Thanks Jason, that is helpful. To clarify:

1. This is largely unavoidable. The main purpose of this query is to find "old 
stuff" that was not previously processed, and this result set is unbounded. I 
implemented the flags as a way to make this more efficient. I came up with one 
solution which was to break the query into 100 subqueries by doing [jcr:uuid] 
like '00%' etc. This chops the result set into small pieces, but it is still 
unbounded.
2. Jackrabbit 1.0, not Oak. We've turned off almost all indexes except for a 
few properties that we care about.
3. That is interesting, I will have to experiment with that. I don't mind 
iterating over the repo as long as it's not a memory usage issue. I've seen 
hints that the Session is holding onto nodes that I am done with, although that 
may be my misperception.

-----Original Message-----
From: Jason Bailey [mailto:[email protected]] 
Sent: Friday, March 25, 2016 1:01 PM
To: [email protected]
Subject: RE: Out of memory during query

A couple of observations. 

1. The query you created is amazingly broad with the usage of nt:base and no 
restricitions such as path. If you're going to create a query the more 
restrictive you can make the query the better.

2. Not sure if you're using JCR or OAK. If you're using oak, be sure to index 
on the property.

3. Queries are generally slow. In the most counter intuitive experience I've 
ever had. We've discovered it is far faster to manually descend through the 
resources and identify the items that you are searching for than any query 
we've created.



-----Original Message-----
From: Roll, Kevin [mailto:[email protected]] 
Sent: Friday, March 25, 2016 12:46 PM
To: [email protected]
Subject: RE: Out of memory during query

I am still working through the out-of-memory issue. The problem seems to be 
identical to what I saw in November - a potentially unbounded query that eats 
up memory. I thought that configuring a resultFetchSize in Jackrabbit had fixed 
the issue, but apparently not, and I'm not sure that this parameter is having 
any effect.

I'm now experimenting with using the QueryManager directly and setting a limit:

            final Session session = resourceResolver.adaptTo(Session.class);
            final QueryManager queryManager = 
session.getWorkspace().getQueryManager();
            final Query query = queryManager.createQuery(QUERY_STRING, 
Query.JCR_SQL2);
            query.setLimit(MAX_NODES_TO_PROCESS);
            final RowIterator rowIterator = query.execute().getRows();
            while (rowIterator.hasNext())

The query execution is still using more memory than I like (all I want is the 
path!) but it appears to be stable. My question is whether the setLimit() is 
actually passing that value to Lucene. I traced down into the Sling code, and 
got lost in the lower levels, but as far as I could tell that value is pushed 
downward. So, can anyone clarify if this will be an actual constraint on 
Lucene? To put the question a different way, will Lucene use approximately the 
same amount of memory to run my query no matter how large my repository gets? 
What I am desperately trying to avoid is an unbounded query execution that will 
eventually fail given a large enough repository.


From: Roll, Kevin
Sent: Wednesday, March 23, 2016 3:54 PM
To: '[email protected]' <[email protected]>
Subject: Out of memory during query

Back in November we had an out-of-memory problem with our Sling application. I 
determined that a periodic task was executing a query that appeared to be 
unlimited in terms of result set size, which would eat up memory as the 
repository grew. In order to combat this I marked the nodes I am interested in 
with a boolean flag, and I configured Jackrabbit to set the resultFetchSize to 
100. This seemed to solve the problem and we had no further issues - until last 
week, when the problem reappeared.

I've been able to determine that the problem is entirely in the execution of 
this query. I can enter it from the JCR Explorer query window and it will cause 
the runaway memory problem. The query is very straightforward, it is simply:

select * from [nt:base] where uploadToImageManagerFlag = true

I have no need for any parallel results, I simply want to examine the resultant 
Resources one at a time. Deleting/rebuilding the Jackrabbit indexes did not 
help.

Any ideas why this query might be causing runaway memory consumption? Looking 
at a heap dump it appears that there are massive numbers of NodeId, 
HashMap$Entry, NameSet, ChildNodeEntry, NodeState, etc. It seems that for 
whatever reason a large number of nodes are being pulled into memory.

If this would make more sense on the Jackrabbit list I can ask over there as 
well.

Thanks!

Reply via email to