Hi Henry, I run queries that cover up to half a million AEM "pages" quite often, My times range between 30 seconds to 3 minutes depending on what I'm doing. Here's my observations:
If you are running a query, make sure that it's as simple as possible and that it is keying on a property that is indexed. Try to make that property as unique as possible, meaning that you are have added this property to a subset of overall pages. Try not to use queries. That last item is key, every single outage I've had in the last year has been because of a query that was written in OOTB in a poor manner, or a query that worked just fine for 3 months and then decided that instead of using the indexed value it was going to iterate nodes and then die a grizzly death. My preference is a controlled traversal, where I do a recursive descent of the resource tree, Ignoring paths that I don't want to go down, such as a "jcr:content" node. If I have a limit or I'm looking at a small graph, then I just gather the results and then process them post search. If you have a large set of data that you gathering you need be aware of the potential impact of processing a large set. If I expect a large set of returns, I incorporate a callback, so that as soon as I identify a correct resource I process it and spit out into either a file or write it to a response. That way this minimizes the amount of memory that is being committed to the process. I have a utility here https://github.com/JEBailey/sling-resourcelocator that is a java 8 port of the one I use in production. At the very least it should give you some ideas on how this done. -Jason ________________________________________ From: Henry Saginor <[email protected]> Sent: Thursday, September 22, 2016 1:48 PM To: [email protected] Cc: Heath, Aaron Subject: Re: Generating report of tens of thousands of pages Hi Jordan, You might need to create an index for your query to avoid using the default traversal indexer (that’s what the warnings are about). See Apache Oak documentation on that. Also, rather than getting the entire result set for the query at once consider pagination via setLimit and setOffset. Aside from optimizing the query itself if it’s a large report, consider a background sling job and only display pre-generated reports. If you are using a product like Adobe AEM/CQ you can use CQ workflows as well. Of course I don’t know if your use case allows this. Henry > On Sep 22, 2016, at 7:48 AM, Shurmer, Jordan <[email protected]> > wrote: > > Hello, > > I'm wondering if I can get some opinions on the best way to find/filter nodes > when there are tens of thousands of nodes in the tree. JCR Queries are simply > timing out after some time, and sometimes generate a lot of log warning about > node traversals (obviously). We usually end up writing our own simple node > traversal script using the JCR Node api or the Sling Resource api. This still > takes a really long time, and sometimes significantly affects the performance > of the application > > Is there any other way of doing this sort of thing that we've overlooked? > We're looking for an efficient method, which is consistent and reliable. > > Let me know what you all think! > > Thanks, > Jordan Shurmer | Software Engineer | Scripps Lifestyle Studios > > 9721 Sherrill Blvd, Knoxville TN 37932 > Office: 865-560-4887 > [email protected]<mailto:[email protected]> > > SCRIPPS NETWORKS INTERACTIVE | the Leader in Lifestyle Media | > scrippsnetworksinteractive.com > HGTV | Food Network | Travel Channel | DIY Network | Cooking Channel | Great > American Country | TVN | Fine Living | Asian Food Channel >
