Re: Generating report of tens of thousands of pages

Jason Bailey Sun, 25 Sep 2016 19:43:02 -0700

Hi Henry,

I run queries that cover up to half a million AEM "pages" quite often, My times 
range between 30 seconds to 3 minutes depending on what I'm doing. Here's my 
observations:

If you are running a query, make sure that it's as simple as possible and that 
it is keying on a property that is indexed. 
Try to make that property as unique as possible, meaning that you are have 
added this property to a subset of overall pages. 
Try not to use queries.

That last item is key, every single outage I've had in the last year has been 
because of a query that was written in OOTB in a poor manner, or a query that 
worked just fine for 3 months and then decided that instead of using the 
indexed value it was going to iterate nodes and then die a grizzly death. 

My preference is a controlled traversal, where I do a recursive descent of the 
resource tree, Ignoring paths that I don't want to go down, such as a 
"jcr:content" node.  If I have a limit or I'm looking at a small graph, then I 
just gather the results and then process them post search. If you have a large 
set of data that you gathering you need be aware of the potential impact of 
processing a large set.

 If I expect a large set of returns, I incorporate a callback, so that as soon 
as I identify a correct resource I process it and spit out into either a file 
or write it to a response. That way this minimizes the amount of memory that is 
being committed to the process.

I have a utility here  https://github.com/JEBailey/sling-resourcelocator that 
is a java 8 port of the one I use in production. At the very least it should 
give you some ideas on how this done.

-Jason 

________________________________________
From: Henry Saginor <[email protected]>
Sent: Thursday, September 22, 2016 1:48 PM
To: [email protected]
Cc: Heath, Aaron
Subject: Re: Generating report of tens of thousands of pages

Hi Jordan,

You might need to create an index for your query to avoid using the default 
traversal indexer (that’s what the warnings are about). See Apache Oak 
documentation on that. Also, rather than getting the entire result set for the 
query at once consider pagination via setLimit and setOffset.

Aside from optimizing the query itself if it’s a large report, consider a 
background sling job and only display pre-generated reports. If you are using a 
product like Adobe AEM/CQ you can use CQ workflows as well. Of course I don’t 
know if your use case allows this.

Henry

> On Sep 22, 2016, at 7:48 AM, Shurmer, Jordan <[email protected]> 
> wrote:
>
> Hello,
>
> I'm wondering if I can get some opinions on the best way to find/filter nodes 
> when there are tens of thousands of nodes in the tree. JCR Queries are simply 
> timing out after some time, and sometimes generate a lot of log warning about 
> node traversals (obviously). We usually end up writing our own simple node 
> traversal script using the JCR Node api or the Sling Resource api. This still 
> takes a really long time, and sometimes significantly affects the performance 
> of the application
>
> Is there any other way of doing this sort of thing that we've overlooked? 
> We're looking for an efficient method, which is consistent and reliable.
>
> Let me know what you all think!
>
> Thanks,
> Jordan Shurmer | Software Engineer | Scripps Lifestyle Studios
>
> 9721 Sherrill Blvd, Knoxville TN 37932
> Office: 865-560-4887
> [email protected]<mailto:[email protected]>
>
> SCRIPPS NETWORKS INTERACTIVE | the Leader in Lifestyle Media | 
> scrippsnetworksinteractive.com
> HGTV | Food Network | Travel Channel | DIY Network | Cooking Channel | Great 
> American Country | TVN | Fine Living | Asian Food Channel
>

Re: Generating report of tens of thousands of pages

Reply via email to