On Fri, Jan 30, 2015 at 10:51 AM, Lukas Kahwe Smith
<sm...@pooteeweet.org> wrote:
>
>> On 30 Jan 2015, at 10:44, Ard Schrijvers <a.schrijv...@onehippo.com> wrote:
>>
>> On Fri, Jan 30, 2015 at 10:03 AM, cfalletta <cedric.falle...@gmail.com> 
>> wrote:
>>> Hello Thomas,
>>>
>>> Thanks for your answer.
>>>
>>> I'm using version 2.6.5 of jackrabbit.
>>>
>>> We're loading 300.000+ documents in production and it takes 3-5 minutes to
>>> load it all. 2 queries are run : the select * with a limit, and the select *
>>> without limit. I'll attach the source file  source_jackrabbit.txt
>>> <http://jackrabbit.510166.n4.nabble.com/file/n4661929/source_jackrabbit.txt>
>>>
>>> In the development environment, i set the logging of jackrabbit to debug,
>>> and it appeared that the first query was taking a lot of time. However,
>>> setting the logging level to DEBUG seriously decreased the overall
>>> performance. I'll run another test without count and without debug mode on a
>>> large set of documents to be sure, thanks for the advice.
>>>
>>> By the way, i've heard of another implementation of QueryResult that would
>>> return the totalSize of the query without "limit" :
>>> org.apache.jackrabbit.core.query.lucene.QueryResultImpl. But
>>> org.apache.jackrabbit.core.query.lucene.QueryResult only works with
>>> SingleColumnQueryResult.
>>> -> Any idea how to use QueryResultImpl and if it is a viable solution ?
>>>
>>> Is jackrabbit able to properly handle queries on millions of documents as
>>> long as we have a limit in the query ?
>>
>> In general, yes.
>>
>> A bit more detailed: the problem is not really the query itself (most
>> of the time), but the authorization of the results. If you set a
>> limit, say of 100, then the authorization part can stop after the
>> query granted read access to 100 nodes. A limit will still result in
>> bad performance if your use has only read access to, say, 0.1%,
>> because then on average, for 100 granted results, there must be
>> 100.000 nodes checked. Again, the performance also depends on your
>> bundle caches: If all nodes are in memory, checking 100.000 nodes
>> won't be blistering fast, but not really slow either. If you run
>> through your caches, then, when nodes have to be fetched from a
>> backing database, performance will drop insanely.
>>
>> Please realize, that if you want to compare jackrabbit searches with
>> something like Solr or Elastic Search, a fair comparison would be to
>> check every result from Solr or Elastic Search separately for read
>> access against some external system for example. It is for a reason
>> that Solr or ES hardly do anything for fine (fine!!)grained ACL kind
>> of indexing...that is a really complex part
>>
>> Hope this helps
>>
>> Last thing: Some queries, mainly queries with hierarchical constraints
>> do not perform well for millions of nodes. Again, something that is
>> hard to achieve with Lucene
>
> We also found that for some queries SQL1 performed much better than SQL2:
> http://blog.liip.ch/archive/2012/06/26/jackrabbit-and-its-two-sql-languages-some-findings.html

Originally the SQL2 impl was done mainly to have jr comply to the
reference. Afaik it didn't really query via Lucene in the beginning
but did node traversal and then check for the constraints per node.
Later on, the 'real' implementation has been added. I don't know if
the blog above is about the first implementation which mainly was
there to be spec compliant. That said, I am on thin ice, because I
still only use xpath

Regards Ard

>
> regards,
> Lukas Kahwe Smith
> sm...@pooteeweet.org
>
>
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Reply via email to