RE: How are developers using jackrabbit

Ard Schrijvers Tue, 31 Jul 2007 13:16:15 -0700

Hello Vikas,

apparently nobody yet had time to react to your little survey, so I will just 
try to give my 2 cents. IMO your questions are strongly intertwined with how 
you set up your content modelling, which kind of data you have (binary data vs 
xml), what kind of useage you expect (searches vs iterating nodes), etc etc, 
and therefore hard (impossible) to judge.


Though I am by far not yet in the position to base my remarks by code or proper 
examples or benchmarking, I do think, you are having a usecase that would kind 
of "has the need for the best of all worlds", regarding storing / indexing / 
iterating nodes / searching (with sorting) etc.

I am not yet aware of the ins and outs on many parts of the JR, but at least 
storing 10K child nodes per node is AFAIK currently not an option. Regarding 
your usecase, having around 36.000.000 documents after one year in one single 
ws with terabytes of data...so 100.000.000 docs within three years...Well, I 
think you at least have to tune some settings :-) 

Though, something just to grasp the complexity of your requirements, I'll take 
the searching part as an example for it: many millions of documents and 
terabytes of data, and you want fast searching, right? Well, there is just this 
apache project out there, Hadoop, a lucene subproject build on the MapReduce 
algorithm [1] to enable your fast searching. Though, obviously, this is a 
bleeding edge apache toplevel project, and obviously not (yet...) available in 
JR. But, as a next requirement you might have that you also need fast facetted 
navigation..then you need the bleeding edge Solr[2] technology, so you somehow 
need to have the best of Solr and Hadoop. Since, ofcourse, we also want 
authorisation, we need to add some bleeding edge not yet existing toplevel 
project, that combines the best of two bleeding edge toplevel projects to 
include authorisation on searches. And, of all projects, we do need to know 
exactly how to tune the settings, because OOM's might occur in any project if 
you do not know the ins and outs of configuration. I think you graps the idea 
of what I am trying to say: with 100.000.000 docs and many terabytes of data, 
searching becomes much complexer then the current JR lucene impl IMO

For any other parts in JR probably similar arguments hold regarding the 
requirements you have to deal with, but I think *any* system out in the open 
and closed will have these (though others might digress a little on this 
because my knowledge is too shallow). 

I am not aware of available benchmarks or JR performance numbers, but perhaps 
other are,

Regards Ard

[1] http://lucene.apache.org/hadoop/
[2] http://lucene.apache.org/solr/

> We are concerned regarding Jackrabbit and its ability to handle really
> heavy load requirements. We are looking to use jackrabbit to push
> approximately 300-500 nodes a minute ranging to 100K nodes a day. The
> live repository could easily go to be a few terabytes all using a
> single workspace.
> 
> We wanted to ask the community how is jackrabbit actually being used
> in production environments. So here is a email poll if you will.
> 
> . How much of data are you pushing into jackrabbit at a time?
> 
> . Are you using burst modes or continuous data feed?
> 
> . What is the biggest repository (in size) that you have used or heard
> of being used with jackrabbit?
> 
> . Are you satisfied with the response times of your queries?
> 
> . Have you restrained having more that 10K child nodes per node?
> 
> . What caching mechanism are you using? Are you modifying the default
> caching that comes with jackrabbit?
> 
> . Are you using the default data store mechanisms such as file PMs and
> db PMs or have you built a custom PM or used one from Day systems?
> 
> 
> I hope these answers would help us and the community on the whole.
> 
> Thanks.
>

RE: How are developers using jackrabbit

Reply via email to