Hi,

I just encountered this message by chance, but i would like to share my opinion about it.

Ard Schrijvers wrote:
Hello Vikas,

apparently nobody yet had time to react to your little survey, so I will just 
try to give my 2 cents. IMO your questions are strongly intertwined with how 
you set up your content modelling, which kind of data you have (binary data vs 
xml), what kind of useage you expect (searches vs iterating nodes), etc etc, 
and therefore hard (impossible) to judge.

Though I am by far not yet in the position to base my remarks by code or proper examples 
or benchmarking, I do think, you are having a usecase that would kind of "has the 
need for the best of all worlds", regarding storing / indexing / iterating nodes / 
searching (with sorting) etc.

I am not yet aware of the ins and outs on many parts of the JR, but at least storing 10K child nodes per node is AFAIK currently not an option. Regarding your usecase, having around 36.000.000 documents after one year in one single ws with terabytes of data...so 100.000.000 docs within three years...Well, I think you at least have to tune some settings :-)
Though, something just to grasp the complexity of your requirements, I'll take 
the searching part as an example for it: many millions of documents and 
terabytes of data, and you want fast searching, right? Well, there is just this 
apache project out there, Hadoop, a lucene subproject build on the MapReduce 
algorithm [1] to enable your fast searching. Though, obviously, this is a 
bleeding edge apache toplevel project, and obviously not (yet...) available in 
JR. But, as a next requirement you might have that you also need fast facetted 
navigation..then you need the bleeding edge Solr[2] technology, so you somehow 
need to have the best of Solr and Hadoop. Since, ofcourse, we also want 
authorisation, we need to add some bleeding edge not yet existing toplevel 
project, that combines the best of two bleeding edge toplevel projects to 
include authorisation on searches. And, of all projects, we do need to know 
exactly how to tune the settings, because OOM's might occur in any project if 
you do not know the ins and outs of configuration. I think you graps the idea 
of what I am trying to say: with 100.000.000 docs and many terabytes of data, 
searching becomes much complexer then the current JR lucene impl IMO
Hadoop enables one to deal with millions of files containing TBs of data. The data is stored, in what is called a distributed file system. The data can be processed parallel using map-reduce programming paradigm. The framework is fault tolerant regarding data storage and computation. Regarding searching as far as i know, JR uses lucene to store the index, but lucene has some issues with write only indexes. So solr (built on top of lucene) can be a high level solution to that.

I have been dealing with webdav integration of the filesystem interface for hadoop(using JR), and developed a working patch for hadoop. I will be glad if you check it out (https://issues.apache.org/jira/browse/HADOOP-496). Any feedback will be appreciated (since i am neither familiar with the JR at all, nor have a deeper understanding of the data flow model of JR ).

For any other parts in JR probably similar arguments hold regarding the requirements you have to deal with, but I think *any* system out in the open and closed will have these (though others might digress a little on this because my knowledge is too shallow).
I am not aware of available benchmarks or JR performance numbers, but perhaps 
other are,

Regards Ard

[1] http://lucene.apache.org/hadoop/
[2] http://lucene.apache.org/solr/

We are concerned regarding Jackrabbit and its ability to handle really
heavy load requirements. We are looking to use jackrabbit to push
approximately 300-500 nodes a minute ranging to 100K nodes a day. The
live repository could easily go to be a few terabytes all using a
single workspace.

We wanted to ask the community how is jackrabbit actually being used
in production environments. So here is a email poll if you will.

. How much of data are you pushing into jackrabbit at a time?

. Are you using burst modes or continuous data feed?

. What is the biggest repository (in size) that you have used or heard
of being used with jackrabbit?

. Are you satisfied with the response times of your queries?

. Have you restrained having more that 10K child nodes per node?

. What caching mechanism are you using? Are you modifying the default
caching that comes with jackrabbit?

. Are you using the default data store mechanisms such as file PMs and
db PMs or have you built a custom PM or used one from Day systems?


I hope these answers would help us and the community on the whole.

Thanks.


Reply via email to