FW: Jackrabbits reliability and performance

Tarun Dogra Thu, 14 Nov 2013 08:23:00 -0800

Hi Enrique,

Thanks for the detailed reply. Unfortunately, I am not familiarised with the 
nodes and the BTree side of Jackrabbit framework. So I was expecting an answer 
in terms of the overall picture of how Jackrabbit as a JCR will fit in to our 
system.


In brief, we need to integrate Jackrabbit (as advised by our vendor) in to our 
clinical trial management system. For this, I have already provided you with 
the server specification on which the system will be hosted. So just wanted to 
know if on such server, Jackrabbit is capable enough to intake approximately 
15GB data per year and be able to manage those many documents/files (as 
mentioned before) without being affected in terms of its performance? We 
already know it is a much stabilised JCR, but we just wanted to confirm if such 
system is able to suffice our organisation’s requirements.

Regards,
Tarun


From: Enrique Medina Montenegro [mailto:[email protected]]
Sent: 14 November 2013 14:29
To: [email protected]<mailto:[email protected]>
Cc: Mark Essex
Subject: Re: Jackrabbits reliability and performance

Hi Tarun,

Let me share my findings with you :-)

At my work we are evaluating the use of Jackrabbit to build a JCR repository to 
store the register of marks (intellectual property) as documents composed 
basically of an ID, some metadata (who created it, when, etc.) and the XML and 
JSON representation of the mark itself. Currently, we have all that information 
spread in several relational DBs and we would like to take advantage of the 
versioning and observation features of the JCR repository.

During our initial evaluation, mostly focused on performance, we noticed 
serious issues when adding the 1 million marks we have currently in our DBs 
underneath the same "parent" node, but we found out that this was actually a 
known limitation by Jackrabbit, which clearly states that no more than 10K 
child nodes should be added to the same "parent "node:

http://wiki.apache.org/jackrabbit/Performance

However, we were still sort of forced to follow that path because we were 
required to perform an initial dump of all the data in the DBs, and just adding 
each mark as a sub-mode proved to be the fastest way to export all the data in 
an acceptable window frame.

Nevertheless, we also tried to shard the nodes as a tree, basically splitting 
the 9-digit ID of our marks into 3-digit groups, so each node could only have 
as much as 1K sub-nodes within itself. For example, mark with ID = 000342865 
would be saved into --> root (node) -> marks (node) -> 000 (node) -> 342 (node) 
--> 000342865 (node). Theoretically, this would perform much better than our 
original approach, but as a downside, it would dramatically slow down the time 
it takes to export the 1M marks from the DBs, going further out of our 
acceptable window frame (due to the fact that, for each mark, it had to 
previously look up the exact node where to store it, and the bigger the JCR 
repository was growing, the slower the node lookup times were, therefore 
impacting the overall export process).

We also took a look at the BTreeManager, but we just couldn't make it work due 
to the issue I describe here (which BTW has not been answered yet):

http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9kJZkvF1eyNvu-A%40mail.gmail.com%3E

So getting back to the original approach of storing everything under the same 
node, how did we manage to get acceptable read times? Well, it boils down to 
using Lucene's indexation (configured properly to only index the "id" property, 
and not all the XML and JSON stuff - using the IndexingConfiguration in the 
Search section of the repository config file) to actually perform the 
search/retrieval of marks. So for instance, instead of:

session.getNode("/marks/000342865") --> takes ~2.4segs with 1M marks under the 
same node

we run this query with SQL2:

SELECT * FROM markType WHERE id = '000342865' --> takes tens of ms with 1M 
marks under the same node thanks to Lucene's indexes

(notice that "markType" is a custom node type that we have created to model our 
domain, in this case the marks)

LESSONS LEARNED: You need to clearly define the scope of your project in terms 
of the functionality you're willing to use from Jackrabbit, and then plan for 
detailed performance workshops to prove your approach. There are always 
trade-offs (for instance, in my case, when I want to get the specific version 
of a mark, I cannot use the "official" API through "VersionManager" because it 
uses direct path to fetch the node prior to getting the revision --> 
session.getWorkspace().getVersionManager().getVersionHistory("/marks/000342865").getVersionByLabel("v.6.0"),
 and I have to use the "deprecated" API method from the node itself, once I've 
got it using the SQL2 statement mentioned above --> 
markNode.getVersionHistory().getVersionByLabel("v.6.0"), with the uncertainty 
on when that deprecated API will be removed...).

Please share your findings in the list as you make progress :-)

Regards,
Enrique Medina.

On Thu, Nov 14, 2013 at 10:40 AM, Tarun Dogra 
<[email protected]<mailto:[email protected]>> wrote:
Respected Sir/Madam,

In the next couple of months, we (ORION Clinical Services Ltd., UK) are about 
to release a clinical trial management system as a product to be used in-house 
by all our employees. We have bought this product off the shelf from a third 
party vendor. As suggested by our vendor, we would implement JackRabbit as the 
central repository system within this main product. But we are still not sure 
whether jackrabbit is an ideal solution to be integrated with our product and 
this is where we will need your help and would appreciate if you could share 
your expertise.

Just to give you an overview of our organisation, we will have around 7500 
documents (each of size 250K approximately on an average) per "study" within 
our clinical trial management framework. We usually take on board  around 7-8 
such studies per year. So, on the basis of 8 studies per year, the total size 
of all the documents will grow to 7500 x 250 x 8 = 15GB approximately per year. 
So just wanted to know a couple of things from you:

1.       Is Jackrabbit reliable enough as a system to cater to our above 
mentioned needs? and

2.       Will the management of so many documents have any adverse effects on 
jackrabbit's performance? - considering that Jackrabbit will reside on one of 
our own hosted server with the following spec -

Poweredge R710

CPU: 2 x Intel X5550

Memory: 16GB

Operating System: Windows 2008 R2 64bit SP1

Disk capacity: C: 142gb and D: 1.22Tb


Sorry if you are not the correct department to consult to in regards to our 
above mentioned concern and if this is the case, it will be much appreciated if 
you could direct us to the right department/person? Many thanks.

Look forward to hearing from you.

Regards,
Tarun

________________________________
**********************************Legal & Confidentiality 
Notice**************************************
This email and attachments hereto are strictly private and confidential. 
Reading, copying, disclosure or use by anybody else is not authorised. If you 
have received this email in error, please delete it and notify us as soon as 
possible.
The antivirus software used by ORION is automatically and constantly updated in 
an effort to minimise the risk of viruses infecting our systems, However, you 
should be aware that there is no absolute guarantee that any files attached to 
this email are virus free.
ORION may monitor email traffic data and also the content of email for the 
purposes of security and staff training.
ORION Clinical Services Limited is a private limited company registered in 
England. Company number 3457136. Registered address: 7 Bath Road, Slough, 
Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company of a 
number of subsidiary companies. For further details please visit our website at 
www.orioncro.com<http://www.orioncro.com>
________________________________________


________________________________
**********************************Legal & Confidentiality 
Notice**************************************
This email and attachments hereto are strictly private and confidential. 
Reading, copying, disclosure or use by anybody else is not authorised. If you 
have received this email in error, please delete it and notify us as soon as 
possible.
The antivirus software used by ORION is automatically and constantly updated in 
an effort to minimise the risk of viruses infecting our systems, However, you 
should be aware that there is no absolute guarantee that any files attached to 
this email are virus free.
ORION may monitor email traffic data and also the content of email for the 
purposes of security and staff training.
ORION Clinical Services Limited is a private limited company registered in 
England. Company number 3457136. Registered address: 7 Bath Road, Slough, 
Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company of a 
number of subsidiary companies. For further details please visit our website at 
www.orioncro.com
________________________________________

FW: Jackrabbits reliability and performance

Reply via email to