Well, again it depends on how you set up the JCR repository, so all I can give you is a conditional yes to your question...
Salu2, Quique. On Thu, Nov 14, 2013 at 5:21 PM, Tarun Dogra <[email protected]>wrote: > Hi Enrique, > > Thanks for the detailed reply. Unfortunately, I am not familiarised with > the nodes and the BTree side of Jackrabbit framework. So I was expecting an > answer in terms of the overall picture of how Jackrabbit as a JCR will fit > in to our system. > > In brief, we need to integrate Jackrabbit (as advised by our vendor) in to > our clinical trial management system. For this, I have already provided you > with the server specification on which the system will be hosted. So just > wanted to know if on such server, Jackrabbit is capable enough to intake > approximately 15GB data per year and be able to manage those many > documents/files (as mentioned before) without being affected in terms of > its performance? We already know it is a much stabilised JCR, but we just > wanted to confirm if such system is able to suffice our organisation’s > requirements. > > Regards, > Tarun > > > From: Enrique Medina Montenegro [mailto:[email protected]] > Sent: 14 November 2013 14:29 > To: [email protected]<mailto:[email protected]> > Cc: Mark Essex > Subject: Re: Jackrabbits reliability and performance > > Hi Tarun, > > Let me share my findings with you :-) > > At my work we are evaluating the use of Jackrabbit to build a JCR > repository to store the register of marks (intellectual property) as > documents composed basically of an ID, some metadata (who created it, when, > etc.) and the XML and JSON representation of the mark itself. Currently, we > have all that information spread in several relational DBs and we would > like to take advantage of the versioning and observation features of the > JCR repository. > > During our initial evaluation, mostly focused on performance, we noticed > serious issues when adding the 1 million marks we have currently in our DBs > underneath the same "parent" node, but we found out that this was actually > a known limitation by Jackrabbit, which clearly states that no more than > 10K child nodes should be added to the same "parent "node: > > http://wiki.apache.org/jackrabbit/Performance > > However, we were still sort of forced to follow that path because we were > required to perform an initial dump of all the data in the DBs, and just > adding each mark as a sub-mode proved to be the fastest way to export all > the data in an acceptable window frame. > > Nevertheless, we also tried to shard the nodes as a tree, basically > splitting the 9-digit ID of our marks into 3-digit groups, so each node > could only have as much as 1K sub-nodes within itself. For example, mark > with ID = 000342865 would be saved into --> root (node) -> marks (node) -> > 000 (node) -> 342 (node) --> 000342865 (node). Theoretically, this would > perform much better than our original approach, but as a downside, it would > dramatically slow down the time it takes to export the 1M marks from the > DBs, going further out of our acceptable window frame (due to the fact > that, for each mark, it had to previously look up the exact node where to > store it, and the bigger the JCR repository was growing, the slower the > node lookup times were, therefore impacting the overall export process). > > We also took a look at the BTreeManager, but we just couldn't make it work > due to the issue I describe here (which BTW has not been answered yet): > > > http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9kJZkvF1eyNvu-A%40mail.gmail.com%3E > > So getting back to the original approach of storing everything under the > same node, how did we manage to get acceptable read times? Well, it boils > down to using Lucene's indexation (configured properly to only index the > "id" property, and not all the XML and JSON stuff - using the > IndexingConfiguration in the Search section of the repository config file) > to actually perform the search/retrieval of marks. So for instance, instead > of: > > session.getNode("/marks/000342865") --> takes ~2.4segs with 1M marks under > the same node > > we run this query with SQL2: > > SELECT * FROM markType WHERE id = '000342865' --> takes tens of ms with 1M > marks under the same node thanks to Lucene's indexes > > (notice that "markType" is a custom node type that we have created to > model our domain, in this case the marks) > > LESSONS LEARNED: You need to clearly define the scope of your project in > terms of the functionality you're willing to use from Jackrabbit, and then > plan for detailed performance workshops to prove your approach. There are > always trade-offs (for instance, in my case, when I want to get the > specific version of a mark, I cannot use the "official" API through > "VersionManager" because it uses direct path to fetch the node prior to > getting the revision --> > session.getWorkspace().getVersionManager().getVersionHistory("/marks/000342865").getVersionByLabel("v.6.0"), > and I have to use the "deprecated" API method from the node itself, once > I've got it using the SQL2 statement mentioned above --> > markNode.getVersionHistory().getVersionByLabel("v.6.0"), with the > uncertainty on when that deprecated API will be removed...). > > Please share your findings in the list as you make progress :-) > > Regards, > Enrique Medina. > > On Thu, Nov 14, 2013 at 10:40 AM, Tarun Dogra <[email protected] > <mailto:[email protected]>> wrote: > Respected Sir/Madam, > > In the next couple of months, we (ORION Clinical Services Ltd., UK) are > about to release a clinical trial management system as a product to be used > in-house by all our employees. We have bought this product off the shelf > from a third party vendor. As suggested by our vendor, we would implement > JackRabbit as the central repository system within this main product. But > we are still not sure whether jackrabbit is an ideal solution to be > integrated with our product and this is where we will need your help and > would appreciate if you could share your expertise. > > Just to give you an overview of our organisation, we will have around 7500 > documents (each of size 250K approximately on an average) per "study" > within our clinical trial management framework. We usually take on board > around 7-8 such studies per year. So, on the basis of 8 studies per year, > the total size of all the documents will grow to 7500 x 250 x 8 = 15GB > approximately per year. So just wanted to know a couple of things from you: > > 1. Is Jackrabbit reliable enough as a system to cater to our above > mentioned needs? and > > 2. Will the management of so many documents have any adverse effects > on jackrabbit's performance? - considering that Jackrabbit will reside on > one of our own hosted server with the following spec - > > Poweredge R710 > > CPU: 2 x Intel X5550 > > Memory: 16GB > > Operating System: Windows 2008 R2 64bit SP1 > > Disk capacity: C: 142gb and D: 1.22Tb > > > Sorry if you are not the correct department to consult to in regards to > our above mentioned concern and if this is the case, it will be much > appreciated if you could direct us to the right department/person? Many > thanks. > > Look forward to hearing from you. > > Regards, > Tarun > > ________________________________ > **********************************Legal & Confidentiality > Notice************************************** > This email and attachments hereto are strictly private and confidential. > Reading, copying, disclosure or use by anybody else is not authorised. If > you have received this email in error, please delete it and notify us as > soon as possible. > The antivirus software used by ORION is automatically and constantly > updated in an effort to minimise the risk of viruses infecting our systems, > However, you should be aware that there is no absolute guarantee that any > files attached to this email are virus free. > ORION may monitor email traffic data and also the content of email for the > purposes of security and staff training. > ORION Clinical Services Limited is a private limited company registered in > England. Company number 3457136. Registered address: 7 Bath Road, Slough, > Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company > of a number of subsidiary companies. For further details please visit our > website at www.orioncro.com<http://www.orioncro.com> > ________________________________________ > > > ________________________________ > **********************************Legal & Confidentiality > Notice************************************** > This email and attachments hereto are strictly private and confidential. > Reading, copying, disclosure or use by anybody else is not authorised. If > you have received this email in error, please delete it and notify us as > soon as possible. > The antivirus software used by ORION is automatically and constantly > updated in an effort to minimise the risk of viruses infecting our systems, > However, you should be aware that there is no absolute guarantee that any > files attached to this email are virus free. > ORION may monitor email traffic data and also the content of email for the > purposes of security and staff training. > ORION Clinical Services Limited is a private limited company registered in > England. Company number 3457136. Registered address: 7 Bath Road, Slough, > Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company > of a number of subsidiary companies. For further details please visit our > website at www.orioncro.com > ________________________________________ > >
