Re: FW: Jackrabbits reliability and performance

Enrique Medina Montenegro Thu, 14 Nov 2013 08:58:48 -0800

Well, again it depends on how you set up the JCR repository, so all I can
give you is a conditional yes to your question...


Salu2,
Quique.


On Thu, Nov 14, 2013 at 5:21 PM, Tarun Dogra <[email protected]>wrote:

> Hi Enrique,
>
> Thanks for the detailed reply. Unfortunately, I am not familiarised with
> the nodes and the BTree side of Jackrabbit framework. So I was expecting an
> answer in terms of the overall picture of how Jackrabbit as a JCR will fit
> in to our system.
>
> In brief, we need to integrate Jackrabbit (as advised by our vendor) in to
> our clinical trial management system. For this, I have already provided you
> with the server specification on which the system will be hosted. So just
> wanted to know if on such server, Jackrabbit is capable enough to intake
> approximately 15GB data per year and be able to manage those many
> documents/files (as mentioned before) without being affected in terms of
> its performance? We already know it is a much stabilised JCR, but we just
> wanted to confirm if such system is able to suffice our organisation’s
> requirements.
>
> Regards,
> Tarun
>
>
> From: Enrique Medina Montenegro [mailto:[email protected]]
> Sent: 14 November 2013 14:29
> To: [email protected]<mailto:[email protected]>
> Cc: Mark Essex
> Subject: Re: Jackrabbits reliability and performance
>
> Hi Tarun,
>
> Let me share my findings with you :-)
>
> At my work we are evaluating the use of Jackrabbit to build a JCR
> repository to store the register of marks (intellectual property) as
> documents composed basically of an ID, some metadata (who created it, when,
> etc.) and the XML and JSON representation of the mark itself. Currently, we
> have all that information spread in several relational DBs and we would
> like to take advantage of the versioning and observation features of the
> JCR repository.
>
> During our initial evaluation, mostly focused on performance, we noticed
> serious issues when adding the 1 million marks we have currently in our DBs
> underneath the same "parent" node, but we found out that this was actually
> a known limitation by Jackrabbit, which clearly states that no more than
> 10K child nodes should be added to the same "parent "node:
>
> http://wiki.apache.org/jackrabbit/Performance
>
> However, we were still sort of forced to follow that path because we were
> required to perform an initial dump of all the data in the DBs, and just
> adding each mark as a sub-mode proved to be the fastest way to export all
> the data in an acceptable window frame.
>
> Nevertheless, we also tried to shard the nodes as a tree, basically
> splitting the 9-digit ID of our marks into 3-digit groups, so each node
> could only have as much as 1K sub-nodes within itself. For example, mark
> with ID = 000342865 would be saved into --> root (node) -> marks (node) ->
> 000 (node) -> 342 (node) --> 000342865 (node). Theoretically, this would
> perform much better than our original approach, but as a downside, it would
> dramatically slow down the time it takes to export the 1M marks from the
> DBs, going further out of our acceptable window frame (due to the fact
> that, for each mark, it had to previously look up the exact node where to
> store it, and the bigger the JCR repository was growing, the slower the
> node lookup times were, therefore impacting the overall export process).
>
> We also took a look at the BTreeManager, but we just couldn't make it work
> due to the issue I describe here (which BTW has not been answered yet):
>
>
> http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9kJZkvF1eyNvu-A%40mail.gmail.com%3E
>
> So getting back to the original approach of storing everything under the
> same node, how did we manage to get acceptable read times? Well, it boils
> down to using Lucene's indexation (configured properly to only index the
> "id" property, and not all the XML and JSON stuff - using the
> IndexingConfiguration in the Search section of the repository config file)
> to actually perform the search/retrieval of marks. So for instance, instead
> of:
>
> session.getNode("/marks/000342865") --> takes ~2.4segs with 1M marks under
> the same node
>
> we run this query with SQL2:
>
> SELECT * FROM markType WHERE id = '000342865' --> takes tens of ms with 1M
> marks under the same node thanks to Lucene's indexes
>
> (notice that "markType" is a custom node type that we have created to
> model our domain, in this case the marks)
>
> LESSONS LEARNED: You need to clearly define the scope of your project in
> terms of the functionality you're willing to use from Jackrabbit, and then
> plan for detailed performance workshops to prove your approach. There are
> always trade-offs (for instance, in my case, when I want to get the
> specific version of a mark, I cannot use the "official" API through
> "VersionManager" because it uses direct path to fetch the node prior to
> getting the revision -->
> session.getWorkspace().getVersionManager().getVersionHistory("/marks/000342865").getVersionByLabel("v.6.0"),
> and I have to use the "deprecated" API method from the node itself, once
> I've got it using the SQL2 statement mentioned above -->
> markNode.getVersionHistory().getVersionByLabel("v.6.0"), with the
> uncertainty on when that deprecated API will be removed...).
>
> Please share your findings in the list as you make progress :-)
>
> Regards,
> Enrique Medina.
>
> On Thu, Nov 14, 2013 at 10:40 AM, Tarun Dogra <[email protected]
> <mailto:[email protected]>> wrote:
> Respected Sir/Madam,
>
> In the next couple of months, we (ORION Clinical Services Ltd., UK) are
> about to release a clinical trial management system as a product to be used
> in-house by all our employees. We have bought this product off the shelf
> from a third party vendor. As suggested by our vendor, we would implement
> JackRabbit as the central repository system within this main product. But
> we are still not sure whether jackrabbit is an ideal solution to be
> integrated with our product and this is where we will need your help and
> would appreciate if you could share your expertise.
>
> Just to give you an overview of our organisation, we will have around 7500
> documents (each of size 250K approximately on an average) per "study"
> within our clinical trial management framework. We usually take on board
>  around 7-8 such studies per year. So, on the basis of 8 studies per year,
> the total size of all the documents will grow to 7500 x 250 x 8 = 15GB
> approximately per year. So just wanted to know a couple of things from you:
>
> 1.       Is Jackrabbit reliable enough as a system to cater to our above
> mentioned needs? and
>
> 2.       Will the management of so many documents have any adverse effects
> on jackrabbit's performance? - considering that Jackrabbit will reside on
> one of our own hosted server with the following spec -
>
> Poweredge R710
>
> CPU: 2 x Intel X5550
>
> Memory: 16GB
>
> Operating System: Windows 2008 R2 64bit SP1
>
> Disk capacity: C: 142gb and D: 1.22Tb
>
>
> Sorry if you are not the correct department to consult to in regards to
> our above mentioned concern and if this is the case, it will be much
> appreciated if you could direct us to the right department/person? Many
> thanks.
>
> Look forward to hearing from you.
>
> Regards,
> Tarun
>
> ________________________________
> **********************************Legal & Confidentiality
> Notice**************************************
> This email and attachments hereto are strictly private and confidential.
> Reading, copying, disclosure or use by anybody else is not authorised. If
> you have received this email in error, please delete it and notify us as
> soon as possible.
> The antivirus software used by ORION is automatically and constantly
> updated in an effort to minimise the risk of viruses infecting our systems,
> However, you should be aware that there is no absolute guarantee that any
> files attached to this email are virus free.
> ORION may monitor email traffic data and also the content of email for the
> purposes of security and staff training.
> ORION Clinical Services Limited is a private limited company registered in
> England. Company number 3457136. Registered address: 7 Bath Road, Slough,
> Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company
> of a number of subsidiary companies. For further details please visit our
> website at www.orioncro.com<http://www.orioncro.com>
> ________________________________________
>
>
> ________________________________
> **********************************Legal & Confidentiality
> Notice**************************************
> This email and attachments hereto are strictly private and confidential.
> Reading, copying, disclosure or use by anybody else is not authorised. If
> you have received this email in error, please delete it and notify us as
> soon as possible.
> The antivirus software used by ORION is automatically and constantly
> updated in an effort to minimise the risk of viruses infecting our systems,
> However, you should be aware that there is no absolute guarantee that any
> files attached to this email are virus free.
> ORION may monitor email traffic data and also the content of email for the
> purposes of security and staff training.
> ORION Clinical Services Limited is a private limited company registered in
> England. Company number 3457136. Registered address: 7 Bath Road, Slough,
> Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company
> of a number of subsidiary companies. For further details please visit our
> website at www.orioncro.com
> ________________________________________
>
>

Re: FW: Jackrabbits reliability and performance

Reply via email to