On Wed, Nov 20, 2013 at 7:39 PM, Enrique Medina Montenegro < [email protected]> wrote:
> Thank for the feedback, Peter. Much appreciated. > > During these days I've also tried to segment the "marks" node into a deep > tree structure by taking the ID in groups of 3 digits. So for example, as > my IDs have 9 numbers, I can take the first 3 digits for the first level in > the tree, then the next 3 digits for the second level, and then the last 3 > digits for the last level where the "mark" would actually be saved (as the > leaf). An example is worth a thousand words: > > Depending on your access patterns, you might also use the date, as in /YYYY/MM/DD/003672897, for segmentation. > mark --> ID = 003672897 > > JCR --> root (node) --> marks (node) --> 003 (node) --> 672 (node) --> > 003672897 (node) > > This is a valid approach at the theory level, but at the practical level, > when I dump the 1M marks from the DB into JCR, for each an every "mark" it > has to lookup the path in the tree where to ultimately store the "mark", > and this lookup starts to take orders of seconds as the tree structure > grows, making the full extraction process from the DB too slow for our > requirements. > > I did an evaluation of jackrabbit recently, and I found that using Apache sling instead of "pure" jackrabbit made things very convenient for a number of things. While I'm not sure if it would be faster or not, using Sling REST API would enable you to create each document using something like $ curl -u admin:admin [email protected] -Fother_field=test ... http:// <server>:<port>/marks/<nnn>/<mmm>/<ppp>/ or its equivalent request using any HTTPClient framework. This might show a flatter response time, and will create the intermediate nodes if needed. Regards Santiago > That's why I still need to stick to the "flat" structure taking profit of > the Lucene's index, while still being worried about the use of the > deprecated API, as I mention in my previous email. > > Salu2, > Quique. > > > On Wed, Nov 20, 2013 at 7:27 PM, Peter Harrison < > [email protected]> wrote: > > > I am by no means an expert, but I have been developing for three or four > > months with JackRabbit. The approach I've taken is not to include the > base > > records under one node. > > > > For example, you may have classes of patent, such as medical, chemical > > process etc, and so you could break down the mark into subnodes for each > > class of patent. Finding a particular mark by its ID is still quite easy, > > but not as trivial as simply having a path like /mark/<patentid>. > > > > I have put a REST interface in front of JackRabbit that handles simple > IDs > > - running the appropriate query, and then returning the object which > > contains the full path. > > > > This idea - that the path itself contains information about a node takes > a > > little to get used to, but it allows you to do some very quick reporting > on > > specific classes, as searches can be scoped to specific trees. > > > > What I'm learning is that JackRabbit isn't just another kind of DB - so > > you should not treat it as just another kind of flat table. You should be > > creating a deep tree structure rather than a shallow structure. Doing > this > > allows you to utilise the path to limit the scope of queries. > > > > PS: I have also modified the Java OCM to allow lists of primitives to be > > stored as properties of a single subnode. I've been making changes to OCM > > on my local system, but am not really sure how to contribute back. > > > > > > On 20/11/13 23:39, Enrique Medina Montenegro wrote: > > > >> Hi list, > >> > >> > >> I’ve been evaluating Jackrabbit for several weeks, performing all sorts > of > >> performance testing due to the nature of the repository we need to > create > >> here at OHIM. Not sure if you’re aware of us, but we’re the European > >> Office > >> where you have to come to protect the intellectual property of your > marks > >> and designs in the whole European Community. Currently, we are storing > all > >> our marks and designs information in a relational DB, and besides > serious > >> performance issues (it’s an old DB, not Oracle unfortunately) we don’t > >> have > >> functionality such as versioning or observation, and the fact that our > >> information is perfectly suitable to be modelled into an XML document, > led > >> us to think about storing it in a JCR repository. > >> > >> > >> > >> I went through David’s model and decided to create a single node called > >> “marks” and then add one child node for each existing mark in our system > >> (~1 million marks where each mark would have ~50 versions/revisions), > but > >> then I found that adding more than 10K child nodes could lead to > potential > >> performance issues. However, after some testing, I also found that > >> indexing > >> the mark nodes allowed us to query them extremely fast using SQL2, so we > >> could overcome the issue with the 10K child nodes. > >> > >> > >> > >> For example, instead of doing à session.getNode(“/marks/000345123”) ß we > >> could query à SELECT * FROM [iptool:markType] WHERE [iptool:id] = > >> ‘000345123’ (notice that we defined our own custom node types and also > >> told > >> Lucene just to index the [iptool:id] property through the use of the > >> IndexConfiguration configuration). > >> > >> > >> > >> Evertyhing was then progressing smoothly, but then we realized that in > >> order to fetch a specific version or even the base version of a > particular > >> mark, the API recommended using the VersionManager: > >> > >> > >> > >> VersionHistory history = > >> session.getWorkspace().getVersionManager().getVersionHistory(markNode. > >> getPath()); > >> > >> > >> > >> Unfortunately, this API makes use of the direct path access to the node > >> being versioned, which in our case was killing our performance due to > the > >> 10K child nodes limitation (sort of). Although there’s the possibility > to > >> access to the versions directly from the node itself using > >> àmarkNode.getBaseVersion() or markNode.getVersionHistory() > >> > >> ß these methods are deprecated and we are not quite sure whether they > will > >> be removed in the short future or left there as an alternative way to > >> retrieving the version history from a node. > >> > >> > >> > >> Therefore, could I possibly get some answers from you to help us out in > >> making our final decision on whether to use Jackrabbit as our official > JCR > >> repository implementation? > >> > >> > >> > >> ´ Is the direct retrieval of the version history through the node > itself > >> (now deprecated) going to be eventually removed or not? If so, when is > it > >> planned to be removed? If not, will it be kept as a “valid” alternative > to > >> the current VersionManager approach? > >> > >> ´ Using the Lucene’s indexes is throwing very fast read times > (magnitude > >> of tens of ms), but do you foresee other hidden issues or side effects > to > >> maintain ~1M child nodes underneath the same parent “mark” node? > >> > >> ´ We also played around the BTreeManager, but we couldn’t make it work > >> with custom node types. I even posted this issue in the user mail list, > >> but > >> so far I haven’t got any response: > >> > >> http://mail-archives.apache.org/mod_mbox/jackrabbit-users/ > >> 201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9k > >> JZkvF1eyNvu-A%40mail.gmail.com%3E<https://mailtrack.io/trace/link/ > >> d3712d035f427b56d11f00d2265d38a80e23bd13> > >> > >> > >> Thanks so much in advance for helping us out to choose Jackrabbit as our > >> JCR technology, hopefully!!! J > >> > >> > >> Sent with MailTrack<https://mailtrack.io/install?source=signature& > >> [email protected]> > >> > >> > >> > > >
