Some performance questions about Jackrabbit

Lorenzo Dini Fri, 01 Feb 2008 04:47:19 -0800

Hi Everybody,

I have been using Jackrabbit for almost 1 year now and I have some (alot of :-) questions about the cost of operations performed on therepository because I am trying to optimize the performance and knowingwhat are the real operations done underneath helps for the tuning :-)

I hope somebody will answer and I hope this question will also helpother people in using JR in the best way.

I am sure these questions apply to any deployment model, but I am justdescribing my case.


------------------------------------------------------------------------

Basically I have a Tomcat WAR with an Axis and a REST Webservices asfront-end that use the in-thread JackRabbit to read-store lot of GB of data.


Currently I am using the Jackrabbit 1.3.1 directly embedded in a
custom application that runs on Tomcat.

I am using LocalFileSystemPersistenceManager and I am still using theSimpleDBPersistenceManager (with MySql in the same machine) becausesince I have lots of binaries in it, it was impossible to move to theBundlePersistenceManager before JR 1.4 since the DataStore was not inthe release and I could not afford to store the binaries in the DB.


I have 2 workspaces:
Workspace 1------
Nodes:  22338   (added about 20 nodes per day)
Properties: 242239

Blobs: 13558 files - 48 GB of storage (is stored in a AFS server with asoftlink in the /blobs directory, no removal, just few megabytes addedper day)



Workspace 2------
Nodes: 122605 (removed about 5000 nodes and added other 5000 nodes per day)
Properties: 1276972

Blobs: 23842 - 38 GB of storage (local file system, about 3GB of olddata removed per day and other 3GB of new data added)

As suggested, I have a (almost) balanced tree structure with depth 8,there are not more than 100 children per node, usually no more than 20.

------------------------------------------------------------------------

QUESTIONS:

Session

1) How is the behavior when there are two session operating at the sametime?Whenever a session is open reading from the repository, and at the sametime another session is writing in the repository and saving withnode.save() or session.save(), are the changes cached in memory untilthe read only session is closed or the changes are visible in theread-only session? How does it work with nodes already in memory beforethe change? and with the nodes that are not in memory and must be readfrom the persistence after the change?

2) How much is the cost to create a new Session through a login?? Is itbetter to store them in a pool or just create them every time? CurrentlyI store 1 session per workspace and return them in case of read-onlyaccess (whenever a write-access connection is requested, I generate anew Session and remove all the read-only from the pool, that means Inever do the logout until the read-only is closed when a write-access isrequested..) Shall I return 1 session per request no matter of the usageand always logout them?

3) What happens if a session is garbage collected without a logout hasbeen executed?

4) Since I am not using the JR security, I have implemented my ownclasses for AccessManager and LoginModule that just return true andperform the minimal operation to allow anything. This cause an error inJR 1.4 at login() time.

Is the basic security provided by JR (SimpleAccessManager andSimpleLoginModule) add overhead for security checks? In case it doesnot, I will move back to them for better maintenance.

IO

5) Are the InputStream returned by getProperty("...").getStream()FileInputStream or BufferedInputStream? In case I would wrap them with aBufferedInputStream to try to improve the IO.

6) How much the MySqlBundlePersistenceManager in average improves theperformances?? My bottleneck is always 100% of processor time with JAVAand never MySql that is using not more than 5-10%, will the BundlePMlower down the usage of processor by Java?

7) Is there any tool to get a readable version of the serialized nodestored in the DB?


Backup

8) What is the difference, in any, in performance between:

new SysViewSAXEventGenerator(node, false, true, th).serialize();

and

session.exportSystemView(node.getPath(), ch, true, false);

and is there a way to spread the backup in a longer time in order not touse all the available resources?

9) What happens if during the backup (that for me takes more than 1 hourper workspace doing the commands in question 8) a lot of modificationsare performed by other sessions?

10) Since it does not make sense to export a 90 GB XML file with thebinaries inside, right now, to perform a backup, I am exporting the XMLwithout binaries.


Importing it will overwrite all the binaries with new files 0 sized.

To restore it, I am changing the blobs location, import the xml, andthen move back the blobs location to the original storage in order toremap the binaries. Since the node UUID do not change, it works.

Do you have a better way to do this? The problem is the same using aDataStore I think.

11) I am planning to move to JR 1.4 but it costs a lot in terms ofmigration of the whole storage to the new DataStore format.

Since the DataStore uses the md5 and not anymore the node UUID I cannotreplace back the file structure generated by the Blobs.

The only way is to create a script to change the blobs structure to thenew DataStore structure but for this I need a mapping Node UUID -> md5


Is there a way to know the file url from the Node instance??

If so, I could create a script that changes a specific file from theformat No/de/UUID/propertyname.bin to the new format Fi/le/md5/...



Indexing and Searching

12) How much is the improvement of specifying the indexing rules? I ammainly use the name property for searching and few others... Settingthis properties as priorital would speedup a lot? I think that most ofthe time is spent not on the lucine query itself but in loading andsorting the nodes.


13) When exactly the nodes are loaded from the DB by the QueryEngine?
What's happening during query.execute()?
What's during query.getNodes()? how many nodes are read from the DB?
When (and how) the sorting is done?
What's during iterator.nextNode()

14) How the sorting works since it cannot be done by the DB? Is it doneby lucine? or simply all the nodes are sorted using a collections.sort?That means that all nodes must be loaded before returning the first andeven if you need only the first N. How to speedup this?

15) Is there any change in JR 1.4? I saw it is possible to limit theentries returned and the offset, how this work with sorting?

16) In case I need a specific subnode with a particular property, is itfaster to list all the subnodes using the node.getNodes() and pickingthe right one or doing a lucine query? I imagine it depends on thenumber of subnodes but aproximately for 20 subnodes the overhead oflucine overperform the getNodes()


NodeTypeDefinition

17) I use a quite complex nodetypedefinition, without references assuggested (I use strings and do the getNodeByUUID()). How much overheadthis definition has in checking the types? I could enable it duringdevelopment and testing and disable it in production.

I hope they are not ALL stupid questions, my apologies if some or mostof them have been already discussed before I joined the mailing list.


Lorenzo Dini


--
*Lorenzo Dini*

CERN - European Organization for Nuclear Research
Information Technology Department
CH-1211 Geneva 23

Building 28 - Office 1-007
Phone: +41 (0) 22 7674384
Fax: +41 (0) 22 7668847
E-mail: [EMAIL PROTECTED]

Some performance questions about Jackrabbit

Reply via email to