Hi Everybody,
I have been using Jackrabbit for almost 1 year now and I have some (a
lot of :-) questions about the cost of operations performed on the
repository because I am trying to optimize the performance and knowing
what are the real operations done underneath helps for the tuning :-)
I hope somebody will answer and I hope this question will also help
other people in using JR in the best way.
I am sure these questions apply to any deployment model, but I am just
describing my case.
------------------------------------------------------------------------
Basically I have a Tomcat WAR with an Axis and a REST Webservices as
front-end that use the in-thread JackRabbit to read-store lot of GB of data.
Currently I am using the Jackrabbit 1.3.1 directly embedded in a
custom application that runs on Tomcat.
I am using LocalFileSystemPersistenceManager and I am still using the
SimpleDBPersistenceManager (with MySql in the same machine) because
since I have lots of binaries in it, it was impossible to move to the
BundlePersistenceManager before JR 1.4 since the DataStore was not in
the release and I could not afford to store the binaries in the DB.
I have 2 workspaces:
Workspace 1------
Nodes: 22338 (added about 20 nodes per day)
Properties: 242239
Blobs: 13558 files - 48 GB of storage (is stored in a AFS server with a
softlink in the /blobs directory, no removal, just few megabytes added
per day)
Workspace 2------
Nodes: 122605 (removed about 5000 nodes and added other 5000 nodes per day)
Properties: 1276972
Blobs: 23842 - 38 GB of storage (local file system, about 3GB of old
data removed per day and other 3GB of new data added)
As suggested, I have a (almost) balanced tree structure with depth 8,
there are not more than 100 children per node, usually no more than 20.
------------------------------------------------------------------------
QUESTIONS:
Session
1) How is the behavior when there are two session operating at the same
time?
Whenever a session is open reading from the repository, and at the same
time another session is writing in the repository and saving with
node.save() or session.save(), are the changes cached in memory until
the read only session is closed or the changes are visible in the
read-only session? How does it work with nodes already in memory before
the change? and with the nodes that are not in memory and must be read
from the persistence after the change?
2) How much is the cost to create a new Session through a login?? Is it
better to store them in a pool or just create them every time? Currently
I store 1 session per workspace and return them in case of read-only
access (whenever a write-access connection is requested, I generate a
new Session and remove all the read-only from the pool, that means I
never do the logout until the read-only is closed when a write-access is
requested..) Shall I return 1 session per request no matter of the usage
and always logout them?
3) What happens if a session is garbage collected without a logout has
been executed?
4) Since I am not using the JR security, I have implemented my own
classes for AccessManager and LoginModule that just return true and
perform the minimal operation to allow anything. This cause an error in
JR 1.4 at login() time.
Is the basic security provided by JR (SimpleAccessManager and
SimpleLoginModule) add overhead for security checks? In case it does
not, I will move back to them for better maintenance.
IO
5) Are the InputStream returned by getProperty("...").getStream()
FileInputStream or BufferedInputStream? In case I would wrap them with a
BufferedInputStream to try to improve the IO.
6) How much the MySqlBundlePersistenceManager in average improves the
performances?? My bottleneck is always 100% of processor time with JAVA
and never MySql that is using not more than 5-10%, will the BundlePM
lower down the usage of processor by Java?
7) Is there any tool to get a readable version of the serialized node
stored in the DB?
Backup
8) What is the difference, in any, in performance between:
new SysViewSAXEventGenerator(node, false, true, th).serialize();
and
session.exportSystemView(node.getPath(), ch, true, false);
and is there a way to spread the backup in a longer time in order not to
use all the available resources?
9) What happens if during the backup (that for me takes more than 1 hour
per workspace doing the commands in question 8) a lot of modifications
are performed by other sessions?
10) Since it does not make sense to export a 90 GB XML file with the
binaries inside, right now, to perform a backup, I am exporting the XML
without binaries.
Importing it will overwrite all the binaries with new files 0 sized.
To restore it, I am changing the blobs location, import the xml, and
then move back the blobs location to the original storage in order to
remap the binaries. Since the node UUID do not change, it works.
Do you have a better way to do this? The problem is the same using a
DataStore I think.
11) I am planning to move to JR 1.4 but it costs a lot in terms of
migration of the whole storage to the new DataStore format.
Since the DataStore uses the md5 and not anymore the node UUID I cannot
replace back the file structure generated by the Blobs.
The only way is to create a script to change the blobs structure to the
new DataStore structure but for this I need a mapping Node UUID -> md5
Is there a way to know the file url from the Node instance??
If so, I could create a script that changes a specific file from the
format No/de/UUID/propertyname.bin to the new format Fi/le/md5/...
Indexing and Searching
12) How much is the improvement of specifying the indexing rules? I am
mainly use the name property for searching and few others... Setting
this properties as priorital would speedup a lot? I think that most of
the time is spent not on the lucine query itself but in loading and
sorting the nodes.
13) When exactly the nodes are loaded from the DB by the QueryEngine?
What's happening during query.execute()?
What's during query.getNodes()? how many nodes are read from the DB?
When (and how) the sorting is done?
What's during iterator.nextNode()
14) How the sorting works since it cannot be done by the DB? Is it done
by lucine? or simply all the nodes are sorted using a collections.sort?
That means that all nodes must be loaded before returning the first and
even if you need only the first N. How to speedup this?
15) Is there any change in JR 1.4? I saw it is possible to limit the
entries returned and the offset, how this work with sorting?
16) In case I need a specific subnode with a particular property, is it
faster to list all the subnodes using the node.getNodes() and picking
the right one or doing a lucine query? I imagine it depends on the
number of subnodes but aproximately for 20 subnodes the overhead of
lucine overperform the getNodes()
NodeTypeDefinition
17) I use a quite complex nodetypedefinition, without references as
suggested (I use strings and do the getNodeByUUID()). How much overhead
this definition has in checking the types? I could enable it during
development and testing and disable it in production.
I hope they are not ALL stupid questions, my apologies if some or most
of them have been already discussed before I joined the mailing list.
Lorenzo Dini
--
*Lorenzo Dini*
CERN - European Organization for Nuclear Research
Information Technology Department
CH-1211 Geneva 23
Building 28 - Office 1-007
Phone: +41 (0) 22 7674384
Fax: +41 (0) 22 7668847
E-mail: [EMAIL PROTECTED]