A few suggestions: * Don't produce an XML document and then transform it - use the method which accepts a ContentHandler instead of an OutputStream and handle the SAX events to produce your CSV.
* This seems like something which can be seriously parallelized. Fire up 5000 nodes and 15 minutes may be achievable :) Justin On 4/22/10 10:48 AM, Tai Tran wrote: > Hi, > > I'm very new to JackRabbit, but I'm challenged by a performance-critical > task in my project that needs dumping the whole JackRabbit data into CSV > file. > > We're using JackRabbit standalone server 1.6.0 with MySQL 5.x to store a > huge hierarchical data of network devices. Each device can have up to 100 > attributes, and several thousands child nodes under with nth level of depth: > > device[1] > rack > subrack > port > ... > ... > ... > > device[2] > ... > > device[5000] > ... > > We need to dump the whole JackRabbit data in tree structure into a flat CSV > file with each row is a data of one node. The output CSV data is as huge as > the source JackRabbit data, up to 3.6 millions lines with the following > format: > > rack, attr1, attr2, ... > rack, attr1, attr2, ... > ... > subrack, attr1, attr2, ... > ... > > To minimize calls through RMI access layer, we tried iterating each device > in the repository and using Node.exportSystemView() to dump the data into a > XML file on hard disk, and then parsing it to generate output in CSV file. > However, it is very slow, we ended up with more than 5 hours to dump the > whole JackRabbit data on a very fast server while we targeted it to complete > within 15 minutes (almost insane)! > > Now we're planning to change JackRabbit source code to add our customized > version of exportSystemView in hope of tackling this performance issue. > > Any suggestions are really appreciated!!! > > Thanks a lot, > Tai Tran >
