Stefan Kurla wrote:
As far as the file nodetype is concerned, this is a custom nodetype which has 4 references per file imported and currently, all the references are made to the same UUID since we are testing, this could change in the future.
this may be the time consuming factor. whenever a reference is added that points to a node N the complete set of references pointing to N is re-written to the persistence manager. with increasing number of references to N this will slow down your import. is there a reason why all files point to the same node?
Any tips or ideas? I will update the results of the test. Right now I have imported 1K out of 12K files and the import time has gone up to 4 seconds per file. Is this normal? Remember since I am importing the jackrabbit SVN all files are put under one nt:folder which is "jackrabbit". This is a pretty normal case of about 12K files and only 78MB. We have plans of a 1TB repository.
I did a quick test with an adapted version of http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit-core/src/test/java/org/apache/jackrabbit/core/query/TextExtractorTest.java
that saves changes whenever 100 files have been imported. I used the svn export of jackrabbit/trunk (~3000 files in ~900 folders) configuration: - jackrabbit in-process - o.a.j.c.persistence.db.DerbyPersistenceManager (externalBlobs = false) - text extractors: pdf, xml and plain text test result: Imported 2978 files in 50484 ms. regards marcel
