Re: importxml memory

Steven Singer Wed, 22 Aug 2007 14:57:01 -0700


Stefan,


Thanks for taking a look at it.

I'm aware that every node is versionable and we've noticed some of theissues with that. We have the perceived requirment of 'being able toeasily revert any a set of changes' that lead us down that path. We willhave to re-think our use of mix:versionable and look at other ways ofaccomplishing our application goals but haven't yet had a chance to do so(along with other modelling changes)

I'm actually surprised that our data structure is so deep, I hadn'tintended it to be this way and am suspecting we have(or had) anapplication bug that is causing this.

The _delete_me nodes are because we where unable to delete corrupt nodes.What I think happened was that some edits were made to the repository whenit was brought up pointing at a different version store (either adifferent repository.xml or someone copied/deleted a workspace data dirwithout the associated version store, or something else happened tocorrupt the nodes). We kept getting InvalidItemState exceptions, the onlyway we could come up with getting rid of those nodes was to rename themand then strip them out on an import.

What I ended up doing was writing a program that connected to the sourcerepository and my new destination repository.

It then walked the node tree performing a non-recursive exportxml andimportxml one node at a time (stripping out the _delete_me nodes andversionHistory properties). We performed a save after each node import.


Thanks for your help.

hi steve

On 7/30/07, Steven Singer <[EMAIL PROTECTED]> wrote:


How are people using importxml to restore or import anything but small
amounts of data into the repository? I have a 22meg xml file that I'm
unable to import because I keep running out of memory.


i analyzed the xml file that you sent me offline (thanks!).
i noticed the following:

1) system view xml export
2) file size: 22mb without whitespace,
   => 650mb with simple 2-space indentation (!)
3) 23k nodes and 202k properties
4) virtually every node is versionable
5) *very* deep structure: max depth is 2340... (!)
6) lots of junk data (e.g. thousands of _delete_me1234567890 nodes,
   btw hundreds/thousands of levels deep and all versionable)

i'd say that the content model has lots of room for improvement ;)

mainly 5) accounts for the excessive memory consumption during
import. while this could certainly be improved in jackrabbit i can't think of a
really good use case for creating >2k level deep hierarchies.

i also would suggest to review the use of mix:versionable. versionability
doesn't come for free since it implies a certain overhead. making 1 node
mix:versionable creates approx. 7 nodes and 13 properties in the version store
(version history, root version etc etc). mix:versionable should therefore only
be used where needed.

btw: by using a decorated content handler which performed a save every
200 nodes i was able to import the data with 512mb heap. it took about
30 minutes on a macbook pro (2ghz).

cheers
stefan


The importxml in in JCR commands works fine but when I go to save the data
the jvm memory usage goes up to 1GB and eventually runs out of memory.
This was sort of discussed
http://mail-archives.apache.org/mod_mbox/jackrabbit-users/200610.mbox/browser
but I didn't see any solutions proposed.

Does the backup tool suffer from the same problem (being unable to restore
content above a certain size?)  How have other people handled migrating
data between different persistence managers or changing a node-type
definition that seems to require a re-import?




Steven Singer
RAD International Ltd.


Steven Singer
RAD International Ltd.

Re: importxml memory

Reply via email to