[MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Todd Gochenour
I have a 154Gig file representing a data dump from MySQL that I want to load into MarkLogic and analyze. When I use the flow editor to collect/load this file into an empty database, it takes 33 seconds. When I add two delete element transforms to the flow the load fails with a timeout error

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Damon Feldman
Todd, RecordLoader and CoRB are useful tools for bulk loading and processing, respectively, and are on the MarkLogic developer site. Typically, XML documents in MarkLogic correspond to rows rather than tables, so it may be ideal to use RecordLoader's RECORD_NAME configuration property to

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Todd Gochenour
This advice repeats a recommendation I saw earlier tonight during some of my research, namely that with MarkLogic it's better to break up documents into smaller fragments. I guess there's a performance gain in bursting a document into small fragments, something to do with concurrency and locking

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Geert Josten
Hi Todd, It is mostly because of two reasons: memory footprint, and indexing. If you don’t have fragmentation enabled in the database configuration, then the entire document is one fragment of 150Gb. Any processing on fragments mean that the entire fragment is loaded into memory. Luckily

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Geert Josten
Hi Todd, I know a few tricks that could help getting this done with information studio. One of which is putting your XQuery in a custom XQuery transform. But you need to copy things like collection from the input file, and some other properties as well, to make sure resulting files are treated

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Todd Gochenour
This is my second day spent working with MarkLogic, having just come back this week from XMLPrague. So everything in my system to date is default configuration, straight out of the box. I have seen the Fragment Roots and Fragment Parents nodes listed under my database in the configure database