Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-26 Thread Geert Josten
*Verzonden:* zondag 26 februari 2012 1:05 *Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents? Ah, of course. Compression. I looked to see that the legacy system has SQL statements for insert and select of the table doc_fil and it is calling

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-26 Thread Todd Gochenour
I believe I've found an answer to my question where did my namespaced attribute xsi:type='xs:hexBinary' go? I declared the two namespaces in my XQuery code: declare namespace xs = http://www.w3.org/2001/XMLSchema;; declare namespace xsi = http://www.w3.org/2001/XMLSchema-instance;; And then I

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-26 Thread Todd Gochenour
Here is what I've accomplished this weekend. My next step in this process is to scrub the foreign keys from the database. The foreign keys have the pattern {table}_id, so for instance the field /xyz/usr_id would reference the primary document /usr/id. Leveraging this pattern, I want to

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-25 Thread Todd Gochenour
My question is, what happened to the /file_blob/@xsi:type attribute. Was it interpreted by MarkLogic or just discarded or is it there but my query fails to properly ask for it?Could it be that the element is already known by MarkLogic as hexBinary and so a cast to hexBinary is going in the

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-25 Thread Geert Josten
*Van:* general-boun...@developer.marklogic.com [mailto: general-boun...@developer.marklogic.com] *Namens *Todd Gochenour *Verzonden:* zaterdag 25 februari 2012 18:54 *Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents? My question is, what

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-25 Thread Todd Gochenour
row field name=id32/field field name=doc_rep_id1/field field name=doc_fld_id1/field field name=fil_version2/field field name=upload_usr_id1/field field name=upload_date2006-11-01 15:26:34/field field name=mime_typeapplication/excel/field field name=abstractxls/field field

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-25 Thread Geert Josten
...@developer.marklogic.com [mailto: general-boun...@developer.marklogic.com] *Namens *Todd Gochenour *Verzonden:* zaterdag 25 februari 2012 19:06 *Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents? row field name=id32/field field name=doc_rep_id1

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-25 Thread Todd Gochenour
Ah, of course. Compression. I looked to see that the legacy system has SQL statements for insert and select of the table doc_fil and it is calling compress() and uncompress(). I found this on google, InnoDB implements compression with the help of the well-known zlib

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-24 Thread Todd Gochenour
It's time for me to pick this project up now that the work week has passed. I'm attempting to implement Michael Blakeley's recommendation to move the SQL blob content into it's own document as part of this initial load/chunk phase. Here's how I see the strategy. As I iterate through each record

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-24 Thread Geert Josten
*Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents? It's time for me to pick this project up now that the work week has passed. I'm attempting to implement Michael Blakeley's recommendation to move the SQL blob content into it's own

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread Todd Gochenour
Here's an update. Turns out there's another database I need to port to XQuery. The first one was admin. The second one has attachments stored as blobs in the database, so I turned on the hex-blob option in mysqldump to get a 537MB database extract. The blobs were marked up with the attribute

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread David Lee
To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Documents? Here's an update. Turns out there's another database I need to port to XQuery. The first one was admin. The second one has attachments stored as blobs in the database, so I turned on the hex-blob

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread Todd Gochenour
Great. Good to hear that the database elements and attributes are indexed by default. eXistDB by default does the same. I'm looking at the Information Studio/Application Services/Database Settings page and wondering then what these options provide in addition to the default index.

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread David Lee
:52 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Documents? Great. Good to hear that the database elements and attributes are indexed by default. eXistDB by default does the same. I'm looking at the Information Studio/Application Services/Database

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread mcundiff1
Developer Discussion general@developer.marklogic.com Sent: Tuesday, February 21, 2012 11:02:42 AM Subject: Re: [MarkLogic Dev General] Processing Large Documents? Here I am going further from my area of expertise so please buyer beware. The basic indexing options provide for a limited set

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread Colleen Whitney
The options on that page offer a limited subset of common index settings that customers often enable to support application features. Support for wildcard queries is a good example; there are multiple indexes over and above default indexes that can be added for best results, and the checkbox on

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread Michael Blakeley
As David said, you probably won't need to make any changes to the index config for some time. Mostly folks make changes to tweak full-text search capabilities. But I thought I'd point out that you can check the evaluation of an XPath. Note that I added a missing '@' to your original expression.

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-21 Thread Todd Gochenour
I know I said 'attributes' in my original question but the example was correct, id's are child elements, not attributes. I assume something like xdmp:plan(//usr[id='123']) or the generic xdmp:plan(//*[id='123']) is already indexed? ___ General mailing

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Geert Josten
-boun...@developer.marklogic.com [mailto: general-boun...@developer.marklogic.com] *Namens *Todd Gochenour *Verz**onden:* maandag 20 februari 2012 8:47 *Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents? This is my second day spent working

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Damon Feldman
To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Documents? This advice repeats a recommendation I saw earlier tonight during some of my research, namely that with MarkLogic it's better to break up documents into smaller fragments. I guess there's a performance

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Todd Gochenour
Day three. President's day. I will first chunk the data for each row as this will improve concurrency. I gather I will need to generate random document names for each chunk and put these documents in a collection using the name of the database as the folder name.I see the terms Forest and

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Todd Gochenour
Oops, I just realized that when I said 154 Gigabytes I should have said 154 Megabytes. My first transformation reduces this to 6 Megabytes. Big difference, yes? Todd ___ General mailing list General@developer.marklogic.com

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Michael Blakeley
Ignore forests and stands for now. Those are physical storage artifacts, completely orthogonal to collections. One difference you may note to existdb is that a document can be in many collections at the same time. As I understand it, existdb collections act sort of like filesystem directories.

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Todd Gochenour
The XQuery I have for performing the chunking is timing out after 9 minutes (running in the query console). There are 156000 'rows' total in this extract. I'm now reading the Developer's guide for Understanding Transactions to figure out how I might optimize this query. My query reads:

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Todd Gochenour
I could do the denormalization work in SQL, but this would be a tedious manual process. My hope with XQuery is that I can analyse the structure and do this process automatically. Then I'd have a generic algorithim which can be applied to other databases.

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Tim Meagher
Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Documents? The XQuery I have for performing the chunking is timing out after 9 minutes (running in the query console). There are 156000 'rows' total in this extract. I'm now reading the Developer's guide

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Michael Blakeley
You can raise the time limit: http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/admin/http.xmlquery=request+timeout Default Time Limit specifies the default value for any request's time limit, when otherwise unspecified. A request can change its time limit

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Michael Blakeley
Hmm... I didn't see any joins or denormalization in the XQuery you posted most recently. So maybe we are talking at cross-purposes? Is your denormalization simply changing the row elements to elements named after table_data/@name? If so, I can see why that would be tedious: relational systems

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Todd Gochenour
The de-normalization phase will happen in a subsequent pass across the data. Now I'm just trying to get the chunking and refactor of names accomplished in this first pass. I put the original MySQL datadump into the database so that I could perform queries against it. It only took 31 seconds to

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-20 Thread Todd Gochenour
Michael's last example with spawning almost worked. The generated document name for each record re-used the same table index, so I was left with only 45 documents in the end. I changed the XQuery to read: (: query console :) for $table in

[MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Todd Gochenour
I have a 154Gig file representing a data dump from MySQL that I want to load into MarkLogic and analyze. When I use the flow editor to collect/load this file into an empty database, it takes 33 seconds. When I add two delete element transforms to the flow the load fails with a timeout error

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Damon Feldman
, 2012 7:59 PM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Processing Large Documents? I have a 154Gig file representing a data dump from MySQL that I want to load into MarkLogic and analyze. When I use the flow editor to collect/load this file into an empty database

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Todd Gochenour
This advice repeats a recommendation I saw earlier tonight during some of my research, namely that with MarkLogic it's better to break up documents into smaller fragments. I guess there's a performance gain in bursting a document into small fragments, something to do with concurrency and locking

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Geert Josten
Gochenour *Verzonden:* maandag 20 februari 2012 7:57 *Aan:* MarkLogic Developer Discussion *Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents? This advice repeats a recommendation I saw earlier tonight during some of my research, namely that with MarkLogic it's better to break up

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Geert Josten
:00 *Aan:* MarkLogic Developer Discussion *Onderwerp:* [MarkLogic Dev General] Processing Large Documents? I have a 154Gig file representing a data dump from MySQL that I want to load into MarkLogic and analyze. When I use the flow editor to collect/load this file into an empty database

Re: [MarkLogic Dev General] Processing Large Documents?

2012-02-19 Thread Todd Gochenour
Gochenour *Verzonden:* maandag 20 februari 2012 2:00 *Aan:* MarkLogic Developer Discussion *Onderwerp:* [MarkLogic Dev General] Processing Large Documents? I have a 154Gig file representing a data dump from MySQL that I want to load into MarkLogic and analyze. When I use the flow editor