Hi Martin,
Thank you for the data provided. We were able to reproduce the issue. This is not actually a bug but a peculiarity of the Sedna's internal representation. Let me explain shortly how XML is stored inside and why it's so hard to load data in your concrete case. Internal storage can be considered as an index based on descriptive schema. For example, let's consider the following XML snippet: <persons> <person id="person1"> <name>Ivan</name> </person> <person id="person2"> <name>Martin</name> </person> </persons> Descriptive schema of this document is the following: persons (S) | == person (S) | == @id (S) | == name (S) | == text() (S) By definition every path of the document has exactly one path in the descriptive schema, and every path of the descriptive schema is a path of the document. Thereby each node in XML document is connected with exactly one schema node and each schema node may have many nodes connected with it. In our example person (S) schema node has two connected XML nodes. In Sedna all document (collection) nodes are stored in block chains (each block is 64KB). One chain per descriptive schema node. In our example, again, we have five chains of blocks - one chain for "persons" nodes, one chain for "person" nodes, one for "id" attribute, etc ... To retrieve descriptive schema of the document, collection or database one may run *doc("$schema_<document_name>")*, *doc("$schema_<collection_name>")*or just *doc("$schema")* queries, respectively. These queries also return how many blocks and nodes there are in each chain: <schema> <document name="auction"> <document name="" total_nodes="1" total_blocks="1"> <element name="persons" total_nodes="1" total_blocks="1"> <element name="person" total_nodes="2" total_blocks="1"> <attribute name="id" total_nodes="2" total_blocks="1"> <element name="name" total_nodes="2" total_blocks="1"> <text name="" total_nodes="2" total_blocks="1"> </element> </element> </element> </document> </document> </schema> For further details on the Sedna's internal representaion refer to the http://panda.ispras.ru/~grinev/mypapers/sedna.pdf . For illustaration refer to the: http://www.slideshare.net/shcheklein/sedna-xml-database-system-internal-representation. Now let's consider your data. It has very complex descriptive schema (almost each node within XML document has unique path). It means that we have enormous number of almost empty blocks (each stores one-two nodes). The main reason of this complexity is hierarchy of entity describing tags which are nested and may have many different names and somewhere enclose entire articles. <creator> <person ...> <writer ...> <novelist ... > {content here} </novelist> </writer> </person> </creator> If you want to load it into Sedna you have to change representaion a bit: *1. Simplify entity description blocks. For example the following representaion will be much easier to load:* <entity> <creator wordnetid="..." confidence="..."/> <person wordnetid="..." confidence="..."/> <writer wordnetid="..." confidence="..."/> <novelist wordnetid="..." confidence="..."/> {content here} </entity> *2. If you have several millions files, it's better to concatenate them into one document (bulk load will be better optimized since all data is known in advance):* <articles> <artcile id="00000.xml"> {content here} </article> <artcile id="00000.xml"> {content here} </article> ... <articles> *3. Use -bufs-num parameter to increase number of buffers allocated by Sedna's storage manager (se_sm). It'll significantly speedup bulk loading.* ./se_sm -bufs-num 32000* * Moreover, it doesn't matter if you are going to use Sedna or not. I believe, such XML representation is very hard for almost every XML-tool or XML processing language. Ivan Shcheklein, Sedna Team On Sun, May 9, 2010 at 12:00 AM, Martin Bukatovic < martin.bukato...@gmail.com> wrote: > On Sat, May 8, 2010 at 12:21 PM, Ivan Shcheklein <shchekl...@gmail.com> > wrote: > > I can give you access to the private ftp on modis server and guarantee > that > > we'll not use your data except for the testing purposes. > > Seems OK. I will provide you with roughly 300 MB of xml data. > > > At least you can print schema of your collection: doc("$schema") . The > more > > different nodes it has the bigger burst factor is. Try to load 300MB and > > send me result of this command. > > Even 300MB is huge enough to make it quite time consuming, therefore I > loaded > just 5 documents (with total size about 360 kB) successfully and the > database directory > has 235 MB (using version 3.3.55). The schema of this collection can be > reached > at > http://www.fi.muni.cz/~xbukatov/nxd/tmp/sedna-inex-schema.xml<http://www.fi.muni.cz/%7Exbukatov/nxd/tmp/sedna-inex-schema.xml> > > Martin B. >
------------------------------------------------------------------------------
_______________________________________________ Sedna-discussion mailing list Sedna-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sedna-discussion