Re: [Sedna-discussion] Unexpected disk space usage

Ivan Shcheklein Tue, 11 May 2010 10:43:03 -0700

Hi Martin,

Thank you for the data provided. We were able to reproduce the issue. This
is not actually a bug but a peculiarity of the Sedna's internal
representation. Let me explain shortly how XML is stored inside and why it's
so hard to load data in your concrete case.

Internal storage can be considered as an index based on descriptive schema.
For example, let's consider the following XML snippet:

<persons>
<person id="person1">
<name>Ivan</name>
</person>
<person id="person2">
<name>Martin</name>
</person>
</persons>

Descriptive schema of this document is the following:

persons (S)
|
 == person (S)
    |
     == @id (S)
    |
     == name (S)
        |
         == text() (S)

By definition every path of the document has exactly one path in the
descriptive schema, and every path of the descriptive schema is a path of
the document. Thereby each node in XML document is connected with exactly
one schema node and each schema node may have many nodes connected with it.
In our example person (S) schema node has two connected XML nodes.

In Sedna all document (collection) nodes are stored in block chains (each
block is 64KB). One chain per descriptive schema node. In our example,
again, we have five chains of blocks - one chain for "persons" nodes, one
chain for "person" nodes, one for "id" attribute, etc ...

To retrieve descriptive schema of the document, collection or database one
may run *doc("$schema_<document_name>")*,
*doc("$schema_<collection_name>")*or just
*doc("$schema")* queries, respectively. These queries also return how many
blocks and nodes there are in each chain:

<schema>
  <document name="auction">
    <document name="" total_nodes="1" total_blocks="1">
      <element name="persons" total_nodes="1" total_blocks="1">
        <element name="person" total_nodes="2" total_blocks="1">
          <attribute name="id" total_nodes="2" total_blocks="1">
          <element name="name" total_nodes="2" total_blocks="1">
            <text name="" total_nodes="2" total_blocks="1">
          </element>
        </element>
      </element>
    </document>
  </document>
</schema>

For further details on the Sedna's internal representaion refer to the
http://panda.ispras.ru/~grinev/mypapers/sedna.pdf . For illustaration refer
to the:
http://www.slideshare.net/shcheklein/sedna-xml-database-system-internal-representation.

Now let's consider your data. It has very complex descriptive schema (almost
each node within XML document has unique path). It means that we have
enormous number of almost empty blocks (each stores one-two nodes). The main
reason of this complexity is hierarchy of entity describing tags which are
nested and may have many different names and somewhere enclose entire
articles.

<creator>
  <person ...>
     <writer ...>
       <novelist ... >
        {content here}
       </novelist>
     </writer>
   </person>
</creator>

If you want to load it into Sedna you have to change representaion a bit:

*1. Simplify entity description blocks. For example the following
representaion will be much easier to load:*

<entity>
  <creator wordnetid="..." confidence="..."/>
  <person wordnetid="..." confidence="..."/>
  <writer wordnetid="..." confidence="..."/>
  <novelist wordnetid="..." confidence="..."/>
  {content here}
</entity>

*2. If you have several millions files, it's better to concatenate them into
one document (bulk load will be better optimized since all data is known in
advance):*

<articles>
  <artcile id="00000.xml">
    {content here}
  </article>
  <artcile id="00000.xml">
    {content here}
  </article>
   ...
<articles>

*3. Use -bufs-num parameter to increase number of buffers allocated by
Sedna's storage manager (se_sm). It'll significantly speedup bulk loading.*

./se_sm -bufs-num 32000*
*

Moreover, it doesn't matter if you are going to use Sedna or not. I believe,
such XML representation is very hard for almost every XML-tool or XML
processing language.

Ivan Shcheklein,
Sedna Team

On Sun, May 9, 2010 at 12:00 AM, Martin Bukatovic <
martin.bukato...@gmail.com> wrote:

> On Sat, May 8, 2010 at 12:21 PM, Ivan Shcheklein <shchekl...@gmail.com>
> wrote:
> > I can give you access to the private ftp on modis server and guarantee
> that
> > we'll not use your data except for the testing purposes.
>
> Seems OK. I will provide you with roughly 300 MB of xml data.
>
> > At least you can print schema of your collection: doc("$schema") . The
> more
> > different nodes it has the bigger burst factor is. Try to load 300MB and
> > send me result of this command.
>
> Even 300MB is huge enough to make it quite time consuming, therefore I
> loaded
> just 5 documents (with total size about 360 kB) successfully and the
> database directory
> has 235 MB (using version 3.3.55). The schema of this collection can be
> reached
> at 
> http://www.fi.muni.cz/~xbukatov/nxd/tmp/sedna-inex-schema.xml<http://www.fi.muni.cz/%7Exbukatov/nxd/tmp/sedna-inex-schema.xml>
>
> Martin B.
>

------------------------------------------------------------------------------

_______________________________________________
Sedna-discussion mailing list
Sedna-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Unexpected disk space usage

Reply via email to