On Thu, Apr 16, 2009 at 3:32 PM, OKO <ol...@sics.se> wrote:
[snip]
>
> The "1.1" vs "1.0" expansion factor is much larger for the case of
> collections with few and small files (factor 350) , than for many and small
> files (factor 25).
>
> But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0
> from a *size* point of view. Maybe comparing processing speed of "1.1" vs
> "1.0" would give a different picture, but at the moment I have no numbers to
> offer in that respect.

Here is the deal: database format between 1.0 and 1.1 hasn't changed,
except some bugs in index files were fixed and the collection files
are now created properly, which in your case, unfortunately, means
they take a lot more space than before. The reason is that default
size is intended for bigger collections with many documents.

The database you tried to rebuild is the corner case however, with
several collections that only have a single document each, so default
size settings are way too large. But this is only the default setting
and it can be change to more suitable number. This way, 1.1 database
should take about the same amount of space as the existing database.

> By the way, I do not have any very large documents in the database, so I
> have no idea how "1.1" compares to "1.0" for such beasts.
>
> And the 64,000 $ question is, of course, what means are available to
> decrease disk space footprint of the "1.1"  database?

The following steps can help to reduce database size:
1. Meta collections can be turned off. The setting is in the
config/system.xml file (In Xindice 1.1 directory). In the following
line:

    <root-collection dbroot="./db/" name="db" use-metadata="on">

change use-metadata="off". This will make all Meta collections go away.

2. Initial collection size can be adjusted (here I assume that all the
collections use default BTreeFiler to store data, HashFiler is
somewhat different beast). When creating a collection it can be given
a configuration to specify pagecount setting that directly affect
initial collection size:

<collection compressed="true" inline-metadata="true" name="test">
  <filer class="org.apache.xindice.core.filer.BTreeFiler" pagecount="16" />
</collection>

When creating collection from command-line tool, this setting can be
specified with --pagecount parameter:

bin\xindice ac -c /db -n test --pagecount 16

The "perfect" value for pagecount depends on size and amount of
documents in a collection. For collections that have documents added
and modified often it should be picked based on how much data is
expected to be stored in the collection, but in any case it only sets
initial size, so collection file will grow when it is too small to
hold new data.

In your case, when a collection has only one small document and this
situation is not expected to change, pagecount may very well be set to
1.

Now back to data migration... I made some changes to xindice_rebuild
to add optional pagecount parameter that will overwrite original
collection setting.

If you have source version of Xindice download, please save attached
file to <xindice
directory>\java\src\org\apache\xindice\tools\DatabaseRebuild.java and
run build.bat or build.sh depending on your OS. After that, you can
try to rebuild database again using optional parameter:

bin\xindice_rebuild.bat rebuild db -p 1

If you have binary version of Xindice download instead, please let me
know, I'll see what can be done for that. Alternatively, collections
can be rebuilt manually, by exporting documents, creating new
collections using command-line tool with the option "--pagecount 1"
and importing documents into new collections.

Let me know if something doesn't work for you.

Natalia

Attachment: DatabaseRebuild.java
Description: Binary data

Reply via email to