Natalia,
Thanks for offering some hints.

To put some data on the table ...

Below some size quantities for two parts of the database
   (1) a collection with few and small documents
   (2) a collection with many small documents

For each of these I give size of 
   (A) exported format 
   (B) xindice 1.0 database
   (C) xindice 1.1 database

Sizes are computed by "du" (running Cygwin).

Summary table of sizes in KB (table more readable in fixed font ;-) ):

.      |   (A)   |   (B)   |   (C)   |
. (1)  |      20 |     250 |   84000 |
. (2)  |    4000 |     450 |   12000 |

The "1.1" vs "1.0" expansion factor is much larger for the case of
collections with few and small files (factor 350) , than for many and small
files (factor 25). 

But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0
from a *size* point of view. Maybe comparing processing speed of "1.1" vs
"1.0" would give a different picture, but at the moment I have no numbers to
offer in that respect. 

By the way, I do not have any very large documents in the database, so I
have no idea how "1.1" compares to "1.0" for such beasts.

And the 64,000 $ question is, of course, what means are available to
decrease disk space footprint of the "1.1"  database?

/O


==========================
(1) a set of collections with few and small documents

(A) This is a size listing of (part of) the exported 1.0 database.
Result: total of 20KB contents. 
Most of these files are small.
(Note: the "meta" collections/files are my own meta info of the application
level, so not Xindice meta info!).

xxx$ du -ba publications
553     publications/rss-common/meta/meta.xml
1065    publications/rss-common/meta
1577    publications/rss-common
257     publications/rss_2_0/meta/meta.xml
769     publications/rss_2_0/meta
1281    publications/rss_2_0
260     publications/rss_0_92/meta/meta.xml
772     publications/rss_0_92/meta
1284    publications/rss_0_92
260     publications/rss_0_91/meta/meta.xml
772     publications/rss_0_91/meta
1284    publications/rss_0_91
438     publications/newsletter/meta/meta.xml
950     publications/newsletter/meta
1462    publications/newsletter
512     publications/archive/meta
1024    publications/archive
429     publications/newsletter-index/meta/meta.xml
941     publications/newsletter-index/meta
1453    publications/newsletter-index
427     publications/news-archive/meta/meta.xml
939     publications/news-archive/meta
1451    publications/news-archive
394     publications/home-page/meta/meta.xml
906     publications/home-page/meta
1418    publications/home-page
257     publications/rss_1_0/meta/meta.xml
769     publications/rss_1_0/meta
1281    publications/rss_1_0
315     publications/home-page-time-stamp/meta/meta.xml
827     publications/home-page-time-stamp/meta
1339    publications/home-page-time-stamp
350     publications/events/meta/meta.xml
862     publications/events/meta
1374    publications/events
1634    publications/meta/meta.xml
2146    publications/meta
18886   publications
xxx$ 

(B) The size of the corresponding 1.0 database files 
Result: total of 250KB contents. 

yyy$ du -ba publications/
12288   publications/archive/archive.tbl
12288   publications/archive/meta/meta.tbl
12288   publications/archive/meta
24576   publications/archive
12288   publications/home-page/home-page.tbl
12288   publications/home-page/meta/meta.tbl
12288   publications/home-page/meta
24576   publications/home-page
12288   publications/meta/meta.tbl
12288   publications/meta
12288   publications/news-archive/meta/meta.tbl
12288   publications/news-archive/meta
12288   publications/news-archive/news-archive.tbl
24576   publications/news-archive
12288   publications/newsletter/meta/meta.tbl
12288   publications/newsletter/meta
12288   publications/newsletter/newsletter.tbl
24576   publications/newsletter
12288   publications/publications.tbl
12288   publications/rss-common/meta/meta.tbl
12288   publications/rss-common/meta
12288   publications/rss-common/rss-common.tbl
24576   publications/rss-common
12288   publications/rss_0_91/meta/meta.tbl
12288   publications/rss_0_91/meta
12288   publications/rss_0_91/rss_0_91.tbl
24576   publications/rss_0_91
12288   publications/rss_0_92/meta/meta.tbl
12288   publications/rss_0_92/meta
12288   publications/rss_0_92/rss_0_92.tbl
24576   publications/rss_0_92
12288   publications/rss_1_0/meta/meta.tbl
12288   publications/rss_1_0/meta
12288   publications/rss_1_0/rss_1_0.tbl
24576   publications/rss_1_0
12288   publications/rss_2_0/meta/meta.tbl
12288   publications/rss_2_0/meta
12288   publications/rss_2_0/rss_2_0.tbl
24576   publications/rss_2_0
245760  publications/
yyy$ 

(C) ... and of the corresponding 1.1 database files
Result: total of 84000KB contents. 


zzz$ du -ba publications/
4202496 publications/archive/archive.tbl
4202496 publications/archive/meta/meta.tbl
4202496 publications/archive/meta
8404992 publications/archive
4202496 publications/home-page/home-page.tbl
4202496 publications/home-page/meta/meta.tbl
4202496 publications/home-page/meta
8404992 publications/home-page
4202496 publications/meta/meta.tbl
4202496 publications/meta
4202496 publications/news-archive/meta/meta.tbl
4202496 publications/news-archive/meta
4202496 publications/news-archive/news-archive.tbl
8404992 publications/news-archive
4202496 publications/newsletter/meta/meta.tbl
4202496 publications/newsletter/meta
4202496 publications/newsletter/newsletter.tbl
8404992 publications/newsletter
4202496 publications/publications.tbl
4202496 publications/rss-common/meta/meta.tbl
4202496 publications/rss-common/meta
4202496 publications/rss-common/rss-common.tbl
8404992 publications/rss-common
4202496 publications/rss_0_91/meta/meta.tbl
4202496 publications/rss_0_91/meta
4202496 publications/rss_0_91/rss_0_91.tbl
8404992 publications/rss_0_91
4202496 publications/rss_0_92/meta/meta.tbl
4202496 publications/rss_0_92/meta
4202496 publications/rss_0_92/rss_0_92.tbl
8404992 publications/rss_0_92
4202496 publications/rss_1_0/meta/meta.tbl
4202496 publications/rss_1_0/meta
4202496 publications/rss_1_0/rss_1_0.tbl
8404992 publications/rss_1_0
4202496 publications/rss_2_0/meta/meta.tbl
4202496 publications/rss_2_0/meta
4202496 publications/rss_2_0/rss_2_0.tbl
8404992 publications/rss_2_0
84049920 publications/
zzz$ 


==========================
(2) a collection with a large set of small documents

(A) This is a size listing of (part of) the exported 1.0 database.
This one contains approx 4000 rather small documents.
Result: total of 4000KB contents (so approx 1KB per document)

xxx$ du -b content/news
3932206 content/news/db
665     content/news/meta
3933383 content/news
xxx$ 

(note: not displaying individual files/documents ... there are too many of
them)

(B) The size of the corresponding 1.0 database files 
Result: total of 450KB contents. 

yyy$ du -ba content/news/
430080  content/news/db/db.tbl
430080  content/news/db
12288   content/news/meta/meta.tbl
12288   content/news/meta
12288   content/news/news.tbl
454656  content/news/
yyy$ 

(C) ... and of the corresponding 1.1 database files
Result: total of 12MB contents. 

zzz$ du -ba content/news/
4202496 content/news/db/db.tbl
4202496 content/news/db
4202496 content/news/meta/meta.tbl
4202496 content/news/meta
4202496 content/news/news.tbl
12607488   content/news/
zzz$ 

===end===


Natalia Shilenkova wrote:
> 
> There was a change in Xindice v1.1 that could possibly be responsible
> for database size increase.
> 
> Xindice v1.0 failed to correctly allocate initial file space for a
> collection according to its page size and page count parameters, so a
> collection with just a few documents would occupy several Kb on disk
> instead of reserving some space on disk for the collection to grow. It
> was fixed in v1.1 and a collection with default page size (4Kb) and
> page count (1024) parameters now occupies about 4Mb.
> 
> If you had several small collections in Xindice v1.0 it is possible
> that after rebuilding them for v1.1 the database would take
> considerably more disk space. How many collections do you have in that
> database and how big are the collections? If this indeed is the reason
> for the database increase, it could be fixed by adjusting page count
> parameter for small collections.
> 
> Also, I think Meta collections were introduced somewhere between 1.0
> and 1.1, I cannot remember now if they are created when rebuilding a
> database, but if they are, it would take some disk space, too. You can
> explore database directories to see if Meta collections are there (in
> system/Metas, I think). Meta collections can be turned off.
> 
> Natalia
> 
> On Thu, Apr 16, 2009 at 9:35 AM, OKO <ol...@sics.se> wrote:
>>
>> Ran xindice_rebuild on a smallish existing 1.0 database to get it into
>>  1.1
>> format.
>> The size was expanded enormously:
>> db 1.0: 2 M
>> db 1.1: 233 M
>>
>> That is a factor of 100 ;-(
>>
>> I have not dared do this on my real 1.0 database, which now occupies more
>> than 1G of disk space.
>>
>> *Question*:
>>  - Is this a feature or a bug?
>>
>> Tool info as presented:
>> $ xindice -h
>> trying to register database
>>
>> Xindice Command Tools v1.1
>>
>> ...etc...
>>
>> /O
>> --
>> View this message in context:
>> http://www.nabble.com/xindice_rebuild---file-size-vastly-multiplied-tp23078111p23078111.html
>> Sent from the Xindice - Users mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/xindice_rebuild---file-size-vastly-multiplied-tp23078111p23084822.html
Sent from the Xindice - Users mailing list archive at Nabble.com.

Reply via email to