Sounds like a good sample to me. The collection could be smaller if you
document is mostly tags. I would guess that the internal storage is not
just your raw document, but a parsed version and the tags are probably
represented by a number. If the ratio of your data to your
tags goes way up, then you will probably see a difference. I don't know
this for fact, I don't actually code Xindice, I just play a coder on television.
This is pretty much the case. Xindice doesn't store things as a serialized DOM. It creates a tokenized stream, and stores all element and attribute names in a single, global collection that maps those names to integer symbol IDs. The symbol IDs are what actually get stored in the collection and index files, so if the XML is very data oriented and has a lot of tags and attributes, the removal of those names can reduce the size of the disk image rather well.
-- Tom Bradford - http://www.tbradford.org Architect - XQRL (XQuery Engine) - http://www.xqrl.com Apache Xindice (Native XML Database) - http://xml.apache.org/xindice Project Labrador (Web Services Framework) - http://notdotnet.org
