abel deuring wrote:
> A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
> bloating, if you use CatalogAware objects. An UnTextIndex maintains for
Right.. if you don't use CatalogAware, however, and don't unindex before
reindexing an object, you should see a huge bloat savings, because the
only things which are supposed to be updated then are indexes and
metadata which have data that has changed.
> each word a list of documents, where this word appears. So, if a
> document to be indexed contains, say, 100 words, 100 IIBTrees
> (containing mappings documentId -> word score) will be updated. (see
> UnTextIndex.insertForwardIndexEntry) If you have a larger number of
> documents, these mappings may be quite large: Assume 10.000 documents,
> and assume that you have 10 words which appear in 30% of all documents.
> Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
> one can try to keep this number of frequent words low by using a "good"
> stop word list, but at least for German, such a list is quite difficult
> to build. And one can argue that many "not too really frequent" words
> should be indexed in order to allow more precise phrase searches)I don't
> know the details, how data is stored inside the BTress, so I can give
> only a rough estimate of the memory requirements: With 32 bit integers,
> we have at least 8 bytes per IIBTree entry (documentId and score), so
> each of the 10 BTree for the "frequent words" has a minimum length of
> 3000*8 = 24000 bytes.
> If you now add a new document containing 5 of these frequent words, 5
> larger BTrees will be updated. [Chris, let me know, if I'm now going to
> tell nonsense...] I assume that the entire updated BTrees = 120000 bytes
> will be appended to the ZODB (ignoring the less frequent words) -- even
> if the document contains only 1 kB text.
Nah... I don't think so. At least I hope not! Each bucket in a BTree
is a separate persistent object. So only the sum of the data in the
updated buckets will be appended to the ZODB. So if you add an item to
a BTree, you don't add 24000+ bytes for each update. You just add the
amount of space taken up by the bucket... unfortunately I don't know
exactly how much this is, but I'd imagine it's pretty close to the
datasize with only a little overhead.
> This is the reason, why I'm working on some kind of "lazy cataloging".
> My approach is to use a Python class (or Base class,if ZClasses are
> involved), which has a method manage_afterAdd. This method looks for
> superValues of a type like "lazyCatalog" (derived from ZCatalog), and
> inserts self.getPhysicalPath() into the update list of each found
> Later, a "lazyCatalog" can index all objects in this list. Then, then
> bloating happens either in RAM (without subtransaction), or in a
> temporary file, if you use subtransactions.
> OK, another approach which fits better to your (Giovanni) needs might be
> to use another data base than ZODB, but I'm afarid that even then
> "instant indexing" will be an expensive process, if you have a large
> number of documents.
Another option is to use a session manager, and update the catalog at
Zope-Dev maillist - [EMAIL PROTECTED]
** No cross posts or HTML encoding! **
(Related lists -