Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

2001-06-26 Thread Chris McDonough

Off the top of my head, I don't think there are any.  But this is why I
haven't fixed it yet, because I'd need to think about it past "off the
top of my head".  ;-)

- C


Casey Duncan wrote:
> 
> 
> What if any disadvantages are there to not calling unindex_object first?
> If there aren't any good ones, I think I'll be rewriting some of my own
> "CatalogAware" code...
> --
> | Casey Duncan
> | Kaivo, Inc.
> | [EMAIL PROTECTED]
> `-->
> 
> ___
> Zope maillist  -  [EMAIL PROTECTED]
> http://lists.zope.org/mailman/listinfo/zope
> **   No cross posts or HTML encoding!  **
> (Related lists -
>  http://lists.zope.org/mailman/listinfo/zope-announce
>  http://lists.zope.org/mailman/listinfo/zope-dev )

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

2001-06-26 Thread Casey Duncan

Chris McDonough wrote:
> 
> abel deuring wrote:
> > A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
> > bloating, if you use CatalogAware objects. An UnTextIndex maintains for
> 
> Right.. if you don't use CatalogAware, however, and don't unindex before
> reindexing an object, you should see a huge bloat savings, because the
> only things which are supposed to be updated then are indexes and
> metadata which have data that has changed.
> 
[snip]

What if any disadvantages are there to not calling unindex_object first?
If there aren't any good ones, I think I'll be rewriting some of my own
"CatalogAware" code...
-- 
| Casey Duncan
| Kaivo, Inc.
| [EMAIL PROTECTED]
`-->

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

2001-06-26 Thread Chris McDonough

abel deuring wrote:
> A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
> bloating, if you use CatalogAware objects. An UnTextIndex maintains for

Right.. if you don't use CatalogAware, however, and don't unindex before
reindexing an object, you should see a huge bloat savings, because the
only things which are supposed to be updated then are indexes and
metadata which have data that has changed.

> each word a list of documents, where this word appears. So, if a
> document to be indexed contains, say, 100 words, 100 IIBTrees
> (containing mappings documentId -> word score) will be updated. (see
> UnTextIndex.insertForwardIndexEntry) If you have a larger number of
> documents, these mappings may be quite large: Assume 10.000 documents,
> and assume that you have 10 words which appear in 30% of all documents.
> Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
> one can try to keep this number of frequent words low by using a "good"
> stop word list, but at least for German, such a list is quite difficult
> to build. And one can argue that many "not too really frequent" words
> should be indexed in order to allow more precise phrase searches)I don't
> know the details, how data is stored inside the BTress, so I can give
> only a rough estimate of the memory requirements: With 32 bit integers,
> we have at least 8 bytes per IIBTree entry (documentId and score), so
> each of the 10 BTree for the "frequent words" has a minimum length of
> 3000*8 = 24000 bytes.
> 
> If you now add a new document containing 5 of these frequent words, 5
> larger BTrees will be updated. [Chris, let me know, if I'm now going to
> tell nonsense...] I assume that the entire updated BTrees = 12 bytes
> will be appended to the ZODB (ignoring the less frequent words) -- even
> if the document contains only 1 kB text.

Nah... I don't think so.  At least I hope not!  Each bucket in a BTree
is a separate persistent object.  So only the sum of the data in the
updated buckets will be appended to the ZODB.  So if you add an item to
a BTree, you don't add 24000+ bytes for each update.  You just add the
amount of space taken up by the bucket... unfortunately I don't know
exactly how much this is, but I'd imagine it's pretty close to the
datasize with only a little overhead.
 
> This is the reason, why I'm working on some kind of "lazy cataloging".
> My approach is to use a Python class (or Base class,if ZClasses are
> involved), which has a method manage_afterAdd. This method looks for
> superValues of a type like "lazyCatalog" (derived from ZCatalog), and
> inserts self.getPhysicalPath() into the update list of each found
> "lazyCatalog".
> 
> Later, a "lazyCatalog" can index all objects in this list. Then, then
> bloating happens either in RAM (without subtransaction), or in a
> temporary file, if you use subtransactions.
> 
> OK, another approach which fits better to your (Giovanni) needs might be
> to use another data base than ZODB, but I'm afarid that even then
> "instant indexing" will be an expensive process, if you have a large
> number of documents.

Another option is to use a session manager, and update the catalog at
session-end.

- C

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )