Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

2001-06-26 Thread Chris McDonough

Off the top of my head, I don't think there are any.  But this is why I
haven't fixed it yet, because I'd need to think about it past off the
top of my head.  ;-)

- C


Casey Duncan wrote:
 
 
 What if any disadvantages are there to not calling unindex_object first?
 If there aren't any good ones, I think I'll be rewriting some of my own
 CatalogAware code...
 --
 | Casey Duncan
 | Kaivo, Inc.
 | [EMAIL PROTECTED]
 `--
 
 ___
 Zope maillist  -  [EMAIL PROTECTED]
 http://lists.zope.org/mailman/listinfo/zope
 **   No cross posts or HTML encoding!  **
 (Related lists -
  http://lists.zope.org/mailman/listinfo/zope-announce
  http://lists.zope.org/mailman/listinfo/zope-dev )

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

2001-06-26 Thread Chris McDonough

abel deuring wrote:
 A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
 bloating, if you use CatalogAware objects. An UnTextIndex maintains for

Right.. if you don't use CatalogAware, however, and don't unindex before
reindexing an object, you should see a huge bloat savings, because the
only things which are supposed to be updated then are indexes and
metadata which have data that has changed.

 each word a list of documents, where this word appears. So, if a
 document to be indexed contains, say, 100 words, 100 IIBTrees
 (containing mappings documentId - word score) will be updated. (see
 UnTextIndex.insertForwardIndexEntry) If you have a larger number of
 documents, these mappings may be quite large: Assume 10.000 documents,
 and assume that you have 10 words which appear in 30% of all documents.
 Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
 one can try to keep this number of frequent words low by using a good
 stop word list, but at least for German, such a list is quite difficult
 to build. And one can argue that many not too really frequent words
 should be indexed in order to allow more precise phrase searches)I don't
 know the details, how data is stored inside the BTress, so I can give
 only a rough estimate of the memory requirements: With 32 bit integers,
 we have at least 8 bytes per IIBTree entry (documentId and score), so
 each of the 10 BTree for the frequent words has a minimum length of
 3000*8 = 24000 bytes.
 
 If you now add a new document containing 5 of these frequent words, 5
 larger BTrees will be updated. [Chris, let me know, if I'm now going to
 tell nonsense...] I assume that the entire updated BTrees = 12 bytes
 will be appended to the ZODB (ignoring the less frequent words) -- even
 if the document contains only 1 kB text.

Nah... I don't think so.  At least I hope not!  Each bucket in a BTree
is a separate persistent object.  So only the sum of the data in the
updated buckets will be appended to the ZODB.  So if you add an item to
a BTree, you don't add 24000+ bytes for each update.  You just add the
amount of space taken up by the bucket... unfortunately I don't know
exactly how much this is, but I'd imagine it's pretty close to the
datasize with only a little overhead.
 
 This is the reason, why I'm working on some kind of lazy cataloging.
 My approach is to use a Python class (or Base class,if ZClasses are
 involved), which has a method manage_afterAdd. This method looks for
 superValues of a type like lazyCatalog (derived from ZCatalog), and
 inserts self.getPhysicalPath() into the update list of each found
 lazyCatalog.
 
 Later, a lazyCatalog can index all objects in this list. Then, then
 bloating happens either in RAM (without subtransaction), or in a
 temporary file, if you use subtransactions.
 
 OK, another approach which fits better to your (Giovanni) needs might be
 to use another data base than ZODB, but I'm afarid that even then
 instant indexing will be an expensive process, if you have a large
 number of documents.

Another option is to use a session manager, and update the catalog at
session-end.

- C

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

2001-06-26 Thread Casey Duncan

Chris McDonough wrote:
 
 abel deuring wrote:
  A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
  bloating, if you use CatalogAware objects. An UnTextIndex maintains for
 
 Right.. if you don't use CatalogAware, however, and don't unindex before
 reindexing an object, you should see a huge bloat savings, because the
 only things which are supposed to be updated then are indexes and
 metadata which have data that has changed.
 
[snip]

What if any disadvantages are there to not calling unindex_object first?
If there aren't any good ones, I think I'll be rewriting some of my own
CatalogAware code...
-- 
| Casey Duncan
| Kaivo, Inc.
| [EMAIL PROTECTED]
`--

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )