Re: [Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins]CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

2002-08-15 Thread Chris Withers

Casey Duncan wrote:

 Anyone care to weigh in with use cases for shared lexicons?

Well, the use case you describe: several indexes with roughly the same lexicon 
is the one to watch out for. If you're going to do some quantitative tests on 
this, it'd be interesting.

Still, KISS and all that would suggest the simpler design is better.

my 2p,

Chris


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

2002-08-15 Thread Jim Fulton


The original reason to share vocabularies was that multiple fields
often came from the same human vocabulaties. The idea was that vocabularies
would encompass a number of features including:

- Words (or n-grams) used

- Synonyms

- Stemming rules

- Stop words

- Splitting rules

There was, potentially, a lot of information to be shared and it would
often be important, for consistency to share the same rules for different
fields that contained the same sort of content. Sharing had as much
to do with using consistent rules than it did with optimization.

Unfortunately, the old text index never implemented a lot of these ideas. :(

The pipe-lining model used by ZCTextIndex moves some of this functionality
out of the lexicon and leaves some of these ideas unimplemented, as did
TextIndex.

I think that there is at least potential value in sharing lexicons.
Of course, a down side is that it complicates set up.

On the subject of referencing lexicons by path rather than using direct
references, I'm inclined to agree that direct references are better for
simplicity and speed. It's easy enough to add a new index when you
want to change a lexicon. (Well, there are some complications having to do
with making sure that you get all the needed data into the new index...)

Jim


Casey Duncan wrote:
 On Wednesday 14 August 2002 06:03 pm, Guido van Rossum wrote:
 
Fix for issue #505
ZCTextIndex is now associated by path to its lexicon. After replacing a 

 lexicon used by an index, clear the index to make it use the new lexicon.
 
So the semantics are that when you replace the lexicon, the index is
reset to empty, right?  Why not create a new index instead?  Then the
lexicon could be internal to the index.  Sharing lexicons doesn't
sound like a probable use case, the more I think about it.

--Guido van Rossum (home page: http://www.python.org/~guido/)


 
 I don't disagree. This was a conceptual holdover from the previous generation 
 TextIndex. I'm switching this over to zope-dev for wider discussion:
 
 The current implementation of ZCTextIndex is like the old TextIndex in that 
 you can create one Lexicon (the sucessor to Vocabularies) shared by multiple 
 ZCTextIndexes.
 
 I imagine the thought was that there are only a finite number of words and 
 that sharing the lexicon would save space and possibly index time, since a 
 given word would only need to be inserted once into the lexicon regardless of 
 the number of indexes it occurred in. More significant might be the (cache) 
 memory savings of only having to keep one copy of the words in memory across 
 several indexes. Plus fewer loads and stores to the database overall by 
 sharing the word list.
 
 On the other hand I think query speeds may be compromised since one large 
 lexicon would take longer to search for a given word (or words) then several 
 smaller ones. This would be especially true for small indexes sharing a 
 lexicon with a much larger one.
 
 The other downside (as illustrated by issue #505) is the complication of 
 linking index to lexicon and making the link manageable so that you can tweak 
 the indexing system easily. My fix is not entirely complete because a hard 
 ref to the lexicon is still stored in the low-level index (to which the 
 ZCTextIndex class delegates). In order to fix this effectively without 
 introducing Zope dependancies at the low level (which we have looked to 
 avoid) I would need to create some sort of Lexicon proxy that can access the 
 correct lexicon on demand by a path efficiently. This proxy would be 
 referenced by the low level index in place of the actual lexicon.
 
 Of course the other solution, which is much simpler is to dispense with this 
 notion of sharing lexicons entirely and as Guido suggests, just make the 
 lexicon part of the index.
 
 Without hard use cases to the contrary, I lean toward that simpler design. 
 However I would like to perform some additional testing on large corpuses 
 with many indexes to assess the memory/performance differences between these 
 two approaches. We have already ascertained that with the new ZODB cache code 
 in 2.6, the cache setting can have a profound affect on query performance 
 (like a factor of 10), so I think testing would be helpful.
 
 Anyone care to weigh in with use cases for shared lexicons?
 
 -Casey
 
 ___
 Zope-Dev maillist  -  [EMAIL PROTECTED]
 http://lists.zope.org/mailman/listinfo/zope-dev
 **  No cross posts or HTML encoding!  **
 (Related lists - 
  http://lists.zope.org/mailman/listinfo/zope-announce
  http://lists.zope.org/mailman/listinfo/zope )
 
 



-- 
Jim Fulton   mailto:[EMAIL PROTECTED]   Python Powered!
CTO  (888) 344-4332http://www.python.org
Zope Corporation http://www.zope.com   http://www.zope.org


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  

Re: [Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

2002-08-15 Thread Casey Duncan

On Thursday 15 August 2002 09:21 am, Jim Fulton wrote:
 The original reason to share vocabularies was that multiple fields
 often came from the same human vocabulaties. The idea was that 
vocabularies
 would encompass a number of features including:
 
 - Words (or n-grams) used
 
 - Synonyms
 
 - Stemming rules
 
 - Stop words
 
 - Splitting rules
 
 There was, potentially, a lot of information to be shared and it would
 often be important, for consistency to share the same rules for different
 fields that contained the same sort of content. Sharing had as much
 to do with using consistent rules than it did with optimization.
 
 Unfortunately, the old text index never implemented a lot of these ideas. :(
 
 The pipe-lining model used by ZCTextIndex moves some of this functionality
 out of the lexicon and leaves some of these ideas unimplemented, as did
 TextIndex.

I'm not sure what you mean. The pipelining is defined and executed in the 
lexicon.
 
 I think that there is at least potential value in sharing lexicons.
 Of course, a down side is that it complicates set up.

I guess the main complaint was that given a set of indexes sharing a lexicon, 
deleting the lexicon and replacing it with another one had no effect on the 
indexes and in fact removes your ability to manage their lexicon at all. So 
you must replace all of the indexes to use the new lexicon by hand.

Admittedly this is really more of a user interface and management issue then 
anything. Zope is just not very good at managing one to many relationships 
unless the one is the container of the many. 8^(
 
 On the subject of referencing lexicons by path rather than using direct
 references, I'm inclined to agree that direct references are better for
 simplicity and speed. It's easy enough to add a new index when you
 want to change a lexicon. (Well, there are some complications having to do
 with making sure that you get all the needed data into the new index...)

The current fix is a compromise that does a traversal as seldom as possible. 
unfortunately it means it must be even more complex then either a simple 
direct ref or path reference would be.

I'm thinking about adopting an alternative fix, which keeps the direct 
reference and the path to the lexicon and gives you a management interface to 
select a new lexicon or simply connect to a replacement (which would clear 
the index). It could also tell you if the lexicon used by the index is the 
actual one referenced from the path. 

I dunno though, maybe we would be better off as before and just document how 
you go about the replacement procedure by hand. The management interface 
could still be improved though, perhaps allowing you to manage the lexicon 
through the index in the case that the original lexicon reference was 
removed. Before there was no disclosure and no way to get to the deleted 
lexicon.

-Casey

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists -
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

2002-08-15 Thread Jim Fulton

Casey Duncan wrote:
 On Thursday 15 August 2002 09:21 am, Jim Fulton wrote:
 

...

 I'm not sure what you mean. The pipelining is defined and executed in the 
 lexicon.

My mistake.


 
I think that there is at least potential value in sharing lexicons.
Of course, a down side is that it complicates set up.

 
 I guess the main complaint was that given a set of indexes sharing a lexicon, 
 deleting the lexicon and replacing it with another one had no effect on the 
 indexes and in fact removes your ability to manage their lexicon at all. So 
 you must replace all of the indexes to use the new lexicon by hand.
 
 Admittedly this is really more of a user interface and management issue then 
 anything. Zope is just not very good at managing one to many relationships 
 unless the one is the container of the many. 8^(

Maybe that's not Zope's job. Perhaps the lexicon should keep track of the indexes
using it. Then, if you try to delete it, you'd at least get a warning letting
you know that you may need to recreate a bunch of indexes, and telling you
which ones.


 
On the subject of referencing lexicons by path rather than using direct
references, I'm inclined to agree that direct references are better for
simplicity and speed. It's easy enough to add a new index when you
want to change a lexicon. (Well, there are some complications having to do
with making sure that you get all the needed data into the new index...)

 
 The current fix is a compromise that does a traversal as seldom as possible. 
 unfortunately it means it must be even more complex then either a simple 
 direct ref or path reference would be.

Yup, and this brittle.


 I'm thinking about adopting an alternative fix, which keeps the direct 
 reference and the path to the lexicon and gives you a management interface to 
 select a new lexicon or simply connect to a replacement (which would clear 
 the index). It could also tell you if the lexicon used by the index is the 
 actual one referenced from the path. 

That sounds OK.

Jim


-- 
Jim Fulton   mailto:[EMAIL PROTECTED]   Python Powered!
CTO  (888) 344-4332http://www.python.org
Zope Corporation http://www.zope.com   http://www.zope.org


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

2002-08-15 Thread Guido van Rossum

 I think that there is at least potential value in sharing lexicons.
 Of course, a down side is that it complicates set up.

This is where I say YAGNI and announce that I'll be happy to
refactor the code if and when a real need is discovered.

 On the subject of referencing lexicons by path rather than using
 direct references, I'm inclined to agree that direct references are
 better for simplicity and speed. It's easy enough to add a new index
 when you want to change a lexicon. (Well, there are some
 complications having to do with making sure that you get all the
 needed data into the new index...)

What was the use case for switching lexicons in the first place?  I
bet it was just someone idly playing around and noticing that it
didn't work right...

--Guido van Rossum (home page: http://www.python.org/~guido/)

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

2002-08-15 Thread Guido van Rossum

 I guess the main complaint was that given a set of indexes sharing a
 lexicon, deleting the lexicon and replacing it with another one had
 no effect on the indexes and in fact removes your ability to manage
 their lexicon at all. So you must replace all of the indexes to use
 the new lexicon by hand.

What ability to manage the lexicon are you talking about?  The
lexicon has nothing manageable once it's created, except its name. :-)

IMO we should remove the external Lexicon from ZCTextIndex and let
ZCTextIndex create the Lexicon for you.  That means that the pipeline
options need to be selected when you create a ZCTextIndex -- this is
actually simpler because it's now one-stop shopping (except for the
need to still create a ZCatalog).

--Guido van Rossum (home page: http://www.python.org/~guido/)

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )