[Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread E. Seifert

Hi Michel,

Michel Pelletier wrote:
>The splitter should really be a modular component.  That's what
>vocabularies were origninally for, to store language specific artifacts
>like word lists and splitters.  For example, stripping the "ing" suffix
>obviously only makes sense in English.  so if you want to change this
>behavior, make your own vocabulary with its own custom splitter.
>
>This is because each language has very different splitting requirements,
>and even different meanings of the word "word".  Imagine, for example,
>splitting Japanese or one of the Chinese languages (based textualy on
>Kanji).

Just imagine German! There are composite words without spaces or other
non-aphanumeric characters between them.

>Identifying words in Kanji is a very hard problem.  In romance langauge,
>it's easy, words are seperated by spaces, but in Kanji words are
>diferentiated by the context of the surrounding characters, there are no
>"spaces".  Splitting Kanji text requres a pre-existing dictionary and some
>interesting heuristic matching algorithms.  And that's only half of
>Japanese itself, really, since there are two other alphabets (hiragana and
>katagana) that *are* character-phonetic like romance langauges, and all
>three alphabets are commonly mixed together in the same sentence!  Chinese
>language may also have these phonetic alphabets.

The same applies for German: You'd need a huge dictionary with word stems,
exceptions, and stop words.
Stems of many words change in different cases, too.

>In other words, it's not an easy problem!  There is going to be an
>unimaginable culture clash when asian and other non-romance languages
>catch up to the volume of romance language content on the web.

Well, English or German in fact aren't romance languages, they're germanic
:-)

Eric



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread Michel Pelletier

On Thu, 24 May 2001, Christian Robottom Reis wrote:

> On Thu, 24 May 2001, Michel Pelletier wrote:
>
> > This is a very common indexing strategy to save space and make searches
> > more relevant.  Otherwise 'Dog' and 'dog' would return two completely
> > different result sets.
>
> Fine. However:
>
> >>> s.indexes('Foo')
> []
>
> Is _this_ supposed to happen, too?

Yes.  The splitter was applied to the document before it was indexed so
both Foo and foo became foo and there is no Foo.  The index itself is
technicaly not case insensitive, it's case flattened, which makes the
query interface case insensitive.

> Ah, I guess to. It's the problem with
> using this outside of Zope. :-)

No, you just didn't apply the splitter before you queried the index.

results = []
for word in Splitter("search for these words or foo"):
  results = results + s.indexes(word)

> Uhhh, no, it _is_ implemented. It just didn't work like I'd expect :-)
>
> >>> index.positions(1,['crazy'])
> [2]
> >>> index.positions(1,'crazy')
> []
> >>> index.positions(1,['Crazy'])
> []

I see, yes it must be a sequence and you must also apply the splitter to
your input before querying an index.

> So it does look lowercase words up. Of course, this is an artifact of the
> following point you make:
>
> > you want to look up things in a text index, use the same splitter to munge
> > the content before querying the index, otherwise, you may end up not
> > finding what you're looking for.
>
> This makes sense:
>
> >>> s = Splitter("Crazy")
> >>> index.positions(1,s)
> [2]
>
> Ahhm. Okay. Will update my documentation with this important point.

Ah I see you came to the answer yourself.  Yes this is an important point,
especially for other languages where the splitter *must* be applied to
extract the words from context, like Japanese.

> > In other words, it's not an easy problem!  There is going to be an
> > unimaginable culture clash when asian and other non-romance languages
> > catch up to the volume of romance language content on the web.
>
> Fascinating points on i18n and l10n of the indexing mechanism. Makes me
> wonder how far the current implementation will go before having to be
> rewritten, and if the world will survive east-meets-the-west of computing
> text.

Digital Garage implemented a JVocabulary and have sucessfully cataloged
japanese text.  I wonder if htdig or php can do that .

> But I believe the Splitter could stay the same for western languages, from
> what I've seen of the code. Can't really see the ing-cutting stuff here.

Oh, well maybe it used to remove common suffixes and I took it out.  It's
called stemming, and it's a pretty common pattern.  But you'd be suprised
how many people run into english only quirks even in western languages
with Zope's splitter.

-Michel


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread Andreas Jung


- Original Message -
From: "Christian Robottom Reis" <[EMAIL PROTECTED]>
>
> Fascinating points on i18n and l10n of the indexing mechanism. Makes me
> wonder how far the current implementation will go before having to be
> rewritten, and if the world will survive east-meets-the-west of computing
> text.
>
> But I believe the Splitter could stay the same for western languages, from
> what I've seen of the code. Can't really see the ing-cutting stuff here.
>

Zope 2.4 will come with a reorganization of the index stuff that allows
you to write easier customized indexes. Also the text index is prepared to
work with multiple splitters - you should be able to choose a splitter from
a list of available splitters from the ZMIstay tuned.

Andreas



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



RE: [Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread Loren Stafford

Have you seen http://www.zope.org/Members/brianh/JSplitter

FYI

-- Loren

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread Christian Robottom Reis

On Thu, 24 May 2001, Michel Pelletier wrote:

> This is a very common indexing strategy to save space and make searches
> more relevant.  Otherwise 'Dog' and 'dog' would return two completely
> different result sets.

Fine. However:

>>> s.indexes('Foo')
[]

Is _this_ supposed to happen, too? Ah, I guess to. It's the problem with
using this outside of Zope. :-) I couldn't figure out what it was for.

> find the same words.  The splitter can also be passed a mapping of
> synonyms, so you can tell the splitter that "automobile" "ford" and "lisp"
> are all synonymous to the word "car".

Yes, I've seen this in stop_words_dict from Lexicon.py.

> > It makes TextIndex's position() call behave
> > unexpectedly until you do some tests with the Splitter itself!
>
> position() is currently unimplemented, isn't it?  so does it
> matter?  Also, I don't know what your doing with position() but anytime

Uhhh, no, it _is_ implemented. It just didn't work like I'd expect :-)

>>> index.positions(1,['crazy'])
[2]
>>> index.positions(1,'crazy')
[]
>>> index.positions(1,['Crazy'])
[]

So it does look lowercase words up. Of course, this is an artifact of the
following point you make:

> you want to look up things in a text index, use the same splitter to munge
> the content before querying the index, otherwise, you may end up not
> finding what you're looking for.

This makes sense:

>>> s = Splitter("Crazy")
>>> index.positions(1,s)
[2]

Ahhm. Okay. Will update my documentation with this important point.

> In other words, it's not an easy problem!  There is going to be an
> unimaginable culture clash when asian and other non-romance languages
> catch up to the volume of romance language content on the web.

Fascinating points on i18n and l10n of the indexing mechanism. Makes me
wonder how far the current implementation will go before having to be
rewritten, and if the world will survive east-meets-the-west of computing
text.

But I believe the Splitter could stay the same for western languages, from
what I've seen of the code. Can't really see the ing-cutting stuff here.

Take care,
--
/\/\ Christian Reis, Senior Engineer, Async Open Source, Brazil
~\/~ http://async.com.br/~kiko/ | [+55 16] 274 4311




___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread Michel Pelletier

On Thu, 24 May 2001, Christian Robottom Reis wrote:

> Hi, I've been testing SearchIndex's Splitter here, and I'm finding the
> behaviour only a tiny bit strange: it converts the words it splits to
> lowercase. Is this intentional? 

Yes.

>Example:
> 
> >>> import SearchIndex.Splitter
> >>> import SearchIndex.Lexicon
> >>> s = SearchIndex.Splitter.Splitter("Foo Bar Baz",
>   SearchIndex.Lexicon.stop_word_dict)
> >>> s[0]
> 'foo'
> >>> s.indexes('foo')
> [0]
> 
> Why does this happen? 

This is a very common indexing strategy to save space and make searches
more relevant.  Otherwise 'Dog' and 'dog' would return two completely
different result sets.  

The splitter also removes single character words, splits words on
non-alphanumeric characters based on your locale (like -) and trims off
common english suffixes like 's' and 'ing' so that 'walk' and 'walking'
find the same words.  The splitter can also be passed a mapping of
synonyms, so you can tell the splitter that "automobile" "ford" and "lisp"
are all synonymous to the word "car".

> It makes TextIndex's position() call behave
> unexpectedly until you do some tests with the Splitter itself!

position() is currently unimplemented, isn't it?  so does it
matter?  Also, I don't know what your doing with position() but anytime
you want to look up things in a text index, use the same splitter to munge
the content before querying the index, otherwise, you may end up not
finding what you're looking for.

The splitter should really be a modular component.  That's what
vocabularies were origninally for, to store language specific artifacts
like word lists and splitters.  For example, stripping the "ing" suffix
obviously only makes sense in English.  so if you want to change this
behavior, make your own vocabulary with its own custom splitter.

This is because each language has very different splitting requirements,
and even different meanings of the word "word".  Imagine, for example,
splitting Japanese or one of the Chinese languages (based textualy on
Kanji).  

Identifying words in Kanji is a very hard problem.  In romance langauge,
it's easy, words are seperated by spaces, but in Kanji words are
diferentiated by the context of the surrounding characters, there are no
"spaces".  Splitting Kanji text requres a pre-existing dictionary and some
interesting heuristic matching algorithms.  And that's only half of
Japanese itself, really, since there are two other alphabets (hiragana and
katagana) that *are* character-phonetic like romance langauges, and all
three alphabets are commonly mixed together in the same sentence!  Chinese
language may also have these phonetic alphabets.

In other words, it's not an easy problem!  There is going to be an
unimaginable culture clash when asian and other non-romance languages
catch up to the volume of romance language content on the web.

-Michel


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



[Zope-dev] SearchIndex Splitter lowercase indexes?

2001-05-24 Thread Christian Robottom Reis


Hi, I've been testing SearchIndex's Splitter here, and I'm finding the
behaviour only a tiny bit strange: it converts the words it splits to
lowercase. Is this intentional? Example:

>>> import SearchIndex.Splitter
>>> import SearchIndex.Lexicon
>>> s = SearchIndex.Splitter.Splitter("Foo Bar Baz",
SearchIndex.Lexicon.stop_word_dict)
>>> s[0]
'foo'
>>> s.indexes('foo')
[0]

Why does this happen? It makes TextIndex's position() call behave
unexpectedly until you do some tests with the Splitter itself!

Take care,
--
/\/\ Christian Reis, Senior Engineer, Async Open Source, Brazil
~\/~ http://async.com.br/~kiko/ | [+55 16] 274 4311



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )