Re: [Zope-dev] ZCatalog with UTF-8 Chinese

2000-09-28 Thread Zope mailing lists

On Thu, 28 Sep 2000, Sin Hang Kin wrote:
> After reading some code of query, I think the regular expression operations
> which in parse, quotes and parse2 were not safe for utf8 string. So, I

That wouldn't surprise me.

> decide to emulate what they do. However, I do not understand what getlexicon
> is doing and I would like to learn what  q should looks like before it is
> passed to evaluate. I do not understand that vocabulary seems to store like
> integer, is getlexicon a step to look up the string to convert them to
> integer? I am getting lost.

I don't fully understand Lexicon myself, but I've at least spent some
time groveling around in the code.  I understand there's been a relatively
recent checkin of a new version of the text index stuff that at least
provides clearer variable names and additional comments; if you aren't
working from cvs version you might want to browse the files on the
cvs web interface.

So, here's what I understand:

The lexicon takes words and associates them with integers.  It is the
integers that are stored in the text index.  So in the final stages
of the search process, the parsed words are looked up in the lexicon
to get the integer, and the integer is then passed to the index
to get back the result set (list of documents containing the word).
The result set is itself a list of integers.  I think it is in fact
pairs (or some more complex data structure); at the least the index
stores the document number and the word offset (I think it's a word
offset) of the word into the document.

As for what q looks like...well, I haven't grovelled through the
parse, quote, parens, and parse2 code much, so I'm guess a bit here:
I *think* that before it goes into evaluate q is a list of sequences
or words, where the sequences are a list of sequences or
wordsrecursive.  The sub-sequences would be the parenthesized
expressions from the original string.  In the original string, any
occurences of the pair of words 'and not' were replaced by 'andnot'.
Any quoted strings (double quotes only, I believe) were replaced
by sequences of words separated by the 'near' operator ('...').
parse2 makes sure that every other item in q is an operator, by
sticking the default operator, 'or', in between any pairs that
aren't separated by an operator.

If I'm right, an expression like:

This is and not a (good "test of") searching

should end up feeding to evaluate a 'q' like this:

('this', 'or', 'is', 'andnot', 'a', 'or', ('good', 'or', ('test', '...'
   'of')), 'or', 'searching')

I'm least sure of those parens around test...of.

Maybe this will at least give you a clue to enable you to figure
out what the code *really* does .

--RDM


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )




[Zope-dev] ZCatalog with UTF-8 Chinese

2000-09-27 Thread Sin Hang Kin

Dear Developer:

Trying to short-cut UNTEXTINDEX to handle UTF-8 Chinese, I need some help.

After reading some code of query, I think the regular expression operations
which in parse, quotes and parse2 were not safe for utf8 string. So, I
decide to emulate what they do. However, I do not understand what getlexicon
is doing and I would like to learn what  q should looks like before it is
passed to evaluate. I do not understand that vocabulary seems to store like
integer, is getlexicon a step to look up the string to convert them to
integer? I am getting lost.

Could some experienced developer help me out of these?

Rgs,

Kent Sin
-
kentsin.weblogs.com
kentsin.imeme.net


def query(self, s, default_operator = Or, ws = (string.whitespace,)):

"""

This is called by TextIndexes. A 'query term' which is a string

's' is passed in, along with an index object. s is parsed, then

the wildcards are parsed, then something is parsed again, then the

whole thing is 'evaluated'

"""

# First replace any occurences of " and not " with " andnot "

s = ts_regex.gsub('[%s]+and[%s]*not[%s]+' % (ws * 3), ' andnot ', s)

# do some parsing

q = parse(s)

## here, we give lexicons a chance to transform the query.

## For example, substitute wildcards, or translate words into

## various languages.

q = self.getLexicon(self._lexicon).query_hook(q)

# do some more parsing

q = parse2(q, default_operator)

## evalute the final 'expression'

return self.evaluate(q)





___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )