RE: Proposal: Statistical Stopword elimination

2003-03-31 Thread Alex Murzaku
Google and it behaves the same way. Very frequent terms ARE indexed. They get removed only when they are part of a query with more than one term. -- Alex Murzaku ___ alex(at)lissus.com http://www.lissus.com -Original Message- From

RE: time for 1.3 release?

2003-01-17 Thread Alex Murzaku
I had built for my earlier Snowball-Lucene integration, I did use these lists for the analyzers and also made sure to exclude them from the tests (since the analyzer would remove them...) Regards, -- Alex Murzaku ___ alex(at)lissus.com http

RE: language identifier, stemmers and analyzers

2002-11-19 Thread Alex Murzaku
and manual language selection for this. -- Alex Murzaku ___ alex(at)lissus.com http://www.lissus.com -Original Message- From: maurits van wijland [mailto:[EMAIL PROTECTED]] Sent: Saturday, November 16, 2002 8:22 AM To: Lucene Developers

Re: about bigram based word segment

2002-09-13 Thread Alex Murzaku
--- Che Dong [EMAIL PROTECTED] wrote: if Chinese segment with single charactor like: w1w2w3 = w1 w2 w3, you search w1w2 and w2w1 will return with same the result. isn't it? That wouldn't be the case if you quote the two characters (therefore you submit a phrase query.) But this discussion

Re: fixed url and How to contribute code to lucene sandbox?

2002-09-12 Thread Alex Murzaku
I don't know any Asian languages but from earlier experimentations, I remember that some time bigram tokenization could hurt matching, e.g.: w1w2w3 == tokenized as == w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would miss a search for w2. w1 w2 w3 would work better. --- Doug Cutting [EMAIL PROTECTED]

Re: Bug? QueryParser may not correctly interpret RangeQuery text

2002-06-02 Thread Alex Murzaku
It's true that the unsofisticated end-user would not use SQL, but between range (inclusive, exclusive), boolean, fuzzy, etc., the simple query parser you have is evolving into something more complex than SQL. While SQL supports them with key words, we are getting into an endless quest for unused

Re: cvs commit: jakarta-lucene TODO.txt

2002-05-28 Thread Alex Murzaku
in the case of short fields like addresses, single sentences, etc.) Cheers, Alex --- [EMAIL PROTECTED] wrote: - Alex Murzaku contributed some code for dealing with Russian. c.f. http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgId=115631

Re: [OT] Intertwingle

2002-04-23 Thread Alex Murzaku
Yes Otis, very interesting and very familiar. I am downloading it and, hopefully, I can get rid of Outlook. Merci beaucoup Petite Abeille! Tu vois, Lucene c'est pas mal apres tout... --- Otis Gospodnetic [EMAIL PROTECTED] wrote: --- petite_abeille [EMAIL PROTECTED] wrote: Thought you might

RE: Normalization

2002-03-13 Thread Alex Murzaku
Would it make sense to allow a full regex in the matching part? Could use regex or oromatcher packages. Don't know how that would affect your hashing though... Alex -Original Message- From: Rodrigo Reyes [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 12, 2002 5:16 PM To: Lucene

RE: Normalization

2002-03-13 Thread Alex Murzaku
Hi Rodrigo and Brian, The power of regex is desirable especially in the left and right context matching. As it is, you need to write a lot of little rules for every possible combination. A regex instead would allow for just one rule covering most of the combinations. For example, you have a rule

RE: Normalization

2002-03-11 Thread Alex Murzaku
The generic string transducer kit could become a fine and widely used lucene contrib tool but could also become more than that: a standalone tool like Snowball. The formal language Rodrigo describes is quite powerful and allows for a lot. What I was trying to say is that it doesn't need to be

RE: Token retrieval question

2001-10-12 Thread Alex Murzaku
From what I remember, lucene indices are structures like: term, doc(i), pos1, .. where for every TERM there is a list of DOCs in which it appears and the respective POSitions in that DOC. Our problem is that TERM, usually, is a non-word (or stem). For display purposes, having a real word