Google and it behaves the same way. Very
frequent terms ARE indexed. They get removed only when they are part of
a query with more than one term.
--
Alex Murzaku
___
alex(at)lissus.com http://www.lissus.com
-Original Message-
From
I had built for my earlier Snowball-Lucene integration, I did
use these lists for the analyzers and also made sure to exclude them
from the tests (since the analyzer would remove them...)
Regards,
--
Alex Murzaku
___
alex(at)lissus.com http
and
manual language selection for this.
--
Alex Murzaku
___
alex(at)lissus.com http://www.lissus.com
-Original Message-
From: maurits van wijland [mailto:[EMAIL PROTECTED]]
Sent: Saturday, November 16, 2002 8:22 AM
To: Lucene Developers
--- Che Dong [EMAIL PROTECTED] wrote:
if Chinese segment with single charactor like: w1w2w3 = w1 w2 w3,
you search w1w2 and w2w1 will return with same the result. isn't
it?
That wouldn't be the case if you quote the two characters (therefore
you submit a phrase query.) But this discussion
I don't know any Asian languages but from earlier experimentations, I
remember that some time bigram tokenization could hurt matching, e.g.:
w1w2w3 == tokenized as == w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
miss a search for w2. w1 w2 w3 would work better.
--- Doug Cutting [EMAIL PROTECTED]
It's true that the unsofisticated end-user would not
use SQL, but between range (inclusive, exclusive),
boolean, fuzzy, etc., the simple query parser you have
is evolving into something more complex than SQL.
While SQL supports them with key words, we are getting
into an endless quest for unused
in the case
of short fields like addresses, single sentences,
etc.)
Cheers,
Alex
--- [EMAIL PROTECTED] wrote:
- Alex Murzaku contributed some code for dealing
with Russian.
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgId=115631
Yes Otis, very interesting and very familiar. I am
downloading it and, hopefully, I can get rid of
Outlook.
Merci beaucoup Petite Abeille! Tu vois, Lucene c'est
pas mal apres tout...
--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:
--- petite_abeille [EMAIL PROTECTED] wrote:
Thought you might
Would it make sense to allow a full regex in the matching part? Could
use regex or oromatcher packages. Don't know how that would affect your
hashing though...
Alex
-Original Message-
From: Rodrigo Reyes [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 12, 2002 5:16 PM
To: Lucene
Hi Rodrigo and Brian,
The power of regex is desirable especially in the left and right context
matching. As it is, you need to write a lot of little rules for every
possible combination. A regex instead would allow for just one rule
covering most of the combinations. For example, you have a rule
The generic string transducer kit could become a fine and widely used
lucene contrib tool but could also become more than that: a standalone
tool like Snowball. The formal language Rodrigo describes is quite
powerful and allows for a lot.
What I was trying to say is that it doesn't need to be
From what I remember, lucene indices are structures like:
term, doc(i), pos1, ..
where for every TERM there is a list of DOCs in which it appears and the
respective POSitions in that DOC.
Our problem is that TERM, usually, is a non-word (or stem). For display
purposes, having a real word
12 matches
Mail list logo