Re: Snowball and accents filter...?

2007-04-28 Thread Andrew Green
El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió:
> : In order to do this, we tried subclassing the SnowballAnalyzer... it
> : doesn't work yet, though. Here is the code of our custom class:
> 
> At first glance, what youv'e got seems fine, can you elaborate on what you
> mean by "it doesn't work" ?
> 
> Perhaps the issue is that the SnowballStemmer can't handle the accented
> characters, and you should strip them first, then stem?
> 
>   public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenStream result = new StandardTokenizer(reader);
> result = new StandardFilter(result);
> result = new LowerCaseFilter(result);
> if (stopSet != null)
>   result = new StopFilter(result, stopSet);
> result = new ISOLatin1AccentFilter(result);
> result = new SnowballFilter(result, name);
> return result;
>   }
> 
Thanks for your answer, Chris.

It doesn't work for the opposite reason: it requires words to be spelled
correctly, including accents, in order to stem them. So, for example,
"civilización" and its plural, "civilizaciones" are stemmed correctly,
but the accentless version, "civilizacion", doesn't get stemmed at all.
So if someone misspells the word, omitting the accent, in the search
query--a likely scenario--the only hits they get are identical
misspellings in the documents, if such things exist. But we need
stemming of both accented and unaccented versions of the word. Stemming
misspellings may sound inherently evil, I suppose, but it seems to be
our best bet.

We're currently trying to modify the SpanishStemmer to do this, but
haven't gotten it quite yet.

Another option that I'm imagining might work, though less well, would be
to simultaneously maintain two indexes, one of correctly stemmed words
generated without the accents filter, and another of unstemmed words
with the accents stripped, and query both indexes when searching.

Yet another possibility would be, I think, to silently use a dictionary
to correct spellings in queries before searching.

A few Google queries show that they do things sort of the way we're
trying to, though perhaps not quite...

Thanks again,
Andrew


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search for docs containing only a certain word in a specified field?

2007-04-28 Thread karl wettin


28 apr 2007 kl. 07.52 skrev Kun Hong:


karl wettin wrote:


27 apr 2007 kl. 14.11 skrev Erik Hatcher:



On Apr 27, 2007, at 6:39 AM, karl wettin wrote:

27 apr 2007 kl. 12.36 skrev Erik Hatcher:


Unless someone has some other tricks I'm not aware of, that is.


I guess it would be possible to add start/stop-tokens such as ^  
and $ to the indexed text: "^ the $" and place a phrase query  
with 0 slop.


True true.   That'd work too.


Thanks for the replies and discussion.

I think I didn't express my problems correctly.  The problem is I  
want to
find documents containing only the "the" token in the title field,  
but not
necessarily with only one appearance.  For example, if the query is  
"the",
I want to find documents whose title is "the", "the the" or "the  
the the".


I'm not sure if you mean that it should treat all repetative tokens  
as only one token? Then you are better of using a filter when  
analyzing text you insert to the index: rather than creating one  
token for each the in "the the the the the the" you only create one.  
You might also want to use this filter when parsing user queries. (It  
will be hard to find the band 'the the'.)


If not and what you write above is all you want to match, nothing  
more, nothing less, then you could do something like this:


(dry coded and untested.)

int n = 3; // the; the the; the the the
String field = "title";
String token = "the";
BooleanQuery bq = new BooleanQuery();
for (int i=0;i

Re: Index sync up

2007-04-28 Thread Erick Erickson

I don't understand why you think HitCollector caches lots of data.
All it does is provide a place for you to decide whether you want a
doc or not. There's no fetching of the doc, or anything else except
the score and the doc ID. There's nothing else you have to do
with HitCollector.collect method.

TopDocs, ends up with an array of docIDs and scores, which is probably
what you want, just skip to the Nth document and read off the next
X documents.

In neither case is there very much storage involved. You've got to score
all the documents anyway if you want the most relevant, and the TopDocs
has a long and a float for each scoring document. Still not a huge amount
of data.

Anyway, best of luck however it works out
Erick



On 4/27/07, Tony Qian <[EMAIL PROTECTED]> wrote:


Erick,

Thanks for your explaination. I thought using HitCollector. The search
interface we are facing now actually is pretty simple. One of the search
requires maximum of search results of 500 and page size is 500 (basically
return first 500). Second one requires max of 250 and page size is 25. At
this time, we are ok even we have to hit query several times.

I see one problem with HitCollector, which is HitCollector caches a huge
data if the document is very large. One best of implementation (I think)
is
client passes in a page number and page size in search method, Lucene
returns documents on that page instead of always returns first 100
documents.

I haven't  looked at Lucene code yet and don't know how hard to implement
that.

Tony


>From: "Erick Erickson" <[EMAIL PROTECTED]>
>Reply-To: java-user@lucene.apache.org
>To: java-user@lucene.apache.org
>Subject: Re: Index sync up
>Date: Fri, 27 Apr 2007 13:12:16 -0400
>
><4> is also easy
>
>From the javadoc:
>"*Caution:* Iterate only over the hits needed. Iterating over all hits is
>generally not desirable and may be the source of performance issues."
>
>So an iterator should be fine for all documents, even those > 100. But do
>be
>aware that the entire query gets re-executed each 100 docs or so, so yes,
>there is a performance issue. You'll pay a price how big depends on a lot
>of
>variables. But let's say the query takes 2 seconds to run. You'll spend
two
>seconds searching before returning document 0, two more seconds between
>documents 100 and 101, two seconds between 200 and 201, etc. *even if you
>just throw them away* if you use an iterator.
>
>So, getting hits 10,000 through 10,100 will spend a LOT if time
processing
>queries. You're better off using a HitCollector, perhaps a TopDocs etc.
>
>On the other hand, if your query takes 10 ms and you never really expect
to
>fetch more than, say, 500 documents, who cares? Do it as simply as
>possible.
>
>But now that I'm thinking about it, it's unclear to me what happens if
you
>just ask for Hits.doc(401) as your first call to get any document from
the
>Hits object. I took a quick look at the Hits code and it *looks* like,
for
>fetching an arbitrary 100 documents, the maximum number of searches
you'll
>make is two. Again, it's a quick look, but it seems like the following
>
>Hits hits = search();
>Document doc = hits.doc(401);
>
>will execute the search twice. First to get the first 100 docs, then to
get
>documents 400-800. At least I think that's what's happening. That said, I
>think you'd still be ahead by implementing your own HitCollector if you
>expect to fetch thousands of documents The "fetch twice as many
>documents as the one we're asked for" algorithm seems tailored for
>relatively small data sets, which shouldn't be any surprise..
>
>Erick
>
>On 4/27/07, Tony Qian <[EMAIL PROTECTED]> wrote:
>>
>>All,
>>
>>After playing around with Lucene, we decided to replace old full-text
>>search
>>engine with Lucene. I got "Lucene in Action" a week ago and finished
>>reading
>>most of the book. I got several questions.
>>
>>1) Since the book was written two years ago and Lucene has made a lot of
>>changes, is there any plan for 2nd edition? (I guess this question is
for
>>Otis and Erik, btw, it is a great book.)
>>
>>2) I have two processes for indexing. one runs every 5 minutes to add
new
>>contents into an existing index. Another one runs daily to rebuild
entire
>>index which also handles removing old contents. After rebuild process
>>finishes indexing, we'd like to replace the index built by first process
>>(every 5 minutes) with index built by second process. How do i do it
>>safely
>>and also avoid duplicating or missing documents (It is possible that
first
>>process is still adding documents to the index when we try to replace it
>>with second one).
>>NOTE: both processes retrieve data from same database.
>>
>>3) we are doing indexing on a master server and push index data to slave
>>servers. In order to make new data visible to client, we have to close
>>IndexSearcher and open it after new data is coped over. We use web based
>>application (servlet) as search interface, creating a IndexSearcher as
an
>>instance variable 

Re: Snowball and accents filter...?

2007-04-28 Thread Erick Erickson

You actually wouldn't have to maintain two versions. You could,
instead, inject the accentless (stemmed) terms in your single
index as synonyms (See Lucene In Action). This is easier
to search and maintain

But it also bloats your index by some factor since you're storing two
words for every accented word in your corpus. And gives you
headaches if there is more than one accent in the word (do you
then store all 4 possibilities for two accents? 8 for 3? etc?).

I think your notion of running the search terms through a dictionary
is a very good one. That way, your searcher doesn't have to care
about all this nonsense, and assume correctly-accented characters.

Erick

On 4/28/07, Andrew Green <[EMAIL PROTECTED]> wrote:


El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió:
> : In order to do this, we tried subclassing the SnowballAnalyzer... it
> : doesn't work yet, though. Here is the code of our custom class:
>
> At first glance, what youv'e got seems fine, can you elaborate on what
you
> mean by "it doesn't work" ?
>
> Perhaps the issue is that the SnowballStemmer can't handle the accented
> characters, and you should strip them first, then stem?
>
>   public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenStream result = new StandardTokenizer(reader);
> result = new StandardFilter(result);
> result = new LowerCaseFilter(result);
> if (stopSet != null)
>   result = new StopFilter(result, stopSet);
> result = new ISOLatin1AccentFilter(result);
> result = new SnowballFilter(result, name);
> return result;
>   }
>
Thanks for your answer, Chris.

It doesn't work for the opposite reason: it requires words to be spelled
correctly, including accents, in order to stem them. So, for example,
"civilización" and its plural, "civilizaciones" are stemmed correctly,
but the accentless version, "civilizacion", doesn't get stemmed at all.
So if someone misspells the word, omitting the accent, in the search
query--a likely scenario--the only hits they get are identical
misspellings in the documents, if such things exist. But we need
stemming of both accented and unaccented versions of the word. Stemming
misspellings may sound inherently evil, I suppose, but it seems to be
our best bet.

We're currently trying to modify the SpanishStemmer to do this, but
haven't gotten it quite yet.

Another option that I'm imagining might work, though less well, would be
to simultaneously maintain two indexes, one of correctly stemmed words
generated without the accents filter, and another of unstemmed words
with the accents stripped, and query both indexes when searching.

Yet another possibility would be, I think, to silently use a dictionary
to correct spellings in queries before searching.

A few Google queries show that they do things sort of the way we're
trying to, though perhaps not quite...

Thanks again,
Andrew


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Sort in Lucene 1.4.3

2007-04-28 Thread Zhang, Lisheng
Hi,

I encountered one problem in lucene 1.4.3: I called 

Searcher.search(, new Sort("myfiled");

In "myfiled", most values looks like number "123456" or sth
similiar, but one field contains a value "Just a TRY", then
I got error:

java.lang.ClassCastException at
org.pache.lucene.search.FieldDocSortedHitQueue.lssThan
(FieldDocSortedHitQueue.java:129)

It seems that lucene judged this field as number so it 
cannot cast one particular value as String?

To me if client did not specify sorting field type, we should
treat it just as String?

Thanks very much for helps and best regards, Lisheng



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index sync up

2007-04-28 Thread Otis Gospodnetic
Hi Tony,
 
- Original Message 

All,

After playing around with Lucene, we decided to replace old full-text search 
engine with Lucene. I got "Lucene in Action" a week ago and finished reading 
most of the book. I got several questions.

1) Since the book was written two years ago and Lucene has made a lot of 
changes, is there any plan for 2nd edition? (I guess this question is for 
Otis and Erik, btw, it is a great book.)

OG: Thanks.  Yes, there are plans for LIA2.  At this point in time they are 
still just plans.  We started preparing for the second edition some months ago, 
but then Lucene got some fresh blood and started developing an changing 
rapidly, that we decided to wait a little longer.  Plus, both Erik and I are 
quite busy these days (see my signature).

2) I have two processes for indexing. one runs every 5 minutes to add new 
contents into an existing index. Another one runs daily to rebuild entire 
index which also handles removing old contents. After rebuild process 
finishes indexing, we'd like to replace the index built by first process 
(every 5 minutes) with index built by second process. How do i do it safely 
and also avoid duplicating or missing documents (It is possible that first 
process is still adding documents to the index when we try to replace it 
with second one).
NOTE: both processes retrieve data from same database.

OG: You'll need to make those two processed communicate somehow.  If they run 
on the same servers, the easiest way might be using files - if file X exists, 
stop updating the index.  Or, if file Y exists, that means the first process is 
still updating, so wait with the index swap.
If this is running under UNIX, you might be able to just do:
rm -rf index// the files won't *really* be removed at this point, 
so searching against this index will still work.
mv newIndex index
reopen the IndexSearcher

You could also play with sym-links:

normally you'd have: index -> index-built-on-20070428
when you build a new index the following night you call it 
index-built-on-20070429 and point index to it: index -> index-built-on-20070429
reopen the IndexSearcher

3) we are doing indexing on a master server and push index data to slave 
servers. In order to make new data visible to client, we have to close 
IndexSearcher and open it after new data is coped over. We use web based 
application (servlet) as search interface, creating a IndexSearcher as an 
instance variable for all clients. My question is what will happen to 
clients if I close IndexSearcher while clients are still doing search. How 
to safely update index when client are searching?

OG: The clients using the IndexSearcher when you close it will get an exception 
- IOException most likely.
But you don't *have* to close the old IndexSearcher.  You could just open a new 
one and let the old one get GCed.
OR, if you really want to close the old one, you could always come up with a 
simple mechanism that implements the "oh, this IndexSearcher needs to be closed 
soon - ok, let's give all clients who are using it 60 seconds to finish up and 
then we are closing this IS".  Or you could keep count of clients using this.  
I believe Solr does this.  You'll also want to warm up the new IndexSearcher 
with a query before exposing it to real clients, esp. if your index is big.

4) Lucene caches first 100 hits in memmory. We decided to use requery to 
return search results back to clients. For first 100 documents, i can 
iterator through "Hits". Do i have to use doc(n) to retrive documents for 
any documents > 100? Any performance issues?

OG:  For hits > 100 you still use the same API as for hits < 100.  However, if 
your application or its users need to go deep in the results, you might want to 
look at the IndexSearcher search() method that returns TopDocs.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucene Consulting - http://lucene-consulting.com/





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sort in Lucene 1.4.3

2007-04-28 Thread Otis Gospodnetic
Lisheng,

Have a look at the javadoc for the Sort object:
Valid Types of Values

 There are three possible kinds of term values which may be put into
 sorting fields: Integers, Floats, or Strings.  Unless
 SortField objects are specified, the type of value
 in the field is determined by parsing the first term in the field.

 

Thus, if you know what type of a value yoru field has, use SortField and 
explicitly set it.  Also, instantiate Sort and SortField only once instead of 
in each call to search().

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucene Consulting - http://lucene-consulting.com/


- Original Message 
From: "Zhang, Lisheng" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, April 28, 2007 9:04:23 PM
Subject: Sort in Lucene 1.4.3

Hi,

I encountered one problem in lucene 1.4.3: I called 

Searcher.search(, new Sort("myfiled");

In "myfiled", most values looks like number "123456" or sth
similiar, but one field contains a value "Just a TRY", then
I got error:

java.lang.ClassCastException at
org.pache.lucene.search.FieldDocSortedHitQueue.lssThan
(FieldDocSortedHitQueue.java:129)

It seems that lucene judged this field as number so it 
cannot cast one particular value as String?

To me if client did not specify sorting field type, we should
treat it just as String?

Thanks very much for helps and best regards, Lisheng



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]