Re: Problems...

2005-01-07 Thread Chris Hostetter

: Stored = as-is value stored in the Lucene index
:
: Tokenized = field is analyzed using the specified Analyzer - the tokens
: emitted are indexed
:
: Indexed = the text (either as-is with keyword fields, or the tokens
: from tokenized fields) is made searchable (aka inverted)
:
: Vectored = term frequency is stored in the index in an easily
: retrievable fashion.

FYI: I've FAQed this...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: RemoteSearcher

2005-01-07 Thread Yura Smolsky
Hello, Otis.

Interesting. Nutch doesnt use RemoteSearchable b/b RemoteSearchable is not
very useful? I mean does it suitable for distibuting index process in
parallel on many services or not? Will it give us good performance.

We have RemoteSearchable in the sources, but anyone does not use it. :)

I ask this question, b/c I use PyLucene (very good port in
Python) and I need to realize a lot of things about implementation of
RemoteSearchable in omniORBpy (CORBA).  I have big index (3,000,000 docs) and
many fields. I have noticed, that search becomes slower. I want to
distribute index on many servers. Does RemoteSearchable worse of it?

BTW, Is there working demo of nutch with big index?

OG Nutch (nutch.org) has a pretty sophisticated infrastructure for
OG distributed searching, but it doesn't use RemoteSearcher.


 Does anyone know application which based on RemoteSearcher to
 distribute index on many servers?
 


Yura Smolsky,




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reading fields selectively

2005-01-07 Thread mark harwood
There is no API for this, but I recall somebody
 talking about adding support for this a few months
 back

See
http://marc.theaimsgroup.com/?l=lucene-devm=109485996612177w=2

This implementation was working on a version of Lucene
before compression was introduced so things may have
changed a little.


Cheers,
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reading fields selectively

2005-01-07 Thread John Wang
Thanks guys for the info!

After looking at the patch code I have two problems:

1) The patch implementation doesn't help with performance. It still
reads the data for every field in the document. Just not storing all
of them. So this implementation helps if there are memory
restrictions, but not if you are after performance.

2) We are bundling Lucene in our application, we are trying very hard
not having to change Lucene code and thus divert from the Lucene code
base. This patch implementation requires you to make changes to
SegmentReader.java. I am hoping not having to do that.


Any ideas?

Thanks

-John


On Fri, 7 Jan 2005 08:59:25 + (GMT), mark harwood
[EMAIL PROTECTED] wrote:
 There is no API for this, but I recall somebody
  talking about adding support for this a few months
  back
 
 See
 http://marc.theaimsgroup.com/?l=lucene-devm=109485996612177w=2
 
 This implementation was working on a version of Lucene
 before compression was introduced so things may have
 changed a little.
 
 Cheers,
 Mark
 
 
 ___
 ALL-NEW Yahoo! Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setting Similarity at search time

2005-01-07 Thread John Wang
Hi Chuck:

 Trying to follow up on this thread. Do you know if this feature
will be incorporated in the next Lucene release?

 How would someone find out which patches will go into the next release?

Thanks

-John


On Mon, 15 Nov 2004 13:05:36 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 Take a look at this:
 
 http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
 
 Not my initial patch, but the latest patch from Wolf Siberski.  I
 haven't used it yet, but it looks like what you are looking for, and
 something I want to use too.
 
 Chuck
 
-Original Message-
From: Ken McCracken [mailto:[EMAIL PROTECTED]
Sent: Monday, November 15, 2004 11:31 AM
To: Lucene Users List
Subject: setting Similarity at search time
   
Hi,
   
Is there a way to set the Similarity at search(...) time, rather
 than
just setting it on the (Index)Searcher object itself?  I'd like to
 be
able to specify different similarities in different threads
 searching
concurrently, using the same IndexSearcher instance.
   
In my use case, the choice of Similarity is a parameter of the
 search
request, and hence may be different for each request.
   
Can such a method be added to override the search(...) method?
   
Thanks,
-Ken
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reading fields selectively

2005-01-07 Thread mark harwood
It still reads the data for every field in the
document

No, not if your fields are positioned in the right
order. It stops reading fields after it has got what
is needed. 
If your doc has fields in the order:

smallFrequentlyReadField, largeRarelyReadField

then the patch will not read largeRarelyReadField
off the disk when you ask for
smallFrequentlyReadField.

If the fields are ordered the other way around then
there is (currently) no way of knowing the offset of
the smallFrequentlyReadField so all fields would have
to be read.

Hope this helps.
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Duplicate Id

2005-01-07 Thread mahaveer jain
Hi,
 
I have a application where I know I will have duplicate ID's. When I search 
these duplicate ID's will it search content in both the files ?
 
For Example :
 
Id = Mahaveer, Content = Jain India
Id = Mahaveer, Content = Lucene Test
 
Now when I search for India Test will it return both the columns ? Also can I 
display unique results ?
 
Mahaveer
 
 

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: setting Similarity at search time

2005-01-07 Thread Erik Hatcher
On Jan 7, 2005, at 4:26 AM, John Wang wrote:
 Trying to follow up on this thread. Do you know if this feature
will be incorporated in the next Lucene release?
 How would someone find out which patches will go into the next 
release?
CVS commit messages are sent to the lucene-dev e-mail list.  This is 
the best way to see what is happening with the codebase.  As for what 
is planned, see here:

http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
Lucene is community-driven, though, so if you have a compelling 
rationale for why something should be committed by all means lobby for 
it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Use search engine technology for object persistence

2005-01-07 Thread Erik Hatcher
Interesting article:
http://www.javaworld.com/javaworld/jw-01-2005/jw-0103-search_p.html
I don't agree with the use of QueryParser for non-human-entered 
queries, though, but otherwise its a reasonable approach for a 
light-weight object store.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Id

2005-01-07 Thread Otis Gospodnetic
Hello,

If you search for India OR Test, you will find both, if you use AND,
you will find none.  Lucene can search any text, not just files.  It
sounds like you are using Lucene's demo as a real application (not a
good practise).  I suggest you take a look at the Resources page on the
Lucene Wiki to get a better idea about what Lucene is and how it can be
used.

Otis


--- mahaveer jain [EMAIL PROTECTED] wrote:

 Hi,
  
 I have a application where I know I will have duplicate ID's. When I
 search these duplicate ID's will it search content in both the files
 ?
  
 For Example :
  
 Id = Mahaveer, Content = Jain India
 Id = Mahaveer, Content = Lucene Test
  
 Now when I search for India Test will it return both the columns ?
 Also can I display unique results ?
  
 Mahaveer
  
  
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: questions

2005-01-07 Thread Luke Shannon
Hello Jac;

If you have verified that the index folder is indeed being create and their
is a segment(s) file(s) in it, check that the IndexSearcher in the demo is
pointing to that location. This is a easy error to make and would account
for the error message no segments folder.

Luke


- Original Message - 
From: jac jac [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Friday, January 07, 2005 2:03 AM
Subject: questions



 Hi I am a newbie and i just installed Tomcat on my machine.
 May I know, when i placed the Luceneweb folder in the webapps folder of
Tomcat, how come I couldn't conduct the search operation when i test the
website? Did I missed out anything?

 It prompts me that there is no c:\opt\index\segment folder...
 I created but i still couldnt get Lucene to work...

 At http://jakarta.apache.org/lucene/docs/demo.html:
 under the Indexing file instruction where should I do the following type
java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src. ???
 Is it a must to install ant?

 Please kindly help!!! Thanks very much in advance

 regards,
 jac



 -
 Do you Yahoo!?
  The all-new My Yahoo! - What will yours do?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reading fields selectively

2005-01-07 Thread Mariella Di Giacomo
Hi,
Probably this is trivial question.
How can you enforce the order of the fields when you index them ?
Thanks,
Mariella
At 09:32 AM 1/7/2005 +, mark harwood wrote:
It still reads the data for every field in the
document
No, not if your fields are positioned in the right
order. It stops reading fields after it has got what
is needed.
If your doc has fields in the order:
smallFrequentlyReadField, largeRarelyReadField
then the patch will not read largeRarelyReadField
off the disk when you ask for
smallFrequentlyReadField.
If the fields are ordered the other way around then
there is (currently) no way of knowing the offset of
the smallFrequentlyReadField so all fields would have
to be read.
Hope this helps.
Mark


___
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reading fields selectively

2005-01-07 Thread Erik Hatcher
On Jan 7, 2005, at 10:03 AM, Mariella Di Giacomo wrote:
Probably this is trivial question.
How can you enforce the order of the fields when you index them ?
By the order in which you add them to a document.
Erik

Thanks,
Mariella
At 09:32 AM 1/7/2005 +, mark harwood wrote:
It still reads the data for every field in the
document
No, not if your fields are positioned in the right
order. It stops reading fields after it has got what
is needed.
If your doc has fields in the order:
smallFrequentlyReadField, largeRarelyReadField
then the patch will not read largeRarelyReadField
off the disk when you ask for
smallFrequentlyReadField.
If the fields are ordered the other way around then
there is (currently) no way of knowing the offset of
the smallFrequentlyReadField so all fields would have
to be read.
Hope this helps.
Mark


___
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reading fields selectively

2005-01-07 Thread Mariella Di Giacomo
At 10:24 AM 1/7/2005 -0500, Erik Hatcher wrote:
On Jan 7, 2005, at 10:03 AM, Mariella Di Giacomo wrote:
Probably this is trivial question.
How can you enforce the order of the fields when you index them ?
By the order in which you add them to a document.
So when you do the following:
  doc.add(Field.Keyword(id, keywords[i]));
 doc.add(Field.UnIndexed(country, unindexed[i]));
 doc.add(Field.UnStored(contents, unstored[i]));
 doc.add(Field.Text(city, text[i]));
The first stored will be id, the second country and so on.
Is that correct ?
Mariella



Thanks,
Mariella
At 09:32 AM 1/7/2005 +, mark harwood wrote:
It still reads the data for every field in the
document
No, not if your fields are positioned in the right
order. It stops reading fields after it has got what
is needed.
If your doc has fields in the order:
smallFrequentlyReadField, largeRarelyReadField
then the patch will not read largeRarelyReadField
off the disk when you ask for
smallFrequentlyReadField.
If the fields are ordered the other way around then
there is (currently) no way of knowing the offset of
the smallFrequentlyReadField so all fields would have
to be read.
Hope this helps.
Mark


___
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reading fields selectively

2005-01-07 Thread Erik Hatcher
On Jan 7, 2005, at 10:34 AM, Mariella Di Giacomo wrote:
At 10:24 AM 1/7/2005 -0500, Erik Hatcher wrote:
On Jan 7, 2005, at 10:03 AM, Mariella Di Giacomo wrote:
Probably this is trivial question.
How can you enforce the order of the fields when you index them ?
By the order in which you add them to a document.
So when you do the following:
  doc.add(Field.Keyword(id, keywords[i]));
 doc.add(Field.UnIndexed(country, unindexed[i]));
 doc.add(Field.UnStored(contents, unstored[i]));
 doc.add(Field.Text(city, text[i]));
The first stored will be id, the second country and so on.
Is that correct ?
Yes.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Use search engine technology for object persistence

2005-01-07 Thread Luke Francl
On Fri, 2005-01-07 at 08:05, Erik Hatcher wrote:
 Interesting article:
 
   http://www.javaworld.com/javaworld/jw-01-2005/jw-0103-search_p.html

Sort of off-topic, but does this mean JavaWorld is publishing again? I
had read Bill Venners's post from back in January '04 that they shut
down.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Use a date field for ranking

2005-01-07 Thread Christoph Kiehl
Hi,
we are currently implementing a search engine for a news site. Our goal 
is to have a search result that uses the publish date of the documents 
to boost the score of the documents.

I took a look at nutch to see how it implements pagerank and it seems 
like this is done at index time by setting a document boost.

This approach won't work for us because ranking by date is optional. We 
have to use something that boosts the scores at _search_ time.

My idea is to implement it like the sort functionality built into lucene 
and use the FieldCache.

Has anyone a better idea or an important downside of this approach?
Regards,
Christoph
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Use a date field for ranking

2005-01-07 Thread Chris Hostetter
: we are currently implementing a search engine for a news site. Our goal
: is to have a search result that uses the publish date of the documents
: to boost the score of the documents.

: have to use something that boosts the scores at _search_ time.

1) There is a way to boost individual Query objects (which you may then
compose into a Tree of BooleanQueries) see Query.setBoost(float)

2) if you are planning to rebuild your index on a regular basis (ie:
nightly) then you can easily apply boosts to your documets when you index
them.

If you want to be able to do only incrimental additions...

3) I'm sure there is a very cool and efficient way to do this using a
custom Similarity implimentation (which somhow causes the default score
to be divided by the age of the document) but i've never acctualy played
with the SImilarity class, so i won't say for certain it can be done that
way (hopefully someone else can chime in)

4) I can tell you what i cam up with when i was proof of concepting this a
while back...

In my case, I'm willing to accept that there is some finite granularity of
time at which newer documents are no longer very much more fresh then
older documents (ie: articles from the same week are equally fresh to
me) I also have a practicle cut off of how old things can get before they
are just plan old: 52 weeks.

With those numbers in mind, I can add a special field to each document
that indicates which week the article was published (ie: 2004w1, 2004w2,
2004w3, etc...).  At search time, my query can include a BooleanQuery of
52 clauses ORed together, each one containing the magic token for the last
52 weeks prio to when the search was execuded, each with a slightly
decreasing boost from the week before.





-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Check to see if index is optimized

2005-01-07 Thread Crump, Michael
Hello,

 

Lucene is great!  I just have a question.

 

Is there a simple way to check and see if an index is already optimized?
What happens if optimize is called on an already optimized index - does
the call basically do a noop?  Or is it still and expensive call?

 

 

Regards,

 

Michael



Question about the best way to replace existing docs in an index.

2005-01-07 Thread Jim Lynch
My application for Lucene involves updating an existing index with a 
mixture of new and revised documents.  From what I've been able to 
dicern from reading I'm going to have to delete the old versions of the 
revised documents before indexing them again.  Since this indexing will 
probably take quite a while due to the number of new/revised documents 
I'll be adding and the large number of documents already in the index, 
I'm uncomfortable keeping an IndexReader and an IndexWriter open for 
long periods of time.  

What I'm considering doing is reading the file with mulitple documents 
twice.  One time I test to see if the document is in the index and 
delete it if it is with something like:

The Reference term is unique.
...
   while(String ref = getNextDocument() != null) {
 Term t = Term(Reference,ref);
 TermDocs td = indexReader.termDocs(t);
 if(td != null) {
   td.next();
   indexReader.delete(td.doc());
 }
   }
Or should I not bother to look for the term at all and do something like 
this?

   while(String ref = getNextDocument() != null) {
 Term t = Term(Reference,ref);
 indexReader.delete(t);
   }
Are either of these more efficient?
Then I would close the indexReader and go back and reread the file, 
indexing merrily away.

Should I be concerned about keeping both an indexReader and indexWriter 
open at the same time?  I'll have other processes probably making 
searches during this time.  I'm not concerned about the searches not 
finding the data I'm currently adding, I'm more concerned about locking 
those searches out.  

A couple of valid assumptions.  The reference term is unique in the 
index and there will be only one in the input file.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Quick question about highlighting.

2005-01-07 Thread Jim Lynch
I've read as much as I could find on the highlighting that is now in the 
sandbox.  I didn't find the javadocs.  I found a link to them, but it 
redirected my to a cvs tree.

Do I assume that you have to store the content of the document for the 
highlighting to work?  Otherwise I don't see how it could work.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Quick question about highlighting.

2005-01-07 Thread David Spencer
Jim Lynch wrote:
I've read as much as I could find on the highlighting that is now in the 
sandbox.  I didn't find the javadocs.
I have a copy here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html
  I found a link to them, but it
redirected my to a cvs tree.
Do I assume that you have to store the content of the document for the 
highlighting to work?  
Not per se, but you do need access to the contents to pass to 
Highlighter.getBestFragments(). You can store the contents in the index, 
or you can have in a cache, DB, or you can refetch the doc...

You need to know what Analyzer you used too to get the tokenStream via:
TokenStream tokenStream = analyzer.tokenStream( field, new 
StringReader(body));


Otherwise I don't see how it could work.
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Check to see if index is optimized

2005-01-07 Thread Luke Shannon
This may not be a simple way, but you could just do a quick check on the
folder to see if there is more than one file containing the name segment.

Luke

- Original Message - 
From: Crump, Michael [EMAIL PROTECTED]
To: lucene-user@jakarta.apache.org
Sent: Friday, January 07, 2005 2:24 PM
Subject: Check to see if index is optimized


Hello,



Lucene is great!  I just have a question.



Is there a simple way to check and see if an index is already optimized?
What happens if optimize is called on an already optimized index - does
the call basically do a noop?  Or is it still and expensive call?





Regards,



Michael



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Morus Walter
Crump, Michael writes:

 
 Is there a simple way to check and see if an index is already optimized?
 What happens if optimize is called on an already optimized index - does
 the call basically do a noop?  Or is it still and expensive call?
 
Why don't you just try that? E.g. using luke. Or three lines of code...

You will find, that calling optimize for an optimized index does
not change the index. (optimized means just one segement and no
deleted documents)

So I guess the answer for your first question can be found in the sources
of optimize:

  public synchronized void optimize() throws IOException {
flushRamSegments();
while (segmentInfos.size()  1 ||
   (segmentInfos.size() == 1 
(SegmentReader.hasDeletions(segmentInfos.info(0)) ||
 segmentInfos.info(0).dir != directory ||
 (useCompoundFile 
  (!SegmentReader.usesCompoundFile(segmentInfos.info(0)) ||
SegmentReader.hasSeparateNorms(segmentInfos.info(0))) {
  int minSegment = segmentInfos.size() - mergeFactor;
  mergeSegments(minSegment  0 ? 0 : minSegment);
}
  }

segmentInfos is private in IndexWriter, so I suspect you cannot check
that without modifying lucene.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Luke Francl
On Fri, 2005-01-07 at 13:24, Crump, Michael wrote:

 Is there a simple way to check and see if an index is already optimized?
 What happens if optimize is called on an already optimized index - does
 the call basically do a noop?  Or is it still and expensive call?

If an index has no deletions, it does not need to be optimized. You can
find out if it has deletions with IndexReader.hasDeletions.

I am not sure what the cost of optimization is if the index doesn't need
it. Perhaps someone else on this list knows.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Mike Snare
 If an index has no deletions, it does not need to be optimized. You can
 find out if it has deletions with IndexReader.hasDeletions.

Is that true?  An index that has just been created (with no deletions)
can still have multiple segments that could be optimized.  I'm not
sure your statement is correct.

-Mike

On Fri, 07 Jan 2005 14:22:23 -0600, Luke Francl
[EMAIL PROTECTED] wrote:
 On Fri, 2005-01-07 at 13:24, Crump, Michael wrote:
 
  Is there a simple way to check and see if an index is already optimized?
  What happens if optimize is called on an already optimized index - does
  the call basically do a noop?  Or is it still and expensive call?
 
 If an index has no deletions, it does not need to be optimized. You can
 find out if it has deletions with IndexReader.hasDeletions.
 
 I am not sure what the cost of optimization is if the index doesn't need
 it. Perhaps someone else on this list knows.
 
 Regards,
 Luke Francl
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Mike Snare
Based on the method sent earlier, it looks like Lucene first checks to
see if optimization is even necessary.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query based stemming

2005-01-07 Thread Peter Kim
Hi,

I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)

Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?

Thanks!
Peter

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Quick question about highlighting.

2005-01-07 Thread Jim Lynch
OK, thanks.  That clears things up.  I'll play with it once I get 
something indexed.

Jim.
David Spencer wrote:
Jim Lynch wrote:
I've read as much as I could find on the highlighting that is now in 
the sandbox.  I didn't find the javadocs.

I have a copy here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html 


  I found a link to them, but it
redirected my to a cvs tree.
Do I assume that you have to store the content of the document for 
the highlighting to work?  

Not per se, but you do need access to the contents to pass to 
Highlighter.getBestFragments(). You can store the contents in the 
index, or you can have in a cache, DB, or you can refetch the doc...

You need to know what Analyzer you used too to get the tokenStream via:
TokenStream tokenStream = analyzer.tokenStream( field, new 
StringReader(body));

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query based stemming

2005-01-07 Thread Jim Lynch
From what I've read, if you want to have a choice, the easiest way is 
to index the documents twice. Once with stemming on and once with it off 
placing the results in two different indexes.  Then at query time, 
select which index you want to use based on whether you want stemming on 
or off.

Jim.
Peter Kim wrote:
Hi,
I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)
Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?
Thanks!
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query based stemming

2005-01-07 Thread Chris Hostetter

: Is it possible to enable stem queries on a per-query basis? It doesn't
: seem to be possible since the stem tokenizing is done during the
: indexing process. Are people basically stuck with having all their
: queries stemmed or none at all?

:  From what I've read, if you want to have a choice, the easiest way is
: to index the documents twice. Once with stemming on and once with it off
: placing the results in two different indexes.  Then at query time,
: select which index you want to use based on whether you want stemming on
: or off.

As I understand it, the intented place to impliment Stemming is in an
Analyzer Filter (not to be confused with a search Filter).  Since you can
can specify an Analyzer when you call addDocument, you don't have to
acctually have two seperate indexes, you could just have all the docs in
one index - and use a search Filter to indicate which docs to look at.

Alternately: the Analyzer's tokenStream method is given the fieldName
being analyzed, so you could write an Analyzer with a set of rules
telling it to only apply your Stemming filter to certain fields, and
then instead of having twice as many documents, you can just index your
text in two seperate fields (which should be a little easier, then
seperate docs because you are only duplicating the fields where stemming
is relevant)  Then at search time you don't have to filter anything, just
search the field that's applicable to your current desire (stemmed or
unstemmed)

Lastely: Allthough it's tricky to get correct, there's no law saying you
have to use the same Analyzer when you query as when you index.  You could
index your documents using an Analyzer that does no stemming, and then at
search time (if you want stemming) use an Analyzer that does reverse
stemming to expand your query terms out to all the possible variants.


(NOTE: I've never acctaully tried this, but i think the theory is sound).


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]