Re: Check to see if index is optimized

2005-01-07 Thread Mike Snare
 If an index has no deletions, it does not need to be optimized. You can
 find out if it has deletions with IndexReader.hasDeletions.

Is that true?  An index that has just been created (with no deletions)
can still have multiple segments that could be optimized.  I'm not
sure your statement is correct.

-Mike

On Fri, 07 Jan 2005 14:22:23 -0600, Luke Francl
[EMAIL PROTECTED] wrote:
 On Fri, 2005-01-07 at 13:24, Crump, Michael wrote:
 
  Is there a simple way to check and see if an index is already optimized?
  What happens if optimize is called on an already optimized index - does
  the call basically do a noop?  Or is it still and expensive call?
 
 If an index has no deletions, it does not need to be optimized. You can
 find out if it has deletions with IndexReader.hasDeletions.
 
 I am not sure what the cost of optimization is if the index doesn't need
 it. Perhaps someone else on this list knows.
 
 Regards,
 Luke Francl
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Mike Snare
Based on the method sent earlier, it looks like Lucene first checks to
see if optimization is even necessary.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search not working properly. Bug !!!!!!

2004-12-30 Thread Mike Snare
You appear to be searching for the word Engineer in the name
field.  Shouldn't this query be directed at the designation field? 
The only terms in the name field would be Ebrahim, Faisal, John,
and Smith, wouldn't they?


On Thu, 30 Dec 2004 22:06:46 +0530, Mohamed Ebrahim Faisal
[EMAIL PROTECTED] wrote:
 Hi all
 
 I have written a simple program to test Indexing  Search. After indexing 
 couple of documents, I Searched for the same, but i didn't get Successfull 
 matches. I don't know whether it is a bug in Lucene or in the code. I have 
 enclosed the code for your review.
 
 But when i used Lucene for bigger applications ( index contains larger 
 documents ), search worked amazingly.
 
 Following is the code which didn't work properly
 
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.queryParser.QueryParser;
 import org.apache.lucene.search.Hits;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.search.TermQuery;
 
 import org.apache.lucene.search.Searcher;
 
 public class testLucene
 {
  private static final String[] strSTOP_WORDS =
 {
and,
are,
was,
will,
with };
  private void test() throws Exception
  {
   Analyzer objAnalyzer = new StandardAnalyzer();
   IndexWriter index = new IndexWriter(index,objAnalyzer, true );
   Searcher objIndexSearcher = new IndexSearcher(index);
 
   Document d = new Document();
 
   d.add( Field.Text(name,Ebrahim Faisal));
   d.add( Field.Text(address,New York));
   d.add( Field.Text(designation,Software Engineer));
   d.add( Field.Text(xyz,123 IndexWriter index));
 
   index.addDocument( d );
 
   d = new Document();
 
   d.add( Field.Text(name,John Smith));
   d.add( Field.Text(address,India));
   d.add( Field.Text(designation,Sr. Software Engineer));
   d.add( Field.Text(xyz,456 StandardAnalyzer true));
 
   index.addDocument( d );
 
   index.optimize();
   index.close();
 
   Query objQuery = null;
 
   objQuery = QueryParser.parse(Engineer, name
 , objAnalyzer);
 
   Hits objHits = objIndexSearcher.search(objQuery);
 
   for (int nStart = 0; nStart  objHits.length(); nStart++)
   {
d = objHits.doc(nStart);
System.out.println( address +d.get(address));
   }
 
  }
  public static void main(String[] args) throws Exception
  {
   new testLucene().test();
  }
 }
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing terms only

2004-12-22 Thread Mike Snare
Whether or not the text is stored in the index is a different concern
that how it is analyzed.  If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method or the
appropriate constructor when adding a field to the Document.  You'll
need to also store a reference to the actual file (URL, Path, etc) in
the document so it can be retrieved from the doc returned in the Hits
object.

Or did I completely misunderstand the question?

-Mike

On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote:
 hi
 
 i need to index my text so that index contains only tokenized stemmed words 
 without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, 
 but it stores whole text, not terms. Please give me a tip how to index terms 
 only. Thanks!
 
 DES


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing terms only

2004-12-22 Thread Mike Snare
I've never used the german analyzer, so I don't know what stop words
it defines/uses.  Someone else will have to answer that.  Sorry

On Wed, 22 Dec 2004 17:45:17 +0100, DES [EMAIL PROTECTED] wrote:
 I actually use Field.Text(String,String) to add documents to my index. Maybe
 I do not understand the way an analyzer works, but I thought that all German
 articles (der, die, das etc) should be filtered out. However if I use Luke
 to view my index, the original text is completely stored in a field. And
 what I need is term vector, that I can create from an indexed document
 field. So this field should contain terms only.
 
  Whether or not the text is stored in the index is a different concern
  that how it is analyzed.  If you want the text to be indexed, and not
  stored, then use the Field.Text(String, String) method or the
  appropriate constructor when adding a field to the Document.  You'll
  need to also store a reference to the actual file (URL, Path, etc) in
  the document so it can be retrieved from the doc returned in the Hits
  object.
 
  Or did I completely misunderstand the question?
 
  -Mike
 
  On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote:
  hi
 
  i need to index my text so that index contains only tokenized stemmed
  words without stopwords etc. The text ist german, so I tried to use
  GermanAnalyzer, but it stores whole text, not terms. Please give me a tip
  how to index terms only. Thanks!
 
  DES
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing terms only

2004-12-22 Thread Mike Snare
Thanks for correcting me.  I use the reader version -- hence my confusion.

-Mike

On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
 
 On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
  Whether or not the text is stored in the index is a different concern
  that how it is analyzed.  If you want the text to be indexed, and not
  stored, then use the Field.Text(String, String) method
 
 Correction: Field.Text(String, String) is a stored field.  If you want
 unstored, use Field.UnStored(String, String).
 This is a bit confusing because Field.Text(String, Reader) is not
 stored.  This confusion has been cleared up in the CVS version of
 Lucene and will be deprecated in the 1.9 release, and removed in the
 2.0 release.
 
 Erik
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: retrieve tokens

2004-12-22 Thread Mike Snare
 But for the other issue on 'store lucene' vs 'store db'. Does anyone can
 provide me with some field experience on size?
 The system I'm developing will provide searching through some 2000
 pdf's, say some 200 pages each. I feed the plain text into Lucene on a
 Field.UnStored bases. I also store this plain text in the database for
 the sole purpose of presenting a context snippet.

Why not store the snippet in another field that is stored, but not
indexed?  You could then immediately retrieve the snippet from the
doc.  This would only increase your index by num_docs * size_snippet
and would save the db access time and complexity.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Mike Snare
I'm still new to Lucene, but wouldn't that be the coord()?  My
understanding is that the coord() is the fraction of the boolean query
that matched a given document.

Again, I'm new, so somebody else will have to confirm or deny...

-Mike


On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
[EMAIL PROTECTED] wrote:
 How to find out the percentages of matched terms in the document(s) using 
 Lucene ?
 Here is an example, of what i am trying to do:
 The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 
 matching
 documents with the following attributes:
 Doc#1: contains terms(ibm,drive)
 Doc#2: contains terms(ibm,risc, tape, drive)
 Doc#3: contains terms(ibm,risc, tape,drive)
 Doc#4: contains terms(ibm, risc, tape, drive, manual).
 The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40%
 (doc#1).
 
 Any help on how to go about doing this ?
 
 Thanks,
 Gururaja
 
 
 -
 Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-16 Thread Mike Snare
 Not if these words are spelling variations of the same concept, which
 doesn't seem unlikely.
 
  In addition, why do we assume that a-1 is a typical product name but
  a-b isn't?
 
 Maybe for a-b, but what about English words like half-baked?

Perhaps that's the difference in thinking, then.  I would imagine that
you would want to search on half-baked and not half AND baked.

 Regards
 Daniel
 
 --
 http://www.danielnaber.de
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-16 Thread Mike Snare
Absolutely, but -- correct me if I'm wrong -- it would give no higher
ranking to half-baked and would take a good deal longer on large
indices.


On Thu, 16 Dec 2004 20:03:27 +0100, Daniel Naber
[EMAIL PROTECTED] wrote:
 On Thursday 16 December 2004 13:46, Mike Snare wrote:
 
   Maybe for a-b, but what about English words like half-baked?
 
  Perhaps that's the difference in thinking, then.  I would imagine that
  you would want to search on half-baked and not half AND baked.
 
 A search for half-baked will find both half-baked and half baked (the
 phrase). The only thing you'll not find if halfbaked.
 
 Regards
  Daniel
 
 --
 http://www.danielnaber.de
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare
 a-1 is considered a typical product name that needs to be unchanged
 (there's a comment in the source that mentions this). Indexing
 hyphen-word as two tokens has the advantage that it can then be found
 with the following queries:
 hypen-word (will be turned into a phrase query internally)
 hypen word (phrase query)
 (it cannot be found searching for hyphenword, however).

Sure.  But phrase queries are slower than a single word query.  In my
case, using the standard analyzer prior to my modification caused a
single (hyphenated) word query to take upwards of 10 seconds (1M+
documents with ~400K terms).  The exact same search with the new
Analyzer takes .5 seconds (granted the new tokenization caused a
significant reduction in the number of terms).  Also, the phrase query
would place the same value on a doc that simply had the two words as a
doc that had the hyphenated version, wouldn't it?  This seems odd.

In addition, why do we assume that a-1 is a typical product name but
a-b isn't?

I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare
I am writing a tool that uses lucene, and I immediately ran into a
problem searching for words that contain internal hyphens (dashes).
After looking at the StandardTokenizer, I saw that it was because
there is no rule that will match ALPHA P ALPHA or ALPHANUM P
ALPHANUM.  Based on what I can tell from the source, every other
term in a word containing any of the following (.,/-_) must contain at
least one digit.

I was wondering if someone could shed some light on why it was deemed
necessary to prevent indexing a word like 'word-with-hyphen' without
first splitting it into its constituent parts.  The only reason I can
think of (and the only one I've found) is to handle hyphenated words
at line breaks, although my first thought would be that this would be
undesired behavior, since a word that was broken due to a line break
should actually be reconstructed, and not split.

In my case, the words are keywords that must remain as is, searchable
with the hyphen in place.  It was easy enough to modify the tokenizer
to do what I need, so I'm not really asking for help there.  I'm
really just curious as to why it is that a-1 is considered a single
token, but a-b is split.

Anyone care to elaborate?

Thanks,
-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]