Re: Check to see if index is optimized
If an index has no deletions, it does not need to be optimized. You can find out if it has deletions with IndexReader.hasDeletions. Is that true? An index that has just been created (with no deletions) can still have multiple segments that could be optimized. I'm not sure your statement is correct. -Mike On Fri, 07 Jan 2005 14:22:23 -0600, Luke Francl [EMAIL PROTECTED] wrote: On Fri, 2005-01-07 at 13:24, Crump, Michael wrote: Is there a simple way to check and see if an index is already optimized? What happens if optimize is called on an already optimized index - does the call basically do a noop? Or is it still and expensive call? If an index has no deletions, it does not need to be optimized. You can find out if it has deletions with IndexReader.hasDeletions. I am not sure what the cost of optimization is if the index doesn't need it. Perhaps someone else on this list knows. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Check to see if index is optimized
Based on the method sent earlier, it looks like Lucene first checks to see if optimization is even necessary. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search not working properly. Bug !!!!!!
You appear to be searching for the word Engineer in the name field. Shouldn't this query be directed at the designation field? The only terms in the name field would be Ebrahim, Faisal, John, and Smith, wouldn't they? On Thu, 30 Dec 2004 22:06:46 +0530, Mohamed Ebrahim Faisal [EMAIL PROTECTED] wrote: Hi all I have written a simple program to test Indexing Search. After indexing couple of documents, I Searched for the same, but i didn't get Successfull matches. I don't know whether it is a bug in Lucene or in the code. I have enclosed the code for your review. But when i used Lucene for bigger applications ( index contains larger documents ), search worked amazingly. Following is the code which didn't work properly import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.Searcher; public class testLucene { private static final String[] strSTOP_WORDS = { and, are, was, will, with }; private void test() throws Exception { Analyzer objAnalyzer = new StandardAnalyzer(); IndexWriter index = new IndexWriter(index,objAnalyzer, true ); Searcher objIndexSearcher = new IndexSearcher(index); Document d = new Document(); d.add( Field.Text(name,Ebrahim Faisal)); d.add( Field.Text(address,New York)); d.add( Field.Text(designation,Software Engineer)); d.add( Field.Text(xyz,123 IndexWriter index)); index.addDocument( d ); d = new Document(); d.add( Field.Text(name,John Smith)); d.add( Field.Text(address,India)); d.add( Field.Text(designation,Sr. Software Engineer)); d.add( Field.Text(xyz,456 StandardAnalyzer true)); index.addDocument( d ); index.optimize(); index.close(); Query objQuery = null; objQuery = QueryParser.parse(Engineer, name , objAnalyzer); Hits objHits = objIndexSearcher.search(objQuery); for (int nStart = 0; nStart objHits.length(); nStart++) { d = objHits.doc(nStart); System.out.println( address +d.get(address)); } } public static void main(String[] args) throws Exception { new testLucene().test(); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method or the appropriate constructor when adding a field to the Document. You'll need to also store a reference to the actual file (URL, Path, etc) in the document so it can be retrieved from the doc returned in the Hits object. Or did I completely misunderstand the question? -Mike On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote: hi i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks! DES - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
I've never used the german analyzer, so I don't know what stop words it defines/uses. Someone else will have to answer that. Sorry On Wed, 22 Dec 2004 17:45:17 +0100, DES [EMAIL PROTECTED] wrote: I actually use Field.Text(String,String) to add documents to my index. Maybe I do not understand the way an analyzer works, but I thought that all German articles (der, die, das etc) should be filtered out. However if I use Luke to view my index, the original text is completely stored in a field. And what I need is term vector, that I can create from an indexed document field. So this field should contain terms only. Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method or the appropriate constructor when adding a field to the Document. You'll need to also store a reference to the actual file (URL, Path, etc) in the document so it can be retrieved from the doc returned in the Hits object. Or did I completely misunderstand the question? -Mike On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote: hi i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks! DES - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
Thanks for correcting me. I use the reader version -- hence my confusion. -Mike On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 22, 2004, at 11:36 AM, Mike Snare wrote: Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method Correction: Field.Text(String, String) is a stored field. If you want unstored, use Field.UnStored(String, String). This is a bit confusing because Field.Text(String, Reader) is not stored. This confusion has been cleared up in the CVS version of Lucene and will be deprecated in the 1.9 release, and removed in the 2.0 release. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: retrieve tokens
But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the plain text into Lucene on a Field.UnStored bases. I also store this plain text in the database for the sole purpose of presenting a context snippet. Why not store the snippet in another field that is stored, but not indexed? You could then immediately retrieve the snippet from the doc. This would only increase your index by num_docs * size_snippet and would save the db access time and complexity. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H [EMAIL PROTECTED] wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
Not if these words are spelling variations of the same concept, which doesn't seem unlikely. In addition, why do we assume that a-1 is a typical product name but a-b isn't? Maybe for a-b, but what about English words like half-baked? Perhaps that's the difference in thinking, then. I would imagine that you would want to search on half-baked and not half AND baked. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
Absolutely, but -- correct me if I'm wrong -- it would give no higher ranking to half-baked and would take a good deal longer on large indices. On Thu, 16 Dec 2004 20:03:27 +0100, Daniel Naber [EMAIL PROTECTED] wrote: On Thursday 16 December 2004 13:46, Mike Snare wrote: Maybe for a-b, but what about English words like half-baked? Perhaps that's the difference in thinking, then. I would imagine that you would want to search on half-baked and not half AND baked. A search for half-baked will find both half-baked and half baked (the phrase). The only thing you'll not find if halfbaked. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
a-1 is considered a typical product name that needs to be unchanged (there's a comment in the source that mentions this). Indexing hyphen-word as two tokens has the advantage that it can then be found with the following queries: hypen-word (will be turned into a phrase query internally) hypen word (phrase query) (it cannot be found searching for hyphenword, however). Sure. But phrase queries are slower than a single word query. In my case, using the standard analyzer prior to my modification caused a single (hyphenated) word query to take upwards of 10 seconds (1M+ documents with ~400K terms). The exact same search with the new Analyzer takes .5 seconds (granted the new tokenization caused a significant reduction in the number of terms). Also, the phrase query would place the same value on a doc that simply had the two words as a doc that had the hyphenated version, wouldn't it? This seems odd. In addition, why do we assume that a-1 is a typical product name but a-b isn't? I am in no way second-guessing or suggesting a change, It just doesn't make sense to me, and I'm trying to understand. It is very likely, as is oft the case, that this is just one of those things one has to accept. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why does the StandardTokenizer split hyphenated words?
I am writing a tool that uses lucene, and I immediately ran into a problem searching for words that contain internal hyphens (dashes). After looking at the StandardTokenizer, I saw that it was because there is no rule that will match ALPHA P ALPHA or ALPHANUM P ALPHANUM. Based on what I can tell from the source, every other term in a word containing any of the following (.,/-_) must contain at least one digit. I was wondering if someone could shed some light on why it was deemed necessary to prevent indexing a word like 'word-with-hyphen' without first splitting it into its constituent parts. The only reason I can think of (and the only one I've found) is to handle hyphenated words at line breaks, although my first thought would be that this would be undesired behavior, since a word that was broken due to a line break should actually be reconstructed, and not split. In my case, the words are keywords that must remain as is, searchable with the hyphen in place. It was easy enough to modify the tokenizer to do what I need, so I'm not really asking for help there. I'm really just curious as to why it is that a-1 is considered a single token, but a-b is split. Anyone care to elaborate? Thanks, -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]