Re: Problem finding similar documents with MoreLikeThis method.

2006-07-21 Thread Martin Braun
Hello, inspired by this thread, I also tried to implement a MoreLikeThis search. But I have the same Problem of a null query. I did set the Fieldname to a Field that is stored in the Index. But like just returns null. Here is my Code: Hits hits = this.is.search(new

Re: Problem finding similar documents with MoreLikeThis method.

2006-07-21 Thread mark harwood
Does your index use StandardAnalyzer? Are your fields stored (Field.Store.YES)? MoreLikeThis uses StandardAnalyzer by default to read the stored content from the example doc which may produce tokens that do not match those of the indexed content. Use setAnalyzer() to ensure they are in sync.

Re: Problem finding similar documents with MoreLikeThis method.

2006-07-21 Thread Martin Braun
hi mark, Does your index use StandardAnalyzer? Are your fields stored (Field.Store.YES)? Thanks! that was the hint in the right direction, the FIeld was Stored but not indexed: titleDocument.add(new Field(kurz, title.getKurz(), Field.Store.YES, Field.Index.NO)); (That was the field for the

Where to find drill-down examples (source code)

2006-07-21 Thread Martin Braun
Hello all, I want to realize a drill-down Function aka narrow search aka refine search. I want to have something like: Refine by Date: * 1990-2000 (30 Docs) * 2001-2003 (200 Docs) * 2004-2006 (10 Docs) But not only DateRanges but also for other Categories. What I have found in the

StandardAnalyzer question

2006-07-21 Thread Ngo, Anh \(ISS Southfield\)
Hello The lucene 2.0.0 StandardAnalyzer does treat the _(underscore) as a token. Is there a way I can make StandardAnalyzer don't tokenize for _ or any given characters? I'd like to keep all features that StandardAnalyzer have but want to modified it a bit for my need? How do I control what

Re: Where to find drill-down examples (source code)

2006-07-21 Thread Miles Barr
Martin Braun wrote: I want to realize a drill-down Function aka narrow search aka refine search. I want to have something like: Refine by Date: * 1990-2000 (30 Docs) * 2001-2003 (200 Docs) * 2004-2006 (10 Docs) But not only DateRanges but also for other Categories. What I have found in the

Re: BooleanQuery question

2006-07-21 Thread Paul Borgermans
Hi you can't have a boolean query containing only MUST_NOT clauses (which is what (-(FILE:abstract.htm)) is. it matches no documents, so the mandatory qualification on it causes the query to fail for all docs. This is true for the search queries, but it makes sense in a query filter IMHO. I

Re: Where to find drill-down examples (source code)

2006-07-21 Thread Ken Krugler
Hello all, I want to realize a drill-down Function aka narrow search aka refine search. I want to have something like: Refine by Date: * 1990-2000 (30 Docs) * 2001-2003 (200 Docs) * 2004-2006 (10 Docs) But not only DateRanges but also for other Categories. What I have found in the

Fastest Method for Searching (need all results)

2006-07-21 Thread Ryan O'Hara
My index contains approximately 5 millions documents. During a search, I need to grab the value of a field for every document in the result set. I am currently using a HitCollector to search. Below is my code: searcher.search(query, new HitCollector(){ public

Re: StandardAnalyzer question

2006-07-21 Thread Daniel Naber
On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote: The lucene 2.0.0 StandardAnalyzer does treat the _(underscore) as a token.  Is there a way I can make StandardAnalyzer don't tokenize for _ or any given characters? You need to add _ to the #LETTER definition in

Re: Fastest Method for Searching (need all results)

2006-07-21 Thread Otis Gospodnetic
I haven't had the chance to use this new feature yet, but have you tried with selective field loading, so that you can load only that 1 field from your index and not all of them? Otis - Original Message From: Ryan O'Hara [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Friday,

Re: Fastest Method for Searching (need all results)

2006-07-21 Thread Mark Miller
Ryan O'Hara wrote: My index contains approximately 5 millions documents. During a search, I need to grab the value of a field for every document in the result set. I am currently using a HitCollector to search. Below is my code: searcher.search(query, new HitCollector(){

Re: Where to find drill-down examples (source code)

2006-07-21 Thread Mark Miller
Ken Krugler wrote: Hello all, I want to realize a drill-down Function aka narrow search aka refine search. I want to have something like: Refine by Date: * 1990-2000 (30 Docs) * 2001-2003 (200 Docs) * 2004-2006 (10 Docs) But not only DateRanges but also for other Categories. What I have

Re: Fastest Method for Searching (need all results)

2006-07-21 Thread Mark Miller
Provides a new api, IndexReader.document(int doc, String[] fields). A document containing only the specified fields is created. The other fields of the document are not loaded, although unfortunately uncompressed strings still have to be scanned because the length information in the index

Re: Fastest Method for Searching (need all results)

2006-07-21 Thread Ryan O'Hara
I haven't had the chance to use this new feature yet, but have you tried with selective field loading, so that you can load only that 1 field from your index and not all of them? I have not tried selective field loading, but it sounds like a good idea. What class is it in? Any more

RE: StandardAnalyzer question

2006-07-21 Thread Ngo, Anh \(ISS Southfield\)
What is #LETTER definition in SnardarTokernize.jj? I saw: | #P: (_|-|/|.|,) | #HAS_DIGIT:// at least one digit (LETTER|DIGIT)* DIGIT (LETTER|DIGIT)* Should I remove _ and recompile the source code? Sincerely, Anh Ngo -Original

Re: StandardAnalyzer question

2006-07-21 Thread Mark Miller
I do not beleive so. If you look above you will see that #P is only used when looking for a num: a host ip, a phone number, etc. You will be removing that ability to recognize a _ while rooting those tokens out. It will still be parsed when tokenizing an EMAIL as well. I dont think this is the

RE: StandardAnalyzer question

2006-07-21 Thread Ngo, Anh \(ISS Southfield\)
Hello Mark, Please show me how to add - to #LETTER definition Thanks, Anh Ngo -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Friday, July 21, 2006 3:51 PM To: java-user@lucene.apache.org Subject: Re: StandardAnalyzer question I do not beleive so. If you look

Re: StandardAnalyzer question

2006-07-21 Thread Mark Miller
I take it back. Probably exactley what you want. Watch out if you're not compiling all of lucene...you need to avoid a ParserException using ant if you try to just extract the Standard Analyzer package (the recommended approach). On 7/21/06, Mark Miller [EMAIL PROTECTED] wrote: I do not

Re: StandardAnalyzer question

2006-07-21 Thread Mark Miller
| #LETTER: // unicode letters [ \u0041-\u005a, \u0061-\u007a, \u00c0-\u00d6, \u00d8-\u00f6, \u00f8-\u00ff, \u0100-\u1fff ] becomes | #LETTER: // unicode letters [ \u0041-\u005a,

Re: Fastest Method for Searching (need all results)

2006-07-21 Thread eks dev
have you tried to only collect doc-ids and see if the speed problem is there, or maybe to fetch only field values? If you have dense results it can easily be split() or addSymbolsToHash() what takes the time. I see 3 possibilities what could be slow, getting doc-ids, fetching field value or

Re: StandardAnalyzer question

2006-07-21 Thread Doron Cohen
\u002d would add -. Originally request was for _ - \u005f Mark Miller [EMAIL PROTECTED] wrote on 21/07/2006 13:09:28: | #LETTER: // unicode letters [ \u0041-\u005a, \u0061-\u007a, \u00c0-\u00d6, \u00d8-\u00f6, \u00f8-\u00ff,

RE: StandardAnalyzer question

2006-07-21 Thread Ngo, Anh \(ISS Southfield\)
I did try it and recompile the whole package but it did not work My #LETTER is: | #LETTER: // unicode letters [ \u0041-\u005a, \u005f, \u0061-\u007a, \u00c0-\u00d6, \u00d8-\u00f6, \u00f8-\u00ff,

Re: StandardAnalyzer question

2006-07-21 Thread Mark Miller
Ngo, Anh (ISS Southfield) wrote: I did try it and recompile the whole package but it did not work My #LETTER is: | #LETTER: // unicode letters [ \u0041-\u005a, \u005f, \u0061-\u007a, \u00c0-\u00d6, \u00d8-\u00f6,

RE: StandardAnalyzer question

2006-07-21 Thread Ngo, Anh \(ISS Southfield\)
It works now. Thank you very much. I forgot to run javacc for the StandardTokenizer.jj Sincerely, Anh Ngo -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Friday, July 21, 2006 5:33 PM To: java-user@lucene.apache.org Subject: Re: StandardAnalyzer question

RE: Performance question

2006-07-21 Thread Scott Smith
Interesting and thanks for the answer. I guess I won't write code to control the order clauses get added--one less thing to do :-) -Original Message- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: Thursday, July 20, 2006 6:47 PM To: java-user@lucene.apache.org Subject: Re: