RE: Plural word search

2007-03-09 Thread DECAFFMEYER MATHIEU
I needed this myself not long time ago.. Here is a piece of code to get an Analyzer that will use a tokeniez and an English stemmer, (for "bears" it will also return "bear" and vice versa) private static Analyzer createEnglishAnalyzer() { return new Analyzer() { public TokenStream tokenSt

RE: indexing pdfs

2007-03-09 Thread Kainth, Sachin
Hi Ashwin, Well in that case you might need to use Ifilters some other way instead of through SeekAFile. I don't know how since I haven't used it myself. Perhaps someone else here has. Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 09 March 2007 02:48 To:

Re: Index a source, but not store it... can it be done?

2007-03-09 Thread John Haxby
Chris Hostetter wrote: i'm not crypto expert, but i imagine it would probably take the same amount of statistical guess work to reconstruct meaningful info from either approach (hashing hte individual words compared to eliminating the positions) so i would think the trade off of supporting phrase

RE: FieldCache: flush cache explicitly

2007-03-09 Thread Ramana Jelda
Yeap. I strongly support John. I knew when I reopen indexes. Then what is the reason to wait for garbage collector.. And that too, FieldCache uses WeakHashMap and that may lead some memory leaks. Jelda > -Original Message- > From: John Wang [mailto:[EMAIL PROTECTED] > Sent: Friday, March

WilcardQuery and memory

2007-03-09 Thread Joe
Hi, Here we use lucene to index our emails, currently 500.000 Documents. When Searching the body by a WildcardQuery the problems arises. I did some profiling with JProfiler. I see the more BooleanClause instances used the more memory is required during search. Most memory is used by instances

pdf,.doc,.xls,.ppt indexing in lucene

2007-03-09 Thread ashwin kumar
hi all i have tried indexing .txt using lucene and its working fine. now i want to index .doc , .pdf , .xls , . ppt with lucene can some one help in doing that thanks regards ashwin

RE: WilcardQuery and memory

2007-03-09 Thread Rob Staveley (Tom)
For indexing e-mail, I recommend that you tokenise the e-mail addresses into fragments and query on the fragments as whole terms rather than using wildcards. Rather than looking for fischauto333* in (say) smtp-from, look for fischauto333 in (say) an additional field called smtp-from-fragments to

Re: WilcardQuery and memory

2007-03-09 Thread Joe
Hi Rob, For indexing e-mail, I recommend that you tokenise the e-mail addresses into fragments and query on the fragments as whole terms rather than using wildcards. [example] Hm for email adresses this isnt a big problem here. The real problem is the query on the body part of an email, wh

Re: WilcardQuery and memory

2007-03-09 Thread Erick Erickson
You can also use a filter. The basic idea is to construct a Lucene Filter, probably using something like RegexTermEnum/TermDocs. It's faster than you think . This, in combination with ConstantScoreQuery should fix you right up. Several things: 1> you lose scoring with the filter part of a query w

Re: pdf,.doc,.xls,.ppt indexing in lucene

2007-03-09 Thread Grant Ingersoll
Search the archive, read the FAQ (see link in my signature). On Mar 9, 2007, at 7:20 AM, ashwin kumar wrote: hi all i have tried indexing .txt using lucene and its working fine. now i want to index .doc , .pdf , .xls , . ppt with lucene can some one help in doing that thanks regards ashwin

Words not found, large file indexing

2007-03-09 Thread Walker, Keith 1
I'm having problems with queries not returning a hit when a document does in fact have those terms. (I'm not worried about the ranking, just whether or not it's a hit.) Is anything wrong with the query syntax? (see below) Also, words in the document's index (not the Lucene index) seemed less lik

Re: Words not found, large file indexing

2007-03-09 Thread Chris Hostetter
are you perhaps exceding this... http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int) : Date: Fri, 09 Mar 2007 12:14:38 -0500 : From: "Walker, Keith 1" <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org

Re: Words not found, large file indexing

2007-03-09 Thread Steffen Heinrich
Hello Chris, this is incredible! I'm new to Lucene and did just subscribe to the list for this very phenomena. Keith' problem was also my problem. Your mail was the first one I read and is _exactly_ what I needed to know. Thanks a lotta! Cheers, Steffen On 9 Mar 2007 at 9:25, Chris Hostetter

Wildcard query with untokenized punctuation

2007-03-09 Thread McGuigan, Colin
(Lucene 1.9.1) I have a "filename" field in Lucene that holds a value, like this: pagefile.sys If I run searches through QueryParser, and I do a search for: pagefile.sys pagefile pagefile. This all works because it goes through getFieldQuery, which tokenizes the string and generat

Re: Wildcard query with untokenized punctuation

2007-03-09 Thread Steffen Heinrich
On 9 Mar 2007 at 15:10, McGuigan, Colin wrote: > I have a "filename" field in Lucene that holds a value, like this: > pagefile.sys > Hi Colin, I'm still _very_ new to lucene, but isn't that what the un-tokenized indexing is for? Like in 1.9.1 doc.add(Field.Keyword("filename", "pagefile.sys"));

Find related question

2007-03-09 Thread sdeck
Hello, I run Nutch and get a whole slew of articles and when I display search results, there may be 5-6 articles that have different titles, and most of the body text is the same, but I want to group them all under one result. These are usually AP articles that all newspapers repurpose. When usi

Re: Index a source, but not store it... can it be done?

2007-03-09 Thread Jason Pump
Agree it's totally hackable, particularly with an md5 hashcode. If you used a 16 bit hash e.g. mod % 65536 then it becomes more difficult to construct the original document but less precise in querying. It might be nice to store the individual words contained in each document as just a sorted l

RE: Wildcard query with untokenized punctuation

2007-03-09 Thread McGuigan, Colin
-Original Message- From: Steffen Heinrich [mailto:[EMAIL PROTECTED] Sent: Fri 3/9/2007 4:31 PM To: java-user@lucene.apache.org Subject: Re: Wildcard query with untokenized punctuation On 9 Mar 2007 at 15:10, McGuigan, Colin wrote: >> I have a "filename" field in Lucene that holds a value