Re: Catalog backend for document stored fields?

2006-10-24 Thread Doron Cohen
I'm indexing logs from a transaction-based application. ... millions documents per month, the size of the indices is ~35 gigs per month (that's the lower bound). I have no choice but to 'store' each field values (as well as indexing/tokenizing them) because I'll need to retrieve them in

Re: number of term occurrences

2006-10-24 Thread Paz Belmonte
Hi, I have tried this options too and the Term Vector return null. Which do you think that it is the problem? 2006/10/24, beatriz ramos [EMAIL PROTECTED]: -- Forwarded message -- From: beatriz ramos [EMAIL PROTECTED] Date: 24-Oct-2006 11:24 Subject: Re: number of term

RE: number of term occurrences

2006-10-24 Thread Samir Abdou
Hi, You indexed without storing vectors! This is why the term vector is null. Samir -Message d'origine- De : Paz Belmonte [mailto:[EMAIL PROTECTED] Envoyé : mardi, 24. octobre 2006 12:30 À : java-user Objet : Re: number of term occurrences Hi, I have tried this options too and the

index short text

2006-10-24 Thread zhongyi yuan
I use lucene to index the address information, because the address information is so short, so I think use the Lucene Score computing is not suitable. who can give me some advices to index short address information. the format of address is: name,address etc.

Re: index short text

2006-10-24 Thread Erick Erickson
Could you specify why the score is not suitable? What is it you're trying to do that isn't working correctly? At a guess, I'd suspect that if you're using, say, StandardAnalyzer during index time, the input stream is being tokenized differently than you expect. And, depending upon what analyzer

Re: number of term occurrences

2006-10-24 Thread Paz Belmonte
I don't know. How are this vectors stored? Could you show me an example? (or documentation where I can find it) 2006/10/24, Samir Abdou [EMAIL PROTECTED]: Hi, You indexed without storing vectors! This is why the term vector is null. Samir -Message d'origine- De: Paz Belmonte

Re: number of term occurrences

2006-10-24 Thread Tricia Williams
When you create a Document by adding Field(s) (http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html) consider the last constructor which allows you to specify if the the field will have its TermVector stored or not stored. Also, Luke has a column in its document view

Re: near duplicates

2006-10-24 Thread Beto Siless
Hi Karl! I'm interested in near duplicate detection based on termFreqVectos. Now I'm comparing all documents with each other (calculating the angle)... Is there a way to avoid that? Thanks! Beto karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates

Re: near duplicates

2006-10-24 Thread Beto Siless
Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document with all

Re: near duplicates

2006-10-24 Thread Find Me
It doesn't make sense to eliminate near duplicates during search time. But if you are trying to cluster duplicates together then probably you want to look at Carrot. On 10/24/06, Beto Siless [EMAIL PROTECTED] wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection

Re: near duplicates

2006-10-24 Thread Andrzej Bialecki
Beto Siless wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each

Re: Scalability Questions

2006-10-24 Thread Doron Cohen
4) Roughly how large is the index file in comparison to the size of the input files? It depends on whether you store fields or just index them, plus there is also a compression (gzip -9 equivalent) option. As an example - index size numbers I saw: when indexing 1M docs of ~20KB of very

Re: index architectures

2006-10-24 Thread Doron Cohen
Perhaps another comment on the same line - I think you would be able to get more from your system by bounding the number of open searchers to 2: - old, serving 'old' queries, would be soon closed; - new, being opened and warmed up, and then serving 'new' queries; Because... - if I understood

Re: number of term occurrences

2006-10-24 Thread Doron Cohen
I don't know why the termDocs option did not work for you. Perhaps you did not (re)open the searcher after the index was populated? Anyhow, here is a small code snippet that does just this, see if it works for you, then you can compare it to your code... void numberOfTermOcc() throws Exception