date:20061024

Re: Catalog backend for document stored fields?

2006-10-24 Thread Doron Cohen

I'm indexing logs from a transaction-based application. ... millions documents per month, the size of the indices is ~35 gigs per month (that's the lower bound). I have no choice but to 'store' each field values (as well as indexing/tokenizing them) because I'll need to retrieve them in

Re: number of term occurrences

2006-10-24 Thread Paz Belmonte

Hi, I have tried this options too and the Term Vector return null. Which do you think that it is the problem? 2006/10/24, beatriz ramos [EMAIL PROTECTED]: -- Forwarded message -- From: beatriz ramos [EMAIL PROTECTED] Date: 24-Oct-2006 11:24 Subject: Re: number of term

RE: number of term occurrences

2006-10-24 Thread Samir Abdou

Hi, You indexed without storing vectors! This is why the term vector is null. Samir -Message d'origine- De : Paz Belmonte [mailto:[EMAIL PROTECTED] Envoyé : mardi, 24. octobre 2006 12:30 À : java-user Objet : Re: number of term occurrences Hi, I have tried this options too and the

index short text

2006-10-24 Thread zhongyi yuan

I use lucene to index the address information, because the address information is so short, so I think use the Lucene Score computing is not suitable. who can give me some advices to index short address information. the format of address is: name,address etc.

Re: index short text

2006-10-24 Thread Erick Erickson

Could you specify why the score is not suitable? What is it you're trying to do that isn't working correctly? At a guess, I'd suspect that if you're using, say, StandardAnalyzer during index time, the input stream is being tokenized differently than you expect. And, depending upon what analyzer

Re: number of term occurrences

2006-10-24 Thread Paz Belmonte

I don't know. How are this vectors stored? Could you show me an example? (or documentation where I can find it) 2006/10/24, Samir Abdou [EMAIL PROTECTED]: Hi, You indexed without storing vectors! This is why the term vector is null. Samir -Message d'origine- De: Paz Belmonte

Re: number of term occurrences

2006-10-24 Thread Tricia Williams

When you create a Document by adding Field(s) (http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html) consider the last constructor which allows you to specify if the the field will have its TermVector stored or not stored. Also, Luke has a column in its document view

Re: near duplicates

2006-10-24 Thread Beto Siless

Hi Karl! I'm interested in near duplicate detection based on termFreqVectos. Now I'm comparing all documents with each other (calculating the angle)... Is there a way to avoid that? Thanks! Beto karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates

Re: near duplicates

2006-10-24 Thread Beto Siless

Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document with all

Re: near duplicates

2006-10-24 Thread Find Me

It doesn't make sense to eliminate near duplicates during search time. But if you are trying to cluster duplicates together then probably you want to look at Carrot. On 10/24/06, Beto Siless [EMAIL PROTECTED] wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection

Re: near duplicates

2006-10-24 Thread Andrzej Bialecki

Beto Siless wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each

Re: Scalability Questions

2006-10-24 Thread Doron Cohen

4) Roughly how large is the index file in comparison to the size of the input files? It depends on whether you store fields or just index them, plus there is also a compression (gzip -9 equivalent) option. As an example - index size numbers I saw: when indexing 1M docs of ~20KB of very

Re: index architectures

2006-10-24 Thread Doron Cohen

Perhaps another comment on the same line - I think you would be able to get more from your system by bounding the number of open searchers to 2: - old, serving 'old' queries, would be soon closed; - new, being opened and warmed up, and then serving 'new' queries; Because... - if I understood

Re: number of term occurrences

2006-10-24 Thread Doron Cohen

I don't know why the termDocs option did not work for you. Perhaps you did not (re)open the searcher after the index was populated? Anyhow, here is a small code snippet that does just this, see if it works for you, then you can compare it to your code... void numberOfTermOcc() throws Exception

Re: Catalog backend for document stored fields?

Re: number of term occurrences

RE: number of term occurrences

index short text

Re: index short text

Re: number of term occurrences

Re: number of term occurrences

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: near duplicates

Re: Scalability Questions

Re: index architectures

Re: number of term occurrences

14 matches

Site Navigation

Mail list logo

Footer information