I'm indexing logs from a transaction-based application.
...
millions documents per month, the size of the indices is ~35 gigs per
month
(that's the lower bound). I have no choice but to 'store' each field
values
(as well as indexing/tokenizing them) because I'll need to retrieve them
in
Hi,
I have tried this options too and the Term Vector return null.
Which do you think that it is the problem?
2006/10/24, beatriz ramos [EMAIL PROTECTED]:
-- Forwarded message --
From: beatriz ramos [EMAIL PROTECTED]
Date: 24-Oct-2006 11:24
Subject: Re: number of term
Hi,
You indexed without storing vectors! This is why the term vector is null.
Samir
-Message d'origine-
De : Paz Belmonte [mailto:[EMAIL PROTECTED]
Envoyé : mardi, 24. octobre 2006 12:30
À : java-user
Objet : Re: number of term occurrences
Hi,
I have tried this options too and the
I use lucene to index the address information, because the address
information is so short, so I think use the Lucene Score computing is
not suitable.
who can give me some advices to index short address information.
the format of address is: name,address etc.
Could you specify why the score is not suitable? What is it you're trying to
do that isn't working correctly?
At a guess, I'd suspect that if you're using, say, StandardAnalyzer during
index time, the input stream is being tokenized differently than you expect.
And, depending upon what analyzer
I don't know. How are this vectors stored?
Could you show me an example? (or documentation where I can find it)
2006/10/24, Samir Abdou [EMAIL PROTECTED]:
Hi,
You indexed without storing vectors! This is why the term vector is null.
Samir
-Message d'origine-
De: Paz Belmonte
When you create a Document by adding Field(s)
(http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html)
consider the last constructor which allows you to specify if the the field
will have its TermVector stored or not stored. Also, Luke has a column in
its document view
Hi Karl!
I'm interested in near duplicate detection based on termFreqVectos. Now
I'm comparing all documents with each other (calculating the angle)...
Is there a way to avoid that?
Thanks!
Beto
karl wettin wrote:
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates
Hi Andrej!
I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each document with all
It doesn't make sense to eliminate near duplicates during search time. But
if you are trying to cluster duplicates together then probably you want to
look at Carrot.
On 10/24/06, Beto Siless [EMAIL PROTECTED] wrote:
Hi Andrej!
I'm taking a look to fuzzy signatures for near duplicate detection
Beto Siless wrote:
Hi Andrej!
I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each
4) Roughly how large is the index file in comparison to the size of the
input files?
It depends on whether you store fields or just index them, plus
there is also a compression (gzip -9 equivalent) option.
As an example - index size numbers I saw: when indexing 1M docs of ~20KB of
very
Perhaps another comment on the same line - I think you would be able to get
more from your system by bounding the number of open searchers to 2:
- old, serving 'old' queries, would be soon closed;
- new, being opened and warmed up, and then serving 'new' queries;
Because... - if I understood
I don't know why the termDocs option did not work for you. Perhaps you did
not (re)open the searcher after the index was populated? Anyhow, here is a
small code snippet that does just this, see if it works for you, then you
can compare it to your code...
void numberOfTermOcc() throws Exception
14 matches
Mail list logo