Just saw the implementation of MultiDocValues.getNumericValues(). It uses
sort of returns an anonymous inner classes to get the doc value from the
appropriate index reader. Very cool impleentation!
I guess that answers my question on how to get docVal from multiple
atomic readers.
It would be
Thanks Michael for your response.
Few questions:
1. Can I expect better performance when retrieving a single NumericDocValue
for all hits vs when I retrieve documents for all hits to fetch the field
value? As far as I understand retrieving n documents from the index
requires n disk reads. How ma
DocValues are better than payloads.
E.g. index a NumericDocValuesField with each doc, holding your id.
Then at search time you can use MultiDocValues.getNumericValues.
Mike McCandless
http://blog.mikemccandless.com
On Fri, Mar 21, 2014 at 4:35 PM, Rohit Banga wrote:
> Hi everyone
>
> When I
Thanks guys for your replies,
I will go for the sharding approach suggested by Oliver & Tri Cao.
In my case, every word occurrence is a document, and the context of the
occurrence are document fields. I use that to do n-gram analysis in a large
corpus of text, and lucene seems to be the best and
Hi everyone
When I query a lucene index, I get back a list of document ids. This index
search is fast. Now for all documents matching the result I need a unique
String field called "id" which is stored in the document. From the
documentation I gather that document ids are internal and I should not
Thanks, that is the conclusion I came to as well; it was a little naive of
me to think all the segments always get replaced on each commit, as that of
course would be unnecessary and terribly inefficient. De-duplication using
a Set was indeed the fix for me.
On Fri, Mar 21, 2014 at 12:47 AM, Uwe
Every word occurrence or every unique word? I mean Integer.MAX_VALUE like 2
billion. Even the OED only has 600,000 words defined. The former doesn't
sound like a good use case match for Lucene as it exists today. Lucene
indexes "documents", not "words".
I'm sure some day Lucene will switch fro
I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance
Hi Oli,
Thanks for your reply,
I thought about this, but it feels like making a crude, inefficient
implementation of what's already in lucene -- CompositeReader, isn't it? It
would involve writing my CompositeCompositeReader which would forward the
requests to the underlying CompositeReader...
I
Can you split your corpus across multiple Lucene instances?
Cheers, Oli
-Original Message-
From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]
Sent: Friday, March 21, 2014 12:29 PM
To: java-user@lucene.apache.org
Subject: maxDoc/numDocs int fields
Hi all,
I am using lucene to index
Hi all,
I am using lucene to index a large corpus of text, with every word being a
separate document (this is something I cannot change), and I am hitting a
limitation of the CompositeReader only supporting Integer.MAX_VALUE
documents.
Is there any way to work around this limitation? For the mome
Computing the cosine between two documents requires that the vectors for
each document to be the same length (same number of elements, same
dimensionality, not the norm). The length of the vector is the length
of the vocabulary for the whole set. The two sets will inevitably have
different nu
What analyzer are you using? smartcn?
From: kalaik [kalaiselva...@zohocorp.com]
Sent: Friday, March 21, 2014 5:10 AM
To: java-user@lucene.apache.org
Subject: QueryParser
Dear Team,
we are using lucene in our product , it well searching fo
Dear Team,
we are using lucene in our product , it well searching for high
speed and performance but
Japaneese, chinese and korean language not searching properly
we had use QueryParser
QueryParser is splitted into word like "轻歌曼舞庆元旦"
Hi all,
Does anybody know of a way of getting a breakdown of the disk space a
particular field takes up in a lucene index? I'm experimenting with different
query-time and index-time field combinations, and I'd like to see the exact
effect they have on disk usage, but I can only really get stat
Hello Herb. Thank you very much for your reply. I want to have the cosine for
each a and each b. I'm using code for lucene I found online, which I will post
below.
Hello Uwe. Thank you very much for replying. I am using a class DocVector and
then a class in which i try to compute the similariti
Hi,
a commit is actually just the list of segments and their (unmutable) files.
Because the files are not mutable, every commit point can safely refer to the
same files which are also used by an earlier commit point. In your code, you
should use a Set instead of a List. Depending on how many
c
17 matches
Mail list logo