Re: Question about Payloads in Lucene 4.5

2014-03-21 Thread Rohit Banga
​Just saw the implementation of MultiDocValues.getNumericValues(). It uses sort of returns an anonymous inner classes to get the doc value from the appropriate index reader. Very cool impleentation! I guess that answers my question on how to get docVal from multiple​ ​ atomic readers. It would be

Re: Question about Payloads in Lucene 4.5

2014-03-21 Thread Rohit Banga
​Thanks Michael for your response. Few questions: 1. Can I expect better performance when retrieving a single NumericDocValue for all hits vs when I retrieve documents for all hits to fetch the field value? As far as I understand retrieving n documents from the index requires n disk reads. How ma

Re: Question about Payloads in Lucene 4.5

2014-03-21 Thread Michael McCandless
DocValues are better than payloads. E.g. index a NumericDocValuesField with each doc, holding your id. Then at search time you can use MultiDocValues.getNumericValues. Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 4:35 PM, Rohit Banga wrote: > Hi everyone > > When I

Re: maxDoc/numDocs int fields

2014-03-21 Thread Artem Gayardo-Matrosov
Thanks guys for your replies, I will go for the sharding approach suggested by Oliver & Tri Cao. In my case, every word occurrence is a document, and the context of the occurrence are document fields. I use that to do n-gram analysis in a large corpus of text, and lucene seems to be the best and

Question about Payloads in Lucene 4.5

2014-03-21 Thread Rohit Banga
Hi everyone When I query a lucene index, I get back a list of document ids. This index search is fast. Now for all documents matching the result I need a unique String field called "id" which is stored in the document. From the documentation I gather that document ids are internal and I should not

Re: Segments reusable across commits?

2014-03-21 Thread Vitaly Funstein
Thanks, that is the conclusion I came to as well; it was a little naive of me to think all the segments always get replaced on each commit, as that of course would be unnecessary and terribly inefficient. De-duplication using a Set was indeed the fix for me. On Fri, Mar 21, 2014 at 12:47 AM, Uwe

Re: maxDoc/numDocs int fields

2014-03-21 Thread Jack Krupansky
Every word occurrence or every unique word? I mean Integer.MAX_VALUE like 2 billion. Even the OED only has 600,000 words defined. The former doesn't sound like a good use case match for Lucene as it exists today. Lucene indexes "documents", not "words". I'm sure some day Lucene will switch fro

Re: maxDoc/numDocs int fields

2014-03-21 Thread Tri Cao
I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance

Re: maxDoc/numDocs int fields

2014-03-21 Thread Artem Gayardo-Matrosov
Hi Oli, Thanks for your reply, I thought about this, but it feels like making a crude, inefficient implementation of what's already in lucene -- CompositeReader, isn't it? It would involve writing my CompositeCompositeReader which would forward the requests to the underlying CompositeReader... I

RE: maxDoc/numDocs int fields

2014-03-21 Thread Oliver Christ
Can you split your corpus across multiple Lucene instances? Cheers, Oli -Original Message- From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com] Sent: Friday, March 21, 2014 12:29 PM To: java-user@lucene.apache.org Subject: maxDoc/numDocs int fields Hi all, I am using lucene to index

maxDoc/numDocs int fields

2014-03-21 Thread Artem Gayardo-Matrosov
Hi all, I am using lucene to index a large corpus of text, with every word being a separate document (this is something I cannot change), and I am hitting a limitation of the CompositeReader only supporting Integer.MAX_VALUE documents. Is there any way to work around this limitation? For the mome

Re: Dimension mismatch exception

2014-03-21 Thread Herb Roitblat
Computing the cosine between two documents requires that the vectors for each document to be the same length (same number of elements, same dimensionality, not the norm). The length of the vector is the length of the vocabulary for the whole set. The two sets will inevitably have different nu

RE: QueryParser

2014-03-21 Thread Allison, Timothy B.
What analyzer are you using? smartcn? From: kalaik [kalaiselva...@zohocorp.com] Sent: Friday, March 21, 2014 5:10 AM To: java-user@lucene.apache.org Subject: QueryParser Dear Team, we are using lucene in our product , it well searching fo

QueryParser

2014-03-21 Thread kalaik
Dear Team, we are using lucene in our product , it well searching for high speed and performance but Japaneese, chinese and korean language not searching properly we had use QueryParser QueryParser is splitted into word like "轻歌曼舞庆元旦"

Getting individual field sizes from an index

2014-03-21 Thread Alan Woodward
Hi all, Does anybody know of a way of getting a breakdown of the disk space a particular field takes up in a lucene index? I'm experimenting with different query-time and index-time field combinations, and I'd like to see the exact effect they have on disk usage, but I can only really get stat

Re: Dimension mismatch exception

2014-03-21 Thread Stefy D.
Hello Herb. Thank you very much for your reply. I want to have the cosine for each a and each b. I'm using code for lucene I found online, which I will post below. Hello Uwe. Thank you very much for replying. I am using a class DocVector and then a class in which i try to compute the similariti

RE: Segments reusable across commits?

2014-03-21 Thread Uwe Schindler
Hi, a commit is actually just the list of segments and their (unmutable) files. Because the files are not mutable, every commit point can safely refer to the same files which are also used by an earlier commit point. In your code, you should use a Set instead of a List. Depending on how many c