Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-05 Thread Chetan Mehrotra
Based on suggestion here implemented a script to un-invert the index (details at OAK-7122 [1], [2]). uninverting was done by following logic def collectFieldNames(DirectoryReader reader) { println "Proceeding to collect the field names per document" Bits liveDocs =

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-03 Thread Chetan Mehrotra
> This isn't an API problem. This is by design -- this is how it works. Ack. What I was referring to wrt api earlier that uninverting the index is not a direct operation and hence not supported via api. This would need to be done by using other api and would require post processing of index

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-03 Thread Dawid Weiss
> That helps and explains why there is no support in std api This isn't an API problem. This is by design -- this is how it works. If you wish to retrieve fields that are indexed and stored with the document, the API provides such an option (indexed and stored field type). Your indexed fields are

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-03 Thread Chetan Mehrotra
>> So unless you "store" that value >> with the document as a stored field, you'll have to "uninvert" the >> index yourself. That helps and explains why there is no support in std api > Luke has some capabilities to look at the index at a low level, > perhaps that could give you some pointers. I

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Erick Erickson
Luke has some capabilities to look at the index at a low level, perhaps that could give you some pointers. I think you can pull the older branch from here: https://github.com/DmitryKey/luke or: https://code.google.com/archive/p/luke/ NOTE: This is not a part of Lucene, but an independent project

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss
Ok. I think you should look at the Java API -- this will give you more clarity of what is actually stored in the index and how to extract it. The thing (I think) you're missing is that an inverted index points in the "other" direction (from a given value to all documents that contained it). So

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Chetan Mehrotra
> Only stored fields are kept for each document. If you need to dump > internal data structures (terms, positions, offsets, payloads, you > name it) you'll need to dive into the API and traverse all segments, > then dump the above (and note that document IDs are per-segment and > will have to be

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss
Only stored fields are kept for each document. If you need to dump internal data structures (terms, positions, offsets, payloads, you name it) you'll need to dive into the API and traverse all segments, then dump the above (and note that document IDs are per-segment and will have to be somehow

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Chetan Mehrotra
> How about the quickest solution: dump the content of both indexes to a document-per-line text That would work (and is the plan) but so far I can only get stored field per document and no other data on per document basis. What other data we can get on per document basis using the Lucene API?

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-01 Thread Dawid Weiss
How about the quickest solution: dump the content of both indexes to a document-per-line text file, sort, diff? Even if your indexes are large, if you have large spare disk, this will be super fast. Dawid On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra wrote: > Hi, >

Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-01 Thread Chetan Mehrotra
Hi, We use Lucene for indexing in Jackrabbit Oak [2]. Recently we implemented a new indexing approach [1] which traverses the data to be indexed in a different way compared to the traversal approach we have been using so far. The new approach is faster and produces index with same number of