Based on suggestion here implemented a script to un-invert the index
(details at OAK-7122 [1], [2]).
uninverting was done by following logic
def collectFieldNames(DirectoryReader reader) {
println "Proceeding to collect the field names per document"
Bits liveDocs =
> This isn't an API problem. This is by design -- this is how it works.
Ack. What I was referring to wrt api earlier that uninverting the
index is not a direct operation and hence not supported via api. This
would need to be done by using other api and would require post
processing of index
> That helps and explains why there is no support in std api
This isn't an API problem. This is by design -- this is how it works.
If you wish
to retrieve fields that are indexed and stored with the document, the
API provides
such an option (indexed and stored field type). Your indexed fields
are
>> So unless you "store" that value
>> with the document as a stored field, you'll have to "uninvert" the
>> index yourself.
That helps and explains why there is no support in std api
> Luke has some capabilities to look at the index at a low level,
> perhaps that could give you some pointers. I
Luke has some capabilities to look at the index at a low level,
perhaps that could give you some pointers. I think you can pull
the older branch from here:
https://github.com/DmitryKey/luke
or:
https://code.google.com/archive/p/luke/
NOTE: This is not a part of Lucene, but an independent project
Ok. I think you should look at the Java API -- this will give you more
clarity of what is actually stored in the index
and how to extract it. The thing (I think) you're missing is that an
inverted index points in the "other" direction (from a given value to
all documents that contained it). So
> Only stored fields are kept for each document. If you need to dump
> internal data structures (terms, positions, offsets, payloads, you
> name it) you'll need to dive into the API and traverse all segments,
> then dump the above (and note that document IDs are per-segment and
> will have to be
Only stored fields are kept for each document. If you need to dump
internal data structures (terms, positions, offsets, payloads, you
name it) you'll need to dive into the API and traverse all segments,
then dump the above (and note that document IDs are per-segment and
will have to be somehow
> How about the quickest solution: dump the content of both indexes to a
document-per-line text
That would work (and is the plan) but so far I can only get stored
field per document and no other data on per document basis. What other
data we can get on per document basis using the Lucene API?
How about the quickest solution: dump the content of both indexes to a
document-per-line text
file, sort, diff?
Even if your indexes are large, if you have large spare disk, this
will be super fast.
Dawid
On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra
wrote:
> Hi,
>
Hi,
We use Lucene for indexing in Jackrabbit Oak [2]. Recently we
implemented a new indexing approach [1] which traverses the data to be
indexed in a different way compared to the traversal approach we have
been using so far. The new approach is faster and produces index with
same number of
11 matches
Mail list logo