Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

Robert Muir Mon, 10 Mar 2014 21:12:27 -0700

Hello, I think you are confused between two different index
structures, probably because of the name of the options in solr.


1. indexing term vectors: this means given a document, you can go
lookup a miniature "inverted index" just for that document. That means
each document has "term vectors" which has a term dictionary of the
terms in that one document, and optionally things like positions and
character offsets. This can be useful if you are examining *many
terms* for just a few documents. For example: the MoreLikeThis use
case. In solr this is activated with termVectors=true. To additionally
store positions/offsets information inside the term vectors its
termPositions and termOffsets, respectively.

2. indexing character offsets: this means given a term, you can get
the offset information "along with" each position that matched. So
really you can think of this as a special form of a payload. This is
useful if you are examining *many documents* for just a few terms. For
example, many highlighting use cases. In solr this is activated with
storeOffsetsWithPositions=true. It is unrelated to term vectors.

Hopefully this helps.

On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French <jkfaus...@gmail.com> wrote:
> This looks like a codec issue, but I'm not sure how to address it. I've
> found that a different instance of DocsAndPositionsEnum is instantiated
> between my code and Solr's TermVectorComponent.
>
> Mine:
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum
> Solr: 
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum
>
> As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where
> the Lucene 4.1 reference comes from. I've searched through the Solr config
> files and can't see where to change the codec, but shouldn't the reader use
> the same codec as used when the index was created?
>
>
> On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French <jkfaus...@gmail.com>wrote:
>
>> We have an API on top of Lucene 4.6 that I'm trying to adapt to running
>> under Solr 4.6. The problem is although I'm getting the correct offsets
>> when the index is created by Lucene, the same method calls always return -1
>> when the index is created by Solr. In the latter case I can see the
>> character offsets via Luke, and I can even get them from Solr when I access
>> the /tvrh search handler, which uses the TermVectorComponent class.
>>
>> This is roughly how I'm reading character offsets in my Lucene code:
>>
>>> AtomicReader reader = ...
>>> Term term = ...
>>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
>>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
>>>   for (int i = 0; i < postings.freq(); i++) {
>>>     System.out.println("start:" + postings.startOffset());
>>>     System.out.println("end:" + postings.endOffset());
>>>   }
>>> }
>>
>>
>> Notice that I want the values for a single term. When run against an index
>> created by Solr, the above calls to startOffset() and endOffset() return
>> -1. Solr's TermVectorComponent prints the correct offsets like this
>> (paraphrased):
>>
>> IndexReader reader = searcher.getIndexReader();
>>> Terms vector = reader.getTermVector(docId, field);
>>> TermsEnum termsEnum = vector.iterator(termsEnum);
>>> int freq = (int) termsEnum.totalTermFreq();
>>> DocsAndPositionsEnum dpEnum = null;
>>> while((text = termsEnum.next()) != null) {
>>>   String term = text.utf8ToString();
>>>   dpEnum = termsEnum.docsAndPositions(null, dpEnum);
>>>   dpEnum.nextDoc();
>>>   for (int i = 0; i < freq; i++) {
>>>     final int pos = dpEnum.nextPosition();
>>>     System.out.println("start:" + dpEnum.startOffset());
>>>     System.out.println("end:" + dpEnum.endOffset());
>>>   }
>>> }
>>
>>
>> but in this case it is getting the offsets per doc ID, rather than a
>> single term, which is what I want.
>>
>> Could anyone tell me:
>>
>>    1. Why I'm not able to get the offsets using my first example, and/or
>>    2. A better way to get the offsets for a given term?
>>
>> Thanks.
>>
>>        Jeff
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

Reply via email to