Hello, I think you are confused between two different index structures, probably because of the name of the options in solr.
1. indexing term vectors: this means given a document, you can go lookup a miniature "inverted index" just for that document. That means each document has "term vectors" which has a term dictionary of the terms in that one document, and optionally things like positions and character offsets. This can be useful if you are examining *many terms* for just a few documents. For example: the MoreLikeThis use case. In solr this is activated with termVectors=true. To additionally store positions/offsets information inside the term vectors its termPositions and termOffsets, respectively. 2. indexing character offsets: this means given a term, you can get the offset information "along with" each position that matched. So really you can think of this as a special form of a payload. This is useful if you are examining *many documents* for just a few terms. For example, many highlighting use cases. In solr this is activated with storeOffsetsWithPositions=true. It is unrelated to term vectors. Hopefully this helps. On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French <jkfaus...@gmail.com> wrote: > This looks like a codec issue, but I'm not sure how to address it. I've > found that a different instance of DocsAndPositionsEnum is instantiated > between my code and Solr's TermVectorComponent. > > Mine: > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum > Solr: > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum > > As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where > the Lucene 4.1 reference comes from. I've searched through the Solr config > files and can't see where to change the codec, but shouldn't the reader use > the same codec as used when the index was created? > > > On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French <jkfaus...@gmail.com>wrote: > >> We have an API on top of Lucene 4.6 that I'm trying to adapt to running >> under Solr 4.6. The problem is although I'm getting the correct offsets >> when the index is created by Lucene, the same method calls always return -1 >> when the index is created by Solr. In the latter case I can see the >> character offsets via Luke, and I can even get them from Solr when I access >> the /tvrh search handler, which uses the TermVectorComponent class. >> >> This is roughly how I'm reading character offsets in my Lucene code: >> >>> AtomicReader reader = ... >>> Term term = ... >>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term); >>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) { >>> for (int i = 0; i < postings.freq(); i++) { >>> System.out.println("start:" + postings.startOffset()); >>> System.out.println("end:" + postings.endOffset()); >>> } >>> } >> >> >> Notice that I want the values for a single term. When run against an index >> created by Solr, the above calls to startOffset() and endOffset() return >> -1. Solr's TermVectorComponent prints the correct offsets like this >> (paraphrased): >> >> IndexReader reader = searcher.getIndexReader(); >>> Terms vector = reader.getTermVector(docId, field); >>> TermsEnum termsEnum = vector.iterator(termsEnum); >>> int freq = (int) termsEnum.totalTermFreq(); >>> DocsAndPositionsEnum dpEnum = null; >>> while((text = termsEnum.next()) != null) { >>> String term = text.utf8ToString(); >>> dpEnum = termsEnum.docsAndPositions(null, dpEnum); >>> dpEnum.nextDoc(); >>> for (int i = 0; i < freq; i++) { >>> final int pos = dpEnum.nextPosition(); >>> System.out.println("start:" + dpEnum.startOffset()); >>> System.out.println("end:" + dpEnum.endOffset()); >>> } >>> } >> >> >> but in this case it is getting the offsets per doc ID, rather than a >> single term, which is what I want. >> >> Could anyone tell me: >> >> 1. Why I'm not able to get the offsets using my first example, and/or >> 2. A better way to get the offsets for a given term? >> >> Thanks. >> >> Jeff >> >> >> >> >> >> >> >> >>