Re: Field compression and storage optimization

Yonik Seeley Fri, 01 Sep 2006 12:22:34 -0700

On 9/1/06, Mike Klaas <[EMAIL PROTECTED]> wrote:

On 9/1/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> A couple of thoughts...
>  - should this be specific to highlighting? (if not, the name should change)


Not necessarily--the reason I named it as such is that I had trouble
thinking of applications of only-sometimes sorting term vectors for a
field.  Though since I've misunderstood how term vectors work, the
only thing that remains is compression, which is more generally
applicable.

>  - compression options make sense for both text and string fields...
> perhaps it should just be added there.

That sounds ideal.  Perhaps a compressed=true/false with optional
compressionThreshold (default compress all)?

Should these types of parameters be overridable on a the
field-defintion level?  It is a bit difficult since field properties
are boolean and there would have to be some means of determining
whether a field property is set or not.

>  - if you store term vectors for longer fields, shouldn't you just
> store them for all fields (the longer ones will presumably take up the
> bulk of the index anyway)

True, it might make more sense to reverse the inequality.

> Regarding term vectors... like some other field properties, they are
> per-field and not per-field-instance (so you can't turn it on for some
> and off for others).  On document retrieval, I think one would detect
> that term vectors were stored, but one wouldn't get back any terms (I
> haven't tried this though).  I doubt the highlighter handles this
> case.

If they are per-field, does that mean that term-vectors are generated
for all documents for a field if only one document requests them?  If
so, there is little point to this optimization.


No, term vectors are only generated for fields where it is explicitly set.
But, the per-segment FieldInfos keeps track of "indexed", "omitNorms"
and "termVectors" on a per-fieldname basis.  When segments are merged,
of one segment doesn't have termvectors stored and another segment
does, the entire new segment is marked as having termvectors.

If not, however, the highlighting code currently works by attempting
term-vector retrieval and falling back on re-analysis, so I believe
that it should be fine.


I haven't tried it, but I think it might be impossible to tell an
empty field with termvectors stored from a field without termvectors
that was "promoted" to having termvectors.  I think
reader.getTermFreqVector(docId,field) may *not* return null in the
latter case.
Anyone know for sure?

-Yonik

Re: Field compression and storage optimization

Reply via email to