Re: Field compression and storage optimization

Mike Klaas Fri, 01 Sep 2006 11:00:44 -0700

On 9/1/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

A couple of thoughts...
 - should this be specific to highlighting? (if not, the name should change)


Not necessarily--the reason I named it as such is that I had trouble
thinking of applications of only-sometimes sorting term vectors for a
field.  Though since I've misunderstood how term vectors work, the
only thing that remains is compression, which is more generally
applicable.

 - compression options make sense for both text and string fields...
perhaps it should just be added there.


That sounds ideal.  Perhaps a compressed=true/false with optional
compressionThreshold (default compress all)?

Should these types of parameters be overridable on a the
field-defintion level?  It is a bit difficult since field properties
are boolean and there would have to be some means of determining
whether a field property is set or not.

 - if you store term vectors for longer fields, shouldn't you just
store them for all fields (the longer ones will presumably take up the
bulk of the index anyway)


True, it might make more sense to reverse the inequality.

Regarding term vectors... like some other field properties, they are
per-field and not per-field-instance (so you can't turn it on for some
and off for others).  On document retrieval, I think one would detect
that term vectors were stored, but one wouldn't get back any terms (I
haven't tried this though).  I doubt the highlighter handles this
case.


If they are per-field, does that mean that term-vectors are generated
for all documents for a field if only one document requests them?  If
so, there is little point to this optimization.

If not, however, the highlighting code currently works by attempting
term-vector retrieval and falling back on re-analysis, so I believe
that it should be fine.

-Mike

On 8/31/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I was thinking of enabling the compressed=True field option (which
> currently has no effect), as compression is important for highlighting
> large fields (since they must be stored).
>
> However, rather than exposing a lucene implementation detail, I
> decided to create a FieldType which dynamically chooses to compress
> and/or term-vector a field depending on the field length (configurable
> in the field type).
>
> Any objections to commiting this?
>
> -Mike
>
> public class HighlitTextField extends TextField {
>   /* if field size (in characters) is greater than this threshold, the field
>      will be stored compressed */
>   public static int DEFAULT_COMPRESS_THRESHOLD = 200;
>   /* if field size (in characters) is greater than this threshold, the field
>      will have term vector data stored */
>   public static int DEFAULT_TERMVEC_THRESHOLD = 500;
>
>   int compressThreshold;
>   int termVecThreshold;
>
>   private static String CT = "compressThreshold";
>   private static String TV = "termVecThreshold";
>
>   protected void init(IndexSchema schema, Map<String,String> args) {
>     SolrParams p = new MapSolrParams(args);
>     compressThreshold = p.getInt(CT, DEFAULT_COMPRESS_THRESHOLD);
>     termVecThreshold = p.getInt(TV, DEFAULT_TERMVEC_THRESHOLD);
>     for(String prop: new String[]{CT, TV})
>       args.remove(prop);
>     super.init(schema, args);
>   }
>
>     /* Helpers for field construction */
>   protected Field.TermVector getFieldTermVec(SchemaField field,
>                                              String internalVal) {
>     /* store all termvec data if field length exceeds threshold */
>     return internalVal.length() >= termVecThreshold ?
>       Field.TermVector.WITH_POSITIONS_OFFSETS : Field.TermVector.NO;
>   }
>   protected Field.Store getFieldStore(SchemaField field,
>                                       String internalVal) {
>     /* compress field if length exceeds threshold */
>     return internalVal.length() >= compressThreshold ?
>       Field.Store.COMPRESS : Field.Store.YES;
>
>   }
> }

Re: Field compression and storage optimization

Reply via email to