On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux <m...@itec.uni-klu.ac.at> wrote: > Hi! > > I'm basically searching for a method to put byte[] data into Lucene > DocValues of type BINARY (see [1]). Currently only primitives and > Strings are supported according to [1]. > > I know that this can be done with a custom update handler, but I'd > like to avoid that. >
Can you describe a little bit what kind of operations you want to do with it? I don't really know how BinaryField is typically used, but maybe it could support this option. On the other hand adding it to BinaryField might not "buy" you much without some additional stuff depending upon what you need to do. Like if you really want to do sort/facet on the thing, SORTED(SET) would probably be a better implementation: it doesnt care that the values are binary. BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is: * SORTED: deduplicates/compresses the unique byte[]'s and gives each document an ordinal number that reflects sort order (for sorting/faceting/grouping/etc) * SORTED_SET: similar, except each document has a "set" (which can be empty), of ordinal numbers (e.g. for faceting multivalued fields) * BINARY: just stores the byte[] for each document (no deduplication, no compression, no ordinals, nothing). So for sorting/faceting: BINARY is generally not very efficient unless there is something custom going on: for example lucene's faceting package stores the "values" elsewhere in a separate taxonomy index, so it uses this type just to encode a delta-compressed ordinal list for each document. For scoring factors/function queries: encoding the values inside NUMERIC(s) [up to 64 bits each] might still be best on average: the compression applied here is surprisingly efficient.