Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Aaron Daubman Tue, 05 Jun 2012 09:17:07 -0700

Thanks for the responses,

By saying "dirty data" you imply that only one of the values is "good" or
> "clean" and that the others can be safely discarded/ignored, as opposed to
> true multi-valued data where each value is there for good reason and needs
> to be preserved. In any case, how do you know/decide which value should be
> used for sorting - and did you just get lucky that Solr happened to use the
> right one?
>


I haven't gone back and checked the old version's docs where this was
"working", however, I suspect that either the field never ended up
appearing in docs more than once, or if it did, it had the same value
repeated...

The real issue here is that the docs are created externally, and the
producer won't (yet) guarantee that fields that should appear once will
actually appear once. Because of this, I don't want to declare the field as
multiValued="false" as I don't want to cause indexing errors. It would be
great for me (and apparently many others after searching) if there were an
option as simple as forceSingleValued="true" - where some deterministic
behavior such as "use first field encountered, ignore all others", would
occur.


The preferred technique would be the preprocess and "clean" the data before
> it is handed to Solr or SolrJ, even if the source must remain "dirty".
> Baring that a preprocessor or a custom update processor certainly.
>

I could write preprocessors (this is really what will eventually happen
when the producer cleans their data),  custom processors, etc... however,
for something this simple it would be great not to be producing more code
that would have to be maintained.



> Please clarify exactly how the data is being fed into Solr.
>

 I am using "generic" code to read from a key/value store and compose
documents. This is another reason fixing the data at this point would not
be desirable, the currently generic code would need to be made specific to
look for these particular fields and then coerce them to single values...

Thanks again,
      Aaron

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Reply via email to