Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Trey Grainger Thu, 28 Nov 2013 16:06:42 -0800

Yeah, the documentation is definitely wrong - it definitely doesn't
concatenate the values in a multivalued field, it only uses the first one
like you mentioned.


If you want to detect the language of each of the values in the
multi-valued field (as opposed to specifying multiple separate string
values), however, this is easy enough to accomplish by modifying the code
in the language detect update processor to loop through each of the values:

LinkedHashSet<String> langsToPrepend = new LinkedHashSet<String>();
for (final Object inputValue : inputField.getValues()) {
     Object outputValue = inputValue;
     List<DetectedLanguage> fieldValueLangs = null;
          if (inputValue instanceof String){
               fieldValueLangs = this.detectLanguage(inputValue.toString());
          }

     for (DetectedLanguage lang : fieldValueLangs){

         langsToPrepend.add(lang.getLangCode());
    }
}

The "langsToPrepend" variable above will contain a set of languages,
where detectLanguage was called separately for each value in the
multivalued field.  If you just want to concatenate all the values and
detect languages once (as opposed to only using the first value in the
multivalued field, like it does today), just concatenate each of the
input values in the first loop and call detectLanguage once at the
end.

I wrote code that does this for an example in the Solr in Action book.
 The particular example was detecting languages for each value in a
multivalued field and then pre-pending the language to the text for
the multivalued field (so the analyzer would know which stemmer to
use, as they were being dynamically substituted in based upon the
language).  The code is available here if you are interested:
https://github.com/treygrainger/solr-in-action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifierUpdateProcessor.java

Good luck!

-Trey




On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan <
muel...@ponton-consulting.de> wrote:

> > I suspect that it is an oversight for a use case that was not considered.
> > I mean, it should probably either ignore or convert non text/string
> > values.
> Ok, I'll see that I provide a patch against trunk. It actually
> ignores non string values, but is unable to check the remaining values
> of a multivalued field.
>
> > Hmmm... are you using JSON input? I mean, how are the types being set?
> > Solr XML doesn't have a way to set the value types.
> >
> No. It's a field with multivalued=true. That results in a SolrInputField
> where value (which is defined to be Object) actually holds a List.
> This list is populated with Integer, String, Date, you name it.
> I'm talking about the actual Java-Datatypes. The values in the list are
> probably set by this 3rdparty Textbodyprocessor thingy.
>
> Now the Language processor just asks for field.getValue().
> This is delegated to the SolrInputField which in turn calls firstValue()
> Interestingly enough, already is able to handle a Collection as its value.
> But if the value is a collection, it just returns the first element.
>
> > You could workaround it with an update processor that copied the field
> and
> > massaged the multiple values into what you really want the language
> > detection to see. You could even implement that processor as a JavaScript
> > script with the stateless script update processor.
> >
> Our workaround would be to not feed the multivalued field but only the
> String fields (which are also included in the multivalued field)
>
>
> Filing a Bug/Feature request and providing the patch will take some time
> as I haven't setup a fully working trunk in my IDEA installation.
> But I'm eager to do it :)
>
> Regards,
> Stephan
>
>
> > -- Jack Krupansky
> >
> > -----Original Message-----
> > From: Müller, Stephan
> > Sent: Wednesday, November 27, 2013 5:02 AM
> > To: solr-user@lucene.apache.org
> > Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
> > multivalued fields
> >
> > Hello,
> >
> > this is a repost. This message was originally posted on the 'general'
> list
> > but it was suggested, that the 'user' list might be a better place to
> ask.
> >
> > ---- Original Message ----
> > Hi,
> >
> > we are passing a multivalued field to the
> > LanguageIdentifierUpdateProcessor.
> > This multivalued field contains arbitrary types (Integer, String, Date).
> >
> > Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> > doc, String[] fields), which btw does not use the parameter fields, is
> > unable to parse all fields of the/a multivalued field. The call "Object
> > content = doc.getFieldValue(fieldName);" does not care what type the
> field
> > is and just delegates to SolrInputDocument which in turn calls
> > getFirstValue.
> >
> > So, two issues:
> > First - if the first value of the multivalued field is not of type
> String,
> > the field is ignored completely.
> >
> > Second - the concat method does not concat all values of a multivalued
> > field.
> >
> > While http://www.mail-archive.com/solr-
> > u...@lucene.apache.org/msg90530.html
> > states: "The feature is designed to detect exactly one language per
> field.
> > In case of multivalued, it will concatenate all values before detection."
> > But as far as I can see, the code is unable to do this at all for
> > multivalued fields.
> >
> > This behavior was found in 4.3 but the code is still the same for current
> > trunk (as of 2013-11-26)
> >
> > Is this a bug? Is this a special design decision? Did we miss a certain
> > configuration, that would allow the Language identification to use all
> > values of a multivalued field?
> >
> > We are about to write our own
> > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the
> getInstance
> > hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite
> > LanguageIdentifierUpdateProcessor to handle all values of a multivalued
> > field, ignoring non-string values.
> >
> >
> >
> > Please see configuration below.
> >
> > I hope I was able to make myself clear. I'd like to hear your thoughts on
> > this, before I go off and file a bug report.
> >
> > Regards,
> > Stephan
> >
> >
> > A little background:
> > We are using a 3rd-party CMS framework which pulls in some magic SOLR
> > configuration (namely the textbody field).
> >
> > The textbody field is defined as follows:
> > <!--
> > The default text search field.
> > This field and the field name_tokenized are used as default search fields
> > for the /editor and /cmdismax search request handlers in solrconfig.xml.
> >
> > For the Content Feeder the text of all indexed fields of the CoreMedia
> > document is stored in this field.
> > The CAE Feeder by default stores the text of all elements in this field.
> > -->
> > <field name="textbody" type="text_general" stored="false"
> > multiValued="true"/>
> >
> > As you can see, it is also used as search field, therefor we want to have
> > the actual datatypes on the values.
> > The field itself is generated by a processor, prior to calling the
> > language identification (see processor chain).
> >
> >
> >
> > The processor chain:
> >
> > <updateRequestProcessorChain>
> >   <!-- Improve error messages -->
> >   <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" />
> >
> >   <!-- Blob extraction -->
> >   <processor class="3rdpartypackage.BinaryDataProcessorFactory">
> >     <!-- some comments -->
> >   </processor>
> >
> >   <!-- Textbody handling -->
> >   <processor class="3rdpartypackage.TextBodyProcessorFactory" />
> >
> >   <!-- Copy content of field name to name_tokenized -->
> >   <processor class="solr.CloneFieldUpdateProcessorFactory">
> >     <str name="source">name</str>
> >     <str name="dest">name_tokenized</str>
> >   </processor>
> >
> >   <!--Language detection -->
> >   <processor
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdate
> > ProcessorFactory">
> >     <str name="langid.fl">textbody,name_tokenized</str>
> >     <str name="langid.langField">language</str>
> >     <str name="langid.fallback">en</str>
> >   </processor>
> >
> >   <!-- Index into language dependent fields if defined (e.g. textbody_en
> > instead of textbody) -->
> >   <processor
> >
> class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProces
> > sorFactory">
> >     <str name="languageField">language</str>
> >     <str name="textFields">textbody,name_tokenized</str>
> >   </processor>
> >
> >   <processor class="solr.RunUpdateProcessorFactory" />
> > </updateRequestProcessorChain>
>
>

Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Reply via email to