Re: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Müller , Stephan Fri, 29 Nov 2013 03:12:23 -0800

Hello Trey, thank you for this example.

We've solved it by omitting the multivalued field and passing the distinct 
string fields instead, still I go with proposing a patch, so the language 
processor is able to concatenate multivalues by default. I think it's a 
reasonable feature (and can't remember to have ever contributed a patch to an 
open source project)
My thoughts on the patch implementation are quite the same as Yours, iterating 
on getValues(). I'll have this discussed in the dev-list and probably in JIRA.



One thing: How do you guard against a possible NPE in line 129
> (final Object inputValue : inputField.getValues()) {

SolrInputField.getValues() will return NULL if the associated value was null. 
It does not create an empty Collection.
That, btw, seems to be a minor bug in the javadoc, not stating that this method 
returns null.


Regards,
Stephan - srm

[...]

> The "langsToPrepend" variable above will contain a set of languages, where
> detectLanguage was called separately for each value in the multivalued
> field.  If you just want to concatenate all the values and detect
> languages once (as opposed to only using the first value in the
> multivalued field, like it does today), just concatenate each of the input
> values in the first loop and call detectLanguage once at the end.
> 
> I wrote code that does this for an example in the Solr in Action book.
>  The particular example was detecting languages for each value in a
> multivalued field and then pre-pending the language to the text for the
> multivalued field (so the analyzer would know which stemmer to use, as
> they were being dynamically substituted in based upon the language).  The
> code is available here if you are interested:
> https://github.com/treygrainger/solr-in-
> action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifier
> UpdateProcessor.java
> 
> Good luck!
> 
> -Trey
> 
> 
> 
> 
> On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan < Mueller@ponton-
> consulting.de> wrote:
> 
> > > I suspect that it is an oversight for a use case that was not
> considered.
> > > I mean, it should probably either ignore or convert non text/string
> > > values.
> > Ok, I'll see that I provide a patch against trunk. It actually ignores
> > non string values, but is unable to check the remaining values of a
> > multivalued field.
> >
> > > Hmmm... are you using JSON input? I mean, how are the types being set?
> > > Solr XML doesn't have a way to set the value types.
> > >
> > No. It's a field with multivalued=true. That results in a
> > SolrInputField where value (which is defined to be Object) actually
> holds a List.
> > This list is populated with Integer, String, Date, you name it.
> > I'm talking about the actual Java-Datatypes. The values in the list
> > are probably set by this 3rdparty Textbodyprocessor thingy.
> >
> > Now the Language processor just asks for field.getValue().
> > This is delegated to the SolrInputField which in turn calls
> > firstValue() Interestingly enough, already is able to handle a
> Collection as its value.
> > But if the value is a collection, it just returns the first element.
> >
> > > You could workaround it with an update processor that copied the
> > > field
> > and
> > > massaged the multiple values into what you really want the language
> > > detection to see. You could even implement that processor as a
> > > JavaScript script with the stateless script update processor.
> > >
> > Our workaround would be to not feed the multivalued field but only the
> > String fields (which are also included in the multivalued field)
> >
> >
> > Filing a Bug/Feature request and providing the patch will take some
> > time as I haven't setup a fully working trunk in my IDEA installation.
> > But I'm eager to do it :)
> >
> > Regards,
> > Stephan
> >
> >
> > > -- Jack Krupansky
> > >
> > > -----Original Message-----
> > > From: Müller, Stephan
> > > Sent: Wednesday, November 27, 2013 5:02 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
> > > multivalued fields
> > >
> > > Hello,
> > >
> > > this is a repost. This message was originally posted on the 'general'
> > list
> > > but it was suggested, that the 'user' list might be a better place
> > > to
> > ask.
> > >
> > > ---- Original Message ----
> > > Hi,
> > >
> > > we are passing a multivalued field to the
> > > LanguageIdentifierUpdateProcessor.
> > > This multivalued field contains arbitrary types (Integer, String,
> Date).
> > >
> > > Now, the
> > > LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> > > doc, String[] fields), which btw does not use the parameter fields,
> > > is unable to parse all fields of the/a multivalued field. The call
> > > "Object content = doc.getFieldValue(fieldName);" does not care what
> > > type the
> > field
> > > is and just delegates to SolrInputDocument which in turn calls
> > > getFirstValue.
> > >
> > > So, two issues:
> > > First - if the first value of the multivalued field is not of type
> > String,
> > > the field is ignored completely.
> > >
> > > Second - the concat method does not concat all values of a
> > > multivalued field.
> > >
> > > While http://www.mail-archive.com/solr-
> > > u...@lucene.apache.org/msg90530.html
> > > states: "The feature is designed to detect exactly one language per
> > field.
> > > In case of multivalued, it will concatenate all values before
> detection."
> > > But as far as I can see, the code is unable to do this at all for
> > > multivalued fields.
> > >
> > > This behavior was found in 4.3 but the code is still the same for
> > > current trunk (as of 2013-11-26)
> > >
> > > Is this a bug? Is this a special design decision? Did we miss a
> > > certain configuration, that would allow the Language identification
> > > to use all values of a multivalued field?
> > >
> > > We are about to write our own
> > > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the
> > getInstance
> > > hardcoded to return LanguageIdentifierUpdateProcessor?) and
> > > overwrite LanguageIdentifierUpdateProcessor to handle all values of
> > > a multivalued field, ignoring non-string values.
> > >
> > >
> > >
> > > Please see configuration below.
> > >
> > > I hope I was able to make myself clear. I'd like to hear your
> > > thoughts on this, before I go off and file a bug report.
> > >
> > > Regards,
> > > Stephan
> > >
> > >
> > > A little background:
> > > We are using a 3rd-party CMS framework which pulls in some magic
> > > SOLR configuration (namely the textbody field).
> > >
> > > The textbody field is defined as follows:
> > > <!--
> > > The default text search field.
> > > This field and the field name_tokenized are used as default search
> > > fields for the /editor and /cmdismax search request handlers in
> solrconfig.xml.
> > >
> > > For the Content Feeder the text of all indexed fields of the
> > > CoreMedia document is stored in this field.
> > > The CAE Feeder by default stores the text of all elements in this
> field.
> > > -->
> > > <field name="textbody" type="text_general" stored="false"
> > > multiValued="true"/>
> > >
> > > As you can see, it is also used as search field, therefor we want to
> > > have the actual datatypes on the values.
> > > The field itself is generated by a processor, prior to calling the
> > > language identification (see processor chain).
> > >
> > >
> > >
> > > The processor chain:
> > >
> > > <updateRequestProcessorChain>
> > >   <!-- Improve error messages -->
> > >   <processor class="3rdpartypackage.ErrorHandlingProcessorFactory"
> > > />
> > >
> > >   <!-- Blob extraction -->
> > >   <processor class="3rdpartypackage.BinaryDataProcessorFactory">
> > >     <!-- some comments -->
> > >   </processor>
> > >
> > >   <!-- Textbody handling -->
> > >   <processor class="3rdpartypackage.TextBodyProcessorFactory" />
> > >
> > >   <!-- Copy content of field name to name_tokenized -->
> > >   <processor class="solr.CloneFieldUpdateProcessorFactory">
> > >     <str name="source">name</str>
> > >     <str name="dest">name_tokenized</str>
> > >   </processor>
> > >
> > >   <!--Language detection -->
> > >   <processor
> > >
> > class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUp
> > date
> > > ProcessorFactory">
> > >     <str name="langid.fl">textbody,name_tokenized</str>
> > >     <str name="langid.langField">language</str>
> > >     <str name="langid.fallback">en</str>
> > >   </processor>
> > >
> > >   <!-- Index into language dependent fields if defined (e.g.
> > > textbody_en instead of textbody) -->
> > >   <processor
> > >
> > class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsPr
> > oces
> > > sorFactory">
> > >     <str name="languageField">language</str>
> > >     <str name="textFields">textbody,name_tokenized</str>
> > >   </processor>
> > >
> > >   <processor class="solr.RunUpdateProcessorFactory" />
> > > </updateRequestProcessorChain>
> >
> >

Re: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Reply via email to