Hello Trey, thank you for this example. We've solved it by omitting the multivalued field and passing the distinct string fields instead, still I go with proposing a patch, so the language processor is able to concatenate multivalues by default. I think it's a reasonable feature (and can't remember to have ever contributed a patch to an open source project) My thoughts on the patch implementation are quite the same as Yours, iterating on getValues(). I'll have this discussed in the dev-list and probably in JIRA.
One thing: How do you guard against a possible NPE in line 129 > (final Object inputValue : inputField.getValues()) { SolrInputField.getValues() will return NULL if the associated value was null. It does not create an empty Collection. That, btw, seems to be a minor bug in the javadoc, not stating that this method returns null. Regards, Stephan - srm [...] > The "langsToPrepend" variable above will contain a set of languages, where > detectLanguage was called separately for each value in the multivalued > field. If you just want to concatenate all the values and detect > languages once (as opposed to only using the first value in the > multivalued field, like it does today), just concatenate each of the input > values in the first loop and call detectLanguage once at the end. > > I wrote code that does this for an example in the Solr in Action book. > The particular example was detecting languages for each value in a > multivalued field and then pre-pending the language to the text for the > multivalued field (so the analyzer would know which stemmer to use, as > they were being dynamically substituted in based upon the language). The > code is available here if you are interested: > https://github.com/treygrainger/solr-in- > action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifier > UpdateProcessor.java > > Good luck! > > -Trey > > > > > On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan < Mueller@ponton- > consulting.de> wrote: > > > > I suspect that it is an oversight for a use case that was not > considered. > > > I mean, it should probably either ignore or convert non text/string > > > values. > > Ok, I'll see that I provide a patch against trunk. It actually ignores > > non string values, but is unable to check the remaining values of a > > multivalued field. > > > > > Hmmm... are you using JSON input? I mean, how are the types being set? > > > Solr XML doesn't have a way to set the value types. > > > > > No. It's a field with multivalued=true. That results in a > > SolrInputField where value (which is defined to be Object) actually > holds a List. > > This list is populated with Integer, String, Date, you name it. > > I'm talking about the actual Java-Datatypes. The values in the list > > are probably set by this 3rdparty Textbodyprocessor thingy. > > > > Now the Language processor just asks for field.getValue(). > > This is delegated to the SolrInputField which in turn calls > > firstValue() Interestingly enough, already is able to handle a > Collection as its value. > > But if the value is a collection, it just returns the first element. > > > > > You could workaround it with an update processor that copied the > > > field > > and > > > massaged the multiple values into what you really want the language > > > detection to see. You could even implement that processor as a > > > JavaScript script with the stateless script update processor. > > > > > Our workaround would be to not feed the multivalued field but only the > > String fields (which are also included in the multivalued field) > > > > > > Filing a Bug/Feature request and providing the patch will take some > > time as I haven't setup a fully working trunk in my IDEA installation. > > But I'm eager to do it :) > > > > Regards, > > Stephan > > > > > > > -- Jack Krupansky > > > > > > -----Original Message----- > > > From: Müller, Stephan > > > Sent: Wednesday, November 27, 2013 5:02 AM > > > To: solr-user@lucene.apache.org > > > Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on > > > multivalued fields > > > > > > Hello, > > > > > > this is a repost. This message was originally posted on the 'general' > > list > > > but it was suggested, that the 'user' list might be a better place > > > to > > ask. > > > > > > ---- Original Message ---- > > > Hi, > > > > > > we are passing a multivalued field to the > > > LanguageIdentifierUpdateProcessor. > > > This multivalued field contains arbitrary types (Integer, String, > Date). > > > > > > Now, the > > > LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument > > > doc, String[] fields), which btw does not use the parameter fields, > > > is unable to parse all fields of the/a multivalued field. The call > > > "Object content = doc.getFieldValue(fieldName);" does not care what > > > type the > > field > > > is and just delegates to SolrInputDocument which in turn calls > > > getFirstValue. > > > > > > So, two issues: > > > First - if the first value of the multivalued field is not of type > > String, > > > the field is ignored completely. > > > > > > Second - the concat method does not concat all values of a > > > multivalued field. > > > > > > While http://www.mail-archive.com/solr- > > > u...@lucene.apache.org/msg90530.html > > > states: "The feature is designed to detect exactly one language per > > field. > > > In case of multivalued, it will concatenate all values before > detection." > > > But as far as I can see, the code is unable to do this at all for > > > multivalued fields. > > > > > > This behavior was found in 4.3 but the code is still the same for > > > current trunk (as of 2013-11-26) > > > > > > Is this a bug? Is this a special design decision? Did we miss a > > > certain configuration, that would allow the Language identification > > > to use all values of a multivalued field? > > > > > > We are about to write our own > > > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the > > getInstance > > > hardcoded to return LanguageIdentifierUpdateProcessor?) and > > > overwrite LanguageIdentifierUpdateProcessor to handle all values of > > > a multivalued field, ignoring non-string values. > > > > > > > > > > > > Please see configuration below. > > > > > > I hope I was able to make myself clear. I'd like to hear your > > > thoughts on this, before I go off and file a bug report. > > > > > > Regards, > > > Stephan > > > > > > > > > A little background: > > > We are using a 3rd-party CMS framework which pulls in some magic > > > SOLR configuration (namely the textbody field). > > > > > > The textbody field is defined as follows: > > > <!-- > > > The default text search field. > > > This field and the field name_tokenized are used as default search > > > fields for the /editor and /cmdismax search request handlers in > solrconfig.xml. > > > > > > For the Content Feeder the text of all indexed fields of the > > > CoreMedia document is stored in this field. > > > The CAE Feeder by default stores the text of all elements in this > field. > > > --> > > > <field name="textbody" type="text_general" stored="false" > > > multiValued="true"/> > > > > > > As you can see, it is also used as search field, therefor we want to > > > have the actual datatypes on the values. > > > The field itself is generated by a processor, prior to calling the > > > language identification (see processor chain). > > > > > > > > > > > > The processor chain: > > > > > > <updateRequestProcessorChain> > > > <!-- Improve error messages --> > > > <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" > > > /> > > > > > > <!-- Blob extraction --> > > > <processor class="3rdpartypackage.BinaryDataProcessorFactory"> > > > <!-- some comments --> > > > </processor> > > > > > > <!-- Textbody handling --> > > > <processor class="3rdpartypackage.TextBodyProcessorFactory" /> > > > > > > <!-- Copy content of field name to name_tokenized --> > > > <processor class="solr.CloneFieldUpdateProcessorFactory"> > > > <str name="source">name</str> > > > <str name="dest">name_tokenized</str> > > > </processor> > > > > > > <!--Language detection --> > > > <processor > > > > > class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUp > > date > > > ProcessorFactory"> > > > <str name="langid.fl">textbody,name_tokenized</str> > > > <str name="langid.langField">language</str> > > > <str name="langid.fallback">en</str> > > > </processor> > > > > > > <!-- Index into language dependent fields if defined (e.g. > > > textbody_en instead of textbody) --> > > > <processor > > > > > class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsPr > > oces > > > sorFactory"> > > > <str name="languageField">language</str> > > > <str name="textFields">textbody,name_tokenized</str> > > > </processor> > > > > > > <processor class="solr.RunUpdateProcessorFactory" /> > > > </updateRequestProcessorChain> > > > >