Re: Solr 7.7 UpdateRequestProcessor broken

Jason Gerlowski Mon, 18 Feb 2019 08:39:45 -0800

Hey all,

I have a proposed update which adds a 7.7 section to our "Upgrade
Notes" ref-guide page.  I put a mention of this in there, but don't
have a ton of context on the issue.  Would appreciate a review from
anyone more familiar.  Check out SOLR-13256 if you get a few minutes.


Best,

Jason

On Mon, Feb 18, 2019 at 9:06 AM Jan Høydahl <jan....@cominvent.com> wrote:
>
> Thanks for chiming in Markus. Yea, same with the langid tests, they just work 
> locally with manually constructed SolrInputDocument objects.
> This bug breaking change sounds really scary and we should add an UPGRADE 
> NOTE somewhere.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 15. feb. 2019 kl. 10:34 skrev Markus Jelsma <markus.jel...@openindex.io>:
> >
> > I stumbled upon this too yesterday and created SOLR-13249. In local unit 
> > tests we get String but in distributed unit tests we get a 
> > ByteArrayUtf8CharSequence instead.
> >
> > https://issues.apache.org/jira/browse/SOLR-13249
> >
> >
> >
> > -----Original message-----
> >> From:Andreas Hubold <andreas.hub...@coremedia.com>
> >> Sent: Friday 15th February 2019 10:10
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Solr 7.7 UpdateRequestProcessor broken
> >>
> >> Hi,
> >>
> >> thank you, Jan.
> >>
> >> I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you
> >> want to add your patch to that ticket. I did not have time to test it yet.
> >>
> >> So I guess, all SolrJ usages have to handle CharSequence now for string
> >> fields? Well, this really sounds like a major breaking change for custom
> >> code.
> >>
> >> Thanks,
> >> Andreas
> >>
> >> Jan Høydahl schrieb am 15.02.19 um 09:14:
> >>> Hi
> >>>
> >>> This is a subtle change which is not detected by our langid unit tests, 
> >>> as I think it only happens when document is trasferred with SolrJ and 
> >>> Javabin codec.
> >>> Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
> >>>
> >>> Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
> >>>
> >>> Other SolrInputDocument users assuming String type for strings in 
> >>> SolrInputDocument would also be vulnerable.
> >>>
> >>> I have a patch ready that you could test:
> >>>
> >>> Index: 
> >>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> >>> IDEA additional info:
> >>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> >>> <+>UTF-8
> >>> ===================================================================
> >>> --- 
> >>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> >>>   (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> >>> +++ 
> >>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> >>>   (date 1550217809000)
> >>> @@ -60,12 +60,12 @@
> >>>            Collection<Object> fieldValues = doc.getFieldValues(fieldName);
> >>>            if (fieldValues != null) {
> >>>              for (Object content : fieldValues) {
> >>> -              if (content instanceof String) {
> >>> -                String stringContent = (String) content;
> >>> +              if (content instanceof CharSequence) {
> >>> +                CharSequence stringContent = (CharSequence) content;
> >>>                  if (stringContent.length() > maxFieldValueChars) {
> >>> -                  detector.append(stringContent.substring(0, 
> >>> maxFieldValueChars));
> >>> +                  detector.append(stringContent.subSequence(0, 
> >>> maxFieldValueChars).toString());
> >>>                  } else {
> >>> -                  detector.append(stringContent);
> >>> +                  detector.append(stringContent.toString());
> >>>                  }
> >>>                  detector.append(" ");
> >>>                } else {
> >>> Index: 
> >>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> >>> IDEA additional info:
> >>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> >>> <+>UTF-8
> >>> ===================================================================
> >>> --- 
> >>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> >>>     (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> >>> +++ 
> >>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> >>>     (date 1550217691000)
> >>> @@ -413,10 +413,10 @@
> >>>          Collection<Object> fieldValues = doc.getFieldValues(fieldName);
> >>>          if (fieldValues != null) {
> >>>            for (Object content : fieldValues) {
> >>> -            if (content instanceof String) {
> >>> -              String stringContent = (String) content;
> >>> +            if (content instanceof CharSequence) {
> >>> +              CharSequence stringContent = (CharSequence) content;
> >>>                if (stringContent.length() > maxFieldValueChars) {
> >>> -                sb.append(stringContent.substring(0, 
> >>> maxFieldValueChars));
> >>> +                sb.append(stringContent.subSequence(0, 
> >>> maxFieldValueChars));
> >>>                } else {
> >>>                  sb.append(stringContent);
> >>>                }
> >>> @@ -449,8 +449,8 @@
> >>>          Collection<Object> contents = doc.getFieldValues(field);
> >>>          if (contents != null) {
> >>>            for (Object content : contents) {
> >>> -            if (content instanceof String) {
> >>> -              docSize += Math.min(((String) content).length(), 
> >>> maxFieldValueChars);
> >>> +            if (content instanceof CharSequence) {
> >>> +              docSize += Math.min(((CharSequence) content).length(), 
> >>> maxFieldValueChars);
> >>>              }
> >>>            }
> >>>
> >>>
> >>>
> >>> --
> >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>>
> >>>> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold 
> >>>> <andreas.hub...@coremedia.com>:
> >>>>
> >>>> Hi,
> >>>>
> >>>> while trying to update from Solr 7.6 to 7.7 I run into some unexpected 
> >>>> incompatibilites with UpdateRequestProcessors.
> >>>>
> >>>> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does 
> >>>> not return Strings for string fields anymore but instances of 
> >>>> org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some 
> >>>> related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" 
> >>>> section.
> >>>>
> >>>> I can adapt our UpdateRequestProcessor implementations but at least the 
> >>>> org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
> >>>>  is broken now as well and needs to be fixed in Solr. It expects String 
> >>>> values and logs messages such as the following now:
> >>>>
> >>>> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] 
> >>>> o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field 
> >>>> name_tokenized not a String value, not including in detection
> >>>>
> >>>> I wonder what kind of plugins are affected by the change. Does this only 
> >>>> affect UpdateRequestProcessors or more plugins? Do I need to handle 
> >>>> these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
> >>>>
> >>>> Cheers,
> >>>> Andreas
> >>>>
> >>>>
> >>>
> >>
> >>
>

Re: Solr 7.7 UpdateRequestProcessor broken

Reply via email to