Hello Solr Community,

I’m seeking your feedback regarding an issue I’ve encountered when
configuring the Solr Langid module, specifically when using the deprecated
langid.whitelist property instead of Solr’s newer langid.allowlist property
to define allowed language codes.

As you are likely aware, the langid.whitelist property has been deprecated
since Solr 9.0.0, and the recommended approach is to use langid.allowlist
instead. I am indeed using the langid.allowlist property, but I would like
to bring attention to an issue I’ve observed with the legacy support for
langid.whitelist. I believe there is a bug in the backward compatibility
code that could cause unintended behavior when the langid.whitelist
property is configured.

To illustrate the problem, I’ll provide a detailed example based on the
code:

   1.

   *The check for legacyAllowList*: In the Solr code, specifically in the
   
https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127,
   there is a check for the length of the legacyAllowList string. However,
   the legacyAllowList is never actually used after the length check in the
   code. Instead, an empty string ("") is used as the default value when
   fetching the LANG_ALLOWLIST parameter.
   2.

   *Resulting issue with the langAllowlist set*: As a result, the Set<String>
   langAllowlist is populated with a single element: an empty string ("").
   This causes an issue when the code checks if the langAllowlist is empty
   in the later part of the code (
   
https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405)
   , specifically in this section. The check langAllowlist.isEmpty()
   incorrectly returns false because the set does contain an element - the
   empty string.
   3.

   *Unexpected fallback behavior*: Consequently, even though the language
   of the document might be correctly detected (for instance, if the document
   is identified as being in German), the flow incorrectly enters the "else"
   clause. This results in the log message: *"Detected a language not in
   allowlist (de), using fallback en"* and the fallback language is set to
   English (en), even though the document language was correctly identified
   as German.

I believe this behavior stems from a bug in the backwards compatibility
handling for the deprecated langid.whitelist property. If the
legacyAllowList value is not being properly used or passed to the
langAllowlist set, it leads to incorrect fallback behavior.

I’d appreciate any insights or thoughts from the community on this issue.
Thank you in advance for your time!

Alex

Reply via email to