Hello Solr Community, I’m seeking your feedback regarding an issue I’ve encountered when configuring the Solr Langid module, specifically when using the deprecated langid.whitelist property instead of Solr’s newer langid.allowlist property to define allowed language codes.
As you are likely aware, the langid.whitelist property has been deprecated since Solr 9.0.0, and the recommended approach is to use langid.allowlist instead. I am indeed using the langid.allowlist property, but I would like to bring attention to an issue I’ve observed with the legacy support for langid.whitelist. I believe there is a bug in the backward compatibility code that could cause unintended behavior when the langid.whitelist property is configured. To illustrate the problem, I’ll provide a detailed example based on the code: 1. *The check for legacyAllowList*: In the Solr code, specifically in the https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127, there is a check for the length of the legacyAllowList string. However, the legacyAllowList is never actually used after the length check in the code. Instead, an empty string ("") is used as the default value when fetching the LANG_ALLOWLIST parameter. 2. *Resulting issue with the langAllowlist set*: As a result, the Set<String> langAllowlist is populated with a single element: an empty string (""). This causes an issue when the code checks if the langAllowlist is empty in the later part of the code ( https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405) , specifically in this section. The check langAllowlist.isEmpty() incorrectly returns false because the set does contain an element - the empty string. 3. *Unexpected fallback behavior*: Consequently, even though the language of the document might be correctly detected (for instance, if the document is identified as being in German), the flow incorrectly enters the "else" clause. This results in the log message: *"Detected a language not in allowlist (de), using fallback en"* and the fallback language is set to English (en), even though the document language was correctly identified as German. I believe this behavior stems from a bug in the backwards compatibility handling for the deprecated langid.whitelist property. If the legacyAllowList value is not being properly used or passed to the langAllowlist set, it leads to incorrect fallback behavior. I’d appreciate any insights or thoughts from the community on this issue. Thank you in advance for your time! Alex