For performance, loading the models takes a fair amount of time. Are
you able to reuse the detector? I can't remember off the top of my
head if language detectors are thread safe, but you should be able to
load that only once.

As for accuracy, have you tried other detectors? The other hack I've
seen is that for short snippets of text, people concatenate/repeat
that text a number of times until they have a few hundred words.

On Thu, Sep 4, 2025 at 5:52 AM Francesco Festa <fe...@netseven.it> wrote:
>
> Hi everyone,
>
> I'm currently working with Tika and I'm having problems with the accuracy of 
> predictions, so I was wondering if anyone has any ideas on how to improve 
> what we have.
>
> My use-case is kind of specific: I have short to medium sized text that can 
> be in either one or two languages. If the text has two languages it is 
> usually a concatenation of a text and its translation right after (So the 
> position of the two languages is kind of identifiable).
>
> For this reason what we are currently doing is splitting the text in half, 
> running Tika on both halves and then taking both the outputs (if it's the 
> same language we consider it only once)
>
> The code that does this is the following:
>
>     private static List<String> segmentateText(String text, int numSegments, 
> int minLength) {
>         // If the text is shorter than the given minLength we don't 
> segmentate it
> // In this example minLength is 0
>         if (text.length() <= minLength) {
>             List<String> singleSegment = new ArrayList<>();
>             singleSegment.add(text);
>             return singleSegment;
>         }
>         // Computes the size of each segment by counting the number of words
>         String[] words = text.split("\\s+");
>         int segmentSize = words.length / numSegments;
>
>         List<String> segments = new ArrayList<>();
>         for (int i = 0; i < numSegments; i++) {
>             int start = i * segmentSize;
>
>             // If this is not the last segment, it computes the end index, 
> else it gets all remaining words
>             int end;
>             if (i < numSegments - 1)
>                 end = (i + 1) * segmentSize;
>             else
>                 end = words.length;
>
>             // Constructs the segment
>             StringBuilder segment = new StringBuilder();
>             for (int j = start; j < end; j++) {
>                 segment.append(words[j]);
>                 // Appends whitespaces if this is not the lst word
>                 if (j < end - 1)
>                     segment.append(" ");
>             }
>             segments.add(segment.toString());
>         }
>
>         return segments;
>     }
>
> We then simply pass the segments to this other functions which does the 
> detection:
>
>     private static Set<String> detectLanguages(List<String> segments) {
>         LanguageDetector detector = new OptimaizeLangDetector().loadModels();
>
>         Set<String> result = new HashSet<>();
>         String detectedLang;
>
>         // For each segment the language gets detected, parsed and added to a 
> set
>         for (String segment : segments) {
>             LanguageResult language = detector.detect(segment);
>             detectedLang = language.getLanguage();
>             detectedLang = (String) Utils.normalizeLanguage(detectedLang); // 
> We can ignore this
>             result.add(detectedLang);
>         }
>
>         return result;
>     }
>
> The problem is that this approach has okay-ish results for medium sized texts 
> but sucks for shorter texts, for example:
> "Open Access: Soziologische Aspekte ; Open Access: Sociological Implications"
> This should be German and English but it is instead flagged as Italian. There 
> are many similar cases across our data, even with different languages.
>
> Now, I'm wondering what can I do to improve performance for this specific use 
> case?
>
> I'll list some additional information I think could be relevant:
> - I can't know beforehand the set of languages present in the data, so we 
> can't load only the models we need.
> - The computation power is kind of limited as this code has to be used in a 
> Big Data pipeline with millions of strings, so it can't be too slow.
> - I'm using the latest version of Tika.
>
> Thanks a lot in advance to everyone who is willing to help,
> Francesco

Reply via email to