Re: tika parser detecting "IBM500" for small files

Satinder Singh Mon, 07 Sep 2020 22:09:19 -0700

and my code is:

import org.apache.tika.Tika;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;


public static String detectEncoding(InputStream is)
  {
    CharsetDetector detector = new CharsetDetector();
     detector.setText(TikaInputStream.get(is));
    CharsetMatch detected = detector.detect();

On Sat, Sep 5, 2020 at 12:13 AM John Patrick <[email protected]> wrote:
>
> Have you tried 1.24.1?
> Did it detect as a different type on an older version?
> Have you tried it on another machine...
> Are other files being detected as expected?
> What os are you using and what java version are you using?
>
>
> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> output https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
>
> I'm using 1.8.0_261 on a mac.
>
> John
>
> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <[email protected]> wrote:
> >
> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> > Example content of a small file:
> > "a d"
> >
> > How to fix this?

Re: tika parser detecting "IBM500" for small files

Reply via email to