Re: tika parser detecting "IBM500" for small files

John Patrick Tue, 08 Sep 2020 14:14:32 -0700

What about my other questions...
1) Did it work different before? what combination was that working
version... os + java + tika
2) Are other files working correct?


Have you tried your code on other environments???

Have you tried using the tika-app-1.24.1.jar as per my example?

Can you try adding this debug line;
System.out.println("file.encoding=" + System.setProperty("file.encoding"));

What does the debug line show?



On 08/09/2020, Satinder Singh <[email protected]> wrote:
> and my code is:
>
> import org.apache.tika.Tika;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.mime.MimeTypes;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
>
> public static String detectEncoding(InputStream is)
>   {
>     CharsetDetector detector = new CharsetDetector();
>      detector.setText(TikaInputStream.get(is));
>     CharsetMatch detected = detector.detect();
>
> On Sat, Sep 5, 2020 at 12:13 AM John Patrick <[email protected]>
> wrote:
>>
>> Have you tried 1.24.1?
>> Did it detect as a different type on an older version?
>> Have you tried it on another machine...
>> Are other files being detected as expected?
>> What os are you using and what java version are you using?
>>
>>
>> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
>> output
>> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
>>
>> I'm using 1.8.0_261 on a mac.
>>
>> John
>>
>> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <[email protected]> wrote:
>> >
>> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
>> > Example content of a small file:
>> > "a d"
>> >
>> > How to fix this?
>

Re: tika parser detecting "IBM500" for small files

Reply via email to