Re: tika parser detecting "IBM500" for small files

Satinder Singh Fri, 11 Sep 2020 00:03:20 -0700

inline ...

1) Did it work different before? what combination was that working
version... os + java + tika
linux  6 , tika-parser 1.24.1, java 8.
It never worked for a file having "d" in word start or end, e.g.
"a d"


2) Are other files working correct?
Yes

Have you tried your code on other environments???
No. we need it working on linux 6/7

Have you tried using the tika-app-1.24.1.jar as per my example?
No. Requirement is for only encoding detection. So using only tika-parser-1.24.1

Can you try adding this debug line;
System.out.println("file.encoding=" + System.setProperty("file.encoding"));
I will try it.

On Wed, Sep 9, 2020 at 2:43 AM John Patrick <[email protected]> wrote:
>
> What about my other questions...
> 1) Did it work different before? what combination was that working
> version... os + java + tika
> 2) Are other files working correct?
>
> Have you tried your code on other environments???
>
> Have you tried using the tika-app-1.24.1.jar as per my example?
>
> Can you try adding this debug line;
> System.out.println("file.encoding=" + System.setProperty("file.encoding"));
>
> What does the debug line show?
>
>
>
> On 08/09/2020, Satinder Singh <[email protected]> wrote:
> > and my code is:
> >
> > import org.apache.tika.Tika;
> > import org.apache.tika.io.TikaInputStream;
> > import org.apache.tika.mime.MimeTypes;
> > import org.apache.tika.parser.txt.CharsetDetector;
> > import org.apache.tika.parser.txt.CharsetMatch;
> >
> > public static String detectEncoding(InputStream is)
> >   {
> >     CharsetDetector detector = new CharsetDetector();
> >      detector.setText(TikaInputStream.get(is));
> >     CharsetMatch detected = detector.detect();
> >
> > On Sat, Sep 5, 2020 at 12:13 AM John Patrick <[email protected]>
> > wrote:
> >>
> >> Have you tried 1.24.1?
> >> Did it detect as a different type on an older version?
> >> Have you tried it on another machine...
> >> Are other files being detected as expected?
> >> What os are you using and what java version are you using?
> >>
> >>
> >> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> >> output
> >> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
> >>
> >> I'm using 1.8.0_261 on a mac.
> >>
> >> John
> >>
> >> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <[email protected]> wrote:
> >> >
> >> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> >> > Example content of a small file:
> >> > "a d"
> >> >
> >> > How to fix this?
> >

Re: tika parser detecting "IBM500" for small files

Reply via email to