So on a Mac I get; a
I get, "Match of UTF-8 with confidence 15"

Using; a d
I get; "Match of IBM500 in fr with confidence 98"

Using; d
I get; "Match of UTF-8 with confidence 15"

So I don't think oraclelinux is the issue, and if you have tested
yourself on different operating systems you could have seen the same
results as myself.

I've no idea why it thinks "a d" is IBM500 in french with 98% confidence...

If you think it is wrong to raise a defect, but with a file of such
few characters I would expect some strange detection.

John

On Fri, 11 Sep 2020 at 12:23, John Patrick <[email protected]> wrote:
>
> further inline comments...
>
> On Fri, 11 Sep 2020 at 08:02, Satinder Singh <[email protected]> wrote:
> >
> > inline ...
> >
> > 1) Did it work different before? what combination was that working
> > version... os + java + tika
> > linux  6 , tika-parser 1.24.1, java 8.
> > It never worked for a file having "d" in word start or end, e.g.
> > "a d"
> >
> > 2) Are other files working correct?
> > Yes
> >
> > Have you tried your code on other environments???
> > No. we need it working on linux 6/7
> Do you mean Oracle Linux?
> I know you need it working on linux 6 or 7 but knowing if your code
> works elsewhere potentially helps track down any issues...
> You might have a test environment which is Debian where it passes,
> another CentOs where it passes, your developers might be on Ubuntu and
> the code works fine their, but your production is Oracle Linux and
> it's failing there?
> If it is a physical host or a virtual host (vmware or similar)? or a
> container host (openshift or docker)?
>
> >
> > Have you tried using the tika-app-1.24.1.jar as per my example?
> > No. Requirement is for only encoding detection. So using only 
> > tika-parser-1.24.1
> Again it helps track down the issue, if my example tika-app works for
> you but your tika-parser still doesn't work then it helps identify
> where to look next
>
> >
> > Can you try adding this debug line;
> > System.out.println("file.encoding=" + System.setProperty("file.encoding"));
> > I will try it.
> >
> > On Wed, Sep 9, 2020 at 2:43 AM John Patrick <[email protected]> wrote:
> > >
> > > What about my other questions...
> > > 1) Did it work different before? what combination was that working
> > > version... os + java + tika
> > > 2) Are other files working correct?
> > >
> > > Have you tried your code on other environments???
> > >
> > > Have you tried using the tika-app-1.24.1.jar as per my example?
> > >
> > > Can you try adding this debug line;
> > > System.out.println("file.encoding=" + 
> > > System.setProperty("file.encoding"));
> > >
> > > What does the debug line show?
> > >
> > >
> > >
> > > On 08/09/2020, Satinder Singh <[email protected]> wrote:
> > > > and my code is:
> > > >
> > > > import org.apache.tika.Tika;
> > > > import org.apache.tika.io.TikaInputStream;
> > > > import org.apache.tika.mime.MimeTypes;
> > > > import org.apache.tika.parser.txt.CharsetDetector;
> > > > import org.apache.tika.parser.txt.CharsetMatch;
> > > >
> > > > public static String detectEncoding(InputStream is)
> > > >   {
> > > >     CharsetDetector detector = new CharsetDetector();
> > > >      detector.setText(TikaInputStream.get(is));
> > > >     CharsetMatch detected = detector.detect();
> > > >
> > > > On Sat, Sep 5, 2020 at 12:13 AM John Patrick <[email protected]>
> > > > wrote:
> > > >>
> > > >> Have you tried 1.24.1?
> > > >> Did it detect as a different type on an older version?
> > > >> Have you tried it on another machine...
> > > >> Are other files being detected as expected?
> > > >> What os are you using and what java version are you using?
> > > >>
> > > >>
> > > >> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> > > >> output
> > > >> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
> > > >>
> > > >> I'm using 1.8.0_261 on a mac.
> > > >>
> > > >> John
> > > >>
> > > >> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <[email protected]> 
> > > >> wrote:
> > > >> >
> > > >> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> > > >> > Example content of a small file:
> > > >> > "a d"
> > > >> >
> > > >> > How to fix this?
> > > >

Reply via email to