Hi,

thanks for the hint. I tested the new version of Tika (2.7.0), but I cannot see 
any difference (the detection of the shift_jis file do not work). 
Did you test it as a server? I must use Tika as a server and with a Post 
request.



-----Ursprüngliche Nachricht-----
Von: Tim Allison <[email protected]> 
Gesendet: Montag, 1. Mai 2023 16:47
An: [email protected]
Betreff: Re: post request with shift_jis encoding and filename hint

In Tika 2.7.0, we migrated to a living fork of the Universal Charset Detector 
(TIKA-3213).  I just tried the main branch's detection of the file attached to 
TIKA-2473, and the detection now works for that file.

I completely understand the problems you're having and appreciate your 
attempted workarounds, but you're right, Tika should _just work_.  So, give 
2.7.0 a try.

That said, charset detection is not always perfect, and charset detection on 
short files is notoriously challenging.

On Fri, Apr 28, 2023 at 11:43 AM Medea Springmeier 
<[email protected]> wrote:
>
> Hi,
>
> I want that Tika can detect a textfile with shift_jis as charEncoding.
>
> I found this one here:
>
> https://github.com/dadoonet/fscrawler/issues/400
>
> (and there is also a ticket for the problem in the Jira of Tika: 
> https://issues.apache.org/jira/browse/TIKA-2437)
>
>
>
> So, I put the filename also in my request to give Tika a hint. When I make a 
> PUT request there is all fine (I get back the "Content-Type": "text/plain; 
> charset=Shift_JIS" and also the shift_jis text I want to have). But when I 
> make a POST request I get the problem that I cannot add a Content-Disposition 
> header in the Post-Body without also adding a Content-Type header (I use Java 
> and the MultipartEntityBuilder for my request to Tika Server (2.6.0)). 
> However, when I add a Content-Type header than Tika uses it for his detection 
> also when it is set as Wildcard. So, all what I get in this situation is 
> "Content-Type": "application/octet-stream" without any detected text and the 
> information that Tika used the EmptyParser.
>
>
>
> I don't want to add the "Content-Type": "text/plain" in the request (this 
> would work) because I do not have only textfiles. And I do not want to make a 
> guess myself on the filename for the Content-Type. In my expectation that 
> should Tika able to do.
>
>
>
> I want to use Tika with Post requests. Is there any way to use it in this way 
> and to detect shift_jis encoded textfiles?
>
> Maybe, is there a method that I can tell Tika only to use Mime-Magic and the 
> filename, but not to use the Content-Type for guessing the Mime-Type?
>
>
>
>

Reply via email to