In Tika 2.7.0, we migrated to a living fork of the Universal Charset
Detector (TIKA-3213).  I just tried the main branch's detection of the
file attached to TIKA-2473, and the detection now works for that file.

I completely understand the problems you're having and appreciate your
attempted workarounds, but you're right, Tika should _just work_.  So,
give 2.7.0 a try.

That said, charset detection is not always perfect, and charset
detection on short files is notoriously challenging.

On Fri, Apr 28, 2023 at 11:43 AM Medea Springmeier
<[email protected]> wrote:
>
> Hi,
>
> I want that Tika can detect a textfile with shift_jis as charEncoding.
>
> I found this one here:
>
> https://github.com/dadoonet/fscrawler/issues/400
>
> (and there is also a ticket for the problem in the Jira of Tika: 
> https://issues.apache.org/jira/browse/TIKA-2437)
>
>
>
> So, I put the filename also in my request to give Tika a hint. When I make a 
> PUT request there is all fine (I get back the "Content-Type": "text/plain; 
> charset=Shift_JIS" and also the shift_jis text I want to have). But when I 
> make a POST request I get the problem that I cannot add a Content-Disposition 
> header in the Post-Body without also adding a Content-Type header (I use Java 
> and the MultipartEntityBuilder for my request to Tika Server (2.6.0)). 
> However, when I add a Content-Type header than Tika uses it for his detection 
> also when it is set as Wildcard. So, all what I get in this situation is 
> "Content-Type": "application/octet-stream" without any detected text and the 
> information that Tika used the EmptyParser.
>
>
>
> I don't want to add the "Content-Type": "text/plain" in the request (this 
> would work) because I do not have only textfiles. And I do not want to make a 
> guess myself on the filename for the Content-Type. In my expectation that 
> should Tika able to do.
>
>
>
> I want to use Tika with Post requests. Is there any way to use it in this way 
> and to detect shift_jis encoded textfiles?
>
> Maybe, is there a method that I can tell Tika only to use Mime-Magic and the 
> filename, but not to use the Content-Type for guessing the Mime-Type?
>
>
>
>

Reply via email to