In Tika 2.7.0, we migrated to a living fork of the Universal Charset Detector (TIKA-3213). I just tried the main branch's detection of the file attached to TIKA-2473, and the detection now works for that file.
I completely understand the problems you're having and appreciate your attempted workarounds, but you're right, Tika should _just work_. So, give 2.7.0 a try. That said, charset detection is not always perfect, and charset detection on short files is notoriously challenging. On Fri, Apr 28, 2023 at 11:43 AM Medea Springmeier <[email protected]> wrote: > > Hi, > > I want that Tika can detect a textfile with shift_jis as charEncoding. > > I found this one here: > > https://github.com/dadoonet/fscrawler/issues/400 > > (and there is also a ticket for the problem in the Jira of Tika: > https://issues.apache.org/jira/browse/TIKA-2437) > > > > So, I put the filename also in my request to give Tika a hint. When I make a > PUT request there is all fine (I get back the "Content-Type": "text/plain; > charset=Shift_JIS" and also the shift_jis text I want to have). But when I > make a POST request I get the problem that I cannot add a Content-Disposition > header in the Post-Body without also adding a Content-Type header (I use Java > and the MultipartEntityBuilder for my request to Tika Server (2.6.0)). > However, when I add a Content-Type header than Tika uses it for his detection > also when it is set as Wildcard. So, all what I get in this situation is > "Content-Type": "application/octet-stream" without any detected text and the > information that Tika used the EmptyParser. > > > > I don't want to add the "Content-Type": "text/plain" in the request (this > would work) because I do not have only textfiles. And I do not want to make a > guess myself on the filename for the Content-Type. In my expectation that > should Tika able to do. > > > > I want to use Tika with Post requests. Is there any way to use it in this way > and to detect shift_jis encoded textfiles? > > Maybe, is there a method that I can tell Tika only to use Mime-Magic and the > filename, but not to use the Content-Type for guessing the Mime-Type? > > > >
