Hi Tilman!
Thanks for the quick reply. According to my tests, the issue is fixed for
docx, odt, pptx, and xlsx, but still happening for doc, ppt and xls
extensions
I will test it further and let you know if I find anything, but hopefully
that can point you in the right direction

Regards,
Alvaro

On Mon, Jun 23, 2025 at 11:11 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Hi,
>
> Please test with the unreleased 3.2.1:
>
> https://dist.apache.org/repos/dist/dev/tika/3.2.1/
>
> https://repository.apache.org/content/repositories/orgapachetika-1115/org/apache/tika
>
> Tilman
>
>
>
> On 6/23/2025 11:01 AM, Alvaro Nogueira via user wrote:
>
>
>
> ---------- Forwarded message ---------
> From: Alvaro Nogueira <alvaro.nogue...@flywire.com>
> Date: Mon, Jun 23, 2025 at 10:54 AM
> Subject: InputStream consumed by Tika.detect
> To: <user-subscr...@tika.apache.org>
>
>
> Hello,
> We've been using Tika version 3.1.0 to successfully detect MimeTypes from
> files before uploading them to our S3.
> However, after v3.2.0 upgrade, we've noticed that the original inputStream
> is being consumed entirely for certain file extensions.
> The affected extensions seem to be all for Microsoft files, pointing us to
> the POIFSContainerDetector, which was actually changed for this release.
> This is the list of extensions we've tested with errors: doc, docx, odt,
> ppt, pptx, xls, xlsx
> And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf,
> svg, txt
>
> Here's some code to reproduce the issue:
>
> class TikaBugReport {
>
>     // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx     public 
> static void main(String[] args) throws IOException {
>         String fileName = "Test.docx";
>         InputStream inputStream = new 
> ClassPathResource(fileName).getInputStream();
>         checkFileMime(inputStream, fileName);
>     }
>
>     public static void checkFileMime(InputStream inputStream, String 
> fileName) {
>         try {
>             Tika tika = new Tika();
>             System.out.println("InputStream available bytes before 
> processing: " + inputStream.available());
>             System.out.println("InputStream supports mark: " + 
> inputStream.markSupported());
>
>             Metadata metadata = new Metadata();
>
>             TikaInputStream tikaInputStream = 
> TikaInputStream.get(inputStream);
>             System.out.println("Original InputStream available bytes after 
> TikaInputStream.get(): " + inputStream.available());
>
>             String mimeType = tika.detect(tikaInputStream, metadata);
>
>             // Debug: Check state after detection            
> System.out.println("Original InputStream available bytes after tika.detect(): 
> " + inputStream.available());
>             System.out.println("TikaInputStream available bytes after 
> tika.detect(): " + tikaInputStream.available());
>             if (inputStream.available() == 0) {
>                 throw new IllegalStateException("InputStream is empty after 
> TikaInputStream creation");
>             }
>
>         } catch (Exception e) {
>             System.out.printf("Mime check exception for file '%s': [%s]%n", 
> fileName, e.getMessage());
>         }
>     }
> }
>
>
> --
> Thank you and regards,
>
> Álvaro Nogueira
> Senior Software Engineer
> [image: Logo] <https://www.flywire.com/> [image: LinkedIn icon]
> <https://www.linkedin.com/company/flywire> [image: Twitter icon]
> <https://twitter.com/Flywire> [image: Facebook icon]
> <https://www.facebook.com/Flywire> [image: Instagram icon]
> <https://www.instagram.com/insideflywire/>
>
> Disclaimer for electronic communications
> <https://www.flywire.com/legal/disclaimer-for-electronic-communications>
>
>
>

-- 
Disclaimer for electronic communications 
<https://www.flywire.com/legal/disclaimer-for-electronic-communications>



Reply via email to