On 6/23/2025 2:21 PM, Alvaro Nogueira via user wrote:
Hi Tilman!
Thanks for the quick reply. According to my tests, the issue is fixed for docx, odt, pptx, and xlsx, but still happening for doc, ppt and xls extensions I will test it further and let you know if I find anything, but hopefully that can point you in the right direction

Hi,

I've approved your JIRA request, please report it there and don't forget to include a file.

Tilman



Regards,
Alvaro

On Mon, Jun 23, 2025 at 11:11 AM Tilman Hausherr <thaush...@t-online.de> wrote:

    Hi,

    Please test with the unreleased 3.2.1:

    https://dist.apache.org/repos/dist/dev/tika/3.2.1/

    
https://repository.apache.org/content/repositories/orgapachetika-1115/org/apache/tika

    Tilman



    On 6/23/2025 11:01 AM, Alvaro Nogueira via user wrote:


    ---------- Forwarded message ---------
    From: *Alvaro Nogueira* <alvaro.nogue...@flywire.com>
    Date: Mon, Jun 23, 2025 at 10:54 AM
    Subject: InputStream consumed by Tika.detect
    To: <user-subscr...@tika.apache.org>


    Hello,
    We've been using Tika version 3.1.0 to successfully detect
    MimeTypes from files before uploading them to our S3.
    However, after v3.2.0 upgrade, we've noticed that the original
    inputStream is being consumed entirely for certain file extensions.
    The affected extensions seem to be all for Microsoft files,
    pointing us to the POIFSContainerDetector, which was actually
    changed for this release.
    This is the list of extensions we've tested with errors: doc,
    docx, odt, ppt, pptx, xls, xlsx
    And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf,
    png, rtf, svg, txt

    Here's some code to reproduce the issue:

    class TikaBugReport {

         // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx
    public static void main(String[] args)throws IOException {
             String fileName ="Test.docx";
             InputStream inputStream =new 
ClassPathResource(fileName).getInputStream();
             checkFileMime(inputStream, fileName);
         }

         public static void checkFileMime(InputStream inputStream, String 
fileName) {
             try {
                 Tika tika =new Tika();
                 System.out.println("InputStream available bytes before processing: 
" + inputStream.available());
                 System.out.println("InputStream supports mark: " + 
inputStream.markSupported());

                 Metadata metadata =new Metadata();

                 TikaInputStream tikaInputStream = 
TikaInputStream.get(inputStream);
                 System.out.println("Original InputStream available bytes after 
TikaInputStream.get(): " + inputStream.available());

                 String mimeType = tika.detect(tikaInputStream, metadata);

                 // Debug: Check state after detection System.out.println("Original 
InputStream available bytes after tika.detect(): " + inputStream.available());
                 System.out.println("TikaInputStream available bytes after 
tika.detect(): " + tikaInputStream.available());
                 if (inputStream.available() ==0) {
                     throw new IllegalStateException("InputStream is empty after 
TikaInputStream creation");
                 }

             }catch (Exception e) {
                 System.out.printf("Mime check exception for file '%s': 
[%s]%n", fileName, e.getMessage());
             }
         }
    }

-- Thank you and regards,

    Álvaro Nogueira
    Senior Software Engineer    
    Logo <https://www.flywire.com/>       LinkedIn icon
    <https://www.linkedin.com/company/flywire> Twitter icon
    <https://twitter.com/Flywire> Facebook icon
    <https://www.facebook.com/Flywire> Instagram icon
    <https://www.instagram.com/insideflywire/>


    Disclaimer for electronic communications
    <https://www.flywire.com/legal/disclaimer-for-electronic-communications>



Disclaimer for electronic communications <https://www.flywire.com/legal/disclaimer-for-electronic-communications>

Reply via email to