Hello all,
I started my server like this: "java -jar tika-server-1.20.jar -server" I was working with a RAR file to get the information and I noticed that a LOT of weird output was included. This file contained both binaries (.so), .class and .java files. The problem is that I can consistently reproduce it with that rar file, but I cannot share it. So I tried: "curl -T tika-server-1.20.jar localhost:9998/tika --header "Accept: text/plain" > out.txt" This gave me a little bit of the same problem (just not as bad as I had it): schemaorg_apache_xmlbeans/system/sD023D6490046BA0250A839A9AD24C443/agautoformatattributegroup.xsb Úzº¾�����������9http://schemas.openxmlformats.org/spreadsheetml/2006/main�^MAG_AutoFormat��unqualified�8<xsd:attributeGroup name="AG_AutoFormat" xmlns=" http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:xsd=" http://www.w3.org/2001/XMLSchema"> <xsd:attribute name="autoFormatId" type="xsd:unsignedInt"> <xsd:annotation> My question is: why is this outputted? Some tests with the rar-file (not the tika-jar) showed me that each file seperatly is extracted properly. (meaning: i get text) Plus that when I delete files from the rar, some files are extracted properly which were not extracted properly before. Furthermore I noticed a distinctive pattern: EF BF BD (which seems to be an UTF-8 replacement character). But it's not with every rar, for a test I downloaded the "br" dump of wikimedia and rarred it. then "/rmeta/text" on it, and that extract properly. So I'm guessing some kind of buffer overflowing into the next text-extraction? What could I do to debug this more in-depth and/or provide the devs with some more info so they could tackle it for me? Thanks!
