Hi
There might be a bug with the AutoDetectParser, which fails to recognise some
plain-text files as plain text.
In the attachment are three testing files, as you can see they are all plain
text.
The following code is used for my testing:
————————
AutoDetectParser parser = new AutoDetectParser();
for (File f : new
File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
try {
parser.parse(in, handler, metadata);
String content = handler.toString();
System.out.println(metadata); //line A
}catch (Exception e){
e.printStackTrace();
}
}
————————
for the three testing files, I would expect line A to print “plain text”, in
fact, it is printing the following:
X-Parsed-By=org.apache.tika.parser.EmptyParser
Content-Type=image/x-portable-bitmap
X-Parsed-By=org.apache.tika.parser.DefaultParser
X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3
Content-Type=audio/mpeg
X-Parsed-By=org.apache.tika.parser.EmptyParser
Content-Type=image/x-portable-bitmap
And as a result, variable “content” is always empty.
Any suggestions on this please?
Thanks