Hi everyone,

Tika (testing with v2.8.0, but doesn't seem to be version-specific) seems
to detect generic XML depending on the existence of and details on the XML
declaration:

@Test
public void testDetect() throws IOException {
try (final InputStream in = new BufferedInputStream(new
ByteArrayInputStream("*<data>42</data>*".getBytes(StandardCharsets.US_ASCII))))
{
assertEquals(MediaType.*TEXT_PLAIN*, new Tika().getDetector().detect(in, new
Metadata()).getBaseType());
}
try (final InputStream in = new BufferedInputStream(new
ByteArrayInputStream("*<?xml?><data>42</data>*".getBytes(StandardCharsets.
US_ASCII)))) {
assertEquals(MediaType.*TEXT_PLAIN*, new Tika().getDetector().detect(in, new
Metadata()).getBaseType());
}
try (final InputStream in = new BufferedInputStream(new
ByteArrayInputStream("*<?xml version='1.0'?><data>42</data>*".getBytes(
StandardCharsets.US_ASCII)))) {
assertEquals(MediaType.*APPLICATION_XML*, new Tika().getDetector().detect(in,
new Metadata()).getBaseType());
}
}

In short, only XML files with an XML declarations that explicity includes
an encoding will be detected as application/xml. XML files without XML
declaration or with an XML declaration but without encoding will be
detected as text/plain.

Is that intentional?

Thanks
John

Reply via email to