I wasn't around on the project when the xml mime magic was developed. So, take this as personal opinion, not an official statement. :D
The first item is intentional (xml data with no declaration). Text-based files are challenging, and looking for matching tags is beyond what our current detection does...not to say that it would be impossible. We do allow a missing declaration for specific subtypes, such as svg, IIRC. The second item is surprising because it looks like we should only require '<?xml' at offset 0. I'll look into that tomorrow. On Wed, Jul 12, 2023 at 11:58 AM John Ulrik <[email protected]> wrote: > Hi everyone, > > Tika (testing with v2.8.0, but doesn't seem to be version-specific) seems > to detect generic XML depending on the existence of and details on the XML > declaration: > > @Test > public void testDetect() throws IOException { > try (final InputStream in = new BufferedInputStream(new > ByteArrayInputStream("*<data>42</data>*".getBytes(StandardCharsets. > US_ASCII)))) { > assertEquals(MediaType.*TEXT_PLAIN*, new Tika().getDetector().detect(in, new > Metadata()).getBaseType()); > } > try (final InputStream in = new BufferedInputStream(new > ByteArrayInputStream("*<?xml?><data>42</data>*".getBytes(StandardCharsets. > US_ASCII)))) { > assertEquals(MediaType.*TEXT_PLAIN*, new Tika().getDetector().detect(in, new > Metadata()).getBaseType()); > } > try (final InputStream in = new BufferedInputStream(new > ByteArrayInputStream("*<?xml version='1.0'?><data>42</data>*".getBytes( > StandardCharsets.US_ASCII)))) { > assertEquals(MediaType.*APPLICATION_XML*, new Tika().getDetector().detect( > in, new Metadata()).getBaseType()); > } > } > > In short, only XML files with an XML declarations that explicity includes > an encoding will be detected as application/xml. XML files without XML > declaration or with an XML declaration but without encoding will be > detected as text/plain. > > Is that intentional? > > Thanks > John > >
