I wasn't around on the project when the xml mime magic was developed.  So,
take this as personal opinion, not an official statement. :D

The first item is intentional (xml data with no declaration).  Text-based
files are challenging, and looking for matching tags is beyond what our
current detection does...not to say that it would be impossible.  We do
allow a missing declaration for specific subtypes, such as svg, IIRC.

The second item is surprising because it looks like we should only require
'<?xml' at offset 0. I'll look into that tomorrow.

On Wed, Jul 12, 2023 at 11:58 AM John Ulrik <[email protected]> wrote:

> Hi everyone,
>
> Tika (testing with v2.8.0, but doesn't seem to be version-specific) seems
> to detect generic XML depending on the existence of and details on the XML
> declaration:
>
> @Test
> public void testDetect() throws IOException {
> try (final InputStream in = new BufferedInputStream(new
> ByteArrayInputStream("*<data>42</data>*".getBytes(StandardCharsets.
> US_ASCII)))) {
> assertEquals(MediaType.*TEXT_PLAIN*, new Tika().getDetector().detect(in, new
> Metadata()).getBaseType());
> }
> try (final InputStream in = new BufferedInputStream(new
> ByteArrayInputStream("*<?xml?><data>42</data>*".getBytes(StandardCharsets.
> US_ASCII)))) {
> assertEquals(MediaType.*TEXT_PLAIN*, new Tika().getDetector().detect(in, new
> Metadata()).getBaseType());
> }
> try (final InputStream in = new BufferedInputStream(new
> ByteArrayInputStream("*<?xml version='1.0'?><data>42</data>*".getBytes(
> StandardCharsets.US_ASCII)))) {
> assertEquals(MediaType.*APPLICATION_XML*, new Tika().getDetector().detect(
> in, new Metadata()).getBaseType());
> }
> }
>
> In short, only XML files with an XML declarations that explicity includes
> an encoding will be detected as application/xml. XML files without XML
> declaration or with an XML declaration but without encoding will be
> detected as text/plain.
>
> Is that intentional?
>
> Thanks
> John
>
>

Reply via email to