Hi Tim,

thanks!  For me the fix can wait until the next release.
The URL was by accident in the sample to verify that
upgrading Tika on Stormcrawler didn't break anything.
It was the only document out of 450 parsed by Tika
which could be a regression.

Best,
Sebastian

On 11/16/21 16:49, Tim Allison wrote:
> Hi Seb,
> 
> I'm sorry for taking forever to reply.  That's a bug.  Now fixed:
> https://issues.apache.org/jira/browse/TIKA-3593
> 
> If you specify the DcXMLParser in your tika-config after the default
> parser, it _should_ be selected instead of the XMLParser.  Let me know
> if I can help with this temporary workaround.
> 
> Thank you for identifying this problem!
> 
> Cheers,
> 
>       Tim
> 
> On Thu, Nov 11, 2021 at 7:21 AM Sebastian Nagel
> <[email protected]> wrote:
>>
>> Hi,
>>
>> when is the Dublin Core XML parser used to parse XML files?
>> Is there a configuration required to enable the DcXMLParser?
>>
>> There is a difference between 1.27 and 2.1.0:
>>
>> $> java -jar tika-app-1.27.jar -J \
>>       https://news.haltonhills.halinet.on.ca/dc.xml \
>>    | jq '.[0]."dc:title"'
>> "Deaths"
>> $> java -jar tika-app-2.1.0.jar ...
>> null
>>
>> $> java -jar tika-app-1.27.jar -J \
>>       https://news.haltonhills.halinet.on.ca/dc.xml \
>>    | jq '.[0]."X-Parsed-By"'
>> [
>>   "org.apache.tika.parser.DefaultParser",
>>   "org.apache.tika.parser.xml.DcXMLParser"
>> ]
>> $> java -jar tika-app-2.1.0.jar -J \
>>       https://news.haltonhills.halinet.on.ca/dc.xml \
>>    | jq '.[0]."X-TIKA:Parsed-By"'
>> [
>>   "org.apache.tika.parser.DefaultParser",
>>   "org.apache.tika.parser.xml.XMLParser"
>> ]
>>
>>
>> Thanks,
>> Sebastian

Reply via email to