Hi,
I'm trying to override the built-in PDF parser with another one. I
looked through the mailing list archive and found the following hints
how to override a built-in parser:
http://mail-archives.apache.org/mod_mbox/tika-user/201105.mbox/%3CBANLkTimp4omHywv_ptOmqEX9v-%2BW4e7fVA%40mail.gmail.com%3E
https://issues.apache.org/jira/browse/TIKA-527
Is there any documentation of the syntax of the configuration file
available?
The problem is that using the proposed method does not work for me. Any
use of the configuration file apparently sends Tika into an endless
recursion, even without overriding a built-in parser in the
configuration file.
If I understand it correctly, the following configuration file should
have the same effect as the built-in configuration:
$ cat tika-config.xml
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
</parsers>
</properties>
But if I provide that to Tika, after a while the command line
application is terminated with an exception:
$ java -Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:216)
at java.lang.StringBuilder.toString(StringBuilder.java:430)
at org.apache.tika.mime.MediaType.toString(MediaType.java:237)
at org.apache.tika.detect.MagicDetector.<init>(MagicDetector.java:142)
at
org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:254)
at
org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:202)
at
org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:186)
at
org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:152)
at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:124)
at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:107)
at
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:63)
at
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:91)
at
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:147)
at
org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:455)
at
org.apache.tika.config.TikaConfig.typesFromDomElement(TikaConfig.java:273)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:161)
at
org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
at
org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52)
at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:355)
at java.lang.Class.newInstance(Class.java:308)
at
org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:288)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:162)
at
org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
at
org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52)
at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
Is this a bug in Tika, or am I doing something wrong?
Thanks
Stephan
--
_______________________________________________________________
Stephan Mühlstrasser [email protected] www.pdflib.com
PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München, Germany
Court of registry/Amtsgericht München HRB 129497
Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________