Hi,

I'm trying to override the built-in PDF parser with another one. I looked through the mailing list archive and found the following hints how to override a built-in parser:

http://mail-archives.apache.org/mod_mbox/tika-user/201105.mbox/%3CBANLkTimp4omHywv_ptOmqEX9v-%2BW4e7fVA%40mail.gmail.com%3E

https://issues.apache.org/jira/browse/TIKA-527

Is there any documentation of the syntax of the configuration file available?

The problem is that using the proposed method does not work for me. Any use of the configuration file apparently sends Tika into an endless recursion, even without overriding a built-in parser in the configuration file.

If I understand it correctly, the following configuration file should have the same effect as the built-in configuration:

$ cat tika-config.xml
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
</parsers>
</properties>

But if I provide that to Tika, after a while the command line application is terminated with an exception:

$ java -Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:216)
        at java.lang.StringBuilder.toString(StringBuilder.java:430)
        at org.apache.tika.mime.MediaType.toString(MediaType.java:237)
        at org.apache.tika.detect.MagicDetector.<init>(MagicDetector.java:142)
        at 
org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:254)
        at 
org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:202)
        at 
org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:186)
        at 
org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:152)
        at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:124)
        at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:107)
        at 
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:63)
        at 
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:91)
        at 
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:147)
        at 
org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:455)
        at 
org.apache.tika.config.TikaConfig.typesFromDomElement(TikaConfig.java:273)
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:161)
        at 
org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
        at 
org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
        at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52)
        at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at java.lang.Class.newInstance0(Class.java:355)
        at java.lang.Class.newInstance(Class.java:308)
        at 
org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:288)
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:162)
        at 
org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
        at 
org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
        at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52)
        at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

Is this a bug in Tika, or am I doing something wrong?

Thanks
Stephan

--
_______________________________________________________________
Stephan Mühlstrasser   [email protected]            www.pdflib.com
  PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München,  Germany
       Court of registry/Amtsgericht München HRB 129497
 Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
    PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________


Reply via email to