Hi,
I'm currently in the process of moving a system over to use ForkParser
with Tika 2.3.0 but there's some issues I'm having.
First, I'd hoped to use the 'ForkParser(Path tikaBin,
ParserFactoryFactory factoryFactory)' constructor to get better
isolation, but have run into the issue described in TIKA-3223 where it
can't find an exception class, for example if parsing an encrypted
document. For now I've switched to using a 'legacy' constructor but it
would be nice to eventually move to the newer method.
Second, there seems to be some work missing in the handling of metadata
from certain parsers when using ForkParser. For example, for
OpenDocument ODP and ODS files and Microsoft Open XML formats, while the
document text is returned there is no metadata in either the returned
Metadata object or in the returned HTML head. The OpenDocument ODT
format works as expected via ForkParser though.
For an audio/mp4 file, the title is returned but the rest of the
metadata is missing, although the values are present in the body of the
returned HTML. For an video/mp4 file, metadata values are only present
in the body of the HTML and in the Metadata object it has an incorrect
video/quicktime content type.
If it's possible to squeeze fixing this second issue into the next
version it would be really helpful!
Thanks,
Stephen.
- ForkParser issues with 2.3.0 Stephen H
-