On 2010-06-24 10:56, [email protected] wrote:
> Hi,
> 
> It looks like Tika does not include a PostScript parser. At least the
> copy that comes with Nutch 1.1. Is this right? I just want to double
> check because PostScript is a major file format. I get errors "Can't
> retrieve Tika parser for mime-type application/postscript" in the log
> when Nutch comes across a PostScript file. I've found a reference to
> parser-pdf associated with PostScript, but it does not work any
> better. It tries to treat PostScript files as pdf and fails, if I
> correctly interpret its complains.

PDF parser can't properly parse Postscript, sorry. On the other hand,
Postscript parsers may be (and often are) able to parse PDF-s.

> 
> Could anyone help with parsing PostScript in Nutch, please? It is
> hard to believe that this is not implemented.

You can use Ghostscript via the parse-ext plugin - see examples in
plugin.xml file there.

(...and BTW, parsing Postscript is definitely not on the same level of
complexity as parsing PDF - Postscript is a full programming language,
whereas PDF is "just" a page description format).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to