On 2010-06-24 10:56, [email protected] wrote: > Hi, > > It looks like Tika does not include a PostScript parser. At least the > copy that comes with Nutch 1.1. Is this right? I just want to double > check because PostScript is a major file format. I get errors "Can't > retrieve Tika parser for mime-type application/postscript" in the log > when Nutch comes across a PostScript file. I've found a reference to > parser-pdf associated with PostScript, but it does not work any > better. It tries to treat PostScript files as pdf and fails, if I > correctly interpret its complains.
PDF parser can't properly parse Postscript, sorry. On the other hand, Postscript parsers may be (and often are) able to parse PDF-s. > > Could anyone help with parsing PostScript in Nutch, please? It is > hard to believe that this is not implemented. You can use Ghostscript via the parse-ext plugin - see examples in plugin.xml file there. (...and BTW, parsing Postscript is definitely not on the same level of complexity as parsing PDF - Postscript is a full programming language, whereas PDF is "just" a page description format). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

