Re: nutch crawling arabic pdf site

Markus Jelsma Sun, 20 Feb 2011 09:25:59 -0800

The problem isn't fixed in the 0.9 relase of Tika so you're still stuck here 
and there is no other parse-pdf plugin which you can use. There is, however, 
the parse-ext plugin [1] which you perhaps could use to execute pdf2text and 
return the parsed content. I haven't used this plugin and i don't know how to 
configure it. If you successfully manage to get it up and running then please 
post your findings on the list.


As a last resort you might have to write a custom plugin [2]. But i image it'd 
do the same job as the parse-ext plugin with pdf2text.

[1]: http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/parse-
ext/src/java/org/apache/nutch/parse/ext/ExtParser.java?view=markup
[2]: http://wiki.apache.org/nutch/WritingPluginExample

> thaaaaaanx  a lot for your help
> you have a wide experience
> but the problem is still exist
> i don't know what can i do

Re: nutch crawling arabic pdf site

Reply via email to