Re: nutch crawling arabic pdf site

Markus Jelsma Wed, 16 Feb 2011 07:35:12 -0800

Hi,

1. is the PDF actually fetched, parsed and indexed? Doesn't your regex-
urlfilter skip PDF?
2. Is the PDF too large, is it being truncated by Nutch?
3. Does Tika actually parse the PDF as you expect?


There may be issues at separate locations. You can use the parser checker to 
confirm Tika's working.

bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://www.apache.org/licenses/icla.pdf


Cheers,

On Wednesday 16 February 2011 08:31:23 hala wrote:
> thank you for your reply
> i do a complete crawl (generate, fetch, update, index) cycle,i use nutch
> internal search,i crawl a site that has alink to pdf file,the pdf contain
> arabic words, i want to search on them by nutch.
> if the site has arabic words, nutch return them to me, but if the arabic
> words in the pdf ,nutch don't return them to me.
> please give me any help

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch crawling arabic pdf site

Reply via email to