Hi, 1. is the PDF actually fetched, parsed and indexed? Doesn't your regex- urlfilter skip PDF? 2. Is the PDF too large, is it being truncated by Nutch? 3. Does Tika actually parse the PDF as you expect?
There may be issues at separate locations. You can use the parser checker to confirm Tika's working. bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.apache.org/licenses/icla.pdf Cheers, On Wednesday 16 February 2011 08:31:23 hala wrote: > thank you for your reply > i do a complete crawl (generate, fetch, update, index) cycle,i use nutch > internal search,i crawl a site that has alink to pdf file,the pdf contain > arabic words, i want to search on them by nutch. > if the site has arabic words, nutch return them to me, but if the arabic > words in the pdf ,nutch don't return them to me. > please give me any help -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

