Crawling images

Wade Dugas Thu, 29 Jul 2010 06:38:31 -0700

I am not able to get Nutch 1.2 to crawl jpeg images. Parse Tika is supposed to 
be able to parse them, but what needs to be done to have them fetched?


I have updated regex-filter.txt to not skip jpeg images:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$


Although parse-tika should automatically parse the image/jpeg Mime type, the 
fetcher doesn't seem to pick them up.

I would like to parse the images and store them in the content cache.

Any help would be most appreciated.

Thanks,
Wade

Crawling images

Reply via email to