I am not able to get Nutch 1.2 to crawl jpeg images. Parse Tika is supposed to
be able to parse them, but what needs to be done to have them fetched?
I have updated regex-filter.txt to not skip jpeg images:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
Although parse-tika should automatically parse the image/jpeg Mime type, the
fetcher doesn't seem to pick them up.
I would like to parse the images and store them in the content cache.
Any help would be most appreciated.
Thanks,
Wade