regex-urlfilter.txt not working?

Steve Cohen Tue, 21 Dec 2010 12:58:54 -0800

in the regex-urlfilter.txt we have the following:

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|XLS|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|jpe|pcx|tif|tiff|dll|DLL|a|so|o|class|bin|ttf|pfb|pfm|afm|hqx|sea|eps|ai|ram|wav|avi|mid|mov|mpg|mpeg|mp3|ogg|dat|dta|log|bz2|jar|arj|cab|rar|tar|zip|tar.gz|upp|tgz|sdd|hdr|iso|img|gpg|gbk|fac|ghg|mdic|jnilib|dmg|3gp|m4a|m4v|wma|wmv|wrl|lzh|msi|gg|kml|kmz|skb|skp|chm|mht|html/|htm/|phtml/|ghtml/|asp/|js|jsp/|shtml/|doc|PDF|pdf|swf|xml)$



So we shouldn't see any mention of pdfs, right? well in the logs I am seeing
this:

2010-12-21 15:45:04,340 WARN  parse.ParseUtil - No suitable parser found
when trying to parse content
http://www.fodors.com/pdf/fodors-south-australia.pdf of type application/pdf
2010-12-21 15:45:04,340 WARN  fetcher.Fetcher - Error parsing:
http://www.fodors.com/pdf/fodors-south-australia.pdf:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/pdf url=
http://www.fodors.com/pdf/fodors-south-australia.pdf
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647

Does parseutil.java not use the regex-urlfilter.txt?

Thanks,
Steve Cohen

regex-urlfilter.txt not working?

Reply via email to