Yeah , may be because of this
parse-(text|html|msexcel|mspowerpoint|msword|rss|zip)

Pdf is not included.
On Wed, Dec 22, 2010 at 2:43 AM, scohen [via Lucene] <
[email protected]<ml-node%[email protected]>
> wrote:

> I forgot to mention, in nutch-site.xml we have this property:
>
> <property>
>   <name>plugin.includes</name>
>
> <value>nutch-extensionpoints|protocol-file|protocol-http|urlfilter-regex|parse-(text|html|msexcel|mspowerpoint|msword|rss|zip)|index-(anchor|basic|more)|scoring-opic|query-(basic|more|site|url)|response-(json|xml)|summary-basic|urlnormalizer-(pass|regex|basic)
>
>   </value>
> </property>
>
> On Tue, Dec 21, 2010 at 3:58 PM, Steve Cohen <[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2128072&i=0>>
> wrote:
>
> > in the regex-urlfilter.txt we have the following:
> >
> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|XLS|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|jpe|pcx|tif|tiff|dll|DLL|a|so|o|class|bin|ttf|pfb|pfm|afm|hqx|sea|eps|ai|ram|wav|avi|mid|mov|mpg|mpeg|mp3|ogg|dat|dta|log|bz2|jar|arj|cab|rar|tar|zip|tar.gz|upp|tgz|sdd|hdr|iso|img|gpg|gbk|fac|ghg|mdic|jnilib|dmg|3gp|m4a|m4v|wma|wmv|wrl|lzh|msi|gg|kml|kmz|skb|skp|chm|mht|html/|htm/|phtml/|ghtml/|asp/|js|jsp/|shtml/|doc|PDF|pdf|swf|xml)$
>
> >
> >
> > So we shouldn't see any mention of pdfs, right? well in the logs I am
> > seeing this:
> >
> > 2010-12-21 15:45:04,340 WARN  parse.ParseUtil - No suitable parser found
> > when trying to parse content
> > http://www.fodors.com/pdf/fodors-south-australia.pdf of type
> > application/pdf
> > 2010-12-21 15:45:04,340 WARN  fetcher.Fetcher - Error parsing:
> > http://www.fodors.com/pdf/fodors-south-australia.pdf:
> > org.apache.nutch.parse.ParseException: parser not found for
> > contentType=application/pdf url=
> > http://www.fodors.com/pdf/fodors-south-australia.pdf
> >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
> >         at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
> >         at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647
> >
> > Does parseutil.java not use the regex-urlfilter.txt?
> >
> > Thanks,
> > Steve Cohen
> >
>
>
> ------------------------------
>  View message @
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128072.html
>
> To start a new topic under Nutch - User, email
> [email protected]<ml-node%[email protected]>
> To unsubscribe from Nutch - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=>.
>
>



-- 
Kumar Anurag


-----
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128100.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to