did u use 1.fetching -noParsing and then Parsing 2.or fetching. ? On Wed, Dec 22, 2010 at 3:27 AM, scohen [via Lucene] < [email protected]<ml-node%[email protected]> > wrote:
> I can understand that I am getting a "No suitable parser found" error > because we don't have a parse-pdf plugin. However, it shouldn't be looking > at pdfs in the first place because we are telling it to ignore pdfs with > the > regex-urlfilter.txt file. > > I don't see how not mentioning parse-(pdf) would cause the > regex-urlfilter.txt file to not work. > > On Tue, Dec 21, 2010 at 4:19 PM, Anurag <[hidden > email]<http://user/SendEmail.jtp?type=node&node=2128302&i=0>> > wrote: > > > > > Yeah , may be because of this > > parse-(text|html|msexcel|mspowerpoint|msword|rss|zip) > > > > Pdf is not included. > > On Wed, Dec 22, 2010 at 2:43 AM, scohen [via Lucene] < > > [hidden email] > > <http://user/SendEmail.jtp?type=node&node=2128302&i=1><[hidden > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=2>> > > <[hidden email] > > <http://user/SendEmail.jtp?type=node&node=2128302&i=3><[hidden > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=4>> > > > > > > wrote: > > > > > I forgot to mention, in nutch-site.xml we have this property: > > > > > > <property> > > > <name>plugin.includes</name> > > > > > > > > > <value>nutch-extensionpoints|protocol-file|protocol-http|urlfilter-regex|parse-(text|html|msexcel|mspowerpoint|msword|rss|zip)|index-(anchor|basic|more)|scoring-opic|query-(basic|more|site|url)|response-(json|xml)|summary-basic|urlnormalizer-(pass|regex|basic) > > > > > > > </value> > > > </property> > > > > > > On Tue, Dec 21, 2010 at 3:58 PM, Steve Cohen <[hidden email]< > > http://user/SendEmail.jtp?type=node&node=2128072&i=0>> > > > wrote: > > > > > > > in the regex-urlfilter.txt we have the following: > > > > > > > > > > > > > > > > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|XLS|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|jpe|pcx|tif|tiff|dll|DLL|a|so|o|class|bin|ttf|pfb|pfm|afm|hqx|sea|eps|ai|ram|wav|avi|mid|mov|mpg|mpeg|mp3|ogg|dat|dta|log|bz2|jar|arj|cab|rar|tar|zip|tar.gz|upp|tgz|sdd|hdr|iso|img|gpg|gbk|fac|ghg|mdic|jnilib|dmg|3gp|m4a|m4v|wma|wmv|wrl|lzh|msi|gg|kml|kmz|skb|skp|chm|mht|html/|htm/|phtml/|ghtml/|asp/|js|jsp/|shtml/|doc|PDF|pdf|swf|xml)$ > > > > > > > > > > > > > > > > So we shouldn't see any mention of pdfs, right? well in the logs I am > > > > > seeing this: > > > > > > > > 2010-12-21 15:45:04,340 WARN parse.ParseUtil - No suitable parser > > found > > > > when trying to parse content > > > > http://www.fodors.com/pdf/fodors-south-australia.pdf of type > > > > application/pdf > > > > 2010-12-21 15:45:04,340 WARN fetcher.Fetcher - Error parsing: > > > > http://www.fodors.com/pdf/fodors-south-australia.pdf: > > > > org.apache.nutch.parse.ParseException: parser not found for > > > > contentType=application/pdf url= > > > > http://www.fodors.com/pdf/fodors-south-australia.pdf > > > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) > > > > at > > > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879) > > > > at > > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647 > > > > > > > > Does parseutil.java not use the regex-urlfilter.txt? > > > > > > > > Thanks, > > > > Steve Cohen > > > > > > > > > > > > > ------------------------------ > > > View message @ > > > > > > http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128072.html<http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128072.html?by-user=t> > > > > > > To start a new topic under Nutch - User, email > > > [hidden email] > > > <http://user/SendEmail.jtp?type=node&node=2128302&i=5><[hidden > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=6>> > > <[hidden email] > > <http://user/SendEmail.jtp?type=node&node=2128302&i=7><[hidden > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=8>> > > > > > > To unsubscribe from Nutch - User, click here< > > > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=&by-user=t> > > >. > > > > > > > > > > > > > > -- > > Kumar Anurag > > > > > > ----- > > Kumar Anurag > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128100.html<http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128100.html?by-user=t> > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > ------------------------------ > View message @ > http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128302.html > > To start a new topic under Nutch - User, email > [email protected]<ml-node%[email protected]> > To unsubscribe from Nutch - User, click > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=>. > > -- Kumar Anurag ----- Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128438.html Sent from the Nutch - User mailing list archive at Nabble.com.

