Re: regex-urlfilter.txt not working?

Steve Cohen Tue, 21 Dec 2010 15:05:46 -0800

We are fetching and parsing in one step.

I seem to have fixed my issue though I am a little confused.


We are using a hdfs set up with a multisite nutch configuration.

We've specified these conf files for hdfs:

search1:/opt/nutch/hadoop/conf]ls
core-default.xml  hdfs-default.xml  log4j.properties    mapred-site.xml
slaves
core-site.xml      hdfs-site.xml     mapred-default.xml    masters

Just the files for hdfs. none of the files for either nutch site. We editied
the start-all.sh script with --config /opt/nutch/hadoop/conf so it uses
these files.

For the nutch sites we have these:

search1:/opt/nutch/site1/conf]ls
automaton-urlfilter.txt  domain-urlfilter.txt
httpclient-auth.xml    parse-plugins.xml
common-terms.utf8     domain-urlfilter.txt.bak
prefix-urlfilter.txt     solrconfig.xml.ver-1.4
configuration.xsl     elevate.xml
protwords.txt
context.xsl         exclude_tags.txt                             README

core-default.xml                                 regex-normalize.xml
core-site.xml         hadoop-default.xml.bak
log4j.properties        regex-urlfilter.txt     spellings.txt
core-site.xml.hdfs     hadoop-default.xml.deprecated
mapred-default.xml    regex-urlfilter.txt.bak  stopwords.txt
core-site.xml.not_hdfs     hadoop-env.sh
mapred-site.xml        regex-urlfilter.txt.smc  subcollections.xml
crawl.ini         hadoop-site.xml.bak
mapred-site.xml.bak    schema.xml         suffix-urlfilter.txt
crawl.ini.bak         hadoop-site.xml.change_map_tasks_to_one_for_testing
mapred-site.xml.hdfs    schema.xml.bak         synonyms.txt
crawl-tool.xml         hadoop-site.xml.old
mapred-site.xml.not_hdfs    slaves             tika-mimetypes.xml
crawl-urlfilter.txt     hadoop-site.xml.orig                      masters
        solrconfig.xml         xslt
crawl-urlfilter.txt.bak  hdfs-default.xml
nutch-conf.xsl
custom-fields.xml     hdfs-site.xml
nutch-default.xml
domain-suffixes.xml     hdfs-site.xml.hdfs
nutch-site.xml
domain-suffixes.xsd     hdfs-site.xml.not_hdfs
parse-plugins.dtd

search1:/opt/nutch/site2/conf]ls
automaton-urlfilter.txt  domain-urlfilter.txt
httpclient-auth.xml    parse-plugins.xml
common-terms.utf8     domain-urlfilter.txt.bak
prefix-urlfilter.txt     solrconfig.xml.ver-1.4
configuration.xsl     elevate.xml
protwords.txt
context.xsl         exclude_tags.txt                             README

core-default.xml                                 regex-normalize.xml
core-site.xml         hadoop-default.xml.bak
log4j.properties        regex-urlfilter.txt     spellings.txt
core-site.xml.hdfs     hadoop-default.xml.deprecated
mapred-default.xml    regex-urlfilter.txt.bak  stopwords.txt
core-site.xml.not_hdfs     hadoop-env.sh
mapred-site.xml        regex-urlfilter.txt.smc  subcollections.xml
crawl.ini         hadoop-site.xml.bak
mapred-site.xml.bak    schema.xml         suffix-urlfilter.txt
crawl.ini.bak         hadoop-site.xml.change_map_tasks_to_one_for_testing
mapred-site.xml.hdfs    schema.xml.bak         synonyms.txt
crawl-tool.xml         hadoop-site.xml.old
mapred-site.xml.not_hdfs    slaves             tika-mimetypes.xml
crawl-urlfilter.txt     hadoop-site.xml.orig                      masters
        solrconfig.xml         xslt
crawl-urlfilter.txt.bak  hdfs-default.xml
nutch-conf.xsl
custom-fields.xml     hdfs-site.xml
nutch-default.xml
domain-suffixes.xml     hdfs-site.xml.hdfs
nutch-site.xml
domain-suffixes.xsd     hdfs-site.xml.not_hdfs
parse-plugins.dtd


And we are setting NUTCH_CONF_DIR= in the different crawl scripts.

We have different nutch-site.xml files and different regex-urlfilter.txt
files so we don't want to put them in the hdfs conf directory.

However, the only way I seem to be able to get the regex-urlfilter.txt to be
read is to add it to the hdfs conf directory. So my new question is what do
we have to do to get the nutch site conf files read or do we have to stop
and restart the hdfs cluster with different conf files every time we run a
crawl a different site.

Thanks,
Steve

On Tue, Dec 21, 2010 at 5:32 PM, Anurag <[email protected]> wrote:

>
> did u use 1.fetching -noParsing  and then Parsing
> 2.or fetching.
> ?
> On Wed, Dec 22, 2010 at 3:27 AM, scohen [via Lucene] <
> [email protected]<ml-node%[email protected]>
> <ml-node%[email protected]<ml-node%[email protected]>
> >
> > wrote:
>
> > I can understand that I am getting a "No suitable parser found" error
> > because we don't have a parse-pdf plugin. However, it shouldn't be
> looking
> > at pdfs in the first place because we are telling it to ignore pdfs with
> > the
> > regex-urlfilter.txt file.
> >
> > I don't see how not mentioning parse-(pdf) would cause the
> > regex-urlfilter.txt file to not work.
> >
> > On Tue, Dec 21, 2010 at 4:19 PM, Anurag <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=2128302&i=0>>
> > wrote:
> >
> > >
> > > Yeah , may be because of this
> > > parse-(text|html|msexcel|mspowerpoint|msword|rss|zip)
> > >
> > > Pdf is not included.
> > > On Wed, Dec 22, 2010 at 2:43 AM, scohen [via Lucene] <
> > > [hidden email] <http://user/SendEmail.jtp?type=node&node=2128302&i=1
> ><[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=2>>
> > > <[hidden email] <http://user/SendEmail.jtp?type=node&node=2128302&i=3
> ><[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=4>>
> > > >
> > > > wrote:
> > >
> > > > I forgot to mention, in nutch-site.xml we have this property:
> > > >
> > > > <property>
> > > >   <name>plugin.includes</name>
> > > >
> > > >
> > >
> >
> <value>nutch-extensionpoints|protocol-file|protocol-http|urlfilter-regex|parse-(text|html|msexcel|mspowerpoint|msword|rss|zip)|index-(anchor|basic|more)|scoring-opic|query-(basic|more|site|url)|response-(json|xml)|summary-basic|urlnormalizer-(pass|regex|basic)
> >
> > > >
> > > >   </value>
> > > > </property>
> > > >
> > > > On Tue, Dec 21, 2010 at 3:58 PM, Steve Cohen <[hidden email]<
> > > http://user/SendEmail.jtp?type=node&node=2128072&i=0>>
> > > > wrote:
> > > >
> > > > > in the regex-urlfilter.txt we have the following:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|XLS|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|jpe|pcx|tif|tiff|dll|DLL|a|so|o|class|bin|ttf|pfb|pfm|afm|hqx|sea|eps|ai|ram|wav|avi|mid|mov|mpg|mpeg|mp3|ogg|dat|dta|log|bz2|jar|arj|cab|rar|tar|zip|tar.gz|upp|tgz|sdd|hdr|iso|img|gpg|gbk|fac|ghg|mdic|jnilib|dmg|3gp|m4a|m4v|wma|wmv|wrl|lzh|msi|gg|kml|kmz|skb|skp|chm|mht|html/|htm/|phtml/|ghtml/|asp/|js|jsp/|shtml/|doc|PDF|pdf|swf|xml)$
> >
> > > >
> > > > >
> > > > >
> > > > > So we shouldn't see any mention of pdfs, right? well in the logs I
> am
> >
> > > > > seeing this:
> > > > >
> > > > > 2010-12-21 15:45:04,340 WARN  parse.ParseUtil - No suitable parser
> > > found
> > > > > when trying to parse content
> > > > > http://www.fodors.com/pdf/fodors-south-australia.pdf of type
> > > > > application/pdf
> > > > > 2010-12-21 15:45:04,340 WARN  fetcher.Fetcher - Error parsing:
> > > > > http://www.fodors.com/pdf/fodors-south-australia.pdf:
> > > > > org.apache.nutch.parse.ParseException: parser not found for
> > > > > contentType=application/pdf url=
> > > > > http://www.fodors.com/pdf/fodors-south-australia.pdf
> > > > >         at
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
> > > > >         at
> > > > >
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
> > > > >         at
> > > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647
> > > > >
> > > > > Does parseutil.java not use the regex-urlfilter.txt?
> > > > >
> > > > > Thanks,
> > > > > Steve Cohen
> > > > >
> > > >
> > > >
> > > > ------------------------------
> > > >  View message @
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128072.html
> <
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128072.html?by-user=t
> >
> > > >
> > > > To start a new topic under Nutch - User, email
> > > > [hidden email] <http://user/SendEmail.jtp?type=node&node=2128302&i=5
> ><[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=6>>
> > > <[hidden email] <http://user/SendEmail.jtp?type=node&node=2128302&i=7
> ><[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=2128302&i=8>>
> > > >
> > > > To unsubscribe from Nutch - User, click here<
> > >
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=
> <
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=&by-user=t
> >
> > > >.
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kumar Anurag
> > >
> > >
> > > -----
> > > Kumar Anurag
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128100.html
> <
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128100.html?by-user=t
> >
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
> >
> > ------------------------------
> >  View message @
> >
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128302.html
> >
> > To start a new topic under Nutch - User, email
> > [email protected]<ml-node%[email protected]>
> <ml-node%[email protected]<ml-node%[email protected]>
> >
> > To unsubscribe from Nutch - User, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=
> >.
> >
> >
>
>
>
> --
> Kumar Anurag
>
>
> -----
> Kumar Anurag
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-not-working-tp2127997p2128438.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: regex-urlfilter.txt not working?

Reply via email to