Re: Indexing-time URL filtering again

Joe Zhang Sun, 25 Nov 2012 14:44:00 -0800

I ran the regex tester command you provided. It seems to be taking forever
(15 min + by now).


On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang <[email protected]> wrote:

> you mean the content my pattern file?
>
> well, even wehn I reduce it to simply "-.", the same problem still pops up.
>
> On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma <[email protected]
> > wrote:
>
>> You seems to have an NPE caused by your regex rules, for some weird
>> reason. If you can provide a way to reproduce you can file an issue in
>> Jira. This NPE should also occur if your run the regex tester.
>>
>> nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker
>> -allCombined
>>
>> In the mean time you can check if a rule causes the NPE.
>>
>> -----Original message-----
>> > From:Joe Zhang <[email protected]>
>> > Sent: Sun 25-Nov-2012 23:26
>> > To: [email protected]
>> > Subject: Re: Indexing-time URL filtering again
>> >
>> > the last few lines of hadoop.log:
>> >
>> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
>> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
>> > org.apache.nutch.indexer.metadata.MetadataIndexer
>> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
>> > java.lang.RuntimeException: Error in configuring object
>> >         at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>> >         at
>> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>> >         at
>> >
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> >         at
>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> > Caused by: java.lang.reflect.InvocationTargetException
>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >         at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >         at java.lang.reflect.Method.invoke(Method.java:601)
>> >         at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>> >         ... 5 more
>> > Caused by: java.lang.RuntimeException: Error in configuring object
>> >         at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>> >         at
>> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>> >         at
>> >
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> >         at
>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>> >         ... 10 more
>> > Caused by: java.lang.reflect.InvocationTargetException
>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >         at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >         at java.lang.reflect.Method.invoke(Method.java:601)
>> >         at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>> >         ... 13 more
>> > Caused by: java.lang.NullPointerException
>> >         at java.io.Reader.<init>(Reader.java:78)
>> >         at java.io.BufferedReader.<init>(BufferedReader.java:94)
>> >         at java.io.BufferedReader.<init>(BufferedReader.java:109)
>> >         at
>> >
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
>> >         at
>> >
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
>> >         at
>> >
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>> >         at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57)
>> >         at
>> >
>> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
>> >         ... 18 more
>> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException:
>> Job
>> > failed!
>> >
>> >
>> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
>> > <[email protected]>wrote:
>> >
>> > > You should provide the log output.
>> > >
>> > > -----Original message-----
>> > > > From:Joe Zhang <[email protected]>
>> > > > Sent: Sun 25-Nov-2012 17:27
>> > > > To: [email protected]
>> > > > Subject: Re: Indexing-time URL filtering again
>> > > >
>> > > > I actually checked out the most recent build from SVN, Release 1.6 -
>> > > > 23/11/2012.
>> > > >
>> > > > The following command
>> > > >
>> > > > bin/nutch solrindex  -Durlfilter.regex.file=.....UrlFiltering.txt
>> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
>> > > > crawl/segments/*  -filter
>> > > >
>> > > > produced the following output:
>> > > >
>> > > > SolrIndexer: starting at 2012-11-25 16:19:29
>> > > > SolrIndexer: deleting gone documents: false
>> > > > SolrIndexer: URL filtering: true
>> > > > SolrIndexer: URL normalizing: false
>> > > > java.io.IOException: Job failed!
>> > > >
>> > > > Can anybody help?
>> > > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang <[email protected]>
>> wrote:
>> > > >
>> > > > > How exactly do I get to trunk?
>> > > > >
>> > > > > I did download download NUTCH-1300-1.5-1.patch, and run the patch
>> > > command
>> > > > > correctly, and re-build nutch. But the problem still persists...
>> > > > >
>> > > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <
>> > > [email protected]
>> > > > > > wrote:
>> > > > >
>> > > > >> No, this is no bug. As i said, you need either to patch your
>> Nutch or
>> > > get
>> > > > >> the sources from trunk. The -filter parameter is not in your
>> version.
>> > > Check
>> > > > >> the patch manual if you don't know how it works.
>> > > > >>
>> > > > >> $ cd trunk ; patch -p0 < file.patch
>> > > > >>
>> > > > >> -----Original message-----
>> > > > >> > From:Joe Zhang <[email protected]>
>> > > > >> > Sent: Sun 25-Nov-2012 08:42
>> > > > >> > To: Markus Jelsma <[email protected]>; user <
>> > > > >> [email protected]>
>> > > > >> > Subject: Re: Indexing-time URL filtering again
>> > > > >> >
>> > > > >> > This does seem a bug. Can anybody help?
>> > > > >> >
>> > > > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <
>> [email protected]>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > Markus, could you advise? Thanks a lot!
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <
>> [email protected]
>> > > >
>> > > > >> wrote:
>> > > > >> > >
>> > > > >> > >> I followed your instruction and applied the patch, Markus,
>> but
>> > > the
>> > > > >> > >> problem still persists --- "-filter" is interpreted as a
>> path by
>> > > > >> solrindex.
>> > > > >> > >>
>> > > > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
>> > > > >> > >> [email protected]> wrote:
>> > > > >> > >>
>> > > > >> > >>> Ah, i get it now. Please use trunk or patch your version
>> with:
>> > > > >> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable
>> > > > >> filtering.
>> > > > >> > >>>
>> > > > >> > >>> -----Original message-----
>> > > > >> > >>> > From:Joe Zhang <[email protected]>
>> > > > >> > >>> > Sent: Fri 23-Nov-2012 03:08
>> > > > >> > >>> > To: [email protected]
>> > > > >> > >>> > Subject: Re: Indexing-time URL filtering again
>> > > > >> > >>> >
>> > > > >> > >>> > But Markus said it worked for him. I was really he could
>> send
>> > > his
>> > > > >> > >>> command
>> > > > >> > >>> > line.
>> > > > >> > >>> >
>> > > > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
>> > > > >> > >>> > [email protected]> wrote:
>> > > > >> > >>> >
>> > > > >> > >>> > > Is this a bug?
>> > > > >> > >>> > >
>> > > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <
>> > > > >> [email protected]>
>> > > > >> > >>> wrote:
>> > > > >> > >>> > > > Putting -filter between crawldb and segments, I
>> sitll got
>> > > the
>> > > > >> same
>> > > > >> > >>> thing:
>> > > > >> > >>> > > >
>> > > > >> > >>> > > > org.apache.hadoop.mapred.InvalidInputException:
>> Input path
>> > > > >> does not
>> > > > >> > >>> > > exist:
>> > > > >> > >>> > > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
>> > > > >> > >>> > > > Input path does not exist:
>> > > > >> > >>> > > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
>> > > > >> > >>> > > > Input path does not exist:
>> > > > >> > >>> > > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
>> > > > >> > >>> > > > Input path does not exist:
>> > > > >> > >>> > > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
>> > > > >> > >>> > > >
>> > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
>> > > > >> > >>> > > > <[email protected]>wrote:
>> > > > >> > >>> > > >
>> > > > >> > >>> > > >> These are roughly the available parameters:
>> > > > >> > >>> > > >>
>> > > > >> > >>> > > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb
>> > > <linkdb>]
>> > > > >> > >>> [-hostdb
>> > > > >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... |
>> -dir
>> > > > >> > >>> <segments>)
>> > > > >> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
>> > > > >> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter]
>> [-normalize]
>> > > > >> > >>> > > >>
>> > > > >> > >>> > > >> Having -filter at the end should work fine, if it,
>> for
>> > > some
>> > > > >> > >>> reason,
>> > > > >> > >>> > > >> doesn't work put it before the segment and after the
>> > > crawldb
>> > > > >> and
>> > > > >> > >>> file an
>> > > > >> > >>> > > >> issue in jira, it works here if i have -filter at
>> the
>> > > end.
>> > > > >> > >>> > > >>
>> > > > >> > >>> > > >> Cheers
>> > > > >> > >>> > > >>
>> > > > >> > >>> > > >> -----Original message-----
>> > > > >> > >>> > > >> > From:Joe Zhang <[email protected]>
>> > > > >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
>> > > > >> > >>> > > >> > To: Markus Jelsma <[email protected]>;
>> user <
>> > > > >> > >>> > > >> [email protected]>
>> > > > >> > >>> > > >> > Subject: Re: Indexing-time URL filtering again
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> > Yes, I forgot to do that. But still, what exactly
>> > > should
>> > > > >> the
>> > > > >> > >>> command
>> > > > >> > >>> > > >> look like?
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> > bin/nutch solrindex
>> > > > >>  -Durlfilter.regex.file=....UrlFiltering.txt
>> > > > >> > >>> > > >> http://localhost:8983/solr/ <
>> http://localhost:8983/solr/
>> > > >
>> > > > >> > >>> .../crawldb/
>> > > > >> > >>> > > >> ..../segments/*  -filter
>> > > > >> > >>> > > >> > this command would cause nutch to interpret
>> "-filter"
>> > > as a
>> > > > >> path.
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
>> > > > >> > >>> > > >> [email protected] <mailto:
>> > > > >> [email protected]> >
>> > > > >> > >>> wrote:
>> > > > >> > >>> > > >> > Hi,
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> > I just tested a small index job that usually
>> writes
>> > > 1200
>> > > > >> > >>> records to
>> > > > >> > >>> > > >> Solr. It works fine if i specify -. in a filter
>> (index
>> > > > >> nothing)
>> > > > >> > >>> and
>> > > > >> > >>> > > point
>> > > > >> > >>> > > >> to it with -Durlfilter.regex.file=path like you do.
>>  I
>> > > > >> assume you
>> > > > >> > >>> mean
>> > > > >> > >>> > > by
>> > > > >> > >>> > > >> `it doesn't work` that it filters nothing and
>> indexes all
>> > > > >> records
>> > > > >> > >>> from
>> > > > >> > >>> > > the
>> > > > >> > >>> > > >> segment. Did you forget the -filter parameter?
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> > Cheers
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> > -----Original message-----
>> > > > >> > >>> > > >> > > From:Joe Zhang <[email protected] <mailto:
>> > > > >> > >>> [email protected]>
>> > > > >> > >>> > > >
>> > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29
>> > > > >> > >>> > > >> > > To: user <[email protected] <mailto:
>> > > > >> [email protected]>
>> > > > >> > >>> >
>> > > > >> > >>> > > >> > > Subject: Indexing-time URL filtering again
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > Dear List:
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > I asked a similar question before, but I haven't
>> > > solved
>> > > > >> the
>> > > > >> > >>> problem.
>> > > > >> > >>> > > >> > > Therefore I try to re-ask the question more
>> clearly
>> > > and
>> > > > >> seek
>> > > > >> > >>> advice.
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together.
>> Things
>> > > > >> work
>> > > > >> > >>> fine at
>> > > > >> > >>> > > the
>> > > > >> > >>> > > >> > > rudimentary level.
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > The basic problem I face in crawling/indexing is
>> > > that I
>> > > > >> need
>> > > > >> > >>> to
>> > > > >> > >>> > > control
>> > > > >> > >>> > > >> > > which pages the crawlers should VISIT (so far
>> through
>> > > > >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt)
>> > > > >> > >>> > > >> > > and which pages are INDEXED by Solr. The latter
>> are
>> > > only
>> > > > >> a
>> > > > >> > >>> SUBSET of
>> > > > >> > >>> > > >> the
>> > > > >> > >>> > > >> > > former, and they are giving me headache.
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > A real-life example would be: when we crawl
>> CNN.com,
>> > > we
>> > > > >> only
>> > > > >> > >>> want to
>> > > > >> > >>> > > >> index
>> > > > >> > >>> > > >> > > "real content" pages such as
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > >
>> > > > >> > >>>
>> > > > >>
>> > >
>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1<
>> > > > >> > >>> > > >>
>> > > > >> > >>>
>> > > > >>
>> > >
>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1>
>> > > > >> > >>> > > .
>> > > > >> > >>> > > >> > > When we start the crawling from the root, we
>> can't
>> > > > >> specify
>> > > > >> > >>> tight
>> > > > >> > >>> > > >> > > patterns (e.g., +^http://([a-z0-9]*\.)*
>> > > > >> > >>> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
>> <
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> >
>> > > > >> <
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> ><
>> > > > >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in
>> > > > >> > >>> > > >> nutch/conf/regex-urlfilter.txt,
>> > > > >> > >>> > > >> > > because the pages on the path between root and
>> > > content
>> > > > >> pages
>> > > > >> > >>> do not
>> > > > >> > >>> > > >> satisfy
>> > > > >> > >>> > > >> > > such patterns. Putting such patterns in
>> > > > >> > >>> > > nutch/conf/regex-urlfilter.txt
>> > > > >> > >>> > > >> > > would severely jeopardize the coverage of the
>> crawl.
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > The closest solution I've got so far (courtesy
>> of
>> > > > >> Markus) was
>> > > > >> > >>> this:
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > nutch solrindex -Durlfilter.regex.file=/path
>> > > > >> http://solrurl/<
>> > > > >> > >>> > > >> http://solrurl/> ...
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > >  but unfortunately I haven't been able to make
>> it
>> > > work
>> > > > >> for
>> > > > >> > >>> me. The
>> > > > >> > >>> > > >> content
>> > > > >> > >>> > > >> > > of the urlfilter.regex.file is what I thought
>> > > "correct"
>> > > > >> ---
>> > > > >> > >>> > > something
>> > > > >> > >>> > > >> like
>> > > > >> > >>> > > >> > > the following:
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)*
>> > > > >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
>> <
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> >
>> > > > >> <
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> ><
>> > > > >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
>> > > > >> > >>> > > >> > > -.
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > Everything seems quite straightforward. Am I
>> doing
>> > > > >> anything
>> > > > >> > >>> wrong
>> > > > >> > >>> > > >> here? Can
>> > > > >> > >>> > > >> > > anyone advise? I'd greatly appreciate.
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> > > Joe
>> > > > >> > >>> > > >> > >
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >> >
>> > > > >> > >>> > > >>
>> > > > >> > >>> > >
>> > > > >> > >>> > >
>> > > > >> > >>> > >
>> > > > >> > >>> > > --
>> > > > >> > >>> > > Lewis
>> > > > >> > >>> > >
>> > > > >> > >>> >
>> > > > >> > >>>
>> > > > >> > >>
>> > > > >> > >>
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Indexing-time URL filtering again

Reply via email to