Re: Indexing-time URL filtering again

Joe Zhang Tue, 27 Nov 2012 00:27:04 -0800

along the same line of discussion, if the indexing time filter worked, the
regex patterns in regex-urlfilters.txt would still take precedence
(according to the regex tester suggested by Markus above). So how can one
turn off regex-urlfilters.txt at indexing time?


BTW, Markus, I did try to rebuild a clean 1.6, the NPE still exists.

On Mon, Nov 26, 2012 at 5:38 PM, Joe Zhang <[email protected]> wrote:

> when do you think we are going to see an official release of nutch 1.6?
>
>
> On Mon, Nov 26, 2012 at 2:49 PM, Markus Jelsma <[email protected]
> > wrote:
>
>> Building from source with ant produces a local runtime in runtime/local,
>> that's the same as when you extract an official release.
>>
>> -----Original message-----
>> > From:Joe Zhang <[email protected]>
>> > Sent: Mon 26-Nov-2012 22:23
>> > To: [email protected]
>> > Subject: Re: Indexing-time URL filtering again
>> >
>> > yes that's wht i've been doing. but "ant" itself won't produce the
>> official
>> > binary release.
>> >
>> > On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma
>> > <[email protected]>wrote:
>> >
>> > > just ant will do the trick.
>> > >
>> > >
>> > >
>> > > -----Original message-----
>> > > > From:Joe Zhang <[email protected]>
>> > > > Sent: Mon 26-Nov-2012 22:03
>> > > > To: [email protected]
>> > > > Subject: Re: Indexing-time URL filtering again
>> > > >
>> > > > talking about ant, after ant clean, which ant target should i use?
>> > > >
>> > > > On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma
>> > > > <[email protected]>wrote:
>> > > >
>> > > > > I checked the code. You're probably not pointing it to a valid
>> path or
>> > > > > perhaps the build is wrong and you haven't used ant clean before
>> > > building
>> > > > > Nutch. If you keep having trouble you may want to check out trunk.
>> > > > >
>> > > > > -----Original message-----
>> > > > > > From:Joe Zhang <[email protected]>
>> > > > > > Sent: Mon 26-Nov-2012 00:40
>> > > > > > To: [email protected]
>> > > > > > Subject: Re: Indexing-time URL filtering again
>> > > > > >
>> > > > > > OK. I'm testing it. But like I said, even when I reduce the
>> patterns
>> > > to
>> > > > > the
>> > > > > > simpliest form "-.", the problem still persists.
>> > > > > >
>> > > > > > On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma
>> > > > > > <[email protected]>wrote:
>> > > > > >
>> > > > > > > It's taking input from stdin, enter some URL's to test it.
>> You can
>> > > add
>> > > > > an
>> > > > > > > issue with reproducable steps.
>> > > > > > >
>> > > > > > > -----Original message-----
>> > > > > > > > From:Joe Zhang <[email protected]>
>> > > > > > > > Sent: Sun 25-Nov-2012 23:49
>> > > > > > > > To: [email protected]
>> > > > > > > > Subject: Re: Indexing-time URL filtering again
>> > > > > > > >
>> > > > > > > > I ran the regex tester command you provided. It seems to be
>> > > taking
>> > > > > > > forever
>> > > > > > > > (15 min + by now).
>> > > > > > > >
>> > > > > > > > On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang <
>> [email protected]
>> > > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > you mean the content my pattern file?
>> > > > > > > > >
>> > > > > > > > > well, even wehn I reduce it to simply "-.", the same
>> problem
>> > > still
>> > > > > > > pops up.
>> > > > > > > > >
>> > > > > > > > > On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma <
>> > > > > > > [email protected]
>> > > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >> You seems to have an NPE caused by your regex rules, for
>> some
>> > > > > weird
>> > > > > > > > >> reason. If you can provide a way to reproduce you can
>> file an
>> > > > > issue in
>> > > > > > > > >> Jira. This NPE should also occur if your run the regex
>> tester.
>> > > > > > > > >>
>> > > > > > > > >> nutch -Durlfilter.regex.file=path
>> > > > > > > org.apache.nutch.net.URLFilterChecker
>> > > > > > > > >> -allCombined
>> > > > > > > > >>
>> > > > > > > > >> In the mean time you can check if a rule causes the NPE.
>> > > > > > > > >>
>> > > > > > > > >> -----Original message-----
>> > > > > > > > >> > From:Joe Zhang <[email protected]>
>> > > > > > > > >> > Sent: Sun 25-Nov-2012 23:26
>> > > > > > > > >> > To: [email protected]
>> > > > > > > > >> > Subject: Re: Indexing-time URL filtering again
>> > > > > > > > >> >
>> > > > > > > > >> > the last few lines of hadoop.log:
>> > > > > > > > >> >
>> > > > > > > > >> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters -
>> > > Adding
>> > > > > > > > >> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>> > > > > > > > >> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters -
>> > > Adding
>> > > > > > > > >> > org.apache.nutch.indexer.metadata.MetadataIndexer
>> > > > > > > > >> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner -
>> > > > > job_local_0001
>> > > > > > > > >> > java.lang.RuntimeException: Error in configuring object
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> > > > > > > > >> >         at
>> > > > > > > > >>
>> > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
>> > > > > > > > >> >         at
>> > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> > > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException
>> > > > > > > > >> >         at
>> > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> > > > > > > Method)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > > > > > > > >> >         at
>> java.lang.reflect.Method.invoke(Method.java:601)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>> > > > > > > > >> >         ... 5 more
>> > > > > > > > >> > Caused by: java.lang.RuntimeException: Error in
>> configuring
>> > > > > object
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> > > > > > > > >> >         at
>> > > > > > > > >>
>> > > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>> > > > > > > > >> >         ... 10 more
>> > > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException
>> > > > > > > > >> >         at
>> > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> > > > > > > Method)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > > > > > > > >> >         at
>> java.lang.reflect.Method.invoke(Method.java:601)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>> > > > > > > > >> >         ... 13 more
>> > > > > > > > >> > Caused by: java.lang.NullPointerException
>> > > > > > > > >> >         at java.io.Reader.<init>(Reader.java:78)
>> > > > > > > > >> >         at
>> > > java.io.BufferedReader.<init>(BufferedReader.java:94)
>> > > > > > > > >> >         at
>> > > > > java.io.BufferedReader.<init>(BufferedReader.java:109)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>> > > > > > > > >> >         at
>> > > > > > > org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57)
>> > > > > > > > >> >         at
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
>> > > > > > > > >> >         ... 18 more
>> > > > > > > > >> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer -
>> > > > > > > java.io.IOException:
>> > > > > > > > >> Job
>> > > > > > > > >> > failed!
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
>> > > > > > > > >> > <[email protected]>wrote:
>> > > > > > > > >> >
>> > > > > > > > >> > > You should provide the log output.
>> > > > > > > > >> > >
>> > > > > > > > >> > > -----Original message-----
>> > > > > > > > >> > > > From:Joe Zhang <[email protected]>
>> > > > > > > > >> > > > Sent: Sun 25-Nov-2012 17:27
>> > > > > > > > >> > > > To: [email protected]
>> > > > > > > > >> > > > Subject: Re: Indexing-time URL filtering again
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > I actually checked out the most recent build from
>> SVN,
>> > > > > Release
>> > > > > > > 1.6 -
>> > > > > > > > >> > > > 23/11/2012.
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > The following command
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > bin/nutch solrindex
>> > > > > > >  -Durlfilter.regex.file=.....UrlFiltering.txt
>> > > > > > > > >> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb
>> > > > > > > crawl/linkdb/
>> > > > > > > > >> > > > crawl/segments/*  -filter
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > produced the following output:
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > SolrIndexer: starting at 2012-11-25 16:19:29
>> > > > > > > > >> > > > SolrIndexer: deleting gone documents: false
>> > > > > > > > >> > > > SolrIndexer: URL filtering: true
>> > > > > > > > >> > > > SolrIndexer: URL normalizing: false
>> > > > > > > > >> > > > java.io.IOException: Job failed!
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > Can anybody help?
>> > > > > > > > >> > > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang <
>> > > > > > > [email protected]>
>> > > > > > > > >> wrote:
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > > How exactly do I get to trunk?
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > I did download download NUTCH-1300-1.5-1.patch,
>> and
>> > > run
>> > > > > the
>> > > > > > > patch
>> > > > > > > > >> > > command
>> > > > > > > > >> > > > > correctly, and re-build nutch. But the problem
>> still
>> > > > > > > persists...
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <
>> > > > > > > > >> > > [email protected]
>> > > > > > > > >> > > > > > wrote:
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > >> No, this is no bug. As i said, you need either
>> to
>> > > patch
>> > > > > your
>> > > > > > > > >> Nutch or
>> > > > > > > > >> > > get
>> > > > > > > > >> > > > >> the sources from trunk. The -filter parameter
>> is not
>> > > in
>> > > > > your
>> > > > > > > > >> version.
>> > > > > > > > >> > > Check
>> > > > > > > > >> > > > >> the patch manual if you don't know how it works.
>> > > > > > > > >> > > > >>
>> > > > > > > > >> > > > >> $ cd trunk ; patch -p0 < file.patch
>> > > > > > > > >> > > > >>
>> > > > > > > > >> > > > >> -----Original message-----
>> > > > > > > > >> > > > >> > From:Joe Zhang <[email protected]>
>> > > > > > > > >> > > > >> > Sent: Sun 25-Nov-2012 08:42
>> > > > > > > > >> > > > >> > To: Markus Jelsma <[email protected]
>> >;
>> > > user <
>> > > > > > > > >> > > > >> [email protected]>
>> > > > > > > > >> > > > >> > Subject: Re: Indexing-time URL filtering again
>> > > > > > > > >> > > > >> >
>> > > > > > > > >> > > > >> > This does seem a bug. Can anybody help?
>> > > > > > > > >> > > > >> >
>> > > > > > > > >> > > > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <
>> > > > > > > > >> [email protected]>
>> > > > > > > > >> > > > >> wrote:
>> > > > > > > > >> > > > >> >
>> > > > > > > > >> > > > >> > > Markus, could you advise? Thanks a lot!
>> > > > > > > > >> > > > >> > >
>> > > > > > > > >> > > > >> > >
>> > > > > > > > >> > > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang
>> <
>> > > > > > > > >> [email protected]
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > >> wrote:
>> > > > > > > > >> > > > >> > >
>> > > > > > > > >> > > > >> > >> I followed your instruction and applied the
>> > > patch,
>> > > > > > > Markus,
>> > > > > > > > >> but
>> > > > > > > > >> > > the
>> > > > > > > > >> > > > >> > >> problem still persists --- "-filter" is
>> > > interpreted
>> > > > > as a
>> > > > > > > > >> path by
>> > > > > > > > >> > > > >> solrindex.
>> > > > > > > > >> > > > >> > >>
>> > > > > > > > >> > > > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus
>> Jelsma
>> > > <
>> > > > > > > > >> > > > >> > >> [email protected]> wrote:
>> > > > > > > > >> > > > >> > >>
>> > > > > > > > >> > > > >> > >>> Ah, i get it now. Please use trunk or
>> patch
>> > > your
>> > > > > > > version
>> > > > > > > > >> with:
>> > > > > > > > >> > > > >> > >>>
>> > > https://issues.apache.org/jira/browse/NUTCH-1300to
>> > > > > > > enable
>> > > > > > > > >> > > > >> filtering.
>> > > > > > > > >> > > > >> > >>>
>> > > > > > > > >> > > > >> > >>> -----Original message-----
>> > > > > > > > >> > > > >> > >>> > From:Joe Zhang <[email protected]>
>> > > > > > > > >> > > > >> > >>> > Sent: Fri 23-Nov-2012 03:08
>> > > > > > > > >> > > > >> > >>> > To: [email protected]
>> > > > > > > > >> > > > >> > >>> > Subject: Re: Indexing-time URL filtering
>> > > again
>> > > > > > > > >> > > > >> > >>> >
>> > > > > > > > >> > > > >> > >>> > But Markus said it worked for him. I was
>> > > really
>> > > > > he
>> > > > > > > could
>> > > > > > > > >> send
>> > > > > > > > >> > > his
>> > > > > > > > >> > > > >> > >>> command
>> > > > > > > > >> > > > >> > >>> > line.
>> > > > > > > > >> > > > >> > >>> >
>> > > > > > > > >> > > > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis
>> John
>> > > > > > > Mcgibbney <
>> > > > > > > > >> > > > >> > >>> > [email protected]> wrote:
>> > > > > > > > >> > > > >> > >>> >
>> > > > > > > > >> > > > >> > >>> > > Is this a bug?
>> > > > > > > > >> > > > >> > >>> > >
>> > > > > > > > >> > > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe
>> > > Zhang <
>> > > > > > > > >> > > > >> [email protected]>
>> > > > > > > > >> > > > >> > >>> wrote:
>> > > > > > > > >> > > > >> > >>> > > > Putting -filter between crawldb and
>> > > > > segments, I
>> > > > > > > > >> sitll got
>> > > > > > > > >> > > the
>> > > > > > > > >> > > > >> same
>> > > > > > > > >> > > > >> > >>> thing:
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > org.apache.hadoop.mapred.InvalidInputException:
>> > > > > > > > >> Input path
>> > > > > > > > >> > > > >> does not
>> > > > > > > > >> > > > >> > >>> > > exist:
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
>> > > > > > > > >> > > > >> > >>> > > > Input path does not exist:
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
>> > > > > > > > >> > > > >> > >>> > > > Input path does not exist:
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
>> > > > > > > > >> > > > >> > >>> > > > Input path does not exist:
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > >
>> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM,
>> Markus
>> > > > > Jelsma
>> > > > > > > > >> > > > >> > >>> > > > <[email protected]>wrote:
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > > > >> > >>> > > >> These are roughly the available
>> > > parameters:
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > > > > >> > > > >> > >>> > > >> Usage: SolrIndexer <solr url>
>> <crawldb>
>> > > > > [-linkdb
>> > > > > > > > >> > > <linkdb>]
>> > > > > > > > >> > > > >> > >>> [-hostdb
>> > > > > > > > >> > > > >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...]
>> > > > > (<segment>
>> > > > > > > ... |
>> > > > > > > > >> -dir
>> > > > > > > > >> > > > >> > >>> <segments>)
>> > > > > > > > >> > > > >> > >>> > > >> [-noCommit] [-deleteGone]
>> > > > > [-deleteRobotsNoIndex]
>> > > > > > > > >> > > > >> > >>> > > >> [-deleteSkippedByIndexingFilter]
>> > > [-filter]
>> > > > > > > > >> [-normalize]
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > > > > >> > > > >> > >>> > > >> Having -filter at the end should
>> work
>> > > fine,
>> > > > > if
>> > > > > > > it,
>> > > > > > > > >> for
>> > > > > > > > >> > > some
>> > > > > > > > >> > > > >> > >>> reason,
>> > > > > > > > >> > > > >> > >>> > > >> doesn't work put it before the
>> segment
>> > > and
>> > > > > > > after the
>> > > > > > > > >> > > crawldb
>> > > > > > > > >> > > > >> and
>> > > > > > > > >> > > > >> > >>> file an
>> > > > > > > > >> > > > >> > >>> > > >> issue in jira, it works here if i
>> have
>> > > > > -filter
>> > > > > > > at
>> > > > > > > > >> the
>> > > > > > > > >> > > end.
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > > > > >> > > > >> > >>> > > >> Cheers
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > > > > >> > > > >> > >>> > > >> -----Original message-----
>> > > > > > > > >> > > > >> > >>> > > >> > From:Joe Zhang <
>> [email protected]>
>> > > > > > > > >> > > > >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
>> > > > > > > > >> > > > >> > >>> > > >> > To: Markus Jelsma <
>> > > > > [email protected]
>> > > > > > > >;
>> > > > > > > > >> user <
>> > > > > > > > >> > > > >> > >>> > > >> [email protected]>
>> > > > > > > > >> > > > >> > >>> > > >> > Subject: Re: Indexing-time URL
>> > > filtering
>> > > > > again
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> > Yes, I forgot to do that. But
>> still,
>> > > what
>> > > > > > > exactly
>> > > > > > > > >> > > should
>> > > > > > > > >> > > > >> the
>> > > > > > > > >> > > > >> > >>> command
>> > > > > > > > >> > > > >> > >>> > > >> look like?
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> > bin/nutch solrindex
>> > > > > > > > >> > > > >>  -Durlfilter.regex.file=....UrlFiltering.txt
>> > > > > > > > >> > > > >> > >>> > > >> http://localhost:8983/solr/ <
>> > > > > > > > >> http://localhost:8983/solr/
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > >> > >>> .../crawldb/
>> > > > > > > > >> > > > >> > >>> > > >> ..../segments/*  -filter
>> > > > > > > > >> > > > >> > >>> > > >> > this command would cause nutch to
>> > > > > interpret
>> > > > > > > > >> "-filter"
>> > > > > > > > >> > > as a
>> > > > > > > > >> > > > >> path.
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM,
>> > > Markus
>> > > > > > > Jelsma <
>> > > > > > > > >> > > > >> > >>> > > >> [email protected]<mailto:
>> > > > > > > > >> > > > >> [email protected]> >
>> > > > > > > > >> > > > >> > >>> wrote:
>> > > > > > > > >> > > > >> > >>> > > >> > Hi,
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> > I just tested a small index job
>> that
>> > > > > usually
>> > > > > > > > >> writes
>> > > > > > > > >> > > 1200
>> > > > > > > > >> > > > >> > >>> records to
>> > > > > > > > >> > > > >> > >>> > > >> Solr. It works fine if i specify
>> -. in a
>> > > > > filter
>> > > > > > > > >> (index
>> > > > > > > > >> > > > >> nothing)
>> > > > > > > > >> > > > >> > >>> and
>> > > > > > > > >> > > > >> > >>> > > point
>> > > > > > > > >> > > > >> > >>> > > >> to it with
>> -Durlfilter.regex.file=path
>> > > like
>> > > > > you
>> > > > > > > do.
>> > > > > > > > >>  I
>> > > > > > > > >> > > > >> assume you
>> > > > > > > > >> > > > >> > >>> mean
>> > > > > > > > >> > > > >> > >>> > > by
>> > > > > > > > >> > > > >> > >>> > > >> `it doesn't work` that it filters
>> > > nothing
>> > > > > and
>> > > > > > > > >> indexes all
>> > > > > > > > >> > > > >> records
>> > > > > > > > >> > > > >> > >>> from
>> > > > > > > > >> > > > >> > >>> > > the
>> > > > > > > > >> > > > >> > >>> > > >> segment. Did you forget the -filter
>> > > > > parameter?
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> > Cheers
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> > -----Original message-----
>> > > > > > > > >> > > > >> > >>> > > >> > > From:Joe Zhang <
>> > > [email protected]
>> > > > > <mailto:
>> > > > > > > > >> > > > >> > >>> [email protected]>
>> > > > > > > > >> > > > >> > >>> > > >
>> > > > > > > > >> > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29
>> > > > > > > > >> > > > >> > >>> > > >> > > To: user <
>> [email protected]
>> > > <mailto:
>> > > > > > > > >> > > > >> [email protected]>
>> > > > > > > > >> > > > >> > >>> >
>> > > > > > > > >> > > > >> > >>> > > >> > > Subject: Indexing-time URL
>> filtering
>> > > > > again
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > Dear List:
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > I asked a similar question
>> before,
>> > > but I
>> > > > > > > haven't
>> > > > > > > > >> > > solved
>> > > > > > > > >> > > > >> the
>> > > > > > > > >> > > > >> > >>> problem.
>> > > > > > > > >> > > > >> > >>> > > >> > > Therefore I try to re-ask the
>> > > question
>> > > > > more
>> > > > > > > > >> clearly
>> > > > > > > > >> > > and
>> > > > > > > > >> > > > >> seek
>> > > > > > > > >> > > > >> > >>> advice.
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr
>> 3.6.1
>> > > > > > > together.
>> > > > > > > > >> Things
>> > > > > > > > >> > > > >> work
>> > > > > > > > >> > > > >> > >>> fine at
>> > > > > > > > >> > > > >> > >>> > > the
>> > > > > > > > >> > > > >> > >>> > > >> > > rudimentary level.
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > The basic problem I face in
>> > > > > > > crawling/indexing is
>> > > > > > > > >> > > that I
>> > > > > > > > >> > > > >> need
>> > > > > > > > >> > > > >> > >>> to
>> > > > > > > > >> > > > >> > >>> > > control
>> > > > > > > > >> > > > >> > >>> > > >> > > which pages the crawlers should
>> > > VISIT
>> > > > > (so
>> > > > > > > far
>> > > > > > > > >> through
>> > > > > > > > >> > > > >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt)
>> > > > > > > > >> > > > >> > >>> > > >> > > and which pages are INDEXED by
>> > > Solr. The
>> > > > > > > latter
>> > > > > > > > >> are
>> > > > > > > > >> > > only
>> > > > > > > > >> > > > >> a
>> > > > > > > > >> > > > >> > >>> SUBSET of
>> > > > > > > > >> > > > >> > >>> > > >> the
>> > > > > > > > >> > > > >> > >>> > > >> > > former, and they are giving me
>> > > headache.
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > A real-life example would be:
>> when
>> > > we
>> > > > > crawl
>> > > > > > > > >> CNN.com,
>> > > > > > > > >> > > we
>> > > > > > > > >> > > > >> only
>> > > > > > > > >> > > > >> > >>> want to
>> > > > > > > > >> > > > >> > >>> > > >> index
>> > > > > > > > >> > > > >> > >>> > > >> > > "real content" pages such as
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > >
>> > > > > > > > >> > > > >> > >>>
>> > > > > > > > >> > > > >>
>> > > > > > > > >> > >
>> > > > > > > > >>
>> > > > > > >
>> > >
>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1
>> > > > > <
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > > > > >> > > > >> > >>>
>> > > > > > > > >> > > > >>
>> > > > > > > > >> > >
>> > > > > > > > >>
>> > > > > > >
>> > >
>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1
>> > > > > >
>> > > > > > > > >> > > > >> > >>> > > .
>> > > > > > > > >> > > > >> > >>> > > >> > > When we start the crawling
>> from the
>> > > > > root, we
>> > > > > > > > >> can't
>> > > > > > > > >> > > > >> specify
>> > > > > > > > >> > > > >> > >>> tight
>> > > > > > > > >> > > > >> > >>> > > >> > > patterns (e.g., +^http://
>> > > ([a-z0-9]*\.)*
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
>> <
>> > > > > > >
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> >
>> > > > > > > > >> <
>> > > > > > > > >> > >
>> > > > > > >
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> > > > > > > > >> >
>> > > > > > > > >> > > > >> <
>> > > > > > > > >> > >
>> > > > > > >
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> > > > > > > > >> ><
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
>> > > > > > > ) in
>> > > > > > > > >> > > > >> > >>> > > >> nutch/conf/regex-urlfilter.txt,
>> > > > > > > > >> > > > >> > >>> > > >> > > because the pages on the path
>> > > between
>> > > > > root
>> > > > > > > and
>> > > > > > > > >> > > content
>> > > > > > > > >> > > > >> pages
>> > > > > > > > >> > > > >> > >>> do not
>> > > > > > > > >> > > > >> > >>> > > >> satisfy
>> > > > > > > > >> > > > >> > >>> > > >> > > such patterns. Putting such
>> > > patterns in
>> > > > > > > > >> > > > >> > >>> > > nutch/conf/regex-urlfilter.txt
>> > > > > > > > >> > > > >> > >>> > > >> > > would severely jeopardize the
>> > > coverage
>> > > > > of
>> > > > > > > the
>> > > > > > > > >> crawl.
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > The closest solution I've got
>> so far
>> > > > > > > (courtesy
>> > > > > > > > >> of
>> > > > > > > > >> > > > >> Markus) was
>> > > > > > > > >> > > > >> > >>> this:
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > nutch solrindex
>> > > > > -Durlfilter.regex.file=/path
>> > > > > > > > >> > > > >> http://solrurl/<
>> > > > > > > > >> > > > >> > >>> > > >> http://solrurl/> ...
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > >  but unfortunately I haven't
>> been
>> > > able
>> > > > > to
>> > > > > > > make
>> > > > > > > > >> it
>> > > > > > > > >> > > work
>> > > > > > > > >> > > > >> for
>> > > > > > > > >> > > > >> > >>> me. The
>> > > > > > > > >> > > > >> > >>> > > >> content
>> > > > > > > > >> > > > >> > >>> > > >> > > of the urlfilter.regex.file is
>> what
>> > > I
>> > > > > > > thought
>> > > > > > > > >> > > "correct"
>> > > > > > > > >> > > > >> ---
>> > > > > > > > >> > > > >> > >>> > > something
>> > > > > > > > >> > > > >> > >>> > > >> like
>> > > > > > > > >> > > > >> > >>> > > >> > > the following:
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)*
>> > > > > > > > >> > > > >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
>> <
>> > > > > > >
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> >
>> > > > > > > > >> <
>> > > > > > > > >> > >
>> > > > > > >
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> > > > > > > > >> >
>> > > > > > > > >> > > > >> <
>> > > > > > > > >> > >
>> > > > > > >
>> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
>> > > > > > > > >> ><
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
>> > > > > > > > >> > > > >> > >>> > > >> > > -.
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > Everything seems quite
>> > > straightforward.
>> > > > > Am I
>> > > > > > > > >> doing
>> > > > > > > > >> > > > >> anything
>> > > > > > > > >> > > > >> > >>> wrong
>> > > > > > > > >> > > > >> > >>> > > >> here? Can
>> > > > > > > > >> > > > >> > >>> > > >> > > anyone advise? I'd greatly
>> > > appreciate.
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> > > Joe
>> > > > > > > > >> > > > >> > >>> > > >> > >
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >> >
>> > > > > > > > >> > > > >> > >>> > > >>
>> > > > > > > > >> > > > >> > >>> > >
>> > > > > > > > >> > > > >> > >>> > >
>> > > > > > > > >> > > > >> > >>> > >
>> > > > > > > > >> > > > >> > >>> > > --
>> > > > > > > > >> > > > >> > >>> > > Lewis
>> > > > > > > > >> > > > >> > >>> > >
>> > > > > > > > >> > > > >> > >>> >
>> > > > > > > > >> > > > >> > >>>
>> > > > > > > > >> > > > >> > >>
>> > > > > > > > >> > > > >> > >>
>> > > > > > > > >> > > > >> > >
>> > > > > > > > >> > > > >> >
>> > > > > > > > >> > > > >>
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > >
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Indexing-time URL filtering again

Reply via email to