when do you think we are going to see an official release of nutch 1.6? On Mon, Nov 26, 2012 at 2:49 PM, Markus Jelsma <[email protected]>wrote:
> Building from source with ant produces a local runtime in runtime/local, > that's the same as when you extract an official release. > > -----Original message----- > > From:Joe Zhang <[email protected]> > > Sent: Mon 26-Nov-2012 22:23 > > To: [email protected] > > Subject: Re: Indexing-time URL filtering again > > > > yes that's wht i've been doing. but "ant" itself won't produce the > official > > binary release. > > > > On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma > > <[email protected]>wrote: > > > > > just ant will do the trick. > > > > > > > > > > > > -----Original message----- > > > > From:Joe Zhang <[email protected]> > > > > Sent: Mon 26-Nov-2012 22:03 > > > > To: [email protected] > > > > Subject: Re: Indexing-time URL filtering again > > > > > > > > talking about ant, after ant clean, which ant target should i use? > > > > > > > > On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma > > > > <[email protected]>wrote: > > > > > > > > > I checked the code. You're probably not pointing it to a valid > path or > > > > > perhaps the build is wrong and you haven't used ant clean before > > > building > > > > > Nutch. If you keep having trouble you may want to check out trunk. > > > > > > > > > > -----Original message----- > > > > > > From:Joe Zhang <[email protected]> > > > > > > Sent: Mon 26-Nov-2012 00:40 > > > > > > To: [email protected] > > > > > > Subject: Re: Indexing-time URL filtering again > > > > > > > > > > > > OK. I'm testing it. But like I said, even when I reduce the > patterns > > > to > > > > > the > > > > > > simpliest form "-.", the problem still persists. > > > > > > > > > > > > On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma > > > > > > <[email protected]>wrote: > > > > > > > > > > > > > It's taking input from stdin, enter some URL's to test it. You > can > > > add > > > > > an > > > > > > > issue with reproducable steps. > > > > > > > > > > > > > > -----Original message----- > > > > > > > > From:Joe Zhang <[email protected]> > > > > > > > > Sent: Sun 25-Nov-2012 23:49 > > > > > > > > To: [email protected] > > > > > > > > Subject: Re: Indexing-time URL filtering again > > > > > > > > > > > > > > > > I ran the regex tester command you provided. It seems to be > > > taking > > > > > > > forever > > > > > > > > (15 min + by now). > > > > > > > > > > > > > > > > On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang < > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > you mean the content my pattern file? > > > > > > > > > > > > > > > > > > well, even wehn I reduce it to simply "-.", the same > problem > > > still > > > > > > > pops up. > > > > > > > > > > > > > > > > > > On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma < > > > > > > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> You seems to have an NPE caused by your regex rules, for > some > > > > > weird > > > > > > > > >> reason. If you can provide a way to reproduce you can > file an > > > > > issue in > > > > > > > > >> Jira. This NPE should also occur if your run the regex > tester. > > > > > > > > >> > > > > > > > > >> nutch -Durlfilter.regex.file=path > > > > > > > org.apache.nutch.net.URLFilterChecker > > > > > > > > >> -allCombined > > > > > > > > >> > > > > > > > > >> In the mean time you can check if a rule causes the NPE. > > > > > > > > >> > > > > > > > > >> -----Original message----- > > > > > > > > >> > From:Joe Zhang <[email protected]> > > > > > > > > >> > Sent: Sun 25-Nov-2012 23:26 > > > > > > > > >> > To: [email protected] > > > > > > > > >> > Subject: Re: Indexing-time URL filtering again > > > > > > > > >> > > > > > > > > > >> > the last few lines of hadoop.log: > > > > > > > > >> > > > > > > > > > >> > 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - > > > Adding > > > > > > > > >> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > > > > > > > > >> > 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - > > > Adding > > > > > > > > >> > org.apache.nutch.indexer.metadata.MetadataIndexer > > > > > > > > >> > 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - > > > > > job_local_0001 > > > > > > > > >> > java.lang.RuntimeException: Error in configuring object > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > > > > > > > > >> > at > > > > > > > > >> > > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) > > > > > > > > >> > at > > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > > > > > > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException > > > > > > > > >> > at > > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > > > > > > Method) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > > > > >> > at > java.lang.reflect.Method.invoke(Method.java:601) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) > > > > > > > > >> > ... 5 more > > > > > > > > >> > Caused by: java.lang.RuntimeException: Error in > configuring > > > > > object > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > > > > > > > > >> > at > > > > > > > > >> > > > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > > > > > > > > >> > ... 10 more > > > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException > > > > > > > > >> > at > > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > > > > > > Method) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > > > > >> > at > java.lang.reflect.Method.invoke(Method.java:601) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) > > > > > > > > >> > ... 13 more > > > > > > > > >> > Caused by: java.lang.NullPointerException > > > > > > > > >> > at java.io.Reader.<init>(Reader.java:78) > > > > > > > > >> > at > > > java.io.BufferedReader.<init>(BufferedReader.java:94) > > > > > > > > >> > at > > > > > java.io.BufferedReader.<init>(BufferedReader.java:109) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) > > > > > > > > >> > at > > > > > > > org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57) > > > > > > > > >> > at > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95) > > > > > > > > >> > ... 18 more > > > > > > > > >> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - > > > > > > > java.io.IOException: > > > > > > > > >> Job > > > > > > > > >> > failed! > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma > > > > > > > > >> > <[email protected]>wrote: > > > > > > > > >> > > > > > > > > > >> > > You should provide the log output. > > > > > > > > >> > > > > > > > > > > >> > > -----Original message----- > > > > > > > > >> > > > From:Joe Zhang <[email protected]> > > > > > > > > >> > > > Sent: Sun 25-Nov-2012 17:27 > > > > > > > > >> > > > To: [email protected] > > > > > > > > >> > > > Subject: Re: Indexing-time URL filtering again > > > > > > > > >> > > > > > > > > > > > >> > > > I actually checked out the most recent build from > SVN, > > > > > Release > > > > > > > 1.6 - > > > > > > > > >> > > > 23/11/2012. > > > > > > > > >> > > > > > > > > > > > >> > > > The following command > > > > > > > > >> > > > > > > > > > > > >> > > > bin/nutch solrindex > > > > > > > -Durlfilter.regex.file=.....UrlFiltering.txt > > > > > > > > >> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb > > > > > > > crawl/linkdb/ > > > > > > > > >> > > > crawl/segments/* -filter > > > > > > > > >> > > > > > > > > > > > >> > > > produced the following output: > > > > > > > > >> > > > > > > > > > > > >> > > > SolrIndexer: starting at 2012-11-25 16:19:29 > > > > > > > > >> > > > SolrIndexer: deleting gone documents: false > > > > > > > > >> > > > SolrIndexer: URL filtering: true > > > > > > > > >> > > > SolrIndexer: URL normalizing: false > > > > > > > > >> > > > java.io.IOException: Job failed! > > > > > > > > >> > > > > > > > > > > > >> > > > Can anybody help? > > > > > > > > >> > > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang < > > > > > > > [email protected]> > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > > > > >> > > > > How exactly do I get to trunk? > > > > > > > > >> > > > > > > > > > > > > >> > > > > I did download download NUTCH-1300-1.5-1.patch, > and > > > run > > > > > the > > > > > > > patch > > > > > > > > >> > > command > > > > > > > > >> > > > > correctly, and re-build nutch. But the problem > still > > > > > > > persists... > > > > > > > > >> > > > > > > > > > > > > >> > > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma < > > > > > > > > >> > > [email protected] > > > > > > > > >> > > > > > wrote: > > > > > > > > >> > > > > > > > > > > > > >> > > > >> No, this is no bug. As i said, you need either to > > > patch > > > > > your > > > > > > > > >> Nutch or > > > > > > > > >> > > get > > > > > > > > >> > > > >> the sources from trunk. The -filter parameter is > not > > > in > > > > > your > > > > > > > > >> version. > > > > > > > > >> > > Check > > > > > > > > >> > > > >> the patch manual if you don't know how it works. > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> $ cd trunk ; patch -p0 < file.patch > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> -----Original message----- > > > > > > > > >> > > > >> > From:Joe Zhang <[email protected]> > > > > > > > > >> > > > >> > Sent: Sun 25-Nov-2012 08:42 > > > > > > > > >> > > > >> > To: Markus Jelsma <[email protected] > >; > > > user < > > > > > > > > >> > > > >> [email protected]> > > > > > > > > >> > > > >> > Subject: Re: Indexing-time URL filtering again > > > > > > > > >> > > > >> > > > > > > > > > >> > > > >> > This does seem a bug. Can anybody help? > > > > > > > > >> > > > >> > > > > > > > > > >> > > > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang < > > > > > > > > >> [email protected]> > > > > > > > > >> > > > >> wrote: > > > > > > > > >> > > > >> > > > > > > > > > >> > > > >> > > Markus, could you advise? Thanks a lot! > > > > > > > > >> > > > >> > > > > > > > > > > >> > > > >> > > > > > > > > > > >> > > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang < > > > > > > > > >> [email protected] > > > > > > > > >> > > > > > > > > > > > >> > > > >> wrote: > > > > > > > > >> > > > >> > > > > > > > > > > >> > > > >> > >> I followed your instruction and applied the > > > patch, > > > > > > > Markus, > > > > > > > > >> but > > > > > > > > >> > > the > > > > > > > > >> > > > >> > >> problem still persists --- "-filter" is > > > interpreted > > > > > as a > > > > > > > > >> path by > > > > > > > > >> > > > >> solrindex. > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus > Jelsma > > > < > > > > > > > > >> > > > >> > >> [email protected]> wrote: > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> > >>> Ah, i get it now. Please use trunk or patch > > > your > > > > > > > version > > > > > > > > >> with: > > > > > > > > >> > > > >> > >>> > > > https://issues.apache.org/jira/browse/NUTCH-1300to > > > > > > > enable > > > > > > > > >> > > > >> filtering. > > > > > > > > >> > > > >> > >>> > > > > > > > > >> > > > >> > >>> -----Original message----- > > > > > > > > >> > > > >> > >>> > From:Joe Zhang <[email protected]> > > > > > > > > >> > > > >> > >>> > Sent: Fri 23-Nov-2012 03:08 > > > > > > > > >> > > > >> > >>> > To: [email protected] > > > > > > > > >> > > > >> > >>> > Subject: Re: Indexing-time URL filtering > > > again > > > > > > > > >> > > > >> > >>> > > > > > > > > > >> > > > >> > >>> > But Markus said it worked for him. I was > > > really > > > > > he > > > > > > > could > > > > > > > > >> send > > > > > > > > >> > > his > > > > > > > > >> > > > >> > >>> command > > > > > > > > >> > > > >> > >>> > line. > > > > > > > > >> > > > >> > >>> > > > > > > > > > >> > > > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis > John > > > > > > > Mcgibbney < > > > > > > > > >> > > > >> > >>> > [email protected]> wrote: > > > > > > > > >> > > > >> > >>> > > > > > > > > > >> > > > >> > >>> > > Is this a bug? > > > > > > > > >> > > > >> > >>> > > > > > > > > > > >> > > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe > > > Zhang < > > > > > > > > >> > > > >> [email protected]> > > > > > > > > >> > > > >> > >>> wrote: > > > > > > > > >> > > > >> > >>> > > > Putting -filter between crawldb and > > > > > segments, I > > > > > > > > >> sitll got > > > > > > > > >> > > the > > > > > > > > >> > > > >> same > > > > > > > > >> > > > >> > >>> thing: > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > >> > >>> > > > > > > > > org.apache.hadoop.mapred.InvalidInputException: > > > > > > > > >> Input path > > > > > > > > >> > > > >> does not > > > > > > > > >> > > > >> > >>> > > exist: > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch > > > > > > > > >> > > > >> > >>> > > > Input path does not exist: > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse > > > > > > > > >> > > > >> > >>> > > > Input path does not exist: > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data > > > > > > > > >> > > > >> > >>> > > > Input path does not exist: > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, > Markus > > > > > Jelsma > > > > > > > > >> > > > >> > >>> > > > <[email protected]>wrote: > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > >> > >>> > > >> These are roughly the available > > > parameters: > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > >> > > > >> > >>> > > >> Usage: SolrIndexer <solr url> > <crawldb> > > > > > [-linkdb > > > > > > > > >> > > <linkdb>] > > > > > > > > >> > > > >> > >>> [-hostdb > > > > > > > > >> > > > >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] > > > > > (<segment> > > > > > > > ... | > > > > > > > > >> -dir > > > > > > > > >> > > > >> > >>> <segments>) > > > > > > > > >> > > > >> > >>> > > >> [-noCommit] [-deleteGone] > > > > > [-deleteRobotsNoIndex] > > > > > > > > >> > > > >> > >>> > > >> [-deleteSkippedByIndexingFilter] > > > [-filter] > > > > > > > > >> [-normalize] > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > >> > > > >> > >>> > > >> Having -filter at the end should > work > > > fine, > > > > > if > > > > > > > it, > > > > > > > > >> for > > > > > > > > >> > > some > > > > > > > > >> > > > >> > >>> reason, > > > > > > > > >> > > > >> > >>> > > >> doesn't work put it before the > segment > > > and > > > > > > > after the > > > > > > > > >> > > crawldb > > > > > > > > >> > > > >> and > > > > > > > > >> > > > >> > >>> file an > > > > > > > > >> > > > >> > >>> > > >> issue in jira, it works here if i > have > > > > > -filter > > > > > > > at > > > > > > > > >> the > > > > > > > > >> > > end. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > >> > > > >> > >>> > > >> Cheers > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > >> > > > >> > >>> > > >> -----Original message----- > > > > > > > > >> > > > >> > >>> > > >> > From:Joe Zhang < > [email protected]> > > > > > > > > >> > > > >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05 > > > > > > > > >> > > > >> > >>> > > >> > To: Markus Jelsma < > > > > > [email protected] > > > > > > > >; > > > > > > > > >> user < > > > > > > > > >> > > > >> > >>> > > >> [email protected]> > > > > > > > > >> > > > >> > >>> > > >> > Subject: Re: Indexing-time URL > > > filtering > > > > > again > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > Yes, I forgot to do that. But > still, > > > what > > > > > > > exactly > > > > > > > > >> > > should > > > > > > > > >> > > > >> the > > > > > > > > >> > > > >> > >>> command > > > > > > > > >> > > > >> > >>> > > >> look like? > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > bin/nutch solrindex > > > > > > > > >> > > > >> -Durlfilter.regex.file=....UrlFiltering.txt > > > > > > > > >> > > > >> > >>> > > >> http://localhost:8983/solr/ < > > > > > > > > >> http://localhost:8983/solr/ > > > > > > > > >> > > > > > > > > > > > >> > > > >> > >>> .../crawldb/ > > > > > > > > >> > > > >> > >>> > > >> ..../segments/* -filter > > > > > > > > >> > > > >> > >>> > > >> > this command would cause nutch to > > > > > interpret > > > > > > > > >> "-filter" > > > > > > > > >> > > as a > > > > > > > > >> > > > >> path. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, > > > Markus > > > > > > > Jelsma < > > > > > > > > >> > > > >> > >>> > > >> [email protected] <mailto: > > > > > > > > >> > > > >> [email protected]> > > > > > > > > > >> > > > >> > >>> wrote: > > > > > > > > >> > > > >> > >>> > > >> > Hi, > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > I just tested a small index job > that > > > > > usually > > > > > > > > >> writes > > > > > > > > >> > > 1200 > > > > > > > > >> > > > >> > >>> records to > > > > > > > > >> > > > >> > >>> > > >> Solr. It works fine if i specify -. > in a > > > > > filter > > > > > > > > >> (index > > > > > > > > >> > > > >> nothing) > > > > > > > > >> > > > >> > >>> and > > > > > > > > >> > > > >> > >>> > > point > > > > > > > > >> > > > >> > >>> > > >> to it with > -Durlfilter.regex.file=path > > > like > > > > > you > > > > > > > do. > > > > > > > > >> I > > > > > > > > >> > > > >> assume you > > > > > > > > >> > > > >> > >>> mean > > > > > > > > >> > > > >> > >>> > > by > > > > > > > > >> > > > >> > >>> > > >> `it doesn't work` that it filters > > > nothing > > > > > and > > > > > > > > >> indexes all > > > > > > > > >> > > > >> records > > > > > > > > >> > > > >> > >>> from > > > > > > > > >> > > > >> > >>> > > the > > > > > > > > >> > > > >> > >>> > > >> segment. Did you forget the -filter > > > > > parameter? > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > Cheers > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > -----Original message----- > > > > > > > > >> > > > >> > >>> > > >> > > From:Joe Zhang < > > > [email protected] > > > > > <mailto: > > > > > > > > >> > > > >> > >>> [email protected]> > > > > > > > > >> > > > >> > >>> > > > > > > > > > > > >> > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29 > > > > > > > > >> > > > >> > >>> > > >> > > To: user <[email protected] > > > <mailto: > > > > > > > > >> > > > >> [email protected]> > > > > > > > > >> > > > >> > >>> > > > > > > > > > >> > > > >> > >>> > > >> > > Subject: Indexing-time URL > filtering > > > > > again > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > Dear List: > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > I asked a similar question > before, > > > but I > > > > > > > haven't > > > > > > > > >> > > solved > > > > > > > > >> > > > >> the > > > > > > > > >> > > > >> > >>> problem. > > > > > > > > >> > > > >> > >>> > > >> > > Therefore I try to re-ask the > > > question > > > > > more > > > > > > > > >> clearly > > > > > > > > >> > > and > > > > > > > > >> > > > >> seek > > > > > > > > >> > > > >> > >>> advice. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr > 3.6.1 > > > > > > > together. > > > > > > > > >> Things > > > > > > > > >> > > > >> work > > > > > > > > >> > > > >> > >>> fine at > > > > > > > > >> > > > >> > >>> > > the > > > > > > > > >> > > > >> > >>> > > >> > > rudimentary level. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > The basic problem I face in > > > > > > > crawling/indexing is > > > > > > > > >> > > that I > > > > > > > > >> > > > >> need > > > > > > > > >> > > > >> > >>> to > > > > > > > > >> > > > >> > >>> > > control > > > > > > > > >> > > > >> > >>> > > >> > > which pages the crawlers should > > > VISIT > > > > > (so > > > > > > > far > > > > > > > > >> through > > > > > > > > >> > > > >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt) > > > > > > > > >> > > > >> > >>> > > >> > > and which pages are INDEXED by > > > Solr. The > > > > > > > latter > > > > > > > > >> are > > > > > > > > >> > > only > > > > > > > > >> > > > >> a > > > > > > > > >> > > > >> > >>> SUBSET of > > > > > > > > >> > > > >> > >>> > > >> the > > > > > > > > >> > > > >> > >>> > > >> > > former, and they are giving me > > > headache. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > A real-life example would be: > when > > > we > > > > > crawl > > > > > > > > >> CNN.com, > > > > > > > > >> > > we > > > > > > > > >> > > > >> only > > > > > > > > >> > > > >> > >>> want to > > > > > > > > >> > > > >> > >>> > > >> index > > > > > > > > >> > > > >> > >>> > > >> > > "real content" pages such as > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > > > > > > > > > >> > > > >> > >>> > > > > > > > > >> > > > >> > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1 > > > > > < > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > >> > > > >> > >>> > > > > > > > > >> > > > >> > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1 > > > > > > > > > > > > > > >> > > > >> > >>> > > . > > > > > > > > >> > > > >> > >>> > > >> > > When we start the crawling from > the > > > > > root, we > > > > > > > > >> can't > > > > > > > > >> > > > >> specify > > > > > > > > >> > > > >> > >>> tight > > > > > > > > >> > > > >> > >>> > > >> > > patterns (e.g., +^http:// > > > ([a-z0-9]*\.)* > > > > > > > > >> > > > >> > >>> > > >> > > > > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*< > > > > > > > > > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> > > > > > > > > >> < > > > > > > > > >> > > > > > > > > > > > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* > > > > > > > > >> > > > > > > > > > >> > > > >> < > > > > > > > > >> > > > > > > > > > > > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* > > > > > > > > >> >< > > > > > > > > >> > > > >> > >>> > > >> > > > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> > > > > > > > ) in > > > > > > > > >> > > > >> > >>> > > >> nutch/conf/regex-urlfilter.txt, > > > > > > > > >> > > > >> > >>> > > >> > > because the pages on the path > > > between > > > > > root > > > > > > > and > > > > > > > > >> > > content > > > > > > > > >> > > > >> pages > > > > > > > > >> > > > >> > >>> do not > > > > > > > > >> > > > >> > >>> > > >> satisfy > > > > > > > > >> > > > >> > >>> > > >> > > such patterns. Putting such > > > patterns in > > > > > > > > >> > > > >> > >>> > > nutch/conf/regex-urlfilter.txt > > > > > > > > >> > > > >> > >>> > > >> > > would severely jeopardize the > > > coverage > > > > > of > > > > > > > the > > > > > > > > >> crawl. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > The closest solution I've got > so far > > > > > > > (courtesy > > > > > > > > >> of > > > > > > > > >> > > > >> Markus) was > > > > > > > > >> > > > >> > >>> this: > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > nutch solrindex > > > > > -Durlfilter.regex.file=/path > > > > > > > > >> > > > >> http://solrurl/< > > > > > > > > >> > > > >> > >>> > > >> http://solrurl/> ... > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > but unfortunately I haven't > been > > > able > > > > > to > > > > > > > make > > > > > > > > >> it > > > > > > > > >> > > work > > > > > > > > >> > > > >> for > > > > > > > > >> > > > >> > >>> me. The > > > > > > > > >> > > > >> > >>> > > >> content > > > > > > > > >> > > > >> > >>> > > >> > > of the urlfilter.regex.file is > what > > > I > > > > > > > thought > > > > > > > > >> > > "correct" > > > > > > > > >> > > > >> --- > > > > > > > > >> > > > >> > >>> > > something > > > > > > > > >> > > > >> > >>> > > >> like > > > > > > > > >> > > > >> > >>> > > >> > > the following: > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)* > > > > > > > > >> > > > >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*< > > > > > > > > > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> > > > > > > > > >> < > > > > > > > > >> > > > > > > > > > > > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* > > > > > > > > >> > > > > > > > > > >> > > > >> < > > > > > > > > >> > > > > > > > > > > > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* > > > > > > > > >> >< > > > > > > > > >> > > > >> > >>> > > >> > > > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> > > > > > > > > >> > > > >> > >>> > > >> > > -. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > Everything seems quite > > > straightforward. > > > > > Am I > > > > > > > > >> doing > > > > > > > > >> > > > >> anything > > > > > > > > >> > > > >> > >>> wrong > > > > > > > > >> > > > >> > >>> > > >> here? Can > > > > > > > > >> > > > >> > >>> > > >> > > anyone advise? I'd greatly > > > appreciate. > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > Joe > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > > >> > > > >> > >>> > > >> > > > > > > > > >> > > > >> > >>> > > > > > > > > > > >> > > > >> > >>> > > > > > > > > > > >> > > > >> > >>> > > > > > > > > > > >> > > > >> > >>> > > -- > > > > > > > > >> > > > >> > >>> > > Lewis > > > > > > > > >> > > > >> > >>> > > > > > > > > > > >> > > > >> > >>> > > > > > > > > > >> > > > >> > >>> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> > > > > > > > > > > >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

