along the same line of discussion, if the indexing time filter worked, the regex patterns in regex-urlfilters.txt would still take precedence (according to the regex tester suggested by Markus above). So how can one turn off regex-urlfilters.txt at indexing time?
BTW, Markus, I did try to rebuild a clean 1.6, the NPE still exists. On Mon, Nov 26, 2012 at 5:38 PM, Joe Zhang <[email protected]> wrote: > when do you think we are going to see an official release of nutch 1.6? > > > On Mon, Nov 26, 2012 at 2:49 PM, Markus Jelsma <[email protected] > > wrote: > >> Building from source with ant produces a local runtime in runtime/local, >> that's the same as when you extract an official release. >> >> -----Original message----- >> > From:Joe Zhang <[email protected]> >> > Sent: Mon 26-Nov-2012 22:23 >> > To: [email protected] >> > Subject: Re: Indexing-time URL filtering again >> > >> > yes that's wht i've been doing. but "ant" itself won't produce the >> official >> > binary release. >> > >> > On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma >> > <[email protected]>wrote: >> > >> > > just ant will do the trick. >> > > >> > > >> > > >> > > -----Original message----- >> > > > From:Joe Zhang <[email protected]> >> > > > Sent: Mon 26-Nov-2012 22:03 >> > > > To: [email protected] >> > > > Subject: Re: Indexing-time URL filtering again >> > > > >> > > > talking about ant, after ant clean, which ant target should i use? >> > > > >> > > > On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma >> > > > <[email protected]>wrote: >> > > > >> > > > > I checked the code. You're probably not pointing it to a valid >> path or >> > > > > perhaps the build is wrong and you haven't used ant clean before >> > > building >> > > > > Nutch. If you keep having trouble you may want to check out trunk. >> > > > > >> > > > > -----Original message----- >> > > > > > From:Joe Zhang <[email protected]> >> > > > > > Sent: Mon 26-Nov-2012 00:40 >> > > > > > To: [email protected] >> > > > > > Subject: Re: Indexing-time URL filtering again >> > > > > > >> > > > > > OK. I'm testing it. But like I said, even when I reduce the >> patterns >> > > to >> > > > > the >> > > > > > simpliest form "-.", the problem still persists. >> > > > > > >> > > > > > On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma >> > > > > > <[email protected]>wrote: >> > > > > > >> > > > > > > It's taking input from stdin, enter some URL's to test it. >> You can >> > > add >> > > > > an >> > > > > > > issue with reproducable steps. >> > > > > > > >> > > > > > > -----Original message----- >> > > > > > > > From:Joe Zhang <[email protected]> >> > > > > > > > Sent: Sun 25-Nov-2012 23:49 >> > > > > > > > To: [email protected] >> > > > > > > > Subject: Re: Indexing-time URL filtering again >> > > > > > > > >> > > > > > > > I ran the regex tester command you provided. It seems to be >> > > taking >> > > > > > > forever >> > > > > > > > (15 min + by now). >> > > > > > > > >> > > > > > > > On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang < >> [email protected] >> > > > >> > > > > wrote: >> > > > > > > > >> > > > > > > > > you mean the content my pattern file? >> > > > > > > > > >> > > > > > > > > well, even wehn I reduce it to simply "-.", the same >> problem >> > > still >> > > > > > > pops up. >> > > > > > > > > >> > > > > > > > > On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma < >> > > > > > > [email protected] >> > > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > >> You seems to have an NPE caused by your regex rules, for >> some >> > > > > weird >> > > > > > > > >> reason. If you can provide a way to reproduce you can >> file an >> > > > > issue in >> > > > > > > > >> Jira. This NPE should also occur if your run the regex >> tester. >> > > > > > > > >> >> > > > > > > > >> nutch -Durlfilter.regex.file=path >> > > > > > > org.apache.nutch.net.URLFilterChecker >> > > > > > > > >> -allCombined >> > > > > > > > >> >> > > > > > > > >> In the mean time you can check if a rule causes the NPE. >> > > > > > > > >> >> > > > > > > > >> -----Original message----- >> > > > > > > > >> > From:Joe Zhang <[email protected]> >> > > > > > > > >> > Sent: Sun 25-Nov-2012 23:26 >> > > > > > > > >> > To: [email protected] >> > > > > > > > >> > Subject: Re: Indexing-time URL filtering again >> > > > > > > > >> > >> > > > > > > > >> > the last few lines of hadoop.log: >> > > > > > > > >> > >> > > > > > > > >> > 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - >> > > Adding >> > > > > > > > >> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter >> > > > > > > > >> > 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - >> > > Adding >> > > > > > > > >> > org.apache.nutch.indexer.metadata.MetadataIndexer >> > > > > > > > >> > 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - >> > > > > job_local_0001 >> > > > > > > > >> > java.lang.RuntimeException: Error in configuring object >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) >> > > > > > > > >> > at >> > > > > > > > >> >> > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) >> > > > > > > > >> > at >> > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >> > > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException >> > > > > > > > >> > at >> > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native >> > > > > > > Method) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > > > > > > > >> > at >> java.lang.reflect.Method.invoke(Method.java:601) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) >> > > > > > > > >> > ... 5 more >> > > > > > > > >> > Caused by: java.lang.RuntimeException: Error in >> configuring >> > > > > object >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) >> > > > > > > > >> > at >> > > > > > > > >> >> > > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) >> > > > > > > > >> > ... 10 more >> > > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException >> > > > > > > > >> > at >> > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native >> > > > > > > Method) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > > > > > > > >> > at >> java.lang.reflect.Method.invoke(Method.java:601) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) >> > > > > > > > >> > ... 13 more >> > > > > > > > >> > Caused by: java.lang.NullPointerException >> > > > > > > > >> > at java.io.Reader.<init>(Reader.java:78) >> > > > > > > > >> > at >> > > java.io.BufferedReader.<init>(BufferedReader.java:94) >> > > > > > > > >> > at >> > > > > java.io.BufferedReader.<init>(BufferedReader.java:109) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) >> > > > > > > > >> > at >> > > > > > > org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57) >> > > > > > > > >> > at >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95) >> > > > > > > > >> > ... 18 more >> > > > > > > > >> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - >> > > > > > > java.io.IOException: >> > > > > > > > >> Job >> > > > > > > > >> > failed! >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma >> > > > > > > > >> > <[email protected]>wrote: >> > > > > > > > >> > >> > > > > > > > >> > > You should provide the log output. >> > > > > > > > >> > > >> > > > > > > > >> > > -----Original message----- >> > > > > > > > >> > > > From:Joe Zhang <[email protected]> >> > > > > > > > >> > > > Sent: Sun 25-Nov-2012 17:27 >> > > > > > > > >> > > > To: [email protected] >> > > > > > > > >> > > > Subject: Re: Indexing-time URL filtering again >> > > > > > > > >> > > > >> > > > > > > > >> > > > I actually checked out the most recent build from >> SVN, >> > > > > Release >> > > > > > > 1.6 - >> > > > > > > > >> > > > 23/11/2012. >> > > > > > > > >> > > > >> > > > > > > > >> > > > The following command >> > > > > > > > >> > > > >> > > > > > > > >> > > > bin/nutch solrindex >> > > > > > > -Durlfilter.regex.file=.....UrlFiltering.txt >> > > > > > > > >> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb >> > > > > > > crawl/linkdb/ >> > > > > > > > >> > > > crawl/segments/* -filter >> > > > > > > > >> > > > >> > > > > > > > >> > > > produced the following output: >> > > > > > > > >> > > > >> > > > > > > > >> > > > SolrIndexer: starting at 2012-11-25 16:19:29 >> > > > > > > > >> > > > SolrIndexer: deleting gone documents: false >> > > > > > > > >> > > > SolrIndexer: URL filtering: true >> > > > > > > > >> > > > SolrIndexer: URL normalizing: false >> > > > > > > > >> > > > java.io.IOException: Job failed! >> > > > > > > > >> > > > >> > > > > > > > >> > > > Can anybody help? >> > > > > > > > >> > > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang < >> > > > > > > [email protected]> >> > > > > > > > >> wrote: >> > > > > > > > >> > > > >> > > > > > > > >> > > > > How exactly do I get to trunk? >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > I did download download NUTCH-1300-1.5-1.patch, >> and >> > > run >> > > > > the >> > > > > > > patch >> > > > > > > > >> > > command >> > > > > > > > >> > > > > correctly, and re-build nutch. But the problem >> still >> > > > > > > persists... >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma < >> > > > > > > > >> > > [email protected] >> > > > > > > > >> > > > > > wrote: >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> No, this is no bug. As i said, you need either >> to >> > > patch >> > > > > your >> > > > > > > > >> Nutch or >> > > > > > > > >> > > get >> > > > > > > > >> > > > >> the sources from trunk. The -filter parameter >> is not >> > > in >> > > > > your >> > > > > > > > >> version. >> > > > > > > > >> > > Check >> > > > > > > > >> > > > >> the patch manual if you don't know how it works. >> > > > > > > > >> > > > >> >> > > > > > > > >> > > > >> $ cd trunk ; patch -p0 < file.patch >> > > > > > > > >> > > > >> >> > > > > > > > >> > > > >> -----Original message----- >> > > > > > > > >> > > > >> > From:Joe Zhang <[email protected]> >> > > > > > > > >> > > > >> > Sent: Sun 25-Nov-2012 08:42 >> > > > > > > > >> > > > >> > To: Markus Jelsma <[email protected] >> >; >> > > user < >> > > > > > > > >> > > > >> [email protected]> >> > > > > > > > >> > > > >> > Subject: Re: Indexing-time URL filtering again >> > > > > > > > >> > > > >> > >> > > > > > > > >> > > > >> > This does seem a bug. Can anybody help? >> > > > > > > > >> > > > >> > >> > > > > > > > >> > > > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang < >> > > > > > > > >> [email protected]> >> > > > > > > > >> > > > >> wrote: >> > > > > > > > >> > > > >> > >> > > > > > > > >> > > > >> > > Markus, could you advise? Thanks a lot! >> > > > > > > > >> > > > >> > > >> > > > > > > > >> > > > >> > > >> > > > > > > > >> > > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang >> < >> > > > > > > > >> [email protected] >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> wrote: >> > > > > > > > >> > > > >> > > >> > > > > > > > >> > > > >> > >> I followed your instruction and applied the >> > > patch, >> > > > > > > Markus, >> > > > > > > > >> but >> > > > > > > > >> > > the >> > > > > > > > >> > > > >> > >> problem still persists --- "-filter" is >> > > interpreted >> > > > > as a >> > > > > > > > >> path by >> > > > > > > > >> > > > >> solrindex. >> > > > > > > > >> > > > >> > >> >> > > > > > > > >> > > > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus >> Jelsma >> > > < >> > > > > > > > >> > > > >> > >> [email protected]> wrote: >> > > > > > > > >> > > > >> > >> >> > > > > > > > >> > > > >> > >>> Ah, i get it now. Please use trunk or >> patch >> > > your >> > > > > > > version >> > > > > > > > >> with: >> > > > > > > > >> > > > >> > >>> >> > > https://issues.apache.org/jira/browse/NUTCH-1300to >> > > > > > > enable >> > > > > > > > >> > > > >> filtering. >> > > > > > > > >> > > > >> > >>> >> > > > > > > > >> > > > >> > >>> -----Original message----- >> > > > > > > > >> > > > >> > >>> > From:Joe Zhang <[email protected]> >> > > > > > > > >> > > > >> > >>> > Sent: Fri 23-Nov-2012 03:08 >> > > > > > > > >> > > > >> > >>> > To: [email protected] >> > > > > > > > >> > > > >> > >>> > Subject: Re: Indexing-time URL filtering >> > > again >> > > > > > > > >> > > > >> > >>> > >> > > > > > > > >> > > > >> > >>> > But Markus said it worked for him. I was >> > > really >> > > > > he >> > > > > > > could >> > > > > > > > >> send >> > > > > > > > >> > > his >> > > > > > > > >> > > > >> > >>> command >> > > > > > > > >> > > > >> > >>> > line. >> > > > > > > > >> > > > >> > >>> > >> > > > > > > > >> > > > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis >> John >> > > > > > > Mcgibbney < >> > > > > > > > >> > > > >> > >>> > [email protected]> wrote: >> > > > > > > > >> > > > >> > >>> > >> > > > > > > > >> > > > >> > >>> > > Is this a bug? >> > > > > > > > >> > > > >> > >>> > > >> > > > > > > > >> > > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe >> > > Zhang < >> > > > > > > > >> > > > >> [email protected]> >> > > > > > > > >> > > > >> > >>> wrote: >> > > > > > > > >> > > > >> > >>> > > > Putting -filter between crawldb and >> > > > > segments, I >> > > > > > > > >> sitll got >> > > > > > > > >> > > the >> > > > > > > > >> > > > >> same >> > > > > > > > >> > > > >> > >>> thing: >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > > >> > >>> > > > >> > > > > org.apache.hadoop.mapred.InvalidInputException: >> > > > > > > > >> Input path >> > > > > > > > >> > > > >> does not >> > > > > > > > >> > > > >> > >>> > > exist: >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > >> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch >> > > > > > > > >> > > > >> > >>> > > > Input path does not exist: >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > >> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse >> > > > > > > > >> > > > >> > >>> > > > Input path does not exist: >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > >> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data >> > > > > > > > >> > > > >> > >>> > > > Input path does not exist: >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > >> > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, >> Markus >> > > > > Jelsma >> > > > > > > > >> > > > >> > >>> > > > <[email protected]>wrote: >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > > >> > >>> > > >> These are roughly the available >> > > parameters: >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > > > > >> > > > >> > >>> > > >> Usage: SolrIndexer <solr url> >> <crawldb> >> > > > > [-linkdb >> > > > > > > > >> > > <linkdb>] >> > > > > > > > >> > > > >> > >>> [-hostdb >> > > > > > > > >> > > > >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] >> > > > > (<segment> >> > > > > > > ... | >> > > > > > > > >> -dir >> > > > > > > > >> > > > >> > >>> <segments>) >> > > > > > > > >> > > > >> > >>> > > >> [-noCommit] [-deleteGone] >> > > > > [-deleteRobotsNoIndex] >> > > > > > > > >> > > > >> > >>> > > >> [-deleteSkippedByIndexingFilter] >> > > [-filter] >> > > > > > > > >> [-normalize] >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > > > > >> > > > >> > >>> > > >> Having -filter at the end should >> work >> > > fine, >> > > > > if >> > > > > > > it, >> > > > > > > > >> for >> > > > > > > > >> > > some >> > > > > > > > >> > > > >> > >>> reason, >> > > > > > > > >> > > > >> > >>> > > >> doesn't work put it before the >> segment >> > > and >> > > > > > > after the >> > > > > > > > >> > > crawldb >> > > > > > > > >> > > > >> and >> > > > > > > > >> > > > >> > >>> file an >> > > > > > > > >> > > > >> > >>> > > >> issue in jira, it works here if i >> have >> > > > > -filter >> > > > > > > at >> > > > > > > > >> the >> > > > > > > > >> > > end. >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > > > > >> > > > >> > >>> > > >> Cheers >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > > > > >> > > > >> > >>> > > >> -----Original message----- >> > > > > > > > >> > > > >> > >>> > > >> > From:Joe Zhang < >> [email protected]> >> > > > > > > > >> > > > >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05 >> > > > > > > > >> > > > >> > >>> > > >> > To: Markus Jelsma < >> > > > > [email protected] >> > > > > > > >; >> > > > > > > > >> user < >> > > > > > > > >> > > > >> > >>> > > >> [email protected]> >> > > > > > > > >> > > > >> > >>> > > >> > Subject: Re: Indexing-time URL >> > > filtering >> > > > > again >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > Yes, I forgot to do that. But >> still, >> > > what >> > > > > > > exactly >> > > > > > > > >> > > should >> > > > > > > > >> > > > >> the >> > > > > > > > >> > > > >> > >>> command >> > > > > > > > >> > > > >> > >>> > > >> look like? >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > bin/nutch solrindex >> > > > > > > > >> > > > >> -Durlfilter.regex.file=....UrlFiltering.txt >> > > > > > > > >> > > > >> > >>> > > >> http://localhost:8983/solr/ < >> > > > > > > > >> http://localhost:8983/solr/ >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > >>> .../crawldb/ >> > > > > > > > >> > > > >> > >>> > > >> ..../segments/* -filter >> > > > > > > > >> > > > >> > >>> > > >> > this command would cause nutch to >> > > > > interpret >> > > > > > > > >> "-filter" >> > > > > > > > >> > > as a >> > > > > > > > >> > > > >> path. >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, >> > > Markus >> > > > > > > Jelsma < >> > > > > > > > >> > > > >> > >>> > > >> [email protected]<mailto: >> > > > > > > > >> > > > >> [email protected]> > >> > > > > > > > >> > > > >> > >>> wrote: >> > > > > > > > >> > > > >> > >>> > > >> > Hi, >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > I just tested a small index job >> that >> > > > > usually >> > > > > > > > >> writes >> > > > > > > > >> > > 1200 >> > > > > > > > >> > > > >> > >>> records to >> > > > > > > > >> > > > >> > >>> > > >> Solr. It works fine if i specify >> -. in a >> > > > > filter >> > > > > > > > >> (index >> > > > > > > > >> > > > >> nothing) >> > > > > > > > >> > > > >> > >>> and >> > > > > > > > >> > > > >> > >>> > > point >> > > > > > > > >> > > > >> > >>> > > >> to it with >> -Durlfilter.regex.file=path >> > > like >> > > > > you >> > > > > > > do. >> > > > > > > > >> I >> > > > > > > > >> > > > >> assume you >> > > > > > > > >> > > > >> > >>> mean >> > > > > > > > >> > > > >> > >>> > > by >> > > > > > > > >> > > > >> > >>> > > >> `it doesn't work` that it filters >> > > nothing >> > > > > and >> > > > > > > > >> indexes all >> > > > > > > > >> > > > >> records >> > > > > > > > >> > > > >> > >>> from >> > > > > > > > >> > > > >> > >>> > > the >> > > > > > > > >> > > > >> > >>> > > >> segment. Did you forget the -filter >> > > > > parameter? >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > Cheers >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > -----Original message----- >> > > > > > > > >> > > > >> > >>> > > >> > > From:Joe Zhang < >> > > [email protected] >> > > > > <mailto: >> > > > > > > > >> > > > >> > >>> [email protected]> >> > > > > > > > >> > > > >> > >>> > > > >> > > > > > > > >> > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29 >> > > > > > > > >> > > > >> > >>> > > >> > > To: user < >> [email protected] >> > > <mailto: >> > > > > > > > >> > > > >> [email protected]> >> > > > > > > > >> > > > >> > >>> > >> > > > > > > > >> > > > >> > >>> > > >> > > Subject: Indexing-time URL >> filtering >> > > > > again >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > Dear List: >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > I asked a similar question >> before, >> > > but I >> > > > > > > haven't >> > > > > > > > >> > > solved >> > > > > > > > >> > > > >> the >> > > > > > > > >> > > > >> > >>> problem. >> > > > > > > > >> > > > >> > >>> > > >> > > Therefore I try to re-ask the >> > > question >> > > > > more >> > > > > > > > >> clearly >> > > > > > > > >> > > and >> > > > > > > > >> > > > >> seek >> > > > > > > > >> > > > >> > >>> advice. >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr >> 3.6.1 >> > > > > > > together. >> > > > > > > > >> Things >> > > > > > > > >> > > > >> work >> > > > > > > > >> > > > >> > >>> fine at >> > > > > > > > >> > > > >> > >>> > > the >> > > > > > > > >> > > > >> > >>> > > >> > > rudimentary level. >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > The basic problem I face in >> > > > > > > crawling/indexing is >> > > > > > > > >> > > that I >> > > > > > > > >> > > > >> need >> > > > > > > > >> > > > >> > >>> to >> > > > > > > > >> > > > >> > >>> > > control >> > > > > > > > >> > > > >> > >>> > > >> > > which pages the crawlers should >> > > VISIT >> > > > > (so >> > > > > > > far >> > > > > > > > >> through >> > > > > > > > >> > > > >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt) >> > > > > > > > >> > > > >> > >>> > > >> > > and which pages are INDEXED by >> > > Solr. The >> > > > > > > latter >> > > > > > > > >> are >> > > > > > > > >> > > only >> > > > > > > > >> > > > >> a >> > > > > > > > >> > > > >> > >>> SUBSET of >> > > > > > > > >> > > > >> > >>> > > >> the >> > > > > > > > >> > > > >> > >>> > > >> > > former, and they are giving me >> > > headache. >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > A real-life example would be: >> when >> > > we >> > > > > crawl >> > > > > > > > >> CNN.com, >> > > > > > > > >> > > we >> > > > > > > > >> > > > >> only >> > > > > > > > >> > > > >> > >>> want to >> > > > > > > > >> > > > >> > >>> > > >> index >> > > > > > > > >> > > > >> > >>> > > >> > > "real content" pages such as >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > > > > > > >> > > > >> > >>> >> > > > > > > > >> > > > >> >> > > > > > > > >> > > >> > > > > > > > >> >> > > > > > > >> > > >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1 >> > > > > < >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > > > > >> > > > >> > >>> >> > > > > > > > >> > > > >> >> > > > > > > > >> > > >> > > > > > > > >> >> > > > > > > >> > > >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1 >> > > > > > >> > > > > > > > >> > > > >> > >>> > > . >> > > > > > > > >> > > > >> > >>> > > >> > > When we start the crawling >> from the >> > > > > root, we >> > > > > > > > >> can't >> > > > > > > > >> > > > >> specify >> > > > > > > > >> > > > >> > >>> tight >> > > > > > > > >> > > > >> > >>> > > >> > > patterns (e.g., +^http:// >> > > ([a-z0-9]*\.)* >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> >> < >> > > > > > > >> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* >> > >> > > > > > > > >> < >> > > > > > > > >> > > >> > > > > > > >> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* >> > > > > > > > >> > >> > > > > > > > >> > > > >> < >> > > > > > > > >> > > >> > > > > > > >> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* >> > > > > > > > >> >< >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> >> > > > > > > ) in >> > > > > > > > >> > > > >> > >>> > > >> nutch/conf/regex-urlfilter.txt, >> > > > > > > > >> > > > >> > >>> > > >> > > because the pages on the path >> > > between >> > > > > root >> > > > > > > and >> > > > > > > > >> > > content >> > > > > > > > >> > > > >> pages >> > > > > > > > >> > > > >> > >>> do not >> > > > > > > > >> > > > >> > >>> > > >> satisfy >> > > > > > > > >> > > > >> > >>> > > >> > > such patterns. Putting such >> > > patterns in >> > > > > > > > >> > > > >> > >>> > > nutch/conf/regex-urlfilter.txt >> > > > > > > > >> > > > >> > >>> > > >> > > would severely jeopardize the >> > > coverage >> > > > > of >> > > > > > > the >> > > > > > > > >> crawl. >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > The closest solution I've got >> so far >> > > > > > > (courtesy >> > > > > > > > >> of >> > > > > > > > >> > > > >> Markus) was >> > > > > > > > >> > > > >> > >>> this: >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > nutch solrindex >> > > > > -Durlfilter.regex.file=/path >> > > > > > > > >> > > > >> http://solrurl/< >> > > > > > > > >> > > > >> > >>> > > >> http://solrurl/> ... >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > but unfortunately I haven't >> been >> > > able >> > > > > to >> > > > > > > make >> > > > > > > > >> it >> > > > > > > > >> > > work >> > > > > > > > >> > > > >> for >> > > > > > > > >> > > > >> > >>> me. The >> > > > > > > > >> > > > >> > >>> > > >> content >> > > > > > > > >> > > > >> > >>> > > >> > > of the urlfilter.regex.file is >> what >> > > I >> > > > > > > thought >> > > > > > > > >> > > "correct" >> > > > > > > > >> > > > >> --- >> > > > > > > > >> > > > >> > >>> > > something >> > > > > > > > >> > > > >> > >>> > > >> like >> > > > > > > > >> > > > >> > >>> > > >> > > the following: >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)* >> > > > > > > > >> > > > >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> >> < >> > > > > > > >> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* >> > >> > > > > > > > >> < >> > > > > > > > >> > > >> > > > > > > >> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* >> > > > > > > > >> > >> > > > > > > > >> > > > >> < >> > > > > > > > >> > > >> > > > > > > >> > > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..* >> > > > > > > > >> >< >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> >> > > > > > > > >> > > > >> > >>> > > >> > > -. >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > Everything seems quite >> > > straightforward. >> > > > > Am I >> > > > > > > > >> doing >> > > > > > > > >> > > > >> anything >> > > > > > > > >> > > > >> > >>> wrong >> > > > > > > > >> > > > >> > >>> > > >> here? Can >> > > > > > > > >> > > > >> > >>> > > >> > > anyone advise? I'd greatly >> > > appreciate. >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > > Joe >> > > > > > > > >> > > > >> > >>> > > >> > > >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> > >> > > > > > > > >> > > > >> > >>> > > >> >> > > > > > > > >> > > > >> > >>> > > >> > > > > > > > >> > > > >> > >>> > > >> > > > > > > > >> > > > >> > >>> > > >> > > > > > > > >> > > > >> > >>> > > -- >> > > > > > > > >> > > > >> > >>> > > Lewis >> > > > > > > > >> > > > >> > >>> > > >> > > > > > > > >> > > > >> > >>> > >> > > > > > > > >> > > > >> > >>> >> > > > > > > > >> > > > >> > >> >> > > > > > > > >> > > > >> > >> >> > > > > > > > >> > > > >> > > >> > > > > > > > >> > > > >> > >> > > > > > > > >> > > > >> >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >

