Hi Lewis, hi Sherban, I have to turn my vote into a
-1 The crawl (if run from bin/crawl) isn't working because generator ignores the batch id passed per option -batchId See https://issues.apache.org/jira/browse/NUTCH-2143. Thanks, Sherban, for being insistent! The logs you sent point to the same problem: > Generating a new fetchlist > .../bin/nutch generate ... -batchId 1443749246-29495 > ... > GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs > Fetching : > .../bin/nutch fetch ... 1443749246-29495 ... > ... > FetcherJob: batchId: 1443749246-29495 If you use the batch id logged by Generator (1443749246-1282586680) for the steps "fetch", "parse", and "updatedb" the crawl should step forward. Of course, this is no option for a released 2.3.1! We have to fix this bug. :) Thanks, Sebastian On 10/05/2015 07:53 PM, Drulea, Sherban wrote: > Hi Sebastian, > > I tried multiple URLs in my seed.txt file. None of them result in the > nutch generator crawling any links. > > Here’s my environment: > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > SOLR 4.6.0 > Mongo version 3.0.2. > Nutch 2.3.1 > > ――――――――――――――― > > regex-urlfilter.txt: > ――――――――――――――― > +. > > ――――――――――――――― > nutch-site.xml > ――――――――――――――― > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > > <property> > <name>http.agent.name</name> > <value>nutch Mongo Solr Crawler</value> > </property> > > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.mongodb.store.MongoStore</value> > <description>Default class for storing data</description> > </property> > > <property> > <name>plugin.includes</name> > > <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-( > basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v > alue> > <description>Regular expression naming plugin directory names to > include. </description> > </property> > > </configuration> > > > ――――――――――――――― > gora.properties: > ――――――――――――――― > ############################ > # MongoDBStore properties # > ############################ > gora.datastore.default=org.apache.gora.mongodb.store.MongoStore > gora.mongodb.override_hadoop_configuration=false > gora.mongodb.mapping.file=/gora-mongodb-mapping.xml > gora.mongodb.servers=localhost:27017 > gora.mongodb.db=method_centers > > ――――――――――――――― > seed.txt > ――――――――――――――― > http://punklawyer.com > http://mail-archives.apache.org/mod_mbox/nutch-user/ > http://hbase.apache.org/index.html > http://wiki.apache.org/nutch/FrontPage > http://www.aintitcool.com/ > ――――――――――――――― > > Here are the results of the crawl command " ./bin/crawl urls methods > http://127.0.0.1:8983/solr/ 2” > Injecting seed URLs > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls > -crawlId methods > InjectorJob: starting at 2015-10-01 18:27:23 > InjectorJob: Injecting urlDir: urls > InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the > Gora storage class. > InjectorJob: total number of urls rejected by filters: 0 > InjectorJob: total number of urls injected after normalization and > filtering: 5 > Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02 > Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2 > Generating batchId > Generating a new fetchlist > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 > -crawlId methods -batchId 1443749246-29495 > GeneratorJob: starting at 2015-10-01 18:27:26 > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: false > GeneratorJob: normalizing: false > GeneratorJob: topN: 50000 > GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02 > GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs > Fetching : > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -D fetcher.timelimit.mins=180 > 1443749246-29495 -crawlId methods -threads 50 > FetcherJob: starting at 2015-10-01 18:27:29 > FetcherJob: batchId: 1443749246-29495 > FetcherJob: threads: 50 > FetcherJob: parsing: false > FetcherJob: resuming: false > FetcherJob : timelimit set for : 1443760049865 > Using queue mode : byHost > Fetcher: threads: 50 > QueueFeeder finished: total 0 records. Hit by time limit :0 > -finishing thread FetcherThread0, activeThreads=0 > ... > -finishing thread FetcherThread49, activeThreads=0 > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold sequence: 5 > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs > in 0 queues > -activeThreads=0 > Using queue mode : byHost > Fetcher: threads: 50 > QueueFeeder finished: total 0 records. Hit by time limit :0 > -finishing thread FetcherThread0, activeThreads=0 > ... > > -finishing thread FetcherThread48, activeThreads=0 > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold sequence: 5 > -finishing thread FetcherThread49, activeThreads=0 > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs > in 0 queues > -activeThreads=0 > FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12 > Parsing : > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -D > mapred.skip.attempts.to.start.skipping=2 -D > mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods > ParserJob: starting at 2015-10-01 18:27:43 > ParserJob: resuming: false > ParserJob: forced reparse: false > ParserJob: batchId: 1443749246-29495 > ParserJob: success > ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02 > CrawlDB update for methods > > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true 1443749246-29495 -crawlId methods > DbUpdaterJob: starting at 2015-10-01 18:27:46 > DbUpdaterJob: batchId: 1443749246-29495 > DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02 > Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/ > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -D > solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods > IndexingJob: starting > Active IndexWriters : > SOLRIndexWriter > solr.server.url : URL of the SOLR instance (mandatory) > solr.commit.size : buffer size when sending to SOLR (default 1000) > solr.mapping.file : name of the mapping file for fields (default > solrindex-mapping.xml) > solr.auth : use authentication (default false) > solr.auth.username : username for authentication > solr.auth.password : password for authentication > > > IndexingJob: done. > SOLR dedup -> http://127.0.0.1:8983/solr/ > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true http://127.0.0.1:8983/solr/ > Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2 > Generating batchId > Generating a new fetchlist > /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 > -crawlId methods -batchId 1443749274-17203 > GeneratorJob: starting at 2015-10-01 18:27:55 > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: false > GeneratorJob: normalizing: false > GeneratorJob: topN: 50000 > GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02 > GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs > Generate returned 1 (no new segments created) > Escaping loop: no more URLs to fetch now > > There’s no errors but also no data. What else can I debug? > > I see some warning in my hadoop.log but nothing glaring …. > > 2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2015-10-01 18:19:30,326 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca > l1900181322_0001/job.xml:an attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2015-10-01 18:19:30,327 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca > l1900181322_0001/job.xml:an attempt to override final parameter: > mapreduce.job.end-notification.max.attempts; Ignoring. > 2015-10-01 18:19:30,405 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813 > 22_0001/job_local1900181322_0001.xml:an attempt to override final > parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2015-10-01 18:19:30,406 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813 > 22_0001/job_local1900181322_0001.xml:an attempt to override final > parameter: mapreduce.job.end-notification.max.attempts; Ignoring. > …. > 2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using class > org.apache.gora.mongodb.store.MongoStore as the Gora storage class. > 2015-10-01 18:27:24,969 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca > l1182157052_0001/job.xml:an attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2015-10-01 18:27:24,971 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca > l1182157052_0001/job.xml:an attempt to override final parameter: > mapreduce.job.end-notification.max.attempts; Ignoring. > 2015-10-01 18:27:25,050 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570 > 52_0001/job_local1182157052_0001.xml:an attempt to override final > parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2015-10-01 18:27:25,052 WARN conf.Configuration - > file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570 > 52_0001/job_local1182157052_0001.xml:an attempt to override final > parameter: mapreduce.job.end-notification.max.attempts; Ignoring. > > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080 > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000 > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536 > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo > Solr Crawler/Nutch-2.4-SNAPSHOT > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo > Solr Crawler/Nutch-2.4-SNAPSHOT > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > I’ve been trying this for 3 days with no luck. I want to use nutch but may > be forced to use other program. > > My best guess is maybe something is borked with my plugin.includes: > > <property> > <name>plugin.includes</name> > > <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-( > basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v > alue> > <description>Regular expression naming plugin directory names to > include. </description> > </property> > > Are these valid? Is there a more minimal set to try? > > Cheers, > Sherban > > > > > On 10/4/15, 12:23 PM, "Sebastian Nagel" <[email protected]> wrote: > >> Hi Sherban, >> >>> Right now it finds 0 URLs with no errors. >> >> Can you specify what's going wrong. It could >> be everything, even a configuration problem. >> What did you crawl? Using which storage back-end? >> >> Thanks, >> Sebastian >> >> >> On 10/02/2015 03:02 AM, Drulea, Sherban wrote: >>> Hi Lewis, >>> >>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with >>> no >>> errors. >>> >>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all. >>> >>> Cheers, >>> Sherban >>> >>> >>> >>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <[email protected]> >>> wrote: >>> >>>> Hi Folks, >>>> Is anyone else able to test and run the release candidate for 2.3.1? >>>> It would be great to get a release if we can get the VOTE's and the RC >>>> is >>>> suitable. >>>> Thanks in advance. >>>> Best >>>> Lewis >>>> >>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney < >>>> [email protected]> wrote: >>>> >>>>> Hi Folks, >>>>> It turns out the formatting for the original email below was terrible. >>>>> Sorry about that. >>>>> I've hopefully corrected formatting now. Please VOTE away! >>>>> >>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi user@ & dev@, >>>>>> >>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. >>>>>> >>>>>> We addressed 32 issues in all which can been see at the release >>>>>> report >>>>>> http://s.apache.org/nutch_2.3.1 >>>>>> >>>>>> The release candidate comprises the following components. >>>>>> >>>>>> * A staging repository [0] containing various Maven artifacts >>>>>> * A branch-2.3.1 of the 2.x code [1] >>>>>> * The tagged source upon which we are VOTE'ing [2] >>>>>> * Finally, the release artifacts [3] which i would encourage you to >>>>>> verify for signatures and test. >>>>>> >>>>>> You should use the following KEYS [4] file to verify the signatures >>>>>> of >>>>>> all release artifacts. >>>>>> >>>>>> Please VOTE as follows >>>>>> >>>>>> [ ] +1 Push the release, I am happy :) >>>>>> [ ] +/-0 I am not bothered either way >>>>>> [ ] -1 I am not happy with this release candidate (please state why) >>>>>> >>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly, >>>>>> thank >>>>>> you to everyone that VOTE's. It is appreciated. >>>>>> >>>>>> Thanks >>>>>> Lewis >>>>>> (on behalf of Nutch PMC) >>>>>> >>>>>> p.s. Here's my +1 >>>>>> >>>>>> [0] >>>>>> >>>>>> https://repository.apache.org/content/repositories/orgapachenutch-1005 >>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1 >>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 >>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1 >>>>>> [4] http://www.apache.org/dist/nutch/KEYS >>>>>> >>>>>> -- >>>>>> *Lewis* >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Lewis* >>>>> >>>> >>>> >>>> >>>> -- >>>> *Lewis* >>> >>> >>> >>> _________________________________________________________________________ >>> _ >>> >>> This email message is for the sole use of the intended recipient(s) and >>> may contain confidential information. Any unauthorized review, use, >>> disclosure or distribution is prohibited. If you are not the intended >>> recipient, please contact the sender by reply email and destroy all >>> copies >>> of the original message. >>> >> >

