Seems like the problem is with the generator. It doesn¹t generate any links to crawl. Is there any way to debug why the generator doesn¹t work?
On 10/1/15, 6:39 PM, "Drulea, Sherban" <[email protected]> wrote: >Hi All, > >Thanks for pointing me to the 2.3.1 release. It works without error but >doesn¹t crawl. I¹m out of ideas why. > >Here¹s my environment: > >java version "1.8.0_60" > >Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > >Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > >SOLR 4.6.0 >Mongo version 3.0.2. >Nutch 2.3.1 > >My regex-urlfilter.txt: >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ >+. >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ > >nutch-site.xml >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ ><?xml version="1.0"?> ><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > ><!-- Put site-specific property overrides in this file. --> > ><configuration> > > <property> > <name>http.agent.name</name> > <value>nutch Mongo Solr Crawler</value> > </property> > > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.mongodb.store.MongoStore</value> > <description>Default class for storing data</description> > </property> > > <property> > <name>plugin.includes</name> > ><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index- >(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr< >/value> > <description>Regular expression naming plugin directory names to >include. </description> > </property> > ></configuration> > >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ > >gora.properties: >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ >############################ ># MongoDBStore properties # >############################ >gora.datastore.default=org.apache.gora.mongodb.store.MongoStore >gora.mongodb.override_hadoop_configuration=false >gora.mongodb.mapping.file=/gora-mongodb-mapping.xml >gora.mongodb.servers=localhost:27017 >gora.mongodb.db=method_centers >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ > >Seed.txt >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ >http://punklawyer.com/ >http://mail-archives.apache.org/mod_mbox/nutch-user/ >http://hbase.apache.org/index.html >http://wiki.apache.org/nutch/FrontPage >http://www.aintitcool.com/ >‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ > >Here are the results of the crawl command " ./bin/crawl urls methods >http://127.0.0.1:8983/solr/ 2² > >Injecting seed URLs > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls >-crawlId methods > >InjectorJob: starting at 2015-10-01 18:27:23 > >InjectorJob: Injecting urlDir: urls > >InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the >Gora storage class. > >InjectorJob: total number of urls rejected by filters: 0 > >InjectorJob: total number of urls injected after normalization and >filtering: 5 > >Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02 > >Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2 > >Generating batchId > >Generating a new fetchlist > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 >-crawlId methods -batchId 1443749246-29495 > >GeneratorJob: starting at 2015-10-01 18:27:26 > >GeneratorJob: Selecting best-scoring urls due for fetch. > >GeneratorJob: starting > >GeneratorJob: filtering: false > >GeneratorJob: normalizing: false > >GeneratorJob: topN: 50000 > >GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02 > >GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs > >Fetching : > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -D fetcher.timelimit.mins=180 >1443749246-29495 -crawlId methods -threads 50 > >FetcherJob: starting at 2015-10-01 18:27:29 > >FetcherJob: batchId: 1443749246-29495 > >FetcherJob: threads: 50 > >FetcherJob: parsing: false > >FetcherJob: resuming: false > >FetcherJob : timelimit set for : 1443760049865 > >Using queue mode : byHost > >Fetcher: threads: 50 > >QueueFeeder finished: total 0 records. Hit by time limit :0 > >-finishing thread FetcherThread0, activeThreads=0 > >-finishing thread FetcherThread1, activeThreads=0 > >-finishing thread FetcherThread2, activeThreads=0 > >-finishing thread FetcherThread3, activeThreads=0 > >-finishing thread FetcherThread4, activeThreads=0 > >-finishing thread FetcherThread5, activeThreads=0 > >-finishing thread FetcherThread6, activeThreads=0 > >-finishing thread FetcherThread7, activeThreads=0 > >-finishing thread FetcherThread8, activeThreads=0 > >-finishing thread FetcherThread9, activeThreads=0 > >-finishing thread FetcherThread10, activeThreads=0 > >-finishing thread FetcherThread11, activeThreads=0 > >-finishing thread FetcherThread12, activeThreads=0 > >-finishing thread FetcherThread13, activeThreads=0 > >-finishing thread FetcherThread14, activeThreads=0 > >-finishing thread FetcherThread15, activeThreads=0 > >-finishing thread FetcherThread16, activeThreads=0 > >-finishing thread FetcherThread17, activeThreads=0 > >-finishing thread FetcherThread18, activeThreads=0 > >-finishing thread FetcherThread19, activeThreads=0 > >-finishing thread FetcherThread20, activeThreads=0 > >-finishing thread FetcherThread21, activeThreads=0 > >-finishing thread FetcherThread22, activeThreads=0 > >-finishing thread FetcherThread23, activeThreads=0 > >-finishing thread FetcherThread25, activeThreads=0 > >-finishing thread FetcherThread24, activeThreads=0 > >-finishing thread FetcherThread26, activeThreads=0 > >-finishing thread FetcherThread27, activeThreads=0 > >-finishing thread FetcherThread28, activeThreads=0 > >-finishing thread FetcherThread29, activeThreads=0 > >-finishing thread FetcherThread30, activeThreads=0 > >-finishing thread FetcherThread31, activeThreads=0 > >-finishing thread FetcherThread32, activeThreads=0 > >-finishing thread FetcherThread33, activeThreads=0 > >-finishing thread FetcherThread34, activeThreads=0 > >-finishing thread FetcherThread35, activeThreads=0 > >-finishing thread FetcherThread36, activeThreads=0 > >-finishing thread FetcherThread37, activeThreads=0 > >-finishing thread FetcherThread38, activeThreads=0 > >-finishing thread FetcherThread39, activeThreads=0 > >-finishing thread FetcherThread40, activeThreads=0 > >-finishing thread FetcherThread41, activeThreads=0 > >-finishing thread FetcherThread42, activeThreads=0 > >-finishing thread FetcherThread43, activeThreads=0 > >-finishing thread FetcherThread44, activeThreads=0 > >-finishing thread FetcherThread45, activeThreads=0 > >-finishing thread FetcherThread46, activeThreads=0 > >-finishing thread FetcherThread47, activeThreads=0 > >-finishing thread FetcherThread48, activeThreads=0 > >-finishing thread FetcherThread49, activeThreads=0 > >Fetcher: throughput threshold: -1 > >Fetcher: throughput threshold sequence: 5 > >0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 >URLs in 0 queues > >-activeThreads=0 > >Using queue mode : byHost > >Fetcher: threads: 50 > >QueueFeeder finished: total 0 records. Hit by time limit :0 > >-finishing thread FetcherThread0, activeThreads=0 > >-finishing thread FetcherThread1, activeThreads=0 > >-finishing thread FetcherThread2, activeThreads=0 > >-finishing thread FetcherThread3, activeThreads=0 > >-finishing thread FetcherThread4, activeThreads=0 > >-finishing thread FetcherThread5, activeThreads=0 > >-finishing thread FetcherThread6, activeThreads=0 > >-finishing thread FetcherThread7, activeThreads=0 > >-finishing thread FetcherThread8, activeThreads=0 > >-finishing thread FetcherThread9, activeThreads=0 > >-finishing thread FetcherThread10, activeThreads=0 > >-finishing thread FetcherThread11, activeThreads=0 > >-finishing thread FetcherThread12, activeThreads=0 > >-finishing thread FetcherThread13, activeThreads=0 > >-finishing thread FetcherThread14, activeThreads=0 > >-finishing thread FetcherThread15, activeThreads=0 > >-finishing thread FetcherThread16, activeThreads=0 > >-finishing thread FetcherThread17, activeThreads=0 > >-finishing thread FetcherThread18, activeThreads=0 > >-finishing thread FetcherThread19, activeThreads=0 > >-finishing thread FetcherThread20, activeThreads=0 > >-finishing thread FetcherThread21, activeThreads=0 > >-finishing thread FetcherThread22, activeThreads=0 > >-finishing thread FetcherThread23, activeThreads=0 > >-finishing thread FetcherThread24, activeThreads=0 > >-finishing thread FetcherThread25, activeThreads=0 > >-finishing thread FetcherThread26, activeThreads=0 > >-finishing thread FetcherThread27, activeThreads=0 > >-finishing thread FetcherThread28, activeThreads=0 > >-finishing thread FetcherThread29, activeThreads=0 > >-finishing thread FetcherThread30, activeThreads=0 > >-finishing thread FetcherThread31, activeThreads=0 > >-finishing thread FetcherThread32, activeThreads=0 > >-finishing thread FetcherThread33, activeThreads=0 > >-finishing thread FetcherThread34, activeThreads=0 > >-finishing thread FetcherThread35, activeThreads=0 > >-finishing thread FetcherThread36, activeThreads=0 > >-finishing thread FetcherThread37, activeThreads=0 > >-finishing thread FetcherThread38, activeThreads=0 > >-finishing thread FetcherThread39, activeThreads=0 > >-finishing thread FetcherThread40, activeThreads=0 > >-finishing thread FetcherThread41, activeThreads=0 > >-finishing thread FetcherThread42, activeThreads=0 > >-finishing thread FetcherThread43, activeThreads=0 > >-finishing thread FetcherThread44, activeThreads=0 > >-finishing thread FetcherThread45, activeThreads=0 > >-finishing thread FetcherThread46, activeThreads=0 > >-finishing thread FetcherThread47, activeThreads=0 > >-finishing thread FetcherThread48, activeThreads=0 > >Fetcher: throughput threshold: -1 > >Fetcher: throughput threshold sequence: 5 > >-finishing thread FetcherThread49, activeThreads=0 > >0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 >URLs in 0 queues > >-activeThreads=0 > >FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12 > >Parsing : > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -D >mapred.skip.attempts.to.start.skipping=2 -D >mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods > >ParserJob: starting at 2015-10-01 18:27:43 > >ParserJob: resuming: false > >ParserJob: forced reparse: false > >ParserJob: batchId: 1443749246-29495 > >ParserJob: success > >ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02 > >CrawlDB update for methods > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true 1443749246-29495 -crawlId methods > >DbUpdaterJob: starting at 2015-10-01 18:27:46 > >DbUpdaterJob: batchId: 1443749246-29495 > >DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02 > >Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/ > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -D >solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods > >IndexingJob: starting > >Active IndexWriters : > >SOLRIndexWriter > >solr.server.url : URL of the SOLR instance (mandatory) > >solr.commit.size : buffer size when sending to SOLR (default 1000) > >solr.mapping.file : name of the mapping file for fields (default >solrindex-mapping.xml) > >solr.auth : use authentication (default false) > >solr.auth.username : username for authentication > >solr.auth.password : password for authentication > > > >IndexingJob: done. > >SOLR dedup -> http://127.0.0.1:8983/solr/ > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true http://127.0.0.1:8983/solr/ > >Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2 > >Generating batchId > >Generating a new fetchlist > >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 >-crawlId methods -batchId 1443749274-17203 > >GeneratorJob: starting at 2015-10-01 18:27:55 > >GeneratorJob: Selecting best-scoring urls due for fetch. > >GeneratorJob: starting > >GeneratorJob: filtering: false > >GeneratorJob: normalizing: false > >GeneratorJob: topN: 50000 > >GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02 > >GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs > >Generate returned 1 (no new segments created) > >Escaping loop: no more URLs to fetch now > >So no errors but also no data. What else can I debug? > >I see some warning in my hadoop.log but nothing alarming Š. > >2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load >native-hadoop library for your platform... using builtin-java classes >where applicable > >2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using >FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule - >defaultInterval=2592000 > >2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule - >maxInterval=7776000 > >2015-10-01 18:19:30,326 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc >al1900181322_0001/job.xml:an attempt to override final parameter: >mapreduce.job.end-notification.max.retry.interval; Ignoring. > >2015-10-01 18:19:30,327 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc >al1900181322_0001/job.xml:an attempt to override final parameter: >mapreduce.job.end-notification.max.attempts; Ignoring. > >2015-10-01 18:19:30,405 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181 >322_0001/job_local1900181322_0001.xml:an attempt to override final >parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. > >2015-10-01 18:19:30,406 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181 >322_0001/job_local1900181322_0001.xml:an attempt to override final >parameter: mapreduce.job.end-notification.max.attempts; Ignoring. > >Š. > > >2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load >native-hadoop library for your platform... using builtin-java classes >where applicable > >2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using >class org.apache.gora.mongodb.store.MongoStore as the Gora storage class. > >2015-10-01 18:27:24,969 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc >al1182157052_0001/job.xml:an attempt to override final parameter: >mapreduce.job.end-notification.max.retry.interval; Ignoring. > >2015-10-01 18:27:24,971 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc >al1182157052_0001/job.xml:an attempt to override final parameter: >mapreduce.job.end-notification.max.attempts; Ignoring. > >2015-10-01 18:27:25,050 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157 >052_0001/job_local1182157052_0001.xml:an attempt to override final >parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. > >2015-10-01 18:27:25,052 WARN conf.Configuration - >file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157 >052_0001/job_local1182157052_0001.xml:an attempt to override final >parameter: mapreduce.job.end-notification.max.attempts; Ignoring. > > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080 > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000 > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536 > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo >Solr Crawler/Nutch-2.4-SNAPSHOT > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language = >en-us,en-gb,en;q=0.7,*;q=0.3 > >2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept = >text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080 > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000 > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536 > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo >Solr Crawler/Nutch-2.4-SNAPSHOT > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language = >en-us,en-gb,en;q=0.7,*;q=0.3 > >2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept = >text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > >I¹ve been trying this for 3 days with no luck. I want to use nutch but >may be forced to use other program. > >My best guess is maybe something is borked with my plugin.includes: > ><property> > <name>plugin.includes</name> > ><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index- >(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr< >/value> > <description>Regular expression naming plugin directory names to >include. </description> > </property> > >Are these valid? Is there a more minimal set to try? > >Cheers, >Sherban > > >__________________________________________________________________________ > >This email message is for the sole use of the intended recipient(s) and >may contain confidential information. Any unauthorized review, use, >disclosure or distribution is prohibited. If you are not the intended >recipient, please contact the sender by reply email and destroy all copies >of the original message.

