Thanks Sebastian. I’m running on OS X 10.9.5 btw.
On 10/5/15, 11:53 AM, "Sebastian Nagel" <[email protected]> wrote: >Hi Sherban, > >thanks for the detailed description and the attached log. >I'll have a look on it and hope to be able reproduce the >problem. > >Sebastian > >On 10/05/2015 07:53 PM, Drulea, Sherban wrote: >> Hi Sebastian, >> >> I tried multiple URLs in my seed.txt file. None of them result in the >> nutch generator crawling any links. >> >> Here’s my environment: >> java version "1.8.0_60" >> Java(TM) SE Runtime Environment (build 1.8.0_60-b27) >> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) >> SOLR 4.6.0 >> Mongo version 3.0.2. >> Nutch 2.3.1 >> >> ――――――――――――――― >> >> regex-urlfilter.txt: >> ――――――――――――――― >> +. >> >> ――――――――――――――― >> nutch-site.xml >> ――――――――――――――― >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> <!-- Put site-specific property overrides in this file. --> >> >> <configuration> >> >> <property> >> <name>http.agent.name</name> >> <value>nutch Mongo Solr Crawler</value> >> </property> >> >> <property> >> <name>storage.data.store.class</name> >> <value>org.apache.gora.mongodb.store.MongoStore</value> >> <description>Default class for storing data</description> >> </property> >> >> <property> >> <name>plugin.includes</name> >> >> >><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index >>-( >> >>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr< >>/v >> alue> >> <description>Regular expression naming plugin directory names to >> include. </description> >> </property> >> >> </configuration> >> >> >> ――――――――――――――― >> gora.properties: >> ――――――――――――――― >> ############################ >> # MongoDBStore properties # >> ############################ >> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore >> gora.mongodb.override_hadoop_configuration=false >> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml >> gora.mongodb.servers=localhost:27017 >> gora.mongodb.db=method_centers >> >> ――――――――――――――― >> seed.txt >> ――――――――――――――― >> http://punklawyer.com >> http://mail-archives.apache.org/mod_mbox/nutch-user/ >> http://hbase.apache.org/index.html >> http://wiki.apache.org/nutch/FrontPage >> http://www.aintitcool.com/ >> ――――――――――――――― >> >> Here are the results of the crawl command " ./bin/crawl urls methods >> http://127.0.0.1:8983/solr/ 2” >> Injecting seed URLs >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls >> -crawlId methods >> InjectorJob: starting at 2015-10-01 18:27:23 >> InjectorJob: Injecting urlDir: urls >> InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the >> Gora storage class. >> InjectorJob: total number of urls rejected by filters: 0 >> InjectorJob: total number of urls injected after normalization and >> filtering: 5 >> Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02 >> Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2 >> Generating batchId >> Generating a new fetchlist >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 >> -crawlId methods -batchId 1443749246-29495 >> GeneratorJob: starting at 2015-10-01 18:27:26 >> GeneratorJob: Selecting best-scoring urls due for fetch. >> GeneratorJob: starting >> GeneratorJob: filtering: false >> GeneratorJob: normalizing: false >> GeneratorJob: topN: 50000 >> GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02 >> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 >>URLs >> Fetching : >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true -D fetcher.timelimit.mins=180 >> 1443749246-29495 -crawlId methods -threads 50 >> FetcherJob: starting at 2015-10-01 18:27:29 >> FetcherJob: batchId: 1443749246-29495 >> FetcherJob: threads: 50 >> FetcherJob: parsing: false >> FetcherJob: resuming: false >> FetcherJob : timelimit set for : 1443760049865 >> Using queue mode : byHost >> Fetcher: threads: 50 >> QueueFeeder finished: total 0 records. Hit by time limit :0 >> -finishing thread FetcherThread0, activeThreads=0 >> ... >> -finishing thread FetcherThread49, activeThreads=0 >> Fetcher: throughput threshold: -1 >> Fetcher: throughput threshold sequence: 5 >> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 >>URLs >> in 0 queues >> -activeThreads=0 >> Using queue mode : byHost >> Fetcher: threads: 50 >> QueueFeeder finished: total 0 records. Hit by time limit :0 >> -finishing thread FetcherThread0, activeThreads=0 >> ... >> >> -finishing thread FetcherThread48, activeThreads=0 >> Fetcher: throughput threshold: -1 >> Fetcher: throughput threshold sequence: 5 >> -finishing thread FetcherThread49, activeThreads=0 >> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 >>URLs >> in 0 queues >> -activeThreads=0 >> FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12 >> Parsing : >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true -D >> mapred.skip.attempts.to.start.skipping=2 -D >> mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods >> ParserJob: starting at 2015-10-01 18:27:43 >> ParserJob: resuming: false >> ParserJob: forced reparse: false >> ParserJob: batchId: 1443749246-29495 >> ParserJob: success >> ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02 >> CrawlDB update for methods >> >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true 1443749246-29495 -crawlId methods >> DbUpdaterJob: starting at 2015-10-01 18:27:46 >> DbUpdaterJob: batchId: 1443749246-29495 >> DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02 >> Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/ >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true -D >> solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods >> IndexingJob: starting >> Active IndexWriters : >> SOLRIndexWriter >> solr.server.url : URL of the SOLR instance (mandatory) >> solr.commit.size : buffer size when sending to SOLR (default 1000) >> solr.mapping.file : name of the mapping file for fields (default >> solrindex-mapping.xml) >> solr.auth : use authentication (default false) >> solr.auth.username : username for authentication >> solr.auth.password : password for authentication >> >> >> IndexingJob: done. >> SOLR dedup -> http://127.0.0.1:8983/solr/ >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true http://127.0.0.1:8983/solr/ >> Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2 >> Generating batchId >> Generating a new fetchlist >> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D >> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 >> -crawlId methods -batchId 1443749274-17203 >> GeneratorJob: starting at 2015-10-01 18:27:55 >> GeneratorJob: Selecting best-scoring urls due for fetch. >> GeneratorJob: starting >> GeneratorJob: filtering: false >> GeneratorJob: normalizing: false >> GeneratorJob: topN: 50000 >> GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02 >> GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 >>URLs >> Generate returned 1 (no new segments created) >> Escaping loop: no more URLs to fetch now >> >> There’s no errors but also no data. What else can I debug? >> >> I see some warning in my hadoop.log but nothing glaring …. >> >> 2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using >> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >> 2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule - >> defaultInterval=2592000 >> 2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule - >> maxInterval=7776000 >> 2015-10-01 18:19:30,326 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_lo >>ca >> l1900181322_0001/job.xml:an attempt to override final parameter: >> mapreduce.job.end-notification.max.retry.interval; Ignoring. >> 2015-10-01 18:19:30,327 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_lo >>ca >> l1900181322_0001/job.xml:an attempt to override final parameter: >> mapreduce.job.end-notification.max.attempts; Ignoring. >> 2015-10-01 18:19:30,405 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local190018 >>13 >> 22_0001/job_local1900181322_0001.xml:an attempt to override final >> parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. >> 2015-10-01 18:19:30,406 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local190018 >>13 >> 22_0001/job_local1900181322_0001.xml:an attempt to override final >> parameter: mapreduce.job.end-notification.max.attempts; Ignoring. >> …. >> 2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using >>class >> org.apache.gora.mongodb.store.MongoStore as the Gora storage class. >> 2015-10-01 18:27:24,969 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_lo >>ca >> l1182157052_0001/job.xml:an attempt to override final parameter: >> mapreduce.job.end-notification.max.retry.interval; Ignoring. >> 2015-10-01 18:27:24,971 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_lo >>ca >> l1182157052_0001/job.xml:an attempt to override final parameter: >> mapreduce.job.end-notification.max.attempts; Ignoring. >> 2015-10-01 18:27:25,050 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local118215 >>70 >> 52_0001/job_local1182157052_0001.xml:an attempt to override final >> parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. >> 2015-10-01 18:27:25,052 WARN conf.Configuration - >> >>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local118215 >>70 >> 52_0001/job_local1182157052_0001.xml:an attempt to override final >> parameter: mapreduce.job.end-notification.max.attempts; Ignoring. >> >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080 >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000 >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = >>65536 >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo >> Solr Crawler/Nutch-2.4-SNAPSHOT >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> 2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept = >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080 >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000 >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = >>65536 >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo >> Solr Crawler/Nutch-2.4-SNAPSHOT >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept = >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> >> I’ve been trying this for 3 days with no luck. I want to use nutch but >>may >> be forced to use other program. >> >> My best guess is maybe something is borked with my plugin.includes: >> >> <property> >> <name>plugin.includes</name> >> >> >><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index >>-( >> >>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr< >>/v >> alue> >> <description>Regular expression naming plugin directory names to >> include. </description> >> </property> >> >> Are these valid? Is there a more minimal set to try? >> >> Cheers, >> Sherban >> >> >> >> >> On 10/4/15, 12:23 PM, "Sebastian Nagel" <[email protected]> >>wrote: >> >>> Hi Sherban, >>> >>>> Right now it finds 0 URLs with no errors. >>> >>> Can you specify what's going wrong. It could >>> be everything, even a configuration problem. >>> What did you crawl? Using which storage back-end? >>> >>> Thanks, >>> Sebastian >>> >>> >>> On 10/02/2015 03:02 AM, Drulea, Sherban wrote: >>>> Hi Lewis, >>>> >>>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs >>>>with >>>> no >>>> errors. >>>> >>>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at >>>>all. >>>> >>>> Cheers, >>>> Sherban >>>> >>>> >>>> >>>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" >>>><[email protected]> >>>> wrote: >>>> >>>>> Hi Folks, >>>>> Is anyone else able to test and run the release candidate for 2.3.1? >>>>> It would be great to get a release if we can get the VOTE's and the >>>>>RC >>>>> is >>>>> suitable. >>>>> Thanks in advance. >>>>> Best >>>>> Lewis >>>>> >>>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Folks, >>>>>> It turns out the formatting for the original email below was >>>>>>terrible. >>>>>> Sorry about that. >>>>>> I've hopefully corrected formatting now. Please VOTE away! >>>>>> >>>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi user@ & dev@, >>>>>>> >>>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. >>>>>>> >>>>>>> We addressed 32 issues in all which can been see at the release >>>>>>> report >>>>>>> http://s.apache.org/nutch_2.3.1 >>>>>>> >>>>>>> The release candidate comprises the following components. >>>>>>> >>>>>>> * A staging repository [0] containing various Maven artifacts >>>>>>> * A branch-2.3.1 of the 2.x code [1] >>>>>>> * The tagged source upon which we are VOTE'ing [2] >>>>>>> * Finally, the release artifacts [3] which i would encourage you to >>>>>>> verify for signatures and test. >>>>>>> >>>>>>> You should use the following KEYS [4] file to verify the signatures >>>>>>> of >>>>>>> all release artifacts. >>>>>>> >>>>>>> Please VOTE as follows >>>>>>> >>>>>>> [ ] +1 Push the release, I am happy :) >>>>>>> [ ] +/-0 I am not bothered either way >>>>>>> [ ] -1 I am not happy with this release candidate (please state >>>>>>>why) >>>>>>> >>>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly, >>>>>>> thank >>>>>>> you to everyone that VOTE's. It is appreciated. >>>>>>> >>>>>>> Thanks >>>>>>> Lewis >>>>>>> (on behalf of Nutch PMC) >>>>>>> >>>>>>> p.s. Here's my +1 >>>>>>> >>>>>>> [0] >>>>>>> >>>>>>> >>>>>>>https://repository.apache.org/content/repositories/orgapachenutch-10 >>>>>>>05 >>>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1 >>>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 >>>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1 >>>>>>> [4] http://www.apache.org/dist/nutch/KEYS >>>>>>> >>>>>>> -- >>>>>>> *Lewis* >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Lewis* >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Lewis* >>>> >>>> >>>> >>>> >>>>_______________________________________________________________________ >>>>__ >>>> _ >>>> >>>> This email message is for the sole use of the intended recipient(s) >>>>and >>>> may contain confidential information. Any unauthorized review, use, >>>> disclosure or distribution is prohibited. If you are not the intended >>>> recipient, please contact the sender by reply email and destroy all >>>> copies >>>> of the original message. >>>> >>> >> >

