Hi All,
Thanks for pointing me to the 2.3.1 release. It works without error but doesn’t
crawl. I’m out of ideas why.
Here’s my environment:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1
My regex-urlfilter.txt:
———————————————
+.
———————————————
nutch-site.xml
———————————————
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch Mongo Solr Crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
<description>Regular expression naming plugin directory names to
include. </description>
</property>
</configuration>
———————————————
gora.properties:
———————————————
############################
# MongoDBStore properties #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers
———————————————
Seed.txt
———————————————
http://punklawyer.com/
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
———————————————
Here are the results of the crawl command " ./bin/crawl urls methods
http://127.0.0.1:8983/solr/ 2”
Injecting seed URLs
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls -crawlId
methods
InjectorJob: starting at 2015-10-01 18:27:23
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora
storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 5
Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId
1443749246-29495
GeneratorJob: starting at 2015-10-01 18:27:26
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
Fetching :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-D fetcher.timelimit.mins=180 1443749246-29495 -crawlId methods -threads 50
FetcherJob: starting at 2015-10-01 18:27:29
FetcherJob: batchId: 1443749246-29495
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443760049865
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0
queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0
queues
-activeThreads=0
FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
Parsing :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
ParserJob: starting at 2015-10-01 18:27:43
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1443749246-29495
ParserJob: success
ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
CrawlDB update for methods
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
1443749246-29495 -crawlId methods
DbUpdaterJob: starting at 2015-10-01 18:27:46
DbUpdaterJob: batchId: 1443749246-29495
DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-D solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
IndexingJob: done.
SOLR dedup -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
http://127.0.0.1:8983/solr/
Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId
1443749274-17203
GeneratorJob: starting at 2015-10-01 18:27:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
So no errors but also no data. What else can I debug?
I see some warning in my hadoop.log but nothing alarming ….
2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2015-10-01 18:19:30,326 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:19:30,327 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:19:30,405 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:19:30,406 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
….
2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2015-10-01 18:27:24,969 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:27:24,971 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:27:25,050 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:27:25,052 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo Solr
Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo Solr
Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
I’ve been trying this for 3 days with no luck. I want to use nutch but may be
forced to use other program.
My best guess is maybe something is borked with my plugin.includes:
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
<description>Regular expression naming plugin directory names to
include. </description>
</property>
Are these valid? Is there a more minimal set to try?
Cheers,
Sherban
__________________________________________________________________________
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.