Hi All,

Thanks for pointing me to the 2.3.1 release. It works without error but doesn’t 
crawl. I’m out of ideas why.

Here’s my environment:

java version "1.8.0_60"

Java(TM) SE Runtime Environment (build 1.8.0_60-b27)

Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1

My regex-urlfilter.txt:
———————————————
+.
———————————————

nutch-site.xml
———————————————
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>http.agent.name</name>
        <value>nutch Mongo Solr Crawler</value>
    </property>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.mongodb.store.MongoStore</value>
        <description>Default class for storing data</description>
    </property>

    <property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
        <description>Regular expression naming plugin directory names to 
include. </description>
   </property>

</configuration>

———————————————

gora.properties:
———————————————
############################
# MongoDBStore properties  #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers
———————————————

Seed.txt
———————————————
http://punklawyer.com/
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
———————————————

Here are the results of the crawl command " ./bin/crawl urls methods 
http://127.0.0.1:8983/solr/ 2”

Injecting seed URLs

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls -crawlId 
methods

InjectorJob: starting at 2015-10-01 18:27:23

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora 
storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 5

Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02

Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2

Generating batchId

Generating a new fetchlist

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 
1443749246-29495

GeneratorJob: starting at 2015-10-01 18:27:26

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02

GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs

Fetching :

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D fetcher.timelimit.mins=180 1443749246-29495 -crawlId methods -threads 50

FetcherJob: starting at 2015-10-01 18:27:29

FetcherJob: batchId: 1443749246-29495

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1443760049865

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=0

-finishing thread FetcherThread8, activeThreads=0

-finishing thread FetcherThread9, activeThreads=0

-finishing thread FetcherThread10, activeThreads=0

-finishing thread FetcherThread11, activeThreads=0

-finishing thread FetcherThread12, activeThreads=0

-finishing thread FetcherThread13, activeThreads=0

-finishing thread FetcherThread14, activeThreads=0

-finishing thread FetcherThread15, activeThreads=0

-finishing thread FetcherThread16, activeThreads=0

-finishing thread FetcherThread17, activeThreads=0

-finishing thread FetcherThread18, activeThreads=0

-finishing thread FetcherThread19, activeThreads=0

-finishing thread FetcherThread20, activeThreads=0

-finishing thread FetcherThread21, activeThreads=0

-finishing thread FetcherThread22, activeThreads=0

-finishing thread FetcherThread23, activeThreads=0

-finishing thread FetcherThread25, activeThreads=0

-finishing thread FetcherThread24, activeThreads=0

-finishing thread FetcherThread26, activeThreads=0

-finishing thread FetcherThread27, activeThreads=0

-finishing thread FetcherThread28, activeThreads=0

-finishing thread FetcherThread29, activeThreads=0

-finishing thread FetcherThread30, activeThreads=0

-finishing thread FetcherThread31, activeThreads=0

-finishing thread FetcherThread32, activeThreads=0

-finishing thread FetcherThread33, activeThreads=0

-finishing thread FetcherThread34, activeThreads=0

-finishing thread FetcherThread35, activeThreads=0

-finishing thread FetcherThread36, activeThreads=0

-finishing thread FetcherThread37, activeThreads=0

-finishing thread FetcherThread38, activeThreads=0

-finishing thread FetcherThread39, activeThreads=0

-finishing thread FetcherThread40, activeThreads=0

-finishing thread FetcherThread41, activeThreads=0

-finishing thread FetcherThread42, activeThreads=0

-finishing thread FetcherThread43, activeThreads=0

-finishing thread FetcherThread44, activeThreads=0

-finishing thread FetcherThread45, activeThreads=0

-finishing thread FetcherThread46, activeThreads=0

-finishing thread FetcherThread47, activeThreads=0

-finishing thread FetcherThread48, activeThreads=0

-finishing thread FetcherThread49, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 
queues

-activeThreads=0

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=0

-finishing thread FetcherThread8, activeThreads=0

-finishing thread FetcherThread9, activeThreads=0

-finishing thread FetcherThread10, activeThreads=0

-finishing thread FetcherThread11, activeThreads=0

-finishing thread FetcherThread12, activeThreads=0

-finishing thread FetcherThread13, activeThreads=0

-finishing thread FetcherThread14, activeThreads=0

-finishing thread FetcherThread15, activeThreads=0

-finishing thread FetcherThread16, activeThreads=0

-finishing thread FetcherThread17, activeThreads=0

-finishing thread FetcherThread18, activeThreads=0

-finishing thread FetcherThread19, activeThreads=0

-finishing thread FetcherThread20, activeThreads=0

-finishing thread FetcherThread21, activeThreads=0

-finishing thread FetcherThread22, activeThreads=0

-finishing thread FetcherThread23, activeThreads=0

-finishing thread FetcherThread24, activeThreads=0

-finishing thread FetcherThread25, activeThreads=0

-finishing thread FetcherThread26, activeThreads=0

-finishing thread FetcherThread27, activeThreads=0

-finishing thread FetcherThread28, activeThreads=0

-finishing thread FetcherThread29, activeThreads=0

-finishing thread FetcherThread30, activeThreads=0

-finishing thread FetcherThread31, activeThreads=0

-finishing thread FetcherThread32, activeThreads=0

-finishing thread FetcherThread33, activeThreads=0

-finishing thread FetcherThread34, activeThreads=0

-finishing thread FetcherThread35, activeThreads=0

-finishing thread FetcherThread36, activeThreads=0

-finishing thread FetcherThread37, activeThreads=0

-finishing thread FetcherThread38, activeThreads=0

-finishing thread FetcherThread39, activeThreads=0

-finishing thread FetcherThread40, activeThreads=0

-finishing thread FetcherThread41, activeThreads=0

-finishing thread FetcherThread42, activeThreads=0

-finishing thread FetcherThread43, activeThreads=0

-finishing thread FetcherThread44, activeThreads=0

-finishing thread FetcherThread45, activeThreads=0

-finishing thread FetcherThread46, activeThreads=0

-finishing thread FetcherThread47, activeThreads=0

-finishing thread FetcherThread48, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread49, activeThreads=0

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 
queues

-activeThreads=0

FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12

Parsing :

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D mapred.skip.attempts.to.start.skipping=2 -D 
mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods

ParserJob: starting at 2015-10-01 18:27:43

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: 1443749246-29495

ParserJob: success

ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02

CrawlDB update for methods

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
1443749246-29495 -crawlId methods

DbUpdaterJob: starting at 2015-10-01 18:27:46

DbUpdaterJob: batchId: 1443749246-29495

DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02

Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods

IndexingJob: starting

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : username for authentication

solr.auth.password : password for authentication



IndexingJob: done.

SOLR dedup -> http://127.0.0.1:8983/solr/

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
http://127.0.0.1:8983/solr/

Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2

Generating batchId

Generating a new fetchlist

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 
1443749274-17203

GeneratorJob: starting at 2015-10-01 18:27:55

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02

GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs

Generate returned 1 (no new segments created)

Escaping loop: no more URLs to fetch now

So no errors but also no data. What else can I debug?

I see some warning in my hadoop.log but nothing alarming ….

2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable

2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule

2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000

2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000

2015-10-01 18:19:30,326 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:19:30,327 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.

2015-10-01 18:19:30,405 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:19:30,406 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.

….


2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable

2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class 
org.apache.gora.mongodb.store.MongoStore as the Gora storage class.

2015-10-01 18:27:24,969 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:27:24,971 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.

2015-10-01 18:27:25,050 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:27:25,052 WARN  conf.Configuration - 
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.


2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo Solr 
Crawler/Nutch-2.4-SNAPSHOT

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo Solr 
Crawler/Nutch-2.4-SNAPSHOT

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

I’ve been trying this for 3 days with no luck. I want to use nutch but may be 
forced to use other program.

My best guess is maybe something is borked with my plugin.includes:

<property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
        <description>Regular expression naming plugin directory names to 
include. </description>
   </property>

Are these valid? Is there a more minimal set to try?

Cheers,
Sherban


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.

Reply via email to