Hi Sebastian,
I tried multiple URLs in my seed.txt file. None of them result in the
nutch generator crawling any links.
Here’s my environment:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1
―――――――――――――――
regex-urlfilter.txt:
―――――――――――――――
+.
―――――――――――――――
nutch-site.xml
―――――――――――――――
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch Mongo Solr Crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
<description>Regular expression naming plugin directory names to
include. </description>
</property>
</configuration>
―――――――――――――――
gora.properties:
―――――――――――――――
############################
# MongoDBStore properties #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers
―――――――――――――――
seed.txt
―――――――――――――――
http://punklawyer.com
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
―――――――――――――――
Here are the results of the crawl command " ./bin/crawl urls methods
http://127.0.0.1:8983/solr/ 2”
Injecting seed URLs
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
-crawlId methods
InjectorJob: starting at 2015-10-01 18:27:23
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 5
Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId methods -batchId 1443749246-29495
GeneratorJob: starting at 2015-10-01 18:27:26
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
Fetching :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443749246-29495 -crawlId methods -threads 50
FetcherJob: starting at 2015-10-01 18:27:29
FetcherJob: batchId: 1443749246-29495
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443760049865
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...
-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
Parsing :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
ParserJob: starting at 2015-10-01 18:27:43
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1443749246-29495
ParserJob: success
ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
CrawlDB update for methods
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443749246-29495 -crawlId methods
DbUpdaterJob: starting at 2015-10-01 18:27:46
DbUpdaterJob: batchId: 1443749246-29495
DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
IndexingJob: done.
SOLR dedup -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/
Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId methods -batchId 1443749274-17203
GeneratorJob: starting at 2015-10-01 18:27:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
There’s no errors but also no data. What else can I debug?
I see some warning in my hadoop.log but nothing glaring ….
2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2015-10-01 18:19:30,326 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
l1900181322_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:19:30,327 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
l1900181322_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:19:30,405 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
22_0001/job_local1900181322_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:19:30,406 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
22_0001/job_local1900181322_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
….
2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2015-10-01 18:27:24,969 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
l1182157052_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:27:24,971 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
l1182157052_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:27:25,050 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
52_0001/job_local1182157052_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:27:25,052 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
52_0001/job_local1182157052_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo
Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo
Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
I’ve been trying this for 3 days with no luck. I want to use nutch but may
be forced to use other program.
My best guess is maybe something is borked with my plugin.includes:
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
<description>Regular expression naming plugin directory names to
include. </description>
</property>
Are these valid? Is there a more minimal set to try?
Cheers,
Sherban
On 10/4/15, 12:23 PM, "Sebastian Nagel" <[email protected]> wrote:
>Hi Sherban,
>
>> Right now it finds 0 URLs with no errors.
>
>Can you specify what's going wrong. It could
>be everything, even a configuration problem.
>What did you crawl? Using which storage back-end?
>
>Thanks,
>Sebastian
>
>
>On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>> Hi Lewis,
>>
>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with
>>no
>> errors.
>>
>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
>>
>> Cheers,
>> Sherban
>>
>>
>>
>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <[email protected]>
>> wrote:
>>
>>> Hi Folks,
>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>> It would be great to get a release if we can get the VOTE's and the RC
>>>is
>>> suitable.
>>> Thanks in advance.
>>> Best
>>> Lewis
>>>
>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>> [email protected]> wrote:
>>>
>>>> Hi Folks,
>>>> It turns out the formatting for the original email below was terrible.
>>>> Sorry about that.
>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>
>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi user@ & dev@,
>>>>>
>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>
>>>>> We addressed 32 issues in all which can been see at the release
>>>>>report
>>>>> http://s.apache.org/nutch_2.3.1
>>>>>
>>>>> The release candidate comprises the following components.
>>>>>
>>>>> * A staging repository [0] containing various Maven artifacts
>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>> verify for signatures and test.
>>>>>
>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>of
>>>>> all release artifacts.
>>>>>
>>>>> Please VOTE as follows
>>>>>
>>>>> [ ] +1 Push the release, I am happy :)
>>>>> [ ] +/-0 I am not bothered either way
>>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>>
>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>> thank
>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>
>>>>> Thanks
>>>>> Lewis
>>>>> (on behalf of Nutch PMC)
>>>>>
>>>>> p.s. Here's my +1
>>>>>
>>>>> [0]
>>>>>
>>>>>https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>
>>
>>
>>_________________________________________________________________________
>>_
>>
>> This email message is for the sole use of the intended recipient(s) and
>> may contain confidential information. Any unauthorized review, use,
>> disclosure or distribution is prohibited. If you are not the intended
>> recipient, please contact the sender by reply email and destroy all
>>copies
>> of the original message.
>>
>