Re: [VOTE] Release Apache Nutch 2.3.1

Drulea, Sherban Mon, 05 Oct 2015 12:09:54 -0700

Thanks Sebastian. I’m running on OS X 10.9.5 btw.


On 10/5/15, 11:53 AM, "Sebastian Nagel" <[email protected]> wrote:

>Hi Sherban,
>
>thanks for the detailed description and the attached log.
>I'll have a look on it and hope to be able reproduce the
>problem.
>
>Sebastian
>
>On 10/05/2015 07:53 PM, Drulea, Sherban wrote:
>> Hi Sebastian,
>> 
>> I tried multiple URLs in my seed.txt file. None of them result in the
>> nutch generator crawling any links.
>> 
>> Here’s my environment:
>> java version "1.8.0_60"
>> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
>> SOLR 4.6.0
>> Mongo version 3.0.2.
>> Nutch 2.3.1
>> 
>> ―――――――――――――――
>> 
>> regex-urlfilter.txt:
>> ―――――――――――――――
>> +.
>> 
>> ―――――――――――――――
>> nutch-site.xml
>> ―――――――――――――――
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> 
>> <!-- Put site-specific property overrides in this file. -->
>> 
>> <configuration>
>> 
>>     <property>
>>         <name>http.agent.name</name>
>>         <value>nutch Mongo Solr Crawler</value>
>>     </property>
>> 
>>     <property>
>>         <name>storage.data.store.class</name>
>>         <value>org.apache.gora.mongodb.store.MongoStore</value>
>>         <description>Default class for storing data</description>
>>     </property>
>>     
>>     <property>
>>         <name>plugin.includes</name>
>>         
>> 
>><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index
>>-(
>> 
>>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>>/v
>> alue>
>>         <description>Regular expression naming plugin directory names to
>> include. </description>
>>    </property>
>>     
>> </configuration>
>> 
>> 
>> ―――――――――――――――
>> gora.properties:
>> ―――――――――――――――
>> ############################
>> # MongoDBStore properties  #
>> ############################
>> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>> gora.mongodb.override_hadoop_configuration=false
>> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>> gora.mongodb.servers=localhost:27017
>> gora.mongodb.db=method_centers
>> 
>> ―――――――――――――――
>> seed.txt
>> ―――――――――――――――
>> http://punklawyer.com
>> http://mail-archives.apache.org/mod_mbox/nutch-user/
>> http://hbase.apache.org/index.html
>> http://wiki.apache.org/nutch/FrontPage
>> http://www.aintitcool.com/
>> ―――――――――――――――
>> 
>> Here are the results of the crawl command " ./bin/crawl urls methods
>> http://127.0.0.1:8983/solr/ 2”
>> Injecting seed URLs
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
>> -crawlId methods
>> InjectorJob: starting at 2015-10-01 18:27:23
>> InjectorJob: Injecting urlDir: urls
>> InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
>> Gora storage class.
>> InjectorJob: total number of urls rejected by filters: 0
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 5
>> Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
>> Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
>> Generating batchId
>> Generating a new fetchlist
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>> -crawlId methods -batchId 1443749246-29495
>> GeneratorJob: starting at 2015-10-01 18:27:26
>> GeneratorJob: Selecting best-scoring urls due for fetch.
>> GeneratorJob: starting
>> GeneratorJob: filtering: false
>> GeneratorJob: normalizing: false
>> GeneratorJob: topN: 50000
>> GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
>> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5
>>URLs
>> Fetching : 
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>> 1443749246-29495 -crawlId methods -threads 50
>> FetcherJob: starting at 2015-10-01 18:27:29
>> FetcherJob: batchId: 1443749246-29495
>> FetcherJob: threads: 50
>> FetcherJob: parsing: false
>> FetcherJob: resuming: false
>> FetcherJob : timelimit set for : 1443760049865
>> Using queue mode : byHost
>> Fetcher: threads: 50
>> QueueFeeder finished: total 0 records. Hit by time limit :0
>> -finishing thread FetcherThread0, activeThreads=0
>> ...
>> -finishing thread FetcherThread49, activeThreads=0
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold sequence: 5
>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>>URLs
>> in 0 queues
>> -activeThreads=0
>> Using queue mode : byHost
>> Fetcher: threads: 50
>> QueueFeeder finished: total 0 records. Hit by time limit :0
>> -finishing thread FetcherThread0, activeThreads=0
>> ...
>> 
>> -finishing thread FetcherThread48, activeThreads=0
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold sequence: 5
>> -finishing thread FetcherThread49, activeThreads=0
>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>>URLs
>> in 0 queues
>> -activeThreads=0
>> FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
>> Parsing : 
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D
>> mapred.skip.attempts.to.start.skipping=2 -D
>> mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
>> ParserJob: starting at 2015-10-01 18:27:43
>> ParserJob: resuming:  false
>> ParserJob: forced reparse:  false
>> ParserJob: batchId: 1443749246-29495
>> ParserJob: success
>> ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
>> CrawlDB update for methods
>> 
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true 1443749246-29495 -crawlId methods
>> DbUpdaterJob: starting at 2015-10-01 18:27:46
>> DbUpdaterJob: batchId: 1443749246-29495
>> DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
>> Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D
>> solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
>> IndexingJob: starting
>> Active IndexWriters :
>> SOLRIndexWriter
>> solr.server.url : URL of the SOLR instance (mandatory)
>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>> solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>> solr.auth : use authentication (default false)
>> solr.auth.username : username for authentication
>> solr.auth.password : password for authentication
>> 
>> 
>> IndexingJob: done.
>> SOLR dedup -> http://127.0.0.1:8983/solr/
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true http://127.0.0.1:8983/solr/
>> Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
>> Generating batchId
>> Generating a new fetchlist
>> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>> -crawlId methods -batchId 1443749274-17203
>> GeneratorJob: starting at 2015-10-01 18:27:55
>> GeneratorJob: Selecting best-scoring urls due for fetch.
>> GeneratorJob: starting
>> GeneratorJob: filtering: false
>> GeneratorJob: normalizing: false
>> GeneratorJob: topN: 50000
>> GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
>> GeneratorJob: generated batch id: 1443749275-2050785747 containing 0
>>URLs
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch now
>> 
>> There’s no errors but also no data. What else can I debug?
>> 
>> I see some warning in my hadoop.log but nothing glaring ….
>> 
>> 2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000
>> 2015-10-01 18:19:30,326 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_lo
>>ca
>> l1900181322_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:19:30,327 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_lo
>>ca
>> l1900181322_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>> 2015-10-01 18:19:30,405 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local190018
>>13
>> 22_0001/job_local1900181322_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:19:30,406 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local190018
>>13
>> 22_0001/job_local1900181322_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>> ….
>> 2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using
>>class
>> org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
>> 2015-10-01 18:27:24,969 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_lo
>>ca
>> l1182157052_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:27:24,971 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_lo
>>ca
>> l1182157052_0001/job.xml:an attempt to override final parameter:
>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>> 2015-10-01 18:27:25,050 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local118215
>>70
>> 52_0001/job_local1182157052_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>> 2015-10-01 18:27:25,052 WARN  conf.Configuration -
>> 
>>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local118215
>>70
>> 52_0001/job_local1182157052_0001.xml:an attempt to override final
>> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>> 
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit =
>>65536
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
>> Solr Crawler/Nutch-2.4-SNAPSHOT
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit =
>>65536
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
>> Solr Crawler/Nutch-2.4-SNAPSHOT
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> 
>> I’ve been trying this for 3 days with no luck. I want to use nutch but
>>may
>> be forced to use other program.
>> 
>> My best guess is maybe something is borked with my plugin.includes:
>> 
>> <property>
>>         <name>plugin.includes</name>
>>         
>> 
>><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index
>>-(
>> 
>>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>>/v
>> alue>
>>         <description>Regular expression naming plugin directory names to
>> include. </description>
>>    </property>
>> 
>> Are these valid? Is there a more minimal set to try?
>> 
>> Cheers,
>> Sherban
>> 
>> 
>> 
>> 
>> On 10/4/15, 12:23 PM, "Sebastian Nagel" <[email protected]>
>>wrote:
>> 
>>> Hi Sherban,
>>>
>>>> Right now it finds 0 URLs with no errors.
>>>
>>> Can you specify what's going wrong. It could
>>> be everything, even a configuration problem.
>>> What did you crawl? Using which storage back-end?
>>>
>>> Thanks,
>>> Sebastian
>>>
>>>
>>> On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>>>> Hi Lewis,
>>>>
>>>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs
>>>>with
>>>> no
>>>> errors.
>>>>
>>>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at
>>>>all.
>>>>
>>>> Cheers,
>>>> Sherban
>>>>
>>>>
>>>>
>>>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney"
>>>><[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>>>> It would be great to get a release if we can get the VOTE's and the
>>>>>RC
>>>>> is
>>>>> suitable.
>>>>> Thanks in advance.
>>>>> Best
>>>>> Lewis
>>>>>
>>>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>> It turns out the formatting for the original email below was
>>>>>>terrible.
>>>>>> Sorry about that.
>>>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>>>
>>>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi user@ & dev@,
>>>>>>>
>>>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>>>
>>>>>>> We addressed 32 issues in all which can been see at the release
>>>>>>> report
>>>>>>> http://s.apache.org/nutch_2.3.1
>>>>>>>
>>>>>>> The release candidate comprises the following components.
>>>>>>>
>>>>>>> * A staging repository [0] containing various Maven artifacts
>>>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>>>> verify for signatures and test.
>>>>>>>
>>>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>>> of
>>>>>>> all release artifacts.
>>>>>>>
>>>>>>> Please VOTE as follows
>>>>>>>
>>>>>>> [ ] +1 Push the release, I am happy :)
>>>>>>> [ ] +/-0 I am not bothered either way
>>>>>>> [ ] -1 I am not happy with this release candidate (please state
>>>>>>>why)
>>>>>>>
>>>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>>>> thank
>>>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Lewis
>>>>>>> (on behalf of Nutch PMC)
>>>>>>>
>>>>>>> p.s. Here's my +1
>>>>>>>
>>>>>>> [0]
>>>>>>>
>>>>>>> 
>>>>>>>https://repository.apache.org/content/repositories/orgapachenutch-10
>>>>>>>05
>>>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>>>
>>>>>>> --
>>>>>>> *Lewis*
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> *Lewis*
>>>>
>>>>
>>>>
>>>> 
>>>>_______________________________________________________________________
>>>>__
>>>> _
>>>>
>>>> This email message is for the sole use of the intended recipient(s)
>>>>and
>>>> may contain confidential information. Any unauthorized review, use,
>>>> disclosure or distribution is prohibited. If you are not the intended
>>>> recipient, please contact the sender by reply email and destroy all
>>>> copies
>>>> of the original message.
>>>>
>>>
>> 
>

Re: [VOTE] Release Apache Nutch 2.3.1

Reply via email to