Re: [VOTE] Release Apache Nutch 2.3.1

Sebastian Nagel Thu, 15 Oct 2015 15:22:43 -0700

Hi Lewis, hi Sherban,

I have to turn my vote into a


-1

The crawl (if run from bin/crawl) isn't working because
generator ignores the batch id passed per option -batchId
See https://issues.apache.org/jira/browse/NUTCH-2143.

Thanks, Sherban, for being insistent!

The logs you sent point to the same problem:
> Generating a new fetchlist
> .../bin/nutch generate ... -batchId 1443749246-29495
> ...
> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
> Fetching :
> .../bin/nutch fetch ... 1443749246-29495 ...
> ...
> FetcherJob: batchId: 1443749246-29495

If you use the batch id logged by Generator (1443749246-1282586680)
for the steps "fetch", "parse", and "updatedb" the crawl
should step forward.  Of course, this is no option for a released 2.3.1!
We have to fix this bug. :)

Thanks,
Sebastian


On 10/05/2015 07:53 PM, Drulea, Sherban wrote:
> Hi Sebastian,
> 
> I tried multiple URLs in my seed.txt file. None of them result in the
> nutch generator crawling any links.
> 
> Here’s my environment:
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> SOLR 4.6.0
> Mongo version 3.0.2.
> Nutch 2.3.1
> 
> ―――――――――――――――
> 
> regex-urlfilter.txt:
> ―――――――――――――――
> +.
> 
> ―――――――――――――――
> nutch-site.xml
> ―――――――――――――――
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
>     <property>
>         <name>http.agent.name</name>
>         <value>nutch Mongo Solr Crawler</value>
>     </property>
> 
>     <property>
>         <name>storage.data.store.class</name>
>         <value>org.apache.gora.mongodb.store.MongoStore</value>
>         <description>Default class for storing data</description>
>     </property>
>     
>     <property>
>         <name>plugin.includes</name>
>         
> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
> basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
> alue>
>         <description>Regular expression naming plugin directory names to
> include. </description>
>    </property>
>     
> </configuration>
> 
> 
> ―――――――――――――――
> gora.properties:
> ―――――――――――――――
> ############################
> # MongoDBStore properties  #
> ############################
> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
> gora.mongodb.override_hadoop_configuration=false
> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
> gora.mongodb.servers=localhost:27017
> gora.mongodb.db=method_centers
> 
> ―――――――――――――――
> seed.txt
> ―――――――――――――――
> http://punklawyer.com
> http://mail-archives.apache.org/mod_mbox/nutch-user/
> http://hbase.apache.org/index.html
> http://wiki.apache.org/nutch/FrontPage
> http://www.aintitcool.com/
> ―――――――――――――――
> 
> Here are the results of the crawl command " ./bin/crawl urls methods
> http://127.0.0.1:8983/solr/ 2”
> Injecting seed URLs
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
> -crawlId methods
> InjectorJob: starting at 2015-10-01 18:27:23
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
> Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 5
> Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
> Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId methods -batchId 1443749246-29495
> GeneratorJob: starting at 2015-10-01 18:27:26
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
> Fetching : 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D fetcher.timelimit.mins=180
> 1443749246-29495 -crawlId methods -threads 50
> FetcherJob: starting at 2015-10-01 18:27:29
> FetcherJob: batchId: 1443749246-29495
> FetcherJob: threads: 50
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : 1443760049865
> Using queue mode : byHost
> Fetcher: threads: 50
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> ...
> -finishing thread FetcherThread49, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 50
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> ...
> 
> -finishing thread FetcherThread48, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> -finishing thread FetcherThread49, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
> Parsing : 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> mapred.skip.attempts.to.start.skipping=2 -D
> mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
> ParserJob: starting at 2015-10-01 18:27:43
> ParserJob: resuming:  false
> ParserJob: forced reparse:  false
> ParserJob: batchId: 1443749246-29495
> ParserJob: success
> ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
> CrawlDB update for methods
> 
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true 1443749246-29495 -crawlId methods
> DbUpdaterJob: starting at 2015-10-01 18:27:46
> DbUpdaterJob: batchId: 1443749246-29495
> DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
> Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D
> solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
> IndexingJob: starting
> Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication
> 
> 
> IndexingJob: done.
> SOLR dedup -> http://127.0.0.1:8983/solr/
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true http://127.0.0.1:8983/solr/
> Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId methods -batchId 1443749274-17203
> GeneratorJob: starting at 2015-10-01 18:27:55
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> There’s no errors but also no data. What else can I debug?
> 
> I see some warning in my hadoop.log but nothing glaring ….
> 
> 2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2015-10-01 18:19:30,326 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
> l1900181322_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:19:30,327 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
> l1900181322_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2015-10-01 18:19:30,405 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
> 22_0001/job_local1900181322_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:19:30,406 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
> 22_0001/job_local1900181322_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
> ….
> 2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class
> org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
> 2015-10-01 18:27:24,969 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
> l1182157052_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:27:24,971 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
> l1182157052_0001/job.xml:an attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2015-10-01 18:27:25,050 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
> 52_0001/job_local1182157052_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2015-10-01 18:27:25,052 WARN  conf.Configuration -
> file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
> 52_0001/job_local1182157052_0001.xml:an attempt to override final
> parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
> 
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
> Solr Crawler/Nutch-2.4-SNAPSHOT
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> I’ve been trying this for 3 days with no luck. I want to use nutch but may
> be forced to use other program.
> 
> My best guess is maybe something is borked with my plugin.includes:
> 
> <property>
>         <name>plugin.includes</name>
>         
> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
> basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
> alue>
>         <description>Regular expression naming plugin directory names to
> include. </description>
>    </property>
> 
> Are these valid? Is there a more minimal set to try?
> 
> Cheers,
> Sherban
> 
> 
> 
> 
> On 10/4/15, 12:23 PM, "Sebastian Nagel" <[email protected]> wrote:
> 
>> Hi Sherban,
>>
>>> Right now it finds 0 URLs with no errors.
>>
>> Can you specify what's going wrong. It could
>> be everything, even a configuration problem.
>> What did you crawl? Using which storage back-end?
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>>> Hi Lewis,
>>>
>>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with
>>> no
>>> errors.
>>>
>>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
>>>
>>> Cheers,
>>> Sherban
>>>
>>>
>>>
>>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <[email protected]>
>>> wrote:
>>>
>>>> Hi Folks,
>>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>>> It would be great to get a release if we can get the VOTE's and the RC
>>>> is
>>>> suitable.
>>>> Thanks in advance.
>>>> Best
>>>> Lewis
>>>>
>>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Folks,
>>>>> It turns out the formatting for the original email below was terrible.
>>>>> Sorry about that.
>>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>>
>>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi user@ & dev@,
>>>>>>
>>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>>
>>>>>> We addressed 32 issues in all which can been see at the release
>>>>>> report
>>>>>> http://s.apache.org/nutch_2.3.1
>>>>>>
>>>>>> The release candidate comprises the following components.
>>>>>>
>>>>>> * A staging repository [0] containing various Maven artifacts
>>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>>> verify for signatures and test.
>>>>>>
>>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>> of
>>>>>> all release artifacts.
>>>>>>
>>>>>> Please VOTE as follows
>>>>>>
>>>>>> [ ] +1 Push the release, I am happy :)
>>>>>> [ ] +/-0 I am not bothered either way
>>>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>>>
>>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>>> thank
>>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>>
>>>>>> Thanks
>>>>>> Lewis
>>>>>> (on behalf of Nutch PMC)
>>>>>>
>>>>>> p.s. Here's my +1
>>>>>>
>>>>>> [0]
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> *Lewis*
>>>
>>>
>>>
>>> _________________________________________________________________________
>>> _
>>>
>>> This email message is for the sole use of the intended recipient(s) and
>>> may contain confidential information. Any unauthorized review, use,
>>> disclosure or distribution is prohibited. If you are not the intended
>>> recipient, please contact the sender by reply email and destroy all
>>> copies
>>> of the original message.
>>>
>>
>

Re: [VOTE] Release Apache Nutch 2.3.1

Reply via email to