Re: Nutch returns empty result set for some websites

Ankur Dulwani Tue, 22 Jul 2014 03:14:46 -0700

I have replaced protocol-http with protocol-httpclient in my nutch-site.xml


This is able to crawl other secure sites (https), but no response for 
https://www.google.com/finance site

 
Regards,
Ankur Dulwani



On Monday, 21 July 2014 9:12 PM, Julien Nioche-4 [via Lucene] 
<[email protected]> wrote:
 


>From your log : 

2014-07-19 10:41:58,279 ERROR fetcher.FetcherJob - Unexpected error 
for https://www.google.com/finance
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https 

Replace protocol-http with protocol-httpclient in your nutch-site.xml 
or use the code from  the https://github.com/apache/nutch/tree/2.x
branch (which contains 
https://github.com/apache/nutch/commit/5d0ecc1ada4bbbb09b61a79f4bc967ec38dcd8e1).
 
The 2.x branch is also faster than the 2.2 release, although of course 
if it's speed you're after then you should probably use 1.8 

HTH 

Julien 



On 19 July 2014 06:26, Ankur Dulwani <[hidden email]> wrote: 


> Hi, 
> Following is some part of my hadoop.log, As I am new to Nutch and Solr, 
> therefore these lines jump above my head. 
> 
> 
> 
> 2014-07-19 10:43:45,314 INFO  mapreduce.GoraRecordReader - 
> gora.buffer.read.limit = 10000 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: content 
> dest: content 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: title dest: 
> title 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: host dest: 
> host 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: batchId 
> dest: batchId 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: boost dest: 
> boost 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: digest 
> dest: digest 
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: tstamp 
> dest: tstamp 
> 2014-07-19 10:43:45,343 INFO  basic.BasicIndexingFilter - Maximum title 
> length for indexing set to: 100 
> 2014-07-19 10:43:45,343 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 
> 2014-07-19 10:43:45,343 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off 
> 2014-07-19 10:43:45,343 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 
> 2014-07-19 10:43:45,393 INFO  store.HBaseStore - Keyclass and nameclass 
> match but mismatching table names  mappingfile schema is 'webpage' vs 
> actual schema 'crawl_GF_webpage' , assuming they are the same. 
> 2014-07-19 10:43:45,442 WARN  mapred.FileOutputCommitter - Output path is 
> null in cleanup 
> 2014-07-19 10:43:46,161 INFO  solr.SolrIndexerJob - SolrIndexerJob: done. 
> 2014-07-19 10:43:47,405 INFO  solr.SolrDeleteDuplicates - 
> SolrDeleteDuplicates: starting... 
> 2014-07-19 10:43:47,405 INFO  solr.SolrDeleteDuplicates - 
> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> 2014-07-19 10:43:47,749 WARN  util.NativeCodeLoader - Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable 
> 2014-07-19 10:43:47,761 WARN  mapred.JobClient - No job jar file set. 
> User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 
> 2014-07-19 10:43:48,692 WARN  mapred.FileOutputCommitter - Output path is 
> null in cleanup 
> 2014-07-19 10:43:49,323 INFO  solr.SolrDeleteDuplicates - 
> SolrDeleteDuplicates: done. 
> 
> 
> I have also attached the complete hadoop.log file. 
> 
> 
> Regards, 
> Ankur Dulwani 
> 
> 
> 
> On Saturday, 19 July 2014 10:07 AM, remi tassing [via Lucene] < 
> [hidden email]> wrote: 
> 
> 
> 
> Can you check the log file for more info? 
> 
> default location: $NUTCH_HOME/logs/hadoop.log 
> 
> Ref: 
> http://www.opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/
> 
> 
> On Fri, Jul 18, 2014 at 8:52 PM, Ankur Dulwani <[hidden email]> 
> wrote: 
> 
> 
> > Hi, 
> > I am using Nutch to crawl data from different sources, though it works 
> for 
> > mostly all the websites but it gives empty result for some sites like 
> > https://www.google.com/finance. 
> > 
> > Fetcher: throughput threshold sequence: 5 
> > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 
> URLs 
> > in 0 queues 
> > 
> > 
> > This is what I get after crawling. 
> > 
> > So I need to add any configurations or any properties to be added. 
> > 
> > Thanks in advance. 
> > 
> > 
> > 
> > -- 
> > View this message in context: 
> > 
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874.html
> > Sent from the Nutch - User mailing list archive at Nabble.com. 
> > 
> 
> 
> ________________________________ 
> 
> If you reply to this email, your message will be added to the discussion 
> below: 
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148015.html
> To unsubscribe from Nutch returns empty result set for some websites, 
> click here. 
> NAML 
> 
> hadoop.log (124K) < 
> http://lucene.472066.n3.nabble.com/attachment/4148018/0/hadoop.log> 
> 
> 
> 
> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148018.html
> Sent from the Nutch - User mailing list archive at Nabble.com. 
> 


-- 

Open Source Solutions for Text Engineering 

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


________________________________
 
If you reply to this email, your message will be added to the discussion 
below:http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148322.html
 
To unsubscribe from Nutch returns empty result set for some websites, click 
here.
NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148544.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch returns empty result set for some websites

Reply via email to