>From your log : 2014-07-19 10:41:58,279 ERROR fetcher.FetcherJob - Unexpected error for https://www.google.com/finance org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
Replace protocol-http with protocol-httpclient in your nutch-site.xml or use the code from the https://github.com/apache/nutch/tree/2.x branch (which contains https://github.com/apache/nutch/commit/5d0ecc1ada4bbbb09b61a79f4bc967ec38dcd8e1). The 2.x branch is also faster than the 2.2 release, although of course if it's speed you're after then you should probably use 1.8 HTH Julien On 19 July 2014 06:26, Ankur Dulwani <[email protected]> wrote: > Hi, > Following is some part of my hadoop.log, As I am new to Nutch and Solr, > therefore these lines jump above my head. > > > > 2014-07-19 10:43:45,314 INFO mapreduce.GoraRecordReader - > gora.buffer.read.limit = 10000 > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: content > dest: content > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: title dest: > title > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: host dest: > host > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: batchId > dest: batchId > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: boost dest: > boost > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: digest > dest: digest > 2014-07-19 10:43:45,341 INFO solr.SolrMappingReader - source: tstamp > dest: tstamp > 2014-07-19 10:43:45,343 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: 100 > 2014-07-19 10:43:45,343 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2014-07-19 10:43:45,343 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2014-07-19 10:43:45,343 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2014-07-19 10:43:45,393 INFO store.HBaseStore - Keyclass and nameclass > match but mismatching table names mappingfile schema is 'webpage' vs > actual schema 'crawl_GF_webpage' , assuming they are the same. > 2014-07-19 10:43:45,442 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2014-07-19 10:43:46,161 INFO solr.SolrIndexerJob - SolrIndexerJob: done. > 2014-07-19 10:43:47,405 INFO solr.SolrDeleteDuplicates - > SolrDeleteDuplicates: starting... > 2014-07-19 10:43:47,405 INFO solr.SolrDeleteDuplicates - > SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ > 2014-07-19 10:43:47,749 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-07-19 10:43:47,761 WARN mapred.JobClient - No job jar file set. > User classes may not be found. See JobConf(Class) or JobConf#setJar(String). > 2014-07-19 10:43:48,692 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2014-07-19 10:43:49,323 INFO solr.SolrDeleteDuplicates - > SolrDeleteDuplicates: done. > > > I have also attached the complete hadoop.log file. > > > Regards, > Ankur Dulwani > > > > On Saturday, 19 July 2014 10:07 AM, remi tassing [via Lucene] < > [email protected]> wrote: > > > > Can you check the log file for more info? > > default location: $NUTCH_HOME/logs/hadoop.log > > Ref: > http://www.opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/ > > > On Fri, Jul 18, 2014 at 8:52 PM, Ankur Dulwani <[hidden email]> > wrote: > > > > Hi, > > I am using Nutch to crawl data from different sources, though it works > for > > mostly all the websites but it gives empty result for some sites like > > https://www.google.com/finance. > > > > Fetcher: throughput threshold sequence: 5 > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 > URLs > > in 0 queues > > > > > > This is what I get after crawling. > > > > So I need to add any configurations or any properties to be added. > > > > Thanks in advance. > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > ________________________________ > > If you reply to this email, your message will be added to the discussion > below: > http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148015.html > To unsubscribe from Nutch returns empty result set for some websites, > click here. > NAML > > hadoop.log (124K) < > http://lucene.472066.n3.nabble.com/attachment/4148018/0/hadoop.log> > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148018.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

