Re: Nutch returns empty result set for some websites

Julien Nioche Mon, 21 Jul 2014 08:42:29 -0700

>From your log :

2014-07-19 10:41:58,279 ERROR fetcher.FetcherJob - Unexpected error
for https://www.google.com/finance
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https


Replace protocol-http with protocol-httpclient in your nutch-site.xml
or use the code from  the https://github.com/apache/nutch/tree/2.x
branch (which contains
https://github.com/apache/nutch/commit/5d0ecc1ada4bbbb09b61a79f4bc967ec38dcd8e1).
The 2.x branch is also faster than the 2.2 release, although of course
if it's speed you're after then you should probably use 1.8

HTH

Julien



On 19 July 2014 06:26, Ankur Dulwani <[email protected]> wrote:

> Hi,
> Following is some part of my hadoop.log, As I am new to Nutch and Solr,
> therefore these lines jump above my head.
>
>
>
> 2014-07-19 10:43:45,314 INFO  mapreduce.GoraRecordReader -
> gora.buffer.read.limit = 10000
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: batchId
> dest: batchId
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: digest
> dest: digest
> 2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: tstamp
> dest: tstamp
> 2014-07-19 10:43:45,343 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: 100
> 2014-07-19 10:43:45,343 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-07-19 10:43:45,343 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2014-07-19 10:43:45,343 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-07-19 10:43:45,393 INFO  store.HBaseStore - Keyclass and nameclass
> match but mismatching table names  mappingfile schema is 'webpage' vs
> actual schema 'crawl_GF_webpage' , assuming they are the same.
> 2014-07-19 10:43:45,442 WARN  mapred.FileOutputCommitter - Output path is
> null in cleanup
> 2014-07-19 10:43:46,161 INFO  solr.SolrIndexerJob - SolrIndexerJob: done.
> 2014-07-19 10:43:47,405 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: starting...
> 2014-07-19 10:43:47,405 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> 2014-07-19 10:43:47,749 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-07-19 10:43:47,761 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 2014-07-19 10:43:48,692 WARN  mapred.FileOutputCommitter - Output path is
> null in cleanup
> 2014-07-19 10:43:49,323 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: done.
>
>
> I have also attached the complete hadoop.log file.
>
>
> Regards,
> Ankur Dulwani
>
>
>
> On Saturday, 19 July 2014 10:07 AM, remi tassing [via Lucene] <
> [email protected]> wrote:
>
>
>
> Can you check the log file for more info?
>
> default location: $NUTCH_HOME/logs/hadoop.log
>
> Ref:
> http://www.opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/
>
>
> On Fri, Jul 18, 2014 at 8:52 PM, Ankur Dulwani <[hidden email]>
> wrote:
>
>
> > Hi,
> > I am using Nutch to crawl data from different sources, though it works
> for
> > mostly all the websites but it gives empty result for some sites like
> > https://www.google.com/finance.
> >
> > Fetcher: throughput threshold sequence: 5
> > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
> URLs
> > in 0 queues
> >
> >
> > This is what I get after crawling.
> >
> > So I need to add any configurations or any properties to be added.
> >
> > Thanks in advance.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
> ________________________________
>
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148015.html
> To unsubscribe from Nutch returns empty result set for some websites,
> click here.
> NAML
>
> hadoop.log (124K) <
> http://lucene.472066.n3.nabble.com/attachment/4148018/0/hadoop.log>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148018.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch returns empty result set for some websites

Reply via email to