Re: Nutch returns empty result set for some websites

Ankur Dulwani Fri, 18 Jul 2014 22:28:28 -0700

Hi,
Following is some part of my hadoop.log, As I am new to Nutch and Solr, 
therefore these lines jump above my head.

2014-07-19 10:43:45,314 INFO  mapreduce.GoraRecordReader - 
gora.buffer.read.limit = 10000
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: content dest: 
content
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: title dest: title
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: host dest: host
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: batchId dest: 
batchId
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: boost dest: boost
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: digest dest: 
digest
2014-07-19 10:43:45,341 INFO  solr.SolrMappingReader - source: tstamp dest: 
tstamp
2014-07-19 10:43:45,343 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-07-19 10:43:45,343 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-07-19 10:43:45,343 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-07-19 10:43:45,343 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-07-19 10:43:45,393 INFO  store.HBaseStore - Keyclass and nameclass match 
but mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'crawl_GF_webpage' , assuming they are the same.
2014-07-19 10:43:45,442 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-07-19 10:43:46,161 INFO  solr.SolrIndexerJob - SolrIndexerJob: done.
2014-07-19 10:43:47,405 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
starting...
2014-07-19 10:43:47,405 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
Solr url: http://localhost:8983/solr/
2014-07-19 10:43:47,749 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2014-07-19 10:43:47,761 WARN  mapred.JobClient - No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2014-07-19 10:43:48,692 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-07-19 10:43:49,323 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
done.

I have also attached the complete hadoop.log file.

Regards,
Ankur Dulwani

On Saturday, 19 July 2014 10:07 AM, remi tassing [via Lucene] 
<[email protected]> wrote:

Can you check the log file for more info? 

default location: $NUTCH_HOME/logs/hadoop.log 

Ref: 
http://www.opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/

On Fri, Jul 18, 2014 at 8:52 PM, Ankur Dulwani <[hidden email]> 
wrote: 

> Hi, 
> I am using Nutch to crawl data from different sources, though it works for 
> mostly all the websites but it gives empty result for some sites like 
> https://www.google.com/finance. 
> 
> Fetcher: throughput threshold sequence: 5 
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs 
> in 0 queues 
> 
> 
> This is what I get after crawling. 
> 
> So I need to add any configurations or any properties to be added. 
> 
> Thanks in advance. 
> 
> 
> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874.html
> Sent from the Nutch - User mailing list archive at Nabble.com. 
> 

________________________________

If you reply to this email, your message will be added to the discussion 
below:http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148015.html

To unsubscribe from Nutch returns empty result set for some websites, click 
here.
NAML

hadoop.log (124K) 
<http://lucene.472066.n3.nabble.com/attachment/4148018/0/hadoop.log>

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-returns-empty-result-set-for-some-websites-tp4147874p4148018.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch returns empty result set for some websites

Reply via email to