Cassandra2.1.3 crawl returns no data

jonathan . katon Wed, 25 Feb 2015 15:08:37 -0800


Hello,


I am installing Nutch and Solr for the first time and as a noob I am having
a problem with Nutch and Solr not returning any results after a crawl - I'm
using http://nutch.apache.org . Any help would be greatly appreciated. I
have looked over the Nutch and Apache logs and nothing is popping out at me
as a problem.


On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra
2.1.3. To accomplish this I followed instructions on multiple sites
including:
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Nutch2Tutorial
https://wiki.apache.org/nutch/Nutch2Cassandra
http://wiki.apache.org/nutch/IntranetDocumentSearch

I know Cassandra is working by testing:
> bin/cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 2.1.3

I know Solr is working because I have created a core named "nutch_crawler"
and I can go to the website and access the gui at http://mydomain:8983/solr


Before I built Nutch using "ant runtime" I updated the following

Ivy/ivy.xml - Uncomment out line
<dependency org="org.apache.gora" name="gora-cassandra" rev=0.5" conf="*->
default" />

Conf/gora.properties - Added two lines
gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
gora.cassandrastore.servers=localhost:9160

Conf/nutch-site.xml - Added the following:
  <property>
    <name>http.agent.name</name>
    <value>Nutch HTTP Agent Crawler</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.cassandra.store.CassandraStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|
index-(basic|anchor|metadata)|indexer-solr|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
    <description></description>
  </property>
  <property>
    <name>metatags.names</name>
    <value>*</value>
    <description>Names of the metatages to extract, separated by ','.
      Use '*' to extract all metatags. Prefixes the names with 'metatag.'
      in the parse-metadata. For instance to index description and
keywords,
      you need to activate the plugin index-metadata and set the value of
the
      parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
    </description>
  </property>
  <property>
    <name>index.parse.md</name>
    <value>*</value>
    <description></description>
  </property>
  <property>
    <name></name>
    <value></value>
    <description></description>
  </property>

I then built Nutch using "ant runtime" and there were no errors. After the
build I went to runtime/local/ and created the directory "urls" and then
created a file in urls called "seeds.txt" which contains a single line
without the quotes "http://nutch.apache.org .

Next to perform the crawl I ran the following:
> ./bin/crawl urls/seeds.txt crawl2
http://mydomain:8983/solr/nutch_crawler/ 5

This runs perfectly fine with no errors returned.

If I go to core admin in the solr web gui it shows me that nutch_crawler
contains no Docs.

If I spit out the nutch db stats
> ./bin/nutch readdb crawl2 -status
WebTable statistics start
Statistics for WebTable:
status 2 (status_fetched):      1
min score:      1.0
retry 0:        2
jobs:   {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001,
jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0},
Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116,
MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14,
MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896, CPU_MILLISECONDS=0,
SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7,
REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0,
REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8},
FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268},
File Output Format Counters ={BYTES_WRITTEN=250}}}}
max score:      1.0
TOTAL urls:     2
status 3 (status_gone): 1
avg score:      1.0
WebTable statistics: done

If I spit out the nutch db dump
> ./bin/nutch readdb crawl2 -dump crawl2dump
>more crawl2dump/part-r-00000
http://nutch.apache.org/        key:    org.apache.nutch:http/
baseUrl:        null
status: 2 (status_fetched)
fetchTime:      1427496745962
prevFetchTime:  1424904733150
fetchInterval:  2592000
retriesSinceFetch:      0
modifiedTime:   0
prevModifiedTime:       0
protocolStatus: (null)
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :       y
marker dist :   0
reprUrl:        null
batchId:        1424904736-28420
metadata _csh_ :





---------------------------------------------------------

Jonathan Katon

Design Technology Group, Teradyne, Inc.
Software Tools Engineer

Office: 978-370-3561
Cell: 978-809-4001
Email: [email protected]

Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

Reply via email to