Cassandra2.1.3 crawl returns no data

Lewis John Mcgibbney Wed, 25 Feb 2015 15:38:52 -0800

Hi Jonathan,

There are another two threads ongoing, namely


http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html
and
http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html

Please monitor those links and we can take it from there.
I would strongly suggest that you set logging leverls to DEBUG within
log4.properties and then create a fresh log.
Then step through the individual stages for the crawl cycle and try to
verify if you are loosing data between FetcherJob and ParserJob.
Thank you
Lewis



On Wed, Feb 25, 2015 at 3:06 PM, <[email protected]> wrote:

> I am installing Nutch and Solr for the first time and as a noob I am
> having a problem with Nutch and Solr not returning any results after a
> crawl - I'm using http://nutch.apache.org . Any help would be greatly
> appreciated. I have looked over the Nutch and Apache logs and nothing is
> popping out at me as a problem.
>
>
> On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra
> 2.1.3. To accomplish this I followed instructions on multiple sites
> including:
> http://wiki.apache.org/nutch/NutchTutorial
> http://wiki.apache.org/nutch/Nutch2Tutorial
> https://wiki.apache.org/nutch/Nutch2Cassandra
> http://wiki.apache.org/nutch/IntranetDocumentSearch
>
> I know Cassandra is working by testing:
> > bin/cassandra-cli
> Connected to: "Test Cluster" on 127.0.0.1/9160
> Welcome to Cassandra CLI version 2.1.3
>
> I know Solr is working because I have created a core named "nutch_crawler"
> and I can go to the website and access the gui at
> http://mydomain:8983/solr
>
>
> Before I built Nutch using "ant runtime" I updated the following
>
> *Ivy/ivy.xml - *Uncomment out line
> <dependency org="org.apache.gora" name="gora-cassandra" rev=0.5"
> conf="*->default" />
>
> *Conf/gora.properties* - Added two lines
> gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
> gora.cassandrastore.servers=localhost:9160
>
> *Conf/nutch-site.xml* - Added the following:
>   <property>
>     <name>http.agent.name</name>
>     <value>Nutch HTTP Agent Crawler</value>
>   </property>
>   <property>
>     <name>storage.data.store.class</name>
>     <value>org.apache.gora.cassandra.store.CassandraStore</value>
>     <description>Default class for storing data</description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>     <description></description>
>   </property>
>   <property>
>     <name>metatags.names</name>
>     <value>*</value>
>     <description>Names of the metatages to extract, separated by ','.
>       Use '*' to extract all metatags. Prefixes the names with 'metatag.'
>       in the parse-metadata. For instance to index description and
> keywords,
>       you need to activate the plugin index-metadata and set the value of
> the
>       parameter 'index.parse.md' to
> 'metatag.description,metatag.keywords'.
>     </description>
>   </property>
>   <property>
>     <name>index.parse.md</name>
>     <value>*</value>
>     <description></description>
>   </property>
>   <property>
>     <name></name>
>     <value></value>
>     <description></description>
>   </property>
>
> I then built Nutch using "ant runtime" and there were no errors. After the
> build I went to runtime/local/ and created the directory "urls" and then
> created a file in urls called "seeds.txt" which contains a single line
> without the quotes "http://nutch.apache.org .
>
> Next to perform the crawl I ran the following:
> > ./bin/crawl urls/seeds.txt crawl2
> http://mydomain:8983/solr/nutch_crawler/ 5
>
> This runs perfectly fine with no errors returned.
>
> If I go to core admin in the solr web gui it shows me that nutch_crawler
> contains no Docs.
>
> If I spit out the nutch db stats
> > ./bin/nutch readdb crawl2 -status
> WebTable statistics start
> Statistics for WebTable:
> status 2 (status_fetched):      1
> min score:      1.0
> retry 0:        2
> jobs:
> {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0},
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116,
> MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14,
> MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896, CPU_MILLISECONDS=0,
> SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7,
> REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0,
> REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8},
> FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268},
> File Output Format Counters ={BYTES_WRITTEN=250}}}}
> max score:      1.0
> TOTAL urls:     2
> status 3 (status_gone): 1
> avg score:      1.0
> WebTable statistics: done
>
> If I spit out the nutch db dump
> > ./bin/nutch readdb crawl2 -dump crawl2dump
> >more crawl2dump/part-r-00000
> http://nutch.apache.org/        key:    org.apache.nutch:http/
> baseUrl:        null
> status: 2 (status_fetched)
> fetchTime:      1427496745962
> prevFetchTime:  1424904733150
> fetchInterval:  2592000
> retriesSinceFetch:      0
> modifiedTime:   0
> prevModifiedTime:       0
> protocolStatus: (null)
> parseStatus:    (null)
> title:  null
> score:  1.0
> marker _injmrk_ :       y
> marker dist :   0
> reprUrl:        null
> batchId:        1424904736-28420
> metadata _csh_ :
>
>
>
>
>
> ---------------------------------------------------------
>
> *Jonathan Katon*
>
> *Design Technology Group, Teradyne, Inc.*
> *Software Tools Engineer*
>
> Office: 978-370-3561
> Cell: 978-809-4001
> Email: [email protected]
>
>
>


-- 
*Lewis*

Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

Reply via email to