Cassandra2.1.3 crawl returns no data

jonathan . katon Thu, 26 Feb 2015 09:06:13 -0800

Lewis and anyone else reading this,

Thank you for the links to the other posts. I will continue to review any
updates to them!!


Before I go into my response to the last email I just want to give a mile
high overview of what I am trying to accomplish. I have an intranet site
with thousands of pages created using txt, html, php, and javascript, along
with many powerpoint, word, and pdf documents being served up. I am trying
to add search functionality to the website so content can be found in the
website. For this I assume the best approach is to use Solr and Nutch. I
couldn't get Nutch 2.3 to behave with Solr 5.0 so I ended up adding in
Cassandra 2.1.3 which then made the three play together nicely without all
the errors I was getting before. I don't know enough about Nutch yet to
know if it can do what I'm trying to accomplish so Tika may be thrown into
the mix at some point.


I have updated log4j.properties logging levels to DEBUG like you mentioned.

Reviewing the log I see a couple Errors and Exceptions but I'm not sure if
they are the reason for the lack of crawling data

ERROR store.CassandraStore -
ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@7481ca2
DEBUG util.NativeCodeLoader - Failed to load native-hadoop with error:
java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
WARN mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while
closing datastore.InvalidRequestException(why:supercolumn parameter is not
optional for super CF sc)

Outside of those items I listed above nothing stands out as a problem
between FetcherJob and ParserJob.


I have created multiple pastebins to simplify sharing of config/log files

1) log4j.properties - http://pastebin.com/KnK4A5wB
2) nutch-site.xml - http://pastebin.com/uuVmEdFU
3) gora.properties - http://pastebin.com/DeiM3aUF - I only added two lines
for cassandra
4) gora-cassandra-mapping.xml - http://pastebin.com/F9dCsRwp - I have not
changed this file at all
5) apache-nutch-2.3.log - http://pastebin.com/R9iNqcmN - The log file from
running in DEBUG  "./bin/crawl urls/seeds.txt crawl2
http://mylocalhost:8983/solr/nutch_crawler/ 5"
6) cassandra keyspace - http://pastebin.com/xj3cWUKE - Showing output of
webpage keyspace
7) bin/nutch readdb -dump - http://pastebin.com/1ZDYCeBi - showing output
from running that command

8) regex-urlfilter.xml - Not including this because I have not touched it
at all.



As a side note with Solr. When creating my "nutch_crawler" core I tried two
different methods which I describe below. Different tutorials state you
should do one or the other so I'm not sure what the correct procedure is.

1) Generic create - ./bin/solr create -c nutch_crawler
This defaults to using the solr provided "data_driven_schema_configs"
configset which doesn't include a schema

2) Create a special "nutch_configs" configset. I did this by copying the
"basic_configs" configset provided by solr to a new folder "nutch_configs"
and then copying the schema.xml file which is provided by nutch into
"nutch_configs" replacing its schema from "basic_configs". I then created
the core using

> ./bin/solr create -c nutch_crawler -d /path/to/configsets/nutch_configs

Jonathan


---------------------------------------------------------

Jonathan Katon

Design Technology Group, Teradyne, Inc.
Software Tools Engineer

Office: 978-370-3561
Cell: 978-809-4001
Email: [email protected]






From:   Lewis John Mcgibbney <[email protected]>
To:     "[email protected]" <[email protected]>
Date:   02/25/2015 06:37 PM
Subject:        Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data



Hi Jonathan,

There are another two threads ongoing, namely

http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html
and
http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html

Please monitor those links and we can take it from there.
I would strongly suggest that you set logging leverls to DEBUG within
log4.properties and then create a fresh log.
Then step through the individual stages for the crawl cycle and try to
verify if you are loosing data between FetcherJob and ParserJob.
Thank you
Lewis



On Wed, Feb 25, 2015 at 3:06 PM, <[email protected]> wrote:

> I am installing Nutch and Solr for the first time and as a noob I am
> having a problem with Nutch and Solr not returning any results after a
> crawl - I'm using http://nutch.apache.org . Any help would be greatly
> appreciated. I have looked over the Nutch and Apache logs and nothing is
> popping out at me as a problem.
>
>
> On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra
> 2.1.3. To accomplish this I followed instructions on multiple sites
> including:
> http://wiki.apache.org/nutch/NutchTutorial
> http://wiki.apache.org/nutch/Nutch2Tutorial
> https://wiki.apache.org/nutch/Nutch2Cassandra
> http://wiki.apache.org/nutch/IntranetDocumentSearch
>
> I know Cassandra is working by testing:
> > bin/cassandra-cli
> Connected to: "Test Cluster" on 127.0.0.1/9160
> Welcome to Cassandra CLI version 2.1.3
>
> I know Solr is working because I have created a core named
"nutch_crawler"
> and I can go to the website and access the gui at
> http://mydomain:8983/solr
>
>
> Before I built Nutch using "ant runtime" I updated the following
>
> *Ivy/ivy.xml - *Uncomment out line
> <dependency org="org.apache.gora" name="gora-cassandra" rev=0.5"
> conf="*->default" />
>
> *Conf/gora.properties* - Added two lines
> gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
> gora.cassandrastore.servers=localhost:9160
>
> *Conf/nutch-site.xml* - Added the following:
>   <property>
>     <name>http.agent.name</name>
>     <value>Nutch HTTP Agent Crawler</value>
>   </property>
>   <property>
>     <name>storage.data.store.class</name>
>     <value>org.apache.gora.cassandra.store.CassandraStore</value>
>     <description>Default class for storing data</description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|
index-(basic|anchor|metadata)|indexer-solr|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
>     <description></description>
>   </property>
>   <property>
>     <name>metatags.names</name>
>     <value>*</value>
>     <description>Names of the metatages to extract, separated by ','.
>       Use '*' to extract all metatags. Prefixes the names with 'metatag.'
>       in the parse-metadata. For instance to index description and
> keywords,
>       you need to activate the plugin index-metadata and set the value of
> the
>       parameter 'index.parse.md' to
> 'metatag.description,metatag.keywords'.
>     </description>
>   </property>
>   <property>
>     <name>index.parse.md</name>
>     <value>*</value>
>     <description></description>
>   </property>
>   <property>
>     <name></name>
>     <value></value>
>     <description></description>
>   </property>
>
> I then built Nutch using "ant runtime" and there were no errors. After
the
> build I went to runtime/local/ and created the directory "urls" and then
> created a file in urls called "seeds.txt" which contains a single line
> without the quotes "http://nutch.apache.org .
>
> Next to perform the crawl I ran the following:
> > ./bin/crawl urls/seeds.txt crawl2
> http://mydomain:8983/solr/nutch_crawler/ 5
>
> This runs perfectly fine with no errors returned.
>
> If I go to core admin in the solr web gui it shows me that nutch_crawler
> contains no Docs.
>
> If I spit out the nutch db stats
> > ./bin/nutch readdb crawl2 -status
> WebTable statistics start
> Statistics for WebTable:
> status 2 (status_fetched):      1
> min score:      1.0
> retry 0:        2
> jobs:
> {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0},
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116,
> MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14,
> MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896,
CPU_MILLISECONDS=0,
> SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7,
> REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0,
> REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8},
> FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268},
> File Output Format Counters ={BYTES_WRITTEN=250}}}}
> max score:      1.0
> TOTAL urls:     2
> status 3 (status_gone): 1
> avg score:      1.0
> WebTable statistics: done
>
> If I spit out the nutch db dump
> > ./bin/nutch readdb crawl2 -dump crawl2dump
> >more crawl2dump/part-r-00000
> http://nutch.apache.org/        key:    org.apache.nutch:http/
> baseUrl:        null
> status: 2 (status_fetched)
> fetchTime:      1427496745962
> prevFetchTime:  1424904733150
> fetchInterval:  2592000
> retriesSinceFetch:      0
> modifiedTime:   0
> prevModifiedTime:       0
> protocolStatus: (null)
> parseStatus:    (null)
> title:  null
> score:  1.0
> marker _injmrk_ :       y
> marker dist :   0
> reprUrl:        null
> batchId:        1424904736-28420
> metadata _csh_ :
>
>
>
>
>
> ---------------------------------------------------------
>
> *Jonathan Katon*
>
> *Design Technology Group, Teradyne, Inc.*
> *Software Tools Engineer*
>
> Office: 978-370-3561
> Cell: 978-809-4001
> Email: [email protected]
>
>
>


--
*Lewis*

Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

Reply via email to