Cassandra2.1.3 crawl returns no data

Burrough, Matthew William Thu, 26 Feb 2015 20:57:07 -0800

Hi Jonathan,

It looks like I am running into the same issue as you.  I've spent the last two 
days trying to get Nutch 2.3.0 to run on Windows Server 2012R2 + Cygwin and 
communicate with storage on a different Windows server, all running in Azure.  
(I know, Linux would be much easier to get this stack running, but that isn't 
an option on this project.)  I started with HBase, since that seemed to be a 
popular option and Azure has a pre-made set of VMs for it.  Unfortunately 
Azure's HBase options are 0.98 and 0.98.4, neither of which proved compatible 
with Nutch 2.3/Gora 0.5.  Moving on, I opted for Cassandra today and pulled 
down version 2.0.2 since that version was listed on the Nutch homepage.  
Cassandra ran on Windows without any hacks, so that was a plus.  I plan to 
output into Elasticsearch, or even better, JSON.

As for Nutch, I ran into a series of issues that I managed to get past while 
trying to index my personal website using the step-by-step Nutch commands:

  1.  Permission setting issue with Hadoop 1.x on Windows/Cygwin. Worked around 
for now with hacky patch: 
https://github.com/congainc/patch-hadoop_7682-1.0.x-win. Note that the other 
hack people suggest, to replace the Hadoop 1.2 jar with the Hadoop 0.20.2 jar 
in lib does not work on Nutch 2.3 (Results in a 
java.lang.ExceptionInInitializerError at 
org.apache.gora.mapreduce.GoraOutputFormat.setOutput).  I'm hoping this might 
be resolved in Hadoop 2.x, but based on NUTCH-1936, it looks like Nutch support 
for Hadoop 2.x may not happen until GSOC this year.
  2.   During Nutch generate phase, a thrift exception (

Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.thrift.EncodingUtils.setBit(IIZ)I

at org.apache.cassandra.thrift.CfDef.setGc_grace_secondsIsSet(CfDef.java:895)

at org.apache.cassandra.thrift.CfDef.setGc_grace_seconds(CfDef.java:881)

at me.prettyprint.cassandra.service.ThriftCfDef.toThrift(ThriftCfDef.java:270)

at 
me.prettyprint.cassandra.service.ThriftCfDef.toThriftList(ThriftCfDef.java:258)

at me.prettyprint.cassandra.service.ThriftKsDef.toThrift(ThriftKsDef.java:109)

at 
me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:158)

at 
me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:151)

at 
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)

at 
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253)

at 
me.prettyprint.cassandra.service.ThriftCluster.addKeyspace(ThriftCluster.java:168)

at 
org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:171)

at 
org.apache.gora.cassandra.store.CassandraClient.initialize(CassandraClient.java:121)

at 
org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:152)

at 
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:104)

at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:163)

at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:137)

at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)

at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:133)

at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:122)

at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:209)

at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241)

at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316)

). Solved by removing the libthrift-0.8.0,jar file from the lib directory, 
leaving just libthrift-0.9.1.jar.I'm not sure this is a great thing to do, but 
it did unblock my on this step.

  3.  The elasticindex command no longer works in Nutch 2.3, as the indexer was 
moved to a plugin model (

Error: Could not find or load main class 
org.apache.nutch.indexer.elastic.ElasticIndexerJob

). After adding indexer-elastic to the plugin.includes, nutch index -all now 
runs and connects to the ES cluster I specified in nutch-site.

However the one issue I haven't gotten past is the one you mention and Lewis 
linked to - I'm not getting any of the linked pages off of my initial seed, and 
there doesn't appear to be any content or metadata pulled down from my site 
(even the seed page). I'll look through the links Lewis provided and try to get 
any additional tracing.  Hopefully we'll get this working, as Cassandra does 
seem like a good platform to use.

Matt

________________________________
From: [email protected] [[email protected]]
Sent: Thursday, February 26, 2015 11:58 AM
To: [email protected]
Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

Lewis and anyone else reading this,

Thank you for the links to the other posts. I will continue to review any 
updates to them!!

Before I go into my response to the last email I just want to give a mile high 
overview of what I am trying to accomplish. I have an intranet site with 
thousands of pages created using txt, html, php, and javascript, along with 
many powerpoint, word, and pdf documents being served up. I am trying to add 
search functionality to the website so content can be found in the website. For 
this I assume the best approach is to use Solr and Nutch. I couldn't get Nutch 
2.3 to behave with Solr 5.0 so I ended up adding in Cassandra 2.1.3 which then 
made the three play together nicely without all the errors I was getting 
before. I don't know enough about Nutch yet to know if it can do what I'm 
trying to accomplish so Tika may be thrown into the mix at some point.

I have updated log4j.properties logging levels to DEBUG like you mentioned.

Reviewing the log I see a couple Errors and Exceptions but I'm not sure if they 
are the reason for the lack of crawling data

ERROR store.CassandraStore -
ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@7481ca2
DEBUG util.NativeCodeLoader - Failed to load native-hadoop with error: 
java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
WARN mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while 
closing datastore.InvalidRequestException(why:supercolumn parameter is not 
optional for super CF sc)

Outside of those items I listed above nothing stands out as a problem between 
FetcherJob and ParserJob.

I have created multiple pastebins to simplify sharing of config/log files

1) log4j.properties - http://pastebin.com/KnK4A5wB
2) nutch-site.xml - http://pastebin.com/uuVmEdFU
3) gora.properties - http://pastebin.com/DeiM3aUF - I only added two lines for 
cassandra
4) gora-cassandra-mapping.xml - http://pastebin.com/F9dCsRwp - I have not 
changed this file at all
5) apache-nutch-2.3.log - http://pastebin.com/R9iNqcmN - The log file from 
running in DEBUG  "./bin/crawl urls/seeds.txt crawl2 
http://mylocalhost:8983/solr/nutch_crawler/<http://mimir2.icd.teradyne.com:8983/solr/nutch_crawler/>
 5"
6) cassandra keyspace - http://pastebin.com/xj3cWUKE - Showing output of 
webpage keyspace
7) bin/nutch readdb -dump - http://pastebin.com/1ZDYCeBi - showing output from 
running that command

8) regex-urlfilter.xml - Not including this because I have not touched it at 
all.

As a side note with Solr. When creating my "nutch_crawler" core I tried two 
different methods which I describe below. Different tutorials state you should 
do one or the other so I'm not sure what the correct procedure is.

1) Generic create - ./bin/solr create -c nutch_crawler
This defaults to using the solr provided "data_driven_schema_configs" configset 
which doesn't include a schema

2) Create a special "nutch_configs" configset. I did this by copying the 
"basic_configs" configset provided by solr to a new folder "nutch_configs" and 
then copying the schema.xml file which is provided by nutch into 
"nutch_configs" replacing its schema from "basic_configs". I then created the 
core using

> ./bin/solr create -c nutch_crawler -d /path/to/configsets/nutch_configs

Jonathan

---------------------------------------------------------

Jonathan Katon

Design Technology Group, Teradyne, Inc.
Software Tools Engineer

Office: 978-370-3561[X]
Cell: 978-809-4001[X]
Email: [email protected]

Lewis John Mcgibbney ---02/25/2015 06:37:31 PM---Hi Jonathan, There are another 
two threads ongoing, namely

From: Lewis John Mcgibbney <[email protected]>
To: "[email protected]" <[email protected]>
Date: 02/25/2015 06:37 PM
Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

________________________________

Hi Jonathan,

There are another two threads ongoing, namely

http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html
and
http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html

Please monitor those links and we can take it from there.
I would strongly suggest that you set logging leverls to DEBUG within
log4.properties and then create a fresh log.
Then step through the individual stages for the crawl cycle and try to
verify if you are loosing data between FetcherJob and ParserJob.
Thank you
Lewis

On Wed, Feb 25, 2015 at 3:06 PM, <[email protected]> wrote:

> I am installing Nutch and Solr for the first time and as a noob I am
> having a problem with Nutch and Solr not returning any results after a
> crawl - I'm using http://nutch.apache.org . Any help would be greatly
> appreciated. I have looked over the Nutch and Apache logs and nothing is
> popping out at me as a problem.
>
>
> On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra
> 2.1.3. To accomplish this I followed instructions on multiple sites
> including:
> http://wiki.apache.org/nutch/NutchTutorial
> http://wiki.apache.org/nutch/Nutch2Tutorial
> https://wiki.apache.org/nutch/Nutch2Cassandra
> http://wiki.apache.org/nutch/IntranetDocumentSearch
>
> I know Cassandra is working by testing:
> > bin/cassandra-cli
> Connected to: "Test Cluster" on 127.0.0.1/9160
> Welcome to Cassandra CLI version 2.1.3
>
> I know Solr is working because I have created a core named "nutch_crawler"
> and I can go to the website and access the gui at
> http://mydomain:8983/solr
>
>
> Before I built Nutch using "ant runtime" I updated the following
>
> *Ivy/ivy.xml - *Uncomment out line
> <dependency org="org.apache.gora" name="gora-cassandra" rev=0.5"
> conf="*->default" />
>
> *Conf/gora.properties* - Added two lines
> gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
> gora.cassandrastore.servers=localhost:9160
>
> *Conf/nutch-site.xml* - Added the following:
>   <property>
>     <name>http.agent.name</name>
>     <value>Nutch HTTP Agent Crawler</value>
>   </property>
>   <property>
>     <name>storage.data.store.class</name>
>     <value>org.apache.gora.cassandra.store.CassandraStore</value>
>     <description>Default class for storing data</description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>     <description></description>
>   </property>
>   <property>
>     <name>metatags.names</name>
>     <value>*</value>
>     <description>Names of the metatages to extract, separated by ','.
>       Use '*' to extract all metatags. Prefixes the names with 'metatag.'
>       in the parse-metadata. For instance to index description and
> keywords,
>       you need to activate the plugin index-metadata and set the value of
> the
>       parameter 'index.parse.md' to
> 'metatag.description,metatag.keywords'.
>     </description>
>   </property>
>   <property>
>     <name>index.parse.md</name>
>     <value>*</value>
>     <description></description>
>   </property>
>   <property>
>     <name></name>
>     <value></value>
>     <description></description>
>   </property>
>
> I then built Nutch using "ant runtime" and there were no errors. After the
> build I went to runtime/local/ and created the directory "urls" and then
> created a file in urls called "seeds.txt" which contains a single line
> without the quotes "http://nutch.apache.org .
>
> Next to perform the crawl I ran the following:
> > ./bin/crawl urls/seeds.txt crawl2
> http://mydomain:8983/solr/nutch_crawler/ 5
>
> This runs perfectly fine with no errors returned.
>
> If I go to core admin in the solr web gui it shows me that nutch_crawler
> contains no Docs.
>
> If I spit out the nutch db stats
> > ./bin/nutch readdb crawl2 -status
> WebTable statistics start
> Statistics for WebTable:
> status 2 (status_fetched):      1
> min score:      1.0
> retry 0:        2
> jobs:
> {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0},
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116,
> MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14,
> MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896, CPU_MILLISECONDS=0,
> SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7,
> REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0,
> REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8},
> FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268},
> File Output Format Counters ={BYTES_WRITTEN=250}}}}
> max score:      1.0
> TOTAL urls:     2
> status 3 (status_gone): 1
> avg score:      1.0
> WebTable statistics: done
>
> If I spit out the nutch db dump
> > ./bin/nutch readdb crawl2 -dump crawl2dump
> >more crawl2dump/part-r-00000
> http://nutch.apache.org/        key:    org.apache.nutch:http/
> baseUrl:        null
> status: 2 (status_fetched)
> fetchTime:      1427496745962
> prevFetchTime:  1424904733150
> fetchInterval:  2592000
> retriesSinceFetch:      0
> modifiedTime:   0
> prevModifiedTime:       0
> protocolStatus: (null)
> parseStatus:    (null)
> title:  null
> score:  1.0
> marker _injmrk_ :       y
> marker dist :   0
> reprUrl:        null
> batchId:        1424904736-28420
> metadata _csh_ :
>
>
>
>
>
> ---------------------------------------------------------
>
> *Jonathan Katon*
>
> *Design Technology Group, Teradyne, Inc.*
> *Software Tools Engineer*
>
> Office: 978-370-3561[X]
> Cell: 978-809-4001[X]
> Email: [email protected]
>
>
>

--
*Lewis*

RE: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

Reply via email to