Hi Jonathan,
It looks like I am running into the same issue as you. I've spent the last two days trying to get Nutch 2.3.0 to run on Windows Server 2012R2 + Cygwin and communicate with storage on a different Windows server, all running in Azure. (I know, Linux would be much easier to get this stack running, but that isn't an option on this project.) I started with HBase, since that seemed to be a popular option and Azure has a pre-made set of VMs for it. Unfortunately Azure's HBase options are 0.98 and 0.98.4, neither of which proved compatible with Nutch 2.3/Gora 0.5. Moving on, I opted for Cassandra today and pulled down version 2.0.2 since that version was listed on the Nutch homepage. Cassandra ran on Windows without any hacks, so that was a plus. I plan to output into Elasticsearch, or even better, JSON. As for Nutch, I ran into a series of issues that I managed to get past while trying to index my personal website using the step-by-step Nutch commands: 1. Permission setting issue with Hadoop 1.x on Windows/Cygwin. Worked around for now with hacky patch: https://github.com/congainc/patch-hadoop_7682-1.0.x-win. Note that the other hack people suggest, to replace the Hadoop 1.2 jar with the Hadoop 0.20.2 jar in lib does not work on Nutch 2.3 (Results in a java.lang.ExceptionInInitializerError at org.apache.gora.mapreduce.GoraOutputFormat.setOutput). I'm hoping this might be resolved in Hadoop 2.x, but based on NUTCH-1936, it looks like Nutch support for Hadoop 2.x may not happen until GSOC this year. 2. During Nutch generate phase, a thrift exception ( Exception in thread "main" java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(IIZ)I at org.apache.cassandra.thrift.CfDef.setGc_grace_secondsIsSet(CfDef.java:895) at org.apache.cassandra.thrift.CfDef.setGc_grace_seconds(CfDef.java:881) at me.prettyprint.cassandra.service.ThriftCfDef.toThrift(ThriftCfDef.java:270) at me.prettyprint.cassandra.service.ThriftCfDef.toThriftList(ThriftCfDef.java:258) at me.prettyprint.cassandra.service.ThriftKsDef.toThrift(ThriftKsDef.java:109) at me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:158) at me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:151) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) at me.prettyprint.cassandra.service.ThriftCluster.addKeyspace(ThriftCluster.java:168) at org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:171) at org.apache.gora.cassandra.store.CassandraClient.initialize(CassandraClient.java:121) at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:152) at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:104) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:163) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:137) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:133) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:122) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:209) at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316) ). Solved by removing the libthrift-0.8.0,jar file from the lib directory, leaving just libthrift-0.9.1.jar.I'm not sure this is a great thing to do, but it did unblock my on this step. 3. The elasticindex command no longer works in Nutch 2.3, as the indexer was moved to a plugin model ( Error: Could not find or load main class org.apache.nutch.indexer.elastic.ElasticIndexerJob ). After adding indexer-elastic to the plugin.includes, nutch index -all now runs and connects to the ES cluster I specified in nutch-site. However the one issue I haven't gotten past is the one you mention and Lewis linked to - I'm not getting any of the linked pages off of my initial seed, and there doesn't appear to be any content or metadata pulled down from my site (even the seed page). I'll look through the links Lewis provided and try to get any additional tracing. Hopefully we'll get this working, as Cassandra does seem like a good platform to use. Matt ________________________________ From: [email protected] [[email protected]] Sent: Thursday, February 26, 2015 11:58 AM To: [email protected] Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data Lewis and anyone else reading this, Thank you for the links to the other posts. I will continue to review any updates to them!! Before I go into my response to the last email I just want to give a mile high overview of what I am trying to accomplish. I have an intranet site with thousands of pages created using txt, html, php, and javascript, along with many powerpoint, word, and pdf documents being served up. I am trying to add search functionality to the website so content can be found in the website. For this I assume the best approach is to use Solr and Nutch. I couldn't get Nutch 2.3 to behave with Solr 5.0 so I ended up adding in Cassandra 2.1.3 which then made the three play together nicely without all the errors I was getting before. I don't know enough about Nutch yet to know if it can do what I'm trying to accomplish so Tika may be thrown into the mix at some point. I have updated log4j.properties logging levels to DEBUG like you mentioned. Reviewing the log I see a couple Errors and Exceptions but I'm not sure if they are the reason for the lack of crawling data ERROR store.CassandraStore - ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@7481ca2 DEBUG util.NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path WARN mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while closing datastore.InvalidRequestException(why:supercolumn parameter is not optional for super CF sc) Outside of those items I listed above nothing stands out as a problem between FetcherJob and ParserJob. I have created multiple pastebins to simplify sharing of config/log files 1) log4j.properties - http://pastebin.com/KnK4A5wB 2) nutch-site.xml - http://pastebin.com/uuVmEdFU 3) gora.properties - http://pastebin.com/DeiM3aUF - I only added two lines for cassandra 4) gora-cassandra-mapping.xml - http://pastebin.com/F9dCsRwp - I have not changed this file at all 5) apache-nutch-2.3.log - http://pastebin.com/R9iNqcmN - The log file from running in DEBUG "./bin/crawl urls/seeds.txt crawl2 http://mylocalhost:8983/solr/nutch_crawler/<http://mimir2.icd.teradyne.com:8983/solr/nutch_crawler/> 5" 6) cassandra keyspace - http://pastebin.com/xj3cWUKE - Showing output of webpage keyspace 7) bin/nutch readdb -dump - http://pastebin.com/1ZDYCeBi - showing output from running that command 8) regex-urlfilter.xml - Not including this because I have not touched it at all. As a side note with Solr. When creating my "nutch_crawler" core I tried two different methods which I describe below. Different tutorials state you should do one or the other so I'm not sure what the correct procedure is. 1) Generic create - ./bin/solr create -c nutch_crawler This defaults to using the solr provided "data_driven_schema_configs" configset which doesn't include a schema 2) Create a special "nutch_configs" configset. I did this by copying the "basic_configs" configset provided by solr to a new folder "nutch_configs" and then copying the schema.xml file which is provided by nutch into "nutch_configs" replacing its schema from "basic_configs". I then created the core using > ./bin/solr create -c nutch_crawler -d /path/to/configsets/nutch_configs Jonathan --------------------------------------------------------- Jonathan Katon Design Technology Group, Teradyne, Inc. Software Tools Engineer Office: 978-370-3561[X] Cell: 978-809-4001[X] Email: [email protected] Lewis John Mcgibbney ---02/25/2015 06:37:31 PM---Hi Jonathan, There are another two threads ongoing, namely From: Lewis John Mcgibbney <[email protected]> To: "[email protected]" <[email protected]> Date: 02/25/2015 06:37 PM Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data ________________________________ Hi Jonathan, There are another two threads ongoing, namely http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html and http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html Please monitor those links and we can take it from there. I would strongly suggest that you set logging leverls to DEBUG within log4.properties and then create a fresh log. Then step through the individual stages for the crawl cycle and try to verify if you are loosing data between FetcherJob and ParserJob. Thank you Lewis On Wed, Feb 25, 2015 at 3:06 PM, <[email protected]> wrote: > I am installing Nutch and Solr for the first time and as a noob I am > having a problem with Nutch and Solr not returning any results after a > crawl - I'm using http://nutch.apache.org . Any help would be greatly > appreciated. I have looked over the Nutch and Apache logs and nothing is > popping out at me as a problem. > > > On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra > 2.1.3. To accomplish this I followed instructions on multiple sites > including: > http://wiki.apache.org/nutch/NutchTutorial > http://wiki.apache.org/nutch/Nutch2Tutorial > https://wiki.apache.org/nutch/Nutch2Cassandra > http://wiki.apache.org/nutch/IntranetDocumentSearch > > I know Cassandra is working by testing: > > bin/cassandra-cli > Connected to: "Test Cluster" on 127.0.0.1/9160 > Welcome to Cassandra CLI version 2.1.3 > > I know Solr is working because I have created a core named "nutch_crawler" > and I can go to the website and access the gui at > http://mydomain:8983/solr > > > Before I built Nutch using "ant runtime" I updated the following > > *Ivy/ivy.xml - *Uncomment out line > <dependency org="org.apache.gora" name="gora-cassandra" rev=0.5" > conf="*->default" /> > > *Conf/gora.properties* - Added two lines > gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore > gora.cassandrastore.servers=localhost:9160 > > *Conf/nutch-site.xml* - Added the following: > <property> > <name>http.agent.name</name> > <value>Nutch HTTP Agent Crawler</value> > </property> > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.cassandra.store.CassandraStore</value> > <description>Default class for storing data</description> > </property> > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description></description> > </property> > <property> > <name>metatags.names</name> > <value>*</value> > <description>Names of the metatages to extract, separated by ','. > Use '*' to extract all metatags. Prefixes the names with 'metatag.' > in the parse-metadata. For instance to index description and > keywords, > you need to activate the plugin index-metadata and set the value of > the > parameter 'index.parse.md' to > 'metatag.description,metatag.keywords'. > </description> > </property> > <property> > <name>index.parse.md</name> > <value>*</value> > <description></description> > </property> > <property> > <name></name> > <value></value> > <description></description> > </property> > > I then built Nutch using "ant runtime" and there were no errors. After the > build I went to runtime/local/ and created the directory "urls" and then > created a file in urls called "seeds.txt" which contains a single line > without the quotes "http://nutch.apache.org . > > Next to perform the crawl I ran the following: > > ./bin/crawl urls/seeds.txt crawl2 > http://mydomain:8983/solr/nutch_crawler/ 5 > > This runs perfectly fine with no errors returned. > > If I go to core admin in the solr web gui it shows me that nutch_crawler > contains no Docs. > > If I spit out the nutch db stats > > ./bin/nutch readdb crawl2 -status > WebTable statistics start > Statistics for WebTable: > status 2 (status_fetched): 1 > min score: 1.0 > retry 0: 2 > jobs: > {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116, > MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14, > MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896, CPU_MILLISECONDS=0, > SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7, > REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8}, > FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268}, > File Output Format Counters ={BYTES_WRITTEN=250}}}} > max score: 1.0 > TOTAL urls: 2 > status 3 (status_gone): 1 > avg score: 1.0 > WebTable statistics: done > > If I spit out the nutch db dump > > ./bin/nutch readdb crawl2 -dump crawl2dump > >more crawl2dump/part-r-00000 > http://nutch.apache.org/ key: org.apache.nutch:http/ > baseUrl: null > status: 2 (status_fetched) > fetchTime: 1427496745962 > prevFetchTime: 1424904733150 > fetchInterval: 2592000 > retriesSinceFetch: 0 > modifiedTime: 0 > prevModifiedTime: 0 > protocolStatus: (null) > parseStatus: (null) > title: null > score: 1.0 > marker _injmrk_ : y > marker dist : 0 > reprUrl: null > batchId: 1424904736-28420 > metadata _csh_ : > > > > > > --------------------------------------------------------- > > *Jonathan Katon* > > *Design Technology Group, Teradyne, Inc.* > *Software Tools Engineer* > > Office: 978-370-3561[X] > Cell: 978-809-4001[X] > Email: [email protected] > > > -- *Lewis*

