Lewis and anyone else reading this, Thank you for the links to the other posts. I will continue to review any updates to them!!
Before I go into my response to the last email I just want to give a mile high overview of what I am trying to accomplish. I have an intranet site with thousands of pages created using txt, html, php, and javascript, along with many powerpoint, word, and pdf documents being served up. I am trying to add search functionality to the website so content can be found in the website. For this I assume the best approach is to use Solr and Nutch. I couldn't get Nutch 2.3 to behave with Solr 5.0 so I ended up adding in Cassandra 2.1.3 which then made the three play together nicely without all the errors I was getting before. I don't know enough about Nutch yet to know if it can do what I'm trying to accomplish so Tika may be thrown into the mix at some point. I have updated log4j.properties logging levels to DEBUG like you mentioned. Reviewing the log I see a couple Errors and Exceptions but I'm not sure if they are the reason for the lack of crawling data ERROR store.CassandraStore - ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@7481ca2 DEBUG util.NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path WARN mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while closing datastore.InvalidRequestException(why:supercolumn parameter is not optional for super CF sc) Outside of those items I listed above nothing stands out as a problem between FetcherJob and ParserJob. I have created multiple pastebins to simplify sharing of config/log files 1) log4j.properties - http://pastebin.com/KnK4A5wB 2) nutch-site.xml - http://pastebin.com/uuVmEdFU 3) gora.properties - http://pastebin.com/DeiM3aUF - I only added two lines for cassandra 4) gora-cassandra-mapping.xml - http://pastebin.com/F9dCsRwp - I have not changed this file at all 5) apache-nutch-2.3.log - http://pastebin.com/R9iNqcmN - The log file from running in DEBUG "./bin/crawl urls/seeds.txt crawl2 http://mylocalhost:8983/solr/nutch_crawler/ 5" 6) cassandra keyspace - http://pastebin.com/xj3cWUKE - Showing output of webpage keyspace 7) bin/nutch readdb -dump - http://pastebin.com/1ZDYCeBi - showing output from running that command 8) regex-urlfilter.xml - Not including this because I have not touched it at all. As a side note with Solr. When creating my "nutch_crawler" core I tried two different methods which I describe below. Different tutorials state you should do one or the other so I'm not sure what the correct procedure is. 1) Generic create - ./bin/solr create -c nutch_crawler This defaults to using the solr provided "data_driven_schema_configs" configset which doesn't include a schema 2) Create a special "nutch_configs" configset. I did this by copying the "basic_configs" configset provided by solr to a new folder "nutch_configs" and then copying the schema.xml file which is provided by nutch into "nutch_configs" replacing its schema from "basic_configs". I then created the core using > ./bin/solr create -c nutch_crawler -d /path/to/configsets/nutch_configs Jonathan --------------------------------------------------------- Jonathan Katon Design Technology Group, Teradyne, Inc. Software Tools Engineer Office: 978-370-3561 Cell: 978-809-4001 Email: [email protected] From: Lewis John Mcgibbney <[email protected]> To: "[email protected]" <[email protected]> Date: 02/25/2015 06:37 PM Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data Hi Jonathan, There are another two threads ongoing, namely http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html and http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html Please monitor those links and we can take it from there. I would strongly suggest that you set logging leverls to DEBUG within log4.properties and then create a fresh log. Then step through the individual stages for the crawl cycle and try to verify if you are loosing data between FetcherJob and ParserJob. Thank you Lewis On Wed, Feb 25, 2015 at 3:06 PM, <[email protected]> wrote: > I am installing Nutch and Solr for the first time and as a noob I am > having a problem with Nutch and Solr not returning any results after a > crawl - I'm using http://nutch.apache.org . Any help would be greatly > appreciated. I have looked over the Nutch and Apache logs and nothing is > popping out at me as a problem. > > > On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra > 2.1.3. To accomplish this I followed instructions on multiple sites > including: > http://wiki.apache.org/nutch/NutchTutorial > http://wiki.apache.org/nutch/Nutch2Tutorial > https://wiki.apache.org/nutch/Nutch2Cassandra > http://wiki.apache.org/nutch/IntranetDocumentSearch > > I know Cassandra is working by testing: > > bin/cassandra-cli > Connected to: "Test Cluster" on 127.0.0.1/9160 > Welcome to Cassandra CLI version 2.1.3 > > I know Solr is working because I have created a core named "nutch_crawler" > and I can go to the website and access the gui at > http://mydomain:8983/solr > > > Before I built Nutch using "ant runtime" I updated the following > > *Ivy/ivy.xml - *Uncomment out line > <dependency org="org.apache.gora" name="gora-cassandra" rev=0.5" > conf="*->default" /> > > *Conf/gora.properties* - Added two lines > gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore > gora.cassandrastore.servers=localhost:9160 > > *Conf/nutch-site.xml* - Added the following: > <property> > <name>http.agent.name</name> > <value>Nutch HTTP Agent Crawler</value> > </property> > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.cassandra.store.CassandraStore</value> > <description>Default class for storing data</description> > </property> > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)| index-(basic|anchor|metadata)|indexer-solr|scoring-opic| urlnormalizer-(pass|regex|basic)</value> > <description></description> > </property> > <property> > <name>metatags.names</name> > <value>*</value> > <description>Names of the metatages to extract, separated by ','. > Use '*' to extract all metatags. Prefixes the names with 'metatag.' > in the parse-metadata. For instance to index description and > keywords, > you need to activate the plugin index-metadata and set the value of > the > parameter 'index.parse.md' to > 'metatag.description,metatag.keywords'. > </description> > </property> > <property> > <name>index.parse.md</name> > <value>*</value> > <description></description> > </property> > <property> > <name></name> > <value></value> > <description></description> > </property> > > I then built Nutch using "ant runtime" and there were no errors. After the > build I went to runtime/local/ and created the directory "urls" and then > created a file in urls called "seeds.txt" which contains a single line > without the quotes "http://nutch.apache.org . > > Next to perform the crawl I ran the following: > > ./bin/crawl urls/seeds.txt crawl2 > http://mydomain:8983/solr/nutch_crawler/ 5 > > This runs perfectly fine with no errors returned. > > If I go to core admin in the solr web gui it shows me that nutch_crawler > contains no Docs. > > If I spit out the nutch db stats > > ./bin/nutch readdb crawl2 -status > WebTable statistics start > Statistics for WebTable: > status 2 (status_fetched): 1 > min score: 1.0 > retry 0: 2 > jobs: > {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116, > MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14, > MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896, CPU_MILLISECONDS=0, > SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7, > REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8}, > FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268}, > File Output Format Counters ={BYTES_WRITTEN=250}}}} > max score: 1.0 > TOTAL urls: 2 > status 3 (status_gone): 1 > avg score: 1.0 > WebTable statistics: done > > If I spit out the nutch db dump > > ./bin/nutch readdb crawl2 -dump crawl2dump > >more crawl2dump/part-r-00000 > http://nutch.apache.org/ key: org.apache.nutch:http/ > baseUrl: null > status: 2 (status_fetched) > fetchTime: 1427496745962 > prevFetchTime: 1424904733150 > fetchInterval: 2592000 > retriesSinceFetch: 0 > modifiedTime: 0 > prevModifiedTime: 0 > protocolStatus: (null) > parseStatus: (null) > title: null > score: 1.0 > marker _injmrk_ : y > marker dist : 0 > reprUrl: null > batchId: 1424904736-28420 > metadata _csh_ : > > > > > > --------------------------------------------------------- > > *Jonathan Katon* > > *Design Technology Group, Teradyne, Inc.* > *Software Tools Engineer* > > Office: 978-370-3561 > Cell: 978-809-4001 > Email: [email protected] > > > -- *Lewis*

