Hello,
I am installing Nutch and Solr for the first time and as a noob I am having a problem with Nutch and Solr not returning any results after a crawl - I'm using http://nutch.apache.org . Any help would be greatly appreciated. I have looked over the Nutch and Apache logs and nothing is popping out at me as a problem. On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra 2.1.3. To accomplish this I followed instructions on multiple sites including: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial https://wiki.apache.org/nutch/Nutch2Cassandra http://wiki.apache.org/nutch/IntranetDocumentSearch I know Cassandra is working by testing: > bin/cassandra-cli Connected to: "Test Cluster" on 127.0.0.1/9160 Welcome to Cassandra CLI version 2.1.3 I know Solr is working because I have created a core named "nutch_crawler" and I can go to the website and access the gui at http://mydomain:8983/solr Before I built Nutch using "ant runtime" I updated the following Ivy/ivy.xml - Uncomment out line <dependency org="org.apache.gora" name="gora-cassandra" rev=0.5" conf="*-> default" /> Conf/gora.properties - Added two lines gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore gora.cassandrastore.servers=localhost:9160 Conf/nutch-site.xml - Added the following: <property> <name>http.agent.name</name> <value>Nutch HTTP Agent Crawler</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.cassandra.store.CassandraStore</value> <description>Default class for storing data</description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)| index-(basic|anchor|metadata)|indexer-solr|scoring-opic| urlnormalizer-(pass|regex|basic)</value> <description></description> </property> <property> <name>metatags.names</name> <value>*</value> <description>Names of the metatages to extract, separated by ','. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. </description> </property> <property> <name>index.parse.md</name> <value>*</value> <description></description> </property> <property> <name></name> <value></value> <description></description> </property> I then built Nutch using "ant runtime" and there were no errors. After the build I went to runtime/local/ and created the directory "urls" and then created a file in urls called "seeds.txt" which contains a single line without the quotes "http://nutch.apache.org . Next to perform the crawl I ran the following: > ./bin/crawl urls/seeds.txt crawl2 http://mydomain:8983/solr/nutch_crawler/ 5 This runs perfectly fine with no errors returned. If I go to core admin in the solr web gui it shows me that nutch_crawler contains no Docs. If I spit out the nutch db stats > ./bin/nutch readdb crawl2 -status WebTable statistics start Statistics for WebTable: status 2 (status_fetched): 1 min score: 1.0 retry 0: 2 jobs: {db_stats-job_local1413246058_0001={jobID=job_local1413246058_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=116, MAP_INPUT_RECORDS=2, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=14, MAP_OUTPUT_BYTES=106, COMMITTED_HEAP_BYTES=1516240896, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=905, COMBINE_INPUT_RECORDS=8, REDUCE_INPUT_RECORDS=7, REDUCE_INPUT_GROUPS=7, COMBINE_OUTPUT_RECORDS=7, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=7, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=8}, FileSystemCounters={FILE_BYTES_READ=1217716, FILE_BYTES_WRITTEN=1392268}, File Output Format Counters ={BYTES_WRITTEN=250}}}} max score: 1.0 TOTAL urls: 2 status 3 (status_gone): 1 avg score: 1.0 WebTable statistics: done If I spit out the nutch db dump > ./bin/nutch readdb crawl2 -dump crawl2dump >more crawl2dump/part-r-00000 http://nutch.apache.org/ key: org.apache.nutch:http/ baseUrl: null status: 2 (status_fetched) fetchTime: 1427496745962 prevFetchTime: 1424904733150 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null) title: null score: 1.0 marker _injmrk_ : y marker dist : 0 reprUrl: null batchId: 1424904736-28420 metadata _csh_ : --------------------------------------------------------- Jonathan Katon Design Technology Group, Teradyne, Inc. Software Tools Engineer Office: 978-370-3561 Cell: 978-809-4001 Email: [email protected]

