How to make Nutch 1.7 request mimic a browser?
In some instances the content that is downloaded in Fetch phase from a HTTP URL is not what you would get if you were to access the request from a well known browser like Google Chrome for example, that is because the server is expecting a user agent value that represents a browser. There is a http.agent.name property in nutch-site.xml, is it the same property that should be used to set the user agent to make the server respond to a Nutch get request the same way as it would for a request from a browser ? Or is there an another configurable property ? For example the user agent value for a Chrome browser is below. Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 Thanks.
Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data
Lewis and anyone else reading this, Thank you for the links to the other posts. I will continue to review any updates to them!! Before I go into my response to the last email I just want to give a mile high overview of what I am trying to accomplish. I have an intranet site with thousands of pages created using txt, html, php, and javascript, along with many powerpoint, word, and pdf documents being served up. I am trying to add search functionality to the website so content can be found in the website. For this I assume the best approach is to use Solr and Nutch. I couldn't get Nutch 2.3 to behave with Solr 5.0 so I ended up adding in Cassandra 2.1.3 which then made the three play together nicely without all the errors I was getting before. I don't know enough about Nutch yet to know if it can do what I'm trying to accomplish so Tika may be thrown into the mix at some point. I have updated log4j.properties logging levels to DEBUG like you mentioned. Reviewing the log I see a couple Errors and Exceptions but I'm not sure if they are the reason for the lack of crawling data ERROR store.CassandraStore - ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@7481ca2 DEBUG util.NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path WARN mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while closing datastore.InvalidRequestException(why:supercolumn parameter is not optional for super CF sc) Outside of those items I listed above nothing stands out as a problem between FetcherJob and ParserJob. I have created multiple pastebins to simplify sharing of config/log files 1) log4j.properties - http://pastebin.com/KnK4A5wB 2) nutch-site.xml - http://pastebin.com/uuVmEdFU 3) gora.properties - http://pastebin.com/DeiM3aUF - I only added two lines for cassandra 4) gora-cassandra-mapping.xml - http://pastebin.com/F9dCsRwp - I have not changed this file at all 5) apache-nutch-2.3.log - http://pastebin.com/R9iNqcmN - The log file from running in DEBUG ./bin/crawl urls/seeds.txt crawl2 http://mylocalhost:8983/solr/nutch_crawler/ 5 6) cassandra keyspace - http://pastebin.com/xj3cWUKE - Showing output of webpage keyspace 7) bin/nutch readdb -dump - http://pastebin.com/1ZDYCeBi - showing output from running that command 8) regex-urlfilter.xml - Not including this because I have not touched it at all. As a side note with Solr. When creating my nutch_crawler core I tried two different methods which I describe below. Different tutorials state you should do one or the other so I'm not sure what the correct procedure is. 1) Generic create - ./bin/solr create -c nutch_crawler This defaults to using the solr provided data_driven_schema_configs configset which doesn't include a schema 2) Create a special nutch_configs configset. I did this by copying the basic_configs configset provided by solr to a new folder nutch_configs and then copying the schema.xml file which is provided by nutch into nutch_configs replacing its schema from basic_configs. I then created the core using ./bin/solr create -c nutch_crawler -d /path/to/configsets/nutch_configs Jonathan - Jonathan Katon Design Technology Group, Teradyne, Inc. Software Tools Engineer Office: 978-370-3561 Cell: 978-809-4001 Email: jonathan.ka...@teradyne.com From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Date: 02/25/2015 06:37 PM Subject:Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data Hi Jonathan, There are another two threads ongoing, namely http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html and http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html Please monitor those links and we can take it from there. I would strongly suggest that you set logging leverls to DEBUG within log4.properties and then create a fresh log. Then step through the individual stages for the crawl cycle and try to verify if you are loosing data between FetcherJob and ParserJob. Thank you Lewis On Wed, Feb 25, 2015 at 3:06 PM, user-digest-h...@nutch.apache.org wrote: I am installing Nutch and Solr for the first time and as a noob I am having a problem with Nutch and Solr not returning any results after a crawl - I'm using http://nutch.apache.org . Any help would be greatly appreciated. I have looked over the Nutch and Apache logs and nothing is popping out at me as a problem. On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra 2.1.3. To accomplish this I followed instructions on multiple sites including: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial https://wiki.apache.org/nutch/Nutch2Cassandra http://wiki.apache.org/nutch/IntranetDocumentSearch I know Cassandra is working by testing: bin/cassandra-cli Connected to: Test Cluster on 127.0.0.1/9160
RE: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data
Hi Jonathan, It looks like I am running into the same issue as you. I've spent the last two days trying to get Nutch 2.3.0 to run on Windows Server 2012R2 + Cygwin and communicate with storage on a different Windows server, all running in Azure. (I know, Linux would be much easier to get this stack running, but that isn't an option on this project.) I started with HBase, since that seemed to be a popular option and Azure has a pre-made set of VMs for it. Unfortunately Azure's HBase options are 0.98 and 0.98.4, neither of which proved compatible with Nutch 2.3/Gora 0.5. Moving on, I opted for Cassandra today and pulled down version 2.0.2 since that version was listed on the Nutch homepage. Cassandra ran on Windows without any hacks, so that was a plus. I plan to output into Elasticsearch, or even better, JSON. As for Nutch, I ran into a series of issues that I managed to get past while trying to index my personal website using the step-by-step Nutch commands: 1. Permission setting issue with Hadoop 1.x on Windows/Cygwin. Worked around for now with hacky patch: https://github.com/congainc/patch-hadoop_7682-1.0.x-win. Note that the other hack people suggest, to replace the Hadoop 1.2 jar with the Hadoop 0.20.2 jar in lib does not work on Nutch 2.3 (Results in a java.lang.ExceptionInInitializerError at org.apache.gora.mapreduce.GoraOutputFormat.setOutput). I'm hoping this might be resolved in Hadoop 2.x, but based on NUTCH-1936, it looks like Nutch support for Hadoop 2.x may not happen until GSOC this year. 2. During Nutch generate phase, a thrift exception ( Exception in thread main java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(IIZ)I at org.apache.cassandra.thrift.CfDef.setGc_grace_secondsIsSet(CfDef.java:895) at org.apache.cassandra.thrift.CfDef.setGc_grace_seconds(CfDef.java:881) at me.prettyprint.cassandra.service.ThriftCfDef.toThrift(ThriftCfDef.java:270) at me.prettyprint.cassandra.service.ThriftCfDef.toThriftList(ThriftCfDef.java:258) at me.prettyprint.cassandra.service.ThriftKsDef.toThrift(ThriftKsDef.java:109) at me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:158) at me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:151) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) at me.prettyprint.cassandra.service.ThriftCluster.addKeyspace(ThriftCluster.java:168) at org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:171) at org.apache.gora.cassandra.store.CassandraClient.initialize(CassandraClient.java:121) at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:152) at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:104) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:163) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:137) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:133) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:122) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:209) at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316) ). Solved by removing the libthrift-0.8.0,jar file from the lib directory, leaving just libthrift-0.9.1.jar.I'm not sure this is a great thing to do, but it did unblock my on this step. 3. The elasticindex command no longer works in Nutch 2.3, as the indexer was moved to a plugin model ( Error: Could not find or load main class org.apache.nutch.indexer.elastic.ElasticIndexerJob ). After adding indexer-elastic to the plugin.includes, nutch index -all now runs and connects to the ES cluster I specified in nutch-site. However the one issue I haven't gotten past is the one you mention and Lewis linked to - I'm not getting any of the linked pages off of my initial seed, and there doesn't appear to be any content or metadata pulled down from my site (even the seed page). I'll look through the links Lewis provided and try to get any additional tracing. Hopefully we'll get this working, as Cassandra does seem like a good platform to use. Matt From: jonathan.ka...@teradyne.com [jonathan.ka...@teradyne.com] Sent: Thursday, February 26, 2015 11:58 AM To: user@nutch.apache.org Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data Lewis and anyone else reading this, Thank you for the links to the other posts. I will continue to