How to make Nutch 1.7 request mimic a browser?

2015-02-26 Thread Meraj A. Khan
In some instances the content that is downloaded in Fetch phase from a
HTTP URL is not what you would get if you were to access the request
from a well known browser like Google Chrome for example, that is
because the server is expecting a user agent value that represents a
browser.

There is a http.agent.name property in nutch-site.xml, is it the same
property that should be used to set the user agent to make the server
respond to a Nutch get request the same way as it would for a request
from a browser ? Or is there an another configurable property ?

For example the user agent value for a Chrome browser is below.

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36


Thanks.


Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

2015-02-26 Thread jonathan . katon

Lewis and anyone else reading this,

Thank you for the links to the other posts. I will continue to review any
updates to them!!

Before I go into my response to the last email I just want to give a mile
high overview of what I am trying to accomplish. I have an intranet site
with thousands of pages created using txt, html, php, and javascript, along
with many powerpoint, word, and pdf documents being served up. I am trying
to add search functionality to the website so content can be found in the
website. For this I assume the best approach is to use Solr and Nutch. I
couldn't get Nutch 2.3 to behave with Solr 5.0 so I ended up adding in
Cassandra 2.1.3 which then made the three play together nicely without all
the errors I was getting before. I don't know enough about Nutch yet to
know if it can do what I'm trying to accomplish so Tika may be thrown into
the mix at some point.


I have updated log4j.properties logging levels to DEBUG like you mentioned.

Reviewing the log I see a couple Errors and Exceptions but I'm not sure if
they are the reason for the lack of crawling data

ERROR store.CassandraStore -
ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@7481ca2
DEBUG util.NativeCodeLoader - Failed to load native-hadoop with error:
java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
WARN mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while
closing datastore.InvalidRequestException(why:supercolumn parameter is not
optional for super CF sc)

Outside of those items I listed above nothing stands out as a problem
between FetcherJob and ParserJob.


I have created multiple pastebins to simplify sharing of config/log files

1) log4j.properties - http://pastebin.com/KnK4A5wB
2) nutch-site.xml - http://pastebin.com/uuVmEdFU
3) gora.properties - http://pastebin.com/DeiM3aUF - I only added two lines
for cassandra
4) gora-cassandra-mapping.xml - http://pastebin.com/F9dCsRwp - I have not
changed this file at all
5) apache-nutch-2.3.log - http://pastebin.com/R9iNqcmN - The log file from
running in DEBUG  ./bin/crawl urls/seeds.txt crawl2
http://mylocalhost:8983/solr/nutch_crawler/ 5
6) cassandra keyspace - http://pastebin.com/xj3cWUKE - Showing output of
webpage keyspace
7) bin/nutch readdb -dump - http://pastebin.com/1ZDYCeBi - showing output
from running that command

8) regex-urlfilter.xml - Not including this because I have not touched it
at all.



As a side note with Solr. When creating my nutch_crawler core I tried two
different methods which I describe below. Different tutorials state you
should do one or the other so I'm not sure what the correct procedure is.

1) Generic create - ./bin/solr create -c nutch_crawler
This defaults to using the solr provided data_driven_schema_configs
configset which doesn't include a schema

2) Create a special nutch_configs configset. I did this by copying the
basic_configs configset provided by solr to a new folder nutch_configs
and then copying the schema.xml file which is provided by nutch into
nutch_configs replacing its schema from basic_configs. I then created
the core using

 ./bin/solr create -c nutch_crawler -d /path/to/configsets/nutch_configs

Jonathan


-

Jonathan Katon

Design Technology Group, Teradyne, Inc.
Software Tools Engineer

Office: 978-370-3561
Cell: 978-809-4001
Email: jonathan.ka...@teradyne.com






From:   Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user@nutch.apache.org user@nutch.apache.org
Date:   02/25/2015 06:37 PM
Subject:Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data



Hi Jonathan,

There are another two threads ongoing, namely

http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html
and
http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html

Please monitor those links and we can take it from there.
I would strongly suggest that you set logging leverls to DEBUG within
log4.properties and then create a fresh log.
Then step through the individual stages for the crawl cycle and try to
verify if you are loosing data between FetcherJob and ParserJob.
Thank you
Lewis



On Wed, Feb 25, 2015 at 3:06 PM, user-digest-h...@nutch.apache.org wrote:

 I am installing Nutch and Solr for the first time and as a noob I am
 having a problem with Nutch and Solr not returning any results after a
 crawl - I'm using http://nutch.apache.org . Any help would be greatly
 appreciated. I have looked over the Nutch and Apache logs and nothing is
 popping out at me as a problem.


 On a RHEL 6.4 server I have installed Nutch 2.3, Solr 5.0 and Cassandra
 2.1.3. To accomplish this I followed instructions on multiple sites
 including:
 http://wiki.apache.org/nutch/NutchTutorial
 http://wiki.apache.org/nutch/Nutch2Tutorial
 https://wiki.apache.org/nutch/Nutch2Cassandra
 http://wiki.apache.org/nutch/IntranetDocumentSearch

 I know Cassandra is working by testing:
  bin/cassandra-cli
 Connected to: Test Cluster on 127.0.0.1/9160
 

RE: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data

2015-02-26 Thread Burrough, Matthew William
Hi Jonathan,



It looks like I am running into the same issue as you.  I've spent the last two 
days trying to get Nutch 2.3.0 to run on Windows Server 2012R2 + Cygwin and 
communicate with storage on a different Windows server, all running in Azure.  
(I know, Linux would be much easier to get this stack running, but that isn't 
an option on this project.)  I started with HBase, since that seemed to be a 
popular option and Azure has a pre-made set of VMs for it.  Unfortunately 
Azure's HBase options are 0.98 and 0.98.4, neither of which proved compatible 
with Nutch 2.3/Gora 0.5.  Moving on, I opted for Cassandra today and pulled 
down version 2.0.2 since that version was listed on the Nutch homepage.  
Cassandra ran on Windows without any hacks, so that was a plus.  I plan to 
output into Elasticsearch, or even better, JSON.



As for Nutch, I ran into a series of issues that I managed to get past while 
trying to index my personal website using the step-by-step Nutch commands:

  1.  Permission setting issue with Hadoop 1.x on Windows/Cygwin. Worked around 
for now with hacky patch: 
https://github.com/congainc/patch-hadoop_7682-1.0.x-win. Note that the other 
hack people suggest, to replace the Hadoop 1.2 jar with the Hadoop 0.20.2 jar 
in lib does not work on Nutch 2.3 (Results in a 
java.lang.ExceptionInInitializerError at 
org.apache.gora.mapreduce.GoraOutputFormat.setOutput).  I'm hoping this might 
be resolved in Hadoop 2.x, but based on NUTCH-1936, it looks like Nutch support 
for Hadoop 2.x may not happen until GSOC this year.
  2.   During Nutch generate phase, a thrift exception (

Exception in thread main java.lang.NoSuchMethodError: 
org.apache.thrift.EncodingUtils.setBit(IIZ)I

at org.apache.cassandra.thrift.CfDef.setGc_grace_secondsIsSet(CfDef.java:895)

at org.apache.cassandra.thrift.CfDef.setGc_grace_seconds(CfDef.java:881)

at me.prettyprint.cassandra.service.ThriftCfDef.toThrift(ThriftCfDef.java:270)

at 
me.prettyprint.cassandra.service.ThriftCfDef.toThriftList(ThriftCfDef.java:258)

at me.prettyprint.cassandra.service.ThriftKsDef.toThrift(ThriftKsDef.java:109)

at 
me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:158)

at 
me.prettyprint.cassandra.service.ThriftCluster$6.execute(ThriftCluster.java:151)

at 
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)

at 
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253)

at 
me.prettyprint.cassandra.service.ThriftCluster.addKeyspace(ThriftCluster.java:168)

at 
org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:171)

at 
org.apache.gora.cassandra.store.CassandraClient.initialize(CassandraClient.java:121)

at 
org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:152)

at 
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:104)

at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:163)

at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:137)

at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)

at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:133)

at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:122)

at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:209)

at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241)

at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316)

). Solved by removing the libthrift-0.8.0,jar file from the lib directory, 
leaving just libthrift-0.9.1.jar.I'm not sure this is a great thing to do, but 
it did unblock my on this step.

  3.  The elasticindex command no longer works in Nutch 2.3, as the indexer was 
moved to a plugin model (

Error: Could not find or load main class 
org.apache.nutch.indexer.elastic.ElasticIndexerJob

). After adding indexer-elastic to the plugin.includes, nutch index -all now 
runs and connects to the ES cluster I specified in nutch-site.

However the one issue I haven't gotten past is the one you mention and Lewis 
linked to - I'm not getting any of the linked pages off of my initial seed, and 
there doesn't appear to be any content or metadata pulled down from my site 
(even the seed page). I'll look through the links Lewis provided and try to get 
any additional tracing.  Hopefully we'll get this working, as Cassandra does 
seem like a good platform to use.



Matt




From: jonathan.ka...@teradyne.com [jonathan.ka...@teradyne.com]
Sent: Thursday, February 26, 2015 11:58 AM
To: user@nutch.apache.org
Subject: Re: Nutch2.3/Solr5/Cassandra2.1.3 crawl returns no data


Lewis and anyone else reading this,

Thank you for the links to the other posts. I will continue to