Re: Accessing Nutch crawl results via hadoop

Chris Gerken Thu, 20 Sep 2012 18:43:03 -0700

Lewis,

The nutch/gora build produces a nutch job file and a nutch jar file.  The jar 
file seems to have everything (and a bit more) that I need.  The problem is 
getting that jar into maven, but that seems to require only a few manual steps. 
 I'll know for sure when I get this thing running under hadoop.


thanks

- Chris



On Sep 20, 2012, at 6:43 PM, Lewis John Mcgibbney wrote:

> Chris,
> 
> How did you get on with this..any progress?
> 
> I wanted to have a look today but got caught up identifying problems
> in gora-cassandra v0.2.1.
> 
> Did you find a method to reuse generated webpage classes?
> 
> Lewis
> 
> On Wed, Sep 19, 2012 at 10:06 PM, Chris Gerken
> <[email protected]> wrote:
>> No.  This has nothing to do with ant.  The nutch job has been built and it 
>> has been run. As part of that ant build some Avro classes were built (e.g. 
>> WebPage) specifically for the storage of crawled data into Cassandra via 
>> gora. It seems to me that as I build a completely different job - one that's 
>> going to run in hadoop and access the crawled data from Cassandra - that I 
>> can reuse the the classes that the nutch build created (e.g. WebPage) 
>> instead of rebuilding them from scratch.  So I know those Avro classes are 
>> there somewhere.  What I don't know is which ones they are and what 
>> auxiliary files they prereq.
>> 
>> So my question is: Do those files that I need to access the crawled data in 
>> Cassandra exist in a reusable jar somewhere as a result of the nutch build?  
>> I'm not interested in the source, just the actual class files.
>> 
>> Chris Gerken
>> 
>> 
>> 
>> On Sep 19, 2012, at 3:56 PM, Lewis John Mcgibbney wrote:
>> 
>>> can you not just do 'ant job' from cmdline?
>>> 
>>> Is this what you mean?
>>> 
>>> From Nutch TLD you can do 'ant -projecthelp' to see a fully annotated
>>> description of all of the possible ant tasks.
>>> 
>>> hth
>>> 
>>> On Wed, Sep 19, 2012 at 9:51 PM, Chris Gerken
>>> <[email protected]> wrote:
>>>> Hello,
>>>> 
>>>> We've set up nutch and gora to gather some crawling data which is now 
>>>> stored in a Cassandra column family.  Is there some easy way to get the 
>>>> Avro classes used for the crawl, along with any necessary supporting 
>>>> files, into a hadoop job?  I'm building the hadoop job with maven, but am 
>>>> willing to consume a simple jar if there is a jar that just hold the 
>>>> classes and files I want.
>>>> 
>>>> thanks
>>>> 
>>>> - Chris
>>> 
>>> 
>>> 
>>> --
>>> Lewis
>> 
> 
> 
> 
> -- 
> Lewis

Re: Accessing Nutch crawl results via hadoop

Reply via email to