Re: Re: Looking to extract link data from a nutch crawl

Branden Makana Fri, 16 Jul 2010 12:05:53 -0700

Hi All,


        I never heard a reply back from Alex, so I'm wondering if anyone else 
has insight. I know that I can run 
        
                "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout"

        to have Nutch dump the list of pages crawled to a text file, so I'm 
sure it's possible to get a list of all pages crawled on a site. However, the 
format from the command is not very easy to parse - I just want a list of each 
unique page URL from the crawl, which I guess I could get by parsing that 
output and removing duplicate URLs.

        I'm also looking at the source of CrawlDBReader - it looks like however 
you call the command line, it'll end up creating a NutchJob and sending it off 
to Hadoop. I'm not familiar with Hadoop so that's another stumbling block. 


        Could someone familiar with the crawldb tell me if I'm on the right 
track? Again, I just want a list of all pages on a site from a crawl (without 
duplicates but I can remove them myself if I have to). Should I be trying to 
parse the output of "readlinkdb -dump", or should I be trying to run some job 
through Hadoop? 


Many Thanks,
Branden Makana



On Jul 14, 2010, at 9:57 PM, Branden Root wrote:

> Hi Alex,
> 
> Thanks for your reply. I want to have Nutch crawl a site, then get a list of 
> all pages/images on the site from the crawl. I am fluent in Java, but I'm 
> looking for pointers to where to begin.
> 
> From running the tutorial, I did see a file created by the crawl, 
> "links/part-00000" with plaintext info on all the site's pages - is that the 
> lnkdb you refer to? 
> 
> 
> 
> Thanks,
> Branden Makana
> 
> 
> 
> On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote:
>> I'm a bit confused as to what you want to do, your skills available,
>> and how much you can code yourself. Presumably you have seen the
>> linksdb? and you see that there is code to read from linksdb?
>> 
>> Have you looked at the ReadDB facility? You probably want to look at
>> the class org.apache.nutch.crawl.CrawlDbReader
>> 
>> 
>> Alex

Re: Re: Looking to extract link data from a nutch crawl

Reply via email to