Re: Looking to extract link data from a nutch crawl

Branden Makana Wed, 14 Jul 2010 14:55:31 -0700

Hi Alex,

Thanks for your reply. I want to have Nutch crawl a site, then get a
list of all pages/images from the crawl. I am fluent in Java, I was
just looking for pointers to where to begin.



>From the tutorial, I did see a "crawl/

On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote:
> I'm a bit confused as to what you want to do, your skills available,
> and how much you can code yourself. Presumably you have seen the
> linksdb? and you see that there is code to read from linksdb?
>
> Have you looked at the ReadDB facility? You probably want to look at
> the class org.apache.nutch.crawl.CrawlDbReader
>
>
> Alex
>
>
>
> On 14 July 2010 21:34, Branden Root <[email protected]> wrote:
>> Hello,
>>
>>
>>        I'm new to Nutch, so pardon me if this question had been asked before 
>> (an archives search didn't show anything). I'm trying to use nutch to crawl 
>> a website, and then get a list of all URLs on the site, including image 
>> URLs. I just need the URLs themselves, not the page/image content or 
>> anything like that.
>>
>>        Right now I know how to run Nutch on the command line, then after 
>> crawling/indexing, I can view the links/whatever file to see all the links, 
>> so that's a starting point. But, I really want to be able to 
>> programmatically run a nutch crawl (examples exist), then programmatically 
>> retrieve those links (no examples I can find). Also, I want to include all 
>> image hrefs on the crawled site in the final link printout (if it is even 
>> possible)
>>
>>        Any help is greatly appreciated!
>>
>> Thanks,
>> Branden Makana
>>
>>
>>
>

-- 
Branden Root
Chief Technology Officer | 206.575.3740
Portent Interactive
An Internet Marketing Agency
http://www.portentinteractive.com

Re: Looking to extract link data from a nutch crawl

Reply via email to