Re: Looking to extract link data from a nutch crawl

Branden Makana Wed, 14 Jul 2010 15:03:28 -0700

Sorry, had a crash - I have seen the plain-text file of pages crawled,
is that really all I need, assuming I can get it to list images on
each page too?


On Wednesday, July 14, 2010, Branden Makana
<[email protected]> wrote:
> Hi Alex,
>
> Thanks for your reply. I want to have Nutch crawl a site, then get a
> list of all pages/images from the crawl. I am fluent in Java, I was
> just looking for pointers to where to begin.
>
>
> From the tutorial, I did see a "crawl/
>
> On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote:
>> I'm a bit confused as to what you want to do, your skills available,
>> and how much you can code yourself. Presumably you have seen the
>> linksdb? and you see that there is code to read from linksdb?
>>
>> Have you looked at the ReadDB facility? You probably want to look at
>> the class org.apache.nutch.crawl.CrawlDbReader
>>
>>
>> Alex
>>
>>
>>
>> On 14 July 2010 21:34, Branden Root <[email protected]> wrote:
>>> Hello,
>>>
>>>
>>>        I'm new to Nutch, so pardon me if this question had been asked 
>>> before (an archives search didn't show anything). I'm trying to use nutch 
>>> to crawl a website, and then get a list of all URLs on the site, including 
>>> image URLs. I just need the URLs themselves, not the page/image content or 
>>> anything like that.
>>>
>>>        Right now I know how to run Nutch on the command line, then after 
>>> crawling/indexing, I can view the links/whatever file to see all the links, 
>>> so that's a starting point. But, I really want to be able to 
>>> programmatically run a nutch crawl (examples exist), then programmatically 
>>> retrieve those links (no examples I can find). Also, I want to include all 
>>> image hrefs on the crawled site in the final link printout (if it is even 
>>> possible)
>>>
>>>        Any help is greatly appreciated!
>>>
>>> Thanks,
>>> Branden Makana
>>>
>>>
>

-- 
Branden Root
Chief Technology Officer | 206.575.3740
Portent Interactive
An Internet Marketing Agency
http://www.portentinteractive.com

Re: Looking to extract link data from a nutch crawl

Reply via email to