Re: Re: Looking to extract link data from a nutch crawl

Rayala Udayakumar Sat, 17 Jul 2010 07:17:48 -0700

Hi,

You can use the -format option along with the dump which will allow
you to get the crawl db dump in csv format.


 "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout -format csv"

The csv will have the following fields:

Url;Status code;Status name;Fetch Time;Modified Time;Retries since
fetch;Retry interval;Score;Signature;Metadata

I am not sure if there will be only one row per url. If not, you can
then extract the unique urls from the csv file.

- Uday.

On Sat, Jul 17, 2010 at 12:35 AM, Branden Makana
<[email protected]> wrote:
> Hi All,
>
>
>        I never heard a reply back from Alex, so I'm wondering if anyone else 
> has insight. I know that I can run
>
>                "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout"
>
>        to have Nutch dump the list of pages crawled to a text file, so I'm 
> sure it's possible to get a list of all pages crawled on a site. However, the 
> format from the command is not very easy to parse - I just want a list of 
> each unique page URL from the crawl, which I guess I could get by parsing 
> that output and removing duplicate URLs.
>
>        I'm also looking at the source of CrawlDBReader - it looks like 
> however you call the command line, it'll end up creating a NutchJob and 
> sending it off to Hadoop. I'm not familiar with Hadoop so that's another 
> stumbling block.
>
>
>        Could someone familiar with the crawldb tell me if I'm on the right 
> track? Again, I just want a list of all pages on a site from a crawl (without 
> duplicates but I can remove them myself if I have to). Should I be trying to 
> parse the output of "readlinkdb -dump", or should I be trying to run some job 
> through Hadoop?
>
>
> Many Thanks,
> Branden Makana
>
>
>
> On Jul 14, 2010, at 9:57 PM, Branden Root wrote:
>
>> Hi Alex,
>>
>> Thanks for your reply. I want to have Nutch crawl a site, then get a list of 
>> all pages/images on the site from the crawl. I am fluent in Java, but I'm 
>> looking for pointers to where to begin.
>>
>> From running the tutorial, I did see a file created by the crawl, 
>> "links/part-00000" with plaintext info on all the site's pages - is that the 
>> lnkdb you refer to?
>>
>>
>>
>> Thanks,
>> Branden Makana
>>
>>
>>
>> On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote:
>>> I'm a bit confused as to what you want to do, your skills available,
>>> and how much you can code yourself. Presumably you have seen the
>>> linksdb? and you see that there is code to read from linksdb?
>>>
>>> Have you looked at the ReadDB facility? You probably want to look at
>>> the class org.apache.nutch.crawl.CrawlDbReader
>>>
>>>
>>> Alex
>
>

Re: Re: Looking to extract link data from a nutch crawl

Reply via email to