Re: Re: Looking to extract link data from a nutch crawl

Luan Cestari Sat, 17 Jul 2010 10:40:58 -0700

Is there a way to extract it and then inject it again?

I would like to extract the content and metada from each page to change them
and then persist it. Is there a simple way that I can use Nutch classes to
do that or what can I do?




On Sat, Jul 17, 2010 at 11:16 AM, Rayala Udayakumar
<[email protected]>wrote:

> Hi,
>
> You can use the -format option along with the dump which will allow
> you to get the crawl db dump in csv format.
>
>  "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout -format csv"
>
> The csv will have the following fields:
>
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since
> fetch;Retry interval;Score;Signature;Metadata
>
> I am not sure if there will be only one row per url. If not, you can
> then extract the unique urls from the csv file.
>
> - Uday.
>
> On Sat, Jul 17, 2010 at 12:35 AM, Branden Makana
> <[email protected]> wrote:
> > Hi All,
> >
> >
> >        I never heard a reply back from Alex, so I'm wondering if anyone
> else has insight. I know that I can run
> >
> >                "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout"
> >
> >        to have Nutch dump the list of pages crawled to a text file, so
> I'm sure it's possible to get a list of all pages crawled on a site.
> However, the format from the command is not very easy to parse - I just want
> a list of each unique page URL from the crawl, which I guess I could get by
> parsing that output and removing duplicate URLs.
> >
> >        I'm also looking at the source of CrawlDBReader - it looks like
> however you call the command line, it'll end up creating a NutchJob and
> sending it off to Hadoop. I'm not familiar with Hadoop so that's another
> stumbling block.
> >
> >
> >        Could someone familiar with the crawldb tell me if I'm on the
> right track? Again, I just want a list of all pages on a site from a crawl
> (without duplicates but I can remove them myself if I have to). Should I be
> trying to parse the output of "readlinkdb -dump", or should I be trying to
> run some job through Hadoop?
> >
> >
> > Many Thanks,
> > Branden Makana
> >
> >
> >
> > On Jul 14, 2010, at 9:57 PM, Branden Root wrote:
> >
> >> Hi Alex,
> >>
> >> Thanks for your reply. I want to have Nutch crawl a site, then get a
> list of all pages/images on the site from the crawl. I am fluent in Java,
> but I'm looking for pointers to where to begin.
> >>
> >> From running the tutorial, I did see a file created by the crawl,
> "links/part-00000" with plaintext info on all the site's pages - is that the
> lnkdb you refer to?
> >>
> >>
> >>
> >> Thanks,
> >> Branden Makana
> >>
> >>
> >>
> >> On Wednesday, July 14, 2010, Alex McLintock <[email protected]>
> wrote:
> >>> I'm a bit confused as to what you want to do, your skills available,
> >>> and how much you can code yourself. Presumably you have seen the
> >>> linksdb? and you see that there is code to read from linksdb?
> >>>
> >>> Have you looked at the ReadDB facility? You probably want to look at
> >>> the class org.apache.nutch.crawl.CrawlDbReader
> >>>
> >>>
> >>> Alex
> >
> >
>

Re: Re: Looking to extract link data from a nutch crawl

Reply via email to