Hi All,
I never heard a reply back from Alex, so I'm wondering if anyone else
has insight. I know that I can run
"bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout"
to have Nutch dump the list of pages crawled to a text file, so I'm
sure it's possible to get a list of all pages crawled on a site. However, the
format from the command is not very easy to parse - I just want a list of each
unique page URL from the crawl, which I guess I could get by parsing that
output and removing duplicate URLs.
I'm also looking at the source of CrawlDBReader - it looks like however
you call the command line, it'll end up creating a NutchJob and sending it off
to Hadoop. I'm not familiar with Hadoop so that's another stumbling block.
Could someone familiar with the crawldb tell me if I'm on the right
track? Again, I just want a list of all pages on a site from a crawl (without
duplicates but I can remove them myself if I have to). Should I be trying to
parse the output of "readlinkdb -dump", or should I be trying to run some job
through Hadoop?
Many Thanks,
Branden Makana
On Jul 14, 2010, at 9:57 PM, Branden Root wrote:
> Hi Alex,
>
> Thanks for your reply. I want to have Nutch crawl a site, then get a list of
> all pages/images on the site from the crawl. I am fluent in Java, but I'm
> looking for pointers to where to begin.
>
> From running the tutorial, I did see a file created by the crawl,
> "links/part-00000" with plaintext info on all the site's pages - is that the
> lnkdb you refer to?
>
>
>
> Thanks,
> Branden Makana
>
>
>
> On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote:
>> I'm a bit confused as to what you want to do, your skills available,
>> and how much you can code yourself. Presumably you have seen the
>> linksdb? and you see that there is code to read from linksdb?
>>
>> Have you looked at the ReadDB facility? You probably want to look at
>> the class org.apache.nutch.crawl.CrawlDbReader
>>
>>
>> Alex