On Thu, Jun 27, 2013 at 4:30 AM, devang pandey <[email protected]>wrote:

> I am quite new to nutch. I have crawled a site successfully using nutch 1.2
>

You should use the latest version (1.7) as it has many bug fixes and
enhancements.


> and extracted segment dump by *readseg* command but issue is that dump
> contains lot of information other than url and outlinks also if i want to
> analyse it, manual approach needs to be adopted.


Did you use the general options ? Those are

-nocontent ignore content directory
 -nofetch ignore crawl_fetch directory
-nogenerate ignore crawl_generate directory
 -noparse ignore crawl_parse directory
-noparsedata ignore parse_data directory
 -noparsetext ignore parse_text directory

To see the usage, just run "bin/nutch readseg" w/o any params.

It would be really great
> if there is any utiltiy, plugin which export link with out links in machine
> readable format like csv or sql. Please suggest
>

The "dump" option of readseg command would give you a dump of the segment
in plain text file which is human readable. You could run some shell
commands to convert it into desired form you want.

Reply via email to