On Thu, Jun 27, 2013 at 4:30 AM, devang pandey <[email protected]>wrote:
> I am quite new to nutch. I have crawled a site successfully using nutch 1.2 > You should use the latest version (1.7) as it has many bug fixes and enhancements. > and extracted segment dump by *readseg* command but issue is that dump > contains lot of information other than url and outlinks also if i want to > analyse it, manual approach needs to be adopted. Did you use the general options ? Those are -nocontent ignore content directory -nofetch ignore crawl_fetch directory -nogenerate ignore crawl_generate directory -noparse ignore crawl_parse directory -noparsedata ignore parse_data directory -noparsetext ignore parse_text directory To see the usage, just run "bin/nutch readseg" w/o any params. It would be really great > if there is any utiltiy, plugin which export link with out links in machine > readable format like csv or sql. Please suggest > The "dump" option of readseg command would give you a dump of the segment in plain text file which is human readable. You could run some shell commands to convert it into desired form you want.

