Hi,

I want to use Apache Nutch to collect all 1st and 2nd -level links for all the domains in the seed file. I am basically struggling on how to dump the results properly. This command bin/nutch readlinkdb out/linkdb -dump links is simply not enough as I want the following very specific representation:

rootdomain1: {subdomain1: {subsubdomain1, subsubdomain2, subsubdomain3, subsubdomain4, ...}, subdomain2: {subsubdomain5, subsubdomain6, subsubdomain1, subsubdomain7, ...}, ...}
rootdomain2: {...}
...

This is the command I use to collect the links: bin/crawl urls/ out none 2 in the first place.

Any ideas?

Reply via email to