Hi,
I want to use Apache Nutch to collect all 1st and 2nd -level links for
all the domains in the seed file. I am basically struggling on how to
dump the results properly. This command bin/nutch readlinkdb out/linkdb
-dump links is simply not enough as I want the following very specific
representation:
rootdomain1: {subdomain1: {subsubdomain1, subsubdomain2,
subsubdomain3, subsubdomain4, ...}, subdomain2: {subsubdomain5,
subsubdomain6, subsubdomain1, subsubdomain7, ...}, ...}
rootdomain2: {...}
...
This is the command I use to collect the links: bin/crawl urls/ out none
2 in the first place.
Any ideas?