Hi,
has anyone built a parsing plugin which decides on a per host basis how the
content of the document should be parsed?
For example, if the title of a document is in the first h1-tag of a page for
host1 , but the title for a document of host2 is in the third h2-tag, the
plugin would extract
Hi,
we're looking for a Nutch developer to implement some plugins for us in the
next few weeks.
Substantial knowledge in Nutch, Java and Databases is needed.
If yor're interested, please contact me (koch at huberverlag dot de)
Thanks in advance,
Martina
The nutch data files are pretty opaque, and even strings can't extract
anything except the occasional URL. Is there any code to dump the contents
of the various files in a human readable form?
--
http://www.linkedin.com/in/paultomblin
yes, there are tools which you can use to dump the content of crawl db,
link db and segments.
dump=./crawl/dump
bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
bin/nutch readseg -dump $1 $dump/segments/$1
you will get more info if you
As a very old nutch user an developer of plugins and even implemented nutch in
some products - I could help you.
I am based in Houston, Texas -- skype me on hooduku
sudhi
--- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote:
From: sf30098 sf30...@yahoo.com
Subject: Support needed
To:
Koch Martina wrote:
Hi,
has anyone built a parsing plugin which decides on a per host basis how the
content of the document should be parsed?
For example, if the title of a document is in the first h1-tag of a page for host1
, but the title for a document of host2 is in the third h2-tag, the
Awesome! Thanks.
On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote:
yes, there are tools which you can use to dump the content of crawl db,
link db and segments.
dump=./crawl/dump
bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
bin/nutch readlinkdb