Host specific parsing

2009-07-28 Thread Koch Martina
Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the plugin would extract

Development support

2009-07-28 Thread Koch Martina
Hi, we're looking for a Nutch developer to implement some plugins for us in the next few weeks. Substantial knowledge in Nutch, Java and Databases is needed. If yor're interested, please contact me (koch at huberverlag dot de) Thanks in advance, Martina

Dumping what I have?

2009-07-28 Thread Paul Tomblin
The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form? -- http://www.linkedin.com/in/paultomblin

Re: Dumping what I have?

2009-07-28 Thread reinhard schwab
yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb bin/nutch readseg -dump $1 $dump/segments/$1 you will get more info if you

Re: Support needed

2009-07-28 Thread Sudhi Seshachala
As a very old nutch user an developer of plugins and even implemented nutch in some products - I could help you. I am based in Houston, Texas -- skype me on hooduku sudhi --- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote: From: sf30098 sf30...@yahoo.com Subject: Support needed To:

Re: Host specific parsing

2009-07-28 Thread Andrzej Bialecki
Koch Martina wrote: Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the

Re: Dumping what I have?

2009-07-28 Thread Paul Tomblin
Awesome! Thanks. On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote: yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb