date:20090728

Host specific parsing

2009-07-28 Thread Koch Martina

Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the plugin would extract

Development support

2009-07-28 Thread Koch Martina

Hi, we're looking for a Nutch developer to implement some plugins for us in the next few weeks. Substantial knowledge in Nutch, Java and Databases is needed. If yor're interested, please contact me (koch at huberverlag dot de) Thanks in advance, Martina

Dumping what I have?

2009-07-28 Thread Paul Tomblin

The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form? -- http://www.linkedin.com/in/paultomblin

Re: Dumping what I have?

2009-07-28 Thread reinhard schwab

yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb bin/nutch readseg -dump $1 $dump/segments/$1 you will get more info if you

Re: Support needed

2009-07-28 Thread Sudhi Seshachala

As a very old nutch user an developer of plugins and even implemented nutch in some products - I could help you. I am based in Houston, Texas -- skype me on hooduku sudhi --- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote: From: sf30098 sf30...@yahoo.com Subject: Support needed To:

Re: Host specific parsing

2009-07-28 Thread Andrzej Bialecki

Koch Martina wrote: Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the

Re: Dumping what I have?

2009-07-28 Thread Paul Tomblin

Awesome! Thanks. On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote: yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb

Host specific parsing

Development support

Dumping what I have?

Re: Dumping what I have?

Re: Support needed

Re: Host specific parsing

Re: Dumping what I have?

7 matches

Site Navigation

Mail list logo

Footer information