Host specific parsing
Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the plugin would extract the title differently depending on the host. In my opinion something like a dispatcher plugin would be needed: - Identify host of a document - Read and cache instructions on how to get the information for that host (database or config file) - Execute host-specific plugin Do you have any suggestions on how to implement such a scenario efficiently? Has anyone implemented something similiar and can point out possible performance issues or other critical issues to be considered? Thanks in advance. Kind regards, Martina
Development support
Hi, we're looking for a Nutch developer to implement some plugins for us in the next few weeks. Substantial knowledge in Nutch, Java and Databases is needed. If yor're interested, please contact me (koch at huberverlag dot de) Thanks in advance, Martina
Dumping what I have?
The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form? -- http://www.linkedin.com/in/paultomblin
Re: Dumping what I have?
yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb bin/nutch readseg -dump $1 $dump/segments/$1 you will get more info if you call bin/nutch readdb bin/nutch readlinkdb bin/nutch readseg Paul Tomblin schrieb: The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form?
Re: Support needed
As a very old nutch user an developer of plugins and even implemented nutch in some products - I could help you. I am based in Houston, Texas -- skype me on hooduku sudhi --- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote: From: sf30098 sf30...@yahoo.com Subject: Support needed To: nutch-user@lucene.apache.org Date: Monday, July 27, 2009, 4:01 PM I need someone with substantial knowledge in Nutch, Java and Lucene and have customised the system before. In particular, this should be related to image indexing and geo-positioning. if possible (either or, is good as well). The job role will be on providing supports and advice on how to go about implementing such system.. This includes: 1. replying questions and providing guidance in implementation 2. reviewing codes and providing suggestions as to how to improve. Please let me know if you're interested. -- View this message in context: http://www.nabble.com/Support-needed-tp24688172p24688172.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Host specific parsing
Koch Martina wrote: Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the plugin would extract the title differently depending on the host. In my opinion something like a dispatcher plugin would be needed: - Identify host of a document - Read and cache instructions on how to get the information for that host (database or config file) - Execute host-specific plugin Do you have any suggestions on how to implement such a scenario efficiently? Has anyone implemented something similiar and can point out possible performance issues or other critical issues to be considered? Yes, and yes. With the current plugin system you can create a new dispatcher plugin, and then add other necessary plugins as import elements. This way they will be accessible from the same classloader, so that you can instantiate them directly in your dispatcher plugin. As for the lookup ... many solutions are possible. DB connections from map tasks may be problematic, both because of latency and the cost of setting up so many DB connections. OTOH, if you add local caching (using JCS or Ehcache) the hit/miss ratio should be decent enough. If the mapping of host names to plugins can be expressed by rules then maybe a simple rule set would be enough. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Dumping what I have?
Awesome! Thanks. On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote: yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb bin/nutch readseg -dump $1 $dump/segments/$1 you will get more info if you call bin/nutch readdb bin/nutch readlinkdb bin/nutch readseg Paul Tomblin schrieb: The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form? -- http://www.linkedin.com/in/paultomblin