Host specific parsing

2009-07-28 Thread Koch Martina
Hi,

has anyone built a parsing plugin which decides on a per host basis how the 
content of the document should be parsed?

For example, if the title of a document is in the first h1-tag of a page for 
host1 , but the title for a document of host2 is in the third h2-tag, the 
plugin would extract the title differently depending on the host.

In my opinion something like a dispatcher plugin would be needed:

-  Identify host of a document

-  Read and cache instructions on how to get the information for that 
host (database or config file)

-  Execute host-specific plugin

Do you have any suggestions on how to implement such a scenario efficiently? 
Has anyone implemented something similiar and can point out possible 
performance issues or other critical issues to be considered?

Thanks in advance.

Kind regards,
Martina


Development support

2009-07-28 Thread Koch Martina
Hi,

we're looking for a Nutch developer to implement some plugins for us in the 
next few weeks.
Substantial knowledge in Nutch, Java and Databases is needed.

If yor're interested, please contact me (koch at  huberverlag dot de)

Thanks in advance,

Martina



Dumping what I have?

2009-07-28 Thread Paul Tomblin
The nutch data files are pretty opaque, and even strings can't extract
anything except the occasional URL.  Is there any code to dump the contents
of the various files in a human readable form?

-- 
http://www.linkedin.com/in/paultomblin


Re: Dumping what I have?

2009-07-28 Thread reinhard schwab
yes, there are tools which you can use to dump the content of crawl db,
link db and segments.

dump=./crawl/dump
bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
bin/nutch readseg -dump $1 $dump/segments/$1

you will get more info if you call

bin/nutch readdb
bin/nutch readlinkdb
bin/nutch readseg

Paul Tomblin schrieb:
 The nutch data files are pretty opaque, and even strings can't extract
 anything except the occasional URL.  Is there any code to dump the contents
 of the various files in a human readable form?

   



Re: Support needed

2009-07-28 Thread Sudhi Seshachala
As a very old nutch user an developer of plugins and even implemented nutch in 
some products - I could help you.
I am based in Houston, Texas -- skype me on hooduku

sudhi

--- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote:

From: sf30098 sf30...@yahoo.com
Subject: Support needed
To: nutch-user@lucene.apache.org
Date: Monday, July 27, 2009, 4:01 PM


I need someone with substantial knowledge in Nutch, Java and Lucene and have
customised the system before. In particular, this should be related to image
indexing and geo-positioning. 

if possible (either or, is good as well).

The job role will be on providing supports and advice on how to go about
implementing such system..

This includes:
1. replying questions and providing guidance in implementation
2. reviewing codes and providing suggestions as to how to improve. 

Please let me know if you're interested.
-- 
View this message in context: 
http://www.nabble.com/Support-needed-tp24688172p24688172.html
Sent from the Nutch - User mailing list archive at Nabble.com.




  

Re: Host specific parsing

2009-07-28 Thread Andrzej Bialecki

Koch Martina wrote:

Hi,

has anyone built a parsing plugin which decides on a per host basis how the 
content of the document should be parsed?

For example, if the title of a document is in the first h1-tag of a page for host1 
, but the title for a document of host2 is in the third h2-tag, the plugin would 
extract the title differently depending on the host.

In my opinion something like a dispatcher plugin would be needed:

-  Identify host of a document

-  Read and cache instructions on how to get the information for that 
host (database or config file)

-  Execute host-specific plugin

Do you have any suggestions on how to implement such a scenario efficiently? 
Has anyone implemented something similiar and can point out possible 
performance issues or other critical issues to be considered?


Yes, and yes. With the current plugin system you can create a new 
dispatcher plugin, and then add other necessary plugins as import 
elements. This way they will be accessible from the same classloader, so 
that you can instantiate them directly in your dispatcher plugin.


As for the lookup ... many solutions are possible. DB connections from 
map tasks may be problematic, both because of latency and the cost of 
setting up so many DB connections. OTOH, if you add local caching (using 
JCS or Ehcache) the hit/miss ratio should be decent enough. If the 
mapping of host names to plugins can be expressed by rules then maybe a 
simple rule set would be enough.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Dumping what I have?

2009-07-28 Thread Paul Tomblin
Awesome!  Thanks.

On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote:

 yes, there are tools which you can use to dump the content of crawl db,
 link db and segments.

 dump=./crawl/dump
 bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
 bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
 bin/nutch readseg -dump $1 $dump/segments/$1

 you will get more info if you call

 bin/nutch readdb
 bin/nutch readlinkdb
 bin/nutch readseg

 Paul Tomblin schrieb:
  The nutch data files are pretty opaque, and even strings can't extract
  anything except the occasional URL.  Is there any code to dump the
 contents
  of the various files in a human readable form?
 
 




-- 
http://www.linkedin.com/in/paultomblin