Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann
Hi Sami,

On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Author: siren
 Date: Sat Dec  9 14:27:07 2006
 New Revision: 485076
 
 URL: http://svn.apache.org/viewvc?view=revrev=485076
 Log:
 Optimize SpellCheckedMetadata further by taking into account the fact that it
 is used only for http-headers.
 
 I am starting to believe that spellchecking should just be an utility method
 used by http protocol plugins.

I think that right now I'm -1 for this change. I would make note of all the
comments on NUTCH-139, from which this code was born. In the end, I think
what we all realized was that the spell checking capabilities is necessary,
but not everywhere, as you point out. However, I don't think it's limited
entirely to HTTP headers (what you've currently changed the code to). I
think it should be implemented as a protocol layer service, also providing
spell checking support to other protocol plugins, like protocol-file, etc.,
where field headers run the risk of being misspelled as well. What's to stop
someone from implementing protocol-file++ that returns different file header
keys than that of protocol-file? Just b/c HTTP is the most pervasively used
plugin right now, I think it's convenient to assume that only HTTP protocol
field keys may need spell checking services.

Just my 2 cents...

Cheers,
  Chris





Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Sami Siren
Chris Mattmann wrote:
 Hi Sami,
 
 On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: siren
 Date: Sat Dec  9 14:27:07 2006
 New Revision: 485076

 URL: http://svn.apache.org/viewvc?view=revrev=485076
 Log:
 Optimize SpellCheckedMetadata further by taking into account the fact that it
 is used only for http-headers.

 I am starting to believe that spellchecking should just be an utility method
 used by http protocol plugins.
 
 I think that right now I'm -1 for this change. I would make note of all the
 comments on NUTCH-139, from which this code was born. In the end, I think
 what we all realized was that the spell checking capabilities is necessary,
 but not everywhere, as you point out. However, I don't think it's limited
 entirely to HTTP headers (what you've currently changed the code to). I
 think it should be implemented as a protocol layer service, also providing
 spell checking support to other protocol plugins, like protocol-file, etc.,

In protocol file all headers are artificial an generated in nutch code
so if there's spelling mistake there then we should fix the code
generating the headers and not rely on spellchecking in the first place.

 where field headers run the risk of being misspelled as well. What's to stop
 someone from implementing protocol-file++ that returns different file header
 keys than that of protocol-file? Just b/c HTTP is the most pervasively used
 plugin right now, I think it's convenient to assume that only HTTP protocol
 field keys may need spell checking services.

If there's a real need for spell checking on other keys one can just add
more classes to the array no big deal.

--
 Sami Siren



Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann
Hi Sami,

 Indeed, I see your point. I guess what I was advocating for was more of a
ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we
could update the code that you have below to use ProtocolHeaders.class
rather than HttpHeaders.class. We would then make ProtocolHeaders extend
HttpHeaders, so that it by default inherits all of the HttpHeaders, while
still allowing more ProtocolHeader met keys (e.g., we could have an
interface for FileHeaders, etc.).

 What do you think about that? Alternatively we could just create a
ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all
the met key fields from HttpHeaders, and it would be the place that the met
key fields for FileHeaders, etc. could go into.

Let me know what you think, and thanks!

Cheers,
  Chris



On 12/9/06 3:53 PM, Sami Siren [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 Hi Sami,
 
 On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: siren
 Date: Sat Dec  9 14:27:07 2006
 New Revision: 485076
 
 URL: http://svn.apache.org/viewvc?view=revrev=485076
 Log:
 Optimize SpellCheckedMetadata further by taking into account the fact that
 it
 is used only for http-headers.
 
 I am starting to believe that spellchecking should just be an utility method
 used by http protocol plugins.
 
 I think that right now I'm -1 for this change. I would make note of all the
 comments on NUTCH-139, from which this code was born. In the end, I think
 what we all realized was that the spell checking capabilities is necessary,
 but not everywhere, as you point out. However, I don't think it's limited
 entirely to HTTP headers (what you've currently changed the code to). I
 think it should be implemented as a protocol layer service, also providing
 spell checking support to other protocol plugins, like protocol-file, etc.,
 
 In protocol file all headers are artificial an generated in nutch code
 so if there's spelling mistake there then we should fix the code
 generating the headers and not rely on spellchecking in the first place.
 
 where field headers run the risk of being misspelled as well. What's to stop
 someone from implementing protocol-file++ that returns different file header
 keys than that of protocol-file? Just b/c HTTP is the most pervasively used
 plugin right now, I think it's convenient to assume that only HTTP protocol
 field keys may need spell checking services.
 
 If there's a real need for spell checking on other keys one can just add
 more classes to the array no big deal.
 
 --
  Sami Siren