[jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Gal Nitzan (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]

Gal Nitzan updated NUTCH-179:
-

Description: 
Somtime there are requirements of the real world (usually your boss) where a 
special parse is required for a certain site. Though the content type is 
text/html, a specialized parser is needed.

Sample: I am required to crawl certain sites where some of them are partners 
sites. when fetching from the partners site I need to look for certain entries 
in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy 
implementation for others if ParserFactory could use NutchConf to check for 
certain properties and if matched to use the correct plugin based on the url 
and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.

  was:
Somtime there are requirements of the real world (usually your boss) where a 
special parse is required for a certain site.

Sample: I am required to crawl certain sites where some of them are partners 
sites. when fetching from the partners site I need to look for certain entries 
in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy 
implementation for others if ParserFactory could use NutchConf to check for 
certain properties and if matched to use the correct plugin based on the url 
and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.


 Proposition: Enable Nutch to use a parser plugin not just based on content 
 type
 ---

  Key: NUTCH-179
  URL: http://issues.apache.org/jira/browse/NUTCH-179
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.8-dev
 Reporter: Gal Nitzan


 Somtime there are requirements of the real world (usually your boss) where 
 a special parse is required for a certain site. Though the content type is 
 text/html, a specialized parser is needed.
 Sample: I am required to crawl certain sites where some of them are partners 
 sites. when fetching from the partners site I need to look for certain 
 entries in the text and boost the score.
 Currently the ParserFactory looks for a plugin based only on the content type.
 Facing this issue myself I noticed that it would give a very easy 
 implementation for others if ParserFactory could use NutchConf to check for 
 certain properties and if matched to use the correct plugin based on the url 
 and not just the content type.
 The implementation shouldn be to complicated.
 Looking to hear more ideas.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



RE: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Chris Mattmann
Hi Gail,

 Check out:

http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

That's the way that the parser factory currently works. Also added, but not
described in that proposal is the ability to call a parser by its id, which
is a method present in ParseUtil.java.

G'luck!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: Gal Nitzan (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Sunday, January 15, 2006 4:10 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a
 parser plugin not just based on content type
 
  [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
 
 Gal Nitzan updated NUTCH-179:
 -
 
 Description:
 Sorry, please close this issue.
 
 I figured that if I set my parse plugin first. I can always be called
 first and than decide if I want to parse or not.
 
   was:
 Somtime there are requirements of the real world (usually your boss)
 where a special parse is required for a certain site. Though the content
 type is text/html, a specialized parser is needed.
 
 Sample: I am required to crawl certain sites where some of them are
 partners sites. when fetching from the partners site I need to look for
 certain entries in the text and boost the score.
 
 Currently the ParserFactory looks for a plugin based only on the content
 type.
 
 Facing this issue myself I noticed that it would give a very easy
 implementation for others if ParserFactory could use NutchConf to check
 for certain properties and if matched to use the correct plugin based on
 the url and not just the content type.
 
 The implementation shouldn be to complicated.
 
 Looking to hear more ideas.
 
 
  Proposition: Enable Nutch to use a parser plugin not just based on
 content type
  
 ---
 
   Key: NUTCH-179
   URL: http://issues.apache.org/jira/browse/NUTCH-179
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.8-dev
  Reporter: Gal Nitzan
 
 
  Sorry, please close this issue.
  I figured that if I set my parse plugin first. I can always be called
 first and than decide if I want to parse or not.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira