[ 
https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2675:
-----------------------------------
    Fix Version/s:     (was: 1.15)
                   1.16

> Give parsers the capability to read and write CrawlDatum
> --------------------------------------------------------
>
>                 Key: NUTCH-2675
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2675
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Junqiang Zhang
>            Priority: Minor
>             Fix For: 1.16
>
>
> Parsers are called inside org.apache.nutch.parse.ParseSegment,
> (Line 127 for version 1.15)        parseResult = parseUtil.parse(content);
> and inside org.apache.nutch.fetcher.FetcherThread.
> (Line 640 for version 1.15)            parseResult = 
> this.parseUtil.parse(content);
> The current version of Nutch does not give parsers the capability to access 
> CrawlDatum. If users want to customize the parsing process using some 
> metadata of CrawlDatum, it is difficult to read the required metadata. 
> On the other side, if users want to save metadata generated during parsing, 
> the metadata can only be saved as parseMeta of 
> org.apache.nutch.parse.ParseData, and those of parseMeta selected by 
> db.parsemeta.to.crawldb in nutch-site.xml can be added to CrawlDatum inside 
> org.apache.nutch.parse.ParseOutputFormat and 
> org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to 
> CrawlDatum, the metadata generated during parsing can be added to CrawlDatum 
> directly by parsers.
> I use Nutch to fetch and parse web pages. To read required metadata from 
> CrawlDatum during parsing, I do the following steps to work around.
> (1) During web page fetching, inside 
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the 
> required metadata from CrawlDatum, and save the required metadata together 
> with the Headers metadata of org.apache.nutch.net.protocols.Response to the 
> metadata of org.apache.nutch.protocol.Content. This can be done at line 334 
> of the code by replacing "response.getHeaders()" by a new metadata containing 
> both the required metadata from CrawlDatum and the Headers metadata.
> The code need to be modified inside 
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin is
> (Line 332 for version 1.15)      Content c = new Content(u.toString(), 
> u.toString(),
> (Line 333 for version 1.15)           (content == null ? EMPTY_CONTENT : 
> content),
> (Line 334 for version 1.15)           response.getHeader("Content-Type"), 
> response.getHeaders(), mimeTypes);
> (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser 
> of parse-html plugin, read the required metadata from the metadata of 
> org.apache.nutch.protocol.Content, and customize the parsing process using 
> the required metadata.
> If parsers have direct access to CrawlDatum, the above workaround is not 
> needed. To give parsers the capacity to directly read and write CrawlDatum, I 
> would like to suggest adding a new method "public ParseResult parse(Content 
> content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future 
> versions of Nutch.
> To be compatible with current 1.15 and previous versions, I would like to 
> suggest adding a new configuration property to nutch-default.xml. The default 
> of the configuration property can be use the current method "public 
> ParseResult parse(Content content)". If users want to use "public ParseResult 
> parse(Content content, CrawlDatum datum)", they can change the property in 
> nutch-site.xml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to