[ https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2675: ----------------------------------- Fix Version/s: (was: 1.15) 1.16 > Give parsers the capability to read and write CrawlDatum > -------------------------------------------------------- > > Key: NUTCH-2675 > URL: https://issues.apache.org/jira/browse/NUTCH-2675 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 1.15 > Reporter: Junqiang Zhang > Priority: Minor > Fix For: 1.16 > > > Parsers are called inside org.apache.nutch.parse.ParseSegment, > (Line 127 for version 1.15) parseResult = parseUtil.parse(content); > and inside org.apache.nutch.fetcher.FetcherThread. > (Line 640 for version 1.15) parseResult = > this.parseUtil.parse(content); > The current version of Nutch does not give parsers the capability to access > CrawlDatum. If users want to customize the parsing process using some > metadata of CrawlDatum, it is difficult to read the required metadata. > On the other side, if users want to save metadata generated during parsing, > the metadata can only be saved as parseMeta of > org.apache.nutch.parse.ParseData, and those of parseMeta selected by > db.parsemeta.to.crawldb in nutch-site.xml can be added to CrawlDatum inside > org.apache.nutch.parse.ParseOutputFormat and > org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to > CrawlDatum, the metadata generated during parsing can be added to CrawlDatum > directly by parsers. > I use Nutch to fetch and parse web pages. To read required metadata from > CrawlDatum during parsing, I do the following steps to work around. > (1) During web page fetching, inside > org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the > required metadata from CrawlDatum, and save the required metadata together > with the Headers metadata of org.apache.nutch.net.protocols.Response to the > metadata of org.apache.nutch.protocol.Content. This can be done at line 334 > of the code by replacing "response.getHeaders()" by a new metadata containing > both the required metadata from CrawlDatum and the Headers metadata. > The code need to be modified inside > org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin is > (Line 332 for version 1.15) Content c = new Content(u.toString(), > u.toString(), > (Line 333 for version 1.15) (content == null ? EMPTY_CONTENT : > content), > (Line 334 for version 1.15) response.getHeader("Content-Type"), > response.getHeaders(), mimeTypes); > (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser > of parse-html plugin, read the required metadata from the metadata of > org.apache.nutch.protocol.Content, and customize the parsing process using > the required metadata. > If parsers have direct access to CrawlDatum, the above workaround is not > needed. To give parsers the capacity to directly read and write CrawlDatum, I > would like to suggest adding a new method "public ParseResult parse(Content > content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future > versions of Nutch. > To be compatible with current 1.15 and previous versions, I would like to > suggest adding a new configuration property to nutch-default.xml. The default > of the configuration property can be use the current method "public > ParseResult parse(Content content)". If users want to use "public ParseResult > parse(Content content, CrawlDatum datum)", they can change the property in > nutch-site.xml. -- This message was sent by Atlassian JIRA (v7.6.3#76005)