[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851719#action_12851719 ] Hudson commented on NUTCH-779: -- Integrated in Nutch-trunk #1112 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1112/]) Mechanism for passing metadata from parse to crawldb Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850915#action_12850915 ] Julien Nioche commented on NUTCH-779: - Could anyone please review this issue? I would like to commit it in time for the 1.1 release Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850939#action_12850939 ] Andrzej Bialecki commented on NUTCH-779: - CrawlDbReducer, the cramped line {{if (metaFromParse!=null){}} needs some whitespace fixing. Other than that, +1. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
I'd like to use Julien's approach because I found the scoring filter complex to understand. My use case is the following : 1. during scoring after parsing, I want to tag interesting pages for me, say meta=HIT 2. in the next step (to be created) I would like to prune the segment of NON-HIT content in order to optimize segment space (I use nutch caching), I typically need to ditch 90% of segment data. Also considering to 4. focus recrawls on HIT pages and their outlinks Today I don't know really if how one can retrieve these meta data, I have manage to avoid storing text content for NON-HIT but it is a dirty trick. 2010/1/19 Andrzej Bialecki (JIRA) j...@apache.org [ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175] Andrzej Bialecki commented on NUTCH-779: - Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- -MilleBii-
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802172#action_12802172 ] Julien Nioche commented on NUTCH-779: - The property needs some documentation in nutch-default.xml plus a sensible default. Sure - just wanted the general approach to be checked before doing the tedious bits. Do you think it makes sense to do things the way I suggested or would you use the ScoringFilters instead? Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175 ] Andrzej Bialecki commented on NUTCH-779: - Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801875#action_12801875 ] Andrzej Bialecki commented on NUTCH-779: - You can already achieve this with ScoringFilters, although it requires using three methods instead ... I would also rename the status to parse_meta, it's less cryptic this way. The property needs some documentation in nutch-default.xml plus a sensible default. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.