[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705130928.patch Following Andrzej advice, a much cleaner code :) Attached... Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495350 ] Doğacan Güney commented on NUTCH-485: - You probably should not add put(String/Text key, Parse parse) methods to ParseResult. ParseResult doesn't have a direct method of adding a Parse object, so that it can check whether the parse object comes from a real url or a sub-url. Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705131241.patch Thanks Doğacan, I missed it :( Thanks to all reviewers. Yet another patch... Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-485.200705131241.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357 ] Doğacan Güney commented on NUTCH-443: - Well... That's embarrassing. It seems I forgot to include the necessary changes to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' fetch datums. So, that seemed easy (just a couple of lines) but then I realized that there is another bug. (Which, in my defense, was present in Nutch before 443. So the bug was there, I only made it worse:) It is a bit difficult to describe, so please bear with me. The problem goes like this: In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses empty Content and creates a parse data and an empty parse text. After updatedb and one more generate-fetch-parse-updatedb cycle, we now have a proper content, parse text and parse data for foo in the new segment. Now, assume I index both of these segments together. Url foo will have two sets of (fetch datum, parse), one coming from the first segment, the other coming from the second segment. Since first fetch datum is LINKED, this code in Indexer.reduce will cause foo to be discarded: if (redir != null) { // XXX page was redirected - what should we do? // XXX discard it for now return; } And it doesn't work if we just remove this code. Remember that foo has two sets of (fetch datum, parse) and one of the parses contains an empty parse text. Since, in reduce Indexer will randomly choose one of the parses it is likely that we will get an empty parse text for url foo. This is the part that I made worse: Since Indexer has to read crawl_parse it will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) and discard a lot of useful pages in any multi-segment index job. Sorry if the description is unnecessarily complex. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Andrzej Bialecki Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site
[ https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-484. -- Resolution: Fixed committed and updated site, thanks Gal Nutch Nightly API link is broken in site Key: NUTCH-484 URL: https://issues.apache.org/jira/browse/NUTCH-484 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Priority: Trivial Fix For: 1.0.0 Attachments: NUTCH-484.200705121200.patch The Nightly API link is broken -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-444: Attachment: NUTCH-444.patch feed.tar.bz2 First version of feed plugin featuring a Parser and an IndexingFilter. You would need the latest patch from NUTCH-443 (redirect_and_index.patch) to test it. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reopened NUTCH-443: - Assignee: Chris A. Mattmann (was: Andrzej Bialecki ) Per Doğacan's comment, we need to reopen this and test out his new patch for it. Andrzej, I'd be happy if you reassigned to you, however, I will have some time on Tuesday to look at this if you don't until then. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, redirect_and_index.patch allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495381 ] Chris A. Mattmann commented on NUTCH-444: - Doğacan -- I will check this out tomorrow (Monday) night, latest Tuesday. I've reopened NUTCH-443 and will also look at your new patch from there. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495410 ] Doğacan Güney commented on NUTCH-485: - I have two more minor nits: 1) ParseResult.isSuccess returns true only if all parses are successful. This makes sense, but I think you should make it more obvious by mentioning it in method's javadoc. 2) There seems to be some whitespace issues. For example, some indents are 4 spaces. All indents should be 2 space-indents. Anyway, I don't know if my vote counts, but, besides these two issues, I am +1 on this patch. I think this may be very useful for image search. After parsing a page, one can traverse DOM, add image src's as urls and the immediate text around images as parse text (+ whatever data you can gather as parse data). Of course, this doesn't automatically make Nutch an image search engine, but is a good first step. Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-485.200705131241.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705140001.patch Thanks Doğacan for taking the time to review the code. I agree with your comments on the usage. I run a video search and it sure going to help. The ability to discover and add content on the fly to the segment while parsing is a functionality long awaited and it all made possible after NUTCH-443... :) And yet one more update with a better description in javadoc and some fixes to indentation. Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.