[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505701 ] Doğacan Güney commented on NUTCH-444: - +1 for committing feed plugin. I would like to see parse-rss removed too. However, * if we are to remove it, we must make sure that it is announced properly. Otherwise, nutch-user will just receive a ton of emails asking why nutch doesn't parse feeds even though parse-rss is included in config :) * it may be a good idea to optionally provide parse-rss's behavior in feed. So, one can get either the "one parse per item" behavior or "all entries condensed into one parse" behavior. We can do this later on, if/when it turns out that some people need the old behavior. > Possibly use a different library to parse RSS feed for improved performance > and compatibility > - > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, > NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-500) Add hadoop masters configuration file into conf folder
Add hadoop masters configuration file into conf folder -- Key: NUTCH-500 URL: https://issues.apache.org/jira/browse/NUTCH-500 Project: Nutch Issue Type: Improvement Components: ndfs Affects Versions: 0.9.0 Environment: Linux Fedora 7, Java 1.5 Reporter: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Hadoop scripts read a configuration file named masters to know how many namenode should be started. This file is not in the repository for the moment, thus it generate some errors message (error which is not really important) when we start the cluster. Anyway it could be a good idea to add a template file in the conf directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
Thanks Do?acan, much obliged. Gal. > -Original Message- > From: Do?acan G?ney (JIRA) [mailto:[EMAIL PROTECTED] > Sent: Sunday, June 17, 2007 11:29 PM > To: nutch-dev@lucene.apache.org > Subject: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return > ParseResult object instead of Parse object > > > [ https://issues.apache.org/jira/browse/NUTCH- > 485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] > > Do?acan G?ney resolved NUTCH-485. > - > > Resolution: Fixed > > Committed in rev 548103 with two modifications: > > 1) Fix whitespace issues. > > 2) Original patch changed CCParseFilter to return the original parse > result if CCParseFilter fails. Now if CCParseFilter fails with an > exception, it returns an empty parse created from the exception. > > > Change HtmlParseFilter 's to return ParseResult object instead of Parse > object > > > -- > > > > Key: NUTCH-485 > > URL: https://issues.apache.org/jira/browse/NUTCH-485 > > Project: Nutch > > Issue Type: Improvement > > Components: fetcher > >Affects Versions: 1.0.0 > > Environment: All > >Reporter: Gal Nitzan > >Assignee: Do?acan G?ney > > Fix For: 1.0.0 > > > > Attachments: NUTCH-485.200705122151.patch, NUTCH- > 485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH- > 485.200705131241.patch, NUTCH-485.200705140001.patch > > > > > > The current implementation of HtmlParseFilters.java doesn't allow a > filter to add parse objects to the ParseResult object. > > A change to the HtmlParseFilter is needed which allows the filter to > return ParseResult . and ofcourse a change to HtmlParseFilters . > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-485. - Resolution: Fixed Committed in rev 548103 with two modifications: 1) Fix whitespace issues. 2) Original patch changed CCParseFilter to return the original parse result if CCParseFilter fails. Now if CCParseFilter fails with an exception, it returns an empty parse created from the exception. > Change HtmlParseFilter 's to return ParseResult object instead of Parse object > -- > > Key: NUTCH-485 > URL: https://issues.apache.org/jira/browse/NUTCH-485 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: NUTCH-485.200705122151.patch, > NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, > NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch > > > The current implementation of HtmlParseFilters.java doesn't allow a filter to > add parse objects to the ParseResult object. > A change to the HtmlParseFilter is needed which allows the filter to return > ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-444: Attachment: NUTCH-444.Mattmann.061707.patch.txt Hi Folks, Here is a patch that brings this issue up-to-date. The patch takes Doğacan's initial patch, and cleans it up in many places, e.g.: * changed ParseStatus.STATUS_FAILURE on failed parse (was ParseStatus.STATUS_SUCCESS) - line 271 * reformatted code to conform to project style * removed magic strings * added in Apache license * added in unit test * fixed build.xml file to include refs to nutch-extensionpoints dep during unit test While I think there are a few minor open questions moving forward, I don't see any of them hindering the committal of this patch. In answer to my above referenced question regarding this issue as well, I noticed that all-in-all, the feed plugin provided here does provide a superset of functionality provided by that of parse-rss. So, I am +1 for removing parse-rss. Some things to consider going forward: 1. I did find one difference in semantics between the parse-rss plugin and the feed plugin: the feed plugin adds the URL pointer to the channel file as the Text entry in the map provided in the ParseResult class. While this is probably the correct thing to do, it was causing me some grief initially b/c it caused my unit test to fail. My unit test was expecting to receive the url: http://test.channel.com, the identified URL in the rsstest.rss file, provided as sample input for the unit test. However, since the feed plugin parser takes the *actual* URL pointer to the channel file (e.g., file:/some/path/on/your/system/rsstest.rss), rather than the specified channel URL, this test was failing. The old parse-rss plugin actually took the channel URL instead. I thought about this, and it's not a major hurdle. I think the semantics of simply taking the URL pointer to the channel file that was used (even if it was a file: pointer), is fine. 2. It might be a good idea to factor out the desired index/parse properties taken from the feed and allow them to be specified by a configuration file to this plugin. In other words, wouldn't it be nice to tell the plugin which fields we want to extract (e.g., author, published date, etc.)? This would be an improvement to this plugin later on. Okey dok, so here it is. If there are no objections, I'd like to commit this in the next 48 hrs. I'd also like feedback from folks like Andrzej and Doğacan regarding removing parse-rss from the sources. Thanks! Cheers, Chris > Possibly use a different library to parse RSS feed for improved performance > and compatibility > - > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, > NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-444 started by Chris A. Mattmann. > Possibly use a different library to parse RSS feed for improved performance > and compatibility > - > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, > parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505607 ] Chris A. Mattmann commented on NUTCH-444: - Hi Nutch Newbie: I will take a look at this today, and take an action to prepare a patch. Cheers, Chris > Possibly use a different library to parse RSS feed for improved performance > and compatibility > - > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, > parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann closed NUTCH-443. --- Patch applied to trunk: http://svn.apache.org/viewvc?rev=548076&view=rev > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff, patch.txt, > redirect_and_index.patch, redirect_and_index_v2.patch > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-443. - Resolution: Fixed Patch tested and contributed by Dogacan. This update is a fix and semantics change from the original patch for NUTCH-443. The original patch did not tell the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch datums. This patch addresses that issue. Now, if Fetcher gets a null content, instead of pushing an empty content, it filters the null content. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff, patch.txt, > redirect_and_index.patch, redirect_and_index_v2.patch > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505598 ] Andrzej Bialecki commented on NUTCH-485: - Whitespace changes should be committed as a separate patch, if really needed - otherwise the patch should not introduce purely whitespace changes. This is not a dogma, but keeping this rule makes it easier later on to see what is the meaning of the patch. > Change HtmlParseFilter 's to return ParseResult object instead of Parse object > -- > > Key: NUTCH-485 > URL: https://issues.apache.org/jira/browse/NUTCH-485 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: NUTCH-485.200705122151.patch, > NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, > NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch > > > The current implementation of HtmlParseFilters.java doesn't allow a filter to > add parse objects to the ParseResult object. > A change to the HtmlParseFilter is needed which allows the filter to return > ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-476) Would like to add a field to the document class for its MD5 signature
[ https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-476. --- Resolution: Invalid Such a field is already stored in index (as "digest"). You can change how it is calculated by db.signature.class option. > Would like to add a field to the document class for its MD5 signature > -- > > Key: NUTCH-476 > URL: https://issues.apache.org/jira/browse/NUTCH-476 > Project: Nutch > Issue Type: Improvement > Components: indexer > Environment: all >Reporter: Linh Pham >Priority: Minor > > During indexing a file, if an MD5 signature was calculated and stored along > with the document as a default, > it could then be used to remove duplicates from the results on retrieval. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-270) Apply just the applicable portions of the patch to protocol.httpclient.Http.java
[ https://issues.apache.org/jira/browse/NUTCH-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-270. --- Resolution: Fixed Fixed as part of NUTCH-61. > Apply just the applicable portions of the patch to > protocol.httpclient.Http.java > > > Key: NUTCH-270 > URL: https://issues.apache.org/jira/browse/NUTCH-270 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 0.8 >Reporter: Jeremy Calvert > > This seems to be two issues in one. Adaptive scheduling AND content change > detection. > I don't see any reason not to apply the patch to allow content change > detection. That is, the parts of th patch to support changing the signature > HttpResponse(URL url, long lastModified). It'd be especially useful for > those of us who refetch feeds fairly frequently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH_443_reopened_v3.patch New version against latest trunk. Tested locally, seems to work. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff, patch.txt, > redirect_and_index.patch, redirect_and_index_v2.patch > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.