[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505701
 ] 

Doğacan Güney commented on NUTCH-444:
-

+1 for committing feed plugin.

I would like to see parse-rss removed too. However, 

* if we are to remove it, we must make sure that it is announced properly. 
Otherwise, nutch-user will just receive a ton of emails asking why nutch 
doesn't parse feeds even though parse-rss is included in config :)

* it may be a good idea to optionally provide parse-rss's behavior in feed. So, 
one can get either the "one parse per item" behavior or "all entries condensed 
into one parse" behavior. We can do this later on, if/when it turns out that 
some people need the old behavior.

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> -
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, 
> NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-500) Add hadoop masters configuration file into conf folder

2007-06-17 Thread Emmanuel Joke (JIRA)
Add hadoop masters configuration file into conf folder
--

 Key: NUTCH-500
 URL: https://issues.apache.org/jira/browse/NUTCH-500
 Project: Nutch
  Issue Type: Improvement
  Components: ndfs
Affects Versions: 0.9.0
 Environment: Linux Fedora 7, Java 1.5
Reporter: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0


Hadoop scripts read a configuration file named masters to know how many 
namenode should be started.
This file is not in the repository for the moment, thus it generate some errors 
message (error which is not really important)  when we start the cluster.

Anyway it could be a good idea to add a template file in the conf directory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-17 Thread Gal Nitzan
Thanks Do?acan, much obliged.

Gal.

> -Original Message-
> From: Do?acan G?ney (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Sunday, June 17, 2007 11:29 PM
> To: nutch-dev@lucene.apache.org
> Subject: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return
> ParseResult object instead of Parse object
>
>
>  [ https://issues.apache.org/jira/browse/NUTCH-
> 485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Do?acan G?ney resolved NUTCH-485.
> -
>
> Resolution: Fixed
>
> Committed in rev 548103 with two modifications:
>
> 1) Fix whitespace issues.
>
> 2) Original patch changed CCParseFilter to return the original parse
> result if CCParseFilter fails. Now if CCParseFilter fails with an
> exception, it returns an empty parse created from the exception.
>
> > Change HtmlParseFilter 's to return ParseResult object instead of Parse
> object
> > 
> --
> >
> > Key: NUTCH-485
> > URL: https://issues.apache.org/jira/browse/NUTCH-485
> > Project: Nutch
> >  Issue Type: Improvement
> >  Components: fetcher
> >Affects Versions: 1.0.0
> > Environment: All
> >Reporter: Gal Nitzan
> >Assignee: Do?acan G?ney
> > Fix For: 1.0.0
> >
> > Attachments: NUTCH-485.200705122151.patch, NUTCH-
> 485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-
> 485.200705131241.patch, NUTCH-485.200705140001.patch
> >
> >
> > The current implementation of HtmlParseFilters.java doesn't allow a
> filter to add parse objects to the ParseResult object.
> > A change to the HtmlParseFilter is needed which allows the filter to
> return ParseResult . and ofcourse a change to  HtmlParseFilters .
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.




[jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-485.
-

Resolution: Fixed

Committed in rev 548103 with two modifications:

1) Fix whitespace issues.

2) Original patch changed CCParseFilter to return the original parse result if 
CCParseFilter fails. Now if CCParseFilter fails with an exception, it returns 
an empty parse created from the exception.

> Change HtmlParseFilter 's to return ParseResult object instead of Parse object
> --
>
> Key: NUTCH-485
> URL: https://issues.apache.org/jira/browse/NUTCH-485
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: NUTCH-485.200705122151.patch, 
> NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
> NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch
>
>
> The current implementation of HtmlParseFilters.java doesn't allow a filter to 
> add parse objects to the ParseResult object.
> A change to the HtmlParseFilter is needed which allows the filter to return 
> ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-444:


Attachment: NUTCH-444.Mattmann.061707.patch.txt

Hi Folks,

 Here is a patch that brings this issue up-to-date. The patch takes Doğacan's 
initial patch, and cleans it up in many places, e.g.:

* changed ParseStatus.STATUS_FAILURE on failed parse (was 
ParseStatus.STATUS_SUCCESS) - line 271
* reformatted code to conform to project style
* removed magic strings
* added in Apache license
* added in unit test
* fixed build.xml file to include refs to nutch-extensionpoints dep during unit 
test

 While I think there are a few minor open questions moving forward, I don't see 
any of them hindering the committal of this patch. In answer to my above 
referenced question regarding this issue as well, I noticed that all-in-all, 
the feed plugin provided here does provide a superset of functionality provided 
by that of parse-rss. So, I am +1 for removing parse-rss. Some things to 
consider going forward:

1. I did find one difference in semantics between the parse-rss plugin and the 
feed plugin: the feed plugin adds the URL pointer to the channel file as the 
Text entry in the  map provided in the ParseResult class. While 
this is probably the correct thing to do, it was causing me some grief 
initially b/c it caused my unit test to fail. My unit test was expecting to 
receive the url: http://test.channel.com, the identified URL in the rsstest.rss 
file, provided as sample input for the unit test. However, since the feed 
plugin parser takes the *actual* URL pointer to the channel file (e.g., 
file:/some/path/on/your/system/rsstest.rss), rather than the specified channel 
URL, this test was failing. The old parse-rss plugin actually took the channel 
URL instead. I thought about this, and it's not a major hurdle. I think the 
semantics of simply taking the URL pointer to the channel file that was used 
(even if it was a file: pointer), is fine.

2. It might be a good idea to factor out the desired index/parse properties 
taken from the feed and allow them to be specified by a configuration file to 
this plugin. In other words, wouldn't it be nice to tell the plugin which 
fields we want to extract (e.g., author, published date, etc.)? This would be 
an improvement to this plugin later on.

Okey dok, so here it is. If there are no objections, I'd like to commit this in 
the next 48 hrs. I'd also like feedback from folks like Andrzej and Doğacan 
regarding removing parse-rss from the sources.

Thanks!

Cheers,
  Chris



> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> -
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, 
> NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-444 started by Chris A. Mattmann.

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> -
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
> parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-17 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505607
 ] 

Chris A. Mattmann commented on NUTCH-444:
-

Hi Nutch Newbie:

I will take a look at this today, and take an action to prepare a patch. 

Cheers,
  Chris


> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> -
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
> parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-06-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann closed NUTCH-443.
---


Patch applied to trunk:

http://svn.apache.org/viewvc?rev=548076&view=rev

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff, patch.txt, 
> redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-06-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-443.
-

Resolution: Fixed

Patch tested and contributed by Dogacan. This update is a fix and semantics 
change from the original patch for NUTCH-443. The original patch did not tell 
the  Indexer to read crawl_parse too so that it can pickup sub-urls' fetch 
datums. This patch addresses that issue. Now, if Fetcher gets a null content, 
instead of pushing an empty content, it filters the null content.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff, patch.txt, 
> redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505598
 ] 

Andrzej Bialecki  commented on NUTCH-485:
-

Whitespace changes should be committed as a separate patch, if really needed - 
otherwise the patch should not introduce purely whitespace changes. This is not 
a dogma, but keeping this rule makes it easier later on to see what is the 
meaning of the patch.

> Change HtmlParseFilter 's to return ParseResult object instead of Parse object
> --
>
> Key: NUTCH-485
> URL: https://issues.apache.org/jira/browse/NUTCH-485
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: NUTCH-485.200705122151.patch, 
> NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
> NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch
>
>
> The current implementation of HtmlParseFilters.java doesn't allow a filter to 
> add parse objects to the ParseResult object.
> A change to the HtmlParseFilter is needed which allows the filter to return 
> ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

2007-06-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-476.
---

Resolution: Invalid

Such a field is already stored in index (as "digest"). You can change how it is 
calculated by db.signature.class option.

> Would like to add a field to the document class for its MD5 signature 
> --
>
> Key: NUTCH-476
> URL: https://issues.apache.org/jira/browse/NUTCH-476
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
> Environment: all
>Reporter: Linh Pham
>Priority: Minor
>
> During indexing a file, if an MD5 signature was calculated and stored along 
> with the document  as a default,
> it could then be used to remove duplicates from the results on retrieval.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-270) Apply just the applicable portions of the patch to protocol.httpclient.Http.java

2007-06-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-270.
---

Resolution: Fixed

Fixed as part of NUTCH-61.

> Apply just the applicable portions of the patch to 
> protocol.httpclient.Http.java
> 
>
> Key: NUTCH-270
> URL: https://issues.apache.org/jira/browse/NUTCH-270
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Jeremy Calvert
>
> This seems to be two issues in one.  Adaptive scheduling AND content change 
> detection.
> I don't see any reason not to apply the patch to allow content change 
> detection.  That is, the parts of th patch to support changing the signature 
> HttpResponse(URL url, long lastModified).  It'd be especially useful for 
> those of us who refetch feeds fairly frequently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-06-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-443:


Attachment: NUTCH_443_reopened_v3.patch

New version against latest trunk. 

Tested locally, seems to work.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff, patch.txt, 
> redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.