[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705130928.patch

Following Andrzej advice, a much cleaner code :)

Attached...

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495350
 ] 

Doğacan Güney commented on NUTCH-485:
-

You probably should not add put(String/Text key, Parse parse) methods to 
ParseResult. ParseResult doesn't have a direct method of adding a Parse object, 
so that it can check whether the parse object comes from a real url or a 
sub-url. 

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705131241.patch

Thanks Doğacan, I missed it :( 

Thanks to all reviewers.
 
Yet another patch...

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
 NUTCH-485.200705131241.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357
 ] 

Doğacan Güney commented on NUTCH-443:
-

Well... That's embarrassing. It seems I forgot to include the necessary changes 
to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' 
fetch datums. 

So, that seemed easy (just a couple of lines) but then I realized that there is 
another bug. (Which, in my defense, was present in Nutch before 443. So the bug 
was there, I only made it worse:)

It is a bit difficult to describe, so please bear with me. The problem goes 
like this:

In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and 
a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses 
empty Content and creates a parse data and an empty parse text. After updatedb 
and one more generate-fetch-parse-updatedb cycle, we now have a proper content, 
parse text and parse data for foo in the new segment.

Now, assume I index both of these segments together. Url foo will have two sets 
of (fetch datum, parse), one coming from the first segment, the other coming 
from the second segment. Since first fetch datum is LINKED,  this code in 
Indexer.reduce will cause foo to be discarded:

if (redir != null) {
  // XXX page was redirected - what should we do?
  // XXX discard it for now
  return;
}

And it doesn't work if we just remove this code. Remember that foo has two sets 
of (fetch datum, parse) and one of the parses contains an empty parse text. 
Since, in reduce Indexer will randomly choose one of the parses it is likely 
that we will get an empty parse text for url foo.

This is the part that I made worse: Since Indexer has to read crawl_parse it 
will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) 
and discard a lot of useful pages in any multi-segment index job.

Sorry if the description is unnecessarily complex.



 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
 NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
 NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site

2007-05-13 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-484.
--

Resolution: Fixed

committed and updated site, thanks Gal

 Nutch Nightly API link is broken in site
 

 Key: NUTCH-484
 URL: https://issues.apache.org/jira/browse/NUTCH-484
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
Priority: Trivial
 Fix For: 1.0.0

 Attachments: NUTCH-484.200705121200.patch


 The Nightly API link is broken

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-444:


Attachment: NUTCH-444.patch
feed.tar.bz2

First version of feed plugin featuring a Parser and an IndexingFilter. You 
would need the latest patch from NUTCH-443 (redirect_and_index.patch) to test 
it.

 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
 parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reopened NUTCH-443:
-

  Assignee: Chris A. Mattmann  (was: Andrzej Bialecki )

Per Doğacan's comment, we need to reopen this and test out his new patch for 
it. Andrzej, I'd be happy if you reassigned to you, however, I will have some 
time on Tuesday to look at this if you don't until then.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
 NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
 NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, 
 redirect_and_index.patch


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495381
 ] 

Chris A. Mattmann commented on NUTCH-444:
-

Doğacan -- I will check this out tomorrow (Monday) night, latest Tuesday. I've 
reopened NUTCH-443 and will also look at your new patch from there.

 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
 parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495410
 ] 

Doğacan Güney commented on NUTCH-485:
-

I have two more minor nits:

1) ParseResult.isSuccess returns true only if all parses are successful. This 
makes sense, but I think you should make it more obvious by mentioning it in 
method's javadoc. 

2) There seems to be some whitespace issues. For  example, some indents are 4 
spaces. All indents should be 2 space-indents.

Anyway, I don't know if my vote counts, but, besides these two issues, I am +1 
on this patch.

I think this may be very useful for image search. After parsing a page, one can 
traverse DOM, add image src's as urls and the immediate text around images as 
parse text (+ whatever data you can gather as parse data). Of course, this 
doesn't automatically make Nutch an image search engine, but is a good first 
step.

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
 NUTCH-485.200705131241.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705140001.patch

Thanks Doğacan for taking the time to review the code.

I agree with your comments on the usage. I run a video search and it sure going 
to help. The ability to discover and add content on the fly to the segment 
while parsing is a functionality long awaited and it all made possible after 
NUTCH-443... :)


And yet one more update with a better description in javadoc and some fixes to 
indentation.

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
 NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.