[jira] Resolved: (NUTCH-687) Add RAT
[ https://issues.apache.org/jira/browse/NUTCH-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-687. -- Resolution: Fixed Fix Version/s: 1.0.0 committed Add RAT --- Key: NUTCH-687 URL: https://issues.apache.org/jira/browse/NUTCH-687 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-687.patch Add apache rat so we can easily see the situation with required headers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-689) Swf parser doesn't seem to handle relative links
[ https://issues.apache.org/jira/browse/NUTCH-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674520#action_12674520 ] Sami Siren commented on NUTCH-689: -- for some reason I cannot apply the patch: patching file src/java/org/apache/nutch/parse/swf/SWFParser.java Hunk #2 FAILED at 94. Swf parser doesn't seem to handle relative links Key: NUTCH-689 URL: https://issues.apache.org/jira/browse/NUTCH-689 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Peter Sparks Attachments: parse-swf.patch I was using the swf parser to extract links from flash files on the site www.arnoldworldwide.com and I was getting an malformed url exception because an outlink was found and it was a relative link that wasn't being resolved. I was able to fix it by resolving all links as they are added to the list of outlinks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-591) StringIndexOutOfBoundsException when extracting text from a Word document.
[ https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-591. -- Resolution: Duplicate duplicate of NUTCH-691 StringIndexOutOfBoundsException when extracting text from a Word document. -- Key: NUTCH-591 URL: https://issues.apache.org/jira/browse/NUTCH-591 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: linux redhat as4u4 x86 kernel 2.6.9 Reporter: frank ling see http://issues.apache.org/bugzilla/show_bug.cgi?id=41076+ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-688) Fix missing/wrong headers in source files
[ https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-688. -- Resolution: Fixed I think we are done with this. Fix missing/wrong headers in source files - Key: NUTCH-688 URL: https://issues.apache.org/jira/browse/NUTCH-688 Project: Nutch Issue Type: Bug Reporter: Sami Siren Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-691) Update jakarta poi jars to the most relevant version
[ https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-691. -- Resolution: Fixed Fix Version/s: 1.0.0 committed, Thanks Dmitry Update jakarta poi jars to the most relevant version Key: NUTCH-691 URL: https://issues.apache.org/jira/browse/NUTCH-691 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Fix For: 1.0.0 Attachments: NUTCH-691-v1-poi.patch, NUTCH-691-v1-test.patch Original Estimate: 0.25h Remaining Estimate: 0.25h Update jakarta poi jars to the most relevant version closes bug NUTCH-591. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-563. -- Resolution: Fixed Assignee: Sami Siren committed, thanks Include custom fields in BasicQueryFilter - Key: NUTCH-563 URL: https://issues.apache.org/jira/browse/NUTCH-563 Project: Nutch Issue Type: New Feature Components: searcher Reporter: julien nioche Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: diff.BasicQueryFilter.dynamicFields.txt, NUTCH-563.patch This patch allows to include additional fields in the BasicQueryFilter by specifying runtime parameters. Any parameter matching the regular expression (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674603#action_12674603 ] Sami Siren commented on NUTCH-692: -- Have you seen this outside of EC2? Only in multinode setup? AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674607#action_12674607 ] julien nioche commented on NUTCH-692: - I have seen this only in multinode setup and on EC2. AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-583) FeedParser empty links for items
[ https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-583: - Fix Version/s: (was: 1.0.0) 1.1 pushing this to 1.1 FeedParser empty links for items Key: NUTCH-583 URL: https://issues.apache.org/jira/browse/NUTCH-583 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.1 FeedParser in feed plugin just discards the item if it does not have link element. However Rss 2.0 does not necessitate the link element for each item. Moreover sometimes the link is given in the guid element which is a globally unique identifier for the item. I think we can search the url for an item first, then if it is still not found, we can use the feed's url, but with merging all the parse texts into one Parse object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
would someone help confirm a patch (fix incorrect encoding detection in cached.jsp)
Hi Devs, I am using the latest nightly build nutch 0.9-dev with default configuration. I'm indexing some sites, some of which use ISO-8859-15 or GB2312 encoding. I can't see the non-ASCII characters correctly at cached page view from tomcat 6.0.18. They are all shown as invalid UTF-8 data with raw code point 0xEFBFBD. Then I dumped that segment (nutch readsegs) and found the CharEncodingForConversion is present in Parse Metadata: Parse Metadata: CharEncodingForConversion=GB2312 OriginalCharEncoding=GB2312 However, the cached.jsp (from nutch-2009-02-16_04-01-15.war) doesn't read it from Parse Metadata section, but try to read it from Content section. After 2 lines modification to cached.jsp: --- cached.jsp.orig 2009-02-18 12:17:25.0 -0500 +++ cached.jsp.patched 2009-02-18 12:43:26.0 -0500 @@ -40,6 +40,7 @@ .getLocale().getLanguage(); Metadata metaData = bean.getParseData(details).getContentMeta(); + Metadata parseMetaData = bean.getParseData(details).getParseMeta(); String content = null; String contentType = (String) metaData.get(Metadata.CONTENT_TYPE); @@ -49,7 +50,7 @@ // but I don't know how to emit 'byte sequence' in JSP. // out.getOutputStream().write(bean.getContent(details)) may work, // but I'm not sure. -String encoding = (String) metaData.get(CharEncodingForConversion); +String encoding = (String) parseMetaData.get(CharEncodingForConversion); if (encoding != null) { try { content = new String(bean.getContent(details), encoding); The webpage could be correctly decoded and the bad encoding problem is fixed. I am not familiar with Nutch development process. If some of you could help confirm this patch and commit it, that would help a lot. Thanks, -- Justin Yao
Re: would someone help confirm a patch (fix incorrect encoding detection in cached.jsp)
Justin Yao wrote: Hi Devs, I am using the latest nightly build nutch 0.9-dev with default configuration. I'm indexing some sites, some of which use ISO-8859-15 or GB2312 encoding. I can't see the non-ASCII characters correctly at cached page view from tomcat 6.0.18. They are all shown as invalid UTF-8 data with raw code point 0xEFBFBD. Then I dumped that segment (nutch readsegs) and found the CharEncodingForConversion is present in Parse Metadata: Parse Metadata: CharEncodingForConversion=GB2312 OriginalCharEncoding=GB2312 However, the cached.jsp (from nutch-2009-02-16_04-01-15.war) doesn't read it from Parse Metadata section, but try to read it from Content section. Your analysis seems correct since the parse metadata is where the encoding is stored in html parser. I am not familiar with Nutch development process. If some of you could help confirm this patch and commit it, that would help a lot. You should check url http://wiki.apache.org/nutch/Becoming%20A%20Nutch%20Developer for some info about developing Nutch. So to proceed you'd create a new Jira issue and attach you patch there. Thanks. -- Sami Siren Thanks,
[jira] Created: (NUTCH-693) Add configurable option for treating nofollow behaviour.
Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew McCall updated NUTCH-693: Attachment: nutch.nofollow.patch Here is the patch. Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674885#action_12674885 ] Hudson commented on NUTCH-563: -- Integrated in Nutch-trunk #729 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/729/]) Include custom fields in BasicQueryFilter, contributed by Julien Nioche Include custom fields in BasicQueryFilter - Key: NUTCH-563 URL: https://issues.apache.org/jira/browse/NUTCH-563 Project: Nutch Issue Type: New Feature Components: searcher Reporter: julien nioche Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: diff.BasicQueryFilter.dynamicFields.txt, NUTCH-563.patch This patch allows to include additional fields in the BasicQueryFilter by specifying runtime parameters. Any parameter matching the regular expression (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-688) Fix missing/wrong headers in source files
[ https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674888#action_12674888 ] Hudson commented on NUTCH-688: -- Integrated in Nutch-trunk #729 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/729/]) add missing headers, part 2 rest add missing headers, part 1 core Fix missing/wrong headers in source files - Key: NUTCH-688 URL: https://issues.apache.org/jira/browse/NUTCH-688 Project: Nutch Issue Type: Bug Reporter: Sami Siren Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-691) Update jakarta poi jars to the most relevant version
[ https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674887#action_12674887 ] Hudson commented on NUTCH-691: -- Integrated in Nutch-trunk #729 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/729/]) - Update jakarta poi jars to the most relevant version, contributed by Dmitry Lihachev Update jakarta poi jars to the most relevant version Key: NUTCH-691 URL: https://issues.apache.org/jira/browse/NUTCH-691 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Fix For: 1.0.0 Attachments: NUTCH-691-v1-poi.patch, NUTCH-691-v1-test.patch Original Estimate: 0.25h Remaining Estimate: 0.25h Update jakarta poi jars to the most relevant version closes bug NUTCH-591. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.