[jira] Resolved: (NUTCH-687) Add RAT

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-687.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed

 Add RAT
 ---

 Key: NUTCH-687
 URL: https://issues.apache.org/jira/browse/NUTCH-687
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-687.patch


 Add apache rat so we can easily see the situation with required headers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-689) Swf parser doesn't seem to handle relative links

2009-02-18 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674520#action_12674520
 ] 

Sami Siren commented on NUTCH-689:
--

for some reason I cannot apply the patch:

patching file src/java/org/apache/nutch/parse/swf/SWFParser.java
Hunk #2 FAILED at 94.



 Swf parser doesn't seem to handle relative links
 

 Key: NUTCH-689
 URL: https://issues.apache.org/jira/browse/NUTCH-689
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Peter Sparks
 Attachments: parse-swf.patch


 I was using the swf parser to extract links from flash files on the site 
 www.arnoldworldwide.com and I was getting an malformed url exception because 
 an outlink was found and it was a relative link that wasn't being resolved. I 
 was able to fix it by resolving all links as they are added to the list of 
 outlinks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-591) StringIndexOutOfBoundsException when extracting text from a Word document.

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-591.
--

Resolution: Duplicate

duplicate of NUTCH-691

 StringIndexOutOfBoundsException when extracting text from a Word document.
 --

 Key: NUTCH-591
 URL: https://issues.apache.org/jira/browse/NUTCH-591
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: linux
 redhat as4u4 x86
 kernel 2.6.9
Reporter: frank ling

 see 
 http://issues.apache.org/bugzilla/show_bug.cgi?id=41076+

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-688) Fix missing/wrong headers in source files

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-688.
--

Resolution: Fixed

I think we are done with this.

 Fix missing/wrong headers in source files
 -

 Key: NUTCH-688
 URL: https://issues.apache.org/jira/browse/NUTCH-688
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-18 Thread julien nioche (JIRA)
AlreadyBeingCreatedException with Hadoop 0.19
-

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche


I have been using the SVN version of Nutch on an EC2 cluster and got some 
AlreadyBeingCreatedException during the reduce phase of a parse. For some 
reason one of my tasks crashed and then I ran into this 
AlreadyBeingCreatedException when other nodes tried to pick it up.

There was recently a discussion on the Hadoop user list on similar issues with 
Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I 
have not tried using 0.18.2 yet but will do if the problems persist with 0.19

I was wondering whether anyone else had experienced the same problem. Do you 
think 0.19 is stable enough to use it for Nutch 1.0?
I will be running a crawl on a super large cluster in the next couple of weeks 
and I will confirm this issue  

J.  



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-691) Update jakarta poi jars to the most relevant version

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-691.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed, Thanks Dmitry

 Update jakarta poi jars to the most relevant version
 

 Key: NUTCH-691
 URL: https://issues.apache.org/jira/browse/NUTCH-691
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
 Fix For: 1.0.0

 Attachments: NUTCH-691-v1-poi.patch, NUTCH-691-v1-test.patch

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 Update  jakarta poi jars to the most relevant version closes bug NUTCH-591.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-563) Include custom fields in BasicQueryFilter

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-563.
--

Resolution: Fixed
  Assignee: Sami Siren

committed, thanks

 Include custom fields in BasicQueryFilter
 -

 Key: NUTCH-563
 URL: https://issues.apache.org/jira/browse/NUTCH-563
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Reporter: julien nioche
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: diff.BasicQueryFilter.dynamicFields.txt, NUTCH-563.patch


 This patch allows to include additional fields in the BasicQueryFilter by 
 specifying runtime parameters.  Any parameter matching the regular expression 
 (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be 
 used by the BQF and the specified float value will be used as boost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-18 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674603#action_12674603
 ] 

Sami Siren commented on NUTCH-692:
--

Have you seen this outside of EC2? Only in multinode setup?

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche

 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-18 Thread julien nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674607#action_12674607
 ] 

julien nioche commented on NUTCH-692:
-

I have seen this only in multinode setup and on EC2.

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche

 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-583) FeedParser empty links for items

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-583:
-

Fix Version/s: (was: 1.0.0)
   1.1

pushing this to 1.1

 FeedParser empty links for items
 

 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.1


 FeedParser in feed plugin just discards the item if it does not have link 
 element. However Rss 2.0 does not necessitate the link element for each 
 item. 
 Moreover sometimes the link is given in the guid element which is a 
 globally unique identifier for the item. I think we can search the url for an 
 item first, then if it is still not found, we can use the feed's url, but 
 with merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



would someone help confirm a patch (fix incorrect encoding detection in cached.jsp)

2009-02-18 Thread Justin Yao

Hi Devs,

I am using the latest nightly build nutch 0.9-dev with default 
configuration. I'm indexing some sites, some of which use ISO-8859-15 or 
GB2312 encoding. I can't see the non-ASCII characters correctly at 
cached page view from tomcat 6.0.18. They are all shown as invalid UTF-8 
data with raw code point 0xEFBFBD.


Then I dumped that segment (nutch readsegs) and found the 
CharEncodingForConversion is present in Parse Metadata:


Parse Metadata: CharEncodingForConversion=GB2312 OriginalCharEncoding=GB2312

However, the cached.jsp (from nutch-2009-02-16_04-01-15.war) doesn't 
read it from Parse Metadata section, but try to read it from Content 
section.


After 2 lines modification to cached.jsp:

--- cached.jsp.orig 2009-02-18 12:17:25.0 -0500
+++ cached.jsp.patched  2009-02-18 12:43:26.0 -0500
@@ -40,6 +40,7 @@
 .getLocale().getLanguage();

   Metadata metaData = bean.getParseData(details).getContentMeta();
+  Metadata parseMetaData = bean.getParseData(details).getParseMeta();

   String content = null;
   String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
@@ -49,7 +50,7 @@
 // but I don't know how to emit 'byte sequence' in JSP.
 // out.getOutputStream().write(bean.getContent(details)) may work,
 // but I'm not sure.
-String encoding = (String) metaData.get(CharEncodingForConversion);
+String encoding = (String) 
parseMetaData.get(CharEncodingForConversion);

 if (encoding != null) {
   try {
 content = new String(bean.getContent(details), encoding);

The webpage could be correctly decoded and the bad encoding problem is 
fixed.


I am not familiar with Nutch development process. If some of you could 
help confirm this patch and commit it, that would help a lot.


Thanks,
--
Justin Yao




Re: would someone help confirm a patch (fix incorrect encoding detection in cached.jsp)

2009-02-18 Thread Sami Siren

Justin Yao wrote:

Hi Devs,

I am using the latest nightly build nutch 0.9-dev with default 
configuration. I'm indexing some sites, some of which use ISO-8859-15 
or GB2312 encoding. I can't see the non-ASCII characters correctly at 
cached page view from tomcat 6.0.18. They are all shown as invalid 
UTF-8 data with raw code point 0xEFBFBD.


Then I dumped that segment (nutch readsegs) and found the 
CharEncodingForConversion is present in Parse Metadata:


Parse Metadata: CharEncodingForConversion=GB2312 
OriginalCharEncoding=GB2312


However, the cached.jsp (from nutch-2009-02-16_04-01-15.war) doesn't 
read it from Parse Metadata section, but try to read it from 
Content section.
Your analysis seems correct since the parse metadata is where the 
encoding is stored in html parser.


I am not familiar with Nutch development process. If some of you could 
help confirm this patch and commit it, that would help a lot.
You should check url 
http://wiki.apache.org/nutch/Becoming%20A%20Nutch%20Developer for some 
info about developing Nutch. So to proceed you'd create a new Jira issue 
and attach you patch there. Thanks.


--
Sami Siren



Thanks,




[jira] Created: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2009-02-18 Thread Andrew McCall (JIRA)
Add configurable option for treating nofollow behaviour.


 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Attachments: nutch.nofollow.patch

For my purposes I'd like to follow links even if they're marked nofollow- 
Ideally I'd like to follow them, but not pass the link juice between them. 

I've attached a patch that adds a configuration element 
parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2009-02-18 Thread Andrew McCall (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew McCall updated NUTCH-693:


Attachment: nutch.nofollow.patch

Here is the patch.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter

2009-02-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674885#action_12674885
 ] 

Hudson commented on NUTCH-563:
--

Integrated in Nutch-trunk #729 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/729/])
 Include custom fields in BasicQueryFilter, contributed by Julien Nioche


 Include custom fields in BasicQueryFilter
 -

 Key: NUTCH-563
 URL: https://issues.apache.org/jira/browse/NUTCH-563
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Reporter: julien nioche
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: diff.BasicQueryFilter.dynamicFields.txt, NUTCH-563.patch


 This patch allows to include additional fields in the BasicQueryFilter by 
 specifying runtime parameters.  Any parameter matching the regular expression 
 (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be 
 used by the BQF and the specified float value will be used as boost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-688) Fix missing/wrong headers in source files

2009-02-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674888#action_12674888
 ] 

Hudson commented on NUTCH-688:
--

Integrated in Nutch-trunk #729 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/729/])
 add missing headers, part 2 rest
 add missing headers, part 1 core


 Fix missing/wrong headers in source files
 -

 Key: NUTCH-688
 URL: https://issues.apache.org/jira/browse/NUTCH-688
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-691) Update jakarta poi jars to the most relevant version

2009-02-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674887#action_12674887
 ] 

Hudson commented on NUTCH-691:
--

Integrated in Nutch-trunk #729 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/729/])
 - Update jakarta poi jars to the most relevant version, contributed by 
Dmitry Lihachev


 Update jakarta poi jars to the most relevant version
 

 Key: NUTCH-691
 URL: https://issues.apache.org/jira/browse/NUTCH-691
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
 Fix For: 1.0.0

 Attachments: NUTCH-691-v1-poi.patch, NUTCH-691-v1-test.patch

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 Update  jakarta poi jars to the most relevant version closes bug NUTCH-591.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.