[jira] Created: (NUTCH-742) Checksum Error

2009-06-20 Thread mawanqiang (JIRA)
Checksum Error 
---

 Key: NUTCH-742
 URL: https://issues.apache.org/jira/browse/NUTCH-742
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: linux ubuntu8.0.4 64bit 
10datanode 4G of memory per node 
Reporter: mawanqiang


Approximately 1 million data used to create index when nutch1.0 error.
The error is:
java.lang.RuntimeException: problem advancing post rec#6758513
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:883)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:79)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:153)
at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:90)
at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:301)
at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:331)
at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315)
at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377)
at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174)
at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297)
at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:922)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:881)
... 6 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson

2009-06-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Mike Dawson:
http://wiki.apache.org/nutch/AddingNewLocalization

New page:
===Adding a New Language to Nutch===

If you want to have Nutch in your language - hopefully the below helps.  I just 
Googled around.

* Unzip Nutch 1.0 to any folder

* Translate the .properties files that you find in src/web/locale/org/nutch/jsp 
:
** For each file make sure that you have your own version ending in 
_langcode.properties e.g. _fa.properties .  Btw OmegaT is an excellent 
Translation memory program to help with standardizing terms etc.

* Make a folder src/web/include/langcode with a file header.xml - again this 
needs translated.
* Make a folder src/web/pages/langcode and copy the .xml files from the 
English folder and then translate them.  In search.xml look for the line:
pre
input type=hidden name=lang value=fa/
/pre
Change the value of lang to match the language you are adding (e.g. fa)

* Add your language to src/web/include/footer.html

* In the Nutch base directory run ant

pre
ant generate-docs
/pre

* Work in progress - I now find that when doing the search it still comes back 
in English... for some reason it seems like the JSP loads the resource bundle 
according to the language passed by the browser headers, not according to the 
lang parameter...


[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722214#action_12722214
 ] 

Julien Nioche commented on NUTCH-731:
-

Here is an example which the patch helps addressing 

curl http://wizardhq.com/robots.txt

!DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN
htmlhead
title301 Moved Permanently/title
/headbody
h1Moved Permanently/h1
pThe document has moved a 
href=http://www.wizardhq.com/robots.txt;here/a./p
/body/html

again, the ratio of robots_denied status started going up after I wrote the 
patch which means that such cases are not so rare

 Redirection of robots.txt in RobotRulesParser
 -

 Key: NUTCH-731
 URL: https://issues.apache.org/jira/browse/NUTCH-731
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-731.patch


 The patch attached allows to follow one level of redirection for robots.txt 
 files. A similar issue was mentioned in NUTCH-124 and has been marked as 
 fixed a long time ago but the problem remained, at least when using Fetcher2 
 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in 
 March.
 I have been using this patch for a while now on a large cluster and noticed 
 that the ratio of robots_denied per fetchlist went up, meaning that at least 
 we are now getting restrictions we would not have had before (and getting 
 less complaints from webmasters at the same time)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson

2009-06-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Mike Dawson:
http://wiki.apache.org/nutch/AddingNewLocalization

--
  
  If you want to have Nutch in your language - hopefully the below helps.  I 
just Googled around.
  
- * Unzip Nutch 1.0 to any folder
+  * Unzip Nutch 1.0 to any folder
  
- * Translate the .properties files that you find in 
src/web/locale/org/nutch/jsp :
+  * Translate the .properties files that you find in 
src/web/locale/org/nutch/jsp :
- ** For each file make sure that you have your own version ending in 
_langcode.properties e.g. _fa.properties .  Btw OmegaT is an excellent 
Translation memory program to help with standardizing terms etc.
+  * For each file make sure that you have your own version ending in 
_langcode.properties e.g. _fa.properties .  Btw OmegaT is an excellent 
Translation memory program to help with standardizing terms etc.
  
- * Make a folder src/web/include/langcode with a file header.xml - again 
this needs translated.
+  * Make a folder src/web/include/langcode with a file header.xml - again 
this needs translated.
- * Make a folder src/web/pages/langcode and copy the .xml files from the 
English folder and then translate them.  In search.xml look for the line:
+  * Make a folder src/web/pages/langcode and copy the .xml files from the 
English folder and then translate them.  In search.xml look for the line:
- pre
+ 
+ {{{
  input type=hidden name=lang value=fa/
- /pre
+ }}}
  Change the value of lang to match the language you are adding (e.g. fa)
  
- * Add your language to src/web/include/footer.html
+  * Add your language to src/web/include/footer.html
  
- * In the Nutch base directory run ant
+  * In the Nutch base directory run ant
  
- pre
+ {{{
  ant generate-docs
- /pre
+ }}}
  
- * Work in progress - I now find that when doing the search it still comes 
back in English... for some reason it seems like the JSP loads the resource 
bundle according to the language passed by the browser headers, not according 
to the lang parameter...
+  * Work in progress - I now find that when doing the search it still comes 
back in English... for some reason it seems like the JSP loads the resource 
bundle according to the language passed by the browser headers, not according 
to the lang parameter...
  


[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722242#action_12722242
 ] 

Ken Krugler commented on NUTCH-731:
---

This is definitely an issue - I've been pinging various domains while testing 
robots.txt handling in bixo, and many of them will do a redirect if you use 
http://domain/robots.txt, to http://www.domain/robots.txt.

 Redirection of robots.txt in RobotRulesParser
 -

 Key: NUTCH-731
 URL: https://issues.apache.org/jira/browse/NUTCH-731
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-731.patch


 The patch attached allows to follow one level of redirection for robots.txt 
 files. A similar issue was mentioned in NUTCH-124 and has been marked as 
 fixed a long time ago but the problem remained, at least when using Fetcher2 
 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in 
 March.
 I have been using this patch for a while now on a large cluster and noticed 
 that the ratio of robots_denied per fetchlist went up, meaning that at least 
 we are now getting restrictions we would not have had before (and getting 
 less complaints from webmasters at the same time)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson

2009-06-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Mike Dawson:
http://wiki.apache.org/nutch/AddingNewLocalization

--
- ===Adding a New Language to Nutch===
+ = Adding a New Language to Nutch =
  
- If you want to have Nutch in your language - hopefully the below helps.  I 
just Googled around.
+ If you want to have Nutch in your language - hopefully the below helps.  I 
have been Googling around and digging in some source code...
  
   * Unzip Nutch 1.0 to any folder
  
@@ -25, +25 @@

  ant generate-docs
  }}}
  
-  * Work in progress - I now find that when doing the search it still comes 
back in English... for some reason it seems like the JSP loads the resource 
bundle according to the language passed by the browser headers, not according 
to the lang parameter...
+  * It seems like some changes are needed to search.jsp to make it behave as 
users would expect.  The original appears to expect the language of the browser 
to take precedence over the language selected...  After out.flush() at about 
line 160 add the following in src/web/jsp/search.jsp:
  
+ {{{
+ 
+   //see what locale we should use
+   Locale ourLocale = null;
+   if(!queryLang.equals()) {
+   ourLocale = new Locale(queryLang);
+   language = new String(queryLang);
+   }else {
+   ourLocale = request.getLocale();
+   }
+ 
+ }}}
+ 
+ Then change the line:
+ 
+ {{{
+ i18n:bundle baseName=org.nutch.jsp.search/
+ }}}
+ 
+ to:
+ 
+ {{{
+ i18n:bundle baseName=org.nutch.jsp.search locale=%=ourLocale%/
+ }}}
+ 
+ * Now we are ready to build it:
+ 
+ {{{
+ ant war
+ }}}
+ 
+ * Copy the .war file to your servlet container's webapp directory.  If 
everything went well you will see your language code in the bottom, then you 
can select it, and the search interface will come back with the localisation 
you just put in.
+ 


[jira] Updated: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated NUTCH-731:
---

Fix Version/s: 1.1
 Assignee: Otis Gospodnetic

 Redirection of robots.txt in RobotRulesParser
 -

 Key: NUTCH-731
 URL: https://issues.apache.org/jira/browse/NUTCH-731
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Otis Gospodnetic
 Fix For: 1.1

 Attachments: NUTCH-731.patch


 The patch attached allows to follow one level of redirection for robots.txt 
 files. A similar issue was mentioned in NUTCH-124 and has been marked as 
 fixed a long time ago but the problem remained, at least when using Fetcher2 
 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in 
 March.
 I have been using this patch for a while now on a large cluster and noticed 
 that the ratio of robots_denied per fetchlist went up, meaning that at least 
 we are now getting restrictions we would not have had before (and getting 
 less complaints from webmasters at the same time)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-742) Checksum Error

2009-06-20 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved NUTCH-742.


Resolution: Incomplete

Could you please post more detailed information to nutch-user mailing list 
first?

 Checksum Error 
 ---

 Key: NUTCH-742
 URL: https://issues.apache.org/jira/browse/NUTCH-742
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: linux ubuntu8.0.4 64bit 
 10datanode 4G of memory per node 
Reporter: mawanqiang

 Approximately 1 million data used to create index when nutch1.0 error.
 The error is:
 java.lang.RuntimeException: problem advancing post rec#6758513
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:883)
 at 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
 at 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
 at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:79)
 at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at org.apache.hadoop.mapred.Child.main(Child.java:158)
 Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
 at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:153)
 at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:90)
 at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:301)
 at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:331)
 at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315)
 at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377)
 at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174)
 at 
 org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277)
 at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297)
 at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:922)
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:881)
 ... 6 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Nutch-trunk #851

2009-06-20 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/851/

--
[...truncated 4676 lines...]

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
 
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
  uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
 

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
 

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes
 
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes
 

jar:
  [jar] Building jar: