[jira] Created: (NUTCH-742) Checksum Error
Checksum Error --- Key: NUTCH-742 URL: https://issues.apache.org/jira/browse/NUTCH-742 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: linux ubuntu8.0.4 64bit 10datanode 4G of memory per node Reporter: mawanqiang Approximately 1 million data used to create index when nutch1.0 error. The error is: java.lang.RuntimeException: problem advancing post rec#6758513 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:883) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:79) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:153) at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:90) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:301) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:331) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:922) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:881) ... 6 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Mike Dawson: http://wiki.apache.org/nutch/AddingNewLocalization New page: ===Adding a New Language to Nutch=== If you want to have Nutch in your language - hopefully the below helps. I just Googled around. * Unzip Nutch 1.0 to any folder * Translate the .properties files that you find in src/web/locale/org/nutch/jsp : ** For each file make sure that you have your own version ending in _langcode.properties e.g. _fa.properties . Btw OmegaT is an excellent Translation memory program to help with standardizing terms etc. * Make a folder src/web/include/langcode with a file header.xml - again this needs translated. * Make a folder src/web/pages/langcode and copy the .xml files from the English folder and then translate them. In search.xml look for the line: pre input type=hidden name=lang value=fa/ /pre Change the value of lang to match the language you are adding (e.g. fa) * Add your language to src/web/include/footer.html * In the Nutch base directory run ant pre ant generate-docs /pre * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter...
[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722214#action_12722214 ] Julien Nioche commented on NUTCH-731: - Here is an example which the patch helps addressing curl http://wizardhq.com/robots.txt !DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN htmlhead title301 Moved Permanently/title /headbody h1Moved Permanently/h1 pThe document has moved a href=http://www.wizardhq.com/robots.txt;here/a./p /body/html again, the ratio of robots_denied status started going up after I wrote the patch which means that such cases are not so rare Redirection of robots.txt in RobotRulesParser - Key: NUTCH-731 URL: https://issues.apache.org/jira/browse/NUTCH-731 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-731.patch The patch attached allows to follow one level of redirection for robots.txt files. A similar issue was mentioned in NUTCH-124 and has been marked as fixed a long time ago but the problem remained, at least when using Fetcher2 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in March. I have been using this patch for a while now on a large cluster and noticed that the ratio of robots_denied per fetchlist went up, meaning that at least we are now getting restrictions we would not have had before (and getting less complaints from webmasters at the same time) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Mike Dawson: http://wiki.apache.org/nutch/AddingNewLocalization -- If you want to have Nutch in your language - hopefully the below helps. I just Googled around. - * Unzip Nutch 1.0 to any folder + * Unzip Nutch 1.0 to any folder - * Translate the .properties files that you find in src/web/locale/org/nutch/jsp : + * Translate the .properties files that you find in src/web/locale/org/nutch/jsp : - ** For each file make sure that you have your own version ending in _langcode.properties e.g. _fa.properties . Btw OmegaT is an excellent Translation memory program to help with standardizing terms etc. + * For each file make sure that you have your own version ending in _langcode.properties e.g. _fa.properties . Btw OmegaT is an excellent Translation memory program to help with standardizing terms etc. - * Make a folder src/web/include/langcode with a file header.xml - again this needs translated. + * Make a folder src/web/include/langcode with a file header.xml - again this needs translated. - * Make a folder src/web/pages/langcode and copy the .xml files from the English folder and then translate them. In search.xml look for the line: + * Make a folder src/web/pages/langcode and copy the .xml files from the English folder and then translate them. In search.xml look for the line: - pre + + {{{ input type=hidden name=lang value=fa/ - /pre + }}} Change the value of lang to match the language you are adding (e.g. fa) - * Add your language to src/web/include/footer.html + * Add your language to src/web/include/footer.html - * In the Nutch base directory run ant + * In the Nutch base directory run ant - pre + {{{ ant generate-docs - /pre + }}} - * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter... + * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter...
[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722242#action_12722242 ] Ken Krugler commented on NUTCH-731: --- This is definitely an issue - I've been pinging various domains while testing robots.txt handling in bixo, and many of them will do a redirect if you use http://domain/robots.txt, to http://www.domain/robots.txt. Redirection of robots.txt in RobotRulesParser - Key: NUTCH-731 URL: https://issues.apache.org/jira/browse/NUTCH-731 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-731.patch The patch attached allows to follow one level of redirection for robots.txt files. A similar issue was mentioned in NUTCH-124 and has been marked as fixed a long time ago but the problem remained, at least when using Fetcher2 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in March. I have been using this patch for a while now on a large cluster and noticed that the ratio of robots_denied per fetchlist went up, meaning that at least we are now getting restrictions we would not have had before (and getting less complaints from webmasters at the same time) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Mike Dawson: http://wiki.apache.org/nutch/AddingNewLocalization -- - ===Adding a New Language to Nutch=== + = Adding a New Language to Nutch = - If you want to have Nutch in your language - hopefully the below helps. I just Googled around. + If you want to have Nutch in your language - hopefully the below helps. I have been Googling around and digging in some source code... * Unzip Nutch 1.0 to any folder @@ -25, +25 @@ ant generate-docs }}} - * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter... + * It seems like some changes are needed to search.jsp to make it behave as users would expect. The original appears to expect the language of the browser to take precedence over the language selected... After out.flush() at about line 160 add the following in src/web/jsp/search.jsp: + {{{ + + //see what locale we should use + Locale ourLocale = null; + if(!queryLang.equals()) { + ourLocale = new Locale(queryLang); + language = new String(queryLang); + }else { + ourLocale = request.getLocale(); + } + + }}} + + Then change the line: + + {{{ + i18n:bundle baseName=org.nutch.jsp.search/ + }}} + + to: + + {{{ + i18n:bundle baseName=org.nutch.jsp.search locale=%=ourLocale%/ + }}} + + * Now we are ready to build it: + + {{{ + ant war + }}} + + * Copy the .war file to your servlet container's webapp directory. If everything went well you will see your language code in the bottom, then you can select it, and the search interface will come back with the localisation you just put in. +
[jira] Updated: (NUTCH-731) Redirection of robots.txt in RobotRulesParser
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-731: --- Fix Version/s: 1.1 Assignee: Otis Gospodnetic Redirection of robots.txt in RobotRulesParser - Key: NUTCH-731 URL: https://issues.apache.org/jira/browse/NUTCH-731 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Otis Gospodnetic Fix For: 1.1 Attachments: NUTCH-731.patch The patch attached allows to follow one level of redirection for robots.txt files. A similar issue was mentioned in NUTCH-124 and has been marked as fixed a long time ago but the problem remained, at least when using Fetcher2 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in March. I have been using this patch for a while now on a large cluster and noticed that the ratio of robots_denied per fetchlist went up, meaning that at least we are now getting restrictions we would not have had before (and getting less complaints from webmasters at the same time) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-742) Checksum Error
[ https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-742. Resolution: Incomplete Could you please post more detailed information to nutch-user mailing list first? Checksum Error --- Key: NUTCH-742 URL: https://issues.apache.org/jira/browse/NUTCH-742 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: linux ubuntu8.0.4 64bit 10datanode 4G of memory per node Reporter: mawanqiang Approximately 1 million data used to create index when nutch1.0 error. The error is: java.lang.RuntimeException: problem advancing post rec#6758513 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:883) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:79) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:153) at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:90) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:301) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:331) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:922) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:881) ... 6 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-trunk #851
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/851/ -- [...truncated 4676 lines...] deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-pass [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes jar: [jar] Building jar: