[Nutch Wiki] Update of FrontPage by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by JulienNioche. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=127rev2=128 -- * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application * InstallingWeb2 * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) + * TikaPlugin - Comments on the Tika integration and differences with existing parse plugins == Nutch 2.0 == * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.
[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790162#action_12790162 ] Dennis Kubes commented on NUTCH-768: The older jetty jar file was not removed with this patch. It will need to be removed from the nutch lib directory if applying the patch versus pulling from trunk. There is also a second patch that updates unit tests for the Jetty interfaces. Neither of these will need to be applied if pulling from Trunk as those problems have been corrected. Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-768-1-20091125.patch Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Build failed in Hudson: Nutch-trunk #1011
This is failing because of the older jetty jar being removed and the Jetty interfaces changes. I am currently working to fix the interfaces for the new Jetty version. Hope to have a patch committed later today and this should be back to normal. Dennis Apache Hudson Server wrote: See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1011/ -- [...truncated 4728 lines...] jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter compile-test: compile: [echo] Compiling plugin: urlfilter-regex [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar deps-test: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy]
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790225#action_12790225 ] Andrzej Bialecki commented on NUTCH-666: - Dennis, what's the status of this patch (especially the missing part, the new language identifier)? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implme
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790244#action_12790244 ] Vincent Couturier commented on NUTCH-427: - The last attached zip does not contain the changes of Ilquiz Latypov. It's necessary to patch the zip with the protocol-smb-diff.txt. I will try to put a patched version but if Iluqiz can put his updated version it would be easier. protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. -- Key: NUTCH-427 URL: https://issues.apache.org/jira/browse/NUTCH-427 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: JAVA - OS independent Reporter: Armel Nene Priority: Minor Attachments: protocol-smb-diff.txt, protocol-smb.zip, protocol-smb.zip, protocol-smb.zip Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares Author: Armel T. Nene Update: Vadim Bauer Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r AT g m x . d e A. Introduction The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also support all the properties from the JCifs library. You can find more information on the following site: http://jcifs.samba.org/ The smb protocol syntax for crawling is as follow: smb://x (i.e. smb://server/share). B. Installation 1) Binaries only: The protocol-smb files can be found in the ../plugins directory. Copy the protocol-smb to NUTCHHOME/build/plugins directory. Put the smb.properties file in the NUTCHHOME/conf directory. Configure the properties in smb.properties file Enable the plugin by updating nutch-site.xml file found in NUTCHHOME/conf directory e.g. property nameplugin.includes/name valueprotocol-smb| other plugins.../value description /description /property 2) Source code:The protocol-smb sources can be found in the ../src directory. Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin Update the build.xml in NUTCHHOME/src/plugin to include plugin Update the NUTCHHOME/default.properties file to include plugin run ant to build Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties Enable the plugin by updating the nutch-site.xml file C: Known Issues 1) URLMalformedException: unkown protocol: smb The SMB URL protocol handler is not being successfully installed. In short, the jCIFS jar must be loaded by the System class loader. Workaround: a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty(java.protocol.handler.pkgs, jcifs); new java.util.PropertyPermission(java.protocol.handler.pkgs,read, write) //and so on Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html 2) FATAL smb.SMB - Could