[Nutch Wiki] Update of FrontPage by JulienNioche

2009-12-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by JulienNioche.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=127rev2=128

--

   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * InstallingWeb2
   * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
+  * TikaPlugin - Comments on the Tika integration and differences with 
existing parse plugins
  
  == Nutch 2.0 ==
   * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.


[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-12-14 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790162#action_12790162
 ] 

Dennis Kubes commented on NUTCH-768:


The older jetty jar file was not removed with this patch.  It will need to be 
removed from the nutch lib directory if applying the patch versus pulling from 
trunk.  There is also a second patch that updates unit tests for the Jetty 
interfaces.  Neither of these will need to be applied if pulling from Trunk as 
those problems have been corrected.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Build failed in Hudson: Nutch-trunk #1011

2009-12-14 Thread Dennis Kubes
This is failing because of the older jetty jar being removed and the 
Jetty interfaces changes.  I am currently working to fix the interfaces 
for the new Jetty version.  Hope to have a patch committed later today 
and this should be back to normal.


Dennis

Apache Hudson Server wrote:

See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1011/

--
[...truncated 4728 lines...]
jar:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

compile-test:

compile:
 [echo] Compiling plugin: urlfilter-regex
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar

deps-test:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] 

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790225#action_12790225
 ] 

Andrzej Bialecki  commented on NUTCH-666:
-

Dennis, what's the status of this patch (especially the missing part, the new 
language identifier)?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implme

2009-12-14 Thread Vincent Couturier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790244#action_12790244
 ] 

Vincent Couturier commented on NUTCH-427:
-

The last attached zip does not contain the changes of Ilquiz Latypov. It's 
necessary to patch the zip with the protocol-smb-diff.txt. I will try to put a 
patched version but if Iluqiz can put his updated version it would be easier.

 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb.zip, 
 protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 java.util.PropertyPermission(java.protocol.handler.pkgs,read, write)
   //and so on
Also you can visit the FAQ page: 
 http://jcifs.samba.org/src/docs/faq.html
 2) FATAL smb.SMB - Could