[Nutch Wiki] Update of TikaPlugin by JulienNioche

2009-12-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The TikaPlugin page has been changed by JulienNioche.
http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=2rev2=3

--

  = Tika Plugin =
- The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch. This page will list the differences in coverage or 
functionality between the Tika plugin and the existing Nutch parsers. Tika also 
has more formats not covered by Nutch which are not described here.
+ The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch. This page will list the differences in coverage or 
functionality between the Tika plugin and the existing Nutch parsers. Tika also 
has more formats not covered by Nutch which are not described here and has a 
more generic capability of representing structured content which can be useful 
for HtmlParseFilters (which are currently limited to HTML content).
  
  '''html''': ?
  
@@ -9, +9 @@

  
  '''mp3''': ?
  
- '''msexcel''': ?
+ '''msexcel''': comparable (+ Tika able to represent content in structured way 
as XHTML tables which can be useful for HTML parser plugins)
  
- '''mspowerpoint''': ?
+ '''mspowerpoint''': comparable
  
- '''msword''': ?
+ '''msword''': Tika does not support word 95 other versions are comparable
  
- '''openoffice''': ?
+ '''openoffice''': comparable
  
- '''pdf''': ?
+ '''pdf''': comparable
  
  '''rss''': ?
  
- '''rtf''': ?
+ '''rtf''': comparable
  
  '''swf''' : not yet covered in Tika (see 
https://issues.apache.org/jira/browse/TIKA-337)
  
  '''text''': ?
  
- '''zip''': ?not covered in Tika
+ '''zip''': ?
  


[jira] Commented: (NUTCH-775) Enhance Searcher interface

2009-12-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791411#action_12791411
 ] 

Andrzej Bialecki  commented on NUTCH-775:
-

+1. I would suggest creating a subclass of Metadata, where we can guarantee the 
presence of some required parameters, e.g.:

{code}
public class SearchContext extends Metadata {
  protected int numHits;
  protected String sortField;
  protected String dedupField;
  ...
  // setters and getters for the above
}
{code}

and change the QueryFilter interface to use SearchContext too.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Nutch-trunk #1012

2009-12-16 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1012/

--
[...truncated 4728 lines...]
jar:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

compile-test:

compile:
 [echo] Compiling plugin: urlfilter-regex
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar

deps-test:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass
[mkdir] Created dir: 

Build failed in Hudson: Nutch-trunk #1013

2009-12-16 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1013/

--
[...truncated 4728 lines...]
jar:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

compile-test:

compile:
 [echo] Compiling plugin: urlfilter-regex
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar

deps-test:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass
[mkdir] Created dir: 

unsubscribe

2009-12-16 Thread malsmith
unsubscribe



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-16 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791829#action_12791829
 ] 

Sami Siren commented on NUTCH-666:
--

We should also consider switching to Tika for language identification and route 
the proposed improvements in that area through Tika?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.