[Nutch Wiki] Update of TikaPlugin by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The TikaPlugin page has been changed by JulienNioche. http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=2rev2=3 -- = Tika Plugin = - The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here. + The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here and has a more generic capability of representing structured content which can be useful for HtmlParseFilters (which are currently limited to HTML content). '''html''': ? @@ -9, +9 @@ '''mp3''': ? - '''msexcel''': ? + '''msexcel''': comparable (+ Tika able to represent content in structured way as XHTML tables which can be useful for HTML parser plugins) - '''mspowerpoint''': ? + '''mspowerpoint''': comparable - '''msword''': ? + '''msword''': Tika does not support word 95 other versions are comparable - '''openoffice''': ? + '''openoffice''': comparable - '''pdf''': ? + '''pdf''': comparable '''rss''': ? - '''rtf''': ? + '''rtf''': comparable '''swf''' : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337) '''text''': ? - '''zip''': ?not covered in Tika + '''zip''': ?
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791411#action_12791411 ] Andrzej Bialecki commented on NUTCH-775: - +1. I would suggest creating a subclass of Metadata, where we can guarantee the presence of some required parameters, e.g.: {code} public class SearchContext extends Metadata { protected int numHits; protected String sortField; protected String dedupField; ... // setters and getters for the above } {code} and change the QueryFilter interface to use SearchContext too. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-trunk #1012
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1012/ -- [...truncated 4728 lines...] jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter compile-test: compile: [echo] Compiling plugin: urlfilter-regex [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar deps-test: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir:
Build failed in Hudson: Nutch-trunk #1013
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1013/ -- [...truncated 4728 lines...] jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter compile-test: compile: [echo] Compiling plugin: urlfilter-regex [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar deps-test: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir:
unsubscribe
unsubscribe
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791829#action_12791829 ] Sami Siren commented on NUTCH-666: -- We should also consider switching to Tika for language identification and route the proposed improvements in that area through Tika? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.