[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884624#action_12884624 ] Julien Nioche commented on NUTCH-835: - This patch has been marked for 1.2 but has been committed to trunk only (2.0). Shall we also apply it to /nutch/branches/branch-1.2 ? document deduplication (exact duplicates) failed using MD5Signature --- Key: NUTCH-835 URL: https://issues.apache.org/jira/browse/NUTCH-835 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0, 1.1 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 Reporter: Sebastian Nagel Assignee: Andrzej Bialecki Fix For: 1.2, 2.0 The MD5Signature class calculates different signatures for identical documents. The reason is that byte[] data = content.getContent(); ... StringBuilder().append(data) ... uses java.lang.Object.toString() to get a string representation of the (binary) content which results in unique hash codes (e.g., [...@30dc9065) even for two byte arrays with identical content. A solution would be to take the MD5 sum of the binary content as first part of the final signature calculation (the parsed content is the second part): ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutchbase] WebPage class is a generated code?
Hi, (This question is mostly to Dogacan Enis, but I encourage anyone familiar with the code to join the threads with [Nutchbase] - the sooner the better ;) ). I'm looking at src/gora/webpage.avsc and WebPage.java friends... presumably the java code was autogenerated from avsc using Gora? If so, we should put this autogeneration step in our build.xml. Or am I missing something? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [Nutchbase] WebPage class is a generated code?
(This question is mostly to Dogacan Enis, but I encourage anyone familiar with the code to join the threads with [Nutchbase] - the sooner the better ;) ). I'm looking at src/gora/webpage.avsc and WebPage.java friends... presumably the java code was autogenerated from avsc using Gora? If so, we should put this autogeneration step in our build.xml. Or am I missing something? correct. if we keep the generated java classes in svn then we probably want to make this task optional i.e. it would not be done as part of the build tasks OR we can add it to the build but remove it from svn (or better add to svn ignore or whatever-it-is-called). J. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
[jira] Created: (NUTCH-840) Port tests from parse-html to parse-tika
Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Nutch 2.0 : Design issue
On 2 July 2010 12:22, Andrzej Bialecki a...@getopt.org wrote: On 2010-07-02 12:42, Julien Nioche wrote: Hi guys, You've probably seen that there has been some progress on 2.0 lately. We've updated the nutchbase svn branch with the latest developments done on Dogacan's Github i.e. using GORA as a storage layer. One of the main issues [1] I raised after using nutchbase was that : NutchBase currently marks entries in the table to be fetched | parsed | etc... and needs to go through the whole table at every step. As the table gets bigger it takes more and more time to read through the entries and check their marks which is not a viable option. NutchBase is currently slower than Nutch 1.1 (might be issues with Gora but still...) I suggest instead that we create fetchlists in separate tables, fetch parse in these tables then merge the entries back to the main table. The segment tables could then be deleted if necessary. We would then have a linear processing time for fetching + parsing + updating depending on the size of the segments and NOT on the size of the main table. This would be an improvement compared to 1.1 where the processing time in the updates is relative to the size of the crawldb . Doing this requires to be able to separate the name of a schema from the name of a table in Gora [2], which should not be a big problem. I think this is a good idea - this model is conceptually close to the current model, and I bet it will be easier to debug problems when changes are limited to a separate table... we could create 1 table per segment. (Oh, and let's stop calling them segments, please - maybe call them a batch or crawl cycle or something. The name segments caused a lot of confusion already, and it doesn't convey any useful meaning..) Makes sense As for the time savings .. this remains to be seen. At the end of the fetching/parsing job we need to merge this data back into the main table, which is a massive update that also takes time. True On a second thought I was wondering whether it would also make sense to actually keep the segments as they currently are i.e. stored as NutchWritables in HDFS. The advantages of doing this would be that we'd keep exactly the same code for the fetching + parsing + would only need to modify the generations and update steps + would be able to easily port pre-2.0 segments to the webtable. The drawbacks being that there would be a dual storage GORA / HDFS and we'd need to keep the legacy Nutch Writable objects. The fetcher code is already ported in nutchbase not to use the plain files. I doubt there would be many users who want to jump to Nutch 2.0 and still want to hold on to their old segments... so I think this is not useful. Dual storage .. *shudder* that's asking for trouble. Right, + am not too keen on keeping the legacy objects. Another advantage of having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that is makes it easier to restart an interrupted fetch or parse. Forget about the HDFS based storage, let's just do it with GORA Note that it would not change anything to the content of the main webtable nor the operations done on them. Maybe it would make sense to do that anyway at least as a transition while we make the webtable and GORA operations stable and then see if there is an advantage in storing the segments as GORA tables as well. I am pretty confident that we need to address the point raised in [1] anyway. What do you guys think? *[1] http://github.com/dogacan/nutchbase/issues#issue/8 [2] http://github.com/enis/gora/issues#issue/30* +1 to both points, -1 to the dual storage. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
[Nutch Wiki] Trivial Update of PluginCentral by AlexM c
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PluginCentral page has been changed by AlexMc. The comment on this change is: adding a couple of external tutorials relating to plugins (more welcome!!!). http://wiki.apache.org/nutch/PluginCentral?action=diffrev1=60rev2=61 -- * [[WritingPluginExample-0.9]] - Step-by-step example of how to write a plugin for the current development. * WritingPluginExample - A step-by-step example of how to write a plugin for the 0.7 branch. - updated by LucasBoullosa * [[http://wiki.media-style.com/display/nutchDocu/Write+a+plugin|Writing Plugins]] - by Stefan + * [[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example of writing a custom plugin] by Sujitpal + * [[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a plugin to add dates]] by Ryan Pfister == Plugins that Come with Nutch (0.9) ==
[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Attachment: NUTCH-840.patch Patch which adds the HTML tests to the Tika Parser The tests currently rely on some DOM related code from Neko-HTML which introduces a dependency to the plugin lib-nekohtml. Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will be removed shortly. Once this is done we can delete lib-nekohtml as well then either : a) add the neko jar to the parse-tika lib via IVY b) replace it with another implementation already available from the tika dependencies or the main Nutch dependencies (e.g. dom4j) Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-840.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884671#action_12884671 ] Julien Nioche commented on NUTCH-837: - I think we can also get rid of : * docs/ * WAR related tasks in ANT * src/web/ * src/xmlcatalog/ * src/engines/ Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [Nutchbase] WebPage class is a generated code?
Hey Guys, Since they are generated, +1 to: * adding a filepattern to svn:ignore to ignore them * updating build.xml to autogenerate Cheers, Chris On 7/2/10 3:24 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: (This question is mostly to Dogacan Enis, but I encourage anyone familiar with the code to join the threads with [Nutchbase] - the sooner the better ;) ). I'm looking at src/gora/webpage.avsc and WebPage.java friends... presumably the java code was autogenerated from avsc using Gora? If so, we should put this autogeneration step in our build.xml. Or am I missing something? correct. if we keep the generated java classes in svn then we probably want to make this task optional i.e. it would not be done as part of the build tasks OR we can add it to the build but remove it from svn (or better add to svn ignore or whatever-it-is-called). J. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884691#action_12884691 ] Chris A. Mattmann commented on NUTCH-837: - Hey Julien: How are we going to replace the Nutch webapp? Cheers, Chris Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884712#action_12884712 ] Chris A. Mattmann commented on NUTCH-837: - I'm not sure I agree :) The Nutch webapp is just a set of web pages that let someone know that Search is working. They are decent web pages, have a great look and feel and are something I've seen nearly every newbie Nutch user I've been around leverage to tell whether or not Nutch installed correctly. I'm also a fan of the let's not loose functionality on a technology upgrade task mantra. That is, we are reorganizing the architecture of Nutch to improve it, not to take away functionality. We should at least support the baseline of functionality that was present in 1.x. That said, I'm not sure the existing webapp should be maintained in its current form. Maybe we should take a pass at updating the webapp to work with the Nutch 2.0 architecture underneath. I'm happy to pick up a shovel and dig on that one. Cheers, Chris Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884718#action_12884718 ] Chris A. Mattmann commented on NUTCH-837: - Hey Julien, Yep that's the point. Solr != Nutch, so Solr's Webapp can't be expected to be = Nutch's webapp. The example you cited about cached data is a great one, because Solr's webapp doesn't really support that (nor should it IMHO). So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release...not to worry though I'm volunteering to help do it! :) Cheers, Chris Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-837: Attachment: NUTCH-837.patch Updated patch against r959954 (after NUTCH-836). Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch, NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-837: Attachment: (was: NUTCH-837.patch) Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884729#action_12884729 ] Andrzej Bialecki commented on NUTCH-837: - bq. So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release... I agree. But for the moment it's better to delete the old webapp stuff that we know for sure doesn't work with the current Nutch, and it will be completely reimplemented anyway. Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-841) Nutch 2.0 webapp
Nutch 2.0 webapp Key: NUTCH-841 URL: https://issues.apache.org/jira/browse/NUTCH-841 Project: Nutch Issue Type: Improvement Components: web gui Environment: Nutch 2.0 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 2.0 In light of the conversation on NUTCH-837, we are removing the old Nutch webapp and will replace it with a 2.0 one that works with GORA + Solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884734#action_12884734 ] Julien Nioche commented on NUTCH-837: - :-) Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884731#action_12884731 ] Chris A. Mattmann commented on NUTCH-837: - Okey dok, I created NUTCH-841 to track it. Julien, Andrzej, you have my +1 to take your axe to the old one :) Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-837. - Resolution: Fixed Committed in r960064. Thanks for review! Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of WritingPluginExample-0.9 by Ramprasad Ramachandran
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The WritingPluginExample-0.9 page has been changed by Ramprasad Ramachandran. http://wiki.apache.org/nutch/WritingPluginExample-0.9?action=diffrev1=11rev2=12 -- /project }}} + For Nutch-1.0 write the following: + + {{{ + ?xml version=1.0? + + project name=recommended default=jar-core + + import file=../build-plugin.xml/ + + !-- Build compilation dependencies -- + target name=deps-jar +ant target=jar inheritall=false dir=../lib-xml/ + /target + + !-- Add compilation dependencies to classpath -- + path id=plugin.deps +fileset dir=${nutch.root}/build + include name=**/lib-xml/*.jar / +/fileset + /path + + !-- Deploy Unit test dependencies -- + target name=deps-test +ant target=deploy inheritall=false dir=../lib-xml/ +ant target=deploy inheritall=false dir=../nutch-extensionpoints/ +ant target=deploy inheritall=false dir=../protocol-file/ + /target + + + !-- for junit test -- + mkdir dir=${build.test}/data/ + copy file=data/recommended.html todir=${build.test}/data/ + /project + }}} + Save this file in directory [!YourCheckoutDir]/src/plugin/recommended == The HTML Parser Extension == + NOTE: Nutch-1.0 users make sure that you save all your java files in this directory C:\nutch-1.0\src\plugin\recommended\src\java\org\apache\nutch\parse\recommended + - This is the source code for the HTML Parser extension. It tries to grab the contents of the recommended meta tag and add them to the document being parsed. On the directory above, create a file called RecommendedParser.java and add this as the contents: + This is the source code for the HTML Parser extension. It tries to grab the contents of the recommended meta tag and add them to the document being parsed. On the directory , create a file called RecommendedParser.java and add this as the contents: {{{ package org.apache.nutch.parse.recommended; @@ -273, +310 @@ }}} == Compiling the plugin == + + For ant installation in Windows, refer this - [[http://ant.apache.org/manual/install.html|ant]] In order to build the plugin - or Nutch itself - you'll need ant. If you're using MacOs you can easily get it via [[http://fink.sourceforge.net/|fink]]. Let's get junit while we're at it.
Build failed in Hudson: Nutch-trunk #1196
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1196/changes Changes: [ab] NUTCH-837 Remove search servers and Lucene dependencies. [ab] NUTCH-836 Remove deprecated parse plugins. [jnioche] NUTCH-836 : Remove deprecated parse plugins -- [...truncated 3443 lines...] deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic compile-test: [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test [junit] Running org.apache.nutch.urlfilter.regex.TestRegexURLFilter jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] Running org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.03 sec init: init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-pass compile-test: [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-pass [junit] Running org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.208 sec init: init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-regex compile-test: [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: deps-test: init: init-plugin: compile: jar: [jar] Warning: skipping jar archive http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/nutch-extensionpoints/nutch-extensionpoints.jar because no files were included. deps-test: deploy: copy-generated-lib: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-regex [junit] Running org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.148 sec [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.556 sec test: jar: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/classes [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/classes [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/nutch-2010-07-03_04-42-59.jar javadoc: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/docs/api [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.nutch.crawl... [javadoc] Loading source files for package org.apache.nutch.fetcher... [javadoc] Loading source files for package org.apache.nutch.indexer... [javadoc] Loading source files for package org.apache.nutch.indexer.solr... [javadoc] Loading source files for package org.apache.nutch.metadata... [javadoc] Loading source files for package org.apache.nutch.net... [javadoc] Loading source files for package org.apache.nutch.net.protocols... [javadoc] Loading source files for package org.apache.nutch.parse... [javadoc] Loading source files for package org.apache.nutch.plugin... [javadoc] Loading source files for package org.apache.nutch.protocol... [javadoc] Loading source files for package org.apache.nutch.scoring... [javadoc] Loading source files for package org.apache.nutch.scoring.webgraph... [javadoc] Loading source files for package org.apache.nutch.segment... [javadoc] Loading source files for package org.apache.nutch.tools... [javadoc] Loading source files for package org.apache.nutch.tools.arc... [javadoc] Loading source files for package org.apache.nutch.util... [javadoc] Loading source files for package org.apache.nutch.util.domain... [javadoc] Loading source files for package org.apache.nutch.protocol.http.api... [javadoc] Loading source files for package org.apache.nutch.urlfilter.api... [javadoc] Loading source files for package org.apache.nutch.microformats.reltag... [javadoc] Loading source files for package org.apache.nutch.protocol.file... [javadoc] Loading source files for package org.apache.nutch.protocol.ftp... [javadoc] Loading source files for package org.apache.nutch.protocol.http... [javadoc] Loading source files for package org.apache.nutch.protocol.httpclient... [javadoc] Loading source files for package org.apache.nutch.parse.ext... [javadoc] Loading source files for package