Re: [DISCUSS] Board resolution for Nutch as TLP
I think it looks good after the minor changes. +1. Dennis Andrzej Bialecki wrote: Hi, I was told that the next step is to come up with the proposed Board resolution and vote it among committers. Here's the proposed text (shameless copypaste from Tika and Mahout proposals). IMPORTANT NOTE: I removed from the members of the PMC those existing Nutch committers that haven't been active for more than 1 year, with the intention of moving them to Emeritus status. If any one of these people feels left out and would like to become an active committer in the project, please let us know and we will gladly welcome you back :) The text of the resolution follows. Committers, please read it and optionally comment on the salient points of the text, the rest is boilerplate. If there's an overall consensus I will call for a formal vote to submit this proposal to the Board. == X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web crawling platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. =
[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790162#action_12790162 ] Dennis Kubes commented on NUTCH-768: The older jetty jar file was not removed with this patch. It will need to be removed from the nutch lib directory if applying the patch versus pulling from trunk. There is also a second patch that updates unit tests for the Jetty interfaces. Neither of these will need to be applied if pulling from Trunk as those problems have been corrected. Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-768-1-20091125.patch Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Build failed in Hudson: Nutch-trunk #1011
This is failing because of the older jetty jar being removed and the Jetty interfaces changes. I am currently working to fix the interfaces for the new Jetty version. Hope to have a patch committed later today and this should be back to normal. Dennis Apache Hudson Server wrote: See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1011/ -- [...truncated 4728 lines...] jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter compile-test: compile: [echo] Compiling plugin: urlfilter-regex [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar deps-test: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy]
[jira] Closed: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-768. -- Resolution: Fixed Weird. The hsqldb License file was the same checksum as that pulled from hadoop. It must have had the windows EOL in hadoop distribution as well. I changed it anyways. Everything committed with revision 885778. Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-768-1-20091125.patch Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784066#action_12784066 ] Dennis Kubes commented on NUTCH-768: If no objections I will commit this tomorrow sometime? Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-768-1-20091125.patch Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
Oops. Sorry about that. a...@apache.org wrote: Author: ab Date: Wed Nov 25 12:44:34 2009 New Revision: 884075 URL: http://svn.apache.org/viewvc?rev=884075view=rev Log: Change access from private to public - this fixes Crawl.java breakage. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java?rev=884075r1=884074r2=884075view=diff == --- lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java Wed Nov 25 12:44:34 2009 @@ -50,7 +50,7 @@ super(conf); } - private void indexSolr(String solrUrl, Path crawlDb, Path linkDb, + public void indexSolr(String solrUrl, Path crawlDb, Path linkDb, ListPath segments) throws IOException { LOG.info(SolrIndexer: starting);
[jira] Updated: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-768: --- Attachment: NUTCH-768-1-20091125.patch I thought I was going to be able to do this without code changes. No such luck. There are many, many deprecations as a result of this upgrade. Anything that used the old Mapper and Reducer interfaces seems to have deprecated methods in it. The NutchBean class needed to implement the two RPC*Bean interfaces to handle changes in Hadoop RPC (that could have been a leftover from 1.0 changes but I don't think so). Also there are numerous changes to build scripts and the nutch bin script to support different hadoop jars. There are also many new files for the conf directory as Hadoop has split out files and has new configuration files for new capabilities. After all changes I was able to run everything in local and pseudo-distributed mode as well as test out local and distributed searching. Everything seems to work fine. After we make this upgrade I would recommend going back and updating all of the tool interfaces for the most recent APIs. Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-768-1-20091125.patch Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-771) Add WebGraph classes to the bin/nutch script
Add WebGraph classes to the bin/nutch script Key: NUTCH-771 URL: https://issues.apache.org/jira/browse/NUTCH-771 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All, shell script Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Currently the webgraph jobs are called on the command line by calling main methods on their classes. I propose to upgrade the bin/nutch shell script to allow calling these jobs as well. This would include the webgraphdb, linkrank, scoreupdater, and nodedumper jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782172#action_12782172 ] Dennis Kubes commented on NUTCH-768: I have tested the upgrade with Hadoop 0.20. To upgrade this correctly we do need to upgrade Xerces both in the main lib jars and within the lib-xml plugin. I have upgraded to the most recent version of Xerces 2.9.x. Having run through multiple full crawl and index cycles both on the new and old indexing frameworks, including the webgraphdb, and the solr indexing process, I didn't find any errors within the process. If no one has any objections I will commit these changes within the next 24 hours. Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer
[ https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes reassigned NUTCH-765: -- Assignee: Dennis Kubes Allow Crawl class to call Either Solr or Lucene Indexer --- Key: NUTCH-765 URL: https://issues.apache.org/jira/browse/NUTCH-765 Project: Nutch Issue Type: Improvement Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0, 1.1 Attachments: NUTCH-765-2009112-1.patch Change to the crawl class to have a -solr option which will call the solr indexer instead of the lucene indexer. This also allows it to ignore dedup and merge for solr indexing and to point to a specific solr instance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer
[ https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-765. Resolution: Fixed Committed. Allow Crawl class to call Either Solr or Lucene Indexer --- Key: NUTCH-765 URL: https://issues.apache.org/jira/browse/NUTCH-765 Project: Nutch Issue Type: Improvement Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.1, 1.0.0 Attachments: NUTCH-765-2009112-1.patch Change to the crawl class to have a -solr option which will call the solr indexer instead of the lucene indexer. This also allows it to ignore dedup and merge for solr indexing and to point to a specific solr instance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer
[ https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-765. -- Allow Crawl class to call Either Solr or Lucene Indexer --- Key: NUTCH-765 URL: https://issues.apache.org/jira/browse/NUTCH-765 Project: Nutch Issue Type: Improvement Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0, 1.1 Attachments: NUTCH-765-2009112-1.patch Change to the crawl class to have a -solr option which will call the solr indexer instead of the lucene indexer. This also allows it to ignore dedup and merge for solr indexing and to point to a specific solr instance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Plugin Help
It depends on how you are building and your classpath. Lets call your plugin myhtmlfilter. If running on a single server and you added it to your src/plugin/build.xml under the deploy section, a myhtmlfilter folder with the plugin should show up in under the build/plugins folder upon build. Then you would just have to copy over that myhtmlfilter folder to where your deployment plugins directory. If running on a cluster, even in pseudo-distributed mode you would need to copy over the nutch-*.job file. It has the plugins inside of it and it gets distributed out to the cluster. If referencing from a webapp or the nutch war file, you would need to copy to web-inf/classes/plugins. Dennis david.stu...@progressivealliance.co.uk wrote: Hi, I am trying to write a plugin for nutch and am having real troubles getting it registered in the system. I have created in src/plugin and added it to both the build.xml in plugin and to nutch-site.xml now it seems to build ok but when I try to run a basic crawl urls -dir crawl -depth 3 -topN 2 I see the plugin registered in the hadoop.log 2009-11-14 14:57:45,739 INFO plugin.PluginRepository - Html Filter Parse Plug-in (parse-htmlfilter) But then get the error message below. I have followed all of the tutorials but they are mostly for nutch 0.9 and have error in them which I have worked through Thanks for your help regards, Dave java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.htmlfilter.HtmlfilterIndexer at org.apache.nutch.indexer.IndexingFilters.init(IndexingFilters.java:100) at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:61) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) Caused by: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.htmlfilter.HtmlfilterIndexer at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166) at org.apache.nutch.indexer.IndexingFilters.init(IndexingFilters.java:70) ... 8 more Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.htmlfilter.HtmlfilterIndexer at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:319) at java.lang.ClassLoader.loadClass(ClassLoader.java:254) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
[jira] Updated: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer
[ https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-765: --- Attachment: NUTCH-765-2009112-1.patch Allow Crawl class to call Either Solr or Lucene Indexer --- Key: NUTCH-765 URL: https://issues.apache.org/jira/browse/NUTCH-765 Project: Nutch Issue Type: Improvement Environment: All Reporter: Dennis Kubes Priority: Minor Fix For: 1.0.0, 1.1 Attachments: NUTCH-765-2009112-1.patch Change to the crawl class to have a -solr option which will call the solr indexer instead of the lucene indexer. This also allows it to ignore dedup and merge for solr indexing and to point to a specific solr instance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer
Allow Crawl class to call Either Solr or Lucene Indexer --- Key: NUTCH-765 URL: https://issues.apache.org/jira/browse/NUTCH-765 Project: Nutch Issue Type: Improvement Environment: All Reporter: Dennis Kubes Priority: Minor Fix For: 1.1, 1.0.0 Attachments: NUTCH-765-2009112-1.patch Change to the crawl class to have a -solr option which will call the solr indexer instead of the lucene indexer. This also allows it to ignore dedup and merge for solr indexing and to point to a specific solr instance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Server suggestion
My mistake, you're right. The last processing clusters we built were using Xeon quad cores, not i7s. The i7s were search servers which didn't need ecc memory. AFAICT, wikipedia is correct and the i7s don't yet support ECC. So my suggestion would be to stick with Xeon procs or something that supports ECC for the processing clusters. I would never build a processing cluster that doesn't have ECC memory. We spent a few weeks when we first started trying to tracking down weird corruption checksum bugs ultimately related to using non-ECC memory on a cluster. Dennis Doğacan Güney wrote: Hi Dennis, On Fri, Jul 24, 2009 at 16:46, Dennis Kubesku...@apache.org wrote: fredericoagent wrote: If I want to setup nutch with lets say 400 million urls in the database. Is it better to have a 4-5 super fast and loaded servers or have 12-15 smaller , cheaper servers. More smaller servers. Make sure they are energy efficient though and have a decent amount of Ram. If a server goes down, you aren't affected as much. By superfast I mean cpu is latest quad core or latest six core processor with 6 Gigs Ram and 1. or 1.5 TB HD. By cheap I mean something like a Xeon quad core 2.26 cpu with 3 Gig Ram and 500 Sata HD. or if anyone can suggest a better spec ideal Our first servers were 1Ghz (Yes really) running hadoop 0.04 way back when. Our first production clusters were core2, 4G ECC, 1 750G hard drive. These days been building i7 8-core, 12G ECC, 4T raid-5 machines with up to 8 disks, 2U for around 2200.00 each. If you are looking for a good server builder check out swt.com. They are supermicro resellers and build solid machines. It suggests here: http://en.wikipedia.org/wiki/Core_i7#Drawbacks that core i7's do not support ECC rams. Have you ran into any issues or is WP wrong here? Suggestions. Don't skimp on the hard drive, do at least 750G or more. Price difference is negligible. Do at least 2G Ram, 4G is better, 8G is better than that. You can get up to 12G on regular motherboards these days. After that it gets much more expensive. Ao more recent processors, such as core2 or i7. They are more power efficient per processing unit. If you want a really fast machine, do multiple disks in a raid-5 format. Dennis
Re: Nutch dev. plans
Doğacan Güney wrote: On Fri, Jul 17, 2009 at 21:32, Andrzej Bialeckia...@getopt.org wrote: Doğacan Güney wrote: Hey list, On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote: Hi all, I think we should be creating a sandbox area, where we can collaborate on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will be importing his HBase work as 'nutchbase'. Tika work is the least disruptive, so it could occur even on trunk. OSGI plugins work (which I'd like to tackle) means significant refactoring so I'd rather put this on a branch too. Thanks for starting the discussion, Andrzej. Can you detail your OSGI plugin framework design? Maybe I missed the discussion but updating the plugin system has been something that I wanted to do for a long time :) so I am very much interested in your design. There's no specific design yet except I can't stand the existing plugin framework anymore ... ;) I started reading on OSGI and it seems that it supports the functionality that we need, and much more - it certainly looks like a better alternative than maintaining our plugin system beyond 1.x ... I think I remember a conversation a while back about this :) Not OSGI specifically but changing the plugin framework. I am all for changing it to something like OSGI though. Dennis Couldn't agree more with the can't stand plugin framework :D Any good links on OSGI stuff? Oh, an additional comment about the scoring API: I don't think the claimed benefits of OPIC outweigh the widespread complications that it caused in the API. Besides, getting the static scoring right is very very tricky, so from the engineer's point of view IMHO it's better to do the computation offline, where you have more control over the process and can easily re-run the computation, rather than rely on an online unstable algorithm that modifies scores in place ... Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will feel very natural in a hbase-backed nutch. Give me a couple more days to polish the scoring API then we can change it if you are not happy with it. Dogacan, you mentioned that you would like to work on Katta integration. Could you shed some light on how this fits with the abstract indexing searching layer that we now have, and how distributed Solr fits into this picture? I haven't yet given much thought to Katta integration. But basically, I am thinking of indexing newly-crawled documents as lucene shards and uploading them to katta for searching. This should be very possible with the new indexing system. But so far, I have neither studied katta too much nor given much thought to integration. So I may be missing obvious stuff. Me too.. About distributed solr: I very much like to do this and again, I think, this should be possible to do within nutch. However, distributed solr is ultimately uninteresting to me because (AFAIK) it doesn't have the reliability and high-availability that hadoophbase have, i.e. if a machine dies you lose that part of the index. Grant Ingersoll is doing some initial work on integrating distributed Solr and Zookeeper, once this is in a usable shape then I think perhaps it's more or less equivalent to Katta. I have a patch in my queue that adds direct Hadoop-Solr indexing, using Hadoop OutputFormat. So there will be many options to push index updates to distributed indexes. We just need to offer the right API to implement the integration, and the current API is IMHO quite close. Are there any projects going on that are live indexing systems like solr, yet are backed up by hadoop HDFS like katta? There is the Bailey.sf.net project that fits this description, but it's dormant - either it was too early, or there were just too many design questions (or simply the committers moved to other things). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Ranking Scoring Algorithm Pseudocode
There isn't any pseudocode for this. The code for the main algorithm is in the LinkRank class. It is similar in nature to PageRank except it has the ability to filter reciprocal links. If the Link Loops program is run it also has the ability to filter out link cycles, but that program is O(n) running time so not very efficient. The LinkRank class is just a single score factor though, the setup of the new indexing system allows multiple factors to be combined where the LinkRank may be only a single factor in that. If looking for how the algorithm works I suggest looking at the early PageRank algorithm papers. Here are some links which you may find useful: http://en.wikipedia.org/wiki/PageRank http://www.ianrogers.net/google-page-rank/ Dennis atencorps wrote: Hi, I came across the Ranking Score system in Nutch 1.0 ( which includes the webgraph, linkrank etc). My question is , where can I find the pseudocode for the Ranking Scoring Algorithm/System in place in Nutch 1.0 ? Thanks
Re: Ranking Algorithms
The answer is simple and not so simple at the same time. Last year we put in quite a bit of work to implement a stable PageRank like algorithm into Nutch. This was released as the new scoring and indexing frameworks. That give a good general relevancy score, but it is really a starting point. Many people look at search engines and see a single algorithms, such as PageRank. In reality, a modern search engine, such as google or yahoo, will have hundreds of algorithms and jobs that contribute to relevancy of search results. This is because of two factors: 1) After getting good general relevancy (i.e. link analysis and such), search relevancy is about handling specific relevancy issues. For example handling reciprocal links, near duplicate detection, organizations that own 100k domains, template pages, blogs and echo chambers, hacked pages and blogs with link and keyword spam, malware, etc. Each of these types of issues, and there are many more, require specific algorithms to handle them. Google and Yahoo would have algorithms (and people who specialize in certain areas) to handle all of these types of issues usually through statistical analysis and machine learning jobs. These jobs would then be aggregated together (think pipeline) to form final search engine relevancy scores. In all fairness, this is offline relevancy. There would also be a considerable amount of work done on query parsing and online relevancy. 2) Relevancy scores change over time due to people and companies attempting to manipulate search results through SEO (both good and bad), through culture in general, and through search engines working through better algorithms. So this is a long way of explaining that while Nutch has IMO a good general relevancy currently, taking it to the next level to where results are as good as google is going to take many different specialized MapReduce jobs that we currently don't have. Dennis atencorps wrote: Nutch is a great search Engine and was recently pleased when the large multi national I work for did some trials of Nutch Vs Google when we were evaluating and looking for Enterprise search, was glad to say Nutch was a worthy competitor thus Google Enterprise was chosen only due to office politics (prefering large company over smaller etc ). In terms of Enterprise Search I think Nutch already has it covered , my question is towards Internet Search. Thus Pagerank has been around for over 10 yrs and is what built Google. Are there any newer more capable Ranking algorithms available, and also are there any vision in terms of implementing a truely worthy ranking algorithm into Nutch that could truely deliver quality Internet Search results like Google ?.
Re: LinkRank why 10 iterations?
You are running LinkRank and a comparatively small webgraph. LinkRank is meant, in principle, to be run on very large webgraphs, millions or perhaps 100s of millions of urls. On that scale 10 iterations was what we saw as a good default for the webgraph to converge while not taking an excessive amount of time. For smaller webgraphs 10 iterations might not be necessary. You can use the link.analyze.num.iterations configuration variable to set the number of iterations you would like to run. As a general rule I don't think I would ever go below 5 iterations as all but the very smallest webgraphs wouldn't have enough chance to converge. Here is a good paper on PageRank and convergence. Principles are the same: http://www.webworkshop.net/pagerank.html Dennis Bartosz Gadzimski wrote: Hello, Why you are making so many iterations in linkrank are this is neccessary for some amount of websites? Thanks, Bartosz
[jira] Closed: (NUTCH-291) OpenSearchServlet should return date as well as lastModified
[ https://issues.apache.org/jira/browse/NUTCH-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-291. -- Resolution: Fixed The open search servlet has been superseded by formatters for serving results in xml and json format. Closing issue. OpenSearchServlet should return date as well as lastModified Key: NUTCH-291 URL: https://issues.apache.org/jira/browse/NUTCH-291 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.8 Reporter: Stefan Neufeind Assignee: Dennis Kubes Attachments: NUTCH-291-unfinished.patch Currently lastModified is provided by OpenSearchServlet - but only in case the date lastModified-date is known. Since you can sort by date (which is lastModified or if not present the fetchdate), it might be useful if OpenSearchServlet could provide date as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist
NPE in FieldIndexer when BasicFields url doesn't exist -- Key: NUTCH-729 URL: https://issues.apache.org/jira/browse/NUTCH-729 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.9.0, 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 There is a NullPointerException during a logging call in FieldIndexer when there isn't a url for a document. Documents shouldn't be without urls but since the FieldIndexer doesn't validate fields it is possible for it to occur. Most often this happens when BasicFields is run with the wrong segments directory and doesn't complain. It could also occur if using the FieldIndexer to index things other than basic fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist
[ https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-729: --- Attachment: NUTCH-729-1-20090235.patch Simple patch. Changes the logging to use the key (which should be url and which should always exist). NPE in FieldIndexer when BasicFields url doesn't exist -- Key: NUTCH-729 URL: https://issues.apache.org/jira/browse/NUTCH-729 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.9.0, 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-729-1-20090235.patch There is a NullPointerException during a logging call in FieldIndexer when there isn't a url for a document. Documents shouldn't be without urls but since the FieldIndexer doesn't validate fields it is possible for it to occur. Most often this happens when BasicFields is run with the wrong segments directory and doesn't complain. It could also occur if using the FieldIndexer to index things other than basic fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Apache Nutch 1.0
+1, is this binding? :) Dog(acan Güney wrote: Another non-binding +1 from me. Hope this one is a keeper :D On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Hello, I have packaged the third release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc2/ http://people.apache.org/%7Esiren/nutch-1.0/rc2/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/ The following issues that were discovered during the review of last rc have been fixed: https://issues.apache.org/jira/browse/NUTCH-722 https://issues.apache.org/jira/browse/NUTCH-723 https://issues.apache.org/jira/browse/NUTCH-725 https://issues.apache.org/jira/browse/NUTCH-726 https://issues.apache.org/jira/browse/NUTCH-727 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511 -- Sami Siren -- Dog(acan Güney
[jira] Created: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph
NPE in LinkRank if no nodes with which to create the WebGraph - Key: NUTCH-730 URL: https://issues.apache.org/jira/browse/NUTCH-730 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0, 1.1 For LinkRank, if there are no nodes to process, then a NullPointerException is thrown when trying to count number of nodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-730: --- Attachment: NUTCH-730-1-20090325.patch Throws a more detailed error message if there are no nodes to process. This shouldn't happen on large web graphs but may happen on smaller webgraphs or webgraphs that are all inside one domain (including subdomains). NPE in LinkRank if no nodes with which to create the WebGraph - Key: NUTCH-730 URL: https://issues.apache.org/jira/browse/NUTCH-730 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0, 1.1 Attachments: NUTCH-730-1-20090325.patch For LinkRank, if there are no nodes to process, then a NullPointerException is thrown when trying to count number of nodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Apache Nutch 1.0
Non-binding +1 too :) Sami Siren wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! -- Sami Siren
Re: planning for nutch-1.0-rc1
Sorry about the docs being sparse on this. I will write more about the process as time permits. Don't know about the problem below. What platform are you running on, windows, linux? Dennis Bartosz Gadzimski wrote: Hello, Thanks Dennis for updateing wiki it helped a lot. You gave example with indexing but you didn't said a bit about it. Can you write some more? :) Anyways I have problems at the last step (nutch from 07 march): bin/nutch org.apache.nutch.indexer.field.FieldIndexer It simply stops somewhere 2009-03-07 16:09:04,432 INFO field.FieldIndexer - FieldIndexer: starting 2009-03-07 16:09:04,436 INFO field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/basicfields 2009-03-07 16:09:04,498 INFO field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/anchorfields 2009-03-07 16:09:05,636 INFO plugin.PluginRepository - Plugins: looking in: /usr/local/nutch/plugins 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Registered Plugins: 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Basic Query Filter (query-basic) plugins 2009-03-07 16:09:07,769 INFO field.FieldIndexer - IFD [Thread-11]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b 2009-03-07 16:09:07,769 INFO field.FieldIndexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 autoCommit=true mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= 2009-03-07 16:09:07,781 WARN mapred.LocalJobRunner - job_local_0001 java.lang.NullPointerException at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267) at org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275) In crawl/indexes is only _temporary folder. I will try to debug this but have problems with running nutch in eclipse Thanks, Bartosz Dennis Kubes pisze: I don't know if I would make this primary yet. I need to check what is causing this as it worked fine for me, in fact we currently have it in production. Also we would need to update the shell scripts to integrate this more tightly. Dennis Bartosz Gadzimski wrote: Sami Siren pisze: Andrzej Bialecki wrote: Sami Siren wrote: I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0: NUTCH-578 (kubes) NUTCH-477 (ab) NUTCH-669 (siren) There's one Critical issue reported, related to NekoHTML (NUTCH-700). I'm not sure what are the feature differences (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of action. I will take care of that. I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing? Sounds good. great! -- Sami Siren What about new scoring and new indexing? Will it be integrated as a primary scoring algorithm? I have problem with it on LinkRank: 2009-03-02 20:43:45,708 INFO webgraph.LinkRank - Starting link counter job 2009-03-02 20:43:47,838 INFO webgraph.LinkRank - Finished link counter job 2009-03-02 20:43:47,839 INFO webgraph.LinkRank - Reading numlinks temp file 2009-03-02 20:43:47,840 INFO webgraph.LinkRank - Deleting numlinks temp file 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: java.lang.NullPointerException
Re: planning for nutch-1.0-rc1
NUTCH-578 was a while back but as I remember it worked fine. No objections to either including or pushing it. Dennis Sami Siren wrote: I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0: NUTCH-578 (kubes) NUTCH-477 (ab) NUTCH-669 (siren) I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing? -- Sami Siren
Re: planning for nutch-1.0-rc1
I don't know if I would make this primary yet. I need to check what is causing this as it worked fine for me, in fact we currently have it in production. Also we would need to update the shell scripts to integrate this more tightly. Dennis Bartosz Gadzimski wrote: Sami Siren pisze: Andrzej Bialecki wrote: Sami Siren wrote: I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0: NUTCH-578 (kubes) NUTCH-477 (ab) NUTCH-669 (siren) There's one Critical issue reported, related to NekoHTML (NUTCH-700). I'm not sure what are the feature differences (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of action. I will take care of that. I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing? Sounds good. great! -- Sami Siren What about new scoring and new indexing? Will it be integrated as a primary scoring algorithm? I have problem with it on LinkRank: 2009-03-02 20:43:45,708 INFO webgraph.LinkRank - Starting link counter job 2009-03-02 20:43:47,838 INFO webgraph.LinkRank - Finished link counter job 2009-03-02 20:43:47,839 INFO webgraph.LinkRank - Reading numlinks temp file 2009-03-02 20:43:47,840 INFO webgraph.LinkRank - Deleting numlinks temp file 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627) Another question what about indexing framework mentioned here: http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html Have all those new scoring and indexing would be real step forward. Thanks, Bartosz
[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675907#action_12675907 ] Dennis Kubes commented on NUTCH-477: Same here. I am not against having extra functionality, but I don't think I have ever used the chain options of normalizers either. I guess the call is do we want it in 1.0 or not. My thinking is we are going to be doing major redesign changes post 1.0 so doing lots of code refactoring wouldn't be a big deal. Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Affects Version/s: (was: 1.0.0) 1.1 Fix Version/s: (was: 1.0.0) 1.1 Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666484#action_12666484 ] Dennis Kubes commented on NUTCH-666: It is ok to move to 1.1. Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Site update
http://www.mail-archive.com/d...@forrest.apache.org/msg15136.html This might help. Dennis Andrzej Bialecki wrote: Otis Gospodnetic wrote: Below is what it spits out. I'm not sure what the cause is. I did try forrest seed forrest validate as prescribed at https://issues.apache.org/jira/browse/FOR-984?focusedCommentId=12649593#action_12649593 , but forrest validate failed. validate-sitemap: /home/otis/apache-forrest/main/webapp/resources/schema/relaxng/sitemap-v06.rng:72:31: error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; not recognized [...] No clue. I'd say that until we figure out what happens we can go forward - if it generates a consistent and usable output.
[jira] Closed: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-594. -- Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch, NUTCH-594-4-20081230.patch, NUTCH-594-5-20081231.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-572) Scoring and redirected Urls
[ https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660394#action_12660394 ] Dennis Kubes commented on NUTCH-572: I would like to close this issue. Redirect handling has undergone significant changes since this issue was opened and we still need to take a hard look at redirects and possibly how scores are represented. However, the newer scoring and indexing frameworks do work around this issue. Scoring and redirected Urls --- Key: NUTCH-572 URL: https://issues.apache.org/jira/browse/NUTCH-572 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 When a redirect is found for a given url, the new or end url is stored as the content page and the old CrawlDatum get one of a few redirect codes. The page that gets indexed in Nutch is the end page and it gets indexed under the end url. Many times a site will have a significant number of links pointing to start page and very few pointing to the redirected end page. This is especially true for external links. Opic scores do not get transfered to the end page but stay with the start page (the one doing the redirecting). But the start page doesn't get indexed. Hence the end page will show up in the index but under a usually much reduced score. A good example of this is cnn.com: URL: http://www.cnn.com/ Version: 6 Status: 5 (db_redir_perm) Fetch time: Tue Dec 04 11:02:09 CST 2007 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 51.19438 Signature: b5baaf80e9e10aa6205fc39051c362ff Metadata: _pst_:success(1), lastModified=0 which redirects to http://www.cnn.com/?refresh=1 URL: http://www.cnn.com/?refresh=1 Version: 6 Status: 2 (db_fetched) Fetch time: Tue Dec 04 11:02:11 CST 2007 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: b5baaf80e9e10aa6205fc39051c362ff Metadata: _pst_:success(1), lastModified=0 Now, cnn which should be one of the highest, if not the highest ranking site in the index for keywords such as news in fact doesn't show up in the index and it's redirected end page appears much farther down in search results. My proposal is we somehow make OPIC scores follow redirects. To do this we would most likely need to store a start and end url for redirected urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659825#action_12659825 ] musepwizard edited comment on NUTCH-594 at 12/30/08 6:56 AM: -- JSON-LIb and EZMorph are both under Apache. There is an optional Xom library dependency for JSON-Lib which is not included, that is under LGPL, but everything else is Apache. http://json-lib.sourceforge.net/license.html http://ezmorph.sourceforge.net/license.html I put comments about these in the plugin.xml file for response-json. Is there anything else I need to do? was (Author: musepwizard): JSON-LIb and EZMorph are both under Apache. There is an optional Xom library dependency for JSON-Lib which is not included, that is under LGPL, but everything is Apache. http://json-lib.sourceforge.net/license.html http://ezmorph.sourceforge.net/license.html I put comments about these in the plugin.xml file for response-json. Is there anything else I need to do? Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659825#action_12659825 ] Dennis Kubes commented on NUTCH-594: JSON-LIb and EZMorph are both under Apache. There is an optional Xom library dependency for JSON-Lib which is not included, that is under LGPL, but everything is Apache. http://json-lib.sourceforge.net/license.html http://ezmorph.sourceforge.net/license.html I put comments about these in the plugin.xml file for response-json. Is there anything else I need to do? Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: NUTCH-594-4-20081230.patch Final patch. Adds the ability to stop summaries from being returned and to only return a given set of fields by name. Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch, NUTCH-594-4-20081230.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-668. Resolution: Fixed Committed with revision 729958. Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: ezmorph-1.0.6.jar ezmorph jar required for framework Serve Nutch search results in XML and JSON -- Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: ezmorph-1.0.6.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: NUTCH-594-3-20081229.patch A completely reworked framework with extension point for serving search results in different format. Included are plugins for serving results in XML and JSON format. XML is the default. Uses JSON-Lib to convert the results into JSON format. Serve Nutch search results in XML and JSON -- Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: ezmorph-1.0.6.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Summary: Serve Nutch search results in multiple formats including XML and JSON (was: Serve Nutch search results in XML and JSON) Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: commons-beanutils-1.8.0.jar commons beanutils Serve Nutch search results in XML and JSON -- Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: commons-collections-3.2.1.jar commons collections Serve Nutch search results in XML and JSON -- Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: json-lib-2.2.2-jdk15.jar json lib jar Serve Nutch search results in XML and JSON -- Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: (was: NUTCH-594-3-20081229.patch) Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON
[ https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-594: --- Attachment: NUTCH-594-3-20081229.patch Fixed some things. Added the ability to set mime output type using the plugin.xml file. That way people can have application/json or text.json or text/plain, however they want for their application. Serve Nutch search results in multiple formats including XML and JSON - Key: NUTCH-594 URL: https://issues.apache.org/jira/browse/NUTCH-594 Project: Nutch Issue Type: New Feature Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: commons-beanutils-1.8.0.jar, commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch Allow search results to be served in XML, JSON, and other configurable formats. Right now there is an OpenSearch servlet that returns returns in RSS. I would like something that has more flexibility in terms of the XML being served and also supports other formats such as JSON or plain text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker
This is old. It has been fixed in more recent versions of hadoop and nutch. Otis Gospodnetic (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658610#action_12658610 ] Otis Gospodnetic commented on NUTCH-675: Sha Feng, could you please bring this up on the Nutch mailing list instead of JIRA? It would also be good if you could upgrade your Nutch (including Hadoop) and see if it works then. 0.12 is VERY old version of Hadoop. Reduce tasks do not report their status and are killed by jobtracker Key: NUTCH-675 URL: https://issues.apache.org/jira/browse/NUTCH-675 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: OS : Linux Reporter: sha feng Fix For: 0.9.0 We choose Fetcher2 as our fetcher. Map tasks of Fetcher2 fetches about 2,000,000 urls, but at reduce stage, all reduce tasks can not report their status and be killed by jobtracker. Although we change mapred.task.timeout from 60,000 to 1,800,000, it does not work. So, who can tell us why? By the way, the version of Nutch we use is 0.9 and the version of Hadoop is 0.12. Thanks for your help!
[jira] Commented: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658118#action_12658118 ] Dennis Kubes commented on NUTCH-668: Anybody have a problem if I commit this today or tommorrow? Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: File system
If you are talking about Nutch Contents which are stored in the segments during fetching of pages, then you would need to write MapReduce job to read in the Contents object and do whatever processing you desire. Dennis oSilvio wrote: Very useful information, thanks! But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene? Dennis Kubes-2 wrote: The nutch databases are either SequenceFile or MapFile formats which store key and value pairs. Their keys and values are Writable implementations which translate an object into it byte equivalent and vice versa. Data and index files are MapFile format. Data is a SequenceFile, index is an index used by MapFiles for seeking to a specific key. Please see the hadoop wiki for more information about Sequence and Map files and writable formats. Dennis oSilvio wrote: Do somebody know how do the file structure works, briefly? It seems that the data are compressed or something, its not possible to understand whats recorded in the data nor index files. Thanks Silvio
Re: File system
The nutch databases are either SequenceFile or MapFile formats which store key and value pairs. Their keys and values are Writable implementations which translate an object into it byte equivalent and vice versa. Data and index files are MapFile format. Data is a SequenceFile, index is an index used by MapFiles for seeking to a specific key. Please see the hadoop wiki for more information about Sequence and Map files and writable formats. Dennis oSilvio wrote: Do somebody know how do the file structure works, briefly? It seems that the data are compressed or something, its not possible to understand whats recorded in the data nor index files. Thanks Silvio
[jira] Closed: (NUTCH-448) Allow Plugin Includes and Excludes from File
[ https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-448. -- Resolution: Later This was some old functionality that seemed good at the time. Not so much now. Allow Plugin Includes and Excludes from File Key: NUTCH-448 URL: https://issues.apache.org/jira/browse/NUTCH-448 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: all platforms Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: plugin-fromfile.patch This functionality allows the plugin.includes and plugin.excludes values to be moved out of the nutch-default.xml and nutch-site.xml files and loaded from one or more text configurtion files found in the classpath. This is a cleaner implementation then having one big long regular expression in the configuration file as plugin.includes or plugin.excludes. Loads plugin configuration from files defined by the plugin.files configurtion variable. Files must be available to be found in the classpath. The plugin files consist of one regex per line. Plugins starting with a - will be excluded while lines starting with a # will be ignored. All other non-blank lines will be included as plugins, one per line. Any plugins configured through plugin.includes and plugin.excludes in the configuration are also added. Any plugins that are excluded are removed from the includes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-646) New Indexing Framework for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12654154#action_12654154 ] Dennis Kubes commented on NUTCH-646: Not yet. I need to write up some serious documentation about how to use both the new scoring and indexing systems. I will try to get to that soon. New Indexing Framework for Nutch Key: NUTCH-646 URL: https://issues.apache.org/jira/browse/NUTCH-646 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 0.9.0, 1.0.0 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, NUTCH-646-2-20081126.patch New indexing framework for Nutch that provides a more generic field abstraction consistent with Lucene index semantics. Allows multiple MR jobs to be created for different fields and those fields to be aggregated and indexed in the end. Overcomes limitations of the current indexer that limits what databases are passed into the indexer. Creates a new extension point as well for field-filters for manipulation of fields during the indexing process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Domain URL filter Commit?
Anybody have a problem with me committing the domain-urlfilter plugin in NUTCH-668? Dennis
[jira] Commented: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653881#action_12653881 ] Dennis Kubes commented on NUTCH-668: I agree. Being able to search for tlds like .com would make it much more flexible. Let me work up the changes and I will post a new patch (without my local path :)). Although I do want to get this in quickly I think the new functionality is worth the wait. Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Builds are Failing
After the upgrade to Hadoop, builds are failing because I think we have nutch set to build with Java 5 by default but I think Hadoop is built with Java 6 (At least the release version that I downloaded and used to upgrade Nutch). I know we aren't requiring Nutch to use Java 6 yet. This may force the point. I don't know if Hadoop will build with Java 5. I will test it out and post back results. If it does, then options are: 1) Force Nutch to use Java 6 2) Rebuild Hadoop from source instead of release version using Java 5 Thoughts? Dennis
Re: Builds are Failing
I take it back. Hadoop *requires* java 6 now as of 0.19. Which means we should be making changes to require Nutch to use java 6. Dennis Dennis Kubes wrote: After the upgrade to Hadoop, builds are failing because I think we have nutch set to build with Java 5 by default but I think Hadoop is built with Java 6 (At least the release version that I downloaded and used to upgrade Nutch). I know we aren't requiring Nutch to use Java 6 yet. This may force the point. I don't know if Hadoop will build with Java 5. I will test it out and post back results. If it does, then options are: 1) Force Nutch to use Java 6 2) Rebuild Hadoop from source instead of release version using Java 5 Thoughts? Dennis
[jira] Updated: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-668: --- Attachment: NUTCH-668-2-20081204.patch Updated to include URLUtil methods that were missing. Sorry. Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653404#action_12653404 ] Dennis Kubes commented on NUTCH-207: I think this would be an interesting addition. It would also need to be ported to fetcher2 as well as fetcher. It you want to take on the task of porting it that would be great. If you have any questions feel free to ask. Bandwidth target for fetcher rather than a thread count --- Key: NUTCH-207 URL: https://issues.apache.org/jira/browse/NUTCH-207 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8 Reporter: Rod Taylor Attachments: ratelimit.patch Increases or decreases the number of threads from the starting value (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve a target bandwidth (fetcher.threads.bandwidth). It seems to be able to keep within 10% of the target bandwidth even when large numbers of errors are found or when a number of large pages is run across. To achieve more accurate tracking Nutch should keep track of protocol overhead as well as the volume of pages downloaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-635) LinkAnalysis Tool for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-635. -- LinkAnalysis Tool for Nutch --- Key: NUTCH-635 URL: https://issues.apache.org/jira/browse/NUTCH-635 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations. This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-635) LinkAnalysis Tool for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-635. Resolution: Fixed Committed with revision 723441 LinkAnalysis Tool for Nutch --- Key: NUTCH-635 URL: https://issues.apache.org/jira/browse/NUTCH-635 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations. This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-646) New Indexing Framework for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653489#action_12653489 ] Dennis Kubes commented on NUTCH-646: For the final version of this I have removed the arity dependencies and computation functionality. I still think that type of functionality is needed but it didn't feel like the right place for it at this time. New Indexing Framework for Nutch Key: NUTCH-646 URL: https://issues.apache.org/jira/browse/NUTCH-646 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 0.9.0, 1.0.0 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, NUTCH-646-2-20081126.patch New indexing framework for Nutch that provides a more generic field abstraction consistent with Lucene index semantics. Allows multiple MR jobs to be created for different fields and those fields to be aggregated and indexed in the end. Overcomes limitations of the current indexer that limits what databases are passed into the indexer. Creates a new extension point as well for field-filters for manipulation of fields during the indexing process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-646) New Indexing Framework for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-646. Resolution: Fixed Committed with revision 723447 New Indexing Framework for Nutch Key: NUTCH-646 URL: https://issues.apache.org/jira/browse/NUTCH-646 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0, 0.9.0 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, NUTCH-646-2-20081126.patch New indexing framework for Nutch that provides a more generic field abstraction consistent with Lucene index semantics. Allows multiple MR jobs to be created for different fields and those fields to be aggregated and indexed in the end. Overcomes limitations of the current indexer that limits what databases are passed into the indexer. Creates a new extension point as well for field-filters for manipulation of fields during the indexing process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-662. Resolution: Fixed Committed with revision 722475 Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-663. -- Upgrade Nutch to use Hadoop 0.19 Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-647. -- Resolve URLs tool - Key: NUTCH-647 URL: https://issues.apache.org/jira/browse/NUTCH-647 Project: Nutch Issue Type: New Feature Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch A tool that takes a listing of urls and attempts to resolve their IP addresses. Useful for running after the fetcher has run to determine if DNS problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-647. Resolution: Fixed Fix Version/s: 1.0.0 Committed with revision 722478 Resolve URLs tool - Key: NUTCH-647 URL: https://issues.apache.org/jira/browse/NUTCH-647 Project: Nutch Issue Type: New Feature Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch A tool that takes a listing of urls and attempts to resolve their IP addresses. Useful for running after the fetcher has run to determine if DNS problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-665. Resolution: Fixed Committed with revision 722481 Search Load Testing Tool Key: NUTCH-665 URL: https://issues.apache.org/jira/browse/NUTCH-665 Project: Nutch Issue Type: New Feature Components: searcher Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-665-20081126-1.patch A tool which spawn a number of threads and executes searches against configured search servers. This is used for light load testing of search servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-665. -- Search Load Testing Tool Key: NUTCH-665 URL: https://issues.apache.org/jira/browse/NUTCH-665 Project: Nutch Issue Type: New Feature Components: searcher Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-665-20081126-1.patch A tool which spawn a number of threads and executes searches against configured search servers. This is used for light load testing of search servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-667. -- Input Format for working with Content in Hadoop Streaming - Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-667-1-20081126.patch This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-667. Resolution: Fixed Committed with revision 722483 Input Format for working with Content in Hadoop Streaming - Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-667-1-20081126.patch This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-668) Domain URL Filter
Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-668: --- Attachment: NUTCH-668-1-20081202.patch Includes the DomainURLFilter and test files. Domains can either be filtered by top level domains ignoring subdomains, or by hostnames through configuration. There is a configuration file where valid domains are placed one per line. Those domains are used to create valid domain set against which we validate urls at runtime. Only urls which match domains in the domain set are considered valid. Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-668-1-20081202.patch A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pending Commits for Nutch Issues
Doğacan Güney wrote: Hi Dennis, On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote: If nobody has a problem with them I would like to commit the following issues in the next day or two: NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19) NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4) NUTCH-647: Resolve URLs tool NUTCH-665: Search Load Testing Tool NUTCH-667: Input Format for working with Content in Hadoop Streaming And I would like to commit these in a week: NUTCH-635: LinkAnalysis Tool for Nutch NUTCH-646: New Indexing framework for Nutch NUTCH-594: Serve Nutch search results in XML and JSON NUTCH-666: Analysis plugins and new language identifier. There are others too but these are the ones I am trying to get moved into trunk right now. I am OK with all but NUTCH-666... Why a new language identifier? (or if a new one, why keep old one around?) I haven't got the code pushed out yet. I do have a production version running but I need to make it play nice with the Apache licensing requirements. Current library I am using is under GPL. The reason I switched was because I found that the old one wasn't working correctly for me. I don't know the accuracy levels of the old language identifier but I found that with pages that contained both english and another language, it would often classify it as english. The new language identifier I am currently using has an accuracy rate of 97% and is trainable as before for multiple languages. Currently we have models for 20-30 languages. Also the new language identifier works with the new indexing framework and with new functionality for custom fields. The only reason I would keep the old one around would be for backwards compatibility for people currently using it. I will push out a patch shortly and we can review. If we don't want it to make it into this release I am ok with that. Dennis Dennis
[jira] Updated: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-665: --- Attachment: NUTCH-665-20081126-1.patch Search load testing tool. Search Load Testing Tool Key: NUTCH-665 URL: https://issues.apache.org/jira/browse/NUTCH-665 Project: Nutch Issue Type: New Feature Components: searcher Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-665-20081126-1.patch A tool which spawn a number of threads and executes searches against configured search servers. This is used for light load testing of search servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-647: --- Attachment: NUTCH-647-2-20081126.patch Updated patch. Resolve URLs tool - Key: NUTCH-647 URL: https://issues.apache.org/jira/browse/NUTCH-647 Project: Nutch Issue Type: New Feature Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch A tool that takes a listing of urls and attempts to resolve their IP addresses. Useful for running after the fetcher has run to determine if DNS problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Attachment: NUTCH-666-1-20081126.patch Part one of patch. This includes the new analyzers for different languages. Part two will include the new language identifier tool. Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: NUTCH-663-1-20081126.patch Updates jar and native files Upgrade Nutch to use Hadoop 0.18.2 -- Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: hadoop-0.19.0-core.jar Hadoop core jar Upgrade Nutch to use Hadoop 0.18.2 -- Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650982#action_12650982 ] Dennis Kubes commented on NUTCH-663: hadoop 0.19 was release. I am integrating it in and should have a patch shortly. Upgrade Nutch to use Hadoop 0.18.2 -- Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Summary: Upgrade Nutch to use Hadoop 0.19 (was: Upgrade Nutch to use Hadoop 0.18.2) change to 0.19 instead of 0.18.2 Upgrade Nutch to use Hadoop 0.19 Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Attachment: (was: NUTCH-666-1-20081126.patch) Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: NUTCH-663-1-20081126.patch Updated patch to include API changes in Nutch classes. Upgrade Nutch to use Hadoop 0.19 Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: (was: NUTCH-663-1-20081126.patch) Upgrade Nutch to use Hadoop 0.19 Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-635: --- Attachment: (was: NUTCH-635-8-20080818.patch) LinkAnalysis Tool for Nutch --- Key: NUTCH-635 URL: https://issues.apache.org/jira/browse/NUTCH-635 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations. This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-635: --- Attachment: NUTCH-635-9-20081126.patch Updated final patch for new link analysis framework. I am also going to write up some documentation on the wiki for how this new process works. LinkAnalysis Tool for Nutch --- Key: NUTCH-635 URL: https://issues.apache.org/jira/browse/NUTCH-635 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix using inlinks and outlinks and converges after a given number of iterations. This tool is mean to replace the current scoring system in nutch with a system that converges instead of exponentially increasing scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming
Input Forma for working with Content in Hadoop Streaming Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Attachment: NUTCH-666-1-20081126.patch Fixed patch. Now includes the changes to AnalyzerFactory to allow multiple languages per plugin. Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-667: --- Attachment: NUTCH-667-1-20081126.patch Input format for working with hadoop streaming. Input Forma for working with Content in Hadoop Streaming Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-667-1-20081126.patch This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-667: --- Summary: Input Format for working with Content in Hadoop Streaming (was: Input Forma for working with Content in Hadoop Streaming) Input Format for working with Content in Hadoop Streaming - Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-667-1-20081126.patch This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-646) New Indexing Framework for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-646: --- Attachment: NUTCH-646-2-20081126.patch Updated indexing patch. New Indexing Framework for Nutch Key: NUTCH-646 URL: https://issues.apache.org/jira/browse/NUTCH-646 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 0.9.0, 1.0.0 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, NUTCH-646-2-20081126.patch New indexing framework for Nutch that provides a more generic field abstraction consistent with Lucene index semantics. Allows multiple MR jobs to be created for different fields and those fields to be aggregated and indexed in the end. Overcomes limitations of the current indexer that limits what databases are passed into the indexer. Creates a new extension point as well for field-filters for manipulation of fields during the indexing process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650713#action_12650713 ] Dennis Kubes commented on NUTCH-663: @buddha1021 The 1.0 release for Nutch has some of the features for Nutch 2 but it is not a complete Nutch 2 Architecture. We felt it was best to do add some needed features into the current version of Nutch and get them deployed to the community quickly. A lot of people have been asking about the development of Nutch and releasing. Truth is we have just been busy adding in needed features and patches. We should have a release out in the next couple of weeks. That will be a 1.0 release for Nutch but will probably contain a 18.2 or 19 release for Hadoop. We aren't waiting for hadoop to go to 1.0. @Doğacan Güney I am not opposed to waiting for 0.19 as long as it will be released soon. I was looking and it seemed they tried to release a little while back and didn't finish because of some big errors. Upgrade Nutch to use Hadoop 0.18.2 -- Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650009#action_12650009 ] Dennis Kubes commented on NUTCH-662: We had been running in production for about a month and never saw any issues with the indexing processes using 2.4. Then I was doing some work for upgrading the trunk and it popped up in delete duplicates unit testing. We don't do delete duplicates in our JobStream, we do it query side. First problem was that the old DfsIndexOutput didn't implement the seek method (probably because DFS can't seek), so when that was changed to allow it to seek, it was throwing Checksum errors on the index when it was trying to open it. Come to find out as above 2.4 is purposefully writing a bad checksum, then seeking back, then writing a correct checksum in closing the index as a pseudo-two-phase commit. So I don't think it will affect the indexing process because as you noted it writes to local first then just transfers to DFS. In changing DfsIndexOutput to allow DeleteDuplicates to work I just took the same approach, local first, then put to DFS. Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
Upgrade Nutch to use Hadoop 0.18.2 -- Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-662: --- Attachment: lucene-misc-2.4.0.jar Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12649679#action_12649679 ] Dennis Kubes commented on NUTCH-662: The upgrade to Lucene 2.4 causes a weird problem that might need some discussion. The o.a.n.indexer.FsDirectory$DfsIndexOutput class is used to interact with an index stored on DFS. The 2.4 version of Lucene in the ChecksumIndexOutput.prepareCommit method and finalizeCommit methods do a pseudo two-phase commit. To do this it writes an intential mismatched checksum (long = checkum - 1) then flushes and seeks back and writes the correct checksum in the same spot. They say this is to ensure the commit. Because DFS doesn't have append functionality we can't write to it, seek back to a position, and write again. DFS is write only. To handle this problem in the attached patch, I first write out to a local temporary file that is deleted upon exit, then when close is called on the IndexOutput, that file is written out to DFS all at once. I don't know if this is the best way to do this or if there is a better way, but it does handle the new write and seek functionality of lucene 2.4. The previous implementation of DfsIndexOutput simply threw an UnsupportedOperationException when the seek method was called. This was fine before 2.4 as lucene wasn't calling that method during writing to DFS. In 2.4 it does and unit tests were failing because of it. What does everybody think about this implementation? Other than that I don't see any major issues in upgrading to 2.4. Some people have said performance we down in 2.4. My thoughts are, that might be the case but those will be fixed and it would be good to be on the most recent lucene version as we move to a 1.0 release for Nutch. Also we have been using 2.4 in production for a month now without any issues. Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-662: --- Attachment: lucene-analyzers-2.4.0.jar Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.