[jira] Created: (NUTCH-736) how long it takes nutch 1.0 to fetch
how long it takes nutch 1.0 to fetch Key: NUTCH-736 URL: https://issues.apache.org/jira/browse/NUTCH-736 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 1.0.0 Environment: Intel 2.8 Core2Duo OS X 10.5.6 Reporter: Filipe Antunes I need an opinion about how long it takes nuch 1.0 to fetch a web site. At the moment I'm indexing 3000 sites (medical area). university's, clinics, hospitals, associations, journals (html, docs, PDF, txt, xls). So far I have 5 segments (64Gb) and the its fetching the 6th. Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the machine is throttled to 64Mbits during the day (8 hours)) and this fetching started one month ago. Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Nutch/Solr: storing the page cache in Solr
Siddhartha Reddy wrote: I'm trying to patch Nutch to allow the page cache to be added to the Solr index when using the SolrIndexer tool. Is there any reason this is not done by default? The Solr schema even has the cache field but it is left empty. This issue is more complicated. We would need to handle also non-string content such as various binary formats (PDF, Office, images, etc), and there is no support for this in Solr (yet). Additionally, storing large binary blobs in Lucene index has some performance consequences. Currently Nutch uses Solr for searching, and a separate (set of) segment servers for content serving. I'm enclosing a patch of the changes I have made. I have done some testing and this seems to work fine. Can someone please take a look at it let me know if I'm doing anything wrong? I'm especially not sure about the character encoding to assume when converting the Content (which is stored as byte[]) to a String; I'm getting the encoding from Metadata (using the key Metadata.ORIGINAL_CHAR_ENCODING) but it is always null. The patch looks ok, if handling String content is all you need. Char encoding should be available in ParseData.getMeta(). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-736) how long it takes nutch 1.0 to fetch
[ https://issues.apache.org/jira/browse/NUTCH-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Filipe Antunes updated NUTCH-736: - Description: I need an opinion about how long it takes nuch 1.0 to fetch a web site. At the moment I'm indexing 3000 sites (medical area). university's, clinics, hospitals, associations, journals (html, docs, PDF, txt, xls). So far I have 5 segments (64Gb) and the its fetching the 6th. Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the machine is throttled to 64Kbytes during the day (8 hours)) and this fetching started one month ago. Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? was: I need an opinion about how long it takes nuch 1.0 to fetch a web site. At the moment I'm indexing 3000 sites (medical area). university's, clinics, hospitals, associations, journals (html, docs, PDF, txt, xls). So far I have 5 segments (64Gb) and the its fetching the 6th. Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the machine is throttled to 64Mbits during the day (8 hours)) and this fetching started one month ago. Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? how long it takes nutch 1.0 to fetch Key: NUTCH-736 URL: https://issues.apache.org/jira/browse/NUTCH-736 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 1.0.0 Environment: Intel 2.8 Core2Duo OS X 10.5.6 Reporter: Filipe Antunes I need an opinion about how long it takes nuch 1.0 to fetch a web site. At the moment I'm indexing 3000 sites (medical area). university's, clinics, hospitals, associations, journals (html, docs, PDF, txt, xls). So far I have 5 segments (64Gb) and the its fetching the 6th. Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the machine is throttled to 64Kbytes during the day (8 hours)) and this fetching started one month ago. Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Regarding Solr1.3 and Nutch 0.9 Integration
Dear Paul, we are trying to integrate solr 1.3 and nutch 0.9 and facing few problems .Here, below i am mentioning the error stack trace, and jdk versions.please, help us out from this problem. Finally , let us know if you have any useful documents on Solr-Nutch. Environment we are working: jdk:1.6 nutch:0.9 solr:1.3 Article refered for Integration: http://wiki.apache.org/nutch/RunningNutchAndSolr http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html OS:windows xp Error Stack Trace we got when updating the crawler fetched Content: 009-05-13 20:22:27,175 WARN indexer.SolrClientAdapter - Could not index document, reason:Bad Request Bad Request request: http://localhost:8080/solr/update?wt=javabinversion=2.2 org.apache.solr.common.SolrException: Bad Request Bad Request request: http://localhost:8080/solr/update?wt=javabinversion=2.2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) at org.apache.nutch.indexer.SolrClientAdapter.index(SolrClientAdapter.java:75) at org.apache.nutch.indexer.SolrIndexer$OutputFormat$1.write(SolrIndexer.java:118) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:298) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:238) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155) 2009-05-13 20:22:27,190 INFO indexer.Indexer - Executing commit 2009-05-13 20:22:28,221 INFO indexer.Indexer - SolrIndexer: done Regards, Mallik.J.
The Future of Nutch, reactivated
Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL aliasing) but also link spam detection and a more complex control over the crawling frontier. Anything more? I'm not sure - perhaps I would add template detection and pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). Nutch 1.0 already made some steps in this direction, with the new link analysis package and pluggable FetchSchedule and Signature. A lot remains to be done here, and we are still spending a lot of resources on dealing with issues outside this core competence. --- So, what do we need to do next? * we need to decide where we should commit our resources, as a community of users, contributors and committers, so that the project is most useful to our target audience. At this point there are few active committers, so I don't think we can cover more than 1 direction at a time ... ;) * we need to re-architect Nutch to focus on our core competence, and delegate what we can to other projects. Feel free to comment on the above, make suggestions or corrections. I'd like to wrap it up in a concise mission statement that would help us set the goals for the next couple months. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes -- == Introduction == - This is a feature in Nutch, developed by Susam Pal, that allows the crawler to authenticate itself to websites requiring NTLM, Basic or Digest authentication. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication + This is a feature in Nutch that allows the crawler to authenticate itself to websites requiring NTLM, Basic or Digest authentication. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. @@ -108, +108 @@ Once you have checked the items listed above and you are still unable to fix the problem or confused about any point listed above, please mail the issue with the following information: 1. Version of Nutch you are running. - 1. Complete code in ''conf/httpclient-auth.xml' file. + 1. Complete code in 'conf/httpclient-auth.xml' file. 1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send the complete file.
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- - private static class LuceneDocumentWrapper implements Writable { + public static class LuceneDocumentWrapper implements Writable { ). + + HI, I to faced problems to integrate solr and nutch.After , some work i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ +
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- + public static class LuceneDocumentWrapper implements Writable { ). - HI, I to faced problems to integrate solr and nutch.After , some work i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ + HI, I to faced problems in integrating solr and nutch. After, some work out i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ +
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it requestHandler name=/nutch class=solr.SearchHandler + lst name=defaults + str name=defTypedismax/str + str name=echoParamsexplicit/str + float name=tie0.01/float + str name=qf + content^0.5 anchor^1.0 title^1.2 /str + str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str + str name=fl url /str + str name=mm 2lt;-1 5lt;-2 6lt;90% /str + int name=ps100/int + bool hl=true/ + str name=q.alt*:*/str + str name=hl.fltitle url content/str + str name=f.title.hl.fragsize0/str + str name=f.title.hl.alternateFieldtitle/str + str name=f.url.hl.fragsize0/str + str name=f.url.hl.alternateFieldurl/str + str name=f.content.hl.fragmenterregex/str + /lst + /requestHandler 6. Start Solr
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - Setup The first step to get started is to download the required software components, namely Apache Solr and Nutch. - 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page + '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page - 2. Extract Solr package + '''2.''' Extract Solr package - 3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality) + '''3.''' Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality) - 4. Extract the Nutch package + '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz - tar xzf apache-nutch-1.0.tar.gz - - 5. Configure Solr + '''5.''' Configure Solr - For the sake of simplicity we are going to use the example configuration of Solr as a base. - a. Copy the provided Nutch schema from directory + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: - b. Change schema.xml so that the stored attribute of field âcontentâ is true. + '''b.''' Change schema.xml so that the stored attribute of field âcontentâ is true. field name=âcontentâ type=âtextâ stored=âtrueâ indexed=âtrueâ/ We want to be able to tweak the relevancy of queries easily so weâll create new dismax request handler configuration for our use case: - d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it + '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it requestHandler name=/nutch class=solr.SearchHandler @@ -93, +89 @@ /requestHandler - 6. Start Solr + '''6.''' Start Solr cd apache-solr-1.3.0/example java -jar start.jar - 7. Configure Nutch + '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace itâs contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : ?xml version=1.0? configuration + property + namehttp.agent.name/name + valuenutch-solr-integration/value + /property + property namegenerate.max.per.host/name + value100/value + /property + property + nameplugin.includes/name + valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value + /property + /configuration + - b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, + '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace itâs content with following: - replace itâs content with following: -^(https|telnet|file|ftp|mailto): @@ -135, +143 @@ # deny anything else -. - 8. Create a seed list (the initial urls to fetch) + '''8.''' Create a seed list (the initial urls to fetch) mkdir urls echo http://www.lucidimagination.com/; urls/seed.txt - 9. Inject seed url(s) to nutch crawldb (execute in nutch directory) + '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory) bin/nutch inject crawl/crawldb urls - 10. Generate fetch list, fetch and parse content + '''10.''' Generate fetch list, fetch and parse content bin/nutch generate crawl/crawldb crawl/segments @@ -166, +174 @@ Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content. - 11. Create linkdb + '''11.''' Create linkdb bin/nutch invertlinks crawl/linkdb -dir crawl/segments - 12. Finally index all content from all segments to Solr + '''12.''' Finally index all content from all segments to Solr bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
Re: The Future of Nutch, reactivated
Hi Andrzej, Great summary. My general feeling on this is similar to my prior comments on similar threads from Otis and from Dennis. My personal pet projects for Nutch2: * refactored Nutch core data structures, modeled as POJOs * refactored Nutch architecture where crawling/indexing/parsing/scoring/etc. are insulated from the underlying messaging substrate (e.g., crawl over JMS, EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other framework, etc.) * simpler Nutch deployment mechanisms (separate Nutch deployment package from source code package), think about using Maven2 +1 to all of those and other ideas for how to improve the project's focus. Cheers, Chris On 5/14/09 6:45 AM, Andrzej Bialecki a...@getopt.org wrote: Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL aliasing) but also link spam detection and a more complex control over the crawling frontier. Anything more? I'm not sure - perhaps I would add template detection and pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). Nutch 1.0 already made some steps in this direction, with the new link analysis package and pluggable FetchSchedule and Signature. A lot remains to be done here, and we are still spending a lot of resources on dealing with issues outside this core competence. --- So, what do we need to do next? * we need to decide where we should commit our resources, as a community of users, contributors and committers, so that the project is most useful to our target audience. At this point there are few active committers, so I don't think we can cover more than 1 direction at a time ... ;) * we need to re-architect Nutch to focus on our core competence, and delegate what we can to other projects.
The Future of Nutch, reactivated
All, Sorry that I didn't reply, and thus this isn't threaded properly. I've lurked on the list via the RSS feed, I subscribed so I could put in my two cents worth. I've recently starting using git to maintain a local branch of Nutch. My hope is to get my employer to let me contribute just engineering back to Nutch. We'd like to customize Nutch in various ways and use that as the basis of internal RD and potentially some products that we'd not contribute. The other things that just make Nutch more flexible I'd like to contribute to. I've been working with Nutch on and off since sometime in November or so for my job. A couple of thoughts: 1. Nutch is too monolithic 2. Nutch does the heavy lifting of a framework for a distributed system well. 3. Nutch doesn't really keep all the various pieces up to date very well. 4. Nutch requires at least a Bachelors in Nutch to deal with it. 5. Documentation in a Wiki is out of date or is hard to tell which versions various things work with. 6. Nutch isn't very friendly to simple requests if there a complex hack could be found. (See recursive file:// handling). My most recent task was actually to update Tika to use 0.3 and then use the Tika parsing of the docx format to index. There were a several interesting problems, but I want to get permission from my employer and just show the patches. I thing we fall into the category of #2 (we wish we could fall into category #1, but such is life). We want to make our intranet searchable on a large scale, and would like to apply the indexing and retrieval in a number of RD projects. We also have an interest in using Nutch/Lucene/Hadoop in a number of other problems unrelated to Internet Search. A couple of things that I'd like to help do (or see done) that would make Nutch far more framework like so I can assemble the pieces and parts into what I need: 1. Get Nutch and it's various components into a public Maven repository, and have public scripts to do the publishing. Don't care if that is via Ant with Ivy extensions, or switching to a Maven build systems. I've actually started with both approaches. I'm much better with Maven, but I think Ivy is more likely to be acceptable to the project. I'd like to see this done with Hadoop, and any other core components. For now, I'm just maintaining a local POM file that pushes my builds into our local Maven repository. I'm going to do this one way or another, and would love to hear any feedback on an approach that is acceptable to be contributed back to Nutch. 2. Clearly segregate Plugins from Core from Bits that make it an Application. I've had fun problems with ClassLoaders, and it seems that the interface Plugins are allowed to access are Anything in Core, or it's existing libraries. It would seem that it would be better to have the Core Runtime, which plugins can depend upon, and is relatively minimal. Identify the pieces of Nutch which are there to make it into a program you can run, and push those into a separate place. For API's with multiple implementations, it would be nice to not have be forced to use the same one the Core does when a plugin is written. 3. As you stated earlier, use OSGi for a plugin system and some type of dependency injection rather then hand parsed XML files. I've had problems with the PluginClassloader (I wanted to use Tika in my plugin, and because of the plugin/classloader setup, I had to push the POI libraries into the lib directory rather then in the src/plugin/plugin-XXX/lib directory). Well, that was the first approach, the second was to hack the PluginClassloader to not delegate to the parent for the org.apache.tika package and then provide Tika in the plugin and it all worked. Using an well known plug-in system would have made this much easier. 4. Help transition to using the 3rd party libraries, Nutch still has an SWF parser that went unmaintained in 2002. Flash has moved a long way, it would seem sensible to either jettison that code, or update to newer versions of the same library by the same project (SWF2). Not that I care about Flash, but it seems that parsing isn't something Nutch proper is focused on. 5. With whatever build system is chosen, figure out how to setup a Maven build to construct Out-of-Tree Nutch plugins without having to manually deal with all of the various dependencies and packaging details. 6. Better support for running out of an IDE. The instructions work, and are very helpful. It'd be much nicer to see the use of tools or scripts to generate a saner system then is currently there (having each plugin be a project in Eclipse would be a huge help to debugging weird classpath issues). Right now, running and compiling inside of Eclipse isn't at all similar to running it outside, if you have any time of classloader issues, or multiple conflicting libraries. Not that there are any in-tree right now, but I can see how future ones could exist. 7. Make each plugin be it's own deliverable (even if