Re: Next Nutch release
Dennis Kubes wrote: Andrzej Bialecki wrote: I believe that at this point it's crucial to keep the project well-focused (at the moment I think the main focus is on larger installations, and not the small ones), and also to make Nutch attractive to developers as a reusable search engine component. I think there are two areas. One is to keep the focus as you stated above. The other is to provide a path to get more people involved. If no one objects I will continue working on such a path. Please let me know if I can help in this people area. I'm currently unable to assist with technical Nutch issues on a day-to-day basis, but I am still very interested in doing what I can to ensure Nutch's long-term vitality as a project. Cheers, Doug
Re: Next Nutch release
Andrzej Bialecki wrote: Dennis Kubes wrote: I completely agree with this. I am interested in devoting as much time as possible to seeing the success of Nutch, Hadoop, and Lucene. As our business grows I would also be willing to devote developers full time to work on Nutch, Hadoop, and Lucene. I think that at least one company needs to come out and have a production search engine that is competition, however small, to the googles and yahoos of the world, built on Nutch and Hadoop. I thought that was the original goal of Nutch. I know there are some out there right now like Mozdex, but I mean a true billion page system. I think the .8 codebase, and yes improvements could be made, is capable of supporting such a system. I think then you will see many more developers become interested in the project. If you build it they will come. Sure, I'd love to point people to such a system. But did you do a calculation how much money in the initial investment, and then ongoing costs, is needed to maintain such an index? It cannot happen just because of someone's goodwill, there must be a sound business idea behind it, and a team of dedicated people to make it happen and persevere - not just to demonstrate how good Nutch is, but to keep up for the sake of their own business. I completely agree. We have been working on this business for almost a year. We received significant seed capital to build the alpha version of the search, which is complete, and are in the process of securing first round private equity funding to scale to 100M pages this year and up to 1B pages in year 2. Yes the initial investment for hardware, data center costs, marketing costs, and most importantly development staff for say a 1 billion page index capable of supporting 100 queries per second constant is around 5M and as it grows into the 10-20 billion range costs can grow as high as 100M. I think what many people don't understand is that search is as much a hardware (electricity, bandwidth) issue as it is a software issue. I know that we couldn't have developed the systems we have without Nutch, Hadoop, and Lucene and that I personally and we as a company are completely committed to their development. I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in helping developers contribute. For example if you want to develop a new piece of function they do x, y, and z. Here is how to patch your system. If you want to develop a patch then here are the steps. I have programmed in Java for many years but haven't worked on many open source projects before. The process of how they work isn't explicit and it needs to be. Hmm. I might not be objective here anymore. There is however some documentation already on the Wiki, which explains how to contribute - if you feel it's inadequate please use your hard-earned experience to improve it. I am in the middle of writing a new wiki page for contributing that will go into much more detail about the process. We worked up many patches for issues we came up against in the .8 and .4 codebases but they were never contributed because, as stupid as it might sound, we really don't know how to give it back. The best thing I thought I could do was to help answer questions on the list. Again just need a little guidance. Are you willing to spend the time and do the required refactoring? Anyone else? Yes, I am and I currently have 2 other developers that can help. Sounds great. We could start by creating a new page on Wiki, which would collect our vision for Nutch - as I mentioned to Stefan, I think we should take a step back, and think about the strategy for the next 1-2 years of Nutch development, and what is the target audience. I am all for this, just understand this is a new process for me so will need some guidance. Sure if we start a 2.x branch and if I'm not developing for the trash or jira nirvana, I can imaging to contribute. I Just a quick comment: jira nirvana (which I believe refers to patches sitting idle in Jira for a long time) is not caused by ill will or disrespect for contributors, but foremost by limited human resources. If we want to maintain a certain level of quality, these patches cannot be applied blindly, but need to be reviewed, analyzed, applied, tested, and committed. That's an awful lot of work for 2-3 people, who also have other things to do ... It is very less attractive to developers spending weeks to find a bug like the regular expression one. Than such a bug sits there for month in the jira being rejected. Sure if nobody of the contributors run nutch with a 500 mio url It's not being rejected - see the comments on that issue, there is an overall agreement that
Re: Next Nutch release
Stefan Groschupf wrote: I don't want to start a emotional discussion here, however talking about the problem in public might help. What, specifically, is the problem you perceive? Doug
Re: Next Nutch release
Just to put in my view. Stefan Groschupf wrote: Hi Andrzej, thank you for taking the time to comment, I highly value your comments. * I guess that for each case where Nutch seems inappropriate I could give you a counter-example of Nutch being used commercially with much success. I guess it depends on a particular application and the type of customer. Yes, it would be interesting to hear who use nutch .8 _successfully_ in production. Although I can't say who we are yet as we are in the middle of private equity funding, we have built a production version categorization search engine that uses the Nutch .8 and hadoop .4 code base that we are currently in the process of scaling to 100M pages. * no doubt Nutch has its warts - the plugin system could be simpler, for example ;) but hey, it's great that we have a plugin system at all! It would be easier now to refactor Nutch to use a different plugin system than it was to go from the completely monolithic design to the plugin system ... As with any open source project - if you don't like it, fix it and contribute the fix. Sure - I tried that more than once - but I do not want to start this discussion again. * things won't happen magically unless there is a greater involvement of skilled developers. One way road - well, with limited resources that this project has at the moment the only way is to gradually improve, we cannot afford to abandon the current codebase and start from scratch. I agree - the problem are skilled developers, I remember more than one offer of different companies to dedicate developers to the project, but looks like there was no interest. I completely agree with this. I am interested in devoting as much time as possible to seeing the success of Nutch, Hadoop, and Lucene. As our business grows I would also be willing to devote developers full time to work on Nutch, Hadoop, and Lucene. I think that at least one company needs to come out and have a production search engine that is competition, however small, to the googles and yahoos of the world, built on Nutch and Hadoop. I thought that was the original goal of Nutch. I know there are some out there right now like Mozdex, but I mean a true billion page system. I think the .8 codebase, and yes improvements could be made, is capable of supporting such a system. I think then you will see many more developers become interested in the project. If you build it they will come. I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in helping developers contribute. For example if you want to develop a new piece of function they do x, y, and z. Here is how to patch your system. If you want to develop a patch then here are the steps. I have programmed in Java for many years but haven't worked on many open source projects before. The process of how they work isn't explicit and it needs to be. We worked up many patches for issues we came up against in the .8 and .4 codebases but they were never contributed because, as stupid as it might sound, we really don't know how to give it back. The best thing I thought I could do was to help answer questions on the list. Again just need a little guidance. Are you willing to spend the time and do the required refactoring? Anyone else? Yes, I am and I currently have 2 other developers that can help. In general there was some emotional discussion about API changes. Since nutch is a 0.x and also a software and not a library more frequent refactorings had may be improved the maintainability of the code over the time. Sure if we start a 2.x branch and if I'm not developing for the trash or jira nirvana, I can imaging to contribute. I would rethink and rewrite some major parts (e.g. remove the reusage of objects with a complex states and endless if than else conditions no body can debug) and I believe that is not difficult. I'm not talking about the algorithm stuff here. May be one day we can get some developer together first think about a good extendable design and than start a 2.x stream or a new project. I hope so too. But as Steve B. said once, what we need is developers, developers, developers ... ;) I agree, however it must be attractive for developers to spend time in a open source project. We saw many developers here. You are the only one left that does some serious development and I can't find words how much respect I have for your work. You are the only one that is able to fix serious bugs. We also have much respect for you Andrzej. You may have more developers than you think. They might just not know how to contribute. It is very less attractive to developers spending weeks to find a bug like the regular expression one. Than such a bug
Re: Next Nutch release
Dennis Kubes wrote: I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in helping developers contribute. For example if you want to develop a new piece of function they do x, y, and z. Here is how to patch your system. If you want to develop a patch then here are the steps. The closest thing we have currently are the HowToContribute pages: http://wiki.apache.org/nutch/HowToContribute http://wiki.apache.org/lucene-hadoop/HowToContribute http://wiki.apache.org/jakarta-lucene/HowToContribute These are not great, but they're a start. Are there parts that are confusing? Do they assume too much? Are they missing things? If so, please help to update these. I note that the Nutch version is less evolved than the Lucene and Hadoop versions. Doug
Re: Next Nutch release
Dennis Kubes wrote: I completely agree with this. I am interested in devoting as much time as possible to seeing the success of Nutch, Hadoop, and Lucene. As our business grows I would also be willing to devote developers full time to work on Nutch, Hadoop, and Lucene. I think that at least one company needs to come out and have a production search engine that is competition, however small, to the googles and yahoos of the world, built on Nutch and Hadoop. I thought that was the original goal of Nutch. I know there are some out there right now like Mozdex, but I mean a true billion page system. I think the .8 codebase, and yes improvements could be made, is capable of supporting such a system. I think then you will see many more developers become interested in the project. If you build it they will come. Sure, I'd love to point people to such a system. But did you do a calculation how much money in the initial investment, and then ongoing costs, is needed to maintain such an index? It cannot happen just because of someone's goodwill, there must be a sound business idea behind it, and a team of dedicated people to make it happen and persevere - not just to demonstrate how good Nutch is, but to keep up for the sake of their own business. I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in helping developers contribute. For example if you want to develop a new piece of function they do x, y, and z. Here is how to patch your system. If you want to develop a patch then here are the steps. I have programmed in Java for many years but haven't worked on many open source projects before. The process of how they work isn't explicit and it needs to be. Hmm. I might not be objective here anymore. There is however some documentation already on the Wiki, which explains how to contribute - if you feel it's inadequate please use your hard-earned experience to improve it. We worked up many patches for issues we came up against in the .8 and .4 codebases but they were never contributed because, as stupid as it might sound, we really don't know how to give it back. The best thing I thought I could do was to help answer questions on the list. Again just need a little guidance. Are you willing to spend the time and do the required refactoring? Anyone else? Yes, I am and I currently have 2 other developers that can help. Sounds great. We could start by creating a new page on Wiki, which would collect our vision for Nutch - as I mentioned to Stefan, I think we should take a step back, and think about the strategy for the next 1-2 years of Nutch development, and what is the target audience. Sure if we start a 2.x branch and if I'm not developing for the trash or jira nirvana, I can imaging to contribute. I Just a quick comment: jira nirvana (which I believe refers to patches sitting idle in Jira for a long time) is not caused by ill will or disrespect for contributors, but foremost by limited human resources. If we want to maintain a certain level of quality, these patches cannot be applied blindly, but need to be reviewed, analyzed, applied, tested, and committed. That's an awful lot of work for 2-3 people, who also have other things to do ... It is very less attractive to developers spending weeks to find a bug like the regular expression one. Than such a bug sits there for month in the jira being rejected. Sure if nobody of the contributors run nutch with a 500 mio url It's not being rejected - see the comments on that issue, there is an overall agreement that it's ok; it simply hasn't been applied yet. See above for the why. I'm slowly coming to a point where I should be able to fix it - but let's not throw out the baby with the water ... Wow, I hold my finger crossed! There is a great book on this. It is 0691122024. Andrzej send me your address and I will buy and ship you a copy if you don't have it. Too late :) I found it two weeks ago, and it's already on its merry way - but thanks for the offer. We would also be willing to help develop this functionality further. I started working on a testbed as a part of another commercial project, it's likely that I could get a release from the customer to contribute this code to the project. A testbed is a prerequisite for any serious work on ranking and web graph. (It's quite unfortunate that the best-of-breed open source framework for working with web graphs is licensed under LGPL ...) I can definitely see a desire to re-write but I think even if you re-write you are still going to have the same problem. Search is hard and without guidance we can't get enough developers to understand what they need to know to help. Indeed. People often
Re: Next Nutch release
Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? On 1/17/07, Enis Soztutar [EMAIL PROTECTED] wrote: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Some of the features in the patch is not appropriate for our use cases and it requires hadoop changes, thus I am currently working on an alternative implementation of the administration gui, which runs a hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui to submit and track the jobs from the browser and a job runner. The architechture details of the patch is as follows : - An interface AdminJob which is an abstract class representing a Job in nutch. - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob. - A queue which sorts the jobs in priority order, by a modified a topological sort(jobs can be dependent). - an interface to submit Jobs - a rpc server to listen to job submissions - an extension point (basically same as the previous) - a web server to serve plugin jsp's upon the features will be - submitting jobs from code, command line or web interface, - tracking jobs from the command line or web interface - scheduling jobs I could send the code or details if anyone is interested in pretesting. And i will appreciate any comments and suggestions on this. I am planning to complete the patch and submit it to Jira ASAP. Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content NUTCH-48 Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address NUTCH-36 Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59meta data support in webdb NUTCH-92 DistributedSearch incorrectly scores results NUTCH-68A tool to generate arbitrary fetchlists NUTCH-87 Efficient site-specific crawling for a large number of sites Are there any opinions about issues that should go in before the next release (Answering yes means that you are willing to provide a patch for it). -- Sami Siren
Re: Next Nutch release
Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should have as less as possible code, however the idea was to have as less as possible dependencies to thirdparty tools and libraries and also getting things realized with low tech (jsp). Stefan
Re: Next Nutch release
Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. Are there issues in Hadoop's Jira for these? If so, do they have patches attached? Are they linked to the corresponding issue in Nutch? Doug
Re: Next Nutch release
Stefan, I also dived into contrib/web2 in nutch. The one and admin-gui are both owns some plugins based on nutch plugin architecture. So I think it is great if we extract something in high level and they should have a lot commons. Well, i dont know it is the right time to do this job. On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should have as less as possible code, however the idea was to have as less as possible dependencies to thirdparty tools and libraries and also getting things realized with low tech (jsp). Stefan
Re: Next Nutch release
Th old hadoop patch is here: https://issues.apache.org/jira/browse/NUTCH-251 Also we had this conversation: http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg00314.html I guess after this we missed to post the patches we use internally. If someone feels strong about getting the gui working with hadoop he/ she should feel free to update the patch and post it in the hadoop jira. Stefan On 18.01.2007, at 15:39, Doug Cutting wrote: Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. Are there issues in Hadoop's Jira for these? If so, do they have patches attached? Are they linked to the corresponding issue in Nutch? Doug ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: Next Nutch release
Hi Scott, feel free - I have no options on that. From my very little point of view the nutch .8 source stream is a one way street. In all my projects we move as far as possible away from nutch. I like hadoop a lot and writing customer tools on top of it is - that easy. But nutch .8 was a proof of concept for the early hadoop. There is only one serious developer left and wow how great he does his job - but nutch .8 is just too monolithic, to difficult to extend, to difficult to debug, to difficult to integrate for a serious mission critical application. I spend a signification part of my life daily working with nutch, but if someone would ask - I would answer don't use it. May be one day we can get some developer together first think about a good extendable design and than start a 2.x stream or a new project. And ... yes no opic and yes definitely no plugin architecture (I feel very sorry for all that wast so much life time because of my terrible complicate plugin system) but a clean IOC design with lightweight default interface implementations and a great test coverage. Anyway just my *very little* point of view based on 3.5 years nutch experience. Stefan On 18.01.2007, at 21:33, Scott Green wrote: Stefan, I also dived into contrib/web2 in nutch. The one and admin-gui are both owns some plugins based on nutch plugin architecture. So I think it is great if we extract something in high level and they should have a lot commons. Well, i dont know it is the right time to do this job. On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should have as less as possible code, however the idea was to have as less as possible dependencies to thirdparty tools and libraries and also getting things realized with low tech (jsp). Stefan ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: Next Nutch release
Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Some of the features in the patch is not appropriate for our use cases and it requires hadoop changes, thus I am currently working on an alternative implementation of the administration gui, which runs a hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui to submit and track the jobs from the browser and a job runner. The architechture details of the patch is as follows : - An interface AdminJob which is an abstract class representing a Job in nutch. - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob. - A queue which sorts the jobs in priority order, by a modified a topological sort(jobs can be dependent). - an interface to submit Jobs - a rpc server to listen to job submissions - an extension point (basically same as the previous) - a web server to serve plugin jsp's upon the features will be - submitting jobs from code, command line or web interface, - tracking jobs from the command line or web interface - scheduling jobs I could send the code or details if anyone is interested in pretesting. And i will appreciate any comments and suggestions on this. I am planning to complete the patch and submit it to Jira ASAP. Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content NUTCH-48Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address NUTCH-36Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59meta data support in webdb NUTCH-92DistributedSearch incorrectly scores results NUTCH-68A tool to generate arbitrary fetchlists NUTCH-87Efficient site-specific crawling for a large number of sites Are there any opinions about issues that should go in before the next release (Answering yes means that you are willing to provide a patch for it). -- Sami Siren
Re: Next Nutch release
2007/1/17, Enis Soztutar [EMAIL PROTECTED]: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Good to hear someone is working on that! Why not target it to trunk version of Nutch? - a web server to serve plugin jsp's Why not make it regular war? also please consider making a clean separation of view/logic when you implement the web ui. -- Sami Siren
Re: Next Nutch release
The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content Well ... I'm of a split mind on this. I can bring this patch up to date and apply it before 0.9.0, if we understand that this is a 0 release ... ;) Otherwise I'd prefer to wait with it right after the release. +1 for putting it in after 0.9.0 I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus some changes I made in the meantime), since I'd like to expose the new fetcher to a broader audience, and it doesn't affect the existing implementation. +1 for putting it in before 0.9.0 NUTCH-48 Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address I'm still not entirely convinced about this - and there is already a mechanism in place to support it if someone really wishes to keep this particular info (CrawlDatum.metaData). NUTCH-36 Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin.NUTCH-59meta data support in webdb NUTCH-92 DistributedSearch incorrectly scores resultsNUTCH-68 This is too intrusive to fix just before the release - and needs additional discussion. +1 NUTCH-68 A tool to generate arbitrary fetchlists Easy to port this to 0.9.0 - I can do this. cool. I'll start working on the headers and stuff to get the blocking issue away. -- Sami Siren
Re: Next Nutch release
Sami Siren wrote: 2007/1/17, Enis Soztutar [EMAIL PROTECTED]: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Good to hear someone is working on that! Why not target it to trunk version of Nutch? It is targetted to the trunk already. The previous was targetted to nutch-0.8, hadoop 0.4, since back then that versions was the latest in the trunk - a web server to serve plugin jsp's Why not make it regular war? also please consider making a clean separation of view/logic when you implement the web ui. As Stafan's version used embedded Jetty server, I continued this way. But i will consider that possibility also. -- Sami Siren
RE: Next Nutch release
Hi guys, I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting unmodified content applying it to Nutch 0.8.1. Here are some points: 1.This feature is great for Nutch to have has it differentiate between modified and unmodified content, therefore not indexing twice even if the document fetch time has arrived. a.There are some performance issues here. Even with this patch, Nutch still fetches the content and then checks its status against the last modified time in the database. If it has to check for a 1000 files before indexing the following 10 files, this will cause a real problem for those that are after real time indexing. 2.Since, I applied this patch to Nutch 0.8.1, when I try to parse xml files with our modified version of the xmlparser /indexer plugin; the fetcher throws the following exception: WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200): java.lang.IllegalStateException: Root element not set The system will not hang or crash but the xml file will be indexed without any generated fields. The plugins works fine without the patch. I have another parser that parses graphics and other formats that fails when used with the patch. So far this problem occurs when using the file protocol. 3.the patch works fine when indexing web site using the http protocol. I am willing to work with Andrzej to make it stable as I understand it's the architect of this patch. I have the possibility of testing it in a mix environment in our computer lab. This patch can be the stepping stone for other features such real time indexing and fetch queue for index updating as opposed to creating a new index each time. Best Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: 17 January 2007 15:39 To: nutch-dev@lucene.apache.org Subject: Re: Next Nutch release Sami Siren wrote: 2007/1/17, Enis Soztutar [EMAIL PROTECTED]: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Good to hear someone is working on that! Why not target it to trunk version of Nutch? It is targetted to the trunk already. The previous was targetted to nutch-0.8, hadoop 0.4, since back then that versions was the latest in the trunk - a web server to serve plugin jsp's Why not make it regular war? also please consider making a clean separation of view/logic when you implement the web ui. As Stafan's version used embedded Jetty server, I continued this way. But i will consider that possibility also. -- Sami Siren
Re: Next Nutch release
Armel T. Nene wrote: I am willing to work with Andrzej to make it stable as I understand it's the architect of this patch. I have the possibility of testing it in a mix environment in our computer lab. This patch can be the stepping stone for other features such real time indexing and fetch queue for index updating as opposed to creating a new index each time. Thanks for taking the initiative! I'll be glad to review the patch and apply it right after the 0.9 release. The best way to keep the process open would be to make svn diff and attach this new version of the patch to the JIRA issue. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Next Nutch release
Hi, great to hear people still working on things. It shows once more getting something in early would save some effort. :) Just some random comments. We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. It is may be a long time ago that I spoke to Doug and some other Hadoop developers but at this time I understand people that there is a general interest to have a nutch gui and support required functionality in hadoop. I'm not sure if that is still the case or if I had a wrong impression. In any case from my p.o.v. the clean way would be getting the required minor changes into hadoop (not critical simple stuff from my point of view) instead of implement working around in nutch. Since hadoop is a kind of child of nutch there should be a close relation at least to discuss things. Anyway no strong option, just my 2 cents. In any case I'm very happy if people see now the need for a gui as well and someone is working on that since I'm kind of busy with other projects. Thanks. Stefan On 17.01.2007, at 06:42, Enis Soztutar wrote: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Some of the features in the patch is not appropriate for our use cases and it requires hadoop changes, thus I am currently working on an alternative implementation of the administration gui, which runs a hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui to submit and track the jobs from the browser and a job runner. The architechture details of the patch is as follows : - An interface AdminJob which is an abstract class representing a Job in nutch. - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob. - A queue which sorts the jobs in priority order, by a modified a topological sort(jobs can be dependent). - an interface to submit Jobs - a rpc server to listen to job submissions - an extension point (basically same as the previous) - a web server to serve plugin jsp's upon the features will be - submitting jobs from code, command line or web interface, - tracking jobs from the command line or web interface - scheduling jobs I could send the code or details if anyone is interested in pretesting. And i will appreciate any comments and suggestions on this. I am planning to complete the patch and submit it to Jira ASAP. Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content NUTCH-48Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address NUTCH-36Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59 meta data support in webdb NUTCH-92DistributedSearch incorrectly scores results NUTCH-68A tool to generate arbitrary fetchlists NUTCH-87Efficient site-specific crawling for a large number of sites Are there any opinions about issues that should go in before the next release (Answering yes means that you are willing to provide a patch for it). -- Sami Siren ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: Next Nutch release
Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. Agreed. The replacement regex mentioned in the original comment seems safe enough, and simpler. The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content Well ... I'm of a split mind on this. I can bring this patch up to date and apply it before 0.9.0, if we understand that this is a 0 release ... ;) Otherwise I'd prefer to wait with it right after the release. I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus some changes I made in the meantime), since I'd like to expose the new fetcher to a broader audience, and it doesn't affect the existing implementation. NUTCH-48Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address I'm still not entirely convinced about this - and there is already a mechanism in place to support it if someone really wishes to keep this particular info (CrawlDatum.metaData). NUTCH-36Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59meta data support in webdb NUTCH-92DistributedSearch incorrectly scores results NUTCH-68 This is too intrusive to fix just before the release - and needs additional discussion. NUTCH-68A tool to generate arbitrary fetchlists Easy to port this to 0.9.0 - I can do this. NUTCH-87Efficient site-specific crawling for a large number of sites -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Next Nutch release
Sami, Thanks a lot, I would like to see a feature in, that a link to a webpage is sowing all areay indexed urls. So other spiders can fetch this site and get the urls, the open souce natuch has already to provide. So we need to start not to have open source coding the machine, but as well every node offering an open, downloadable database of urls, And we need a list of urls, of other nutch domains. With this list, each Nutch can crawl the urls of the other nutch providing them on a website. As Million of urls are a lot, I suggest to have 26 websites from a-z to display all urls of the `word´ a, all 25 urls links b-z as well on the page of the word-page a. then several Nutch nodes could use a small p2p feature and as well the sister yacy can fetch the urls from a central open source point: all nutch domains. Would this be possible to generate a webpage-link somewhere on the nutch-homepage of the individual serverinstall with all urls? Opensource has to found solidarity, so make the nutch url database open for as well open source search engine spiders from central points. thanks Original-Nachricht Datum: Tue, 16 Jan 2007 17:53:41 +0200 Von: Sami Siren [EMAIL PROTECTED] An: nutch-dev@lucene.apache.org Betreff: Next Nutch release Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content NUTCH-48 Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address NUTCH-36 Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59meta data support in webdb NUTCH-92 DistributedSearch incorrectly scores results NUTCH-68A tool to generate arbitrary fetchlists NUTCH-87 Efficient site-specific crawling for a large number of sites Are there any opinions about issues that should go in before the next release (Answering yes means that you are willing to provide a patch for it). -- Sami Siren -- Feel free - 5 GB Mailbox, 50 FreeSMS/Monat ... Jetzt GMX ProMail testen: http://www.gmx.net/de/go/promail
Re: Next Nutch release
Folks, When would you like to make the release? I've been working on NUTCH-185, but got a bit bogged down with other work. If there is interest in having NUTCH-185 included in the release, I could make a push to get out a patch by week's end... As for the rest, my +1 for NUTCH-61 being included sooner rather than later. It seems that the patch has garnered enough use and attention that folks would like to see it in the release. I think the email from the user trying to manage a terabyte of data a few days back was particularly telling. Cheers, Chris On 1/16/07 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. Agreed. The replacement regex mentioned in the original comment seems safe enough, and simpler. The top 10 voted issues are currently: NUTCH-61Adaptive re-fetch interval. Detecting umodified content Well ... I'm of a split mind on this. I can bring this patch up to date and apply it before 0.9.0, if we understand that this is a 0 release ... ;) Otherwise I'd prefer to wait with it right after the release. I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus some changes I made in the meantime), since I'd like to expose the new fetcher to a broader audience, and it doesn't affect the existing implementation. NUTCH-48 Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address I'm still not entirely convinced about this - and there is already a mechanism in place to support it if someone really wishes to keep this particular info (CrawlDatum.metaData). NUTCH-36 Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59 meta data support in webdb NUTCH-92 DistributedSearch incorrectly scores results NUTCH-68 This is too intrusive to fix just before the release - and needs additional discussion. NUTCH-68 A tool to generate arbitrary fetchlists Easy to port this to 0.9.0 - I can do this. NUTCH-87 Efficient site-specific crawling for a large number of sites