Re: project vitality? / less documentation is more!
Hello, Just my 2 cents: the Intranet crawl functionnality is VERY confusing. If it was just taken out of the tutorial, and out of the set of commands, that would actually help A LOT: I understood many many things about Nutch once I tried so-called whole-web crawling, where one has to use every command one at a time. And that would also eliminate all the questions about how to recrawl, etc. Or maybe a change of name would be enough: Intranet crawl could be called fast-setup crawl, and whole-web crawling, serious crawling for Intranet or whole-web projects. What do you think? Thanks, Frank.
RE: project vitality? / less documentation is more!
-1 I found the instructions for doing an Intranet crawl extremely helpful for getting up and running quickly. I went back later and figured out more about what it was actually doing. Perhaps the name could just be changed to Single Site Crawling with the Nutch Shell Script and some explanatory text could be added. I'll try to take the time today to put a version of the tutorial on the wiki that does that. Then if folks agree, I'll put together a patch that changes the site links for the tutorial to point at the wiki. Thanks, Jake. -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 3:01 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, Just my 2 cents: the Intranet crawl functionnality is VERY confusing. If it was just taken out of the tutorial, and out of the set of commands, that would actually help A LOT: I understood many many things about Nutch once I tried so-called whole-web crawling, where one has to use every command one at a time. And that would also eliminate all the questions about how to recrawl, etc. Or maybe a change of name would be enough: Intranet crawl could be called fast-setup crawl, and whole-web crawling, serious crawling for Intranet or whole-web projects. What do you think? Thanks, Frank.
RE: project vitality? / less documentation is more!
You're right about the single site thing, but I think just changing the title and adding a bit more of an explanation should do the trick. I went ahead and put up a version of the tutorial on the wiki. I haven't changed it other than to try to get the formatting similar to what's on the current tutorial. Feel free to edit. http://wiki.apache.org/nutch/NutchTutorial Thanks, Jake. -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 10:11 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, single site crawling wouldn't address the confusion that results from the fact that the 'crawl' command is actually the concatenation of several commands; and it would not be true either, since you can do several sites crawling with 'crawl'. But I have to agree that it helps getting up and running quickly; however my point is that, after this first phase, it is _more_ difficult to go to the next phase than if one hadn't used this first step first... Maybe at the end of the tutorial for Intranet crawling the following sentence could be added: If you want to crawl the same site _again_, use the whole-web tutorial below, and NOT the crawl command. Also, the sentence Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines is misleading, since one has to use whole-web crawling to fine-tune or recrawl even the smallest of websites. The distinction is not only on the scale of the project, but on the level of control one wants (IMHO). The documentation should at least give hints in that direction. Thanks, Frank. On 3/7/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote: -1 I found the instructions for doing an Intranet crawl extremely helpful for getting up and running quickly. I went back later and figured out more about what it was actually doing. Perhaps the name could just be changed to Single Site Crawling with the Nutch Shell Script and some explanatory text could be added. I'll try to take the time today to put a version of the tutorial on the wiki that does that. Then if folks agree, I'll put together a patch that changes the site links for the tutorial to point at the wiki. Thanks, Jake. -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 3:01 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, Just my 2 cents: the Intranet crawl functionnality is VERY confusing. If it was just taken out of the tutorial, and out of the set of commands, that would actually help A LOT: I understood many many things about Nutch once I tried so-called whole-web crawling, where one has to use every command one at a time. And that would also eliminate all the questions about how to recrawl, etc. Or maybe a change of name would be enough: Intranet crawl could be called fast-setup crawl, and whole-web crawling, serious crawling for Intranet or whole-web projects. What do you think? Thanks, Frank.
RE: project vitality? / less documentation is more!
+1 -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 3:01 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, Just my 2 cents: the Intranet crawl functionnality is VERY confusing. If it was just taken out of the tutorial, and out of the set of commands, that would actually help A LOT: I understood many many things about Nutch once I tried so-called whole-web crawling, where one has to use every command one at a time. And that would also eliminate all the questions about how to recrawl, etc. Or maybe a change of name would be enough: Intranet crawl could be called fast-setup crawl, and whole-web crawling, serious crawling for Intranet or whole-web projects. What do you think? Thanks, Frank.
RE: project vitality? / less documentation is more!
+1 -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 10:11 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, single site crawling wouldn't address the confusion that results from the fact that the 'crawl' command is actually the concatenation of several commands; and it would not be true either, since you can do several sites crawling with 'crawl'. But I have to agree that it helps getting up and running quickly; however my point is that, after this first phase, it is _more_ difficult to go to the next phase than if one hadn't used this first step first... Maybe at the end of the tutorial for Intranet crawling the following sentence could be added: If you want to crawl the same site _again_, use the whole-web tutorial below, and NOT the crawl command. Also, the sentence Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines is misleading, since one has to use whole-web crawling to fine-tune or recrawl even the smallest of websites. The distinction is not only on the scale of the project, but on the level of control one wants (IMHO). The documentation should at least give hints in that direction. Thanks, Frank. On 3/7/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote: -1 I found the instructions for doing an Intranet crawl extremely helpful for getting up and running quickly. I went back later and figured out more about what it was actually doing. Perhaps the name could just be changed to Single Site Crawling with the Nutch Shell Script and some explanatory text could be added. I'll try to take the time today to put a version of the tutorial on the wiki that does that. Then if folks agree, I'll put together a patch that changes the site links for the tutorial to point at the wiki. Thanks, Jake. -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 3:01 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, Just my 2 cents: the Intranet crawl functionnality is VERY confusing. If it was just taken out of the tutorial, and out of the set of commands, that would actually help A LOT: I understood many many things about Nutch once I tried so-called whole-web crawling, where one has to use every command one at a time. And that would also eliminate all the questions about how to recrawl, etc. Or maybe a change of name would be enough: Intranet crawl could be called fast-setup crawl, and whole-web crawling, serious crawling for Intranet or whole-web projects. What do you think? Thanks, Frank.
Re: project vitality?
Stefan. I know people having 500 mio pages index and I personal run crawls with ~300 pages per second. Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch version) that you manage so many pages per second? Unless this is a company secret, it would be very nice to know how you manage this. Rgrds, Thomas
Re: project vitality?
Hi Thomas, for this crawl setup we have a test environment of nutch 0.8, 10xAMD's, custom linux build, 100Mbit eth1, 1Gb eth0, each box has a 'caching' dns server. Stefan Am 06.03.2006 um 15:59 schrieb TDLN: Stefan. I know people having 500 mio pages index and I personal run crawls with ~300 pages per second. Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch version) that you manage so many pages per second? Unless this is a company secret, it would be very nice to know how you manage this. Rgrds, Thomas --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: project vitality?
On 3/4/06, Stefan Groschupf: Just a general note, jira has a voting functionality. This allows everybody to vote an issue and can show in a very compressed style what the community is looking for. However it is not used that often yet. It would be great if more users can use it. That's a good suggestion. I want to make adv Because there is a bug in Nutch 0.7.1 which forces me, to http://issues.apache.org/jira/browse/NUTCH-205
Re: project vitality?
On 3/4/06, Stefan Groschupf: Just a general note, jira has a voting functionality. This allows everybody to vote an issue and can show in a very compressed style what the community is looking for. However it is not used that often yet. It would be great if more users can use it. That's a good suggestion. I want to make some advertising for my favorite. ;) Because there is a bug in Nutch 0.7.1 which forces me, to make complete recrawls instead of using the incremental approach, this is my voting recommendation: http://issues.apache.org/jira/browse/NUTCH-205 Bye the way: I totally agree with the exchanged opinions. - Nutch is a great project and has the chance to become a very very popular and robust open source software. A big thankyou to all nutch developer is more than appropriate: Thanks guys! - On the other hand: As Richard wrote there could be some improvements in documentation and in responses to mailing-list and reported jira-issues. My concrete suggestions: Nutch 0.8 should be available in around the next two months. Let's take the chance and improve the (wiki-)documentation before releasing it. First lets specify what kind of documentation we like to have in 0.8. I'm sure we'll get for every documentation-subject volunteers for writing it down and some more volunteers for checking and testing it. I would like to support the documentation-project in the next weeks (as far as my spare times is available;))
Re: project vitality?
Richard Braman wrote: I realy do think nutch is great, but I echo Matthias's comments that the community needs to come together and contirbute more back. And that comes with the requirement of making sure volunteers are given access to make their contributions part of the project. Here's how it works: One has to be a committer to directly change the code. One may be invited to become a committer if contributes a number of non-trivial, consistently exemplary patches. Exemplary patches: 1. are easy for a committer to apply; 2. fix one thing; 3. fix it well; 4. are well formatted, using Sun's coding conventions 5. are well documented, with Javadoc for all non-private items 6. pass all existing unit tests 7. includes new unit tests 8. etc. An exemplary patch is thus something that a committer can commit with little hesitation. It follows that exemplary patches will be committed quickly. Lesser patches are likely to languish. For example, a committer might be reluctant to take on a poorly constructed patch for a bug that only affects niche users, since it may take a lot of time to turn it into code worthy of committing. Most committers are already doing as much as they can to help the project. The trick is not to get them committers to do more work, but for others to do more work for the committers, and,eventually, to get more committers. Putting the faqs and tutorial on the website and not the wiki maybe one of the two biggest problems in getting people started learning nutch. If you think these should move, don't just complain: file a bug, make your case, submit a patch, etc. The website is part of the source and is governed by the same process. Doug
Re: project vitality?
David Wallace wrote: Also, I've lost count of the number of times someone has posted something to the effect of I'll pay someone to give me Nutch support, simply because they find the existing documentation and mailing lists inadequate. Usually, that person gets told that the best way to get Nutch support is to ask questions on the mailing list; but since questions often go unanswered, this isn't a very good way to get Nutch support at all. I agree this is a problem, but it is also an opportunity. I do try to answer Nutch questions whenever I have time, and most other Nutch developers are also active on these lists. The problem is simply that there are more questions than question answering hours. All of this is acceptable in a product that hasn't yet reached version 1.0. The code has moved ahead faster than the documentation; and that's fine, provided the documentation will eventually catch up. Yes, I hope it will. Maybe, once 0.8 is deemed production-worthy, the team should down tools, stop coding, and put some effort into really producing a really lovely set of documentation, including a comprehensive FAQ. I believe that this will help grow the user base, faster than adding new features ever could. That would be nice. Once things settle down it will also be easier for support organizations, consultants, book authors, etc, to step in and improve documentation too. Doug
Re: project vitality?
I like to think of it as a framework. Building blocks to build what you ultimately need. If your after the one stop shop, plug in play, no development necessary then perhaps some other commercial systems may be your best bet. Mailing list is very active, most people get responses fairly quickly. If the question is ignored its often because it's already answered. To really understand nutch you need to understand lucene, hadoop and search in general and the wiki of both lucene and nutch is a great read. If all of this is above ones head or not within your time frame to bother with then like i said, there are other products out there. Other then that i'm happily running nutch, looking forward to a billion+ page index and enjoying picking the brains of the talent pool we have here. Happy nutcher -byron http://www.mozdex.com --- Matt Wilkie [EMAIL PROTECTED] wrote: Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
Re: [Nutch-general] Re: project vitality?
Hi, I think that this is my first post. I follow the mailing list and read as many of the emails as I can. I'm going to make a few proposals. I have obtained some money to spend on them. I use and get paid for my nutch expertise. I have some experience. I don't just speak for myself but also for some people who use nutch now, have a commercial interest in nutch and who will contribute money to the effort. This money is not a great deal but it could both escalate and become ongoing. I sympathize with the people who are (with no offense to any side, if there really is one) the complainers. I am grateful to the coders. I can and do make code improvements to nutch for my own uses that nobody ever sees. I have a web interface (sort of), and many other tools that work with nutch, from maps to communication with nutch via telephone. I expect to gain from my association with nutch although how I can't really put my finger on yet. I wouldn't say that I'm frustrated - I'd describe it more as a feeling of hope mixed with helplessness and despair. I think the moment is almost gone. Im old and scatterbrained and don't spell check or reread before I post. I will elaborate as soon as I see this on the list - but I don't like to type until I know what I have to deal with, I have about 3000 emails a day to sift through and I have so many email addresses I've signed up for that I never really know whether I'm going to hit the wrong list or something or whatever. Greg.
RE: project vitality?
Hello all, I think Nutch is a fantastic product. I used 0.6 initially, then 0.7. My 0.7 installation is in production, and mostly works really well. I haven't made the move to 0.8 yet, because the direction that Nutch has gone for 0.8 is quite different from what my organisation requires from its search engine. I owe Doug and the team a huge thank-you for all the effort they've put into Nutch. Well done. However, it's a sad day when someone like Richard Braman gets shot down in flames for making some fair and valid criticisms of the Nutch project. Apart from his statement about Nutch being in proof of concept stage, I agree with everything Richard has said. The documentation DOES leave a fair bit to be desired. The initial learning curve CAN be precipitous. It's easy to get confused with all the various settings in the XML configuration files and the various plug-ins. I can understand that he doesn't feel that he's in a position to contribute to the documentation base, because he doesn't know all the answers yet. I think moving everything, including the tutorial, to the Wiki is a fine idea; provided that we encourage new users to comment on what did and didn't work for them. I think we'll find there's a lot of common ground among their comments. Long-term readers of the nutch-user mailing list know that many newbies ask the same questions. Also, I've lost count of the number of times someone has posted something to the effect of I'll pay someone to give me Nutch support, simply because they find the existing documentation and mailing lists inadequate. Usually, that person gets told that the best way to get Nutch support is to ask questions on the mailing list; but since questions often go unanswered, this isn't a very good way to get Nutch support at all. All of this is acceptable in a product that hasn't yet reached version 1.0. The code has moved ahead faster than the documentation; and that's fine, provided the documentation will eventually catch up. Maybe, once 0.8 is deemed production-worthy, the team should down tools, stop coding, and put some effort into really producing a really lovely set of documentation, including a comprehensive FAQ. I believe that this will help grow the user base, faster than adding new features ever could. So in summary, well done to the Nutch team for this great product. Well done to Richard Braman for pointing out what could be done. And let's all not flame people whose opinions differ from our own. David. This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network.
Re: project vitality?
I think of the Nutch project as a marathon, not a sprint. Nutch's stated goals include: * Scale to entire web - pages on millions of different servers - billions of pages * Support high traffic - thousands of searches per second * State-of-the-art search quality (see http://wiki.apache.org/nutch/Presentations) It's inspiring to see a project with such ambitious goals become a reality. On 3/5/06, Byron Miller [EMAIL PROTECTED] wrote: I like to think of it as a framework. Building blocks to build what you ultimately need. If your after the one stop shop, plug in play, no development necessary then perhaps some other commercial systems may be your best bet. Mailing list is very active, most people get responses fairly quickly. If the question is ignored its often because it's already answered. To really understand nutch you need to understand lucene, hadoop and search in general and the wiki of both lucene and nutch is a great read. If all of this is above ones head or not within your time frame to bother with then like i said, there are other products out there. Other then that i'm happily running nutch, looking forward to a billion+ page index and enjoying picking the brains of the talent pool we have here. Happy nutcher -byron http://www.mozdex.com --- Matt Wilkie [EMAIL PROTECTED] wrote: Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
Re: [Nutch-general] Re: project vitality?
Hello again. OK - first of all I hate mailing lists. I don't consider them to be a valid form of communication for anything but the people doing the coding and don't really consider them of much use at all unless there is no other alternative. Except one - and that is when there needs to be something communicated to the people doing the work and it has to get through - in other words I think mailing lists are a last resort. I've been a part of a few areas of the net where what I was involved with just took off. One of them was in 1999 when Flash 4 came out and suddenly anyone with an ability to use Flash was hot and Flash was the big news and I was part of a forum called were-here.com which was the adult flash forum as opposed to the kids' flashkit.com site. My name was/is Mapp and for the most part of were-here's life I was moderator of the XML forum. I think that if anyone has or cares to read my posts they'll see that I always try to help, my help was usually complete, I am always polite. We had quite a ride for awhile but then the owners of the forum for some secretive reason just took the site down leaving the thousands of contributing posters homeless. I still keep up with all the XML stuff and I suppose I must be sort of an expert in XML - at least in knowing the different formats, vxml, aiml, on and on. I was also part of a few areas of the net where it looked like things were going to take off and never did. One thing I noticed is that technologies that take off have forums dedicated to them and ones that don't take off resist going off the mailing list. I like it how people say take it off list but oh where should it be taken to please? Nobody says take the discussion to the wiki because traditionally wikis aren't real discussion areas. What really should be said is take it to the forum but there isn't really one is there? If there is nobody says anything. I have the name nutchforum.com and am #1 in MSN, Google and Yahoo and one person posted there one day. I know there are other efforts too but if they have any good discussions about relevant topics I'm unaware of them. I agree that the people doing the coding shouldn't have to read this and so obviously I'm proposing a nutch forum with myself for example (could be others too) as a moderator. At least I have a history and it is decent. Were-here.com is back up now - bought by a corporation and maintained as a learning resource to the Flash community but I don't post there much and that is because I resented my hundreds if not thousands of hours of painstakingly trying to give back to the community by being complete, coherent, etc lost because whoever happened to have the luck of owning the forum decided that oh well, see you around, I'm going to work for Microsoft, or whatever. I still resent it even if some corporation knew that they could garner enough good will by buying the forum and restoring the posts/knowlege base. So, what I've done is pick Moodle - an open source php learning system, which has a forum and I've decided that I'll attempt to start a useful forum and that what I'll do is every week or two make the forum sql dump available so if I ever decide that I don't care about anyone or I get snapped up by Google any knowlege will live on. Moodle is being developed by teachers, the people I'd trust to do things right (except for librarians - check out the open source library software that librarians write for an example of a dominant open source effort). So I assume that any forum posting will be long-lived and free. I've also decided to pay for posts - the surest way for a forum/community to not get started is by there being no posting activity. So, I arranged to get posts paid for. I'm not sure yet how much is reasonable but I started off figuring that a few dollars for a well thought out question and 20 -100 dollars for a reasonably comprehensive answer might be alright. Also, I've arranged for some hosting space for people who want to make search engines but don't have the resources. I have a few dedicated servers and unique IP addresses and the like for people who will share their experiences. I don't know what is reasonable to pay but I have arranged some funding and resources albeit with conditions. Also there are other things that normally cost money as well as I'll give support to people who want to use the web interface that I've been working on and if somebody else has an idea that needs a little money well right now the people that I've set up with older not so up to date nutch search engines are becoming desperate to get the stuff I told them would be available to them. These aren't people who want billion page indexes spread over 10 separate beowulf clusters - they're just people who thought they could spend a few hundred and get some additional functionality out of open source software. That being what I do mostly, set up and integrate open source software for people who have reasonable goals. I'm old now and
RE: [Nutch-general] Re: project vitality?
I'll take part in your forum. Just added first post. -Original Message- From: Greg Boulter [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 6:33 PM To: nutch-user@lucene.apache.org Subject: Re: [Nutch-general] Re: project vitality? Hello again. OK - first of all I hate mailing lists. I don't consider them to be a valid form of communication for anything but the people doing the coding and don't really consider them of much use at all unless there is no other alternative. Except one - and that is when there needs to be something communicated to the people doing the work and it has to get through - in other words I think mailing lists are a last resort. I've been a part of a few areas of the net where what I was involved with just took off. One of them was in 1999 when Flash 4 came out and suddenly anyone with an ability to use Flash was hot and Flash was the big news and I was part of a forum called were-here.com which was the adult flash forum as opposed to the kids' flashkit.com site. My name was/is Mapp and for the most part of were-here's life I was moderator of the XML forum. I think that if anyone has or cares to read my posts they'll see that I always try to help, my help was usually complete, I am always polite. We had quite a ride for awhile but then the owners of the forum for some secretive reason just took the site down leaving the thousands of contributing posters homeless. I still keep up with all the XML stuff and I suppose I must be sort of an expert in XML - at least in knowing the different formats, vxml, aiml, on and on. I was also part of a few areas of the net where it looked like things were going to take off and never did. One thing I noticed is that technologies that take off have forums dedicated to them and ones that don't take off resist going off the mailing list. I like it how people say take it off list but oh where should it be taken to please? Nobody says take the discussion to the wiki because traditionally wikis aren't real discussion areas. What really should be said is take it to the forum but there isn't really one is there? If there is nobody says anything. I have the name nutchforum.com and am #1 in MSN, Google and Yahoo and one person posted there one day. I know there are other efforts too but if they have any good discussions about relevant topics I'm unaware of them. I agree that the people doing the coding shouldn't have to read this and so obviously I'm proposing a nutch forum with myself for example (could be others too) as a moderator. At least I have a history and it is decent. Were-here.com is back up now - bought by a corporation and maintained as a learning resource to the Flash community but I don't post there much and that is because I resented my hundreds if not thousands of hours of painstakingly trying to give back to the community by being complete, coherent, etc lost because whoever happened to have the luck of owning the forum decided that oh well, see you around, I'm going to work for Microsoft, or whatever. I still resent it even if some corporation knew that they could garner enough good will by buying the forum and restoring the posts/knowlege base. So, what I've done is pick Moodle - an open source php learning system, which has a forum and I've decided that I'll attempt to start a useful forum and that what I'll do is every week or two make the forum sql dump available so if I ever decide that I don't care about anyone or I get snapped up by Google any knowlege will live on. Moodle is being developed by teachers, the people I'd trust to do things right (except for librarians - check out the open source library software that librarians write for an example of a dominant open source effort). So I assume that any forum posting will be long-lived and free. I've also decided to pay for posts - the surest way for a forum/community to not get started is by there being no posting activity. So, I arranged to get posts paid for. I'm not sure yet how much is reasonable but I started off figuring that a few dollars for a well thought out question and 20 -100 dollars for a reasonably comprehensive answer might be alright. Also, I've arranged for some hosting space for people who want to make search engines but don't have the resources. I have a few dedicated servers and unique IP addresses and the like for people who will share their experiences. I don't know what is reasonable to pay but I have arranged some funding and resources albeit with conditions. Also there are other things that normally cost money as well as I'll give support to people who want to use the web interface that I've been working on and if somebody else has an idea that needs a little money well right now the people that I've set up with older not so up to date nutch search engines are becoming desperate to get the stuff I told them would be available to them. These aren't people who want billion page indexes spread over 10 separate beowulf clusters
RE: project vitality?
don't expect polish. You shouldn't need polish to be able to leran the command required to resume an aborted drawl, or to index what you have already crawled. Things like this shouldn't require an easter egg hunt. They are going to heppen to evryone doing greater than a simple crawl. If you find a bug, please file a bug report, so that other folks are aware of it. I have reported 2 so far. I have a third one (and a patch) that I am still in the process of developing documenting, which relates to parsing pdfs. Better yet, if you have a solution or improvement, please construct a patch file (even for documentation) and attach it to a bug report. On the wiki, anyone can make themselves an account and update documentation. We don't boss folks around here, or complain. We pitch in and help. In the email I sent you I volunteered to help by offering to polish the documentation myself. I do need some answers first. Many of the questions that get asked on this list unfortunately go unanswered by the experts. If they go unanswered, it impossible for those who would otherwise share their solutions on the Wiki, because there is no solution to share. If I went and posted my knowledge about indexing and restarting crawls, it wouldn't be any better than what is already up there, which is incomplete and incorrect. I know there are those of you that no nutch inside and out. Right now that's just a few guys. I know I want to know more about it, that's why I am spending my free time trying to learn. Everyting I am doing is part of an open source search project, not a commercial endevour. I always contribute my knowledge back by posting answers to things I know about. Documentation, whether we like it or not, is key to the use of the product. The onus is on the developers to document the project, and to provide support when the documentation is clearly lacking. One the developers share more of their knowledge, their will be more knowledgable users and the developers wont need to spend as much time on support and documentation. I would agree that if you have 1 url to crawl, and you crawl it with depth = 3-6 , nutch is easy to use. I tried with depth=10, and I hit a snag. This has been very hard to get through, given the lack of documentation. I have nutch up and running fine here http://24.75.221.234:8080 But this is a simple crawl and doesn't reflect all of the pages needed to make a good search engine. I told you I was more than willing to help, and I think many users feel the same way, but I for one feel that there is a lack of documentation and support. This isn't meant to offend anyone, if you are offended you need to toughen up your skin a little bit. -Original Message- From: sudhendra seshachala [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 1:26 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages.. In fact the nightly builds has good improvements than 0.71. Any serious user or adopter should be trying with a snapshot of nightly build.. Doug, It would be better, if there is official 0.8 release or atleast a RC. before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8. Thanks Sudhi Doug Cutting [EMAIL PROTECTED] wrote: Richard Braman wrote: I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. It stands to reason that if the documentation lacks luster the project must be dead! Seriously, this is an active project. It is not yet 1.0, so don't expect polish. If it doesn't look easily usable to you then perhaps it is not. It's still for early adopters. The commit list shows a fair amount of activity: http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.h tml Lots of public sites are using Nutch. Some are listed at http://wiki.apache.org/nutch/PublicServers, but many are not, like http://search.bittorrent.com/. I have tried to get the tutorial and faqs updated, but I haven't heard back. This is an all-volunteer project. If you find a bug, please file a bug report, so that other folks are aware of it. Better yet, if you have a solution or improvement, please construct a patch file (even for documentation) and attach it to a bug report. On the wiki, anyone can make themselves an account and update documentation. We don't boss folks around here, or complain. We pitch in and help. Doug Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Re: project vitality?
Hi Richard, I told you I was more than willing to help, and I think many users feel the same way, but I for one feel that there is a lack of documentation and support. This isn't meant to offend anyone, if you are offended you need to toughen up your skin a little bit. Here you can find some more documentation: http://wiki.media-style.com/display/nutchDocu/Home It is the first hit when you are searching for nutch documentation with google. Sure it is full with tons of typos and has many language issues since my english is terrible but at least I guess that it already helps some people to get a nutch 0.7 or nutch 0.8 up and running. Serious nutch is as much production ready as a noncommercial open source project could be. I know people having 500 mio pages index and I personal run crawls with ~300 pages per second. I'm not sure what you can expect more than that from a open source search project. Stefan
RE: project vitality?
I agree that the doc could be better, but I still take issue with the earlier use of the phrase proof-of-concept. If there are dozens of sites using it in production, several of them indexing 100's of millions of pages, I don't know how you can call it proof-of-concept. Honestly, I'm not sure if there's any other choice for a scalable open source search engine. Last I checked most of the other free projects were better suited to small site searches -- nothing on the scale of tens of millions of pages. So kudos, Nutch developers! Howie
RE: project vitality?
I do thank nutch developers very, very much for what they have put into the project:) I think the concept is great and yes it does work, if you invest the time needed to learn the interfaces, updgrade the distribution nightly, relearn the commands, etc. Doug's statement that nutch is for early adopters is accurate. Now that I have said that, I want to express my feeling that it's hard when it takes a week to figure out that invertlinks only applies to version 0.8. and when you ask to become a volunteer, you are met with no response. It's also frustrating when you share some heard earned insights into something that nutch needs to work on, like pdf parsing, and your comments don't get a single good response from the nutch dev team. Sometimes, in OS projects I get the feeling that the developers breathe different air than users, and that our help is not wanted or that our questions are stupid and not worth their time to answer. I don't feel that there is really any such thing as a stupid question, only stupid answers. Some users even ask questions shamefully like: I know I am a newbie, and my question is stupid, but here it is anyway. I think that's a stigma that we as the larger computer community need to steer away from, especially if we want newbie users to become advanced users. Nutch is nowhere near being a dead project, that is not what I said (I said it was close, not closed), its just that I don't feel that it's something that anyone can just download and use without running into problems. Problems always exist, but need to be documented correctly so that they can be solved quickly. I think nutch has a long way to go before it is comparable to tomcat or httpd, which are both production ready and have literally volumes of information on using in every manner possible. I am sorry if you don't like my opinion or the way it is expressed. -Original Message- From: carmmello [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 10:54 AM To: nutch-user@incubator.apache.org Subject: RE: project vitality? I really can not agree with the way Mr. Richard Braman express his views. I have tried Nutch since version 0.3 and I could not make the 0.8 release work (Nutch is becoming a little bit complicated with all those map reduce, hadoop, and so on, that I can't deal with). I understand, however, that if a product is not finished yet, some times it may fail with the lack of some fundamental documentation, but, if there is a bunch of people who develops, for free, a product that is commercially worth some thousands of dollars and may fit our purposes, we have to say thanks. After that we can, of course, express our views, complaints and suggestions, but we should refrain from some hard, non relevant comments, that goes nowhere, like this, non technical, post of mine. I, myself, have my own experimental implementation of Nutch 0.7.1.x (a nightly version), with more than 400,000 pages, that can be, sometimes, viewed at brazilian working hours, at http://www.qualidade.eng.br/constelacao.htm . It is in portuguese, but english terms related to quality, standards and environment can be searched.
RE: project vitality?
The nutch dev team isn't focused on PDF parsing. Nutch is a search engine framework, IMHO, if you don't parse something correctly, you cannnot rely on the results. We have all parsed things where you leave a comma out and the parse results are wrong. If there was a bug in nutches html parsing would that be a big deal? Howabout if it parsed the text in a particular tag out of order? Pdf is unfortunately not html where you can parse the file sequentially and get an accurate result, but its use is second most ubiquotous. PDFBox is not a PDF parsing frmaework either. It has some pdf parsing algorithms, that aren't being used. Google does a good job parsing pdf, nutch has to do if its ogin to compete. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 4:10 PM To: nutch-user@lucene.apache.org Subject: Re: project vitality? Hello, I've been following this conversation for the past week and decided that I'd go ahead and chime in now. I think that honestly this whole thread of discussion needs to be taken off list, because it doesn't really have anything to do with the use of Nutch: what it boils down to is a list of complaints, requests for improvements and what not. Nutch's goal is to be a large-scale, open source search engine: it's not a PDF parsing framework, nor is it as thoroughly documented as some commercial software -- although I've ran into many commercial software products that don't have the same quality of documentation that Nutch even has now in its nascent stages. Now that I have said that, I want to express my feeling that it's hard when it takes a week to figure out that invertlinks only applies to version 0.8. and when you ask to become a volunteer, you are met with no response. You don't need to ask to become a volunteer: just do it. As Doug said, create a patch, submit the patch to JIRA and let the community look at it. Change something on the Wiki if you don't think that the documentation is particularly well there. Use Nutch to do whatever you like, and if you feel that you contributed something that is applicable to a broader community outside of your domain, let people know about it. If it's really cool, I wouldn't worry about people ignoring you: they'll come around. It's also frustrating when you share some heard earned insights into something that nutch needs to work on, like pdf parsing, and your comments don't get a single good response from the nutch dev team. The nutch dev team isn't focused on PDF parsing. Nutch is a search engine framework, and to Nutch, a PDF parser is a black box that conforms to a standard parsing interface that can be swapped out as technology evolves. Right now, Nutch uses PDFBox, but in a week it could use hot super new rad PDF parsing technology X.1, or some other greater PDF parser. If you feel that PDFBox isn't getting the job done for your particular domain, then post an actual question, not pointers to documents for the Nutch developers to go read. Honestly, I'm guessing they don't have the time, nor the desire to go read a whole bunch of PDF documentation unless there's a real use case, and a real need to upgrade the existing parser. Empirically show that Nutch's PDF capabilities aren't getting the job done, post your results to the list, and let the community look them. I'd guess you'd generate more interest and probably get a better response that way. Sometimes, in OS projects I get the feeling that the developers breathe different air than users, and that our help is not wanted or that our questions are stupid and not worth their time to answer. As far as I can tell the Nutch developers all breathe the same air as us (and moreover, I believe they put on their pants one leg at a time) Nutch is nowhere near being a dead project, that is not what I said (I said it was close, not closed), its just that I don't feel that it's something that anyone can just download and use without running into problems. Problems is a generic word: I would agree with your statement if you qualified what problems means. Small problems like configuration issues? I'd buy that. Exception messages not providing super super detailed information about the error? Sure, I'd even buy that in some cases. However, larger, bigger problems that generally fall in the class of bugs? I would say the answer to that is probably a no. Problems always exist, but need to be documented correctly so that they can be solved quickly. I think nutch has a long way to go before it is comparable to tomcat or httpd, which are both production ready and have literally volumes of information on using in every manner possible. Check out the commiters list on Tomcat ( http://tomcat.apache.org/whoweare.html) versus that of Nutch ( http://lucene.apache.org/nutch/credits.html). 21 active commiters on the Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To have the wealth
Re: project vitality?
I am sorry if you don't like my opinion or the way it is expressed. Hi Richard, most of your opinion I think is the same as mine. I use nutch now since spring 2004 for our page http://www.umkreisfinder.de It was a big effort to learn how nutch is working and also a big effort to learn how to implement plugins. Seems to be a big system :) Much of the stuff I know is about version 0.5 or maybe 0.7. It is really difficult to keep up-to-date with all the stuff which is going on. In the last month I did not have the time to read all the messages on the mailing list, so I also feel less knowing about what's going on. I think the only way to keep informed what's going on with nutch is to read the mailing list each day. That's bad - I could not spent so much time :( Sometimes replies on the mailing list are extremly fast, sometimes there is no response. No response for technical questions, no response if volunteers ask how they could help and no response if bugfixes or code snippets with some improvements are mailed to the mailing list. I only can agree, if you think this is bad. It is bad. Not only that there are persons, who would never come to a state where they could help the project - because they did not get the first wattles - also progress of the nutch project would be slowed down if bugfixes and questions how to voluneer are ignored. I only could suggest to post all patches and improvements to the jira system, so that this information would never be lost. For me it seems a little bit like many persons are working on the code they need, sometimes two persons need the same code - fine -, but if somebody is working on a project or bugfix nobody else of the community currently needs - very bad. Also it is a big question, if and when patches are submitted, which are in the moment only needed by their programmer. I thinks we - the whole nutch community - should think about how we could generate the most value for nutch if persons ask how to volunteer. And also we should think about how we could pay tribute for stuff made by volunteres. Maybe if we simply check and add their improvements to the offical code as soon as possible. Maybe we should organize us ourself a little bit better in this point. What do you think? It also made be useful to ask all future volunteers to work on some parts of the wiki to get a better documentation. Maybe some of the nutch specialists must then look over the documentation is created by beginners. May I ask: How much persons are currently working on nutch? How much time do we alltogehter currently spend on nutch? I am currently working on code to identify geographic information on websites to improve local searches, but did not find time to implement my ideas. Much other stuff to do :( I also feel that I should not start implementing this code until I understand all the stuff which would be new in the next release. Maybe I understand all the important new stuff when reading the release information of the new version as soon as it is available. Last but not least, THANKS to all volunteers who worked on nutch. I am glad to be able to use nutch for our services. It is great to have the code of all the volunteers and run them together with the one percent of the code I have developed for our website. Thanks for reading my post Matthias -- http://www.eventax.com - eventax GmbH http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events
RE: project vitality?
I realy do think nutch is great, but I echo Matthias's comments that the community needs to come together and contirbute more back. And that comes with the requirement of making sure volunteers are given access to make their contributions part of the project. Also, if you use nutch you should be answering other users questions as long as you are actively reading the nutch list and you know the answer. Thats is almost your obligation for using free open source software. Putting the faqs and tutorial on the website and not the wiki maybe one of the two biggest problems in getting people started learning nutch. -Original Message- From: Matthias Jaekle [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 5:27 PM To: nutch-user@lucene.apache.org Subject: Re: project vitality? I am sorry if you don't like my opinion or the way it is expressed. Hi Richard, most of your opinion I think is the same as mine. I use nutch now since spring 2004 for our page http://www.umkreisfinder.de It was a big effort to learn how nutch is working and also a big effort to learn how to implement plugins. Seems to be a big system :) Much of the stuff I know is about version 0.5 or maybe 0.7. It is really difficult to keep up-to-date with all the stuff which is going on. In the last month I did not have the time to read all the messages on the mailing list, so I also feel less knowing about what's going on. I think the only way to keep informed what's going on with nutch is to read the mailing list each day. That's bad - I could not spent so much time :( Sometimes replies on the mailing list are extremly fast, sometimes there is no response. No response for technical questions, no response if volunteers ask how they could help and no response if bugfixes or code snippets with some improvements are mailed to the mailing list. I only can agree, if you think this is bad. It is bad. Not only that there are persons, who would never come to a state where they could help the project - because they did not get the first wattles - also progress of the nutch project would be slowed down if bugfixes and questions how to voluneer are ignored. I only could suggest to post all patches and improvements to the jira system, so that this information would never be lost. For me it seems a little bit like many persons are working on the code they need, sometimes two persons need the same code - fine -, but if somebody is working on a project or bugfix nobody else of the community currently needs - very bad. Also it is a big question, if and when patches are submitted, which are in the moment only needed by their programmer. I thinks we - the whole nutch community - should think about how we could generate the most value for nutch if persons ask how to volunteer. And also we should think about how we could pay tribute for stuff made by volunteres. Maybe if we simply check and add their improvements to the offical code as soon as possible. Maybe we should organize us ourself a little bit better in this point. What do you think? It also made be useful to ask all future volunteers to work on some parts of the wiki to get a better documentation. Maybe some of the nutch specialists must then look over the documentation is created by beginners. May I ask: How much persons are currently working on nutch? How much time do we alltogehter currently spend on nutch? I am currently working on code to identify geographic information on websites to improve local searches, but did not find time to implement my ideas. Much other stuff to do :( I also feel that I should not start implementing this code until I understand all the stuff which would be new in the next release. Maybe I understand all the important new stuff when reading the release information of the new version as soon as it is available. Last but not least, THANKS to all volunteers who worked on nutch. I am glad to be able to use nutch for our services. It is great to have the code of all the volunteers and run them together with the one percent of the code I have developed for our website. Thanks for reading my post Matthias -- http://www.eventax.com - eventax GmbH http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events
Re: project vitality?
Maybe we should organize us ourself a little bit better in this point. What do you think? Just a general note, jira has a voting functionality. This allows everybody to vote an issue and can show in a very compressed style what the community is looking for. However it is not used that often yet. It would be great if more users can use it. Reading the nutch user list becomes very time consuming but browsing issues sorted by votes is very fast. http://issues.apache.org/jira/browse/NUTCH? report=com.atlassian.jira.plugin.system.project:popularissues-panel Stefan
Re: project vitality?
Hi Richard, IMHO, if you don't parse something correctly, you cannnot rely on the results. Good, we're on the same page here. We have all parsed things where you leave a comma out and the parse results are wrong. If there was a bug in nutches html parsing would that be a big deal? Yes, it would be. HTML is the foundation for the web. Its content is the most pervasive out there (as you allude to below). Howabout if it parsed the text in a particular tag out of order? I'm wondering what that has to do with anything? You may want to read up on Lucene (http://lucene.apache.org/). Lucene is the underlying text search api (and index format) that Nutch is built on top of, and I'm wondering if it cares about the order in which a piece of text is given to it? Pdf is unfortunately not html where you can parse the file sequentially and get an accurate result, Gonna have to disagree with you on this. You're making a general statement that's not true across the board. I would assert that in many cases, you can still get an accurate result. What about a PDF research paper? Do you care about what order the text comes in if you're just doing general Google like search. When I go to Google and type grid computing papers, do I care that grid computing comes before some text within the research paper? Possibly, but mainly I care that grid computing was an emphasized phrase within the text. Now, your definition of emphasized may not just be that it's the first text that appears in the paper in the title say: you may just care that the frequency of grid computing in the paper is relatively higher than a certain threshold compared to other terms. On the other hand, the fact that grid computing is in the title and comes first in the PDF may mean a lot to you. in That's the nature of trying to extract structure out of inherently unstructured content. I'm not saying that the structure or order of text within a document is never useful: I agree that in a lot of cases, it can help you to infer what values are associated with what fields you want to index, etc. All I'm saying is that it's certainly a subset of the greater functionality of just doing free text search, so you shouldn't generalize and that that you can't parse a PDF sequentially and obtain good results. but its use is second most ubiquotous. PDFBox is not a PDF parsing frmaework either. It has some pdf parsing algorithms, that aren't being used. Google does a good job parsing pdf, nutch has to do if its ogin to compete. Can you show that Google's PDF parsing capability is any better than Nutch's using accepted evaluation methods for PDF? How about some real use cases and real results? Until we could see such numbers, I'm hesitant to believe what you're saying is true. If it is though, then I'm sure that the community would welcome any updates to the PDF parsing plugin that expedite its improvement. Cheers, Chris -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 4:10 PM To: nutch-user@lucene.apache.org Subject: Re: project vitality? Hello, I've been following this conversation for the past week and decided that I'd go ahead and chime in now. I think that honestly this whole thread of discussion needs to be taken off list, because it doesn't really have anything to do with the use of Nutch: what it boils down to is a list of complaints, requests for improvements and what not. Nutch's goal is to be a large-scale, open source search engine: it's not a PDF parsing framework, nor is it as thoroughly documented as some commercial software -- although I've ran into many commercial software products that don't have the same quality of documentation that Nutch even has now in its nascent stages. Now that I have said that, I want to express my feeling that it's hard when it takes a week to figure out that invertlinks only applies to version 0.8. and when you ask to become a volunteer, you are met with no response. You don't need to ask to become a volunteer: just do it. As Doug said, create a patch, submit the patch to JIRA and let the community look at it. Change something on the Wiki if you don't think that the documentation is particularly well there. Use Nutch to do whatever you like, and if you feel that you contributed something that is applicable to a broader community outside of your domain, let people know about it. If it's really cool, I wouldn't worry about people ignoring you: they'll come around. It's also frustrating when you share some heard earned insights into something that nutch needs to work on, like pdf parsing, and your comments don't get a single good response from the nutch dev team. The nutch dev team isn't focused on PDF parsing. Nutch is a search engine framework, and to Nutch, a PDF parser is a black box that conforms to a standard parsing interface that can be swapped out as technology evolves. Right
project vitality?
Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
RE: project vitality?
I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original Message- From: Matt Wilkie [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 6:34 PM To: nutch-user@lucene.apache.org Subject: project vitality? Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
RE: project vitality?
I wouldn't call Nutch 0.7.x proof-of-concept. There are several production sites running it already: http://wiki.apache.org/nutch/PublicServers Plus I think technorati is built on either Nutch and/or Lucene. That said, the doc could be better, and it's probably a good idea if you know Java since you might have to tweak the code a bit to get the exact behavior you want. If you don't have special needs, you could get something like a site search up in very little time. The newer versions seem to be changing a lot still though. I've been waiting for the dust to settle before I see if I want to upgrade. Howie I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original Message- From: Matt Wilkie [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 6:34 PM To: nutch-user@lucene.apache.org Subject: project vitality? Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
Re: project vitality?
passed the concept stage, technorati uses lucene, in open source projects the last thing people want to do is documentation, anybody know why yahoo took down their nutch server? - Original Message - From: Howie Wang [EMAIL PROTECTED] To: [EMAIL PROTECTED]; nutch-user@lucene.apache.org Sent: Saturday, March 04, 2006 1:09 AM Subject: RE: project vitality? I wouldn't call Nutch 0.7.x proof-of-concept. There are several production sites running it already: http://wiki.apache.org/nutch/PublicServers Plus I think technorati is built on either Nutch and/or Lucene. That said, the doc could be better, and it's probably a good idea if you know Java since you might have to tweak the code a bit to get the exact behavior you want. If you don't have special needs, you could get something like a site search up in very little time. The newer versions seem to be changing a lot still though. I've been waiting for the dust to settle before I see if I want to upgrade. Howie I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original Message- From: Matt Wilkie [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 6:34 PM To: nutch-user@lucene.apache.org Subject: project vitality? Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
Re: project vitality?
I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages.. In fact the nightly builds has good improvements than 0.71. Any serious user or adopter should be trying with a snapshot of nightly build.. Doug, It would be better, if there is official 0.8 release or atleast a RC. before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8. Thanks Sudhi Doug Cutting [EMAIL PROTECTED] wrote: Richard Braman wrote: I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. It stands to reason that if the documentation lacks luster the project must be dead! Seriously, this is an active project. It is not yet 1.0, so don't expect polish. If it doesn't look easily usable to you then perhaps it is not. It's still for early adopters. The commit list shows a fair amount of activity: http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html Lots of public sites are using Nutch. Some are listed at http://wiki.apache.org/nutch/PublicServers, but many are not, like http://search.bittorrent.com/. I have tried to get the tutorial and faqs updated, but I haven't heard back. This is an all-volunteer project. If you find a bug, please file a bug report, so that other folks are aware of it. Better yet, if you have a solution or improvement, please construct a patch file (even for documentation) and attach it to a bug report. On the wiki, anyone can make themselves an account and update documentation. We don't boss folks around here, or complain. We pitch in and help. Doug Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.