Re: project vitality? / less documentation is more!

2006-03-07 Thread Franz Werfel
Hello,

Just my 2 cents: the Intranet crawl functionnality is VERY confusing.

If it was just taken out of the tutorial, and out of the set of
commands, that would actually help A LOT: I understood many many
things about Nutch once I tried so-called whole-web crawling, where
one has to use every command one at a time. And that would also
eliminate all the questions about how to recrawl, etc.

Or maybe a change of name would be enough: Intranet crawl could be
called fast-setup crawl, and whole-web crawling, serious crawling
for Intranet or whole-web projects.

What do you think?

Thanks, Frank.


RE: project vitality? / less documentation is more!

2006-03-07 Thread Vanderdray, Jacob
-1

I found the instructions for doing an Intranet crawl extremely
helpful for getting up and running quickly.  I went back later and
figured out more about what it was actually doing.  Perhaps the name
could just be changed to Single Site Crawling with the Nutch Shell
Script and some explanatory text could be added.

I'll try to take the time today to put a version of the tutorial
on the wiki that does that.  Then if folks agree, I'll put together a
patch that changes the site links for the tutorial to point at the wiki.

Thanks,
Jake.

-Original Message-
From: Franz Werfel [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 3:01 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality? / less documentation is more!

Hello,

Just my 2 cents: the Intranet crawl functionnality is VERY confusing.

If it was just taken out of the tutorial, and out of the set of
commands, that would actually help A LOT: I understood many many
things about Nutch once I tried so-called whole-web crawling, where
one has to use every command one at a time. And that would also
eliminate all the questions about how to recrawl, etc.

Or maybe a change of name would be enough: Intranet crawl could be
called fast-setup crawl, and whole-web crawling, serious crawling
for Intranet or whole-web projects.

What do you think?

Thanks, Frank.


RE: project vitality? / less documentation is more!

2006-03-07 Thread Vanderdray, Jacob
You're right about the single site thing, but I think just
changing the title and adding a bit more of an explanation should do the
trick.  I went ahead and put up a version of the tutorial on the wiki.
I haven't changed it other than to try to get the formatting similar to
what's on the current tutorial.  Feel free to edit.

http://wiki.apache.org/nutch/NutchTutorial

Thanks,
Jake.

-Original Message-
From: Franz Werfel [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 10:11 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality? / less documentation is more!

Hello,

single site crawling wouldn't address the confusion that results
from the fact that the 'crawl' command is actually the concatenation
of several commands; and it would not be true either, since you can do
several sites crawling with 'crawl'.

But I have to agree that it helps getting up and running quickly;
however my point is that, after this first phase, it is _more_
difficult to go to the next phase than if one hadn't used this first
step first...

Maybe at the end of the tutorial for Intranet crawling the following
sentence could be added:
If you want to crawl the same site _again_, use the whole-web
tutorial below, and NOT the crawl command.

Also, the sentence Whole-web crawling is designed to handle very
large crawls which may take weeks to complete, running on multiple
machines is misleading, since one has to use whole-web crawling to
fine-tune or recrawl even the smallest of websites.

The distinction is not only on the scale of the project, but on the
level of control one wants (IMHO). The documentation should at least
give hints in that direction.

Thanks, Frank.



On 3/7/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
 -1

I found the instructions for doing an Intranet crawl
extremely
 helpful for getting up and running quickly.  I went back later and
 figured out more about what it was actually doing.  Perhaps the name
 could just be changed to Single Site Crawling with the Nutch Shell
 Script and some explanatory text could be added.

I'll try to take the time today to put a version of the
tutorial
 on the wiki that does that.  Then if folks agree, I'll put together a
 patch that changes the site links for the tutorial to point at the
wiki.

 Thanks,
 Jake.

 -Original Message-
 From: Franz Werfel [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 07, 2006 3:01 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: project vitality? / less documentation is more!

 Hello,

 Just my 2 cents: the Intranet crawl functionnality is VERY
confusing.

 If it was just taken out of the tutorial, and out of the set of
 commands, that would actually help A LOT: I understood many many
 things about Nutch once I tried so-called whole-web crawling, where
 one has to use every command one at a time. And that would also
 eliminate all the questions about how to recrawl, etc.

 Or maybe a change of name would be enough: Intranet crawl could be
 called fast-setup crawl, and whole-web crawling, serious crawling
 for Intranet or whole-web projects.

 What do you think?

 Thanks, Frank.



RE: project vitality? / less documentation is more!

2006-03-07 Thread Richard Braman
+1

-Original Message-
From: Franz Werfel [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 3:01 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality? / less documentation is more!


Hello,

Just my 2 cents: the Intranet crawl functionnality is VERY confusing.

If it was just taken out of the tutorial, and out of the set of
commands, that would actually help A LOT: I understood many many things
about Nutch once I tried so-called whole-web crawling, where one has to
use every command one at a time. And that would also eliminate all the
questions about how to recrawl, etc.

Or maybe a change of name would be enough: Intranet crawl could be
called fast-setup crawl, and whole-web crawling, serious crawling
for Intranet or whole-web projects.

What do you think?

Thanks, Frank.



RE: project vitality? / less documentation is more!

2006-03-07 Thread Richard Braman
+1

-Original Message-
From: Franz Werfel [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 10:11 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality? / less documentation is more!


Hello,

single site crawling wouldn't address the confusion that results from
the fact that the 'crawl' command is actually the concatenation of
several commands; and it would not be true either, since you can do
several sites crawling with 'crawl'.

But I have to agree that it helps getting up and running quickly;
however my point is that, after this first phase, it is _more_ difficult
to go to the next phase than if one hadn't used this first step first...

Maybe at the end of the tutorial for Intranet crawling the following
sentence could be added: If you want to crawl the same site _again_,
use the whole-web tutorial below, and NOT the crawl command.

Also, the sentence Whole-web crawling is designed to handle very large
crawls which may take weeks to complete, running on multiple machines
is misleading, since one has to use whole-web crawling to fine-tune or
recrawl even the smallest of websites.

The distinction is not only on the scale of the project, but on the
level of control one wants (IMHO). The documentation should at least
give hints in that direction.

Thanks, Frank.



On 3/7/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
 -1

I found the instructions for doing an Intranet crawl 
 extremely helpful for getting up and running quickly.  I went back 
 later and figured out more about what it was actually doing.  Perhaps 
 the name could just be changed to Single Site Crawling with the Nutch

 Shell Script and some explanatory text could be added.

I'll try to take the time today to put a version of the 
 tutorial on the wiki that does that.  Then if folks agree, I'll put 
 together a patch that changes the site links for the tutorial to point

 at the wiki.

 Thanks,
 Jake.

 -Original Message-
 From: Franz Werfel [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 07, 2006 3:01 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: project vitality? / less documentation is more!

 Hello,

 Just my 2 cents: the Intranet crawl functionnality is VERY 
 confusing.

 If it was just taken out of the tutorial, and out of the set of 
 commands, that would actually help A LOT: I understood many many 
 things about Nutch once I tried so-called whole-web crawling, where 
 one has to use every command one at a time. And that would also 
 eliminate all the questions about how to recrawl, etc.

 Or maybe a change of name would be enough: Intranet crawl could be 
 called fast-setup crawl, and whole-web crawling, serious crawling

 for Intranet or whole-web projects.

 What do you think?

 Thanks, Frank.




Re: project vitality?

2006-03-06 Thread TDLN
Stefan.

 I know people having 500 mio pages index and I personal run crawls with
~300 pages per second.

Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch
version) that you manage so many pages per second?

Unless this is a company secret, it would be very nice to know how you
manage this.

Rgrds, Thomas


Re: project vitality?

2006-03-06 Thread Stefan Groschupf

Hi Thomas,
for this crawl setup we have a test environment of nutch 0.8,  
10xAMD's, custom linux build,  100Mbit eth1, 1Gb eth0, each box has a  
'caching' dns server.

Stefan
Am 06.03.2006 um 15:59 schrieb TDLN:


Stefan.

I know people having 500 mio pages index and I personal run  
crawls with

~300 pages per second.

Sorry, but I have to ask: what kind of setup do you have (network,  
hw, nutch

version) that you manage so many pages per second?

Unless this is a company secret, it would be very nice to know  
how you

manage this.

Rgrds, Thomas


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: project vitality?

2006-03-06 Thread mos
On 3/4/06, Stefan Groschupf:

 Just a general note, jira has a voting functionality.
 This allows everybody to vote an issue and can show in a very
 compressed style what the community is looking for.
 However it is not used that often yet. It would be great if more
 users can use it.

That's a good suggestion.
I want to make adv
Because there is a bug in Nutch 0.7.1 which forces me, to

http://issues.apache.org/jira/browse/NUTCH-205


Re: project vitality?

2006-03-06 Thread mos
On 3/4/06, Stefan Groschupf:

 Just a general note, jira has a voting functionality.
 This allows everybody to vote an issue and can show in a very
 compressed style what the community is looking for.
 However it is not used that often yet. It would be great if more
 users can use it.

That's a good suggestion.
I want to make some advertising for my favorite. ;)
Because there is a bug in Nutch 0.7.1 which forces me, to make
complete recrawls instead of using the incremental approach, this is my
voting recommendation:
http://issues.apache.org/jira/browse/NUTCH-205

Bye the way:
I totally agree with the exchanged opinions.

- Nutch is a great project and has the chance to become a very very
popular and robust open source software. A big thankyou to all nutch
developer is more than appropriate:
Thanks guys!

- On the other hand: As Richard wrote there could be some improvements
in documentation and in responses to mailing-list and reported
jira-issues.

My concrete suggestions:

Nutch 0.8 should be available in around the next two months. Let's
take the chance and
improve the (wiki-)documentation before releasing it.
First lets specify what kind of documentation we like to have in 0.8.
I'm sure we'll get for every documentation-subject volunteers for
writing it down and some more volunteers for checking and testing it.

I would like to support the documentation-project in the next weeks
(as far as my spare times is available;))


Re: project vitality?

2006-03-06 Thread Doug Cutting

Richard Braman wrote:

I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back.  And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project.


Here's how it works:

One has to be a committer to directly change the code.

One may be invited to become a committer if contributes a number of 
non-trivial, consistently exemplary patches.


Exemplary patches:
 1. are easy for a committer to apply;
 2. fix one thing;
 3. fix it well;
 4. are well formatted, using Sun's coding conventions
 5. are well documented, with Javadoc for all non-private items
 6. pass all existing unit tests
 7. includes new unit tests
 8. etc.

An exemplary patch is thus something that a committer can commit with 
little hesitation.  It follows that exemplary patches will be committed 
quickly.  Lesser patches are likely to languish.


For example, a committer might be reluctant to take on a poorly 
constructed patch for a bug that only affects niche users, since it may 
take a lot of time to turn it into code worthy of committing.


Most committers are already doing as much as they can to help the 
project.  The trick is not to get them committers to do more work, but 
for others to do more work for the committers, and,eventually, to get 
more committers.



Putting the faqs and tutorial on the website and not the wiki maybe one
of the two biggest problems in getting people started learning nutch.


If you think these should move, don't just complain: file a bug, make 
your case, submit a patch, etc.  The website is part of the source and 
is governed by the same process.


Doug


Re: project vitality?

2006-03-06 Thread Doug Cutting

David Wallace wrote:

Also, I've lost count of the number of times someone has posted
something to the effect of I'll pay someone to give me Nutch support,
simply because they find the existing documentation and mailing lists
inadequate.  Usually, that person gets told that the best way to get
Nutch support is to ask questions on the mailing list; but since
questions often go unanswered, this isn't a very good way to get Nutch
support at all.


I agree this is a problem, but it is also an opportunity. I do try to 
answer Nutch questions whenever I have time, and most other Nutch 
developers are also active on these lists.  The problem is simply that 
there are more questions than question answering hours.



All of this is acceptable in a product that hasn't yet reached version
1.0.  The code has moved ahead faster than the documentation; and
that's fine, provided the documentation will eventually catch up.


Yes, I hope it will.


Maybe, once 0.8 is deemed production-worthy, the team should down tools,
stop coding, and put some effort into really producing a really lovely
set of documentation, including a comprehensive FAQ.  I believe that
this will help grow the user base, faster than adding new features ever
could.


That would be nice.  Once things settle down it will also be easier for 
support organizations, consultants, book authors, etc, to step in and 
improve documentation too.


Doug


Re: project vitality?

2006-03-05 Thread Byron Miller
I like to think of it as a framework. Building blocks
to build what you ultimately need.

If your after the one stop shop, plug in play, no
development necessary then perhaps some other
commercial systems may be your best bet.

Mailing list is very active, most people get responses
fairly quickly. If the question is ignored its often
because it's already answered.

To really understand nutch you need to understand
lucene, hadoop and search in general and the wiki of
both lucene and nutch is a great read.

If all of this is above ones head or not within your
time frame to bother with then like i said, there are
other products out there.

Other then that i'm happily running nutch, looking
forward to a billion+ page index and enjoying picking
the brains of the talent pool we have here.

Happy nutcher

-byron
http://www.mozdex.com


--- Matt Wilkie [EMAIL PROTECTED] wrote:

 Hi there, I'm new around here. The mailing lists
 seem to have a pretty 
 steady stream of traffic but the website hasn't been
 updated since 
 august, and there's only a handful of news items
 before that. What is 
 the vitality of Nutch project? Is it basically a
 labority proof of 
 concept or a mature ready for production product?
 
 thanks for your time,
 
 -- 
 matt wilkie
 
 Geographic Information,
 Information Management and Technology,
 Yukon Department of Environment
 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
 867-667-8133 Tel * 867-393-7003 Fax
 http://environmentyukon.gov.yk.ca/geomatics/
 
 
 



Re: [Nutch-general] Re: project vitality?

2006-03-05 Thread Greg Boulter
Hi,

I think that this is my first post. I follow the mailing list and read as
many of the emails as I can.

I'm going to make a few proposals.
I have obtained some money to spend on them.
I use and get paid for my nutch expertise.
I have some experience.
I don't just speak for myself but also for some people who use nutch now,
have a commercial interest in nutch and who will contribute money to the
effort.
This money is not a great deal but it could both escalate and become
ongoing.
I sympathize with the people who are (with no offense to any side, if
there really is one) the complainers.
I am grateful to the coders.
I can and do make code improvements to nutch for my own uses that nobody
ever sees.
I have a web interface (sort of), and many other tools that work with nutch,
from maps to communication with nutch via telephone.
I expect to gain from my association with nutch although how I can't really
put my finger on yet.
I wouldn't say that I'm frustrated - I'd describe it more as a feeling of
hope mixed with helplessness and despair.
I think the moment is almost gone.
Im old and scatterbrained and don't spell check or reread before I post.
I will elaborate as soon as I see this on the list - but I don't like to
type until I know what I have to deal with, I have about 3000 emails a day
to sift through and I have so many email addresses I've signed up for that I
never really know whether I'm going to hit the wrong list or something or
whatever.

Greg.


RE: project vitality?

2006-03-05 Thread David Wallace
Hello all,
I think Nutch is a fantastic product.  I used 0.6 initially, then 0.7. 
My 0.7 installation is in production, and mostly works really well.  I
haven't made the move to 0.8 yet, because the direction that Nutch has
gone for 0.8 is quite different from what my organisation requires from
its search engine.
 
I owe Doug and the team a huge thank-you for all the effort they've put
into Nutch.  Well done.
 
However, it's a sad day when someone like Richard Braman gets shot down
in flames for making some fair and valid criticisms of the Nutch
project.  Apart from his statement about Nutch being in proof of
concept stage, I agree with everything Richard has said.  The
documentation DOES leave a fair bit to be desired.  The initial learning
curve CAN be precipitous.  It's easy to get confused with all the
various settings in the XML configuration files and the various
plug-ins.  I can understand that he doesn't feel that he's in a position
to contribute to the documentation base, because he doesn't know all the
answers yet.
 
I think moving everything, including the tutorial, to the Wiki is a
fine idea; provided that we encourage new users to comment on what did
and didn't work for them.  I think we'll find there's a lot of common
ground among their comments.  Long-term readers of the nutch-user
mailing list know that many newbies ask the same questions.  
 
Also, I've lost count of the number of times someone has posted
something to the effect of I'll pay someone to give me Nutch support,
simply because they find the existing documentation and mailing lists
inadequate.  Usually, that person gets told that the best way to get
Nutch support is to ask questions on the mailing list; but since
questions often go unanswered, this isn't a very good way to get Nutch
support at all.
 
All of this is acceptable in a product that hasn't yet reached version
1.0.  The code has moved ahead faster than the documentation; and
that's fine, provided the documentation will eventually catch up. 
Maybe, once 0.8 is deemed production-worthy, the team should down tools,
stop coding, and put some effort into really producing a really lovely
set of documentation, including a comprehensive FAQ.  I believe that
this will help grow the user base, faster than adding new features ever
could.
 
So in summary, well done to the Nutch team for this great product. 
Well done to Richard Braman for pointing out what could be done.  And
let's all not flame people whose opinions differ from our own.
 
David.


This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.




Re: project vitality?

2006-03-05 Thread Chris Lamprecht
I think of the Nutch project as a marathon, not a sprint.  Nutch's
stated goals include:

* Scale to entire web
- pages on millions of different servers
- billions of pages
* Support high traffic
- thousands of searches per second
* State-of-the-art search quality

(see http://wiki.apache.org/nutch/Presentations)

It's inspiring to see a project with such ambitious goals become a reality.


On 3/5/06, Byron Miller [EMAIL PROTECTED] wrote:
 I like to think of it as a framework. Building blocks
 to build what you ultimately need.

 If your after the one stop shop, plug in play, no
 development necessary then perhaps some other
 commercial systems may be your best bet.

 Mailing list is very active, most people get responses
 fairly quickly. If the question is ignored its often
 because it's already answered.

 To really understand nutch you need to understand
 lucene, hadoop and search in general and the wiki of
 both lucene and nutch is a great read.

 If all of this is above ones head or not within your
 time frame to bother with then like i said, there are
 other products out there.

 Other then that i'm happily running nutch, looking
 forward to a billion+ page index and enjoying picking
 the brains of the talent pool we have here.

 Happy nutcher

 -byron
 http://www.mozdex.com


 --- Matt Wilkie [EMAIL PROTECTED] wrote:

  Hi there, I'm new around here. The mailing lists
  seem to have a pretty
  steady stream of traffic but the website hasn't been
  updated since
  august, and there's only a handful of news items
  before that. What is
  the vitality of Nutch project? Is it basically a
  labority proof of
  concept or a mature ready for production product?
 
  thanks for your time,
 
  --
  matt wilkie
  
  Geographic Information,
  Information Management and Technology,
  Yukon Department of Environment
  10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
  867-667-8133 Tel * 867-393-7003 Fax
  http://environmentyukon.gov.yk.ca/geomatics/
  
 
 




Re: [Nutch-general] Re: project vitality?

2006-03-05 Thread Greg Boulter
Hello again.

OK - first of all I hate mailing lists. I don't consider them to be a valid
form of communication for anything but the people doing the coding and don't
really consider them of much use at all unless there is no other
alternative. Except one - and that is when there needs to be something
communicated to the people doing the work and it has to get through - in
other words I think mailing lists are a last resort.

I've been a part of a few areas of the net where what I was involved with
just took off. One of them was in 1999 when Flash 4 came out and suddenly
anyone with an ability to use Flash was hot and Flash was the big news and I
was part of a forum called were-here.com which was the adult flash forum
as opposed to the kids' flashkit.com site. My name was/is Mapp and for the
most part of were-here's life I was moderator of the XML forum. I think that
if anyone has or cares to read my posts they'll see that I always try to
help, my help was usually complete, I am always polite. We had quite a ride
for awhile but then the owners of the forum for some secretive reason just
took the site down leaving the thousands of contributing posters homeless.
I still keep up with all the XML stuff and I suppose I must be sort of an
expert in XML - at least in knowing the different formats, vxml, aiml, on
and on.

I was also part of a few areas of the net where it looked like things were
going to take off and never did. One thing I noticed is that technologies
that take off have forums dedicated to them and ones that don't take off
resist going off the mailing list.

I like it how people say take it off list but oh where should it be taken
to please? Nobody says take the discussion to the wiki because
traditionally wikis aren't real discussion areas. What really should be said
is take it to the forum but there isn't really one is there? If there is
nobody says anything. I have the name nutchforum.com and am #1 in MSN,
Google and Yahoo and one person posted there one day. I know there are other
efforts too but if they have any good discussions about relevant topics I'm
unaware of them.

I agree that the people doing the coding shouldn't have to read this and so
obviously I'm proposing a nutch forum with myself for example (could be
others too) as a moderator. At least I have a history and it is decent.
Were-here.com is back up now - bought by a corporation and maintained as a
learning resource to the Flash community  but I don't post there much and
that is because I resented my hundreds if not thousands of hours of
painstakingly trying to give back to the community by being complete,
coherent, etc lost because whoever happened to have the luck of owning the
forum decided that oh well, see you around, I'm going to work for Microsoft,
or whatever. I still resent it even if some corporation knew that they could
garner enough good will by buying the forum and restoring the posts/knowlege
base.

So, what I've done is pick Moodle - an open source php learning system,
which has a forum and I've decided that I'll attempt to start a useful forum
and that what I'll do is every week or two make the forum sql dump available
so if I ever decide that I don't care about anyone or I get snapped up by
Google any knowlege will live on. Moodle is being developed by teachers, the
people I'd trust to do things right (except for librarians - check out the
open source library software that librarians write for an example of a
dominant open source effort). So I assume that any forum posting will be
long-lived and free.

I've also decided to pay for posts - the surest way for a forum/community to
not get started is by there being no posting activity. So, I arranged to get
posts paid for. I'm not sure yet how much is reasonable but I started off
figuring that a few dollars for a well thought out question and 20 -100
dollars for a reasonably comprehensive answer might be alright. Also, I've
arranged for some hosting space for people who want to make search engines
but don't have the resources. I have a few dedicated servers and unique IP
addresses and the like for people who will share their experiences. I don't
know what is reasonable to pay but I have arranged some funding and
resources albeit with conditions.

Also there are other things that normally cost money as well as I'll give
support to people who want to use the web interface that I've been working
on and if somebody else has an idea that needs a little money well right now
the people that I've set up with older not so up to date nutch search
engines are becoming desperate to get the stuff I told them would be
available to them. These aren't people who want billion page indexes spread
over 10 separate beowulf clusters - they're just people who thought they
could spend a few hundred and get some additional functionality out of open
source software. That being what I do mostly, set up and integrate open
source software for people who have reasonable goals. I'm old now and 

RE: [Nutch-general] Re: project vitality?

2006-03-05 Thread Richard Braman
I'll take part in your forum. Just added first post.

-Original Message-
From: Greg Boulter [mailto:[EMAIL PROTECTED] 
Sent: Sunday, March 05, 2006 6:33 PM
To: nutch-user@lucene.apache.org
Subject: Re: [Nutch-general] Re: project vitality?


Hello again.

OK - first of all I hate mailing lists. I don't consider them to be a
valid form of communication for anything but the people doing the coding
and don't really consider them of much use at all unless there is no
other alternative. Except one - and that is when there needs to be
something communicated to the people doing the work and it has to get
through - in other words I think mailing lists are a last resort.

I've been a part of a few areas of the net where what I was involved
with just took off. One of them was in 1999 when Flash 4 came out and
suddenly anyone with an ability to use Flash was hot and Flash was the
big news and I was part of a forum called were-here.com which was the
adult flash forum as opposed to the kids' flashkit.com site. My name
was/is Mapp and for the most part of were-here's life I was moderator of
the XML forum. I think that if anyone has or cares to read my posts
they'll see that I always try to help, my help was usually complete, I
am always polite. We had quite a ride for awhile but then the owners of
the forum for some secretive reason just took the site down leaving the
thousands of contributing posters homeless. I still keep up with all
the XML stuff and I suppose I must be sort of an expert in XML - at
least in knowing the different formats, vxml, aiml, on and on.

I was also part of a few areas of the net where it looked like things
were going to take off and never did. One thing I noticed is that
technologies that take off have forums dedicated to them and ones that
don't take off resist going off the mailing list.

I like it how people say take it off list but oh where should it be
taken to please? Nobody says take the discussion to the wiki because
traditionally wikis aren't real discussion areas. What really should be
said is take it to the forum but there isn't really one is there? If
there is nobody says anything. I have the name nutchforum.com and am
#1 in MSN, Google and Yahoo and one person posted there one day. I know
there are other efforts too but if they have any good discussions about
relevant topics I'm unaware of them.

I agree that the people doing the coding shouldn't have to read this and
so obviously I'm proposing a nutch forum with myself for example (could
be others too) as a moderator. At least I have a history and it is
decent. Were-here.com is back up now - bought by a corporation and
maintained as a learning resource to the Flash community  but I don't
post there much and that is because I resented my hundreds if not
thousands of hours of painstakingly trying to give back to the
community by being complete, coherent, etc lost because whoever
happened to have the luck of owning the forum decided that oh well,
see you around, I'm going to work for Microsoft, or whatever. I still
resent it even if some corporation knew that they could garner enough
good will by buying the forum and restoring the posts/knowlege base.

So, what I've done is pick Moodle - an open source php learning
system, which has a forum and I've decided that I'll attempt to start a
useful forum and that what I'll do is every week or two make the forum
sql dump available so if I ever decide that I don't care about anyone or
I get snapped up by Google any knowlege will live on. Moodle is being
developed by teachers, the people I'd trust to do things right (except
for librarians - check out the open source library software that
librarians write for an example of a dominant open source effort). So I
assume that any forum posting will be long-lived and free.

I've also decided to pay for posts - the surest way for a
forum/community to not get started is by there being no posting
activity. So, I arranged to get posts paid for. I'm not sure yet how
much is reasonable but I started off figuring that a few dollars for a
well thought out question and 20 -100 dollars for a reasonably
comprehensive answer might be alright. Also, I've arranged for some
hosting space for people who want to make search engines but don't have
the resources. I have a few dedicated servers and unique IP addresses
and the like for people who will share their experiences. I don't know
what is reasonable to pay but I have arranged some funding and resources
albeit with conditions.

Also there are other things that normally cost money as well as I'll
give support to people who want to use the web interface that I've
been working on and if somebody else has an idea that needs a little
money well right now the people that I've set up with older not so up to
date nutch search engines are becoming desperate to get the stuff I told
them would be available to them. These aren't people who want billion
page indexes spread over 10 separate beowulf clusters

RE: project vitality?

2006-03-04 Thread Richard Braman

don't expect polish.
You shouldn't need polish to be able to leran the command required to
resume an aborted drawl, or to index what you have already crawled.
Things like this shouldn't require an easter egg hunt.  They are going
to heppen to evryone doing greater than a simple crawl.

If you find a bug, please file a bug report, so that other folks are
aware of it.  
I have reported 2 so far.  I have a third one (and a patch) that I am
still in the process of developing documenting, which relates to parsing
pdfs.

Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

In the email I sent you I volunteered to help by offering to polish the
documentation myself.  I do need some answers first.  Many of the
questions that get asked on this list unfortunately go unanswered by the
experts.  If they go unanswered, it impossible for those who would
otherwise share their solutions on the Wiki, because there is no
solution to share.  

If I went and posted my knowledge about indexing and restarting crawls,
it wouldn't be any better than what is already up there, which is
incomplete and incorrect.  I know there are those of you that no nutch
inside and out. Right now that's just a few guys.  I know I want to know
more about it, that's why I am spending my free time trying to learn.
Everyting I am doing is part of an open source search project, not a
commercial endevour. I always contribute my knowledge back by posting
answers to things I know about.  

Documentation, whether we like it or not, is key to the use of the
product. The onus is on the developers to document the project, and to
provide support when the documentation is clearly lacking.  One the
developers share more of their knowledge, their will be more
knowledgable users and the developers wont need to spend as much time on
support and documentation.

I would agree that if you have 1 url to crawl, and you crawl it with
depth = 3-6 , nutch is easy to use.  I tried with depth=10, and I hit  a
snag.  This has been very hard to get through, given the lack of
documentation.  I have nutch up and running fine here
http://24.75.221.234:8080
But this is a simple crawl and doesn't reflect all of the pages needed
to make a good search engine.

I told you I was more than willing to help, and I think many users feel
the same way, but I for one feel that there is a lack of documentation
and support.  This isn't meant to offend anyone, if you are offended you
need to toughen up your skin a little bit.






-Original Message-
From: sudhendra seshachala [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 1:26 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality?


I could not agree with Doug more. This is one of the best.. am trying
UIMA too... though UIMA also uses Lucene...as of today, it is still a
framework and community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of
nightly build..
   
  Doug, 
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on
releasing 0.8.
   
  Thanks
  Sudhi
  

Doug Cutting [EMAIL PROTECTED] wrote:
  Richard Braman wrote:
 I think it is still very much at proof of concept stage. I think it is

 close, but as you have mentioned, the website Is severely out of date 
 and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks luster the project

must be dead! Seriously, this is an active project. It is not yet 1.0, 
so don't expect polish. If it doesn't look easily usable to you then 
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.h
tml

Lots of public sites are using Nutch. Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

 I have tried
 to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug 
report, so that other folks are aware of it. Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 



Re: project vitality?

2006-03-04 Thread Stefan Groschupf

Hi Richard,

I told you I was more than willing to help, and I think many users  
feel

the same way, but I for one feel that there is a lack of documentation
and support.  This isn't meant to offend anyone, if you are  
offended you

need to toughen up your skin a little bit.


Here you can find some more documentation:
http://wiki.media-style.com/display/nutchDocu/Home

It is the first hit when you are searching for nutch documentation  
with google.
Sure it is full with tons of typos and has  many language issues  
since my english is terrible
but at least I guess that it already helps some people to get a nutch  
0.7 or nutch 0.8 up and running.


Serious nutch is as much production ready as a noncommercial open  
source project could be.
I know people having 500 mio pages index and I personal run crawls  
with ~300 pages per second.


I'm not sure what you can expect more than that from a open source  
search project.


Stefan






RE: project vitality?

2006-03-04 Thread Howie Wang

I agree that the doc could be better, but I still take issue with
the earlier use of the phrase proof-of-concept. If there are
dozens of sites using it in production, several of them indexing
100's of millions of pages, I don't know how you can call it
proof-of-concept.

Honestly, I'm not sure if there's any other choice for a scalable
open source search engine. Last I checked most of the other
free projects were better suited to small site searches -- nothing
on the scale of tens of millions of pages.

So kudos, Nutch developers!

Howie




RE: project vitality?

2006-03-04 Thread Richard Braman
I do thank nutch developers very, very much for what they have put into
the project:)  I think the concept is great and yes it does work, if you
invest the time needed to learn the interfaces, updgrade the
distribution nightly, relearn the commands, etc. Doug's statement that
nutch is for early adopters is accurate.

Now that I have said that, I want to express my feeling that it's hard
when it takes a week to figure out that invertlinks only applies to
version 0.8. and when you ask to become a volunteer, you are met with no
response.  It's also frustrating when you share some heard earned
insights into something that nutch needs to work on, like pdf parsing,
and your comments don't get a single good response from the nutch dev
team.  

Sometimes, in OS projects I get the feeling that the developers breathe
different air than users, and that our help is not wanted or that our
questions are stupid and not worth their time to answer.  I don't feel
that there is really any such thing as a stupid question, only stupid
answers.  Some users even ask questions shamefully like: I know I am a
newbie, and my question is stupid, but here it is anyway.  I think
that's a stigma that we as the larger computer community need to steer
away from, especially if we want newbie users to become advanced users.

Nutch is nowhere near being a dead project, that is not what I said (I
said it was close, not closed), its just that I don't feel that it's
something that anyone can just download and use without running into
problems.  Problems always exist, but need to be documented correctly so
that they can be solved quickly.  I think nutch has a long way to go
before it is comparable to tomcat or httpd, which are both production
ready and have literally volumes of information on using in every manner
possible.  

I am sorry if you don't like my opinion or the way it is expressed.

-Original Message-
From: carmmello [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 10:54 AM
To: nutch-user@incubator.apache.org
Subject: RE: project vitality?


I really can not agree with the way Mr. Richard Braman express his
views.  I have tried Nutch since version 0.3 and I could not make the
0.8 release  work (Nutch is becoming a little bit complicated with all
those map reduce, hadoop, and so on, that I can't deal with).  I
understand, however,  that if a product is not finished yet,  some times
it may fail with the lack of some fundamental documentation, but, if
there is a bunch of people who develops, for free, a product that is
commercially worth some thousands of dollars and may fit our purposes,
we have to say thanks.  After that we can, of course, express our views,
complaints and suggestions, but we should refrain from some hard, non
relevant comments, that goes nowhere, like this, non technical, post of
mine. I, myself, have my own experimental implementation of Nutch
0.7.1.x (a nightly version), with more than 400,000 pages, that can be,
sometimes, viewed at brazilian working hours, at
http://www.qualidade.eng.br/constelacao.htm .  It is in portuguese, but
english terms related to quality, standards and environment can be
searched.



RE: project vitality?

2006-03-04 Thread Richard Braman
The nutch dev team isn't focused on PDF parsing. Nutch is a search
engine framework, 

IMHO, if you don't parse something correctly, you cannnot rely on the
results.  
We have all parsed things where you leave a comma out and the parse
results are wrong.  If there was a bug in nutches html parsing would
that be a big deal? Howabout if it parsed the text in a particular tag
out of order?  Pdf is unfortunately not html where you can parse the
file sequentially and get an accurate result, but its use is second most
ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has some
pdf parsing algorithms, that aren't being used.  Google does a good job
parsing pdf, nutch has to do if its ogin to compete.




-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 4:10 PM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality?


Hello,

 I've been following this conversation for the past week and decided
that I'd go ahead and chime in now. I think that honestly this whole
thread of discussion needs to be taken off list, because it doesn't
really have anything to do with the use of Nutch: what it boils down
to is a list of complaints, requests for improvements and what not.
Nutch's goal is to be a large-scale, open source search engine: it's not
a PDF parsing framework, nor is it as thoroughly documented as some
commercial software -- although I've ran into many commercial software
products that don't have the same quality of documentation that Nutch
even has now in its nascent stages.

 Now that I have said that, I want to express my feeling that it's hard

 when it takes a week to figure out that invertlinks only applies to 
 version 0.8. and when you ask to become a volunteer, you are met with 
 no response.

You don't need to ask to become a volunteer: just do it. As Doug said,
create a patch, submit the patch to JIRA and let the community look at
it. Change something on the Wiki if you don't think that the
documentation is particularly well there. Use Nutch to do whatever you
like, and if you feel that you contributed something that is applicable
to a broader community outside of your domain, let people know about it.
If it's really cool, I wouldn't worry about people ignoring you: they'll
come around.

 It's also frustrating when you share some heard earned insights into 
 something that nutch needs to work on, like pdf parsing, and your 
 comments don't get a single good response from the nutch dev team.

The nutch dev team isn't focused on PDF parsing. Nutch is a search
engine framework, and to Nutch, a PDF parser is a black box that
conforms to a standard parsing interface that can be swapped out as
technology evolves. Right now, Nutch uses PDFBox, but in a week it could
use hot super new rad PDF parsing technology X.1, or some other
greater PDF parser. If you feel that PDFBox isn't getting the job done
for your particular domain, then post an actual question, not pointers
to documents for the Nutch developers to go read. Honestly, I'm guessing
they don't have the time, nor the desire to go read a whole bunch of PDF
documentation unless there's a real use case, and a real need to upgrade
the existing parser. Empirically show that Nutch's PDF capabilities
aren't getting the job done, post your results to the list, and let the
community look them. I'd guess you'd generate more interest and probably
get a better response that way.

 
 Sometimes, in OS projects I get the feeling that the developers 
 breathe different air than users, and that our help is not wanted or 
 that our questions are stupid and not worth their time to answer.

As far as I can tell the Nutch developers all breathe the same air as us
(and moreover, I believe they put on their pants one leg at a time)

 
 Nutch is nowhere near being a dead project, that is not what I said (I

 said it was close, not closed), its just that I don't feel that it's 
 something that anyone can just download and use without running into 
 problems.

Problems is a generic word: I would agree with your statement if you
qualified what problems means. Small problems like configuration
issues? I'd buy that. Exception messages not providing super super
detailed information about the error? Sure, I'd even buy that in some
cases. However, larger, bigger problems that generally fall in the class
of bugs? I would say the answer to that is probably a no.

 Problems always exist, but need to be documented correctly so that 
 they can be solved quickly.  I think nutch has a long way to go before

 it is comparable to tomcat or httpd, which are both production ready 
 and have literally volumes of information on using in every manner 
 possible.

Check out the commiters list on Tomcat (
http://tomcat.apache.org/whoweare.html) versus that of Nutch (
http://lucene.apache.org/nutch/credits.html). 21 active commiters on the
Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To
have the wealth

Re: project vitality?

2006-03-04 Thread Matthias Jaekle

 I am sorry if you don't like my opinion or the way it is expressed.

Hi Richard,

most of your opinion I think is the same as mine. I use nutch now since 
spring 2004 for our page http://www.umkreisfinder.de


It was a big effort to learn how nutch is working and also a big effort 
to learn how to implement plugins. Seems to be a big system :)


Much of the stuff I know is about version 0.5 or maybe 0.7. It is really 
difficult to keep up-to-date with all the stuff which is going on. In 
the last month I did not have the time to read all the messages on the 
mailing list, so I also feel less knowing about what's going on.
I think the only way to keep informed what's going on with nutch is to 
read the mailing list each day. That's bad - I could not spent so much 
time :(


Sometimes replies on the mailing list are extremly fast, sometimes there 
is no response. No response for technical questions, no response if 
volunteers ask how they could help and no response if bugfixes or code 
snippets with some improvements are mailed to the mailing list.


I only can agree, if you think this is bad. It is bad.
Not only that there are persons, who would never come to a state where 
they could help the project - because they did not get the first wattles 
- also progress of the nutch project would be slowed down if bugfixes 
and questions how to voluneer are ignored.


I only could suggest to post all patches and improvements to the jira 
system, so that this information would never be lost.


For me it seems a little bit like many persons are working on the code 
they need, sometimes two persons need the same code - fine -, but if 
somebody is working on a project or bugfix nobody else of the community 
currently needs - very bad. Also it is a big question, if and when 
patches are submitted, which are in the moment only needed by their 
programmer.


I thinks we - the whole nutch community - should think about how we 
could generate the most value for nutch if persons ask how to volunteer.
And also we should think about how we could pay tribute for stuff made 
by volunteres. Maybe if we simply check and add their improvements to 
the offical code as soon as possible.


Maybe we should organize us ourself a little bit better in this point.
What do you think?

It also made be useful to ask all future volunteers to work on some 
parts of the wiki to get a better documentation. Maybe some of the nutch 
specialists must then look over the documentation is created by beginners.


May I ask: How much persons are currently working on nutch? How much 
time do we alltogehter currently spend on nutch?


I am currently working on code to identify geographic information on 
websites to improve local searches, but did not find time to implement 
my ideas. Much other stuff to do :( I also feel that I should not start 
implementing this code until I understand all the stuff which would be 
new in the next release. Maybe I understand all the important new stuff 
when reading the release information of the new version as soon as it is 
available.


Last but not least, THANKS to all volunteers who worked on nutch. I am 
glad to be able to use nutch for our services. It is great to have the 
code of all the volunteers and run them together with the one percent of 
the code I have developed for our website.


Thanks for reading my post

Matthias

--
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events


RE: project vitality?

2006-03-04 Thread Richard Braman
I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back.  And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project. 

Also, if you use nutch you should be answering other users questions as
long as you are actively reading the nutch list and you know the answer.
That’s is almost your obligation for using free open source software.

Putting the faqs and tutorial on the website and not the wiki maybe one
of the two biggest problems in getting people started learning nutch.

-Original Message-
From: Matthias Jaekle [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 5:27 PM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality?


  I am sorry if you don't like my opinion or the way it is expressed.

Hi Richard,

most of your opinion I think is the same as mine. I use nutch now since 
spring 2004 for our page http://www.umkreisfinder.de

It was a big effort to learn how nutch is working and also a big effort 
to learn how to implement plugins. Seems to be a big system :)

Much of the stuff I know is about version 0.5 or maybe 0.7. It is really

difficult to keep up-to-date with all the stuff which is going on. In 
the last month I did not have the time to read all the messages on the 
mailing list, so I also feel less knowing about what's going on. I think
the only way to keep informed what's going on with nutch is to 
read the mailing list each day. That's bad - I could not spent so much 
time :(

Sometimes replies on the mailing list are extremly fast, sometimes there

is no response. No response for technical questions, no response if 
volunteers ask how they could help and no response if bugfixes or code 
snippets with some improvements are mailed to the mailing list.

I only can agree, if you think this is bad. It is bad.
Not only that there are persons, who would never come to a state where 
they could help the project - because they did not get the first wattles

- also progress of the nutch project would be slowed down if bugfixes 
and questions how to voluneer are ignored.

I only could suggest to post all patches and improvements to the jira 
system, so that this information would never be lost.

For me it seems a little bit like many persons are working on the code 
they need, sometimes two persons need the same code - fine -, but if 
somebody is working on a project or bugfix nobody else of the community 
currently needs - very bad. Also it is a big question, if and when 
patches are submitted, which are in the moment only needed by their 
programmer.

I thinks we - the whole nutch community - should think about how we 
could generate the most value for nutch if persons ask how to volunteer.
And also we should think about how we could pay tribute for stuff made 
by volunteres. Maybe if we simply check and add their improvements to 
the offical code as soon as possible.

Maybe we should organize us ourself a little bit better in this point.
What do you think?

It also made be useful to ask all future volunteers to work on some 
parts of the wiki to get a better documentation. Maybe some of the nutch

specialists must then look over the documentation is created by
beginners.

May I ask: How much persons are currently working on nutch? How much 
time do we alltogehter currently spend on nutch?

I am currently working on code to identify geographic information on 
websites to improve local searches, but did not find time to implement 
my ideas. Much other stuff to do :( I also feel that I should not start 
implementing this code until I understand all the stuff which would be 
new in the next release. Maybe I understand all the important new stuff 
when reading the release information of the new version as soon as it is

available.

Last but not least, THANKS to all volunteers who worked on nutch. I am 
glad to be able to use nutch for our services. It is great to have the 
code of all the volunteers and run them together with the one percent of

the code I have developed for our website.

Thanks for reading my post

Matthias

-- 
http://www.eventax.com - eventax GmbH http://www.umkreisfinder.de - Die
Suchmaschine für Lokales und Events



Re: project vitality?

2006-03-04 Thread Stefan Groschupf


Maybe we should organize us ourself a little bit better in this point.
What do you think?


Just a general note, jira has a voting functionality.
This allows everybody to vote an issue and can show in a very  
compressed style what the community is looking for.
However it is not used that often yet. It would be great if more  
users can use it.


Reading the nutch user list becomes very time consuming but browsing  
issues sorted by votes is very fast.


http://issues.apache.org/jira/browse/NUTCH? 
report=com.atlassian.jira.plugin.system.project:popularissues-panel


Stefan 


Re: project vitality?

2006-03-04 Thread Chris Mattmann
Hi Richard,

 IMHO, if you don't parse something correctly, you cannnot rely on the
 results.  

Good, we're on the same page here.

 We have all parsed things where you leave a comma out and the parse
 results are wrong.  If there was a bug in nutches html parsing would
 that be a big deal?

Yes, it would be. HTML is the foundation for the web. Its content is the
most pervasive out there (as you allude to below).

 Howabout if it parsed the text in a particular tag
 out of order?  

I'm wondering what that has to do with anything? You may want to read up on
Lucene (http://lucene.apache.org/). Lucene is the underlying text search api
(and index format) that Nutch is built on top of, and I'm wondering if it
cares about the order in which a piece of text is given to it?

 Pdf is unfortunately not html where you can parse the
 file sequentially and get an accurate result,

Gonna have to disagree with you on this. You're making a general statement
that's not true across the board. I would assert that in many cases, you can
still get an accurate result. What about a PDF research paper? Do you care
about what order the text comes in if you're just doing general Google
like search. When I go to Google and type grid computing papers, do I
care that grid computing comes before some text within the research paper?
Possibly, but mainly I care that grid computing was an emphasized phrase
within the text. Now, your definition of emphasized may not just be that
it's the first text that appears in the paper in the title say: you may just
care that the frequency of grid computing in the paper is relatively
higher than a certain threshold compared to other terms. On the other hand,
the fact that grid computing is in the title and comes first in the PDF
may mean a lot to you. in That's the nature of trying to extract structure
out of inherently unstructured content. I'm not saying that the structure or
order of text within a document is never useful: I agree that in a lot of
cases, it can help you to infer what values are associated with what fields
you want to index, etc. All I'm saying is that it's certainly a subset of
the greater functionality of just doing free text search, so you shouldn't
generalize and that that you can't parse a PDF sequentially and obtain good
results.

 but its use is second most
 ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has some
 pdf parsing algorithms, that aren't being used.  Google does a good job
 parsing pdf, nutch has to do if its ogin to compete.

Can you show that Google's PDF parsing capability is any better than Nutch's
using accepted evaluation methods for PDF? How about some real use cases and
real results? Until we could see such numbers, I'm hesitant to believe what
you're saying is true. If it is though, then I'm sure that the community
would welcome any updates to the PDF parsing plugin that expedite its
improvement.

Cheers,
  Chris



 
 
 
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Saturday, March 04, 2006 4:10 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: project vitality?
 
 
 Hello,
 
  I've been following this conversation for the past week and decided
 that I'd go ahead and chime in now. I think that honestly this whole
 thread of discussion needs to be taken off list, because it doesn't
 really have anything to do with the use of Nutch: what it boils down
 to is a list of complaints, requests for improvements and what not.
 Nutch's goal is to be a large-scale, open source search engine: it's not
 a PDF parsing framework, nor is it as thoroughly documented as some
 commercial software -- although I've ran into many commercial software
 products that don't have the same quality of documentation that Nutch
 even has now in its nascent stages.
 
 Now that I have said that, I want to express my feeling that it's hard
 
 when it takes a week to figure out that invertlinks only applies to
 version 0.8. and when you ask to become a volunteer, you are met with
 no response.
 
 You don't need to ask to become a volunteer: just do it. As Doug said,
 create a patch, submit the patch to JIRA and let the community look at
 it. Change something on the Wiki if you don't think that the
 documentation is particularly well there. Use Nutch to do whatever you
 like, and if you feel that you contributed something that is applicable
 to a broader community outside of your domain, let people know about it.
 If it's really cool, I wouldn't worry about people ignoring you: they'll
 come around.
 
 It's also frustrating when you share some heard earned insights into
 something that nutch needs to work on, like pdf parsing, and your
 comments don't get a single good response from the nutch dev team.
 
 The nutch dev team isn't focused on PDF parsing. Nutch is a search
 engine framework, and to Nutch, a PDF parser is a black box that
 conforms to a standard parsing interface that can be swapped out as
 technology evolves. Right

project vitality?

2006-03-03 Thread Matt Wilkie
Hi there, I'm new around here. The mailing lists seem to have a pretty 
steady stream of traffic but the website hasn't been updated since 
august, and there's only a handful of news items before that. What is 
the vitality of Nutch project? Is it basically a labority proof of 
concept or a mature ready for production product?


thanks for your time,

--
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/




RE: project vitality?

2006-03-03 Thread Richard Braman
I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-Original Message-
From: Matt Wilkie [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty 
steady stream of traffic but the website hasn't been updated since 
august, and there's only a handful of news items before that. What is 
the vitality of Nutch project? Is it basically a labority proof of 
concept or a mature ready for production product?

thanks for your time,

-- 
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/




RE: project vitality?

2006-03-03 Thread Howie Wang

I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:

http://wiki.apache.org/nutch/PublicServers

Plus I think technorati is built on either Nutch and/or Lucene.

That said, the doc could be better, and it's probably a good idea
if you know Java since you might have to tweak the code a bit to
get the exact behavior you want.  If you don't have special needs,
you could get something like a site search up in very little time.

The newer versions seem to be changing a lot still though. I've
been waiting for the dust to settle before I see if I want to upgrade.

Howie


I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-Original Message-
From: Matt Wilkie [mailto:[EMAIL PROTECTED]
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature ready for production product?

thanks for your time,

--
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/







Re: project vitality?

2006-03-03 Thread gekkokid
passed the concept stage, technorati uses lucene, in open source projects 
the last thing people want to do is documentation,


anybody know why yahoo took down their nutch server?


- Original Message - 
From: Howie Wang [EMAIL PROTECTED]

To: [EMAIL PROTECTED]; nutch-user@lucene.apache.org
Sent: Saturday, March 04, 2006 1:09 AM
Subject: RE: project vitality?



I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:

http://wiki.apache.org/nutch/PublicServers

Plus I think technorati is built on either Nutch and/or Lucene.

That said, the doc could be better, and it's probably a good idea
if you know Java since you might have to tweak the code a bit to
get the exact behavior you want.  If you don't have special needs,
you could get something like a site search up in very little time.

The newer versions seem to be changing a lot still though. I've
been waiting for the dust to settle before I see if I want to upgrade.

Howie


I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-Original Message-
From: Matt Wilkie [mailto:[EMAIL PROTECTED]
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature ready for production product?

thanks for your time,

--
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/










Re: project vitality?

2006-03-03 Thread sudhendra seshachala
I could not agree with Doug more. This is one of the best.. am trying UIMA 
too... though UIMA also uses Lucene...as of today, it is still a framework and 
community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of nightly 
build..
   
  Doug, 
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on 
releasing 0.8.
   
  Thanks
  Sudhi
  

Doug Cutting [EMAIL PROTECTED] wrote:
  Richard Braman wrote:
 I think it is still very much at proof of concept stage. I think it is
 close, but as you have mentioned, the website Is severely out of date
 and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks luster the project 
must be dead! Seriously, this is an active project. It is not yet 1.0, 
so don't expect polish. If it doesn't look easily usable to you then 
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch. Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

 I have tried
 to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug 
report, so that other folks are aware of it. Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.