Re: Next Nutch release

2007-01-25 Thread Doug Cutting

Dennis Kubes wrote:

Andrzej Bialecki wrote:
I believe that at this point it's crucial to keep the project 
well-focused (at the moment I think the main focus is on larger 
installations, and not the small ones), and also to make Nutch 
attractive to developers as a reusable search engine component.


I think there are two areas.  One is to keep the focus as you stated 
above.  The other is to provide a path to get more people involved.  If 
no one objects I will continue working on such a path.


Please let me know if I can help in this people area.  I'm currently 
unable to assist with technical Nutch issues on a day-to-day basis, but 
I am still very interested in doing what I can to ensure Nutch's 
long-term vitality as a project.


Cheers,

Doug


Re: Next Nutch release

2007-01-20 Thread Dennis Kubes



Andrzej Bialecki wrote:

Dennis Kubes wrote:
I completely agree with this.  I am interested in devoting as much 
time as possible to seeing the success of Nutch, Hadoop, and Lucene.  
As our business grows I would also be willing to devote developers 
full time to work on Nutch, Hadoop, and Lucene.


I think that at least one company needs to come out and have a 
production search engine that is competition, however small, to the 
googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
that was the original goal of Nutch.  I know there are some out there 
right now like Mozdex, but I mean a true billion page system.  I think 
the .8 codebase, and yes improvements could be made, is capable of 
supporting such a system.  I think then you will see many more 
developers become interested in the project.  If you build it they 
will come.


Sure, I'd love to point people to such a system. But did you do a 
calculation how much money in the initial investment, and then ongoing 
costs, is needed to maintain such an index? It cannot happen just 
because of someone's goodwill, there must be a sound business idea 
behind it, and a team of dedicated people to make it happen and 
persevere - not just to demonstrate how good Nutch is, but to keep up 
for the sake of their own business.


I completely agree.  We have been working on this business for almost a 
year.  We received significant seed capital to build the alpha version 
of the search, which is complete, and are in the process of securing 
first round private equity funding to scale to 100M pages this year and 
up to 1B pages in year 2.


Yes the initial investment  for hardware, data center costs, marketing 
costs, and most importantly development staff for say a 1 billion page 
index capable of supporting 100 queries per second constant is around 5M 
and as it grows into the 10-20 billion range costs can grow as high as 100M.


I think what many people don't understand is that search is as much a 
hardware (electricity, bandwidth) issue as it is a software issue.  I 
know that we couldn't have developed the systems we have without Nutch, 
Hadoop, and Lucene and that I personally and we as a company are 
completely committed to their development.




I will say that it is difficult for people to understand how to get 
more involved.  I have been working with Nutch and Hadoop for almost a 
year now on a daily basis and only now am I understanding how to 
contribute through jira, etc.  There needs to be more guidance in 
helping developers contribute.  For example if you want to develop a 
new piece of function they do x, y, and z.  Here is how to patch your 
system. If you want to develop a patch then here are the steps.  I 
have programmed in Java for many years but haven't worked on many open 
source projects before.  The process of how they work isn't explicit 
and it needs to be.


Hmm. I might not be objective here anymore. There is however some 
documentation already on the Wiki, which explains how to contribute - if 
you feel it's inadequate please use your hard-earned experience to 
improve it.


I am in the middle of writing a new wiki page for contributing that will 
go into much more detail about the process.




We worked up many patches for issues we came up against in the .8 and 
.4 codebases but they were never contributed because, as stupid as it 
might sound, we really don't know how to give it back.  The best thing 
I thought I could do was to help answer questions on the list.  Again 
just need a little guidance.


Are you willing to spend the time and do the required refactoring? 
Anyone else?


Yes, I am and I currently have 2 other developers that can help.


Sounds great. We could start by creating a new page on Wiki, which would 
collect our vision for Nutch - as I mentioned to Stefan, I think we 
should take a step back, and think about the strategy for the next 1-2 
years of Nutch development, and what is the target audience.


I am all for this, just understand this is a new process for me so will 
need some guidance.




Sure if we start a 2.x branch and if I'm not developing for the trash 
or jira nirvana, I can imaging to contribute. I 


Just a quick comment: jira nirvana (which I believe refers to patches 
sitting idle in Jira for a long time) is not caused by ill will or 
disrespect for contributors, but foremost by limited human resources. If 
we want to maintain a certain level of quality, these patches cannot be 
applied blindly, but need to be reviewed, analyzed, applied, tested, and 
committed. That's an awful lot of work for 2-3 people, who also have 
other things to do ...




It is very less attractive to developers spending weeks to find a bug 
like the regular expression one. Than such a bug sits there for month 
in the jira being rejected. Sure if nobody of the contributors run 
nutch with a 500 mio url 


It's not being rejected - see the comments on that issue, there is an 
overall agreement that 

Re: Next Nutch release

2007-01-19 Thread Doug Cutting

Stefan Groschupf wrote:
I don't want to start a emotional discussion here, however talking about 
the problem in public might help.


What, specifically, is the problem you perceive?

Doug


Re: Next Nutch release

2007-01-19 Thread Dennis Kubes

Just to put in my view.

Stefan Groschupf wrote:

Hi Andrzej,

thank you for taking the time to comment, I highly value your comments.

* I guess that for each case where Nutch seems inappropriate I could 
give you a counter-example of Nutch being used  commercially with much 
success. I guess it depends on a particular application and the type 
of customer.


Yes, it would be interesting to hear who use nutch .8 _successfully_ in 
production.


Although I can't say who we are yet as we are in the middle of private 
equity funding,  we have built a production version categorization 
search engine that uses the Nutch .8 and hadoop .4 code base that we are 
currently in the process of scaling to 100M pages.


* no doubt Nutch has its warts - the plugin system could be simpler, 
for example ;) but hey, it's great that we have a plugin system at 
all! It would be easier now to refactor Nutch to use a different 
plugin system than it was to go from the completely monolithic design 
to the plugin system ... As with any open source project - if you 
don't like it, fix it and contribute the fix.


Sure - I tried that more than once - but I do not want to start this 
discussion again.


* things won't happen magically unless there is a greater involvement 
of skilled developers. One way road - well, with limited resources 
that this project has at the moment the only way is to gradually 
improve, we cannot afford to abandon the current codebase and start 
from scratch.


I agree - the problem are skilled developers, I remember more than one 
offer of different companies to dedicate developers to the project, but 
looks like there was no interest.


I completely agree with this.  I am interested in devoting as much time 
as possible to seeing the success of Nutch, Hadoop, and Lucene.  As our 
business grows I would also be willing to devote developers full time to 
work on Nutch, Hadoop, and Lucene.


I think that at least one company needs to come out and have a 
production search engine that is competition, however small, to the 
googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
that was the original goal of Nutch.  I know there are some out there 
right now like Mozdex, but I mean a true billion page system.  I think 
the .8 codebase, and yes improvements could be made, is capable of 
supporting such a system.  I think then you will see many more 
developers become interested in the project.  If you build it they will 
come.


I will say that it is difficult for people to understand how to get more 
involved.  I have been working with Nutch and Hadoop for almost a year 
now on a daily basis and only now am I understanding how to contribute 
through jira, etc.  There needs to be more guidance in helping 
developers contribute.  For example if you want to develop a new piece 
of function they do x, y, and z.  Here is how to patch your system. If 
you want to develop a patch then here are the steps.  I have programmed 
in Java for many years but haven't worked on many open source projects 
before.  The process of how they work isn't explicit and it needs to be.


We worked up many patches for issues we came up against in the .8 and .4 
codebases but they were never contributed because, as stupid as it might 
sound, we really don't know how to give it back.  The best thing I 
thought I could do was to help answer questions on the list.  Again just 
need a little guidance.


Are you willing to spend the time and do the required refactoring? 
Anyone else?


Yes, I am and I currently have 2 other developers that can help.



In general there was some emotional discussion about API changes. Since 
nutch is a 0.x and also a software and not a library more frequent 
refactorings had may be improved the maintainability of the code over 
the time.


Sure if we start a 2.x branch and if I'm not developing for the trash or 
jira nirvana, I can imaging to contribute. I would rethink and rewrite 
some major parts (e.g. remove the reusage of objects with a complex 
states and endless if than else conditions no body can debug) and I 
believe that is not difficult. I'm not talking about the algorithm stuff 
here.


May be one day we can get some developer together first think about a 
good extendable design and than start a 2.x stream or a new project.


I hope so too. But as Steve B. said once, what we need is developers, 
developers, developers ... ;)


I agree, however it must be attractive for developers to spend time in a 
open source project. We saw many developers here. You are the only one 
left that does some serious development and I can't find words how much 
respect I have for your work. You are the only one that is able to fix 
serious bugs.


We also have much respect for you Andrzej.

You may have more developers than you think.  They might just not know 
how to contribute.


It is very less attractive to developers spending weeks to find a bug 
like the regular expression one. Than such a bug 

Re: Next Nutch release

2007-01-19 Thread Doug Cutting

Dennis Kubes wrote:
I will say that it is difficult for people to understand how to get more 
involved.  I have been working with Nutch and Hadoop for almost a year 
now on a daily basis and only now am I understanding how to contribute 
through jira, etc.  There needs to be more guidance in helping 
developers contribute.  For example if you want to develop a new piece 
of function they do x, y, and z.  Here is how to patch your system. If 
you want to develop a patch then here are the steps.


The closest thing we have currently are the HowToContribute pages:

http://wiki.apache.org/nutch/HowToContribute
http://wiki.apache.org/lucene-hadoop/HowToContribute
http://wiki.apache.org/jakarta-lucene/HowToContribute

These are not great, but they're a start.  Are there parts that are 
confusing?  Do they assume too much?  Are they missing things?  If so, 
please help to update these.


I note that the Nutch version is less evolved than the Lucene and Hadoop 
versions.


Doug



Re: Next Nutch release

2007-01-19 Thread Andrzej Bialecki

Dennis Kubes wrote:
I completely agree with this.  I am interested in devoting as much 
time as possible to seeing the success of Nutch, Hadoop, and Lucene.  
As our business grows I would also be willing to devote developers 
full time to work on Nutch, Hadoop, and Lucene.


I think that at least one company needs to come out and have a 
production search engine that is competition, however small, to the 
googles and yahoos of the world, built on Nutch and Hadoop.  I thought 
that was the original goal of Nutch.  I know there are some out there 
right now like Mozdex, but I mean a true billion page system.  I think 
the .8 codebase, and yes improvements could be made, is capable of 
supporting such a system.  I think then you will see many more 
developers become interested in the project.  If you build it they 
will come.


Sure, I'd love to point people to such a system. But did you do a 
calculation how much money in the initial investment, and then ongoing 
costs, is needed to maintain such an index? It cannot happen just 
because of someone's goodwill, there must be a sound business idea 
behind it, and a team of dedicated people to make it happen and 
persevere - not just to demonstrate how good Nutch is, but to keep up 
for the sake of their own business.




I will say that it is difficult for people to understand how to get 
more involved.  I have been working with Nutch and Hadoop for almost a 
year now on a daily basis and only now am I understanding how to 
contribute through jira, etc.  There needs to be more guidance in 
helping developers contribute.  For example if you want to develop a 
new piece of function they do x, y, and z.  Here is how to patch your 
system. If you want to develop a patch then here are the steps.  I 
have programmed in Java for many years but haven't worked on many open 
source projects before.  The process of how they work isn't explicit 
and it needs to be.


Hmm. I might not be objective here anymore. There is however some 
documentation already on the Wiki, which explains how to contribute - if 
you feel it's inadequate please use your hard-earned experience to 
improve it.




We worked up many patches for issues we came up against in the .8 and 
.4 codebases but they were never contributed because, as stupid as it 
might sound, we really don't know how to give it back.  The best thing 
I thought I could do was to help answer questions on the list.  Again 
just need a little guidance.


Are you willing to spend the time and do the required refactoring? 
Anyone else?


Yes, I am and I currently have 2 other developers that can help.


Sounds great. We could start by creating a new page on Wiki, which would 
collect our vision for Nutch - as I mentioned to Stefan, I think we 
should take a step back, and think about the strategy for the next 1-2 
years of Nutch development, and what is the target audience.


Sure if we start a 2.x branch and if I'm not developing for the trash 
or jira nirvana, I can imaging to contribute. I 


Just a quick comment: jira nirvana (which I believe refers to patches 
sitting idle in Jira for a long time) is not caused by ill will or 
disrespect for contributors, but foremost by limited human resources. If 
we want to maintain a certain level of quality, these patches cannot be 
applied blindly, but need to be reviewed, analyzed, applied, tested, and 
committed. That's an awful lot of work for 2-3 people, who also have 
other things to do ...




It is very less attractive to developers spending weeks to find a bug 
like the regular expression one. Than such a bug sits there for month 
in the jira being rejected. Sure if nobody of the contributors run 
nutch with a 500 mio url 


It's not being rejected - see the comments on that issue, there is an 
overall agreement that it's ok; it simply hasn't been applied yet. See 
above for the why.



I'm slowly coming to a point where I should be able to fix it - but 
let's not throw out the baby with the water ...

Wow, I hold my finger crossed!


There is a great book on this.  It is 0691122024.  Andrzej send me 
your address and I will buy and ship you a copy if you don't have it.  


Too late :) I found it two weeks ago, and it's already on its merry way 
- but thanks for the offer.



We would also be willing to help develop this functionality further.


I started working on a testbed as a part of another commercial project, 
it's likely that I could get a release from the customer to contribute 
this code to the project. A testbed is a prerequisite for any serious 
work on ranking and web graph.


(It's quite unfortunate that the best-of-breed open source framework for 
working with web graphs is licensed under LGPL ...)




I can definitely see a desire to re-write but I think even if you 
re-write you are still going to have the same problem.  Search is hard 
and without guidance we can't get enough developers to understand what 
they need to know to help.


Indeed. People often 

Re: Next Nutch release

2007-01-18 Thread Scott Green

Hi,

I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp files
since it takes a lot time when refactoring. So how about to adopt
using velocity/freemarker with servlet?

On 1/17/07, Enis Soztutar [EMAIL PROTECTED] wrote:

Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the votes.
Stafan has written a good plugin for the admin gui and i have updated it
to work with nutch-0.8, hadoop 0.4.

Some of the features in the patch is not appropriate for our use cases
and it requires hadoop changes, thus I am currently working on an
alternative implementation of the administration gui, which runs a
hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui
to submit and track the jobs from the browser and a job runner.

The architechture details of the patch is as follows :

  - An interface AdminJob which is an abstract class representing a Job
in nutch.
  - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob.
  - A queue which sorts the jobs in priority order, by a modified a
topological sort(jobs can be dependent).
  - an interface to submit Jobs
  - a rpc server to listen to job submissions
  - an extension point (basically same as the previous)
  - a web server to serve plugin jsp's

upon the features will be
- submitting jobs from code, command line or web interface,
- tracking jobs from the command line or web interface
- scheduling jobs

I could send the code or details if anyone is interested in pretesting.
And i will appreciate any comments and suggestions on this. I am
planning to complete the patch and submit it to Jira ASAP.

Sami Siren wrote:
 Hello,

 It has been a while from a previous release (0.8.1) and looking at the
 great fixes done in trunk I'd start thinking about baking a new release
 soon.

 Looking at the jira roadmaps there are 1 blocking issues (fixing the
 license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
 which I think NUTCH-233 is safe to put in.

 The top 10 voted issues are currently:

 NUTCH-61   Adaptive re-fetch interval. Detecting umodified content
 NUTCH-48  Did you mean query enhancement/refignment feature
 NUTCH-251 Administration GUI
 NUTCH-289 CrawlDatum should store IP address
 NUTCH-36  Chinese in Nutch
 NUTCH-185 XMLParser is configurable xml parser plugin.
NUTCH-59meta
 data support in webdb
 NUTCH-92  DistributedSearch incorrectly scores results
NUTCH-68A
 tool to generate arbitrary fetchlists NUTCH-87
Efficient
 site-specific crawling for a large number of sites

 Are there any opinions about issues that should go in before the next
 release (Answering yes means that you are willing to provide a patch for
 it).

 --
  Sami Siren






Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf

Hi,

I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp files
since it takes a lot time when refactoring. So how about to adopt
using velocity/freemarker with servlet?



In general I agree it is the view layer and should have as less as  
possible code, however the idea was to have as less as possible  
dependencies to thirdparty tools and libraries and also getting  
things realized with low tech (jsp).


Stefan





Re: Next Nutch release

2007-01-18 Thread Doug Cutting

Stefan Groschupf wrote:
We run the gui in several production environemnts with patched hadoop 
code - since this is from our point of view the clean approach. 
Everything else feels like a workaround to fix some strange hadoop 
behaviors.


Are there issues in Hadoop's Jira for these?  If so, do they have 
patches attached?  Are they linked to the corresponding issue in Nutch?


Doug


Re: Next Nutch release

2007-01-18 Thread Scott Green

Stefan,

I also dived into contrib/web2 in nutch. The one and admin-gui are
both owns some plugins based on nutch plugin architecture. So I think
it is great if we extract something in high level and they should have
a lot commons.  Well, i dont know it is the right time to do this job.

On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote:

Hi,
 I just finished reading all source code about nutch gui. And
 personally i don't like putting a lot of code snippets into jsp files
 since it takes a lot time when refactoring. So how about to adopt
 using velocity/freemarker with servlet?


In general I agree it is the view layer and should have as less as
possible code, however the idea was to have as less as possible
dependencies to thirdparty tools and libraries and also getting
things realized with low tech (jsp).

Stefan






Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf

Th old hadoop patch is here:
https://issues.apache.org/jira/browse/NUTCH-251
Also we had this conversation:
http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg00314.html
I guess after this we missed to post the patches we use internally.

If someone feels strong about getting the gui working with hadoop he/ 
she should feel free to update the patch and post it in the hadoop jira.


Stefan







On 18.01.2007, at 15:39, Doug Cutting wrote:


Stefan Groschupf wrote:
We run the gui in several production environemnts with patched  
hadoop code - since this is from our point of view the clean  
approach. Everything else feels like a workaround to fix some  
strange hadoop behaviors.


Are there issues in Hadoop's Jira for these?  If so, do they have  
patches attached?  Are they linked to the corresponding issue in  
Nutch?


Doug



~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: Next Nutch release

2007-01-18 Thread Stefan Groschupf

Hi Scott,

feel free - I have no options on that.

From my very little point of view the nutch  .8 source stream is a  
one way street.
In all my projects we move as far as possible away from nutch. I like  
hadoop a lot and writing customer tools on top of it is - that easy.
But nutch .8 was a proof of concept for the early hadoop.  There is  
only one serious developer left and wow how great he does his job -  
but nutch .8 is just too monolithic, to difficult to extend, to  
difficult to debug, to difficult to integrate for a serious mission  
critical application.
I spend a signification part of my life daily working with nutch, but  
if someone would ask - I would answer don't use it.
May be one day we can get some developer together first think about a  
good extendable design and than start a 2.x stream or a new project.
And ... yes no opic and yes definitely no plugin architecture (I feel  
very sorry for all that wast so much life time because of my terrible  
complicate plugin system) but a clean IOC design with lightweight  
default interface implementations and a great test coverage.
Anyway just my *very little* point of view based on 3.5 years nutch  
experience.


Stefan





On 18.01.2007, at 21:33, Scott Green wrote:


Stefan,

I also dived into contrib/web2 in nutch. The one and admin-gui are
both owns some plugins based on nutch plugin architecture. So I think
it is great if we extract something in high level and they should have
a lot commons.  Well, i dont know it is the right time to do this job.

On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote:

Hi,
 I just finished reading all source code about nutch gui. And
 personally i don't like putting a lot of code snippets into jsp  
files

 since it takes a lot time when refactoring. So how about to adopt
 using velocity/freemarker with servlet?


In general I agree it is the view layer and should have as less as
possible code, however the idea was to have as less as possible
dependencies to thirdparty tools and libraries and also getting
things realized with low tech (jsp).

Stefan








~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: Next Nutch release

2007-01-17 Thread Enis Soztutar

Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the votes. 
Stafan has written a good plugin for the admin gui and i have updated it 
to work with nutch-0.8, hadoop 0.4.


Some of the features in the patch is not appropriate for our use cases 
and it requires hadoop changes, thus I am currently working on an 
alternative implementation of the administration gui, which runs a 
hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui 
to submit and track the jobs from the browser and a job runner.


The architechture details of the patch is as follows :

 - An interface AdminJob which is an abstract class representing a Job 
in nutch.

 - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob.
 - A queue which sorts the jobs in priority order, by a modified a 
topological sort(jobs can be dependent).

 - an interface to submit Jobs
 - a rpc server to listen to job submissions
 - an extension point (basically same as the previous)
 - a web server to serve plugin jsp's

upon the features will be
   - submitting jobs from code, command line or web interface,
   - tracking jobs from the command line or web interface
   - scheduling jobs

I could send the code or details if anyone is interested in pretesting. 
And i will appreciate any comments and suggestions on this. I am 
planning to complete the patch and submit it to Jira ASAP.


Sami Siren wrote:

Hello,

It has been a while from a previous release (0.8.1) and looking at the
great fixes done in trunk I'd start thinking about baking a new release
soon.

Looking at the jira roadmaps there are 1 blocking issues (fixing the
license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
which I think NUTCH-233 is safe to put in.

The top 10 voted issues are currently:

NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
NUTCH-48Did you mean query enhancement/refignment feature
NUTCH-251   Administration GUI
NUTCH-289   CrawlDatum should store IP address
NUTCH-36Chinese in Nutch
NUTCH-185   XMLParser is configurable xml parser plugin.
NUTCH-59meta
data support in webdb
NUTCH-92DistributedSearch incorrectly scores results
NUTCH-68A
tool to generate arbitrary fetchlists   NUTCH-87Efficient
site-specific crawling for a large number of sites

Are there any opinions about issues that should go in before the next
release (Answering yes means that you are willing to provide a patch for
it).

--
 Sami Siren

  




Re: Next Nutch release

2007-01-17 Thread Sami Siren

2007/1/17, Enis Soztutar [EMAIL PROTECTED]:


Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the votes.
Stafan has written a good plugin for the admin gui and i have updated it
to work with nutch-0.8, hadoop 0.4.



Good to hear someone is working on that! Why not target it to
trunk version of Nutch?


 - a web server to serve plugin jsp's


Why not make it regular war? also please consider making a clean
separation of view/logic when you implement the web ui.

--
Sami Siren


Re: Next Nutch release

2007-01-17 Thread Sami Siren


 The top 10 voted issues are currently:

 NUTCH-61   Adaptive re-fetch interval. Detecting umodified content


Well ... I'm of a split mind on this. I can bring this patch up to date
and apply it before 0.9.0, if we understand that this is a 0 release
... ;) Otherwise I'd prefer to wait with it right after the release.



+1 for putting it in after 0.9.0

I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus

some changes I made in the meantime), since I'd like to expose the new
fetcher to a broader audience, and it doesn't affect the existing
implementation.



+1 for putting it in before 0.9.0



NUTCH-48  Did you mean query enhancement/refignment feature
 NUTCH-251 Administration GUI
 NUTCH-289 CrawlDatum should store IP address


I'm still not entirely convinced about this - and there is already a
mechanism in place to support it if someone really wishes to keep this
particular info (CrawlDatum.metaData).

 NUTCH-36  Chinese in Nutch
 NUTCH-185 XMLParser is configurable xml parser
plugin.NUTCH-59meta
 data support in webdb
 NUTCH-92  DistributedSearch incorrectly scores
resultsNUTCH-68

This is too intrusive to fix just before the release - and needs
additional discussion.



+1


NUTCH-68  A
 tool to generate arbitrary fetchlists

Easy to port this to 0.9.0 - I can do this.



cool.


I'll start working on the headers and stuff to get the blocking issue away.

--
Sami Siren


Re: Next Nutch release

2007-01-17 Thread Enis Soztutar

Sami Siren wrote:

2007/1/17, Enis Soztutar [EMAIL PROTECTED]:


Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the votes.
Stafan has written a good plugin for the admin gui and i have updated it
to work with nutch-0.8, hadoop 0.4.



Good to hear someone is working on that! Why not target it to
trunk version of Nutch?
It is targetted to the trunk already. The previous was targetted to 
nutch-0.8, hadoop 0.4, since back then that versions was the latest in 
the trunk



 - a web server to serve plugin jsp's


Why not make it regular war? also please consider making a clean
separation of view/logic when you implement the web ui.
As Stafan's version used embedded Jetty server, I continued this way. 
But i will consider that possibility also.




--
Sami Siren





RE: Next Nutch release

2007-01-17 Thread Armel T. Nene
Hi guys,

 

I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting
unmodified content applying it to Nutch 0.8.1. Here are some points:

 

1.This feature is great for Nutch to have has it differentiate between
modified and unmodified content, therefore not indexing twice even if the
document fetch time has arrived.

a.There are some performance issues here. Even with this patch, Nutch
still fetches the content and then checks its status against the last
modified time in the database. If it has to check for a 1000 files before
indexing the following 10 files, this will cause a real problem for those
that are after real time indexing.

 

2.Since, I applied this patch to Nutch 0.8.1, when I try to parse xml
files with our modified version of the xmlparser /indexer plugin; the
fetcher throws the following exception:

 

WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

 

The system will not hang or crash but the xml file will be indexed without
any generated fields. The plugins works fine without the patch. I have
another parser that parses graphics and other formats that fails when used
with the patch. So far this problem occurs when using the file protocol.

 

3.the patch works fine when indexing web site using the http protocol.

 

I am willing to work with Andrzej to make it stable as I understand it's the
architect of this patch. I have the possibility of testing it in a mix
environment in our computer lab. This patch can be the stepping stone for
other features such real time indexing and fetch queue for index updating as
opposed to creating a new index each time.

 

Best Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

http://blog.idna-solutions.com

-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: 17 January 2007 15:39
To: nutch-dev@lucene.apache.org
Subject: Re: Next Nutch release

 

Sami Siren wrote:

 2007/1/17, Enis Soztutar [EMAIL PROTECTED]:

 

 Hi all, for NUTCH-251:

 

 I suppose that NUTCH-251 is relatively a significant issue by the votes.

 Stafan has written a good plugin for the admin gui and i have updated it

 to work with nutch-0.8, hadoop 0.4.

 

 

 Good to hear someone is working on that! Why not target it to

 trunk version of Nutch?

It is targetted to the trunk already. The previous was targetted to 

nutch-0.8, hadoop 0.4, since back then that versions was the latest in 

the trunk

 

  - a web server to serve plugin jsp's

 

 Why not make it regular war? also please consider making a clean

 separation of view/logic when you implement the web ui.

As Stafan's version used embedded Jetty server, I continued this way. 

But i will consider that possibility also.

 

 

 -- 

 Sami Siren

 

 

 



Re: Next Nutch release

2007-01-17 Thread Andrzej Bialecki

Armel T. Nene wrote:

I am willing to work with Andrzej to make it stable as I understand it's the
architect of this patch. I have the possibility of testing it in a mix
environment in our computer lab. This patch can be the stepping stone for
other features such real time indexing and fetch queue for index updating as
opposed to creating a new index each time.
  


Thanks for taking the initiative! I'll be glad to review the patch and 
apply it right after the 0.9 release. The best way to keep the process 
open would be to make svn diff and attach this new version of the patch 
to the JIRA issue.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Next Nutch release

2007-01-17 Thread Stefan Groschupf

Hi,

great to hear people still working on things. It shows once more  
getting something in early would save some effort. :)

Just some random comments.

We run the gui in several production environemnts with patched hadoop  
code - since this is from our point of view the clean approach.  
Everything else feels like a workaround to fix some strange hadoop  
behaviors. It is may be a long time ago that I spoke to Doug and some  
other Hadoop developers  but at this time I understand people that  
there is a general interest to have a nutch gui and support required  
functionality in hadoop.

I'm not sure if that is still the case or if I had a wrong impression.
In any case from my p.o.v. the clean way would be getting the  
required minor changes into hadoop (not critical simple stuff from my  
point of view) instead of implement working around in nutch. Since  
hadoop is a kind of child of nutch there should be a close relation  
at least to discuss things.
Anyway no strong option, just my 2 cents. In any case I'm very happy  
if people see now the need for a gui as well and someone is working  
on that since I'm kind of busy with other projects.


Thanks.
Stefan


On 17.01.2007, at 06:42, Enis Soztutar wrote:


Hi all, for NUTCH-251:

I suppose that NUTCH-251 is relatively a significant issue by the  
votes. Stafan has written a good plugin for the admin gui and i  
have updated it to work with nutch-0.8, hadoop 0.4.


Some of the features in the patch is not appropriate for our use  
cases and it requires hadoop changes, thus I am currently working  
on an alternative implementation of the administration gui, which  
runs a hadoop server( like JobTraker) to listen to submitted Jobs,  
an web Gui to submit and track the jobs from the browser and a job  
runner.


The architechture details of the patch is as follows :

 - An interface AdminJob which is an abstract class representing a  
Job in nutch.
 - various classes extending AdminJob. for ex FetchAdminJob,  
IndexAdminJob.
 - A queue which sorts the jobs in priority order, by a modified a  
topological sort(jobs can be dependent).

 - an interface to submit Jobs
 - a rpc server to listen to job submissions
 - an extension point (basically same as the previous)
 - a web server to serve plugin jsp's

upon the features will be
   - submitting jobs from code, command line or web interface,
   - tracking jobs from the command line or web interface
   - scheduling jobs

I could send the code or details if anyone is interested in  
pretesting. And i will appreciate any comments and suggestions on  
this. I am planning to complete the patch and submit it to Jira ASAP.


Sami Siren wrote:

Hello,

It has been a while from a previous release (0.8.1) and looking at  
the
great fixes done in trunk I'd start thinking about baking a new  
release

soon.

Looking at the jira roadmaps there are 1 blocking issues (fixing the
license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
which I think NUTCH-233 is safe to put in.

The top 10 voted issues are currently:

NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
NUTCH-48Did you mean query enhancement/refignment feature
NUTCH-251   Administration GUI
NUTCH-289   CrawlDatum should store IP address
NUTCH-36Chinese in Nutch
NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59  
	meta

data support in webdb
NUTCH-92DistributedSearch incorrectly scores results
NUTCH-68A
tool to generate arbitrary fetchlists   NUTCH-87Efficient
site-specific crawling for a large number of sites

Are there any opinions about issues that should go in before the next
release (Answering yes means that you are willing to provide a  
patch for

it).

--
 Sami Siren







~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com





Re: Next Nutch release

2007-01-16 Thread Andrzej Bialecki

Sami Siren wrote:

Hello,

It has been a while from a previous release (0.8.1) and looking at the
great fixes done in trunk I'd start thinking about baking a new release
soon.

Looking at the jira roadmaps there are 1 blocking issues (fixing the
license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
which I think NUTCH-233 is safe to put in.
  


Agreed. The replacement regex mentioned in the original comment seems 
safe enough, and simpler.



The top 10 voted issues are currently:

NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
  


Well ... I'm of a split mind on this. I can bring this patch up to date 
and apply it before 0.9.0, if we understand that this is a 0 release 
... ;) Otherwise I'd prefer to wait with it right after the release.


I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus 
some changes I made in the meantime), since I'd like to expose the new 
fetcher to a broader audience, and it doesn't affect the existing 
implementation.




NUTCH-48Did you mean query enhancement/refignment feature
NUTCH-251   Administration GUI
NUTCH-289   CrawlDatum should store IP address
  


I'm still not entirely convinced about this - and there is already a 
mechanism in place to support it if someone really wishes to keep this 
particular info (CrawlDatum.metaData).



NUTCH-36Chinese in Nutch
NUTCH-185   XMLParser is configurable xml parser plugin.
NUTCH-59meta
data support in webdb
NUTCH-92DistributedSearch incorrectly scores results
NUTCH-68


This is too intrusive to fix just before the release - and needs 
additional discussion.




NUTCH-68A
tool to generate arbitrary fetchlists   


Easy to port this to 0.9.0 - I can do this.



NUTCH-87Efficient
site-specific crawling for a large number of sites
  




--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Next Nutch release

2007-01-16 Thread Thomas Müller
Sami, Thanks a lot,

I would like to see a feature in, that a link to a webpage is sowing all areay 
indexed urls.

So other spiders can fetch this site and get the urls, the open souce natuch 
has already to provide.


So we need to start not to have open source coding the machine, but as well 
every node offering an open, downloadable database of urls,

And we need a list of urls, of other nutch domains. With this list, each Nutch 
can crawl the urls of the other nutch  providing them on a website.

As Million of urls are a lot, I suggest to have 26 websites from a-z to display 
all urls of the `word´ a, all 25 urls links b-z as well on the page of the 
word-page a.

then several Nutch nodes could use a small p2p feature and as well the sister 
yacy can fetch the urls from a central open source point: all nutch domains.

Would this be possible to generate a webpage-link somewhere on the 
nutch-homepage of the individual serverinstall with all urls?

Opensource has to found solidarity, so make the nutch url database open for as 
well open source search engine spiders from central points.

thanks

 Original-Nachricht 
Datum: Tue, 16 Jan 2007 17:53:41 +0200
Von: Sami Siren [EMAIL PROTECTED]
An: nutch-dev@lucene.apache.org
Betreff: Next Nutch release

 Hello,
 
 It has been a while from a previous release (0.8.1) and looking at the
 great fixes done in trunk I'd start thinking about baking a new release
 soon.
 
 Looking at the jira roadmaps there are 1 blocking issues (fixing the
 license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
 which I think NUTCH-233 is safe to put in.
 
 The top 10 voted issues are currently:
 
 NUTCH-61   Adaptive re-fetch interval. Detecting umodified content
 NUTCH-48  Did you mean query enhancement/refignment feature
 NUTCH-251 Administration GUI
 NUTCH-289 CrawlDatum should store IP address
 NUTCH-36  Chinese in Nutch
 NUTCH-185 XMLParser is configurable xml parser plugin.
 NUTCH-59meta
 data support in webdb
 NUTCH-92  DistributedSearch incorrectly scores results
 NUTCH-68A
 tool to generate arbitrary fetchlists NUTCH-87
 Efficient
 site-specific crawling for a large number of sites
 
 Are there any opinions about issues that should go in before the next
 release (Answering yes means that you are willing to provide a patch for
 it).
 
 --
  Sami Siren

-- 
Feel free - 5 GB Mailbox, 50 FreeSMS/Monat ...
Jetzt GMX ProMail testen: http://www.gmx.net/de/go/promail


Re: Next Nutch release

2007-01-16 Thread Chris Mattmann
Folks,

 When would you like to make the release? I've been working on NUTCH-185,
but got a bit bogged down with other work. If there is interest in having
NUTCH-185 included in the release, I could make a push to get out a patch by
week's end...

 As for the rest, my +1 for NUTCH-61 being included sooner rather than
later. It seems that the patch has garnered enough use and attention that
folks would like to see it in the release. I think the email from the user
trying to manage a terabyte of data a few days back was particularly
telling.

Cheers,
  Chris



On 1/16/07 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Sami Siren wrote:
 Hello,
 
 It has been a while from a previous release (0.8.1) and looking at the
 great fixes done in trunk I'd start thinking about baking a new release
 soon.
 
 Looking at the jira roadmaps there are 1 blocking issues (fixing the
 license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
 which I think NUTCH-233 is safe to put in.
   
 
 Agreed. The replacement regex mentioned in the original comment seems
 safe enough, and simpler.
 
 The top 10 voted issues are currently:
 
 NUTCH-61Adaptive re-fetch interval. Detecting umodified content
   
 
 Well ... I'm of a split mind on this. I can bring this patch up to date
 and apply it before 0.9.0, if we understand that this is a 0 release
 ... ;) Otherwise I'd prefer to wait with it right after the release.
 
 I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus
 some changes I made in the meantime), since I'd like to expose the new
 fetcher to a broader audience, and it doesn't affect the existing
 implementation.
 
 
 NUTCH-48  Did you mean query enhancement/refignment feature
 NUTCH-251  Administration GUI
 NUTCH-289  CrawlDatum should store IP address
   
 
 I'm still not entirely convinced about this - and there is already a
 mechanism in place to support it if someone really wishes to keep this
 particular info (CrawlDatum.metaData).
 
 NUTCH-36  Chinese in Nutch
 NUTCH-185  XMLParser is configurable xml parser plugin.   NUTCH-59  meta
 data support in webdb
 NUTCH-92  DistributedSearch incorrectly scores results   NUTCH-68  
 
 This is too intrusive to fix just before the release - and needs
 additional discussion.
 
 
 NUTCH-68 A
 tool to generate arbitrary fetchlists  
 
 Easy to port this to 0.9.0 - I can do this.
 
 
 NUTCH-87  Efficient
 site-specific crawling for a large number of sites