Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-17 Thread wangxu

Andrzej Bialecki wrote:

Howie Wang wrote:

I definitely don't expect people to write it just because it happens
to be useful to me :-)  Call me crazy, but I'm thinking of
implementing  this when I get some free time (whenever that will be).
It seems that I  would just need to implement IWebDBWriter and
IWebDBReader, and  then add a command line option to the tools
(something like -mysql) to  specify the type of db to instantiate. It
would affect about 15 files, but  the tools changes would be simple
-- a few if statements here and there. Does that sound right?  Howie


You are talking about the codebase from branch 0.7. This branch is not 
under active development. The current codebase is very different - it 
uses the MapReduce framework to process data in a distributed fashion.


So, there is no single interface for writing the CrawlDb. There is one 
class for reading the CrawlDb, but usually the data in the DB is used 
not standalone, but as one of many inputs to a map-reduce job.


To summarize - I think it would be very difficult to do this with the 
current codebase.



My urls are at most at the level of 1000,000 per site;
Perhaps I can do some tests and go on with the idea.

Based on 0.9 It seems the most simplest way to achieve it is like this,
To do any mapReduce job associated with Crawldb,I add operations  like 
these:

Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath;
Read the job-generated CrawlDb to update the RationalDB.
Is that right?



Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki

Howie Wang wrote:

Sorry about the previous crappily formatted message. In brief, my
point wasthat relational DB might perform better for small niche
users, and plusyou get the flexibility of SQL. No more writing custom
code to tweak webdb.Howie


Generally speaking, I agree that it would be a good option to have, 
especially for smaller setups - but it would require extensive 
modifications to many tools in Nutch. Unless you are willing to provide 
patches that implement it without breaking the large-scale case, I think 
we should let the matter rest ...



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Howie Wang
 I definitely don't expect people to write it just because it happens to be 
useful to me :-)  Call me crazy, but I'm thinking of implementing  this when I 
get some free time (whenever that will be). It seems that I  would just need to 
implement IWebDBWriter and IWebDBReader, and  then add a command line option to 
the tools (something like -mysql) to  specify the type of db to instantiate. It 
would affect about 15 files, but  the tools changes would be simple -- a few if 
statements here and there. Does that sound right?  Howie
_
Live Search Maps – find all the local information you need, right when you need 
it.
http://maps.live.com/?icid=wlmtag2FORM=MGAC01

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Arun Kaundal

Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided I am waiting really stable product with
incremental indexing, which detect and add/remove pages as soon as they
added/removed. But they don't want to this , i don't know why ? what is
there mission ? If we join together to implement this, it would be better. I
can work on this as weekend project.
ping me, if u want


On 4/13/07, Howie Wang [EMAIL PROTECTED] wrote:


I definitely don't expect people to write it just because it happens to be
useful to me :-)  Call me crazy, but I'm thinking of implementing  this when
I get some free time (whenever that will be). It seems that I  would just
need to implement IWebDBWriter and IWebDBReader, and  then add a command
line option to the tools (something like -mysql) to  specify the type of db
to instantiate. It would affect about 15 files, but  the tools changes would
be simple -- a few if statements here and there. Does that sound
right?  Howie
_
Live Search Maps – find all the local information you need, right when you
need it.
http://maps.live.com/?icid=wlmtag2FORM=MGAC01


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Doug Cutting

Arun Kaundal wrote:

Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided


Have you submitted patches that have been ignored or rejected?

Each Nutch contributor indeed does what he or she decides.  Nutch is not 
a service organization that implements every feature that someone 
requests.  It is a collaborative project of volunteers.  Each 
contributor adds things they need, and others share the benefits.



I am waiting really stable product with
incremental indexing, which detect and add/remove pages as soon as they
added/removed. But they don't want to this, i don't know why ?


Perhaps because this is difficult, especially while still supporting 
large crawls.  But if others don't want to implement this, I encourage 
you to try to implement it, and, if you succeed, contribute it back to 
the project.  That's the way Nutch grows.



what is
there mission ? If we join together to implement this, it would be 
better. I

can work on this as weekend project.
ping me, if u want


You can of course fork Nutch, or start a new project from scratch.  But 
you ought to also consider submitting patches to Nutch, working work 
with other contributors to solve your problems here before abandoning 
Nutch in favor of another project.


Cheers,

Doug


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki

Howie Wang wrote:

I definitely don't expect people to write it just because it happens
to be useful to me :-)  Call me crazy, but I'm thinking of
implementing  this when I get some free time (whenever that will be).
It seems that I  would just need to implement IWebDBWriter and
IWebDBReader, and  then add a command line option to the tools
(something like -mysql) to  specify the type of db to instantiate. It
would affect about 15 files, but  the tools changes would be simple
-- a few if statements here and there. Does that sound right?  Howie


You are talking about the codebase from branch 0.7. This branch is not 
under active development. The current codebase is very different - it 
uses the MapReduce framework to process data in a distributed fashion.


So, there is no single interface for writing the CrawlDb. There is one 
class for reading the CrawlDb, but usually the data in the DB is used 
not standalone, but as one of many inputs to a map-reduce job.


To summarize - I think it would be very difficult to do this with the 
current codebase.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Howie Wang
 Thanks for the input, Andrzej. Yes, I'm still working off of 0.7.  I might 
still try it since I'm not planning on upgrading for a while, but it sounds 
like it's not going to port to the current versions.  Howie
_
Your friends are close to you. Keep them that way.
http://spaces.live.com/signup.aspx

Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread wangxu
Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Nuther
Hi, wangxu.

You wrote 13 апреля 2007 г., 1:03:31:

 Have anybody thought of replacing CrawlDb with any kind of Rational
 DB,mysql,for example?

 Crawldb is so difficult to manipulate.
 I often have the requirements to edit several entries in crawdb;
 But that would cost too much waiting for the mapReduce.
You think MySQL would give you higher speed? :)
Just try DataPark Search for large number of urls :)
and you will see the difference ;)





Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Andrzej Bialecki

wangxu wrote:

Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.


Please make the following test using your favorite relational DB:

* create a table with 300 mln rows and 10 columns of mixed type

* select 1 mln rows, sorted by some value

* update 1 mln rows to different values

If you find that these operations take less time than with the current 
crawldb then we will have to revisit this issue. :)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Sami Siren
wangxu wrote:
 Have anybody thought of replacing CrawlDb with any kind of Rational
 DB,mysql,for example?
 
 Crawldb is so difficult to manipulate.
 I often have the requirements to edit several entries in crawdb;
 But that would cost too much waiting for the mapReduce.
 

Once when I was young and restless I went through the path with
relational db. It kind of worked with few million records. I am not
trying to do it anymore.

Perhaps your problem is that you process too few records at the time?
Quite often I see examples where people fetch few hundred of few
thousand pages at a time. It might be good amount for small crawls, but
if your goal is bigger you need bigger segments to get there.

--
 Sami Siren




Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Dennis Kubes



Andrzej Bialecki wrote:

wangxu wrote:

Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.


Please make the following test using your favorite relational DB:

* create a table with 300 mln rows and 10 columns of mixed type

* select 1 mln rows, sorted by some value

* update 1 mln rows to different values

If you find that these operations take less time than with the current 
crawldb then we will have to revisit this issue. :)


That is so funny.





RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
  Please make the following test using your favorite relational DB:* 
  create a table with 300 mln rows and 10 columns of mixed type* 
  select 1 mln rows, sorted by some value* update 1 mln rows to 
  different valuesIf you find that these operations take less time 
  than with the current   crawldb then we will have to revisit this issue. 
  :)  That is so funny.I think the original question and the above answer 
  shows the big difference in the ways that Nutch is being used. For a small 
  niche searchengine with fewer than a few million pages, it would probably 
  be performant to use a relational DB. I have a webdb with 5 million 
  records, and usually fetch 20k pagesat a time. It takes me about 1 hour to 
  do an updatedb. To inject just a few dozen new urls takes about 20 minutes. 
  On a relational DB, I know the injecting would be *much* faster, and I 
  think the updatedb step would be also.Also for smaller engines, the raw 
  throughput doesn't matter as much, and other considerations like robustness 
  and flexibility could be more important. With a relational DB, I could 
  recover from a crashed crawl with a simple SQL update. Or I could remove a 
  set of bogus URLs from thedb just as easily. Now when I want to tweak the 
  webdb in an unanticipated way, I have to write a custom piece of Java to do 
  it. Just thought I'd throw in a perspective from a niche search guy.Howie
_
Your friends are close to you. Keep them that way.
http://spaces.live.com/signup.aspx

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
Sorry about the previous crappily formatted message. In brief, my point wasthat 
relational DB might perform better for small niche users, and plusyou get the 
flexibility of SQL. No more writing custom code to tweak webdb.Howie
_
Live Search Maps – find all the local information you need, right when you need 
it.
http://maps.live.com/?icid=wlmtag2FORM=MGAC01