Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Andrzej Bialecki wrote: Howie Wang wrote: I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie You are talking about the codebase from branch 0.7. This branch is not under active development. The current codebase is very different - it uses the MapReduce framework to process data in a distributed fashion. So, there is no single interface for writing the CrawlDb. There is one class for reading the CrawlDb, but usually the data in the DB is used not standalone, but as one of many inputs to a map-reduce job. To summarize - I think it would be very difficult to do this with the current codebase. My urls are at most at the level of 1000,000 per site; Perhaps I can do some tests and go on with the idea. Based on 0.9 It seems the most simplest way to achieve it is like this, To do any mapReduce job associated with Crawldb,I add operations like these: Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath; Read the job-generated CrawlDb to update the RationalDB. Is that right?
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Howie Wang wrote: Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie Generally speaking, I agree that it would be a good option to have, especially for smaller setups - but it would require extensive modifications to many tools in Nutch. Unless you are willing to provide patches that implement it without breaking the large-scale case, I think we should let the matter rest ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie _ Live Search Maps – find all the local information you need, right when you need it. http://maps.live.com/?icid=wlmtag2FORM=MGAC01
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Actually nutch people are kind of autocrate., don't expect more from them They do what they have decided I am waiting really stable product with incremental indexing, which detect and add/remove pages as soon as they added/removed. But they don't want to this , i don't know why ? what is there mission ? If we join together to implement this, it would be better. I can work on this as weekend project. ping me, if u want On 4/13/07, Howie Wang [EMAIL PROTECTED] wrote: I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie _ Live Search Maps – find all the local information you need, right when you need it. http://maps.live.com/?icid=wlmtag2FORM=MGAC01
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Arun Kaundal wrote: Actually nutch people are kind of autocrate., don't expect more from them They do what they have decided Have you submitted patches that have been ignored or rejected? Each Nutch contributor indeed does what he or she decides. Nutch is not a service organization that implements every feature that someone requests. It is a collaborative project of volunteers. Each contributor adds things they need, and others share the benefits. I am waiting really stable product with incremental indexing, which detect and add/remove pages as soon as they added/removed. But they don't want to this, i don't know why ? Perhaps because this is difficult, especially while still supporting large crawls. But if others don't want to implement this, I encourage you to try to implement it, and, if you succeed, contribute it back to the project. That's the way Nutch grows. what is there mission ? If we join together to implement this, it would be better. I can work on this as weekend project. ping me, if u want You can of course fork Nutch, or start a new project from scratch. But you ought to also consider submitting patches to Nutch, working work with other contributors to solve your problems here before abandoning Nutch in favor of another project. Cheers, Doug
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Howie Wang wrote: I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie You are talking about the codebase from branch 0.7. This branch is not under active development. The current codebase is very different - it uses the MapReduce framework to process data in a distributed fashion. So, there is no single interface for writing the CrawlDb. There is one class for reading the CrawlDb, but usually the data in the DB is used not standalone, but as one of many inputs to a map-reduce job. To summarize - I think it would be very difficult to do this with the current codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Thanks for the input, Andrzej. Yes, I'm still working off of 0.7. I might still try it since I'm not planning on upgrading for a while, but it sounds like it's not going to port to the current versions. Howie _ Your friends are close to you. Keep them that way. http://spaces.live.com/signup.aspx
Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce.
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Hi, wangxu. You wrote 13 апреля 2007 г., 1:03:31: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. You think MySQL would give you higher speed? :) Just try DataPark Search for large number of urls :) and you will see the difference ;)
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Please make the following test using your favorite relational DB: * create a table with 300 mln rows and 10 columns of mixed type * select 1 mln rows, sorted by some value * update 1 mln rows to different values If you find that these operations take less time than with the current crawldb then we will have to revisit this issue. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Once when I was young and restless I went through the path with relational db. It kind of worked with few million records. I am not trying to do it anymore. Perhaps your problem is that you process too few records at the time? Quite often I see examples where people fetch few hundred of few thousand pages at a time. It might be good amount for small crawls, but if your goal is bigger you need bigger segments to get there. -- Sami Siren
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Andrzej Bialecki wrote: wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Please make the following test using your favorite relational DB: * create a table with 300 mln rows and 10 columns of mixed type * select 1 mln rows, sorted by some value * update 1 mln rows to different values If you find that these operations take less time than with the current crawldb then we will have to revisit this issue. :) That is so funny.
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Please make the following test using your favorite relational DB:* create a table with 300 mln rows and 10 columns of mixed type* select 1 mln rows, sorted by some value* update 1 mln rows to different valuesIf you find that these operations take less time than with the current crawldb then we will have to revisit this issue. :) That is so funny.I think the original question and the above answer shows the big difference in the ways that Nutch is being used. For a small niche searchengine with fewer than a few million pages, it would probably be performant to use a relational DB. I have a webdb with 5 million records, and usually fetch 20k pagesat a time. It takes me about 1 hour to do an updatedb. To inject just a few dozen new urls takes about 20 minutes. On a relational DB, I know the injecting would be *much* faster, and I think the updatedb step would be also.Also for smaller engines, the raw throughput doesn't matter as much, and other considerations like robustness and flexibility could be more important. With a relational DB, I could recover from a crashed crawl with a simple SQL update. Or I could remove a set of bogus URLs from thedb just as easily. Now when I want to tweak the webdb in an unanticipated way, I have to write a custom piece of Java to do it. Just thought I'd throw in a perspective from a niche search guy.Howie _ Your friends are close to you. Keep them that way. http://spaces.live.com/signup.aspx
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie _ Live Search Maps – find all the local information you need, right when you need it. http://maps.live.com/?icid=wlmtag2FORM=MGAC01