Re: read crawldb.

2008-01-30 Thread Andrzej Bialecki
Siddhartha Reddy wrote: On looking further I think it might be possible to get the content given a URL, but there is no existing class in nutch that can do this. Have a look at the CrawlDbReader code (particularly the 'readUrl' function), you will want to do something similar. This is not the c

Re: read crawldb.

2008-01-30 Thread Siddhartha Reddy
On looking further I think it might be possible to get the content given a URL, but there is no existing class in nutch that can do this. Have a look at the CrawlDbReader code (particularly the 'readUrl' function), you will want to do something similar. If you want to write a mapred job that will

Re: read crawldb.

2008-01-30 Thread nadav hashimshony
Thanks. Any idea how i do this? Nadav. On Jan 30, 2008 2:17 PM, Siddhartha Reddy <[EMAIL PROTECTED]> wrote: > The HTML content is not stored in the crawldb; crawldb only contains the > metadata about the URLs. > > The HTML content is stored in the 'content' directory of the segment where > the

Re: read crawldb.

2008-01-30 Thread Siddhartha Reddy
The HTML content is not stored in the crawldb; crawldb only contains the metadata about the URLs. The HTML content is stored in the 'content' directory of the segment where the particular page was fetched. As far as I know, there is no simple way to access to the content of a page, given the URL.

Re: read crawldb.

2008-01-30 Thread nadav hashimshony
another small question. the following command: dbr.get("/home/nadav/workarea/nutch-0.9/CRAWLDB/crawldb"," http://www.howtofly.com ", conf); bring only this data: Version: 5 Status: 2 (db_fetched) Fetch time: Fri Feb 29 10:57:43 IST 2008 Modified time: Thu Jan 01 02:00:00 IST

Re: Reg: Nutch Admin GUI

2008-01-30 Thread Andrzej Bialecki
Prafulla wrote: Hi, We are looking to develop an Admin GUI for the Nutch Project. We came across this page http://wiki.apache.org/nutch/NutchAdministrationUserInterface but most of the links for download and screen shots given there are outdated. Can we please know the current status of the work