Re: nutch protocol-file

2006-09-02 Thread Thomas Delnoij
Just add scoring-opic to your plugin.includes in nutch-site.xml. Rgrds, Thomas On 9/1/06, Cam Bazz [EMAIL PROTECTED] wrote: Hello, I wanted to index my files so I followed the instructions at http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch I get : Exception in

Re: Best performance approach for single MP machine?

2006-07-25 Thread Thomas Delnoij
Hi Doug, is it possible you could post your hadoop-site.xml? I would like to accomplish the same. Rgrds. Thomas On 7/21/06, Doug Cook [EMAIL PROTECTED] wrote: Thanks, HÃ¥vard (and Doug, in the original email). Those pointers, plus a few other tips from elsewhere, did the trick. I'm now up

Re: Null pointer error when perform search

2006-07-25 Thread Thomas Delnoij
Eric, you should setup the searcher.dir property in nutch-site.xml to point to the crawl directory,. See nutch-default.xml for an explanation of this config property. Rgrds, Thomas On 7/22/06, Eric Wu [EMAIL PROTECTED] wrote: Hi, I am new to Nutch and I got a null pointer exception whenI try

Re: Why would a record be in the database but not show up in the results?

2006-07-25 Thread Thomas Delnoij
Matt, it's the index that is used for searching, not the webdb. What is the status of these pages in webdb? Likely they are not fetched yet (DB_UNFETCHED), and thus can never be in your index. These articles give very nice basic explanation of different concepts:

Re: Links

2006-07-25 Thread Thomas Delnoij
There's 'nutch readdb' command - [EMAIL PROTECTED]:~ nutch readdb Usage: CrawlDbReader crawldb (-stats | -dump out_dir | -topN out_dir [min] | -url url) crawldb directory name where crawldb is located -stats print overall statistics to System.out -dump out_dir

Re: Injecting Into Intranet Crawl

2006-07-25 Thread Thomas Delnoij
For stuff like this best use whole web concepts as explained in the tutorial. Rgrds, Thomas On 7/25/06, Robert Sanford [EMAIL PROTECTED] wrote: I'm running version 0.7.2 and I'm using the Intranet crawl where I specify a list of site root URIs in a text file along with a list of regex for

Re: Date indexed in index-more?

2006-02-15 Thread Thomas Delnoij
As far as I can tell from the src (0.7.1), it is either calculated from the last-modified metadata property, or, when it is not available, from the fetchDate. See org.apache.nutch.indexer.more.MoreIndexingFilter. This also answers my own question

Re: intranet crwl update

2006-02-14 Thread Thomas Delnoij
I will try to answer your questions. If I am wrong, I am sure one of the more experienced developers can correct me ...:) - How do I update/refresh the index? There is no explanation or example about the intranet crawl! The main index (in crawldir/index) is updated by the CrawlTool after every

Re: index content within metatag only

2006-02-14 Thread Thomas Delnoij
I think the http://wiki.apache.org/nutch/WritingPluginExample tutorial shows how to implement the Filter - you would be filtering the 'content' metatag instead of the 'recommended'. Then it is up to you what other Filters you enable/disable. Also look at the

Date first indexed

2006-02-13 Thread Thomas Delnoij
I have worked through the WritingPluginExamplehttp://wiki.apache.org/nutch/WritingPluginExampleexample. Now I am wondering if the following makes any sense. I would like to store the date (mmdd) the first time a Page was added to the Index. I thought I could create a plugin that would add a

Re: Duplicate urls in urls file

2006-02-13 Thread Thomas Delnoij
If the url is already in WebDB, it will not be added again. (WebDBInjector calls WebDBWriter.addPageIfNotPresent(page)). Rgrds, Thomas On 2/13/06, Hasan Diwan [EMAIL PROTECTED] wrote: I've written a perl script to build up a urls file to crawl from RSS feeds. Will nutch handle duplicate

Re: MD5Hash

2006-01-20 Thread Thomas Delnoij
, Jack Tang [EMAIL PROTECTED] wrote: Hi Thomas I suppose the only unique key of contents in web db is page' url. So why not retrieve the content by url directly? /Jack On 1/8/06, Thomas Delnoij [EMAIL PROTECTED] wrote: I am working with Nutch 0.7.1. As far as I understand the current

Re: MD5Hash

2006-01-18 Thread Thomas Delnoij
Maybe one of the other developers can answer my question as well? I want to know if I only have to change the Fetcher ( org.apache.nutch.fetcher.Fetcher), lines 236-240, to accomplish unique MD5Hash for each Page based on their URL. Thanks is advance, Thomas D. On 1/15/06, Thomas Delnoij

Re: MD5Hash

2006-01-15 Thread Thomas Delnoij
um 22:14 schrieb Thomas Delnoij: I am working with Nutch 0.7.1. As far as I understand the current implementation (please correct me if I am wrong), the MD5Hash is calculated based on the Pages' content. Pages with the same content but identified by different URLs, share the same

Re: other newbies like me

2006-01-11 Thread Thomas Delnoij
Andy, you need to install the Nutch webapp as the ROOT application of your tomcat installation, as desribed in the tutorial: http://lucene.apache.org/nutch/tutorial.html Rgrds, Thomas On 1/11/06, Andy Morris [EMAIL PROTECTED] wrote: Okay I used this guy's how-to to install IBM JAVA and

Re: RegexURLFilter / testing regex-urlfilter.txt

2005-11-30 Thread Thomas Delnoij
- regex.jar On 11/29/05, Thomas Delnoij [EMAIL PROTECTED] wrote: For the sake of the archives, I will answer my own question here: I had to add the following line to the bin/nutch script to be able to run org.apache.nutch.net.RegexURLFilter from the command line: CLASSPATH=${CLASSPATH

Re: How can I know how many pages nutch has fetched?

2005-11-29 Thread Thomas Delnoij
Kumar. you can use the nutch readdb [db_name] -stats command to generate statistics for your WebDB and the nutch segread command for your segments. HTH Thomas Delnoij On 11/29/05, Kumar Limbu [EMAIL PROTECTED] wrote: Hi Everyone, I am new to nutch and I would like to know how can I know how

Re: RegexURLFilter / testing regex-urlfilter.txt

2005-11-29 Thread Thomas Delnoij
overrides the classpath environment variable, so adding the jar there didn't help. Rgrds, Thomas Delnoij On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote: All. The problem is actualy a bit different. I was a bit in a hurry when I posted the previous message, apologies. I added both

Re: NDFS / WebDB QUestion

2005-11-18 Thread Thomas Delnoij
for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB storage on my datanodes? Thanks for your help. Thomas Delnoij On 11/13/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, Am 13.11.2005 um 12:58 schrieb Thomas Delnoij: I have studied the available documentation and the mailing

NDFS / WebDB QUestion

2005-11-13 Thread Thomas Delnoij
) what happens to Pages that cannot be parsed (for instance content-type: image/jpg); are they kept in WebDB or are they removed? Thanks for your help. Nutch is a great tool! - Thomas Delnoij

RegexURLFilter / testing regex-urlfilter.txt

2005-10-05 Thread Thomas Delnoij
for testing the regex-urlfilter. Secondly, I want to tune my regex-urlfilter for maximum relevancy of the crawl result. By now, I have around 50 entries. My second question is if I can expect any performance impact? Your help is greatly appreciated. Kind regards, Thomas Delnoij.

Re: RegexURLFilter / testing regex-urlfilter.txt

2005-10-05 Thread Thomas Delnoij
I was a bit in a hurry when I posted this message, apologies. The problem is actualy a bit different. I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath. When I run java org.apache.nutch.net.RegexURLFilter, On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote: All. I want

Re: RegexURLFilter / testing regex-urlfilter.txt

2005-10-05 Thread Thomas Delnoij
help is really appreciated. Kind regards, Thomas Delnoij On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote: I was a bit in a hurry when I posted this message, apologies. The problem is actualy a bit different. I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath. When I run

Re: how long to crawl

2005-07-27 Thread thomas delnoij
I don't think it is a-typical, because I had similar effects with crawl depth = 10. Rgrds, Thomas --- blackwater dev [EMAIL PROTECTED] wrote: Over a gig now, 18 hours running and still going...might just have to kill it unless this is typical. On 7/27/05, blackwater dev [EMAIL PROTECTED]