Re: Removing URLs from Web DB

2006-02-18 Thread Andrzej Bialecki
Chris Schneider wrote: Nutch Users, Does anyone have a tool or an easy method for removing URLs matching a certain pattern from the MapReduce crawldb? For example, let's say you've been crawling for a while, and then realize that you're spending a lot of time trying to crawl bogus URLs with

Content-based Crawl vs Link-based Crawl?

2006-02-18 Thread Elwin
As nutch crawls web pages from links to links by extracting outlinks from the page. For example, we can check if the link text contains some keywords from a dictionary to decide whether or not to crawl it. Moreover, we can check if the content of a page fetched by an outlink contains some keywords

Re: Out of Memory while fetching

2006-02-18 Thread Stefan Groschupf
The log says it: 060215 224057 SEVERE error writing output:java.io.IOException: key out of order: 391009 after 391009 In a ordered array of keys where each key is unique 391010 should come after 391009. Am 17.02.2006 um 14:10 schrieb keren nutch: Hi Stefan, Thank you for reply. I

Re: Link problems with Nutch Web-GUI

2006-02-18 Thread Stefan Groschupf
My this is your problem? Entities.encode(url) Am 17.02.2006 um 15:13 schrieb Fankhauser, Alain: Hello I use Nutch 0.8-dev and I'm trying to index a local file system. After Indexing I start tomcat and search. If I do this, I find the expected results but the links aren't correct. It's

Re: search inside lucene-fields

2006-02-18 Thread Stefan Groschupf
This depends on the query filter plugins you are using. As far I know only the scores of a documents increase if the word occurs in a title but there is not title query filter. However write a own is very easy, check the query-site plugin. Stefan Am 17.02.2006 um 16:36 schrieb Nutch