Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Gora Mohanty
On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to

Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Andrzej Bialecki
Gora Mohanty wrote: On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If

Nutch in WebSphere

2009-10-27 Thread Joshua J Pavel
I'm very new at this, so forgive my novice questions. I'm trying to install nutch in WebSphere 6.1. While I can see that others have done this before, I've been unsuccessful. I keep getting this error: Error 500: java.lang.Error: java.lang.NoClassDefFoundError: org.apache.jsp._search (wrong

Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Gora Mohanty
On Tue, 27 Oct 2009 07:29:10 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] I assume you mean that the generate step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime +

Re: How to index files only with specific type

2009-10-27 Thread Dmitriy Fundak
If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. thx

Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki
Dmitriy Fundak wrote: If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying

Re: How to index files only with specific type

2009-10-27 Thread Dmitriy Fundak
Checking url postfix and returning null if it's not one I need helped. Thanks, Andrzej. 2009/10/27 Andrzej Bialecki a...@getopt.org: Dmitriy Fundak wrote: If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml

How to run fetch from local

2009-10-27 Thread saravan.krish
I had generated the segments after crawling process. Then I downloaded the segments to local from crawldb. Below are the four segments I generated and downloaded from crawldb. Now if I run fetch upon these four segments then I get the below error. Please help me how to run fetch in local.

Nutch indexes less pages, then it fetches

2009-10-27 Thread caezar
Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO

Redirect handling

2009-10-27 Thread caezar
Hi All, I've done some googling, but found different answers, so I would appreciate if you tell me which is the correct one: - when page redirected, content of target page is fetched and associated with the source (initial) page URL - when page redirected, new entry with the redirect target url

Re: Redirect handling

2009-10-27 Thread Paul Tomblin
There are two different types of redirect. When a web site returns a 301 status (redirect permanent), it means the url you requested is no longer valid, don't ask for it again. When it returns a 307 status (temporary redirect), it means keep asking for the url you asked for, and I'll tell you

Nutch in Websphere

2009-10-27 Thread Joshua J Pavel
I'm very new at this, so forgive my novice questions. I'm trying to install nutch in WebSphere 6.1. While I can see that others have done this before, I've been unsuccessful. I keep getting this error: Error 500: java.lang.Error: java.lang.NoClassDefFoundError: org.apache.jsp._search (wrong

ERROR: Checksum Error

2009-10-27 Thread Eric Osgood
This is my second time receiving this error: Map output lost, rescheduling: getMapOutput (attempt_200910271443_0012_m_01_0,0) failed : org.apache.hadoop.fs.ChecksumException: Checksum Error --- Does anyone know why I am getting this error and how to fix it? I tried

Re: Nutch indexes less pages, then it fetches

2009-10-27 Thread 皮皮
check the parse data first, maybe it parse unsuccessful. 2009/10/27 caezar caeza...@gmail.com Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if

Re: Nutch indexes less pages, then it fetches

2009-10-27 Thread kevin chen
I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem,