On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to
Gora Mohanty wrote:
On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If
I'm very new at this, so forgive my novice questions. I'm trying to
install nutch in WebSphere 6.1. While I can see that others have done this
before, I've been unsuccessful. I keep getting this error:
Error 500: java.lang.Error: java.lang.NoClassDefFoundError:
org.apache.jsp._search (wrong
On Tue, 27 Oct 2009 07:29:10 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
I assume you mean that the generate step produces no new URL-s
to fetch? That's expected, because they become eligible for
re-fetching only after Nutch considers them expired, i.e. after
the fetchTime +
If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying source code.
thx
Dmitriy Fundak wrote:
If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying
Checking url postfix and returning null if it's not one I need helped.
Thanks, Andrzej.
2009/10/27 Andrzej Bialecki a...@getopt.org:
Dmitriy Fundak wrote:
If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml
I had generated the segments after crawling process. Then I downloaded the
segments to local from crawldb. Below are the four segments I generated and
downloaded from crawldb. Now if I run fetch upon these four segments then I
get the below error. Please help me how to run fetch in local.
Hi All,
I've got a strange problem, that nutch indexes much less URLs then it
fetches. For example URL:
http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
I assume that if fetched sucessfully because in fetch logs it mentioned only
once:
2009-10-26 10:01:46,502 INFO
Hi All,
I've done some googling, but found different answers, so I would appreciate
if you tell me which is the correct one:
- when page redirected, content of target page is fetched and associated
with the source (initial) page URL
- when page redirected, new entry with the redirect target url
There are two different types of redirect. When a web site returns a
301 status (redirect permanent), it means the url you requested is no
longer valid, don't ask for it again. When it returns a 307 status
(temporary redirect), it means keep asking for the url you asked for,
and I'll tell you
I'm very new at this, so forgive my novice questions. I'm trying to
install nutch in WebSphere 6.1. While I can see that others have done this
before, I've been unsuccessful. I keep getting this error:
Error 500: java.lang.Error: java.lang.NoClassDefFoundError:
org.apache.jsp._search (wrong
This is my second time receiving this error:
Map output lost, rescheduling: getMapOutput
(attempt_200910271443_0012_m_01_0,0) failed :
org.apache.hadoop.fs.ChecksumException: Checksum Error
---
Does anyone know why I am getting this error and how to fix it? I
tried
check the parse data first, maybe it parse unsuccessful.
2009/10/27 caezar caeza...@gmail.com
Hi All,
I've got a strange problem, that nutch indexes much less URLs then it
fetches. For example URL:
http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
I assume that if
I have similar experience.
Reinhard schwab responded a possible fix. See mail in this group from
Reinhard schwab at
Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT)
I haven't have chance to try it out yet.
On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
Hi All,
I've got a strange problem,
15 matches
Mail list logo