Further details:
If I run strace on the process, it looks like this, over and over and over:
gettimeofday({1155249187, 52}, NULL) = 0
gettimeofday({1155249188, 389}, NULL) = 0
gettimeofday({1155249188, 679}, NULL) = 0
gettimeofday({1155249188, 955}, NULL) = 0
clock_gettime(CLOCK_REALTI
I had the same problem before. Just read
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg04303.html
Make that tiny change on line 385 of HttpBase.java and it will work fine.
Raphael
Sellek, Greg wrote:
I am experiencing the same issue as a similar post for 8/6. Whenever I
try
Hello,
Nutch is stalling in the fetch process. I've run it twice now, and it is
stopping on the *same* URL both times. I don't get what's going on!
The last status report was:
060810 145315 status: segment 20060810142649, 7900 pages, 14 errors,
98421231 bytes, 1571224 ms
060810 145315 status: 5
Hello,
is it possible to crawl e.g. http://www.domain.com,
but to skip crawling all urls matching to (http://www.domain.com/subpage/)
I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt.
but it doesn't work:
-ftp.tu-clausthal.de
-^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/m
Hello all - I have been taking a look at Nutch for purposes of indexing
a large pile of internal LAN files at our company, and so far it looks
quite impressive. I believe it could substitute for the Google Mini
appliance. However, the bigger Google boxes add more features that I am
not sure can be
Hi,
Could anyone explain me what does exactly the common-terms.utf8 file? I
don't understand the real functionality of this file...
Regards,
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
I'm interested in crawling multiple shared folders (among other
things) on a corporate LAN.
It is a LAN of MS clients with Active Directory managed accounts.
The users routinely access the files based on ntfs-level (and
sharing?) permissions.
Idealy, I'd like to set up a central server (probably
I am experiencing the same issue as a similar post for 8/6. Whenever I
try and fetch pages, I see a lot of "fetch of xxx failed with:
java.lang.NullPointerException" I have put the appropriate agent info
in both the nutch-default and nutch-site config files. I tried using
DEBUG logging to get m
Take a look at this,
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces
It will answer why you have a few more map tasks that are set in the
configuration.
Dennis
Murat Ali Bayir wrote:
my configs are given below:
in hadoop-site number of mapper = 130
in my code I use job.setNumMapT
my configs are given below:
in hadoop-site number of mapper = 130
in my code I use job.setNumMapTasks = 130
in hadoop-default numberof mapper = 2
in this configuration I have taken 135 mapper in my job. However there
is no problem in number of reducer.
Andrzej Bialecki wrote:
Murat Ali Bayir
Murat Ali Bayir wrote:
Hi everbody, Although I change the number of mappers in
hadoop-site.xml and use job.setNumMapTasks method the system gives
another number as a number of mapper, the problem only occurs for
number of mapper, number of reducers works correctly. What I have to
do for setti
it can not be problem, it only restrict the number of tasks running
simultaneously, there can be pending tasks also, i check that this not
problem. I am not sure but I notice that the number of mapper tasks is
equal to k*number of different parts in input path. To illusrate I have
15 parts in
The name node is running. Run the bin/stop-all.sh script first and then
do a ps -ef | grep NameNode to see if the process is still running. If
it is, it may need to be killed by hand kill -9 processid.
The second problem is the setup of ssh keys as described in previous
email. Also I would re
There is also a mapred.tasktracker.tasks.maximum variable which may be
causing the task number to be different.
Dennis
Murat Ali Bayir wrote:
Hi everbody, Although I change the number of mappers in
hadoop-site.xml and use job.setNumMapTasks method the system gives
another number as a number o
Hey list,
I would like to ask you if it is possible to start a search query with a
simple word (e.g. "Home"). Then Nutch will lookup the word “Home” in a
list with synonyms. Nutch will then recognize that “House” is a synonym
for “Home”. Now, Nutch can start a search query with “House” and “Ho
Hi everbody, Although I change the number of mappers in hadoop-site.xml
and use job.setNumMapTasks method the system gives another number as a
number of mapper, the problem only occurs for number of mapper, number
of reducers works correctly. What I have to do for setting the number
of mappers
Hi,
I am interested in more comprehensive configuration of the crawl targets. The
actual version only supports lists (files) containing URLs. One thing that
could be desirable is the injection of URLs with metadata attached. This
metadata (inserted into the CrawlData object) could be read by pl
hello,
When I execute the DFS commande,I have this:
[EMAIL PROTECTED] search]$ bin/start-all.sh
starting namenode, logging to
/nutch/search/logs/hadoop-nutch-namenode-localhost.out
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 81:0e:49:ce:61:8c:7b:09
I want to include embedded flash in my crawls.
Despite (apparently successfully) including the parse-swf plugin, embedded
flash does not seem to be retrieved. Im assuming that the object tags are
not being parsed to find the .swf files.
Can anyone comment?
Thanks
Iain
[input] [input] [input] [input]
hello,
we are trying to install nutch in single machine using this guide:
"http://wiki.apache.org/nutch/NutchHadoopTutorial?highlight=%28nutch%29";,
we are blocked in this step:
*first we execute this command
20 matches
Mail list logo