Hi @ all,
I am using the newest trunk source code. I get every time this error
msg:
2009-04-10 20:08:23,816 INFO indexer.Indexer - Indexer: done
2009-04-10 20:08:23,817 INFO indexer.DeleteDuplicates - Dedup: starting
2009-04-10 20:08:23,818 INFO indexer.DeleteDuplicates - Dedup: adding
moved to Hadoop.
mapred-default.xml is over ridden by nutch-site.xml
---
Key: NUTCH-186
URL: https://issues.apache.org/jira/browse/NUTCH-186
Project: Nutch
Issue Type: Bug
Affects Versions
mapred-default.xml is over ridden by nutch-site.xml
---
Key: NUTCH-186
URL: https://issues.apache.org/jira/browse/NUTCH-186
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8
Right now I put default mapred data that I may override in
mapred-default.xml rather than in nutch-site.xml (or hadoop-site.xml)
because I can't override anything in *-site.xml. Has the new version of
hadoop changed this and that's why mapred-default.xml is deprecated? If
not, where should
[ http://issues.apache.org/jira/browse/NUTCH-209?page=all ]
Sami Siren closed NUTCH-209.
include nutch jar in mapred jobs
Key: NUTCH-209
URL: http://issues.apache.org/jira/browse/NUTCH-209
[EMAIL PROTECTED] wrote:
As far as we understood from MapRed documentation all reduce tasks must be
launched after last map task is finished e.g map and reduce must not work
simultaneously. But often in logs we see such records: map 80%, reduce 10%
and many more records where map is less
Where now placed mapred branch of nutch ?
Anton Potehin wrote:
Where now placed mapred branch of nutch ?
it is developed in trunk now.
P.
...
A wild idea: could we put this jar on NDFS, sorry, DFS, implement a
DFSClassLoader and point all the tasks' classloaders there? Eventually, when
DFS grows the locality mechanism, we would avoid transmitting this data unless
it's really changed...
include nutch jar in mapred jobs
you're asking for.
include nutch jar in mapred jobs
Key: NUTCH-209
URL: http://issues.apache.org/jira/browse/NUTCH-209
Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Doug Cutting
Priority: Minor
could also try to make the job jar smaller, e.g., by only including enabled
plugins.
include nutch jar in mapred jobs
Key: NUTCH-209
URL: http://issues.apache.org/jira/browse/NUTCH-209
Project: Nutch
Type: Improvement
Versions
it would be a small change.
Re: including only enabled plugins: potentially you would have to build a
custom jar for each job, because the list of active plugins depends on the
job's Configuration. I think I would prefer the replication trick.
include nutch jar in mapred jobs
Hi,
the last days I gave the mapred-branch a try and I was impressed!
But I still have a problem with the incremental crawling. My setup: I
have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running
one round of crawling consists out of the steps:
- generate (I set a limit
[
http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12364010 ]
Gal Nitzan commented on NUTCH-186:
--
After reading the code and I think I figured it... :)
The issue of the mapred-default.xml is totaly misleading.
Actualy
and use a pair of mapred-default/mapred-site.xml ...
It would be more understandable for users.
mapred-default.xml is over ridden by nutch-site.xml
---
Key: NUTCH-186
URL: http://issues.apache.org/jira/browse/NUTCH-186
[
http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12363903 ]
Gal Nitzan commented on NUTCH-186:
--
ok, JobConf extends NutchConf and in the (JobConf) constructor it adds the
mapred-default.xml resource.
the call to add resource
[ http://issues.apache.org/jira/browse/NUTCH-186?page=all ]
Gal Nitzan updated NUTCH-186:
-
Attachment: myBeautifulPatch.patch
the patch attached
mapred-default.xml is over ridden by nutch-site.xml
While going through the nutch sources for creating an updated
nutch-default.xml I got some ideas.
Currently the mapred/ndfs engine is just seen as one part of nutch and
so it makes sense to have mapred/ndfs properties set in the same file as
the rest of the nutch config properties
Huh...
anybody interested in this?
Normally I would be so pushy but to me it seems that Nutch dies if it
meets word document which can't be parsed. This seems like a serious
issue to me.
Or did I overlooked something important/fundamental?
Lukas
On 1/6/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
Lukas Vlcek wrote:
How can I learn that?
What I do is running regular one-step command [/bin/nutch crawl]
In that case your nutch-default.xml / nutch-site.xml decides, there is a
boolean option there. If you didn't change this, then it defaults to
true (i.e. your fetcher is parsing the
Hi,
I found the reason of that exception!
If you look into my crawl.log carefully then you notice these lines:
060104 213608 Parsing
[http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with
[EMAIL PROTECTED]
060104 213609 Unable to successfully parse content
Yes it was fixed. just update your code from trunk.
On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote:
Lukas Vlcek wrote:
Hi,
I am trying to use the latest nutch-trunk version but I am facing
unexpected Job failed! exception. It seems that all crawling work
has been already done
Hmmm...
If I am looking correctly into my local SVN copy then I see I last
updated yesterday - thus I have revision 365850 (Update of HTTPClient
to v3.0). So this should be already fixed... :-(
Andrzej, since you did probably the fix, is there anything special I
should check to be sure I have
Fixed in the copy i run as i've been able to get my
100k pages indexed without getting that error.
-byron
--- Andrzej Bialecki [EMAIL PROTECTED] wrote:
Lukas Vlcek wrote:
Hi,
I am trying to use the latest nutch-trunk version
but I am facing
unexpected Job failed! exception. It seems
Thanks guys!
I really didn't have the latest copy...
L.
On 1/4/06, Byron Miller [EMAIL PROTECTED] wrote:
Fixed in the copy i run as i've been able to get my
100k pages indexed without getting that error.
-byron
--- Andrzej Bialecki [EMAIL PROTECTED] wrote:
Lukas Vlcek wrote:
Hi,
I gave it a next try this night and I still have troubles.
This is the very end of my log (full version is attached) and you can
see another nasty exception:
...
060104 213644 map 100%
060104 213645 Optimizing index.
java.lang.NullPointerException: value cannot be null
at
Lukas Vlcek wrote:
I gave it a next try this night and I still have troubles.
This is the very end of my log (full version is attached) and you can
see another nasty exception:
Do you use the Fetcher in parsing or non-parsing mode, i.e. do you run a
ParseSegment as a separate step?
--
property
namefetcher.verbose/name
valuetrue/value
descriptionIf true, fetcher will log more verbosely./description
/property
property
namemapred.local.dir/name
value/home/lukas/nutch/mapred/local/value
descriptionThe local directory where MapReduce stores intermediate
data files
Note: I mistakenly used nutch-user email for reply-to value. Feel free
to reply to either nutch-dev or nutch-user as I monitor both of them
:-)
Anyway can anybody tell me how I can easily change reply-to value in
gmail? I am struggling with this all the time especially when replying
to multiple
Lukas Vlcek wrote:
Hi,
I am trying to use the latest nutch-trunk version but I am facing
unexpected Job failed! exception. It seems that all crawling work
has been already done but some threads are hunged which results into
exception after some timeout.
This was fixed (or should be fixed
[ http://issues.apache.org/jira/browse/NUTCH-121?page=all ]
Andrzej Bialecki closed NUTCH-121:
---
Fix Version: 0.8-dev
Resolution: Fixed
Assign To: Andrzej Bialecki
Commited. Thanks!
SegmentReader for mapred
Hi all,
I am currently working with Nutch 0.7.1,
I want to start using the mapred, any ideas where I can find the latest
version.
B.T.W I looked at the path:
http://svn.apache.org/repos/asf/lucene/nutch/branches/
but the only directory that exists there is branch-0.7/
Thanks,
Raffi
mapred is now trunk...
Am 19.12.2005 um 18:46 schrieb Rafi Iz:
Hi all,
I am currently working with Nutch 0.7.1,
I want to start using the mapred, any ideas where I can find the
latest version.
B.T.W I looked at the path: http://svn.apache.org/repos/asf/lucene/
nutch/branches/
but the only
Thanks for the fast response,
Do you know where I can find a compressed version?
Thanks,
Rafi
From: Stefan Groschupf [EMAIL PROTECTED]
Reply-To: nutch-dev@lucene.apache.org
To: nutch-dev@lucene.apache.org
Subject: Re: Latest version of Mapred Date: Mon, 19 Dec 2005 19:00:29 +0100
mapred
Thanks for the fast response,
Do you know where I can find a compressed version?
Here are the nightly builds:
http://cvs.apache.org/dist/lucene/nutch/nightly/
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Sami Siren wrote:
+1. I think this is good time to merge now as the mapred is fully usable.
Barring objections, I will do this tomorrow morning, Pacific time.
Doug
dedup.tmp
After each iteration we produce new segment and may use it for search.
Now we try mapred. How we can use crawl in similar way? We need results
in process, but not in the end of crawling (since is very long process -
weeks).
Sami Siren wrote:
+ if (k.contains(score)) {
Since:
1.5
Ah, indeed. Fixed - thanks!
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix,
[EMAIL PROTECTED] wrote:
Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!).
The reader offers similar functionality to the classic readdb command.
This looks great! Thanks, Andrzej.
I just ran it on a 50M page crawl. It took longer than I expected. The
reduce
Doug Cutting wrote:
I just ran it on a 50M page crawl.
FYI, here's the output:
051123 191703 TOTAL urls: 167780785
051123 191703 avg score:1.152
051123 191703 max score:47357.137
051123 191703 min score:1.0
051123 191703 retry 0: 167780785
051123 191703 status 1
[EMAIL PROTECTED] wrote:
Yes, problem in negative progress percentages.
Is /usr/root/seeds/urls the same file on all hosts? How big is it?
Doug
[regarding mapred ver 0.8]
Anton Potehin wrote:
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
Please help me to find out what the problem is? And what I did wrong?
Is the problem the negative
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
/nutch-default.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
051107 091256 Client
Hello Nutch devs,
I have same problems. I have 10 hosts and one master. For each host I
have a datanode and tasktracer.
My mapred conf is 100 maps and 25 reducers. Belove the logs with errors.
Thanks
051107 144101 task_r_pd3ybk 0.224% reduce copy
051107 144102 Moving bad file
/tmp
urls due for fetch.
051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
051107
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
On Mon, 2005-11-07 at 17:26 -0800, Paul Baclace wrote:
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories
Rod Taylor wrote:
NDFS accomplishes the above path finding by auto-prefixing any path not
beginning with / with a /user/$USER. I didn't think it was appropriate
for LocalFileSystem.java to be mucking around trying to automatically
adjust paths to what the user may have intended.
Grep-ing for
On Mon, 2005-11-07 at 18:12 -0800, Paul Baclace wrote:
Rod Taylor wrote:
NDFS accomplishes the above path finding by auto-prefixing any path not
beginning with / with a /user/$USER. I didn't think it was appropriate
for LocalFileSystem.java to be mucking around trying to automatically
I tried running one datanode per machine connecting back to the
same SAN
but it seemed pretty clunky.
SAN in general is a bad idea. A SAN is too slow for a serious setup.
... and it is the single point of failure...
Better use many local hdd.
Stefan
Rod Taylor wrote:
Every segment that I fetch seems to be missing a part when stored on the
filesystem. The stranger thing is it is always the same part (very
reproducible).
This sounds strange. Are the datanode errors always on the same host?
How many hosts are you running this on?
Doug
Ken van Mulder wrote:
First is that the fetcher slows down over time and continues to use more
and more memory as it goes (which I think is eventually hanging the
process).
What parser plugins do you have enabled? These are usually the culprit.
Try using 'kill -QUIT' to see what various
On Fri, 2005-11-04 at 13:43 -0800, Doug Cutting wrote:
Rod Taylor wrote:
Every segment that I fetch seems to be missing a part when stored on the
filesystem. The stranger thing is it is always the same part (very
reproducible).
This sounds strange. Are the datanode errors always on the
On Fri, 2005-11-04 at 13:43 -0800, Doug Cutting wrote:
Rod Taylor wrote:
Every segment that I fetch seems to be missing a part when stored on the
filesystem. The stranger thing is it is always the same part (very
reproducible).
This sounds strange. Are the datanode errors always on the
Rod Taylor wrote:
There is only a single datanode and there are 20 hosts.
That's a lot of load on one datanode. I typically run a datanode on
every host, accessing the local drives on that host.
Doug
Rod Taylor wrote:
I tried running one datanode per machine connecting back to the same SAN
but it seemed pretty clunky. A crash of any datanode would take down
the entire system (no data replication since it's a common data-store in
the end). Reducing it to a single datanode did not have this
in
the end). Reducing it to a single datanode did not have this impact.
Why use NDFS at all? Why not just mount the SAN on all hosts? You're
not using NDFS as a distributed file system, but rather as a centralized
file system.
I was unable to make the mapred branch work by using 'local
was unable to make the mapred branch work by using 'local' as the
filesystem and having more than one tasktracker. Tasktrackers were
unable to complete any work, although it was quite a while ago when I
last tried (September).
Here you go. local filesystem and a single job tracker on another
Rod Taylor wrote:
Here you go. local filesystem and a single job tracker on another
machine. When the tasktracker and jobtracker are on the same box there
isn't a problem. When they are on different machines it runs into
issues.
This is using mapred.local.dir on the local machine (not sharedd
235806 Lost connection to JobTracker
[sbider5.sitebuildit.com/192.168.100.14:5464]. Retrying...
051104 235811 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
051104 235811 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
051104 235811
parsing /home/sitesell/local/taskTracker/task_r_mdnul7
Sources are from October 31st. Sun Standard Edition 1.5.0_02-b09 for
amd64
Every segment that I fetch seems to be missing a part when stored on the
filesystem. The stranger thing is it is always the same part (very
reproducible).
If I have mapred.reduce.tasks set to 20, the hole is at part 13.
I forgot to provide this earlier. Here is nutch ndfs -ls output for the
directory structure of a segment with a failed part-00013.
[EMAIL PROTECTED] ~]$ /opt/nutch/bin/nutch ndfs
-ls /opt/sitesell/sbider_data/nutch/segments/20051102031132/20051102031133
051103 162002 parsing
Paul Baclace wrote:
Here is a patch for improving the error message that is displayed
when an intranet crawl commandline has a file instead of a directory
of files containing URLs.
I have committed this to the mapred branch.
Thanks, Paul!
Doug
I will postpone the merge of the mapred branch into trunk until I have a
chance to (a) add some MapReduce documentation; and (b) implement
MapReduce-based dedup.
Doug
Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred. This
increases the chances
[EMAIL PROTECTED] wrote:
Author: cutting
Date: Mon Sep 12 10:03:00 2005
New Revision: 280368
URL: http://svn.apache.org/viewcvs?rev=280368view=rev
Log:
Change so that -du and -ls commands work with zero arguments.
Come to think of that... Shouldn't the enigmatic TestClient be renamed
to
howdy,
I have been looking around for a nutch/mapred tutorial
and haven't had much luck. I found this one
http://lucene.apache.org/nutch/tutorial.html
which did help me get a crawl going on trunk, but no
such luck in branches/mapred. I set the urls file and
the filter in the same way that I
--- Earl Cahill [EMAIL PROTECTED] wrote:
howdy,
I have been looking around for a nutch/mapred
tutorial
and haven't had much luck. I found this one
http://lucene.apache.org/nutch/tutorial.html
which did help me get a crawl going on trunk, but no
such luck in branches/mapred
In some cases, though, focused crawling requirements may require
extra data to be stored, which is not useful for whole-web, for
example, storing a url's parent and seed url and its depth
(essential for crawl scopes).
Sounds like meta data for a page. :)
Some time ago I submit a patch to
Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred. This
increases the chances for conflicts. I would thus like to merge the
mapred branch into trunk soon. The soonest I could actually start this
is next week. Are there any objections?
Doug
+1
P.
Currently we have three versions of nutch: trunk, 0.7 and mapred.
This
increases the chances for conflicts. I would thus like to merge the
mapred branch into trunk soon. The soonest I could actually start
this is next week. Are there any objections?
I, too, am looking forward
[EMAIL PROTECTED] wrote:
I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works. If
merging mapred to trunk means losing Kelvin's changes, then I
mapred to trunk means losing Kelvin's changes, then I
suggest
one of Nutch developers evaluates Kelvin's modifications and, if
they
are good, commits them to trunk, and then makes the final
pre-mapred
release (e.g. release-0.8).
It won't lose Kelvin's patch: it will still be a patch to 0.7
. If merging mapred to trunk means losing
Kelvin's changes, then I suggest one of Nutch developers
evaluates Kelvin's modifications and, if they are good, commits
them to trunk, and then makes the final pre-mapred release (e.g.
release-0.8).
It won't lose Kelvin's patch: it will still be a patch
tasks, but have
run into issues with the tasks timing out. I attempted to both override the
mapred.tasks.timeout option in mapred-default.xml and in the actual code for my
Mapper class, but my timeout durations remained steady at the default 10
minutes.
I looked at TaskTracker and I see
I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.
First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which
Jay Pound wrote:
is the org.apache.nutch.crawl package a part of the nightly builds?
No. Nightly builds are from trunk. The mapred code is in a separate
branch in subversion. After the 0.7 release, when the mapred branch is
folded into trunk, then it will be in nightly builds. Until
Fuad Efendi wrote:
Which parameter should I pass to Crawl? It should be directory
containing smth. in which format?
As before, inject takes a flat text files of urls, one per line. If you
wish to inject DMOZ urls, there is now a utility main() that will
convert the DMOZ file to such a file.
Thanks,
It works now, I pass a folder to Crawl containing plain text file with
URLs. I am testing, and I pass single URL.
At some point I have:
050815 162137 parsing \tmp\nutch\mapred\local\job_q3s4ai.xml
050815 162137 parsing file:/C:/workspace/MapRed/conf/nutch-site.xml
java.io.IOException
I need some help with how to use mapred, what are the commands to use with it?
Thanks,
Jay Pound
--
Pound Web Hosting www.poundwebhosting.com
(607)-435-3048
how would I setup mapred for smp machines, I understand it will split up big
jobs like indexing or updating the db into a bunch of chunks to be processed
by separate machines, I have machines that are multiple processor machines
that I want to test this with internally, makes sense to utilize
I saw this revision fixed something that has been puzzling me.
However, if the fix is applied, NDFS can't handle 0-byte files
anymore. It will simply hang. I didn't look into the code yet. Maybe
this case is something that needs to be handled specially?
Yitao
I'm trying to start a NDFS datanode and keep getting the following error:
[EMAIL PROTECTED] nutchmapre]$ bin/nutch datanode
050728 213401 10 parsing file:/usr/local/nutchmapre/conf/nutch-default.xml
050728 213402 10 parsing file:/usr/local/nutchmapre/conf/nutch-site.xml
050728 213402 10 Opened
84 matches
Mail list logo