Is there a way to access stderr when using Hadoop Streaming? I see how
stdout is written to the log files but I'm more concerned about what happens
when errors occur. Access to stderr would help debug when a run doesn't
complete successfully but I haven't been able to figure out how to retrieve
wha
Thanks, What would be the # of severs , file sizes that in their range the
performance hit will be minor? I am concerned about implementing it all only
to rewrite it later to scale economically.
Thanks for all the information.
On Tue, May 19, 2009 at 1:30 PM, Amr Awadallah wrote:
> S d,
>
and small
scale tests.
On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard wrote:
> Streaming is slightly slower than native Java jobs. Otherwise Python works
> great in streaming.
>
> Alex
>
> On Tue, May 19, 2009 at 8:36 AM, s d wrote:
>
> > Hi,
> > How robust
Hi,
How robust is using hadoop with python over the streaming protocol? Any
disadvantages (performance? flexibility?) ? It just strikes me that python
is so much more convenient when it comes to deploying and crunching text
files.
Thanks,
I have an Hadoop Streaming program that crawls the web for data items,
processes each retrieved item and then stores the results on S3. For each
processed item a directory on S3 is created to store the results produced by
the processing. At the conclusion of a program run I've been getting a
duplic
I've used wget with Hadoop Streaming without any problems. Based on the
error code you're getting, I suggest you make sure that you have the proper
write permissions for the directory in which Hadoop will process (e.g.,
download, convert, ...) on each of the task tracker machines. The location
wher
I ran into this problem as well and several people on this list provided a
helpful response: once the tasktracker starts, the maximum number of tasks
per node can not be changed. In my case, I've solved this challenge by
stopping and starting mapred (stop-mapred.sh, start-mapred.sh) between jobs.
T
My fault on this one. I mistakenly thought the environment variables
(AMAZON_ACCESS_KEY_ID and AMAZON_SECRET_ACCESS_KEY) would override values
set in hadoop-site.xml; I now see that this is not the case for the Hadoop
FS shell commands.
John
On Wed, Mar 4, 2009 at 5:18 PM, S D wrote:
>
I'm using Hadoop 0.19.0 with S3 Native. Up until a few days ago I was
successfully able to use the various shell functions successfully; e.g.,
hadoop dfs -ls .
To ensure access to my Amazon S3 Native data store I set the following
environment variables: AMAZON_ACCESS_KEY_ID and AMAZON_SECRET_A
t; Instead, the output is again obtained with backticks.
>
> I don't know the way in which irb captures the return value: for analogy I
> would say that backticks are used for capturing the output even in irb.
>
> Best
>
> Roldano
>
>
>
> On Mon, Feb 23, 2009 a
I'm attempting to use Hadoop FS shell (http://hadoop
.apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My
challenge is that I'm unable to get the function return value of the
commands I'm invoking. As an example, I try to run get as follows
hadoop fs -get /user/hadoop/testFile.t
ny changes you make will be ignored.
>
>
> 2009/2/18 S D
>
> > Thanks for your response Rasit. You may have missed a portion of my post.
> >
> > > On a different note, when I attempt to pass params via -D I get a usage
> > message; when I use
> > &g
;
> For more details about these options:
> Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
>
>
>
> I think -jobconf is not used in v.0.19 .
>
> 2009/2/18 S D
>
> > I'm having trouble overriding the maximum number of map tasks that
I'm having trouble overriding the maximum number of map tasks that run on a
given machine in my cluster. The default value of
mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When
running my job I passed
-jobconf mapred.tasktracker.map.tasks.maximum=1
to limit map tasks to
unning. Have a look at the mapred.hosts.exclude property for how to
> exclude tasktrackers.
>
> Tom
>
> On Tue, Feb 17, 2009 at 5:31 PM, S D wrote:
> > Thanks for your response. For clarification, I'm using S3 Native instead
> of
> > HDFS. Hence, I'm not even calling st
te Student
> University of California, Santa Cruz
>
>
> On Tue, Feb 17, 2009 at 2:14 PM, S D wrote:
>
> > I have a Hadoop 0.19.0 cluster of 3 machines (storm, mystique, batman).
> It
> > seemed as if problems were occurring on mystique (I was noticing errors
> &
I have a Hadoop 0.19.0 cluster of 3 machines (storm, mystique, batman). It
seemed as if problems were occurring on mystique (I was noticing errors with
tasks that executed on mystique). So I decided to remove mystique. I did so
by calling stop-mapred.sh (I'm using S3 Native, not HDFS), removing mys
need the
localhost localhost.localdomain line so I thought I better avoid removing it
altogether.
Thanks,
John
On Sun, Feb 15, 2009 at 10:38 AM, Nick Cen wrote:
> Try comment out te localhost definition in your /etc/hosts file.
>
> 2009/2/14 S D
>
> > I'm reviewing t
ss it
> concurrently, and maybe one of them deletes it when it's done and one
> doesn't. Normally each task should run in its own temp directory though.
>
> On Sun, Feb 15, 2009 at 2:51 PM, S D wrote:
>
> > I was not able to determine the command shell return value fo
r:
'localdir' not found
Any clues on what could be going on?
Thanks,
John
On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia wrote:
> Have you logged the output of the dfs command to see whether it's always
> succeeded the copy?
>
> On Sat, Feb 14, 2009 at 2:46
I followed these instructions
http://wiki.apache.org/hadoop/MountableHDFS
and was able to get things working with 0.19.0 on Fedora. The only problem I
ran into was the AMD64 issue on one of my boxes (see the note on the above
link); I edited the Makefile and set OSARCH as suggested but couldn't g
In my Hadoop 0.19.0 program each map function is assigned a directory
(representing a data location in my S3 datastore). The first thing each map
function does is copy the particular S3 data to the local machine that the
map task is running on and then being processing the data; e.g.,
command = "h
I'm reviewing the task trackers on the web interface (
http://jobtracker-hostname:50030/) for my cluster of 3 machines. The names
of the task trackers do not list real domain names; e.g., one of the task
trackers is listed as:
tracker_localhost:localhost/127.0.0.1:48167
I believe that the network
of
/user/hadoop/base that listed only hadoopInput followed by a second refresh
that listed the other subdirectories.
Any clues on how this could be? Perhaps there is a leftover process still
running?
Thanks,
John
On Mon, Feb 2, 2009 at 9:38 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.co
I'm using the Hadoop FS shell to move files into my data store (either HDFS
or S3Native). I'd like to use wildcard with copyFromLocal but this doesn't
seem to work. Is there any way I can get that kind of functionality?
Thanks,
John
This does it. Thanks!
On Thu, Feb 5, 2009 at 9:14 PM, Arun C Murthy wrote:
>
> On Feb 5, 2009, at 1:40 PM, S D wrote:
>
> Is there a way to use the Reporter interface (or something similar such as
>> Counters) with Hadoop streaming? Alternatively, can how could STDOUT be
>
Is there a way to use the Reporter interface (or something similar such as
Counters) with Hadoop streaming? Alternatively, can how could STDOUT be
intercepted for the purpose of updates? If anyone could point me to
documentation or examples that cover this I'd appreciate it.
Thanks,
John
Shefali,
Is your firewall blocking port 54310 on the master?
John
On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar wrote:
> Hi,
>
> I am trying to set-up a two node cluster using Hadoop0.19.0, with 1
> master(which should also work as a slave) and 1 slave node.
>
> But while running bin/start-dfs
p" file in bin directory.
>
> I think File System API also needs some improvement. I wonder if it's
> considered by head developers.
>
> Hope this helps,
> Rasit
>
> 2009/2/4 S D :
> > I'm using the Hadoop FS commands to move files from my local machine
Coincidentally I'm aware of the AWS::S3 package in Ruby but I'd prefer to
avoid that...
On Tue, Feb 3, 2009 at 5:02 PM, S D wrote:
> I'm at my wit's end. I want to do a simple test for the existence of a file
> on Hadoop. Here is the Ruby code I'm trying:
>
I'm at my wit's end. I want to do a simple test for the existence of a file
on Hadoop. Here is the Ruby code I'm trying:
val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt`
puts "Val: #{val}"
if val == 1
// do one thing
else
// do another
end
I never get a return value for
I'm using the Hadoop FS commands to move files from my local machine into
the Hadoop dfs. I'd like a way to force a write to the dfs even if a file of
the same name exists. Ideally I'd like to use a "-force" switch or some
such; e.g.,
hadoop dfs -copyFromLocal -force adirectory s3n://wholeinthe
You
> need not include it in your streaming jar.
> -Amareshwari
>
>
> S D wrote:
>
>> Thanks for your response Amereshwari. I'm unclear on how to take advantage
>> of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
>> streaming jar file (c
t; You can use NLineInputFormat for this, which splits one line (N=1, by
> default) as one split.
> So, each map task processes one line.
> See
> http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> -Amareshwari
>
> S D wrote:
Hello,
I have a clarifying question about Hadoop streaming. I'm new to the list and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.
I have an input file consisting of a list of files (one per line) that need
to be processed independently of each
35 matches
Mail list logo