I would like to better understand YARN's scheduling with named workers and
relaxedLocality==true. For example, suppose that I have a three-node cluster
with nodes A,B,C. Each node has capacity to run two tasks of the kind I desire
simultaneously. My AM then requests nine containers with
I noticed that HDP 2.0 is available for download here:
http://hortonworks.com/products/hdp-2/?b=1#install
Is this the final GA version that tracks Apache Hadoop 2.2?
Sorry I am just a little confused by the different numbering schemes.
Thanks
John
declared publicly stable though, but
we can do that over a JIRA.
On Mon, Oct 21, 2013 at 2:05 AM, John Lilley john.lil...@redpoint.net wrote:
Harsh, thanks for the quick response. These files don't need to be on the
DFS (although we use that too). These are local files used during sorting
and are for the app's user to
utilize. You shouldn't have any permission issues working within them.
The LocalDirAllocator is still somewhat MR-bound but you can still be able to
make it work by giving it a config with the values it needs.
On Mon, Oct 21, 2013 at 8:49 PM, John Lilley john.lil
:49 PM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Thanks again. This gives me a lot of options; we will see what works.
Do you know if there are any permissions issues if we directly access the
folders of LOCAL_DIR_ENV?
Regarding LocalDirAllocator, I see its
Is this option typically enabled in practice? Is there a way for a specific AM
to disable it, or is it site-wide?
We are finding that our YARN application, because it launches many threads,
each with a somewhat large stack, has a large vmem/pmem ratio (say, 2GB /
200MB). Unless we disable
I submitted a JIRA requesting FSDataOutputStream.write(ByteBuffer) method.
HDFS-5395
This is my first JIRA submission for Hadoop, so please comment if this should
be different.
Thanks
John
We have a pure YARN application (no MapReduce) that has need to store a
significant amount of temporary data. How can we know the best location for
these files? How can we ensure that our YARN tasks have write access to these
locations? Is this something that must be configured outside of
configuration for.
Do the files need to be on a distributed FS or a local one?
On Sun, Oct 20, 2013 at 8:54 PM, John Lilley john.lil...@redpoint.net wrote:
We have a pure YARN application (no MapReduce) that has need to store
a significant amount of temporary data. How can we know the best
location
Ah, never mind.. it is getNodeReports()
john
From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Thursday, October 17, 2013 1:44 PM
To: user@hadoop.apache.org
Subject: Querying cluster nodes list
I thought mistakenly that getClusterMetrics() would return information about
the cluster's
I thought mistakenly that getClusterMetrics() would return information about
the cluster's node, or about a queue's nodes, but this doesn't seem to be true
- it is only a count. How can a YARN application query the available node
list on a cluster and what resources are configured on each
In the Hadoop 2.1 docs, I see there is a read(ByteBuffer) call in
FSDataInputStream, but I don't see a write(ByteBuffer) call documented for
FSDataOutputStream. Is there a (fast) way to write a ByteBuffer to HDFS files?
Thanks
John
*
Arun
On Oct 5, 2013, at 3:12 PM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Is there a description of how MapReduce under Hadoop 2.0 assigns mapper tasks
to preferred nodes? I think that someone on the list mentioned previously that
it attempted to assign one
Karim,
I am not an experienced Hadoop programmer, but what I found was that building
and debugging Hadoop under Eclipse was very difficult, and I was never to make
it work correctly. I suggest using the well documented command-line Maven
build, installing Hadoop from that build, and running
MapReduce try to
obtain an even task assignment while optimizing data locality?
Thanks,
John Lilley
Chief Architect, RedPoint Global Inc.
1515 Walnut Street | Suite 200 | Boulder, CO 80302
T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint |
john.lil
Is this option set on a per-application-instance basis or is it a cluster-wide
setting (or both)?
Is this a MapReduce-specific issue, or a YARN issue?
I don't understand how the problem arises in the first place. For example, if
I have an idle cluster with 10 nodes and each node has four
When I call getFileBlockLocations() on a DFS, will it return the blocks for
currently-inactive nodes?
If so, how can I filter out the unavailable blocks?
Or more generally, how do I get the list of node status? Is that
ApplicationClientProtocol.getClusterNodes()?
Thanks,
John
pass it's host:port to the task as part of either
the cmd-line for the task or in it's env. That is what is done by MR AM.
hth,
Arun
On Sep 21, 2013, at 6:52 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Thanks Harsh! The data-transport format is pretty easy
recreate the tables with predefined splits to
create more regions.
Thanks,
Rahul
On Sun, Sep 22, 2013 at 4:38 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Pavan,
How large are the rows in HBase? 22 million rows is not very much but you
mentioned huge strings. Can
there that make it easy to do such things once you define what you
want in a schema/spec form.
On Fri, Sep 20, 2013 at 5:32 PM, John Lilley john.lil...@redpoint.net wrote:
Thanks Harsh. Is this protocol something that is available to all AMs/tasks?
Or is it up to each AM/task pair to develop
If my YARN application tasks are all reading/writing HDFS simultaneously and
some node is unable to honor a connection request because it is overloaded,
what happens? I've seen HDFS attempt to retry connections.
For that matter, how does MR under YARN deal with connection overload during
the
Never mind, its in the ApplicationClientProtocol class.
John
From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Thursday, September 19, 2013 6:12 PM
To: user@hadoop.apache.org
Subject: How to get number of data nodes as a hadoop client
How does a Hadoop client query the number
Pavan,
How large are the rows in HBase? 22 million rows is not very much but you
mentioned huge strings. Can you tell which part of the processing is the
limiting factor (read from HBase, mapper output, reducers)?
John
From: Pavan Sudheendra [mailto:pavan0...@gmail.com]
Sent: Saturday,
progress.
On Fri, Sep 20, 2013 at 12:18 AM, John Lilley john.lil...@redpoint.net wrote:
How does a YARN application master typically query ongoing status
(like percentage completion) of its tasks?
I would like to be able to ultimately relay information to the user like:
100 tasks
Are there any rough numbers one can give me regarding the latency of creating,
writing, and closing a small HDFS-based file? Does replication have a big
impact? I am trying to decide whether to communicate some modestly-sized
(~200KB) information via HDFS files or go to the trouble of
How does a YARN application master typically query ongoing status (like
percentage completion) of its tasks?
I would like to be able to ultimately relay information to the user like:
100 tasks are scheduled
10 tasks are complete
4 tasks are running and they are (4%, 10%, 50%, 70%) complete
But,
of synchronous and
sequenced write pipelines in HDFS). Reads would be the same, unless you're
unable to schedule a rack-local read (at worst
case) due to only one (busy) rack holding it.
On Sun, Sep 15, 2013 at 10:38 PM, John Lilley john.lil...@redpoint.net wrote:
In our YARN application, we
In our YARN application, we are considering whether to store temporary data
with replication=1 or replication=3 (or give the user an option). Obviously
there is a tradeoff between reliability and performance, but on smaller
clusters I'd expect this to be less of an issue.
What is the
A's containers finish.
It's also possible to configure the schedulers to use preemption to make this
kind of thing go a lot faster.
Does that make some sense?
-Sandy
On Mon, Sep 9, 2013 at 7:21 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Do the Hadoop 2.0
Do the Hadoop 2.0 YARN scheduler(s) deal with situations like the following?
Hadoop cluster of 10 nodes, with 8GB each available for containers. There is
only one queue.
Application A requests 100 4GB containers. It initially, or after a little
while, gets 20 containers.
Later, application B
than have my containers implement more
logic.
On Fri, Aug 23, 2013 at 11:17 PM, John Lilley john.lil...@redpoint.net wrote:
Harsh,
Thanks for the clarification. I would find it very convenient in this case
to have my custom jars available in HDFS, but I can see the added complexity
needed
/browse/YARN (do let the
thread know the ID as well, in spirit of http://xkcd.com/979/)
:)
On Thu, Sep 5, 2013 at 11:41 PM, John Lilley john.lil...@redpoint.net wrote:
Harsh,
Thanks as usual for your sage advice. I was hoping to avoid actually
installing anything on individual Hadoop nodes
, August 22, 2013 6:25 PM
To: user@hadoop.apache.org
Subject: Re: yarn-site.xml and aux-services
Auxiliary services are essentially administer-configured services. So, they
have to be set up at install time - before NM is started.
+Vinod
On Thu, Aug 22, 2013 at 1:38 PM, John Lilley
john.lil
-service
that belonged with an AM, how would one do it?
John
-Original Message-
From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Wednesday, June 05, 2013 11:41 AM
To: user@hadoop.apache.org
Subject: RE: yarn-site.xml and aux-services
Wow, thanks. Is this documented anywhere other
In HDFS, who tracks the filesystem path on the datanode of each block? The
namenode or the datanode? If datanode, where is that table stored?
John
I was looking at the HDFS block storage and noticed a couple things (1) all
block files are in a flat directory structure (2) there is a meta file for
each block file. This leads me to ask:
-- Where can I find good reading that describes this level of HDFS internals?
-- Is the flat storage
) are not for Public access.
Please do not use them in production. The only API we care not to change
incompatibly are the FileContext and the FileSystem APIs. They provide much of
what you want - if not, log a JIRA.
On Fri, Jul 5, 2013 at 11:40 PM, John Lilley john.lil...@redpoint.net wrote:
I've seen
I've seen mentioned that you can access HDFS via ClientProtocol, as in:
ClientProtocol namenode = DFSClient.createNamenode(conf);
LocatedBlocks lbs = namenode.getBlockLocations(path, start, length);
But we use:
fs = FileSystem.get(URI, conf);
filestatus = fs.getFileStatus(path);
Manickam,
HDFS supports append; it is the command-line client that does not.
You can write a Java application that opens an HDFS-based file for append, and
use that instead of the hadoop command line.
However, this doesn't completely answer your original question: How do I move
only the delta
Arun,
I'm don't know how to interpret the release schedule from the JIRA. It says
that the patch targets 2.1.0 and it is checked into the trunk, does that mean
it is likely to be rolled into the first Hadoop 2 GA or will it have to wait
for another cycle?
Thanks,
John
From: Arun C Murthy
,
Tariq
cloudfront.blogspot.comhttp://cloudfront.blogspot.com
On Tue, Jul 2, 2013 at 4:39 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
I've seen some benchmarks where replication=1 runs at about 50MB/sec and
replication=3 runs at about 33MB/sec, but I can't seem
(i.e 'yarn.nodemanager.local-dirs' configuration) accordingly
with the app id container id, and this will be cleaned up after the app
completion. You need to make use of this persisted data before completing the
application.
Thanks
Devaraj k
From: John Lilley [mailto:john.lil
I'm sure this has been asked a zillion times, so please just point me to the
JIRA comments: is there a feature underway to allow for re-writing of HDFS file
sections?
Thanks
John
I would like to hear your experiences working with large JSON data sets,
specifically:
1) How large is each JSON document?
2) Do they tend to be a single JSON doc per file, or multiples per file?
3) Do the JSON schemas change over time?
4) Are there interesting public data
I have YARN tasks that benefit from multicore scaling. However, they don't
*always* use more than one core. I would like to allocate containers based
only on memory, and let each task use as many cores as needed, without
allocating exclusive CPU slots in the scheduler. For example, on an
Geelong,
1. These files will probably be some standard format like .gz or .bz2 or
.zip. In that case, pick an appropriate InputFormat. See e.g.
http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/,
Blah blah,
Can you build and run the DistributedShell example? If it does not run
correctly this would tend to implicate your configuration. If it run correctly
then your code is suspect.
John
From: blah blah [mailto:tmp5...@gmail.com]
Sent: Tuesday, June 25, 2013 6:09 PM
To:
that YARN should have available.
From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Sent: Tuesday, July 2, 2013 10:41 AM
Subject: RE
...@microsoft.com wrote:
I believe this is the default behavior.
By default, only memory limit on resources is enforced.
The capacity scheduler will use DefaultResourceCalculator to compute resource
allocation for containers by default, which also does not take CPU into account.
-Chuan
From: John
understand cgroups' CPU
control - does it statically mask cores available to processes, or does it set
up a prioritization for access to all available cores?
Thanks,
John
From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Tuesday, July 02, 2013 1:12 PM
To: user@hadoop.apache.org
Subject: RE
Is there any convention for clients/applications wishing to use temporary file
space in HDFS? For example, my application wants to:
1) Load data into some temporary space in HDFS as an external client
2) Run an AM, which produces HDFS output (also in the temporary space)
3)
, it will have access to all 8
cores. If 3 other tasks that requested a single vcore are later placed on the
same node, and all tasks are using as much CPU as they can get their hands on,
then each of the tasks will get 2 cores of CPU-time.
On Tue, Jul 2, 2013 at 12:12 PM, John Lilley
john.lil
regular writes and append. Random write is not supported. I
do not know of any feature/jira that is underway to support this feature.
On Tue, Jul 2, 2013 at 9:01 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
I'm sure this has been asked a zillion times, so please
Do the HDFS file-reader classes perform internal buffering?
Thanks
John
I don't think can be done in a single map/reduce pass.
Here the author discusses an implementation in PIG:
http://techblug.wordpress.com/2011/08/07/transitive-closure-in-pig/
john
From: parnab kumar [mailto:parnab.2...@gmail.com]
Sent: Thursday, June 13, 2013 10:42 PM
To: user@hadoop.apache.org
Sorry this is the link I meant:
http://hortonworks.com/blog/transitive-closure-in-apache-pig/
john
From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Sunday, June 16, 2013 1:02 PM
To: user@hadoop.apache.org
Subject: RE: how to design the mapper and reducer for the below problem
I don't
You basically have a record similarity scoring and linking problem -- common
in data-quality software like ours. This could be thought of as computing the
cross-product of all records, counting the number of hash keys in common, and
then outputting those that exceed a threshold. This is very
-Original Message-
From: Albert Chu [mailto:ch...@llnl.gov]
Sent: Wednesday, June 12, 2013 2:27 PM
To: user@hadoop.apache.org
Subject: RE: Shuffle design: optimization tradeoffs
On Wed, 2013-06-12 at 18:08 +, John Lilley wrote:
In reading this link as well as the sailfish report, it strikes me
allocation
By default, the ResourceManager will try give you a container on that node,
rack or anywhere (in that order).
We recently added ability to whitelist or blacklist nodes to allow for more
control.
Arun
On Jun 12, 2013, at 8:03 AM, John Lilley wrote:
If I request a container on a node
with
HDFS. But it does also bring drawbacks.
Regards
Bertrand
On Thu, Jun 13, 2013 at 7:57 PM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
When MR assigns data splits to map tasks, does it assign a set of
non-contiguous blocks to one map? The reason I ask
When MR assigns data splits to map tasks, does it assign a set of
non-contiguous blocks to one map? The reason I ask is, thinking through the
problem, if I were the MR scheduler I would attempt to hand a map task a bunch
of blocks that all exist on the same datanode, and then schedule the map
...@llnl.gov]
Sent: Tuesday, June 11, 2013 3:32 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle design: optimization tradeoffs
On Tue, 2013-06-11 at 16:00 +, John Lilley wrote:
I am curious about the tradeoffs that drove design of the
partition/sort/shuffle (Elephant book p 208). Doubtless
There are kind of two parts to this. The semantics of MapReduce promise that
all tuples sharing the same key value are sent to the same reducer, so that you
can write useful MR applications that do things like “count words” or
“summarize by date”. In order to accomplish that, the shuffle
Harsh,
Thanks so much for your thorough explanation.
John
-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Wednesday, June 05, 2013 8:18 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop JARs and Eclipse
Hi John,
On Thu, Jun 6, 2013 at 1:21 AM, John Lilley john.lil
What resources are you trying to access?
Do you want to monitor the system status?
Do you want to read/write HDFS as a client?
Do you want to run your application on the Hadoop cluster?
John
From: Brian Mason [mailto:br...@gabey.com]
Sent: Thursday, June 06, 2013 6:52 AM
To:
(using m2e plugin) would be equivalent
to Adding all relevant Hadoop JARs to Eclipse.
On Wed, Jun 5, 2013 at 11:07 PM, John Lilley john.lil...@redpoint.net wrote:
Well, I've failed and given up on building Hadoop in Eclipse. Too
many things go wrong with Maven plugins and m2e.
But Hadoop
Do tasks that spawn from a YARN AM have a default working directory? Does
stdout/stderr get captured anywhere by default?
I ask because I am setting up tests with the distributed shell AM and want to
know if basic commands (e.g. ls) will send stdout/stderr somewhere I can see at
the end.
/jira/browse/YARN-772 to track
documentation for this and other such features which are available.
thanks,
Arun
On Jun 6, 2013, at 12:44 PM, John Lilley wrote:
In debugging a custom AM and its tasks, I want to create log4j settings so
that the log output goes to someplace standard where I can
What service addresses and ports does a YARN ApplicationMaster need to know
about?
Thanks,
John
Well, I've failed and given up on building Hadoop in Eclipse. Too many things
go wrong with Maven plugins and m2e.
But Hadoop builds just fine using the command-line, and it runs using Sandy's
development-node instructions.
My strategy now is
1) Tell Eclipse about all of the Hadoop JARs
-services.bar.class/name
valuecom.mypack.MyAuxServiceClassForBar/value
/property
On Wed, Jun 5, 2013 at 8:42 PM, John Lilley john.lil...@redpoint.net wrote:
Good, I was hoping that would be the case. But what are the mechanics of it?
Do I just add another entry? And what exactly
Is it possible to use Hadoop streaming or Hadoop pipes for multiple inputs
and outputs? Consider for example an equality join that accepts two inputs
(left, right), and produces three outputs (left unmatched, right unmatched,
joined). That's not actually what I'm trying to implement, but
From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
user@hadoop.apache.orgmailto:user@hadoop.apache.org; Mahmood Naderan
nt_mahm...@yahoo.commailto:nt_mahm...@yahoo.com
Sent: Tuesday, June 4, 2013 3:28 AM
Subject: RE
Answered my own question. The Eclipse installs with Centos6 (or with yum)
seems to have this problem. A direct download of Eclipse for Java EE works
fine.
John
From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Monday, June 03, 2013 5:49 PM
To: user@hadoop.apache.org; Deepak Vohra
patch is related to the issue cited.
https://issues.apache.org/jira/browse/HADOOP-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
thanks,
Deepak
From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net
To: user
Rahul,
This is a very good question, and one we are grappling with currently in our
application port. I think there are a lot of legacy data-processing
applications like ours which would benefit by a port to Hadoop. However,
because we have a great load of C++, it is not necessarily a good
:32 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Thanks for help me to build Hadoop! I'm through compile and install of maven
plugins into Eclipse. I could use some pointers for next steps I want to take,
which are:
* Deploy the simplest development
at 11:33 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Sorry if this is a dumb question, but I'm not sure where to start. I am
following BUILDING.txt instructions for source checked out today using git:
git clone
git://git.apache.org/hadoop-common.githttp
Thanks for help me to build Hadoop! I'm through compile and install of maven
plugins into Eclipse. I could use some pointers for next steps I want to take,
which are:
* Deploy the simplest development only cluster (single node?) and
learn how to debug within it. I read about the
I am having trouble finding a definitive answer about OpenJDK vs Sun JDK in
regards to building Hadoop. This:
http://wiki.apache.org/hadoop/HadoopJavaVersions
Indicates that OpenJDK is not recommended, but is that an authoritative answer?
BUILDING.txt states no preference.
Thanks
John
, May 29, 2013 9:34 AM
To: user@hadoop.apache.org
Subject: Re: OpenJDK?
Yes. Use Sun/Oracle JDK
I have had memory issues while using Oozie. When I replaced OpenJDK with Sun
JDK 6. the memory issue was resolved.
Thanks,
Lenin
On Wed, May 29, 2013 at 8:22 PM, John Lilley
john.lil
Sorry if this is a dumb question, but I'm not sure where to start. I am
following BUILDING.txt instructions for source checked out today using git:
git clone git://git.apache.org/hadoop-common.git Hadoop
Following build steps and adding -X for more logging:
mvn compile -X
But I get this error
I was reading from the HortonWorks blog:
How MapReduce shuffle takes advantage of NM's Auxiliary-services
The Shuffle functionality required to run a MapReduce (MR) application is
implemented as an Auxiliary Service. This service starts up a Netty Web Server,
and knows how to handle MR specific
.
On Thu, May 23, 2013 at 11:01 PM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
I’ve read about splittable compressed formats in Hadoop. Are any of these
formats also “seekable” (in other words, be able to seek to an absolute
location in the uncompressed data
on the Hadoop 1.0.4 source code, especially the
ReduceTask.java file.
yours,
Ling Kun
On Wed, May 22, 2013 at 10:57 PM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
U, is that also the limit for the number of simultaneous connections? In
general, one does not need
What happens when MR produces data splits, and those splits don't align on
block boundaries? I've read that MR will attempt to make data splits near
block boundaries to improve data locality, but isn't there always some slop
where records straddle the block boundaries, resulting in an extra
How does SequenceFile guarantee that the sync marker does not appear in the
data?
John
Thanks to previous kind answers and more reading in the elephant book, I now
understand that mapper tasks place partitioned results into local files that
are served up to reducers via HTTP:
The output file's partitions are made available to the reducers over HTTP. The
maximum number of worker
21.05.2013 um 19:57 schrieb John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net:
When MapReduce enters shuffle to partition the tuples, I am assuming that it
writes intermediate data to HDFS. What replication factor is used for those
temporary files?
john
--
Kai Voigt
k...@123
We intend to use the YARN APIs fairly soon. Are there notable differences in
YARNs classes, interfaces or semantics between 0.23 and 2.0? It seems to be
supported on both versions.
Thanks,
John
As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
definitive :) place to start. It is pretty thorough for starts and once you are
gone through it, the code will start making more sense too.
Regards,
Shahab
On Wed, May 22, 2013 at 10:33 AM, John Lilley
john.lil
the no. of copying threads for
copy.
tasktracker.http.threads=40
Thanks,
Rahul
On Wed, May 22, 2013 at 8:16 PM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
This brings up another nagging question I’ve had for some time. Between HDFS
and shuffle, there seems to be the potential
um 19:57 schrieb John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net:
When MapReduce enters shuffle to partition the tuples, I am assuming that it
writes intermediate data to HDFS. What replication factor is used for those
temporary files?
john
--
Kai Voigt
k...@123
from getting frozen and will be
supported compatibly for foreseeable future.
Details to track here:
https://issues.apache.org/jira/browse/YARN-386
hth,
Arun
On May 22, 2013, at 7:38 AM, John Lilley wrote:
We intend to use the YARN APIs fairly soon. Are there notable differences in
YARNs
I am very interested in a deep understanding of the MapReduce Shuffle phase
algorithm and implementation. Are there whitepapers I could read for an
explanation? Or another mailing list for this question? Obviously there is
the code ;-)
john
I am trying to determine if it is feasible for multiple nodes to alternate
appends to a shared file in HDFS.
Can someone tell me, what is the overhead of an open/append/close cycle?
If multiple nodes attempt open-for-append at once, the losers queue nicely
waiting for the winner to close?
--john
When MapReduce enters shuffle to partition the tuples, I am assuming that it
writes intermediate data to HDFS. What replication factor is used for those
temporary files?
john
.
On Fri, May 17, 2013 at 3:38 AM, John Lilley
john.lil...@redpoint.net
wrote:
I seem to recall reading that when a MapReduce task writes a file,
the blocks of the file are always written to local disk, and
replicated to other nodes. If this is true, is this also true for
non-MR applications
of the actions.
Requests to Namenode are then made through ClientProtocol.
An hdfs committer would be able to give you affirmative answer.
On Sun, Mar 31, 2013 at 11:27 AM, John Lilley
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih
are working in a heterogeneous environment.
Cheers,
Tim
From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Sent: Friday, May 17, 2013 8:35:53 AM
Subject: RE: Distribution of native
101 - 200 of 220 matches
Mail list logo