worker affinity and YARN scheduling

2013-11-11 Thread John Lilley
I would like to better understand YARN's scheduling with named workers and relaxedLocality==true. For example, suppose that I have a three-node cluster with nodes A,B,C. Each node has capacity to run two tasks of the kind I desire simultaneously. My AM then requests nine containers with

HDP 2.0 GA?

2013-11-05 Thread John Lilley
I noticed that HDP 2.0 is available for download here: http://hortonworks.com/products/hdp-2/?b=1#install Is this the final GA version that tracks Apache Hadoop 2.2? Sorry I am just a little confused by the different numbering schemes. Thanks John

RE: temporary file locations for YARN applications

2013-10-21 Thread John Lilley
declared publicly stable though, but we can do that over a JIRA. On Mon, Oct 21, 2013 at 2:05 AM, John Lilley john.lil...@redpoint.net wrote: Harsh, thanks for the quick response. These files don't need to be on the DFS (although we use that too). These are local files used during sorting

RE: temporary file locations for YARN applications

2013-10-21 Thread John Lilley
and are for the app's user to utilize. You shouldn't have any permission issues working within them. The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs. On Mon, Oct 21, 2013 at 8:49 PM, John Lilley john.lil

RE: temporary file locations for YARN applications

2013-10-21 Thread John Lilley
:49 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Thanks again. This gives me a lot of options; we will see what works. Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV? Regarding LocalDirAllocator, I see its

yarn.nodemanager.vmem-check-enabled

2013-10-20 Thread John Lilley
Is this option typically enabled in practice? Is there a way for a specific AM to disable it, or is it site-wide? We are finding that our YARN application, because it launches many threads, each with a somewhat large stack, has a large vmem/pmem ratio (say, 2GB / 200MB). Unless we disable

JIRA submitted

2013-10-20 Thread John Lilley
I submitted a JIRA requesting FSDataOutputStream.write(ByteBuffer) method. HDFS-5395 This is my first JIRA submission for Hadoop, so please comment if this should be different. Thanks John

temporary file locations for YARN applications

2013-10-20 Thread John Lilley
We have a pure YARN application (no MapReduce) that has need to store a significant amount of temporary data. How can we know the best location for these files? How can we ensure that our YARN tasks have write access to these locations? Is this something that must be configured outside of

RE: temporary file locations for YARN applications

2013-10-20 Thread John Lilley
configuration for. Do the files need to be on a distributed FS or a local one? On Sun, Oct 20, 2013 at 8:54 PM, John Lilley john.lil...@redpoint.net wrote: We have a pure YARN application (no MapReduce) that has need to store a significant amount of temporary data. How can we know the best location

RE: Querying cluster nodes list

2013-10-17 Thread John Lilley
Ah, never mind.. it is getNodeReports() john From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Thursday, October 17, 2013 1:44 PM To: user@hadoop.apache.org Subject: Querying cluster nodes list I thought mistakenly that getClusterMetrics() would return information about the cluster's

Querying cluster nodes list

2013-10-17 Thread John Lilley
I thought mistakenly that getClusterMetrics() would return information about the cluster's node, or about a queue's nodes, but this doesn't seem to be true - it is only a count. How can a YARN application query the available node list on a cluster and what resources are configured on each

FSDataOutputStream ByteBuffer write

2013-10-15 Thread John Lilley
In the Hadoop 2.1 docs, I see there is a read(ByteBuffer) call in FSDataInputStream, but I don't see a write(ByteBuffer) call documented for FSDataOutputStream. Is there a (fast) way to write a ByteBuffer to HDFS files? Thanks John

RE: MapReduce task-worker assignment

2013-10-08 Thread John Lilley
* Arun On Oct 5, 2013, at 3:12 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Is there a description of how MapReduce under Hadoop 2.0 assigns mapper tasks to preferred nodes? I think that someone on the list mentioned previously that it attempted to assign one

RE: Hadoop 2.x with Eclipse

2013-10-06 Thread John Lilley
Karim, I am not an experienced Hadoop programmer, but what I found was that building and debugging Hadoop under Eclipse was very difficult, and I was never to make it work correctly. I suggest using the well documented command-line Maven build, installing Hadoop from that build, and running

MapReduce task-worker assignment

2013-10-05 Thread John Lilley
MapReduce try to obtain an even task assignment while optimizing data locality? Thanks, John Lilley Chief Architect, RedPoint Global Inc. 1515 Walnut Street | Suite 200 | Boulder, CO 80302 T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077 Skype: jlilley.redpoint | john.lil

RE: Non data-local scheduling

2013-10-05 Thread John Lilley
Is this option set on a per-application-instance basis or is it a cluster-wide setting (or both)? Is this a MapReduce-specific issue, or a YARN issue? I don't understand how the problem arises in the first place. For example, if I have an idle cluster with 10 nodes and each node has four

getFileBlockLocations

2013-10-04 Thread John Lilley
When I call getFileBlockLocations() on a DFS, will it return the blocks for currently-inactive nodes? If so, how can I filter out the unavailable blocks? Or more generally, how do I get the list of node status? Is that ApplicationClientProtocol.getClusterNodes()? Thanks, John

RE: Task status query

2013-10-01 Thread John Lilley
pass it's host:port to the task as part of either the cmd-line for the task or in it's env. That is what is done by MR AM. hth, Arun On Sep 21, 2013, at 6:52 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Thanks Harsh! The data-transport format is pretty easy

RE: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread John Lilley
recreate the tables with predefined splits to create more regions. Thanks, Rahul On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned huge strings. Can

RE: Task status query

2013-09-21 Thread John Lilley
there that make it easy to do such things once you define what you want in a schema/spec form. On Fri, Sep 20, 2013 at 5:32 PM, John Lilley john.lil...@redpoint.net wrote: Thanks Harsh. Is this protocol something that is available to all AMs/tasks? Or is it up to each AM/task pair to develop

connection overload strategies

2013-09-21 Thread John Lilley
If my YARN application tasks are all reading/writing HDFS simultaneously and some node is unable to honor a connection request because it is overloaded, what happens? I've seen HDFS attempt to retry connections. For that matter, how does MR under YARN deal with connection overload during the

RE: How to get number of data nodes as a hadoop client

2013-09-21 Thread John Lilley
Never mind, its in the ApplicationClientProtocol class. John From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Thursday, September 19, 2013 6:12 PM To: user@hadoop.apache.org Subject: How to get number of data nodes as a hadoop client How does a Hadoop client query the number

RE: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread John Lilley
Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned huge strings. Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)? John From: Pavan Sudheendra [mailto:pavan0...@gmail.com] Sent: Saturday,

RE: Task status query

2013-09-20 Thread John Lilley
progress. On Fri, Sep 20, 2013 at 12:18 AM, John Lilley john.lil...@redpoint.net wrote: How does a YARN application master typically query ongoing status (like percentage completion) of its tasks? I would like to be able to ultimately relay information to the user like: 100 tasks

HDFs file-create performance

2013-09-19 Thread John Lilley
Are there any rough numbers one can give me regarding the latency of creating, writing, and closing a small HDFS-based file? Does replication have a big impact? I am trying to decide whether to communicate some modestly-sized (~200KB) information via HDFS files or go to the trouble of

Task status query

2013-09-19 Thread John Lilley
How does a YARN application master typically query ongoing status (like percentage completion) of its tasks? I would like to be able to ultimately relay information to the user like: 100 tasks are scheduled 10 tasks are complete 4 tasks are running and they are (4%, 10%, 50%, 70%) complete But,

RE: HDFS performance with an without replication

2013-09-19 Thread John Lilley
of synchronous and sequenced write pipelines in HDFS). Reads would be the same, unless you're unable to schedule a rack-local read (at worst case) due to only one (busy) rack holding it. On Sun, Sep 15, 2013 at 10:38 PM, John Lilley john.lil...@redpoint.net wrote: In our YARN application, we

HDFS performance with an without replication

2013-09-15 Thread John Lilley
In our YARN application, we are considering whether to store temporary data with replication=1 or replication=3 (or give the user an option). Obviously there is a tradeoff between reliability and performance, but on smaller clusters I'd expect this to be less of an issue. What is the

RE: Scheduler question

2013-09-13 Thread John Lilley
A's containers finish. It's also possible to configure the schedulers to use preemption to make this kind of thing go a lot faster. Does that make some sense? -Sandy On Mon, Sep 9, 2013 at 7:21 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Do the Hadoop 2.0

Scheduler question

2013-09-08 Thread John Lilley
Do the Hadoop 2.0 YARN scheduler(s) deal with situations like the following? Hadoop cluster of 10 nodes, with 8GB each available for containers. There is only one queue. Application A requests 100 4GB containers. It initially, or after a little while, gets 20 containers. Later, application B

RE: yarn-site.xml and aux-services

2013-09-05 Thread John Lilley
than have my containers implement more logic. On Fri, Aug 23, 2013 at 11:17 PM, John Lilley john.lil...@redpoint.net wrote: Harsh, Thanks for the clarification. I would find it very convenient in this case to have my custom jars available in HDFS, but I can see the added complexity needed

RE: yarn-site.xml and aux-services

2013-09-05 Thread John Lilley
/browse/YARN (do let the thread know the ID as well, in spirit of http://xkcd.com/979/) :) On Thu, Sep 5, 2013 at 11:41 PM, John Lilley john.lil...@redpoint.net wrote: Harsh, Thanks as usual for your sage advice. I was hoping to avoid actually installing anything on individual Hadoop nodes

RE: yarn-site.xml and aux-services

2013-08-23 Thread John Lilley
, August 22, 2013 6:25 PM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services Auxiliary services are essentially administer-configured services. So, they have to be set up at install time - before NM is started. +Vinod On Thu, Aug 22, 2013 at 1:38 PM, John Lilley john.lil

RE: yarn-site.xml and aux-services

2013-08-22 Thread John Lilley
-service that belonged with an AM, how would one do it? John -Original Message- From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Wednesday, June 05, 2013 11:41 AM To: user@hadoop.apache.org Subject: RE: yarn-site.xml and aux-services Wow, thanks. Is this documented anywhere other

HDFS blockIDs vs paths

2013-07-18 Thread John Lilley
In HDFS, who tracks the filesystem path on the datanode of each block? The namenode or the datanode? If datanode, where is that table stored? John

HDFS block storage

2013-07-15 Thread John Lilley
I was looking at the HDFS block storage and noticed a couple things (1) all block files are in a flat directory structure (2) there is a meta file for each block file. This leads me to ask: -- Where can I find good reading that describes this level of HDFS internals? -- Is the flat storage

RE: Accessing HDFS

2013-07-15 Thread John Lilley
) are not for Public access. Please do not use them in production. The only API we care not to change incompatibly are the FileContext and the FileSystem APIs. They provide much of what you want - if not, log a JIRA. On Fri, Jul 5, 2013 at 11:40 PM, John Lilley john.lil...@redpoint.net wrote: I've seen

Accessing HDFS

2013-07-05 Thread John Lilley
I've seen mentioned that you can access HDFS via ClientProtocol, as in: ClientProtocol namenode = DFSClient.createNamenode(conf); LocatedBlocks lbs = namenode.getBlockLocations(path, start, length); But we use: fs = FileSystem.get(URI, conf); filestatus = fs.getFileStatus(path);

RE: How to update a file which is in HDFS

2013-07-04 Thread John Lilley
Manickam, HDFS supports append; it is the command-line client that does not. You can write a Java application that opens an HDFS-based file for append, and use that instead of the hadoop command line. However, this doesn't completely answer your original question: How do I move only the delta

RE: Requesting containers on a specific host

2013-07-04 Thread John Lilley
Arun, I'm don't know how to interpret the release schedule from the JIRA. It says that the patch targets 2.1.0 and it is checked into the trunk, does that mean it is likely to be rolled into the first Hadoop 2 GA or will it have to wait for another cycle? Thanks, John From: Arun C Murthy

RE: intermediate results files

2013-07-02 Thread John Lilley
, Tariq cloudfront.blogspot.comhttp://cloudfront.blogspot.com On Tue, Jul 2, 2013 at 4:39 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem

RE: YARN tasks and child processes

2013-07-02 Thread John Lilley
(i.e 'yarn.nodemanager.local-dirs' configuration) accordingly with the app id container id, and this will be cleaned up after the app completion. You need to make use of this persisted data before completing the application. Thanks Devaraj k From: John Lilley [mailto:john.lil

HDFS file section rewrite

2013-07-02 Thread John Lilley
I'm sure this has been asked a zillion times, so please just point me to the JIRA comments: is there a feature underway to allow for re-writing of HDFS file sections? Thanks John

typical JSON data sets

2013-07-02 Thread John Lilley
I would like to hear your experiences working with large JSON data sets, specifically: 1) How large is each JSON document? 2) Do they tend to be a single JSON doc per file, or multiples per file? 3) Do the JSON schemas change over time? 4) Are there interesting public data

Containers and CPU

2013-07-02 Thread John Lilley
I have YARN tasks that benefit from multicore scaling. However, they don't *always* use more than one core. I would like to allocate containers based only on memory, and let each task use as many cores as needed, without allocating exclusive CPU slots in the scheduler. For example, on an

RE: some idea about the Data Compression

2013-07-02 Thread John Lilley
Geelong, 1. These files will probably be some standard format like .gz or .bz2 or .zip. In that case, pick an appropriate InputFormat. See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/,

RE: Yarn HDFS and Yarn Exceptions when processing larger datasets.

2013-07-02 Thread John Lilley
Blah blah, Can you build and run the DistributedShell example? If it does not run correctly this would tend to implicate your configuration. If it run correctly then your code is suspect. John From: blah blah [mailto:tmp5...@gmail.com] Sent: Tuesday, June 25, 2013 6:09 PM To:

RE: YARN tasks and child processes

2013-07-02 Thread John Lilley
that YARN should have available. From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Sent: Tuesday, July 2, 2013 10:41 AM Subject: RE

RE: Containers and CPU

2013-07-02 Thread John Lilley
...@microsoft.com wrote: I believe this is the default behavior. By default, only memory limit on resources is enforced. The capacity scheduler will use DefaultResourceCalculator to compute resource allocation for containers by default, which also does not take CPU into account. -Chuan From: John

RE: Containers and CPU

2013-07-02 Thread John Lilley
understand cgroups' CPU control - does it statically mask cores available to processes, or does it set up a prioritization for access to all available cores? Thanks, John From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Tuesday, July 02, 2013 1:12 PM To: user@hadoop.apache.org Subject: RE

HDFS temporary file locations

2013-07-02 Thread John Lilley
Is there any convention for clients/applications wishing to use temporary file space in HDFS? For example, my application wants to: 1) Load data into some temporary space in HDFS as an external client 2) Run an AM, which produces HDFS output (also in the temporary space) 3)

RE: Containers and CPU

2013-07-02 Thread John Lilley
, it will have access to all 8 cores. If 3 other tasks that requested a single vcore are later placed on the same node, and all tasks are using as much CPU as they can get their hands on, then each of the tasks will get 2 cores of CPU-time. On Tue, Jul 2, 2013 at 12:12 PM, John Lilley john.lil

RE: HDFS file section rewrite

2013-07-02 Thread John Lilley
regular writes and append. Random write is not supported. I do not know of any feature/jira that is underway to support this feature. On Tue, Jul 2, 2013 at 9:01 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: I'm sure this has been asked a zillion times, so please

HDFS file reader and buffering

2013-06-16 Thread John Lilley
Do the HDFS file-reader classes perform internal buffering? Thanks John

RE: how to design the mapper and reducer for the below problem

2013-06-16 Thread John Lilley
I don't think can be done in a single map/reduce pass. Here the author discusses an implementation in PIG: http://techblug.wordpress.com/2011/08/07/transitive-closure-in-pig/ john From: parnab kumar [mailto:parnab.2...@gmail.com] Sent: Thursday, June 13, 2013 10:42 PM To: user@hadoop.apache.org

RE: how to design the mapper and reducer for the below problem

2013-06-16 Thread John Lilley
Sorry this is the link I meant: http://hortonworks.com/blog/transitive-closure-in-apache-pig/ john From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Sunday, June 16, 2013 1:02 PM To: user@hadoop.apache.org Subject: RE: how to design the mapper and reducer for the below problem I don't

RE: How to design the mapper and reducer for the following problem

2013-06-16 Thread John Lilley
You basically have a record similarity scoring and linking problem -- common in data-quality software like ours. This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold. This is very

RE: Shuffle design: optimization tradeoffs

2013-06-15 Thread John Lilley
-Original Message- From: Albert Chu [mailto:ch...@llnl.gov] Sent: Wednesday, June 12, 2013 2:27 PM To: user@hadoop.apache.org Subject: RE: Shuffle design: optimization tradeoffs On Wed, 2013-06-12 at 18:08 +, John Lilley wrote: In reading this link as well as the sailfish report, it strikes me

RE: container allocation

2013-06-14 Thread John Lilley
allocation By default, the ResourceManager will try give you a container on that node, rack or anywhere (in that order). We recently added ability to whitelist or blacklist nodes to allow for more control. Arun On Jun 12, 2013, at 8:03 AM, John Lilley wrote: If I request a container on a node

RE: Assignment of data splits to mappers

2013-06-14 Thread John Lilley
with HDFS. But it does also bring drawbacks. Regards Bertrand On Thu, Jun 13, 2013 at 7:57 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: When MR assigns data splits to map tasks, does it assign a set of non-contiguous blocks to one map? The reason I ask

Assignment of data splits to mappers

2013-06-13 Thread John Lilley
When MR assigns data splits to map tasks, does it assign a set of non-contiguous blocks to one map? The reason I ask is, thinking through the problem, if I were the MR scheduler I would attempt to hand a map task a bunch of blocks that all exist on the same datanode, and then schedule the map

RE: Shuffle design: optimization tradeoffs

2013-06-12 Thread John Lilley
...@llnl.gov] Sent: Tuesday, June 11, 2013 3:32 PM To: user@hadoop.apache.org Subject: Re: Shuffle design: optimization tradeoffs On Tue, 2013-06-11 at 16:00 +, John Lilley wrote: I am curious about the tradeoffs that drove design of the partition/sort/shuffle (Elephant book p 208). Doubtless

RE: Why/When partitioner is used.

2013-06-07 Thread John Lilley
There are kind of two parts to this. The semantics of MapReduce promise that all tuples sharing the same key value are sent to the same reducer, so that you can write useful MR applications that do things like “count words” or “summarize by date”. In order to accomplish that, the shuffle

RE: Hadoop JARs and Eclipse

2013-06-06 Thread John Lilley
Harsh, Thanks so much for your thorough explanation. John -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, June 05, 2013 8:18 PM To: user@hadoop.apache.org Subject: Re: Hadoop JARs and Eclipse Hi John, On Thu, Jun 6, 2013 at 1:21 AM, John Lilley john.lil

RE: Management API

2013-06-06 Thread John Lilley
What resources are you trying to access? Do you want to monitor the system status? Do you want to read/write HDFS as a client? Do you want to run your application on the Hadoop cluster? John From: Brian Mason [mailto:br...@gabey.com] Sent: Thursday, June 06, 2013 6:52 AM To:

RE: Hadoop JARs and Eclipse

2013-06-06 Thread John Lilley
(using m2e plugin) would be equivalent to Adding all relevant Hadoop JARs to Eclipse. On Wed, Jun 5, 2013 at 11:07 PM, John Lilley john.lil...@redpoint.net wrote: Well, I've failed and given up on building Hadoop in Eclipse. Too many things go wrong with Maven plugins and m2e. But Hadoop

YARN task default output

2013-06-06 Thread John Lilley
Do tasks that spawn from a YARN AM have a default working directory? Does stdout/stderr get captured anywhere by default? I ask because I am setting up tests with the distributed shell AM and want to know if basic commands (e.g. ls) will send stdout/stderr somewhere I can see at the end.

RE: log4j defaults for AM task

2013-06-06 Thread John Lilley
/jira/browse/YARN-772 to track documentation for this and other such features which are available. thanks, Arun On Jun 6, 2013, at 12:44 PM, John Lilley wrote: In debugging a custom AM and its tasks, I want to create log4j settings so that the log output goes to someplace standard where I can

YARN servers and ports

2013-06-05 Thread John Lilley
What service addresses and ports does a YARN ApplicationMaster need to know about? Thanks, John

Hadoop JARs and Eclipse

2013-06-05 Thread John Lilley
Well, I've failed and given up on building Hadoop in Eclipse. Too many things go wrong with Maven plugins and m2e. But Hadoop builds just fine using the command-line, and it runs using Sandy's development-node instructions. My strategy now is 1) Tell Eclipse about all of the Hadoop JARs

RE: yarn-site.xml and aux-services

2013-06-05 Thread John Lilley
-services.bar.class/name valuecom.mypack.MyAuxServiceClassForBar/value /property On Wed, Jun 5, 2013 at 8:42 PM, John Lilley john.lil...@redpoint.net wrote: Good, I was hoping that would be the case. But what are the mechanics of it? Do I just add another entry? And what exactly

streaming/pipes interface and multiple inputs / outputs

2013-06-05 Thread John Lilley
Is it possible to use Hadoop streaming or Hadoop pipes for multiple inputs and outputs? Consider for example an equality join that accepts two inputs (left, right), and produces three outputs (left unmatched, right unmatched, joined). That's not actually what I'm trying to implement, but

RE: HDFS interfaces

2013-06-04 Thread John Lilley
From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org; Mahmood Naderan nt_mahm...@yahoo.commailto:nt_mahm...@yahoo.com Sent: Tuesday, June 4, 2013 3:28 AM Subject: RE

RE: built hadoop! please help with next steps?

2013-06-04 Thread John Lilley
Answered my own question. The Eclipse installs with Centos6 (or with yum) seems to have this problem. A direct download of Eclipse for Java EE works fine. John From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Monday, June 03, 2013 5:49 PM To: user@hadoop.apache.org; Deepak Vohra

RE: built hadoop! please help with next steps?

2013-06-03 Thread John Lilley
patch is related to the issue cited. https://issues.apache.org/jira/browse/HADOOP-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel thanks, Deepak From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net To: user

RE: What else can be built on top of YARN.

2013-06-01 Thread John Lilley
Rahul, This is a very good question, and one we are grappling with currently in our application port. I think there are a lot of legacy data-processing applications like ours which would benefit by a port to Hadoop. However, because we have a great load of C++, it is not necessarily a good

RE: built hadoop! please help with next steps?

2013-05-31 Thread John Lilley
:32 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Thanks for help me to build Hadoop! I'm through compile and install of maven plugins into Eclipse. I could use some pointers for next steps I want to take, which are: * Deploy the simplest development

RE: Help: error in hadoop build

2013-05-30 Thread John Lilley
at 11:33 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Sorry if this is a dumb question, but I'm not sure where to start. I am following BUILDING.txt instructions for source checked out today using git: git clone git://git.apache.org/hadoop-common.githttp

built hadoop! please help with next steps?

2013-05-30 Thread John Lilley
Thanks for help me to build Hadoop! I'm through compile and install of maven plugins into Eclipse. I could use some pointers for next steps I want to take, which are: * Deploy the simplest development only cluster (single node?) and learn how to debug within it. I read about the

OpenJDK?

2013-05-29 Thread John Lilley
I am having trouble finding a definitive answer about OpenJDK vs Sun JDK in regards to building Hadoop. This: http://wiki.apache.org/hadoop/HadoopJavaVersions Indicates that OpenJDK is not recommended, but is that an authoritative answer? BUILDING.txt states no preference. Thanks John

RE: OpenJDK?

2013-05-29 Thread John Lilley
, May 29, 2013 9:34 AM To: user@hadoop.apache.org Subject: Re: OpenJDK? Yes. Use Sun/Oracle JDK I have had memory issues while using Oozie. When I replaced OpenJDK with Sun JDK 6. the memory issue was resolved. Thanks, Lenin On Wed, May 29, 2013 at 8:22 PM, John Lilley john.lil

Help: error in hadoop build

2013-05-29 Thread John Lilley
Sorry if this is a dumb question, but I'm not sure where to start. I am following BUILDING.txt instructions for source checked out today using git: git clone git://git.apache.org/hadoop-common.git Hadoop Following build steps and adding -X for more logging: mvn compile -X But I get this error

NM/AM interaction

2013-05-28 Thread John Lilley
I was reading from the HortonWorks blog: How MapReduce shuffle takes advantage of NM's Auxiliary-services The Shuffle functionality required to run a MapReduce (MR) application is implemented as an Auxiliary Service. This service starts up a Netty Web Server, and knows how to handle MR specific

RE: splittable vs seekable compressed formats

2013-05-24 Thread John Lilley
. On Thu, May 23, 2013 at 11:01 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: I’ve read about splittable compressed formats in Hadoop. Are any of these formats also “seekable” (in other words, be able to seek to an absolute location in the uncompressed data

RE: Shuffle phase replication factor

2013-05-23 Thread John Lilley
on the Hadoop 1.0.4 source code, especially the ReduceTask.java file. yours, Ling Kun On Wed, May 22, 2013 at 10:57 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: U, is that also the limit for the number of simultaneous connections? In general, one does not need

HDFS data and non-aligned splits

2013-05-23 Thread John Lilley
What happens when MR produces data splits, and those splits don't align on block boundaries? I've read that MR will attempt to make data splits near block boundaries to improve data locality, but isn't there always some slop where records straddle the block boundaries, resulting in an extra

SequenceFile sync marker uniqueness

2013-05-23 Thread John Lilley
How does SequenceFile guarantee that the sync marker does not appear in the data? John

HTTP file server, map output, and other files

2013-05-23 Thread John Lilley
Thanks to previous kind answers and more reading in the elephant book, I now understand that mapper tasks place partitioned results into local files that are served up to reducers via HTTP: The output file's partitions are made available to the reducers over HTTP. The maximum number of worker

RE: Shuffle phase replication factor

2013-05-22 Thread John Lilley
21.05.2013 um 19:57 schrieb John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net: When MapReduce enters shuffle to partition the tuples, I am assuming that it writes intermediate data to HDFS. What replication factor is used for those temporary files? john -- Kai Voigt k...@123

YARN in 2.0 and 0.23

2013-05-22 Thread John Lilley
We intend to use the YARN APIs fairly soon. Are there notable differences in YARNs classes, interfaces or semantics between 0.23 and 2.0? It seems to be supported on both versions. Thanks, John

RE: Shuffle phase replication factor

2013-05-22 Thread John Lilley
As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too. Regards, Shahab On Wed, May 22, 2013 at 10:33 AM, John Lilley john.lil

RE: Shuffle phase replication factor

2013-05-22 Thread John Lilley
the no. of copying threads for copy. tasktracker.http.threads=40 Thanks, Rahul On Wed, May 22, 2013 at 8:16 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: This brings up another nagging question I’ve had for some time. Between HDFS and shuffle, there seems to be the potential

RE: Shuffle phase

2013-05-22 Thread John Lilley
um 19:57 schrieb John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net: When MapReduce enters shuffle to partition the tuples, I am assuming that it writes intermediate data to HDFS. What replication factor is used for those temporary files? john -- Kai Voigt k...@123

RE: YARN in 2.0 and 0.23

2013-05-22 Thread John Lilley
from getting frozen and will be supported compatibly for foreseeable future. Details to track here: https://issues.apache.org/jira/browse/YARN-386 hth, Arun On May 22, 2013, at 7:38 AM, John Lilley wrote: We intend to use the YARN APIs fairly soon. Are there notable differences in YARNs

MapReduce shuffle algorithm

2013-05-21 Thread John Lilley
I am very interested in a deep understanding of the MapReduce Shuffle phase algorithm and implementation. Are there whitepapers I could read for an explanation? Or another mailing list for this question? Obviously there is the code ;-) john

HDFS append overhead

2013-05-21 Thread John Lilley
I am trying to determine if it is feasible for multiple nodes to alternate appends to a shared file in HDFS. Can someone tell me, what is the overhead of an open/append/close cycle? If multiple nodes attempt open-for-append at once, the losers queue nicely waiting for the winner to close? --john

Shuffle phase replication factor

2013-05-21 Thread John Lilley
When MapReduce enters shuffle to partition the tuples, I am assuming that it writes intermediate data to HDFS. What replication factor is used for those temporary files? john

RE: Question about writing HDFS files

2013-05-17 Thread John Lilley
. On Fri, May 17, 2013 at 3:38 AM, John Lilley john.lil...@redpoint.net wrote: I seem to recall reading that when a MapReduce task writes a file, the blocks of the file are always written to local disk, and replicated to other nodes. If this is true, is this also true for non-MR applications

RE: Is FileSystem thread-safe?

2013-05-17 Thread John Lilley
of the actions. Requests to Namenode are then made through ClientProtocol. An hdfs committer would be able to give you affirmative answer. On Sun, Mar 31, 2013 at 11:27 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih

RE: Distribution of native executables and data for YARN-based execution

2013-05-17 Thread John Lilley
are working in a heterogeneous environment. Cheers, Tim From: John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Sent: Friday, May 17, 2013 8:35:53 AM Subject: RE: Distribution of native

<    1   2   3   >