RE: Help me understand hadoop caching behavior

2017-12-27 Thread Frank Luo
First, Hadoop itself doesn’t have any caching.

Secondly, if it is a mapper only job, then the data doesn’t go through the 
network.

So look at somewhere else 

From: Avery, John [mailto:jav...@akamai.com]
Sent: Wednesday, December 27, 2017 3:20 PM
To: user@hadoop.apache.org
Subject: Help me understand hadoop caching behavior

I’m writing a program using the C API for Hadoop. I have a 4-node cluster. 
(Cluster was setup according to 
https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) Of the 4 
nodes, one is the namenode and a datanode, the others are datanodes (with one 
being a secondary namenode).

I’ve already managed to write about 1.5TB of data to the cluster. My issue is 
reading data back, specifically, it’s too fast. *Way* too fast, and I don’t 
understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB 
files. When I read back the files (7 files in parallel) I get read speeds in 
excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 
4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of 
data. I am using the read back data, so it isn’t the compiler optimizing 
something out, because when I turn off optimization flags, it still runs 10x 
faster than the network/disks to this box can run.

Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. 
(test verified)
16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of 
Hadoop).

As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead().

So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But 
during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. 
Yet it manages to return to me 300GB of unique data (the data is real, not a 
pattern, not something particularly compressible or dedupable).

I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! 
I feel like I’m overlooking something trivial…I’m specifically asking for 10X 
the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* 
caching from polluting my numbers. Yet it’s doing something that should be 
impossible. I’m at a complete loss. I fully expect to facepalm at the end of 
this.

Oh, and here’s the really weird part (to me). If I request all 20,000 files, it 
zooms past the 5000 I have cached from my 400MB read test and then slows down 
to a more realistic 2GB/s for the rest of the files. Until I re-run the program 
a second time…then it returns a result in something like 35 seconds instead of 
5 minutes. !!!

Named Search Agency of the Year by 
MediaPost

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


Re: Help me understand hadoop caching behavior

2017-12-27 Thread Avery, John
Nevermind. I found my stupid mistake. I didn’t reset a variable…this fact had 
escaped me for the past two days.

From: "Avery, John" 
Date: Wednesday, December 27, 2017 at 4:20 PM
To: "user@hadoop.apache.org" 
Subject: Help me understand hadoop caching behavior

I’m writing a program using the C API for Hadoop. I have a 4-node cluster. 
(Cluster was setup according to 
https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm)
 Of the 4 nodes, one is the namenode and a datanode, the others are datanodes 
(with one being a secondary namenode).

I’ve already managed to write about 1.5TB of data to the cluster. My issue is 
reading data back, specifically, it’s too fast. *Way* too fast, and I don’t 
understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB 
files. When I read back the files (7 files in parallel) I get read speeds in 
excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 
4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of 
data. I am using the read back data, so it isn’t the compiler optimizing 
something out, because when I turn off optimization flags, it still runs 10x 
faster than the network/disks to this box can run.

Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. 
(test verified)
16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of 
Hadoop).

As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead().

So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But 
during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. 
Yet it manages to return to me 300GB of unique data (the data is real, not a 
pattern, not something particularly compressible or dedupable).

I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! 
I feel like I’m overlooking something trivial…I’m specifically asking for 10X 
the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* 
caching from polluting my numbers. Yet it’s doing something that should be 
impossible. I’m at a complete loss. I fully expect to facepalm at the end of 
this.

Oh, and here’s the really weird part (to me). If I request all 20,000 files, it 
zooms past the 5000 I have cached from my 400MB read test and then slows down 
to a more realistic 2GB/s for the rest of the files. Until I re-run the program 
a second time…then it returns a result in something like 35 seconds instead of 
5 minutes. !!!


Help me understand hadoop caching behavior

2017-12-27 Thread Avery, John
I’m writing a program using the C API for Hadoop. I have a 4-node cluster. 
(Cluster was setup according to 
https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) Of the 4 
nodes, one is the namenode and a datanode, the others are datanodes (with one 
being a secondary namenode).

I’ve already managed to write about 1.5TB of data to the cluster. My issue is 
reading data back, specifically, it’s too fast. *Way* too fast, and I don’t 
understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB 
files. When I read back the files (7 files in parallel) I get read speeds in 
excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 
4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of 
data. I am using the read back data, so it isn’t the compiler optimizing 
something out, because when I turn off optimization flags, it still runs 10x 
faster than the network/disks to this box can run.

Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. 
(test verified)
16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of 
Hadoop).

As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead().

So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But 
during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. 
Yet it manages to return to me 300GB of unique data (the data is real, not a 
pattern, not something particularly compressible or dedupable).

I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! 
I feel like I’m overlooking something trivial…I’m specifically asking for 10X 
the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* 
caching from polluting my numbers. Yet it’s doing something that should be 
impossible. I’m at a complete loss. I fully expect to facepalm at the end of 
this.

Oh, and here’s the really weird part (to me). If I request all 20,000 files, it 
zooms past the 5000 I have cached from my 400MB read test and then slows down 
to a more realistic 2GB/s for the rest of the files. Until I re-run the program 
a second time…then it returns a result in something like 35 seconds instead of 
5 minutes. !!!