Hi Gary,

I gave this a shot on a test cluster of CDH4.7 and actually saw a
regression in performance when running the numbers.  Have you done any
benchmarking?  Below are my numbers:



Experimental method:
1. Write 14GB of data to HDFS via [1]
2. Read data multiple times via [2]


*Experiment 1: run on virtual machines*


With short-circuit read *disabled*:
14/09/24 15:10:49 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 344.931469949 s
14/09/24 15:11:30 INFO spark.SparkContext: Job finished: count at
<console>:13, took 18.601568871 s
14/09/24 15:11:54 INFO spark.SparkContext: Job finished: count at
<console>:13, took 16.531909024 s
14/09/24 15:12:18 INFO spark.SparkContext: Job finished: count at
<console>:13, took 17.639692651 s
14/09/24 15:12:38 INFO spark.SparkContext: Job finished: count at
<console>:13, took 16.773438345 s

With short-circuit read *enabled*:
14/09/24 14:28:38 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 299.511103592 s
14/09/24 14:29:17 INFO spark.SparkContext: Job finished: count at
<console>:13, took 22.459146194 s
14/09/24 14:29:44 INFO spark.SparkContext: Job finished: count at
<console>:13, took 19.806642815 s
14/09/24 14:30:11 INFO spark.SparkContext: Job finished: count at
<console>:13, took 20.284644308 s
14/09/24 14:30:40 INFO spark.SparkContext: Job finished: count at
<console>:13, took 21.720455219 s


My summary hear is that enabling short-circuit read caused the write to go
faster (what?) and caused a slight decrease in read performance, from
~17sec to ~20sec.

The VMs were backed by FusionIO drives but I thought maybe there was
something funky with the VMs so switched to bare hardware in a second
experiment.


*Experiment 2: run on bare hardware*

With short-circuit read *disabled*:
14/09/24 15:59:11 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 1605.965203162 s
14/09/24 15:59:39 INFO spark.SparkContext: Job finished: count at
<console>:13, took 11.984355461 s
14/09/24 16:00:00 INFO spark.SparkContext: Job finished: count at
<console>:13, took 11.134712764 s
14/09/24 16:00:11 INFO spark.SparkContext: Job finished: count at
<console>:13, took 8.694292372 s
14/09/24 16:00:24 INFO spark.SparkContext: Job finished: count at
<console>:13, took 9.83986823 s

With short-circuit read *enabled*:
14/09/24 16:23:14 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 1113.897715871 s
14/09/24 16:25:19 INFO spark.SparkContext: Job finished: count at
<console>:13, took 14.249690605 s
14/09/24 16:25:47 INFO spark.SparkContext: Job finished: count at
<console>:13, took 12.67330165 s
14/09/24 16:26:04 INFO spark.SparkContext: Job finished: count at
<console>:13, took 10.673825924 s
14/09/24 16:26:19 INFO spark.SparkContext: Job finished: count at
<console>:13, took 9.722516379 s


This is separate hardware so the numbers are very different (it's not just
bypassing the VM overhead).

Again, the writes are much faster (1605s -> 1113s) but the reads are
comparable if not slightly slower (~10.4s -> ~11.8s)




To make sure that short circuit reads were actually working I looked at the
datanode logs and saw the following line.  I think this confirms that a)
the read was local (127.0.0.1 -> 127.0.0.1) from Spark and b) short-circuit
read was successfully used ("success: true").

hadoop-datanode-mybox.local.log:2014-09-24 16:26:52,800 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid:
-312380305519226759, srvID: DS-96112752-10.201.12.105-50010-1411586696381,
success: true


Has anyone actually deployed this feature and benchmarked gains?  I was
hoping to throw this switch on my clusters and get a 30% perf boost but in
practice that has not materialized.


Cheers!
Andrew



[1] sc.parallelize(1 to (14*1024*1024)).map(k =>
Seq(k, org.apache.commons.lang.RandomStringUtils.random(1024,
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWxyZ0123456789")).mkString("|")).saveAsTextFile("hdfs:///tmp/output")
[2] sc.textFile("hdfs:///tmp/output").count

On Wed, Sep 17, 2014 at 11:19 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> I'm pretty sure it does help, though I don't have any numbers for it. In
> any case, Spark will automatically benefit from this if you link it to a
> version of HDFS that contains this.
>
> Matei
>
> On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com)
> wrote:
>
> Cloudera had a blog post about this in August 2013:
> http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
>
> Has anyone been using this in production - curious as to if it made a
> significant difference from a Spark perspective.
>
>

Reply via email to