When I run some of the Apache Spark examples in the Spark-Shell or as a
job, I am not able to achieve full core utilization on a single machine.
For example:
var textColumn = sc.textFile("/home/someuser/verylargefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
.map(word => (word, 1))
.reduceByKey(_+_)
.count()
When running this script, I mostly see only 1 or 2 active cores on my 8
core machine. Isn't Spark supposed to parallelize this?
This job takes about 15 seconds but most of my cores are idle...
How can I configure spark to utilize all cores?