Hi,
When the data is read from HDFS using textFile, and then map function is
performed as the following code to make the format right in order to feed
it into mllib training algorithms.
rddFile = sc.textFile("Some file on HDFS")
rddData = rddFile.map(line => {
val temp = line.toString.split(",")
val y = temp(3) match {
case "1" => 0.0
case "2" => 1.0
case _ => 2.0
}
val x = temp.slice(1, 3).map(_.toDouble)
LabeledPoint(y, x)
})
My question is that when the map function is performed? Is it lazy
evaluated when we use rddData first time and generate another new dataset
called rddData since RDD is immutable? Does it mean the second time we use
rddData, the transformation isn't computed?
Or the transformation is computed in real time, so we don't need extra
memory for this?
The motivation for asking this question is that I found in mllib library,
there are lots of extra transformation is done. For example, the intercept
is added by map( point -> new LabeledPoint(point.y, Array( 1,
point.feature))
If the new dataset is generated every time when the map is performed, for a
really big dataset, it will waste lots of memory and IO. Also, it will be
less efficiency, when we chain several map function to RDD since all of
them can be done in one place.
Thanks.
Sincerely,
DB Tsai
Machine Learning Engineer
Alpine Data Labs
--------------------------------------
Web: http://alpinenow.com/