Yes it is doing it twice, try to cache the initial RDD.
2014-02-25 8:14 GMT+01:00 Soumitra Kumar <kumar.soumi...@gmail.com>: > I have a code which reads an HBase table, and counts number of rows > containing a field. > > def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : > RDD[List[Array[Byte]]] = { > return rdd.flatMap(kv => { > // Set of interesting keys for this use case > val keys = List ("src") > var data = List[Array[Byte]]() > var usefulRow = false > > val cf = Bytes.toBytes ("cf") > keys.foreach {key => > val col = kv._2.getValue(cf, Bytes.toBytes(key)) > if (col != null) > usefulRow = true > data = data :+ col > } > > if (usefulRow) > Some(data) > else > None > }) > } > > def main(args: Array[String]) { > val hBaseRDD = init(args) > // hBaseRDD.cache() > > println("**** Initial row count " + hBaseRDD.count()) > println("**** Rows with interesting fields " + > readFields(hBaseRDD).count()) > } > > > I am running on a one mode CDH installation. > > As it is it takes around 2.5 minutes. But if I comment out 'println("**** > Initial row count " + hBaseRDD.count())', it takes around 1.5 minutes. > > Is it doing HBase scan twice, for both 'count' calls? How do I improve it? > > Thanks, > -Soumitra. > >