Hi, I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to do a full table scan and get an rdd from it.
Partial piece of code looks like this: sparkContext.newAPIHadoopRDD( HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.getNameWithNamespaceInclAsString, hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt), classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result] ) As per my understanding this full table scan works fast because we are reading Hfiles directly. *Q1. Does that mean we are skipping memstores ? *If yes, then we should have missed some data which is present in memstore because that data has not been persisted to disk yet and hence not available via HFile. *In my local setup, I always get all the data*. Since I am inserting 10-20 entires only I am assuming this is present in memstore when I am issuing the full table scan spark job. Q2. When I issue a get command, Is there a way to know if the record is served from blockCache, memstore or Hfile? Thanks -Sachin
