Hi,

I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to
do a full table scan and get an rdd from it.

Partial piece of code looks like this:

sparkContext.newAPIHadoopRDD(
  
HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.getNameWithNamespaceInclAsString,
hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
  classOf[TableInputFormat],
  classOf[ImmutableBytesWritable],
  classOf[Result]
)


As per my understanding this full table scan works fast because we are
reading Hfiles directly.

*Q1. Does that mean we are skipping memstores ? *If yes, then we should
have missed some data which is present in memstore because that data has
not been persisted to disk yet and hence not available via HFile.

*In my local setup, I always get all the data*. Since I am inserting 10-20
entires only I am assuming this is present in memstore when I am issuing
the full table scan spark job.

Q2. When I issue a get command, Is there a way to know if the record is
served from blockCache, memstore or Hfile?

Thanks
-Sachin

Reply via email to