@Ted Yu If full table scan does not read memstore then why I am getting the recently inserted data. I am pretty sure others may have seen this earlier and may not didn't notice.
@Jingcheng Thanks for your answer. If you are true, then my understanding was wrong. I will try to see the code of TableInputFormat and see if I get something new. On Thu, Jun 29, 2017 at 9:31 AM, Jingcheng Du <[email protected]> wrote: > Hi Sachin, > The TableInputFormat should read the memstore. > The TableInputFormat is converted to scan to each region, the operations in > each region should be a normal scan, so the memstore should be included. > That's why you can always read all the data. > > bq. As per my understanding this full table scan works fast because we are > reading Hfiles directly. > I think the fast full table scan is because you run the scan in each region > concurrently in Spark. > > 2017-06-29 11:33 GMT+08:00 Ted Yu <[email protected]>: > > > TableInputFormat doesn't read memstore. > > > > bq. I am inserting 10-20 entires only > > > > You can query JMX and check the values for the following: > > > > flushedCellsCount > > flushedCellsSize > > > > FlushMemstoreSize_num_ops > > > > For Q2, there is no client side support for knowing where the data comes > > from. > > > > On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain <[email protected]> > > wrote: > > > > > Hi, > > > > > > I have used TableInputFormat and newAPIHadoopRDD defined on > sparkContext > > to > > > do a full table scan and get an rdd from it. > > > > > > Partial piece of code looks like this: > > > > > > sparkContext.newAPIHadoopRDD( > > > HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName. > > > getNameWithNamespaceInclAsString, > > > hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt), > > > classOf[TableInputFormat], > > > classOf[ImmutableBytesWritable], > > > classOf[Result] > > > ) > > > > > > > > > As per my understanding this full table scan works fast because we are > > > reading Hfiles directly. > > > > > > *Q1. Does that mean we are skipping memstores ? *If yes, then we should > > > have missed some data which is present in memstore because that data > has > > > not been persisted to disk yet and hence not available via HFile. > > > > > > *In my local setup, I always get all the data*. Since I am inserting > > 10-20 > > > entires only I am assuming this is present in memstore when I am > issuing > > > the full table scan spark job. > > > > > > Q2. When I issue a get command, Is there a way to know if the record is > > > served from blockCache, memstore or Hfile? > > > > > > Thanks > > > -Sachin > > > > > >
