Re: Implementation of full table scan using Spark

Sachin Jain Wed, 28 Jun 2017 22:28:42 -0700

@Ted Yu If full table scan does not read memstore then why I am getting the
recently inserted data. I am pretty sure others may have seen this earlier
and may not didn't notice.


@Jingcheng Thanks for your answer. If you are true, then my understanding
was wrong. I will try to see the code of TableInputFormat and see if I get
something new.

On Thu, Jun 29, 2017 at 9:31 AM, Jingcheng Du <[email protected]> wrote:

> Hi Sachin,
> The TableInputFormat should read the memstore.
> The TableInputFormat is converted to scan to each region, the operations in
> each region should be a normal scan, so the memstore should be included.
> That's why you can always read all the data.
>
> bq. As per my understanding this full table scan works fast because we are
> reading Hfiles directly.
> I think the fast full table scan is because you run the scan in each region
> concurrently in Spark.
>
> 2017-06-29 11:33 GMT+08:00 Ted Yu <[email protected]>:
>
> > TableInputFormat doesn't read memstore.
> >
> > bq. I am inserting 10-20 entires only
> >
> > You can query JMX and check the values for the following:
> >
> > flushedCellsCount
> > flushedCellsSize
> >
> > FlushMemstoreSize_num_ops
> >
> > For Q2, there is no client side support for knowing where the data comes
> > from.
> >
> > On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I have used TableInputFormat and newAPIHadoopRDD defined on
> sparkContext
> > to
> > > do a full table scan and get an rdd from it.
> > >
> > > Partial piece of code looks like this:
> > >
> > > sparkContext.newAPIHadoopRDD(
> > >   HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.
> > > getNameWithNamespaceInclAsString,
> > > hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
> > >   classOf[TableInputFormat],
> > >   classOf[ImmutableBytesWritable],
> > >   classOf[Result]
> > > )
> > >
> > >
> > > As per my understanding this full table scan works fast because we are
> > > reading Hfiles directly.
> > >
> > > *Q1. Does that mean we are skipping memstores ? *If yes, then we should
> > > have missed some data which is present in memstore because that data
> has
> > > not been persisted to disk yet and hence not available via HFile.
> > >
> > > *In my local setup, I always get all the data*. Since I am inserting
> > 10-20
> > > entires only I am assuming this is present in memstore when I am
> issuing
> > > the full table scan spark job.
> > >
> > > Q2. When I issue a get command, Is there a way to know if the record is
> > > served from blockCache, memstore or Hfile?
> > >
> > > Thanks
> > > -Sachin
> > >
> >
>

Re: Implementation of full table scan using Spark

Reply via email to