Re: Implementation of full table scan using Spark

Ted Yu Thu, 29 Jun 2017 02:42:05 -0700

Sachin:
My previous answer was inaccurate.

Please take a look at TableRecordReaderImpl where htable.getScanner() is
called to obtain ResultScanner.


The (relatively) fast table scan may be due to your table having not much
data.

Cheers

On Wed, Jun 28, 2017 at 10:27 PM, Sachin Jain <[email protected]>
wrote:

> @Ted Yu If full table scan does not read memstore then why I am getting the
> recently inserted data. I am pretty sure others may have seen this earlier
> and may not didn't notice.
>
> @Jingcheng Thanks for your answer. If you are true, then my understanding
> was wrong. I will try to see the code of TableInputFormat and see if I get
> something new.
>
> On Thu, Jun 29, 2017 at 9:31 AM, Jingcheng Du <[email protected]> wrote:
>
> > Hi Sachin,
> > The TableInputFormat should read the memstore.
> > The TableInputFormat is converted to scan to each region, the operations
> in
> > each region should be a normal scan, so the memstore should be included.
> > That's why you can always read all the data.
> >
> > bq. As per my understanding this full table scan works fast because we
> are
> > reading Hfiles directly.
> > I think the fast full table scan is because you run the scan in each
> region
> > concurrently in Spark.
> >
> > 2017-06-29 11:33 GMT+08:00 Ted Yu <[email protected]>:
> >
> > > TableInputFormat doesn't read memstore.
> > >
> > > bq. I am inserting 10-20 entires only
> > >
> > > You can query JMX and check the values for the following:
> > >
> > > flushedCellsCount
> > > flushedCellsSize
> > >
> > > FlushMemstoreSize_num_ops
> > >
> > > For Q2, there is no client side support for knowing where the data
> comes
> > > from.
> > >
> > > On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain <[email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have used TableInputFormat and newAPIHadoopRDD defined on
> > sparkContext
> > > to
> > > > do a full table scan and get an rdd from it.
> > > >
> > > > Partial piece of code looks like this:
> > > >
> > > > sparkContext.newAPIHadoopRDD(
> > > >   HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.
> > > > getNameWithNamespaceInclAsString,
> > > > hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
> > > >   classOf[TableInputFormat],
> > > >   classOf[ImmutableBytesWritable],
> > > >   classOf[Result]
> > > > )
> > > >
> > > >
> > > > As per my understanding this full table scan works fast because we
> are
> > > > reading Hfiles directly.
> > > >
> > > > *Q1. Does that mean we are skipping memstores ? *If yes, then we
> should
> > > > have missed some data which is present in memstore because that data
> > has
> > > > not been persisted to disk yet and hence not available via HFile.
> > > >
> > > > *In my local setup, I always get all the data*. Since I am inserting
> > > 10-20
> > > > entires only I am assuming this is present in memstore when I am
> > issuing
> > > > the full table scan spark job.
> > > >
> > > > Q2. When I issue a get command, Is there a way to know if the record
> is
> > > > served from blockCache, memstore or Hfile?
> > > >
> > > > Thanks
> > > > -Sachin
> > > >
> > >
> >
>

Re: Implementation of full table scan using Spark

Reply via email to