Re: How spark writes to HBASE

Ted Yu Mon, 22 Jan 2018 09:13:39 -0800

I did a search in SHC for saveAsNewHadoop (case insensitive) - there was no
match.


I suggest you use SHC forum for related questions.

On Mon, Jan 22, 2018 at 9:07 AM, vignesh <vignesh...@gmail.com> wrote:

> It would be similar to case 2 right. Say for example in spark I read a file
> of size 512mb which would span 4 cores(if block size is 128). Executor will
> be spanned based on data locality and if the executor is launched
> machine1,2,3,4. May be if block3 region is handled by machine6, then when I
> bulk load via spark hbase connector (which uses saveAsNewHadoopApidataset)
> then in this case the HFILE write of block3 would be to any of those 4
> machines and not to machine6. Is that right? Or i misunderstood?
>
> On Jan 22, 2018 22:27, "Ted Yu" <yuzhih...@gmail.com> wrote:
>
> > For case 1, HFile would be loaded into the region (via staging
> directory).
> >
> > Please see:
> > http://hbase.apache.org/book.html#arch.bulk.load
> >
> > On Mon, Jan 22, 2018 at 8:52 AM, vignesh <vignesh...@gmail.com> wrote:
> >
> > > If it is a bulk load I use spark hbase connector provided by
> hortonworks.
> > > For time series writes I use normal hbase client API's.
> > >
> > > So does that mean in case 2(client API write)  the write to memstore
> will
> > > happen via network? In case 1(bulk load)the HFile will be moved to the
> > > region which is supposed to hold or it will write to local and keep
> that
> > as
> > > a copy and the second replication would go to that particular region?
> > >
> > > On Jan 22, 2018 22:16, "Ted Yu" <yuzhih...@gmail.com> wrote:
> > >
> > > Which connector do you use to perform the write ?
> > >
> > > bq. Or spark will wisely launch an executor on that machine
> > >
> > > I don't think that is the case. Multiple writes may be performed which
> > > would end up on different region servers. Spark won't provide the
> > affinity
> > > described above.
> > >
> > > On Mon, Jan 22, 2018 at 7:19 AM, vignesh <vignesh...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a Spark job which reads some timeseries data and pushes that
> to
> > > > HBASE using HBASE client API. I am executing this Spark job on a 10
> > > > node cluster. Say at first when spark kicks off it picks
> > > > machine1,machine2,machine3 as its executors. Now when the job inserts
> > > > a row to HBASE. Below is what my undersatnding on what it does.
> > > >
> > > > Based on the row key a particular region(from the META) would be
> > > > chosen and that row will be pushed to that RegionServer's memstore
> and
> > > > WAL and once the memestore is full it would be flushed to the
> disk.Now
> > > > if assume a particular row is being processed by a executor on
> > > > machine2 and the regionserver which handles that region to which the
> > > > put is to be made is on machine6. Will the data be transferred from
> > > > machine2 to machine6 over network and then the data will be stored in
> > > > memstore of machine6. Or spark will wisely launch an executor on that
> > > > machine during write(if the dynamic allocation is turned on) and
> > > > pushes to it?
> > > >
> > > >
> > > > --
> > > > I.VIGNESH
> > > >
> > >
> >
>

Re: How spark writes to HBASE

Reply via email to