Setting hbase.bulkload.locality.sensitive.enabled to true and hbase.mapreduce.hfileoutputformat.table.name to the <target_table> would do the magic to keep locality with best efforts during bulkload, FYI. More details please refer to HBASE-12596 <https://issues.apache.org/jira/browse/HBASE-12596>
Best Regards, Yu On 23 January 2018 at 01:13, Ted Yu <[email protected]> wrote: > I did a search in SHC for saveAsNewHadoop (case insensitive) - there was no > match. > > I suggest you use SHC forum for related questions. > > On Mon, Jan 22, 2018 at 9:07 AM, vignesh <[email protected]> wrote: > > > It would be similar to case 2 right. Say for example in spark I read a > file > > of size 512mb which would span 4 cores(if block size is 128). Executor > will > > be spanned based on data locality and if the executor is launched > > machine1,2,3,4. May be if block3 region is handled by machine6, then > when I > > bulk load via spark hbase connector (which uses > saveAsNewHadoopApidataset) > > then in this case the HFILE write of block3 would be to any of those 4 > > machines and not to machine6. Is that right? Or i misunderstood? > > > > On Jan 22, 2018 22:27, "Ted Yu" <[email protected]> wrote: > > > > > For case 1, HFile would be loaded into the region (via staging > > directory). > > > > > > Please see: > > > http://hbase.apache.org/book.html#arch.bulk.load > > > > > > On Mon, Jan 22, 2018 at 8:52 AM, vignesh <[email protected]> wrote: > > > > > > > If it is a bulk load I use spark hbase connector provided by > > hortonworks. > > > > For time series writes I use normal hbase client API's. > > > > > > > > So does that mean in case 2(client API write) the write to memstore > > will > > > > happen via network? In case 1(bulk load)the HFile will be moved to > the > > > > region which is supposed to hold or it will write to local and keep > > that > > > as > > > > a copy and the second replication would go to that particular region? > > > > > > > > On Jan 22, 2018 22:16, "Ted Yu" <[email protected]> wrote: > > > > > > > > Which connector do you use to perform the write ? > > > > > > > > bq. Or spark will wisely launch an executor on that machine > > > > > > > > I don't think that is the case. Multiple writes may be performed > which > > > > would end up on different region servers. Spark won't provide the > > > affinity > > > > described above. > > > > > > > > On Mon, Jan 22, 2018 at 7:19 AM, vignesh <[email protected]> > wrote: > > > > > > > > > Hi, > > > > > > > > > > I have a Spark job which reads some timeseries data and pushes that > > to > > > > > HBASE using HBASE client API. I am executing this Spark job on a 10 > > > > > node cluster. Say at first when spark kicks off it picks > > > > > machine1,machine2,machine3 as its executors. Now when the job > inserts > > > > > a row to HBASE. Below is what my undersatnding on what it does. > > > > > > > > > > Based on the row key a particular region(from the META) would be > > > > > chosen and that row will be pushed to that RegionServer's memstore > > and > > > > > WAL and once the memestore is full it would be flushed to the > > disk.Now > > > > > if assume a particular row is being processed by a executor on > > > > > machine2 and the regionserver which handles that region to which > the > > > > > put is to be made is on machine6. Will the data be transferred from > > > > > machine2 to machine6 over network and then the data will be stored > in > > > > > memstore of machine6. Or spark will wisely launch an executor on > that > > > > > machine during write(if the dynamic allocation is turned on) and > > > > > pushes to it? > > > > > > > > > > > > > > > -- > > > > > I.VIGNESH > > > > > > > > > > > > > > >
