Hi, I have a Spark job which reads some timeseries data and pushes that to HBASE using HBASE client API. I am executing this Spark job on a 10 node cluster. Say at first when spark kicks off it picks machine1,machine2,machine3 as its executors. Now when the job inserts a row to HBASE. Below is what my undersatnding on what it does.
Based on the row key a particular region(from the META) would be chosen and that row will be pushed to that RegionServer's memstore and WAL and once the memestore is full it would be flushed to the disk.Now if assume a particular row is being processed by a executor on machine2 and the regionserver which handles that region to which the put is to be made is on machine6. Will the data be transferred from machine2 to machine6 over network and then the data will be stored in memstore of machine6. Or spark will wisely launch an executor on that machine during write(if the dynamic allocation is turned on) and pushes to it? -- I.VIGNESH
