Awesome.. Thanks :) Now my map and reduce tasks are super fast.. Although, the table i'll eventually be using has a region split of 25.. 4 on 5 machines and 5 on the master region node.. I don't know if thats enough though..
But i'll look into this.. On Mon, Aug 26, 2013 at 2:55 PM, Ashwanth Kumar < [email protected]> wrote: > Just click on "Split" that should be fine. It would pick up a key in the > middle of each region and split them. Split happens like 1 -> 2 -> 4 -> 8 > regions and so on. # of regions for a table is something that you should be > able to come up given the # of region servers and size of data that you are > expecting to store on the table. > > Bigger number for Caching typically means more data in memory for the > Mapper Task. I guess as long as you have enough memory to store the data > you are fine. May be other experts can help me here. > > - Split on the table gives you parallelism since typically each region is > executed on a separate mapper. > - Right Split + Decent Caching can give you best performance on full table > scan jobs. As I already said, beware of the ScannerTimeoutException that > would arise due to very high caching values. You might want to increase the > scanner timeout value in that case. > > > > On Mon, Aug 26, 2013 at 2:42 PM, Pavan Sudheendra <[email protected]>wrote: > >> Hi Ashwanth, thanks for the reply.. >> >> I went to the HBase Web UI and saw that my table had 1 Online Regions.. >> Can you please guide me as to how to do the split on this table? I see the >> UI asking for a region key and a split button... How many splits can i make >> exactly? Can i give two different 'keys' and assume that the table is now >> split into 3? One from beginning to key1, key1 to key2 and key2 to the rest? >> >> >> On Mon, Aug 26, 2013 at 2:36 PM, Ashwanth Kumar < >> [email protected]> wrote: >> >>> setCaching is setting the value via API, other way is to set it in the >>> job configuration using the Key "hbase.client.scanner.caching". >>> >>> I just realized, given that you have just 1 region Caching wouldn't help >>> much in reducing the time. Splitting might be an ideal solution. Based on >>> your Heap space for every Mapper task try playing with that 1500 value. >>> >>> Word of caution, if you increase it too much, you might see >>> ScannerTimeoutException in your TT Logs. >>> >>> >>> On Mon, Aug 26, 2013 at 2:29 PM, Pavan Sudheendra >>> <[email protected]>wrote: >>> >>>> Hi Ashwanth, >>>> My caching is set to 1500 .. >>>> >>>> scan.setCaching(1500); >>>> scan.setCacheBlocks(false); >>>> >>>> Can i set the number of splits via an API? >>>> >>>> >>>> On Mon, Aug 26, 2013 at 2:22 PM, Ashwanth Kumar < >>>> [email protected]> wrote: >>>> >>>>> To answer your question - Go to HBase Web UI and you can initiate a >>>>> manual >>>>> split on the table. >>>>> >>>>> But, before you do that. May be you can try increasing your client >>>>> caching >>>>> value (hbase.client.scanner.caching) in your Job. >>>>> >>>>> >>>>> On Mon, Aug 26, 2013 at 2:09 PM, Pavan Sudheendra <[email protected] >>>>> >wrote: >>>>> >>>>> > What is the input split of the HBase Table in this job status? >>>>> > >>>>> > map() completion: 0.0 >>>>> > reduce() completion: 0.0 >>>>> > Counters: 24 >>>>> > File System Counters >>>>> > FILE: Number of bytes read=0 >>>>> > FILE: Number of bytes written=216030 >>>>> > FILE: Number of read operations=0 >>>>> > FILE: Number of large read operations=0 >>>>> > FILE: Number of write operations=0 >>>>> > HDFS: Number of bytes read=116 >>>>> > HDFS: Number of bytes written=0 >>>>> > HDFS: Number of read operations=1 >>>>> > HDFS: Number of large read operations=0 >>>>> > HDFS: Number of write operations=0 >>>>> > Job Counters >>>>> > Launched map tasks=1 >>>>> > Data-local map tasks=1 >>>>> > Total time spent by all maps in occupied slots >>>>> (ms)=3332 >>>>> > Map-Reduce Framework >>>>> > Map input records=45570 >>>>> > Map output records=45569 >>>>> > Map output bytes=4682237 >>>>> > Input split bytes=116 >>>>> > Combine input records=0 >>>>> > Combine output records=0 >>>>> > Spilled Records=0 >>>>> > CPU time spent (ms)=1142950 >>>>> > Physical memory (bytes) snapshot=475811840 >>>>> > Virtual memory (bytes) snapshot=1262202880 >>>>> > Total committed heap usage (bytes)=370343936 >>>>> > >>>>> > >>>>> > My table has 80,000 rows.. >>>>> > Is there any way to increase the number of input splits since it >>>>> takes >>>>> > nearly 30 mins for the map tasks to complete.. very unusual. >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Regards- >>>>> > Pavan >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ashwanth Kumar / ashwanthkumar.in >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards- >>>> Pavan >>>> >>> >>> >>> >>> -- >>> >>> Ashwanth Kumar / ashwanthkumar.in >>> >>> >> >> >> -- >> Regards- >> Pavan >> > > > > -- > > Ashwanth Kumar / ashwanthkumar.in > > -- Regards- Pavan
