Re: Input split for a HBase of 80,000 rows?

Pavan Sudheendra Mon, 26 Aug 2013 03:53:08 -0700

Awesome.. Thanks :) Now my map and reduce tasks are super fast.. Although,
the table i'll eventually be using has a region split of 25.. 4 on 5
machines and 5 on the master region node.. I don't know if thats enough
though..


But i'll look into this..


On Mon, Aug 26, 2013 at 2:55 PM, Ashwanth Kumar <
[email protected]> wrote:

> Just click on "Split" that should be fine. It would pick up a key in the
>  middle of each region and split them. Split happens like 1 -> 2 -> 4 -> 8
> regions and so on. # of regions for a table is something that you should be
> able to come up given the # of region servers and size of data that you are
> expecting to store on the table.
>
> Bigger number for Caching typically means more data in memory for the
> Mapper Task. I guess as long as you have enough memory to store the data
> you are fine. May be other experts can help me here.
>
> - Split on the table gives you parallelism since typically each region is
> executed on a separate mapper.
> - Right Split + Decent Caching can give you best performance on full table
> scan jobs. As I already said, beware of the ScannerTimeoutException that
> would arise due to very high caching values. You might want to increase the
> scanner timeout value in that case.
>
>
>
> On Mon, Aug 26, 2013 at 2:42 PM, Pavan Sudheendra <[email protected]>wrote:
>
>> Hi Ashwanth, thanks for the reply..
>>
>> I went to the HBase Web UI and saw that my table had 1 Online Regions..
>> Can you please guide me as to how to do the split on this table? I see the
>> UI asking for a region key and a split button... How many splits can i make
>> exactly? Can i give two different 'keys' and assume that the table is now
>> split into 3? One from beginning to key1, key1 to key2 and key2 to the rest?
>>
>>
>> On Mon, Aug 26, 2013 at 2:36 PM, Ashwanth Kumar <
>> [email protected]> wrote:
>>
>>> setCaching is setting the value via API, other way is to set it in the
>>> job configuration using the Key "hbase.client.scanner.caching".
>>>
>>> I just realized, given that you have just 1 region Caching wouldn't help
>>> much in reducing the time. Splitting might be an ideal solution. Based on
>>> your Heap space for every Mapper task try playing with that 1500 value.
>>>
>>> Word of caution, if you increase it too much, you might see
>>> ScannerTimeoutException in your TT Logs.
>>>
>>>
>>> On Mon, Aug 26, 2013 at 2:29 PM, Pavan Sudheendra 
>>> <[email protected]>wrote:
>>>
>>>> Hi Ashwanth,
>>>> My caching is set to 1500 ..
>>>>
>>>> scan.setCaching(1500);
>>>> scan.setCacheBlocks(false);
>>>>
>>>> Can i set the number of splits via an API?
>>>>
>>>>
>>>> On Mon, Aug 26, 2013 at 2:22 PM, Ashwanth Kumar <
>>>> [email protected]> wrote:
>>>>
>>>>> To answer your question - Go to HBase Web UI and you can initiate a
>>>>> manual
>>>>> split on the table.
>>>>>
>>>>> But, before you do that. May be you can try increasing your client
>>>>> caching
>>>>> value (hbase.client.scanner.caching) in your Job.
>>>>>
>>>>>
>>>>> On Mon, Aug 26, 2013 at 2:09 PM, Pavan Sudheendra <[email protected]
>>>>> >wrote:
>>>>>
>>>>> > What is the input split of the HBase Table in this job status?
>>>>> >
>>>>> > map() completion: 0.0
>>>>> > reduce() completion: 0.0
>>>>> > Counters: 24
>>>>> >         File System Counters
>>>>> >                 FILE: Number of bytes read=0
>>>>> >                 FILE: Number of bytes written=216030
>>>>> >                 FILE: Number of read operations=0
>>>>> >                 FILE: Number of large read operations=0
>>>>> >                 FILE: Number of write operations=0
>>>>> >                 HDFS: Number of bytes read=116
>>>>> >                 HDFS: Number of bytes written=0
>>>>> >                 HDFS: Number of read operations=1
>>>>> >                 HDFS: Number of large read operations=0
>>>>> >                 HDFS: Number of write operations=0
>>>>> >         Job Counters
>>>>> >                 Launched map tasks=1
>>>>> >                 Data-local map tasks=1
>>>>> >                 Total time spent by all maps in occupied slots
>>>>> (ms)=3332
>>>>> >         Map-Reduce Framework
>>>>> >                 Map input records=45570
>>>>> >                 Map output records=45569
>>>>> >                 Map output bytes=4682237
>>>>> >                 Input split bytes=116
>>>>> >                 Combine input records=0
>>>>> >                 Combine output records=0
>>>>> >                 Spilled Records=0
>>>>> >                 CPU time spent (ms)=1142950
>>>>> >                 Physical memory (bytes) snapshot=475811840
>>>>> >                 Virtual memory (bytes) snapshot=1262202880
>>>>> >                 Total committed heap usage (bytes)=370343936
>>>>> >
>>>>> >
>>>>> > My table has 80,000 rows..
>>>>> > Is there any way to increase the number of input splits since it
>>>>> takes
>>>>> > nearly 30 mins for the map tasks to complete.. very unusual.
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards-
>>>>> > Pavan
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Ashwanth Kumar / ashwanthkumar.in
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ashwanth Kumar / ashwanthkumar.in
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>
>
> --
>
> Ashwanth Kumar / ashwanthkumar.in
>
>


-- 
Regards-
Pavan

Re: Input split for a HBase of 80,000 rows?

Reply via email to