Re: HBase row count

Nick Pentreath Wed, 26 Feb 2014 06:48:15 -0800

Currently no there is no way to save the web ui details. There was some 
discussion around adding this on the mailing list but no change as yet —
Sent from Mailbox for iPhone


On Tue, Feb 25, 2014 at 7:23 PM, Soumitra Kumar <kumar.soumi...@gmail.com>
wrote:

> Found the issue, actually splits in HBase was not uniform, so one job was
> taking 90% of time.
> BTW, is there a way to save the details available port 4040 after job is
> finished?
> On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath 
> <nick.pentre...@gmail.com>wrote:
>> It's tricky really since you may not know upfront how much data is in
>> there. You could possibly take a look at how much data is in the HBase
>> tables to get an idea.
>>
>> It may take a bit of trial and error, like running out of memory trying to
>> cache the dataset, and checking the Spark UI on port 4040 to see how much
>> is cached and how much memory still remains available, etc etc. You should
>> also take a look at http://spark.apache.org/docs/latest/tuning.html for
>> ideas around memory and serialization tuning.
>>
>> Broadly speaking, what you want to try to do is filter as much data as
>> possible first and cache the subset of data on which you'll be performing
>> multiple passes or computations. For example, based on your code above, you
>> may in fact only wish to cache the data that is the "interesting" fields
>> RDD. It all depends on what you're trying to achieve.
>>
>> If you will only be doing one pass through the data anyway (like running a
>> count every time on the full dataset) then caching is not going to help you.
>>
>>
>> On Tue, Feb 25, 2014 at 4:59 PM, Soumitra Kumar 
>> <kumar.soumi...@gmail.com>wrote:
>>
>>> Thanks Nick.
>>>
>>> How do I figure out if the RDD fits in memory?
>>>
>>>
>>> On Tue, Feb 25, 2014 at 1:04 AM, Nick Pentreath <nick.pentre...@gmail.com
>>> > wrote:
>>>
>>>> cache only caches the data on the first action (count) - the first time
>>>> it still needs to read the data from the source. So the first time you call
>>>> count it will take the same amount of time whether cache is enabled or not.
>>>> The second time you call count on a cached RDD, you should see that it
>>>> takes a lot less time (assuming that the data fit in memory).
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 9:38 AM, Soumitra Kumar <
>>>> kumar.soumi...@gmail.com> wrote:
>>>>
>>>>> I did try with 'hBaseRDD.cache()', but don't see any improvement.
>>>>>
>>>>> My expectation is that with cache enabled, there should not be any
>>>>> penalty of 'hBaseRDD.count' call.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 24, 2014 at 11:29 PM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> Yes, you''re initiating a scan for each count call. The normal way to
>>>>>> improve this would be to use cache(), which is what you have in your
>>>>>> commented out line:
>>>>>> // hBaseRDD.cache()
>>>>>>
>>>>>> If you uncomment that line, you should see an improvement overall.
>>>>>>
>>>>>> If caching is not an option for some reason (maybe data is too large),
>>>>>> then you can implement an overall count in your readFields method using
>>>>>> accumulators:
>>>>>>
>>>>>> val count = sc.accumulator(0L)
>>>>>> ...
>>>>>> In your flatMap function do count += 1 for each row (regardless of
>>>>>> whether "interesting" or not).
>>>>>>
>>>>>> In your main method after doing an action (e.g. count in your case),
>>>>>> call val totalCount = count.value.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 9:15 AM, Soumitra Kumar <
>>>>>> kumar.soumi...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a code which reads an HBase table, and counts number of rows
>>>>>>> containing a field.
>>>>>>>
>>>>>>>     def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) :
>>>>>>> RDD[List[Array[Byte]]] = {
>>>>>>>         return rdd.flatMap(kv => {
>>>>>>>             // Set of interesting keys for this use case
>>>>>>>             val keys = List ("src")
>>>>>>>             var data = List[Array[Byte]]()
>>>>>>>             var usefulRow = false
>>>>>>>
>>>>>>>             val cf = Bytes.toBytes ("cf")
>>>>>>>             keys.foreach {key =>
>>>>>>>                 val col = kv._2.getValue(cf, Bytes.toBytes(key))
>>>>>>>                 if (col != null)
>>>>>>>                     usefulRow = true
>>>>>>>                 data = data :+ col
>>>>>>>             }
>>>>>>>
>>>>>>>             if (usefulRow)
>>>>>>>                 Some(data)
>>>>>>>             else
>>>>>>>                 None
>>>>>>>         })
>>>>>>>     }
>>>>>>>
>>>>>>>     def main(args: Array[String]) {
>>>>>>>         val hBaseRDD = init(args)
>>>>>>>         // hBaseRDD.cache()
>>>>>>>
>>>>>>>         println("**** Initial row count " + hBaseRDD.count())
>>>>>>>         println("**** Rows with interesting fields " +
>>>>>>> readFields(hBaseRDD).count())
>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>> I am running on a one mode CDH installation.
>>>>>>>
>>>>>>> As it is it takes around 2.5 minutes. But if I comment out
>>>>>>> 'println("**** Initial row count " + hBaseRDD.count())', it takes around
>>>>>>> 1.5 minutes.
>>>>>>>
>>>>>>> Is it doing HBase scan twice, for both 'count' calls? How do I
>>>>>>> improve it?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Soumitra.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: HBase row count

Reply via email to