Re: Add client complexity or use a coprocessor?

Tom Brown Mon, 09 Apr 2012 22:14:32 -0700

Andy,

I am a big fan of the Increment class. Unfortunately, I'm not doing
simple increments for the viewer count. I will be receiving duplicate
messages from a particular client for a specific cube cell, and don't
want them to be counted twice (my stats don't have to be 100%
accurate, but the expected rate of duplicates will be higher than the
allowable error rate).


I created an RPC endpoint coprocessor to perform this function but
performance suffered heavily under load (it appears that the endpoint
performs all functions in serial).

When I tried implementing it as a region observer, I was unsure of how
to correctly replace the provided "put" with my own. When I issued a
put from within "prePut", the server blocked the new put (waiting for
the "prePut" to finish). Should I be attempting to modify the WALEdit
object?

Is there a way to extend the functionality of "Increment" to provide
arbitrary bitwise operations on a the contents of a field?

Thanks again!

--Tom

>If it helps, yes this is possible:
>
>> Can I observe updates to a
>> particular table and replace the provided data with my own? (The
>> client calls "put" with the actual user ID, my co-processor replaces
>> it with a computed value, so the actual user ID never gets stored in
>> HBase).
>
>Since your option #2 requires atomic updates to the data structure, have you 
>considered native
>atomic increments? See
>
>http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
>
>
>or
>
>http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
>
>The former is a round trip for each value update. The latter allows you to 
>pack multiple updates
>into a single round trip. This would give you accurate counts even with 
>concurrent writers.
>
>It should be possible for you to do partial aggregation on the client side too 
>whenever parallel
>requests colocate multiple updates to the same cube within some small window 
>of time.
>
>Best regards,
>
>
>    - Andy
>
>Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
>Tom White)
>
>----- Original Message -----
>> From: Tom Brown <[email protected]>
>> To: [email protected]
>> Cc:
>> Sent: Monday, April 9, 2012 9:48 AM
>> Subject: Add client complexity or use a coprocessor?
>>
>> To whom it may concern,
>>
>> Ignoring the complexities of gathering the data, assume that I will be
>> tracking millions of unique viewers. Updates from each of our millions
>> of clients are gathered in a centralized platform and spread among a
>> group of machines for processing and inserting into HBase (assume that
>> this group can be scaled horizontally). The data is stored in an OLAP
>> cube format and one of the metrics I'm tracking across various
>> attributes is viewership (how many people from Y are watching X).
>>
>> I'm writing this to ask for your thoughts as to the most appropriate
>> way to structure my data so I can count unique TV viewers (assume a
>> service like netflix or hulu).
>>
>> Here are the solutions I'm considering:
>>
>> 1. Store each unique user ID as the cell name within the cube(s) it
>> occurs. This has the advantage of having 100% accuracy, but the
>> downside is the enormous space required to store each unique cell.
>> Consuming this data is also problematic as the only way to provide a
>> viewership count is by counting each cell. To save the overhead of
>> sending each cell over the network, counting them could be done by a
>> coprocessor on the region server, but that still doesn't avoid the
>> overhead of reading each cell from the disk. I'm also not sure what
>> happens if a single row is larger than an entire region (48 bytes per
>> user ID * 10,000,000 users = 480GB).
>>
>> 2. Store a byte array that allows estimating unique viewers (with a
>> small margin of error*). Add a co-processor for updating this column
>> so I can guarantee the updates to a specific OLAP cell will be atomic.
>> The main benefit from this path is that there the nodes that update
>> HBase can be less complex. Another benefit I see is that the I can
>> just add more HBase regions as scale requires. However, I'm not sure
>> if I can use a coprocessor the way I want; Can I observe updates to a
>> particular table and replace the provided data with my own? (The
>> client calls "put" with the actual user ID, my co-processor replaces
>> it with a computed value, so the actual user ID never gets stored in
>> HBase).
>>
>> 3. Store a byte array that allows estimating unique viewers (with a
>> small margin of error*). Re-arrange my architecture so that each OLAP
>> cell is only updated by a single node. The main benefit from this
>> would be that I don't need to worry about atomic operations in HBase
>> since all updates for a single cell will be atomic and in serial. The
>> biggest downside is that I believe it will add significant complexity
>> to my overall architecture.
>>
>>
>> Thanks for your time, and I look forward to hearing your thoughts.
>>
>> Sincerely,
>> Tom Brown
>>
>> *(For information about the byte array mentioned in #2 and #3, see:
>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html)
>>

Re: Add client complexity or use a coprocessor?

Reply via email to