Re: Add client complexity or use a coprocessor?

Andrew Purtell Tue, 10 Apr 2012 11:02:02 -0700

Tom,
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice


Gotcha.

> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).

Did you serialize access to your data structure(s)?

> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?

You can add KVs to the WALEdit. Or, you can get a reference to the Put's 
familyMap:

    Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap();

and if you modify the map, you'll change what gets committed.

> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?

As a matter of design, this should be a new operation. It does sound 
interesting and useful, some sort of atomic bitfield.


Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)



----- Original Message -----
> From: Tom Brown <[email protected]>
> To: [email protected]
> Cc: 
> Sent: Monday, April 9, 2012 10:14 PM
> Subject: Re: Add client complexity or use a coprocessor?
> 
> Andy,
> 
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice (my stats don't have to be 100%
> accurate, but the expected rate of duplicates will be higher than the
> allowable error rate).
> 
> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).
> 
> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?
> 
> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?
> 
> Thanks again!
> 
> --Tom
> 
>> If it helps, yes this is possible:
>> 
>>>  Can I observe updates to a
>>>  particular table and replace the provided data with my own? (The
>>>  client calls "put" with the actual user ID, my co-processor 
> replaces
>>>  it with a computed value, so the actual user ID never gets stored in
>>>  HBase).
>> 
>> Since your option #2 requires atomic updates to the data structure, have you 
> considered native
>> atomic increments? See
>> 
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
>> 
>> 
>> or
>> 
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
>> 
>> The former is a round trip for each value update. The latter allows you to 
> pack multiple updates
>> into a single round trip. This would give you accurate counts even with 
> concurrent writers.
>> 
>> It should be possible for you to do partial aggregation on the client side 
> too whenever parallel
>> requests colocate multiple updates to the same cube within some small window 
> of time.
>> 
>> Best regards,
>> 
>> 
>>     - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein 
> (via Tom White)
>> 
>> ----- Original Message -----
>>>  From: Tom Brown <[email protected]>
>>>  To: [email protected]
>>>  Cc:
>>>  Sent: Monday, April 9, 2012 9:48 AM
>>>  Subject: Add client complexity or use a coprocessor?
>>> 
>>>  To whom it may concern,
>>> 
>>>  Ignoring the complexities of gathering the data, assume that I will be
>>>  tracking millions of unique viewers. Updates from each of our millions
>>>  of clients are gathered in a centralized platform and spread among a
>>>  group of machines for processing and inserting into HBase (assume that
>>>  this group can be scaled horizontally). The data is stored in an OLAP
>>>  cube format and one of the metrics I'm tracking across various
>>>  attributes is viewership (how many people from Y are watching X).
>>> 
>>>  I'm writing this to ask for your thoughts as to the most 
> appropriate
>>>  way to structure my data so I can count unique TV viewers (assume a
>>>  service like netflix or hulu).
>>> 
>>>  Here are the solutions I'm considering:
>>> 
>>>  1. Store each unique user ID as the cell name within the cube(s) it
>>>  occurs. This has the advantage of having 100% accuracy, but the
>>>  downside is the enormous space required to store each unique cell.
>>>  Consuming this data is also problematic as the only way to provide a
>>>  viewership count is by counting each cell. To save the overhead of
>>>  sending each cell over the network, counting them could be done by a
>>>  coprocessor on the region server, but that still doesn't avoid the
>>>  overhead of reading each cell from the disk. I'm also not sure what
>>>  happens if a single row is larger than an entire region (48 bytes per
>>>  user ID * 10,000,000 users = 480GB).
>>> 
>>>  2. Store a byte array that allows estimating unique viewers (with a
>>>  small margin of error*). Add a co-processor for updating this column
>>>  so I can guarantee the updates to a specific OLAP cell will be atomic.
>>>  The main benefit from this path is that there the nodes that update
>>>  HBase can be less complex. Another benefit I see is that the I can
>>>  just add more HBase regions as scale requires. However, I'm not 
> sure
>>>  if I can use a coprocessor the way I want; Can I observe updates to a
>>>  particular table and replace the provided data with my own? (The
>>>  client calls "put" with the actual user ID, my co-processor 
> replaces
>>>  it with a computed value, so the actual user ID never gets stored in
>>>  HBase).
>>> 
>>>  3. Store a byte array that allows estimating unique viewers (with a
>>>  small margin of error*). Re-arrange my architecture so that each OLAP
>>>  cell is only updated by a single node. The main benefit from this
>>>  would be that I don't need to worry about atomic operations in 
> HBase
>>>  since all updates for a single cell will be atomic and in serial. The
>>>  biggest downside is that I believe it will add significant complexity
>>>  to my overall architecture.
>>> 
>>> 
>>>  Thanks for your time, and I look forward to hearing your thoughts.
>>> 
>>>  Sincerely,
>>>  Tom Brown
>>> 
>>>  *(For information about the byte array mentioned in #2 and #3, see:
>>> 
> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html)
>>> 
>

Re: Add client complexity or use a coprocessor?

Reply via email to