Tom, > I am a big fan of the Increment class. Unfortunately, I'm not doing > simple increments for the viewer count. I will be receiving duplicate > messages from a particular client for a specific cube cell, and don't > want them to be counted twice
Gotcha. > I created an RPC endpoint coprocessor to perform this function but > performance suffered heavily under load (it appears that the endpoint > performs all functions in serial). Did you serialize access to your data structure(s)? > When I tried implementing it as a region observer, I was unsure of how > to correctly replace the provided "put" with my own. When I issued a > put from within "prePut", the server blocked the new put (waiting for > the "prePut" to finish). Should I be attempting to modify the WALEdit > object? You can add KVs to the WALEdit. Or, you can get a reference to the Put's familyMap: Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap(); and if you modify the map, you'll change what gets committed. > Is there a way to extend the functionality of "Increment" to provide > arbitrary bitwise operations on a the contents of a field? As a matter of design, this should be a new operation. It does sound interesting and useful, some sort of atomic bitfield. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Tom Brown <[email protected]> > To: [email protected] > Cc: > Sent: Monday, April 9, 2012 10:14 PM > Subject: Re: Add client complexity or use a coprocessor? > > Andy, > > I am a big fan of the Increment class. Unfortunately, I'm not doing > simple increments for the viewer count. I will be receiving duplicate > messages from a particular client for a specific cube cell, and don't > want them to be counted twice (my stats don't have to be 100% > accurate, but the expected rate of duplicates will be higher than the > allowable error rate). > > I created an RPC endpoint coprocessor to perform this function but > performance suffered heavily under load (it appears that the endpoint > performs all functions in serial). > > When I tried implementing it as a region observer, I was unsure of how > to correctly replace the provided "put" with my own. When I issued a > put from within "prePut", the server blocked the new put (waiting for > the "prePut" to finish). Should I be attempting to modify the WALEdit > object? > > Is there a way to extend the functionality of "Increment" to provide > arbitrary bitwise operations on a the contents of a field? > > Thanks again! > > --Tom > >> If it helps, yes this is possible: >> >>> Can I observe updates to a >>> particular table and replace the provided data with my own? (The >>> client calls "put" with the actual user ID, my co-processor > replaces >>> it with a computed value, so the actual user ID never gets stored in >>> HBase). >> >> Since your option #2 requires atomic updates to the data structure, have you > considered native >> atomic increments? See >> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29 >> >> >> or >> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html >> >> The former is a round trip for each value update. The latter allows you to > pack multiple updates >> into a single round trip. This would give you accurate counts even with > concurrent writers. >> >> It should be possible for you to do partial aggregation on the client side > too whenever parallel >> requests colocate multiple updates to the same cube within some small window > of time. >> >> Best regards, >> >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >> >> ----- Original Message ----- >>> From: Tom Brown <[email protected]> >>> To: [email protected] >>> Cc: >>> Sent: Monday, April 9, 2012 9:48 AM >>> Subject: Add client complexity or use a coprocessor? >>> >>> To whom it may concern, >>> >>> Ignoring the complexities of gathering the data, assume that I will be >>> tracking millions of unique viewers. Updates from each of our millions >>> of clients are gathered in a centralized platform and spread among a >>> group of machines for processing and inserting into HBase (assume that >>> this group can be scaled horizontally). The data is stored in an OLAP >>> cube format and one of the metrics I'm tracking across various >>> attributes is viewership (how many people from Y are watching X). >>> >>> I'm writing this to ask for your thoughts as to the most > appropriate >>> way to structure my data so I can count unique TV viewers (assume a >>> service like netflix or hulu). >>> >>> Here are the solutions I'm considering: >>> >>> 1. Store each unique user ID as the cell name within the cube(s) it >>> occurs. This has the advantage of having 100% accuracy, but the >>> downside is the enormous space required to store each unique cell. >>> Consuming this data is also problematic as the only way to provide a >>> viewership count is by counting each cell. To save the overhead of >>> sending each cell over the network, counting them could be done by a >>> coprocessor on the region server, but that still doesn't avoid the >>> overhead of reading each cell from the disk. I'm also not sure what >>> happens if a single row is larger than an entire region (48 bytes per >>> user ID * 10,000,000 users = 480GB). >>> >>> 2. Store a byte array that allows estimating unique viewers (with a >>> small margin of error*). Add a co-processor for updating this column >>> so I can guarantee the updates to a specific OLAP cell will be atomic. >>> The main benefit from this path is that there the nodes that update >>> HBase can be less complex. Another benefit I see is that the I can >>> just add more HBase regions as scale requires. However, I'm not > sure >>> if I can use a coprocessor the way I want; Can I observe updates to a >>> particular table and replace the provided data with my own? (The >>> client calls "put" with the actual user ID, my co-processor > replaces >>> it with a computed value, so the actual user ID never gets stored in >>> HBase). >>> >>> 3. Store a byte array that allows estimating unique viewers (with a >>> small margin of error*). Re-arrange my architecture so that each OLAP >>> cell is only updated by a single node. The main benefit from this >>> would be that I don't need to worry about atomic operations in > HBase >>> since all updates for a single cell will be atomic and in serial. The >>> biggest downside is that I believe it will add significant complexity >>> to my overall architecture. >>> >>> >>> Thanks for your time, and I look forward to hearing your thoughts. >>> >>> Sincerely, >>> Tom Brown >>> >>> *(For information about the byte array mentioned in #2 and #3, see: >>> > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html) >>> >
