Ted: Sorry, wrong number, this one is correct:
+ 10.5B columns - 5CF - ~2B CQ Jianshi On Wed, Jun 25, 2014 at 2:01 AM, Jianshi Huang <[email protected]> wrote: > Ted: > > +1.5B columns > - 5 CF > - 300M CQ > > Jianshi > > > On Wed, Jun 25, 2014 at 1:50 AM, Ted Yu <[email protected]> wrote: > >> Thanks for the update. >> >> In your experiment so far, how many columns were involved ? >> >> Cheers >> >> >> On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <[email protected]> >> wrote: >> >>> +Update: >>> >>> Possibly 100s Billion of columns. >>> >>> >>> On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <[email protected] >>> > wrote: >>> >>>> Hi Ted, >>>> >>>> CF: maybe dozens >>>> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId) >>>> >>>> Make sense? >>>> >>>> Jianshi >>>> >>>> >>>> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <[email protected]> wrote: >>>> >>>>> Jianshi: >>>>> How many column families and columns are you expecting (maximum) in >>>>> your largest table ? >>>>> >>>>> Cheers >>>>> >>>>> >>>>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi David, >>>>>> >>>>>> I did, it's a wonderful piece of work and for reviewing facts in a >>>>>> networks it's a great tool. (And Lumify looks really nice) >>>>>> >>>>>> However, my queries are mostly time-bound (from time A to time B), >>>>>> and to make some query real-time (< 50ms), I have to roll out my own >>>>>> schema >>>>>> and index, to denormalize properties and to incrementally do >>>>>> aggregations. >>>>>> I don't think there're existing solution in Graph database that can do >>>>>> these. >>>>>> >>>>>> And it's really fun to implement it myself. :) >>>>>> >>>>>> Please correct me if I'm wrong >>>>>> >>>>>> Jianshi >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Did you get a chance to review http://securegraph.org/? SecureGraph >>>>>>> is an API to manipulate graphs, similar to Blueprints. Unlike >>>>>>> Blueprints, >>>>>>> every Secure graph method requires authorizations and visibilities. >>>>>>> SecureGraph also supports multivalued properties as well as property >>>>>>> metadata. >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Wow, so many replies and very educational. Thank you all! >>>>>>>> >>>>>>>> I'm working on a Graph backend that I hope the same infrastructure >>>>>>>> can support >>>>>>>> >>>>>>>> 1) interactive graph exploration and queries >>>>>>>> >>>>>>>> Answering what are the interactions among N users from time A to >>>>>>>> time B, and how are users connected (now and before). >>>>>>>> >>>>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching) >>>>>>>> in a network of accounts >>>>>>>> >>>>>>>> Answering questions like: what's the ratio of newly registered >>>>>>>> accounts in my 'connected' (need flexible definition) network, how fast >>>>>>>> does it change; Does the network has path satisfying A(CN) -> B(IT) -> >>>>>>>> C(US) where the age of path is less than 3 days; etc. >>>>>>>> >>>>>>>> 3) offline simulation of events or offline calculation of new >>>>>>>> features (used for building models), so I need to take snapshots and >>>>>>>> also >>>>>>>> save point-in-time data >>>>>>>> >>>>>>>> Having them all-in-one in the same infrastructure will greatly >>>>>>>> simplify the implementation. >>>>>>>> >>>>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions >>>>>>>> above are fake and are not related to PayPal :) >>>>>>>> >>>>>>>> I made a prototype in the last two weeks for purpose 1) and my >>>>>>>> feeling about Accumulo is exactly what many of you has said: it just >>>>>>>> works! >>>>>>>> Very little admin work, Clean and clear documentation and APIs. One >>>>>>>> thing I >>>>>>>> haven't got right was high-speed ingestion, I only got 100K >>>>>>>> rows/sec/node, >>>>>>>> but it's already very satisfying. :) >>>>>>>> >>>>>>>> BTW, from Mike's slides it seems HBase is much faster in read >>>>>>>> throughput if the number of columns is small. Any comments? What about >>>>>>>> latency? Can I cache all data in memory in Accumulo to reduce latency >>>>>>>> for >>>>>>>> cold data (say I just restarted my cluster)? >>>>>>>> >>>>>>>> >>>>>>>> Jianshi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> I think first and foremost, how has writing your application been? >>>>>>>>> Is it something you can easily onboard other people for? Does it seem >>>>>>>>> stable enough? If you can answer those questions positively, I think >>>>>>>>> you >>>>>>>>> have a winning situation. >>>>>>>>> >>>>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all >>>>>>>>> provide some level of support for Accumulo, so it has the pedigree of >>>>>>>>> other >>>>>>>>> members of the Hadoop ecosystem. >>>>>>>>> >>>>>>>>> Regarding the performance, I think Mike's presentation needs some >>>>>>>>> context. He can definitely provide more context than the rest of us >>>>>>>>> (and >>>>>>>>> possibly Sean or Bill |-|), but I think one thing he was driving home >>>>>>>>> is >>>>>>>>> that out of the box, Accumulo is configured to run on someone's >>>>>>>>> laptop. >>>>>>>>> There are adjustments to be made when running at any scale greater >>>>>>>>> than a >>>>>>>>> dev machine and they may not be documented clearly. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Mike did a pretty good presentation on performance comparison >>>>>>>>>> between Accumulo / HBase. Again not official IMO but is pretty >>>>>>>>>> detailed in >>>>>>>>>> the approach take and apples-apples comparison >>>>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 >>>>>>>>>> 07:42:57 PM---Performance is probably the largest difference between >>>>>>>>>> Accu]Jeremy >>>>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the >>>>>>>>>> largest >>>>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan >>>>>>>>>> >>>>>>>>>> From: Jeremy Kepner <[email protected]> >>>>>>>>>> To: <[email protected]> >>>>>>>>>> Date: 06/23/2014 07:42 PM >>>>>>>>>> Subject: Re: How does Accumulo compare to HBase >>>>>>>>>> ------------------------------ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Performance is probably the largest difference between Accumulo >>>>>>>>>> and HBase. >>>>>>>>>> >>>>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node. >>>>>>>>>> This performance scales well into the hundreds of nodes to deliver >>>>>>>>>> 100M+ entries/sec. >>>>>>>>>> >>>>>>>>>> There are no recent HBase benchmarks and none in the >>>>>>>>>> peer-reviewed literature. >>>>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo >>>>>>>>>> performance. >>>>>>>>>> >>>>>>>>>> In short, one can often replace a 20+ node database with >>>>>>>>>> a single node Accumulo database. >>>>>>>>>> >>>>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote: >>>>>>>>>> > Er... basically I need to explain to my manager why choosing >>>>>>>>>> Accumulo, >>>>>>>>>> > instead of HBase. >>>>>>>>>> > >>>>>>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase >>>>>>>>>> 0.98 also >>>>>>>>>> > got cell-level security, modeled after Accumulo) >>>>>>>>>> > >>>>>>>>>> > -- >>>>>>>>>> > Jianshi Huang >>>>>>>>>> > >>>>>>>>>> > LinkedIn: jianshi >>>>>>>>>> > Twitter: @jshuang >>>>>>>>>> > Github & Blog: http://huangjs.github.com/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jianshi Huang >>>>>>>> >>>>>>>> LinkedIn: jianshi >>>>>>>> Twitter: @jshuang >>>>>>>> Github & Blog: http://huangjs.github.com/ >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jianshi Huang >>>>>> >>>>>> LinkedIn: jianshi >>>>>> Twitter: @jshuang >>>>>> Github & Blog: http://huangjs.github.com/ >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jianshi Huang >>>> >>>> LinkedIn: jianshi >>>> Twitter: @jshuang >>>> Github & Blog: http://huangjs.github.com/ >>>> >>> >>> >>> >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
