Ted: +1.5B columns - 5 CF - 300M CQ
Jianshi On Wed, Jun 25, 2014 at 1:50 AM, Ted Yu <[email protected]> wrote: > Thanks for the update. > > In your experiment so far, how many columns were involved ? > > Cheers > > > On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <[email protected]> > wrote: > >> +Update: >> >> Possibly 100s Billion of columns. >> >> >> On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <[email protected]> >> wrote: >> >>> Hi Ted, >>> >>> CF: maybe dozens >>> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId) >>> >>> Make sense? >>> >>> Jianshi >>> >>> >>> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <[email protected]> wrote: >>> >>>> Jianshi: >>>> How many column families and columns are you expecting (maximum) in >>>> your largest table ? >>>> >>>> Cheers >>>> >>>> >>>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <[email protected] >>>> > wrote: >>>> >>>>> Hi David, >>>>> >>>>> I did, it's a wonderful piece of work and for reviewing facts in a >>>>> networks it's a great tool. (And Lumify looks really nice) >>>>> >>>>> However, my queries are mostly time-bound (from time A to time B), and >>>>> to make some query real-time (< 50ms), I have to roll out my own schema >>>>> and >>>>> index, to denormalize properties and to incrementally do aggregations. I >>>>> don't think there're existing solution in Graph database that can do >>>>> these. >>>>> >>>>> And it's really fun to implement it myself. :) >>>>> >>>>> Please correct me if I'm wrong >>>>> >>>>> Jianshi >>>>> >>>>> >>>>> >>>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets < >>>>> [email protected]> wrote: >>>>> >>>>>> Did you get a chance to review http://securegraph.org/? SecureGraph >>>>>> is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, >>>>>> every Secure graph method requires authorizations and visibilities. >>>>>> SecureGraph also supports multivalued properties as well as property >>>>>> metadata. >>>>>> >>>>>> >>>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Wow, so many replies and very educational. Thank you all! >>>>>>> >>>>>>> I'm working on a Graph backend that I hope the same infrastructure >>>>>>> can support >>>>>>> >>>>>>> 1) interactive graph exploration and queries >>>>>>> >>>>>>> Answering what are the interactions among N users from time A to >>>>>>> time B, and how are users connected (now and before). >>>>>>> >>>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching) in >>>>>>> a network of accounts >>>>>>> >>>>>>> Answering questions like: what's the ratio of newly registered >>>>>>> accounts in my 'connected' (need flexible definition) network, how fast >>>>>>> does it change; Does the network has path satisfying A(CN) -> B(IT) -> >>>>>>> C(US) where the age of path is less than 3 days; etc. >>>>>>> >>>>>>> 3) offline simulation of events or offline calculation of new >>>>>>> features (used for building models), so I need to take snapshots and >>>>>>> also >>>>>>> save point-in-time data >>>>>>> >>>>>>> Having them all-in-one in the same infrastructure will greatly >>>>>>> simplify the implementation. >>>>>>> >>>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above >>>>>>> are fake and are not related to PayPal :) >>>>>>> >>>>>>> I made a prototype in the last two weeks for purpose 1) and my >>>>>>> feeling about Accumulo is exactly what many of you has said: it just >>>>>>> works! >>>>>>> Very little admin work, Clean and clear documentation and APIs. One >>>>>>> thing I >>>>>>> haven't got right was high-speed ingestion, I only got 100K >>>>>>> rows/sec/node, >>>>>>> but it's already very satisfying. :) >>>>>>> >>>>>>> BTW, from Mike's slides it seems HBase is much faster in read >>>>>>> throughput if the number of columns is small. Any comments? What about >>>>>>> latency? Can I cache all data in memory in Accumulo to reduce latency >>>>>>> for >>>>>>> cold data (say I just restarted my cluster)? >>>>>>> >>>>>>> >>>>>>> Jianshi >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I think first and foremost, how has writing your application been? >>>>>>>> Is it something you can easily onboard other people for? Does it seem >>>>>>>> stable enough? If you can answer those questions positively, I think >>>>>>>> you >>>>>>>> have a winning situation. >>>>>>>> >>>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all >>>>>>>> provide some level of support for Accumulo, so it has the pedigree of >>>>>>>> other >>>>>>>> members of the Hadoop ecosystem. >>>>>>>> >>>>>>>> Regarding the performance, I think Mike's presentation needs some >>>>>>>> context. He can definitely provide more context than the rest of us >>>>>>>> (and >>>>>>>> possibly Sean or Bill |-|), but I think one thing he was driving home >>>>>>>> is >>>>>>>> that out of the box, Accumulo is configured to run on someone's laptop. >>>>>>>> There are adjustments to be made when running at any scale greater >>>>>>>> than a >>>>>>>> dev machine and they may not be documented clearly. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Mike did a pretty good presentation on performance comparison >>>>>>>>> between Accumulo / HBase. Again not official IMO but is pretty >>>>>>>>> detailed in >>>>>>>>> the approach take and apples-apples comparison >>>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 >>>>>>>>> 07:42:57 PM---Performance is probably the largest difference between >>>>>>>>> Accu]Jeremy >>>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest >>>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan >>>>>>>>> >>>>>>>>> From: Jeremy Kepner <[email protected]> >>>>>>>>> To: <[email protected]> >>>>>>>>> Date: 06/23/2014 07:42 PM >>>>>>>>> Subject: Re: How does Accumulo compare to HBase >>>>>>>>> ------------------------------ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Performance is probably the largest difference between Accumulo >>>>>>>>> and HBase. >>>>>>>>> >>>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node. >>>>>>>>> This performance scales well into the hundreds of nodes to deliver >>>>>>>>> 100M+ entries/sec. >>>>>>>>> >>>>>>>>> There are no recent HBase benchmarks and none in the peer-reviewed >>>>>>>>> literature. >>>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo >>>>>>>>> performance. >>>>>>>>> >>>>>>>>> In short, one can often replace a 20+ node database with >>>>>>>>> a single node Accumulo database. >>>>>>>>> >>>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote: >>>>>>>>> > Er... basically I need to explain to my manager why choosing >>>>>>>>> Accumulo, >>>>>>>>> > instead of HBase. >>>>>>>>> > >>>>>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase >>>>>>>>> 0.98 also >>>>>>>>> > got cell-level security, modeled after Accumulo) >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Jianshi Huang >>>>>>>>> > >>>>>>>>> > LinkedIn: jianshi >>>>>>>>> > Twitter: @jshuang >>>>>>>>> > Github & Blog: http://huangjs.github.com/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jianshi Huang >>>>>>> >>>>>>> LinkedIn: jianshi >>>>>>> Twitter: @jshuang >>>>>>> Github & Blog: http://huangjs.github.com/ >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jianshi Huang >>>>> >>>>> LinkedIn: jianshi >>>>> Twitter: @jshuang >>>>> Github & Blog: http://huangjs.github.com/ >>>>> >>>> >>>> >>> >>> >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
