+Update: Possibly 100s Billion of columns.
On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <[email protected]> wrote: > Hi Ted, > > CF: maybe dozens > Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId) > > Make sense? > > Jianshi > > > On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <[email protected]> wrote: > >> Jianshi: >> How many column families and columns are you expecting (maximum) in your >> largest table ? >> >> Cheers >> >> >> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <[email protected]> >> wrote: >> >>> Hi David, >>> >>> I did, it's a wonderful piece of work and for reviewing facts in a >>> networks it's a great tool. (And Lumify looks really nice) >>> >>> However, my queries are mostly time-bound (from time A to time B), and >>> to make some query real-time (< 50ms), I have to roll out my own schema and >>> index, to denormalize properties and to incrementally do aggregations. I >>> don't think there're existing solution in Graph database that can do these. >>> >>> And it's really fun to implement it myself. :) >>> >>> Please correct me if I'm wrong >>> >>> Jianshi >>> >>> >>> >>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets < >>> [email protected]> wrote: >>> >>>> Did you get a chance to review http://securegraph.org/? SecureGraph is >>>> an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, >>>> every Secure graph method requires authorizations and visibilities. >>>> SecureGraph also supports multivalued properties as well as property >>>> metadata. >>>> >>>> >>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang <[email protected] >>>> > wrote: >>>> >>>>> Wow, so many replies and very educational. Thank you all! >>>>> >>>>> I'm working on a Graph backend that I hope the same infrastructure can >>>>> support >>>>> >>>>> 1) interactive graph exploration and queries >>>>> >>>>> Answering what are the interactions among N users from time A to time >>>>> B, and how are users connected (now and before). >>>>> >>>>> 2) real-time (<100ms) feature calculation (aggregation, matching) in a >>>>> network of accounts >>>>> >>>>> Answering questions like: what's the ratio of newly registered >>>>> accounts in my 'connected' (need flexible definition) network, how fast >>>>> does it change; Does the network has path satisfying A(CN) -> B(IT) -> >>>>> C(US) where the age of path is less than 3 days; etc. >>>>> >>>>> 3) offline simulation of events or offline calculation of new features >>>>> (used for building models), so I need to take snapshots and also save >>>>> point-in-time data >>>>> >>>>> Having them all-in-one in the same infrastructure will greatly >>>>> simplify the implementation. >>>>> >>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above >>>>> are fake and are not related to PayPal :) >>>>> >>>>> I made a prototype in the last two weeks for purpose 1) and my feeling >>>>> about Accumulo is exactly what many of you has said: it just works! Very >>>>> little admin work, Clean and clear documentation and APIs. One thing I >>>>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node, >>>>> but it's already very satisfying. :) >>>>> >>>>> BTW, from Mike's slides it seems HBase is much faster in read >>>>> throughput if the number of columns is small. Any comments? What about >>>>> latency? Can I cache all data in memory in Accumulo to reduce latency for >>>>> cold data (say I just restarted my cluster)? >>>>> >>>>> >>>>> Jianshi >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum < >>>>> [email protected]> wrote: >>>>> >>>>>> I think first and foremost, how has writing your application been? Is >>>>>> it something you can easily onboard other people for? Does it seem stable >>>>>> enough? If you can answer those questions positively, I think you have a >>>>>> winning situation. >>>>>> >>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all >>>>>> provide some level of support for Accumulo, so it has the pedigree of >>>>>> other >>>>>> members of the Hadoop ecosystem. >>>>>> >>>>>> Regarding the performance, I think Mike's presentation needs some >>>>>> context. He can definitely provide more context than the rest of us (and >>>>>> possibly Sean or Bill |-|), but I think one thing he was driving home is >>>>>> that out of the box, Accumulo is configured to run on someone's laptop. >>>>>> There are adjustments to be made when running at any scale greater than a >>>>>> dev machine and they may not be documented clearly. >>>>>> >>>>>> >>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Mike did a pretty good presentation on performance comparison >>>>>>> between Accumulo / HBase. Again not official IMO but is pretty detailed >>>>>>> in >>>>>>> the approach take and apples-apples comparison >>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob >>>>>>> >>>>>>> >>>>>>> >>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 >>>>>>> 07:42:57 PM---Performance is probably the largest difference between >>>>>>> Accu]Jeremy >>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest >>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan >>>>>>> >>>>>>> From: Jeremy Kepner <[email protected]> >>>>>>> To: <[email protected]> >>>>>>> Date: 06/23/2014 07:42 PM >>>>>>> Subject: Re: How does Accumulo compare to HBase >>>>>>> ------------------------------ >>>>>>> >>>>>>> >>>>>>> >>>>>>> Performance is probably the largest difference between Accumulo and >>>>>>> HBase. >>>>>>> >>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node. >>>>>>> This performance scales well into the hundreds of nodes to deliver >>>>>>> 100M+ entries/sec. >>>>>>> >>>>>>> There are no recent HBase benchmarks and none in the peer-reviewed >>>>>>> literature. >>>>>>> Old data suggests that HBase performance is ~1% of Accumulo >>>>>>> performance. >>>>>>> >>>>>>> In short, one can often replace a 20+ node database with >>>>>>> a single node Accumulo database. >>>>>>> >>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote: >>>>>>> > Er... basically I need to explain to my manager why choosing >>>>>>> Accumulo, >>>>>>> > instead of HBase. >>>>>>> > >>>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase >>>>>>> 0.98 also >>>>>>> > got cell-level security, modeled after Accumulo) >>>>>>> > >>>>>>> > -- >>>>>>> > Jianshi Huang >>>>>>> > >>>>>>> > LinkedIn: jianshi >>>>>>> > Twitter: @jshuang >>>>>>> > Github & Blog: http://huangjs.github.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jianshi Huang >>>>> >>>>> LinkedIn: jianshi >>>>> Twitter: @jshuang >>>>> Github & Blog: http://huangjs.github.com/ >>>>> >>>> >>>> >>> >>> >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
