Thanks for the update. In your experiment so far, how many columns were involved ?
Cheers On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <[email protected]> wrote: > +Update: > > Possibly 100s Billion of columns. > > > On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <[email protected]> > wrote: > >> Hi Ted, >> >> CF: maybe dozens >> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId) >> >> Make sense? >> >> Jianshi >> >> >> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <[email protected]> wrote: >> >>> Jianshi: >>> How many column families and columns are you expecting (maximum) in your >>> largest table ? >>> >>> Cheers >>> >>> >>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <[email protected]> >>> wrote: >>> >>>> Hi David, >>>> >>>> I did, it's a wonderful piece of work and for reviewing facts in a >>>> networks it's a great tool. (And Lumify looks really nice) >>>> >>>> However, my queries are mostly time-bound (from time A to time B), and >>>> to make some query real-time (< 50ms), I have to roll out my own schema and >>>> index, to denormalize properties and to incrementally do aggregations. I >>>> don't think there're existing solution in Graph database that can do these. >>>> >>>> And it's really fun to implement it myself. :) >>>> >>>> Please correct me if I'm wrong >>>> >>>> Jianshi >>>> >>>> >>>> >>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets < >>>> [email protected]> wrote: >>>> >>>>> Did you get a chance to review http://securegraph.org/? SecureGraph >>>>> is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, >>>>> every Secure graph method requires authorizations and visibilities. >>>>> SecureGraph also supports multivalued properties as well as property >>>>> metadata. >>>>> >>>>> >>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang < >>>>> [email protected]> wrote: >>>>> >>>>>> Wow, so many replies and very educational. Thank you all! >>>>>> >>>>>> I'm working on a Graph backend that I hope the same infrastructure >>>>>> can support >>>>>> >>>>>> 1) interactive graph exploration and queries >>>>>> >>>>>> Answering what are the interactions among N users from time A to time >>>>>> B, and how are users connected (now and before). >>>>>> >>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching) in >>>>>> a network of accounts >>>>>> >>>>>> Answering questions like: what's the ratio of newly registered >>>>>> accounts in my 'connected' (need flexible definition) network, how fast >>>>>> does it change; Does the network has path satisfying A(CN) -> B(IT) -> >>>>>> C(US) where the age of path is less than 3 days; etc. >>>>>> >>>>>> 3) offline simulation of events or offline calculation of new >>>>>> features (used for building models), so I need to take snapshots and also >>>>>> save point-in-time data >>>>>> >>>>>> Having them all-in-one in the same infrastructure will greatly >>>>>> simplify the implementation. >>>>>> >>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above >>>>>> are fake and are not related to PayPal :) >>>>>> >>>>>> I made a prototype in the last two weeks for purpose 1) and my >>>>>> feeling about Accumulo is exactly what many of you has said: it just >>>>>> works! >>>>>> Very little admin work, Clean and clear documentation and APIs. One >>>>>> thing I >>>>>> haven't got right was high-speed ingestion, I only got 100K >>>>>> rows/sec/node, >>>>>> but it's already very satisfying. :) >>>>>> >>>>>> BTW, from Mike's slides it seems HBase is much faster in read >>>>>> throughput if the number of columns is small. Any comments? What about >>>>>> latency? Can I cache all data in memory in Accumulo to reduce latency for >>>>>> cold data (say I just restarted my cluster)? >>>>>> >>>>>> >>>>>> Jianshi >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I think first and foremost, how has writing your application been? >>>>>>> Is it something you can easily onboard other people for? Does it seem >>>>>>> stable enough? If you can answer those questions positively, I think you >>>>>>> have a winning situation. >>>>>>> >>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all >>>>>>> provide some level of support for Accumulo, so it has the pedigree of >>>>>>> other >>>>>>> members of the Hadoop ecosystem. >>>>>>> >>>>>>> Regarding the performance, I think Mike's presentation needs some >>>>>>> context. He can definitely provide more context than the rest of us (and >>>>>>> possibly Sean or Bill |-|), but I think one thing he was driving home is >>>>>>> that out of the box, Accumulo is configured to run on someone's laptop. >>>>>>> There are adjustments to be made when running at any scale greater than >>>>>>> a >>>>>>> dev machine and they may not be documented clearly. >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Mike did a pretty good presentation on performance comparison >>>>>>>> between Accumulo / HBase. Again not official IMO but is pretty >>>>>>>> detailed in >>>>>>>> the approach take and apples-apples comparison >>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 >>>>>>>> 07:42:57 PM---Performance is probably the largest difference between >>>>>>>> Accu]Jeremy >>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest >>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan >>>>>>>> >>>>>>>> From: Jeremy Kepner <[email protected]> >>>>>>>> To: <[email protected]> >>>>>>>> Date: 06/23/2014 07:42 PM >>>>>>>> Subject: Re: How does Accumulo compare to HBase >>>>>>>> ------------------------------ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Performance is probably the largest difference between Accumulo and >>>>>>>> HBase. >>>>>>>> >>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node. >>>>>>>> This performance scales well into the hundreds of nodes to deliver >>>>>>>> 100M+ entries/sec. >>>>>>>> >>>>>>>> There are no recent HBase benchmarks and none in the peer-reviewed >>>>>>>> literature. >>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo >>>>>>>> performance. >>>>>>>> >>>>>>>> In short, one can often replace a 20+ node database with >>>>>>>> a single node Accumulo database. >>>>>>>> >>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote: >>>>>>>> > Er... basically I need to explain to my manager why choosing >>>>>>>> Accumulo, >>>>>>>> > instead of HBase. >>>>>>>> > >>>>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase >>>>>>>> 0.98 also >>>>>>>> > got cell-level security, modeled after Accumulo) >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Jianshi Huang >>>>>>>> > >>>>>>>> > LinkedIn: jianshi >>>>>>>> > Twitter: @jshuang >>>>>>>> > Github & Blog: http://huangjs.github.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jianshi Huang >>>>>> >>>>>> LinkedIn: jianshi >>>>>> Twitter: @jshuang >>>>>> Github & Blog: http://huangjs.github.com/ >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jianshi Huang >>>> >>>> LinkedIn: jianshi >>>> Twitter: @jshuang >>>> Github & Blog: http://huangjs.github.com/ >>>> >>> >>> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ >
