Small amounts in a one node cluster (at first). As it scales I'll be looking at running various O(nk) algorithms, where n is the number of distinct users and k are the overlapping features I want to consider.
Is Apache Spark good as a general database as well as it's more fancy features? - E.g.: considering I'm building a network, maybe using their graph database features? On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <[email protected]> wrote: > Apache Spark supports integration with HBase (which has REST API). > > What's the amount of data you want to store in this system ? > > Cheers > > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <[email protected]> wrote: >> >> I am architecting a platform incorporating: recommender systems, >> information retrieval (ML), sequence mining, and Natural Language >> Processing. >> >> Additionally I have the generic CRUD and authentication components, >> with everything exposed RESTfully. >> >> For the storage layer(s), there are a few options which immediately >> present themselves: >> >> Generic CRUD layer (high speed needed here, though I suppose I could use >> Redis…) >> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema >> SQL layer atop >> - Apache Spark (perhaps piping to HDFS)… ¿maybe? >> - MongoDB (or a similar document-store), a graph-database, or even >> something like Postgres >> >> Analytics layer (to enable Big Data / Data-intensive computing features) >> >> - Apache Spark >> - Hadoop with MapReduce and/or utilising some other Apache / >> non-Apache project with integration >> - Disco (from Nokia) >> >> ________________________________ >> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite >> layers? - The advantage here is obvious, but I am certain there are >> disadvantages. (and yes, I know there are various ways; automated and >> manual; to push data from non HDFS-backed stores to HDFS) >> >> Also, as a bonus answer, which stack would you recommend for this >> user-network I'm building? > >
