I'd like to use Spark as an analytical stack, the only difference is that I would like find the best way to connect it to a dataset that I'm actively working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I don't need the 'resilient', just a distributed data set. Right now, the best way I can think of doing that is working with the data in a distributed system, like HBase, then when I want to do my analytics, I use the HadoopInputFormat readers to transfer the data from the HBase system to Spark and then do my analytics. Of course, then I have the overhead of serialization/deserialization and network transfer before I can even start my calculations. If I already held the dataset in the Spark processes, then I could start calculations immediately. So is there is a 'better' way to manage a distributed data set, which would then serve as an input to Spark RDDs?
Kyle On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen <[email protected]> wrote: > Kyle, the fundamental contract of a Spark RDD is that it is immutable. > This follows the paradigm where data is (functionally) transformed into > other data, rather than mutated. This allows these systems to make certain > assumptions and guarantees that otherwise they wouldn't be able to. > > Now we've been able to get mutative behavior with RDDs---for fun, > almost---but that's implementation dependent and may break at any time. > > It turns out this behavior is quite appropriate for the analytic stack, > where you typically apply the same transform/operator to all data. You're > finding that transactional systems are the exact opposite, where you > typically apply a different operation to individual pieces of the data. > Incidentally this is also the dichotomy between column- and row-based > storage being optimal for each respective pattern. > > Spark is intended for the analytic stack. To use Spark as the persistence > layer of a transaction system is going to be very awkward. I know there are > some vendors who position their in-memory databases as good for both OLTP > and OLAP use cases, but when you talk to them in depth they will readily > admit that it's really optimal for one and not the other. > > If you want to make a project out of making a special Spark RDD that > supports this behavior, it might be interesting. But there will be no > simple shortcuts to get there from here. > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott <[email protected]>wrote: > >> I'm trying to figure out if I can use an RDD to backend an interactive >> server. One of the requirements would be to have incremental updates to >> elements in the RDD, ie transforms that change/add/delete a single element >> in the RDD. >> It seems pretty drastic to do a full RDD filter to remove a single >> element, or do the union of the RDD with another one of size 1 to add an >> element. (Or is it?) Is there an efficient way to do this in Spark? Are >> there any example of this kind of usage? >> >> Thank you, >> Kyle >> > >
