Sure, Shay. Let's connect offline. Sent while mobile. Pls excuse typos etc. On Nov 16, 2013 2:27 AM, "Shay Seng" <[email protected]> wrote:
> Nice, any possibility of sharing this code in advance? > > > On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <[email protected]>wrote: > >> Shay, we've done this at Adatao, specifically a big data frame in RDD >> representation and subsetting/projections/data mining/machine learning >> algorithms on that in-memory table structure. >> >> We're planning to harmonize that with the MLBase work in the near future. >> Just a matter of prioritization on limited resources. If there's enough >> interest we'll accelerate that. >> >> Sent while mobile. Pls excuse typos etc. >> On Nov 16, 2013 1:11 AM, "Shay Seng" <[email protected]> wrote: >> >>> Hi, >>> >>> Is there some way to get R-style Data.Frame data structures into RDDs? >>> I've been using RDD[Seq[]] but this is getting quite error-prone and the >>> code gets pretty hard to read especially after a few joins, maps etc. >>> >>> Rather than access columns by index, I would prefer to access them by >>> name. >>> e.g. instead of writing: >>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9)) >>> I would prefer to write >>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost)) >>> >>> Also joins are particularly irritating. Currently I have to first >>> construct a pair: >>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3))) >>> Now I have to unzip away the join-key and remap the values into a seq >>> >>> instead I would rather write >>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime) >>> >>> >>> The question is this: >>> (1) I started writing a DataFrameRDD class that kept track of the column >>> names and column values, and some optional attributes common to the entire >>> dataframe. However I got a little muddled when trying to figure out what >>> happens when a dataframRDD is chained with other operations and get >>> transformed to other types of RDDs. The Value part of the RDD is obvious, >>> but I didn't know the best way to pass on the "column and attribute" >>> portions of the DataFrame class. >>> >>> I googled around for some documentation on how to write RDDs, but only >>> found a pptx slide presentation with very vague info. Is there a better >>> source of info on how to write RDDs? >>> >>> (2) Even better than info on how to write RDDs, has anyone written an >>> RDD that functions as a DataFrame? :-) >>> >>> tks >>> shay >>> >> >
