Nice, any possibility of sharing this code in advance?

On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <[email protected]> wrote:

> Shay, we've done this at Adatao, specifically a big data frame in RDD
> representation and subsetting/projections/data mining/machine learning
> algorithms on that in-memory table structure.
>
> We're planning to harmonize that with the MLBase work in the near future.
> Just a matter of prioritization on limited resources. If there's enough
> interest we'll accelerate that.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 1:11 AM, "Shay Seng" <[email protected]> wrote:
>
>> Hi,
>>
>> Is there some way to get R-style Data.Frame data structures into RDDs?
>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>> code gets pretty hard to read especially after a few joins, maps etc.
>>
>> Rather than access columns by index, I would prefer to access them by
>> name.
>> e.g. instead of writing:
>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>> I would prefer to write
>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>
>> Also joins are particularly irritating. Currently I have to first
>> construct a pair:
>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>> Now I have to unzip away the join-key and remap the values into a seq
>>
>> instead I would rather write
>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>
>>
>> The question is this:
>> (1) I started writing a DataFrameRDD class that kept track of the column
>> names and column values, and some optional attributes common to the entire
>> dataframe. However I got a little muddled when trying to figure out what
>> happens when a dataframRDD is chained with other operations and get
>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>> but I didn't know the best way to pass on the "column and attribute"
>> portions of the DataFrame class.
>>
>> I googled around for some documentation on how to write RDDs, but only
>> found a pptx slide presentation with very vague info. Is there a better
>> source of info on how to write RDDs?
>>
>> (2) Even better than info on how to write RDDs, has anyone written an RDD
>> that functions as a DataFrame? :-)
>>
>> tks
>> shay
>>
>

Reply via email to