Sure, Shay. Let's connect offline.

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 2:27 AM, "Shay Seng" <[email protected]> wrote:

> Nice, any possibility of sharing this code in advance?
>
>
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <[email protected]>wrote:
>
>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>> representation and subsetting/projections/data mining/machine learning
>> algorithms on that in-memory table structure.
>>
>> We're planning to harmonize that with the MLBase work in the near future.
>> Just a matter of prioritization on limited resources. If there's enough
>> interest we'll accelerate that.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 1:11 AM, "Shay Seng" <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>
>>> Rather than access columns by index, I would prefer to access them by
>>> name.
>>> e.g. instead of writing:
>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>> I would prefer to write
>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>
>>> Also joins are particularly irritating. Currently I have to first
>>> construct a pair:
>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>> Now I have to unzip away the join-key and remap the values into a seq
>>>
>>> instead I would rather write
>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>
>>>
>>> The question is this:
>>> (1) I started writing a DataFrameRDD class that kept track of the column
>>> names and column values, and some optional attributes common to the entire
>>> dataframe. However I got a little muddled when trying to figure out what
>>> happens when a dataframRDD is chained with other operations and get
>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>> but I didn't know the best way to pass on the "column and attribute"
>>> portions of the DataFrame class.
>>>
>>> I googled around for some documentation on how to write RDDs, but only
>>> found a pptx slide presentation with very vague info. Is there a better
>>> source of info on how to write RDDs?
>>>
>>> (2) Even better than info on how to write RDDs, has anyone written an
>>> RDD that functions as a DataFrame? :-)
>>>
>>> tks
>>> shay
>>>
>>
>

Reply via email to