Re: DataFrame RDDs

andy petrella Tue, 19 Nov 2013 00:04:38 -0800

indeed the scala version could be blocking (I'm not sure what it needs
2.11, maybe Miles uses quasiquotes...)


Andy


On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal <[email protected]> wrote:

> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
> Paris last month.
>
> If anybody would like to experiment with shapeless in Spark to create
> something like R data frame or In canter dataset, I would be happy to see
> and eventually help.
>
> My feeling is however the fact that shapeless goes fast (eg. in my
> understanding, the latest shapeless requires 2.11) may be a problem.
> On Nov 19, 2013 12:46 AM, "andy petrella" <[email protected]> wrote:
>
>> Maybe I'm wrong, but this use case could be a good fit for 
>> Shapeless<https://github.com/milessabin/shapeless>'
>> records.
>>
>> Shapeless' records are like, so to say, lisp's record but typed! In that
>> sense, they're more closer to Haskell's record notation, but imho less
>> powerful, since the access will be based on String (field name) for
>> Shapeless where Haskell will use pure functions!
>>
>> Anyway, this 
>> documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records>
>>  is
>> self-explanatory and straightforward how we (maybe) could use them to
>> simulate an R's frame
>>
>> Thinking out loud: when reading a csv file, for instance, what would be
>> needed are
>>  * a Read[T] for each column,
>>  * fold'ling the list of columns by "reading" each and prepending the
>> result (combined with the name with ->>) to an HList
>>
>> The gain would be that we should recover one helpful feature of R's frame
>> which is:
>>   R       :: frame$newCol = frame$post - frame$pre
>>             // which adds a column to a frame
>>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
>>     // type safe "difference" between ints for instance
>>
>> Of course, we're not recovering R's frame as is, because we're simply
>> dealing with rows on by one, where a frame is dealing with the full table
>> -- but in the case of Spark this would have no sense to mimic that, since
>> we use RDDs for that :-D.
>>
>> I didn't experimented this yet, but It'd be fun to try, don't know if
>> someone is interested in ^^
>>
>> Cheers
>>
>> andy
>>
>>
>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <[email protected]>wrote:
>>
>>> Sure, Shay. Let's connect offline.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 2:27 AM, "Shay Seng" <[email protected]> wrote:
>>>
>>>> Nice, any possibility of sharing this code in advance?
>>>>
>>>>
>>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen 
>>>> <[email protected]>wrote:
>>>>
>>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>>>> representation and subsetting/projections/data mining/machine learning
>>>>> algorithms on that in-memory table structure.
>>>>>
>>>>> We're planning to harmonize that with the MLBase work in the near
>>>>> future. Just a matter of prioritization on limited resources. If there's
>>>>> enough interest we'll accelerate that.
>>>>>
>>>>> Sent while mobile. Pls excuse typos etc.
>>>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is there some way to get R-style Data.Frame data structures into
>>>>>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone 
>>>>>> and
>>>>>> the code gets pretty hard to read especially after a few joins, maps etc.
>>>>>>
>>>>>> Rather than access columns by index, I would prefer to access them by
>>>>>> name.
>>>>>> e.g. instead of writing:
>>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>>> I would prefer to write
>>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>>
>>>>>> Also joins are particularly irritating. Currently I have to first
>>>>>> construct a pair:
>>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>>> Now I have to unzip away the join-key and remap the values into a seq
>>>>>>
>>>>>> instead I would rather write
>>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>>
>>>>>>
>>>>>> The question is this:
>>>>>> (1) I started writing a DataFrameRDD class that kept track of the
>>>>>> column names and column values, and some optional attributes common to 
>>>>>> the
>>>>>> entire dataframe. However I got a little muddled when trying to figure 
>>>>>> out
>>>>>> what happens when a dataframRDD is chained with other operations and get
>>>>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>>> portions of the DataFrame class.
>>>>>>
>>>>>> I googled around for some documentation on how to write RDDs, but
>>>>>> only found a pptx slide presentation with very vague info. Is there a
>>>>>> better source of info on how to write RDDs?
>>>>>>
>>>>>> (2) Even better than info on how to write RDDs, has anyone written an
>>>>>> RDD that functions as a DataFrame? :-)
>>>>>>
>>>>>> tks
>>>>>> shay
>>>>>>
>>>>>
>>>>
>>

Re: DataFrame RDDs

Reply via email to