FYI, I asked Miles and, unsurprisingly, Alois Cochard has did some work exactly in that direction (I know he's a big fan of Records).
Check this out: https://twitter.com/aloiscochard/status/402745443450232832 So undoubtedly there is a room for having neat DataFrame in Scala but the path to it is rather tough ;-) Cheers andy On Tue, Nov 19, 2013 at 9:03 AM, andy petrella <[email protected]>wrote: > indeed the scala version could be blocking (I'm not sure what it needs > 2.11, maybe Miles uses quasiquotes...) > > Andy > > > > On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal <[email protected]> wrote: > >> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO >> Paris last month. >> >> If anybody would like to experiment with shapeless in Spark to create >> something like R data frame or In canter dataset, I would be happy to see >> and eventually help. >> >> My feeling is however the fact that shapeless goes fast (eg. in my >> understanding, the latest shapeless requires 2.11) may be a problem. >> On Nov 19, 2013 12:46 AM, "andy petrella" <[email protected]> >> wrote: >> >>> Maybe I'm wrong, but this use case could be a good fit for >>> Shapeless<https://github.com/milessabin/shapeless>' >>> records. >>> >>> Shapeless' records are like, so to say, lisp's record but typed! In that >>> sense, they're more closer to Haskell's record notation, but imho less >>> powerful, since the access will be based on String (field name) for >>> Shapeless where Haskell will use pure functions! >>> >>> Anyway, this >>> documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records> >>> is >>> self-explanatory and straightforward how we (maybe) could use them to >>> simulate an R's frame >>> >>> Thinking out loud: when reading a csv file, for instance, what would be >>> needed are >>> * a Read[T] for each column, >>> * fold'ling the list of columns by "reading" each and prepending the >>> result (combined with the name with ->>) to an HList >>> >>> The gain would be that we should recover one helpful feature of R's >>> frame which is: >>> R :: frame$newCol = frame$post - frame$pre >>> // which adds a column to a frame >>> Shpls :: frame2 = frame + ("newCol" --> (frame("post") - >>> frame("pre"))) // type safe "difference" between ints for instance >>> >>> Of course, we're not recovering R's frame as is, because we're simply >>> dealing with rows on by one, where a frame is dealing with the full table >>> -- but in the case of Spark this would have no sense to mimic that, since >>> we use RDDs for that :-D. >>> >>> I didn't experimented this yet, but It'd be fun to try, don't know if >>> someone is interested in ^^ >>> >>> Cheers >>> >>> andy >>> >>> >>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <[email protected]>wrote: >>> >>>> Sure, Shay. Let's connect offline. >>>> >>>> Sent while mobile. Pls excuse typos etc. >>>> On Nov 16, 2013 2:27 AM, "Shay Seng" <[email protected]> wrote: >>>> >>>>> Nice, any possibility of sharing this code in advance? >>>>> >>>>> >>>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen >>>>> <[email protected]>wrote: >>>>> >>>>>> Shay, we've done this at Adatao, specifically a big data frame in RDD >>>>>> representation and subsetting/projections/data mining/machine learning >>>>>> algorithms on that in-memory table structure. >>>>>> >>>>>> We're planning to harmonize that with the MLBase work in the near >>>>>> future. Just a matter of prioritization on limited resources. If there's >>>>>> enough interest we'll accelerate that. >>>>>> >>>>>> Sent while mobile. Pls excuse typos etc. >>>>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Is there some way to get R-style Data.Frame data structures into >>>>>>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone >>>>>>> and >>>>>>> the code gets pretty hard to read especially after a few joins, maps >>>>>>> etc. >>>>>>> >>>>>>> Rather than access columns by index, I would prefer to access them >>>>>>> by name. >>>>>>> e.g. instead of writing: >>>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9)) >>>>>>> I would prefer to write >>>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost)) >>>>>>> >>>>>>> Also joins are particularly irritating. Currently I have to first >>>>>>> construct a pair: >>>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3))) >>>>>>> Now I have to unzip away the join-key and remap the values into a seq >>>>>>> >>>>>>> instead I would rather write >>>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime) >>>>>>> >>>>>>> >>>>>>> The question is this: >>>>>>> (1) I started writing a DataFrameRDD class that kept track of the >>>>>>> column names and column values, and some optional attributes common to >>>>>>> the >>>>>>> entire dataframe. However I got a little muddled when trying to figure >>>>>>> out >>>>>>> what happens when a dataframRDD is chained with other operations and get >>>>>>> transformed to other types of RDDs. The Value part of the RDD is >>>>>>> obvious, >>>>>>> but I didn't know the best way to pass on the "column and attribute" >>>>>>> portions of the DataFrame class. >>>>>>> >>>>>>> I googled around for some documentation on how to write RDDs, but >>>>>>> only found a pptx slide presentation with very vague info. Is there a >>>>>>> better source of info on how to write RDDs? >>>>>>> >>>>>>> (2) Even better than info on how to write RDDs, has anyone written >>>>>>> an RDD that functions as a DataFrame? :-) >>>>>>> >>>>>>> tks >>>>>>> shay >>>>>>> >>>>>> >>>>> >>> >
