Re: DataFrame RDDs

2013-11-19 Thread andy petrella
FYI,
I asked Miles and, unsurprisingly, Alois Cochard has did some work exactly
in that direction (I know he's a big fan of Records).

Check this out: https://twitter.com/aloiscochard/status/402745443450232832

So undoubtedly there is a room for having neat DataFrame in Scala but the
path to it is rather tough ;-)

Cheers

andy

On Tue, Nov 19, 2013 at 9:03 AM, andy petrella wrote:

> indeed the scala version could be blocking (I'm not sure what it needs
> 2.11, maybe Miles uses quasiquotes...)
>
> Andy
>
>
>
> On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal  wrote:
>
>> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
>> Paris last month.
>>
>> If anybody would like to experiment with shapeless in Spark to create
>> something like R data frame or In canter dataset, I would be happy to see
>> and eventually help.
>>
>> My feeling is however the fact that shapeless goes fast (eg. in my
>> understanding, the latest shapeless requires 2.11) may be a problem.
>> On Nov 19, 2013 12:46 AM, "andy petrella" 
>> wrote:
>>
>>> Maybe I'm wrong, but this use case could be a good fit for 
>>> Shapeless'
>>> records.
>>>
>>> Shapeless' records are like, so to say, lisp's record but typed! In that
>>> sense, they're more closer to Haskell's record notation, but imho less
>>> powerful, since the access will be based on String (field name) for
>>> Shapeless where Haskell will use pure functions!
>>>
>>> Anyway, this 
>>> documentation
>>>  is
>>> self-explanatory and straightforward how we (maybe) could use them to
>>> simulate an R's frame
>>>
>>> Thinking out loud: when reading a csv file, for instance, what would be
>>> needed are
>>>  * a Read[T] for each column,
>>>  * fold'ling the list of columns by "reading" each and prepending the
>>> result (combined with the name with ->>) to an HList
>>>
>>> The gain would be that we should recover one helpful feature of R's
>>> frame which is:
>>>   R   :: frame$newCol = frame$post - frame$pre
>>> // which adds a column to a frame
>>>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") -
>>> frame("pre"))) // type safe "difference" between ints for instance
>>>
>>> Of course, we're not recovering R's frame as is, because we're simply
>>> dealing with rows on by one, where a frame is dealing with the full table
>>> -- but in the case of Spark this would have no sense to mimic that, since
>>> we use RDDs for that :-D.
>>>
>>> I didn't experimented this yet, but It'd be fun to try, don't know if
>>> someone is interested in ^^
>>>
>>> Cheers
>>>
>>> andy
>>>
>>>
>>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen wrote:
>>>
 Sure, Shay. Let's connect offline.

 Sent while mobile. Pls excuse typos etc.
 On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:

> Nice, any possibility of sharing this code in advance?
>
>
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen 
> wrote:
>
>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>> representation and subsetting/projections/data mining/machine learning
>> algorithms on that in-memory table structure.
>>
>> We're planning to harmonize that with the MLBase work in the near
>> future. Just a matter of prioritization on limited resources. If there's
>> enough interest we'll accelerate that.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:
>>
>>> Hi,
>>>
>>> Is there some way to get R-style Data.Frame data structures into
>>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone 
>>> and
>>> the code gets pretty hard to read especially after a few joins, maps 
>>> etc.
>>>
>>> Rather than access columns by index, I would prefer to access them
>>> by name.
>>> e.g. instead of writing:
>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>> I would prefer to write
>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>
>>> Also joins are particularly irritating. Currently I have to first
>>> construct a pair:
>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>> Now I have to unzip away the join-key and remap the values into a seq
>>>
>>> instead I would rather write
>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>
>>>
>>> The question is this:
>>> (1) I started writing a DataFrameRDD class that kept track of the
>>> column names and column values, and some optional attributes common to 
>>> the
>>> entire dataframe. However I got a little muddled when trying to figure 
>>> out
>>> what happens when a dataframRDD is chained with other operations and get
>>> transformed to other types of RDDs. The Value part of 

Re: DataFrame RDDs

2013-11-19 Thread andy petrella
indeed the scala version could be blocking (I'm not sure what it needs
2.11, maybe Miles uses quasiquotes...)

Andy


On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal  wrote:

> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
> Paris last month.
>
> If anybody would like to experiment with shapeless in Spark to create
> something like R data frame or In canter dataset, I would be happy to see
> and eventually help.
>
> My feeling is however the fact that shapeless goes fast (eg. in my
> understanding, the latest shapeless requires 2.11) may be a problem.
> On Nov 19, 2013 12:46 AM, "andy petrella"  wrote:
>
>> Maybe I'm wrong, but this use case could be a good fit for 
>> Shapeless'
>> records.
>>
>> Shapeless' records are like, so to say, lisp's record but typed! In that
>> sense, they're more closer to Haskell's record notation, but imho less
>> powerful, since the access will be based on String (field name) for
>> Shapeless where Haskell will use pure functions!
>>
>> Anyway, this 
>> documentation
>>  is
>> self-explanatory and straightforward how we (maybe) could use them to
>> simulate an R's frame
>>
>> Thinking out loud: when reading a csv file, for instance, what would be
>> needed are
>>  * a Read[T] for each column,
>>  * fold'ling the list of columns by "reading" each and prepending the
>> result (combined with the name with ->>) to an HList
>>
>> The gain would be that we should recover one helpful feature of R's frame
>> which is:
>>   R   :: frame$newCol = frame$post - frame$pre
>> // which adds a column to a frame
>>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
>> // type safe "difference" between ints for instance
>>
>> Of course, we're not recovering R's frame as is, because we're simply
>> dealing with rows on by one, where a frame is dealing with the full table
>> -- but in the case of Spark this would have no sense to mimic that, since
>> we use RDDs for that :-D.
>>
>> I didn't experimented this yet, but It'd be fun to try, don't know if
>> someone is interested in ^^
>>
>> Cheers
>>
>> andy
>>
>>
>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen wrote:
>>
>>> Sure, Shay. Let's connect offline.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:
>>>
 Nice, any possibility of sharing this code in advance?


 On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen 
 wrote:

> Shay, we've done this at Adatao, specifically a big data frame in RDD
> representation and subsetting/projections/data mining/machine learning
> algorithms on that in-memory table structure.
>
> We're planning to harmonize that with the MLBase work in the near
> future. Just a matter of prioritization on limited resources. If there's
> enough interest we'll accelerate that.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:
>
>> Hi,
>>
>> Is there some way to get R-style Data.Frame data structures into
>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone 
>> and
>> the code gets pretty hard to read especially after a few joins, maps etc.
>>
>> Rather than access columns by index, I would prefer to access them by
>> name.
>> e.g. instead of writing:
>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>> I would prefer to write
>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>
>> Also joins are particularly irritating. Currently I have to first
>> construct a pair:
>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>> Now I have to unzip away the join-key and remap the values into a seq
>>
>> instead I would rather write
>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>
>>
>> The question is this:
>> (1) I started writing a DataFrameRDD class that kept track of the
>> column names and column values, and some optional attributes common to 
>> the
>> entire dataframe. However I got a little muddled when trying to figure 
>> out
>> what happens when a dataframRDD is chained with other operations and get
>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>> but I didn't know the best way to pass on the "column and attribute"
>> portions of the DataFrame class.
>>
>> I googled around for some documentation on how to write RDDs, but
>> only found a pptx slide presentation with very vague info. Is there a
>> better source of info on how to write RDDs?
>>
>> (2) Even better than info on how to write RDDs, has anyone written an
>> RDD that functions as a DataFrame? :-)
>>
>> tks
>> shay
>>
>
>>

Re: DataFrame RDDs

2013-11-19 Thread andy petrella
Exactly, we could actually use Dynamic with Records.

Still thinking out of loud:
The fact is that, with Dynamic we would loose the type  -- the
implementation is up to us and makes uses of Any for parameters of course.
Maybe could we use shapeless records as the delegated value for the
selectDynamic (and so on) implementation...

However, we might encounter some problem because of the Any in the
signatures, since the compiler looses the type... most probably we'd need a
ClassTag or something similar... not sure if it'll work -- with a clean
code at least (°_-)



Andy Petrella
Belgium (Liège)


*   *
 IT Consultant for *NextLab  sprl* (co-founder)
 Engaged Citizen Coder for *WAJUG * (co-founder)
 Author of *Learning Play! Framework
2*


*   *Mobile: *+32 495 99 11 04*
Mails:

   - andy.petre...@nextlab.be
   - andy.petre...@gmail.com

Socials:

   - Twitter: https://twitter.com/#!/noootsab
   - LinkedIn: http://be.linkedin.com/in/andypetrella
   - Blogger: http://ska-la.blogspot.com/
   - GitHub:  https://github.com/andypetrella
   - Masterbranch: https://masterbranch.com/andy.petrella



On Tue, Nov 19, 2013 at 8:07 AM, Matei Zaharia wrote:

> Interesting idea — in Scala you can also use the Dynamic type (
> http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic)
> to allow dynamic properties. It has the same potential pitfalls as string
> names, but with nicer syntax.
>
> Matei
>
> On Nov 18, 2013, at 3:45 PM, andy petrella 
> wrote:
>
> Maybe I'm wrong, but this use case could be a good fit for 
> Shapeless'
> records.
>
> Shapeless' records are like, so to say, lisp's record but typed! In that
> sense, they're more closer to Haskell's record notation, but imho less
> powerful, since the access will be based on String (field name) for
> Shapeless where Haskell will use pure functions!
>
> Anyway, this 
> documentation
>  is
> self-explanatory and straightforward how we (maybe) could use them to
> simulate an R's frame
>
> Thinking out loud: when reading a csv file, for instance, what would be
> needed are
>  * a Read[T] for each column,
>  * fold'ling the list of columns by "reading" each and prepending the
> result (combined with the name with ->>) to an HList
>
> The gain would be that we should recover one helpful feature of R's frame
> which is:
>   R   :: frame$newCol = frame$post - frame$pre
>   // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
> // type safe "difference" between ints for instance
>
> Of course, we're not recovering R's frame as is, because we're simply
> dealing with rows on by one, where a frame is dealing with the full table
> -- but in the case of Spark this would have no sense to mimic that, since
> we use RDDs for that :-D.
>
> I didn't experimented this yet, but It'd be fun to try, don't know if
> someone is interested in ^^
>
> Cheers
>
> andy
>
>
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen wrote:
>
>> Sure, Shay. Let's connect offline.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:
>>
>>> Nice, any possibility of sharing this code in advance?
>>>
>>>
>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen wrote:
>>>
 Shay, we've done this at Adatao, specifically a big data frame in RDD
 representation and subsetting/projections/data mining/machine learning
 algorithms on that in-memory table structure.

 We're planning to harmonize that with the MLBase work in the near
 future. Just a matter of prioritization on limited resources. If there's
 enough interest we'll accelerate that.

 Sent while mobile. Pls excuse typos etc.
  On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:

> Hi,
>
> Is there some way to get R-style Data.Frame data structures into RDDs?
> I've been using RDD[Seq[]] but this is getting quite error-prone and the
> code gets pretty hard to read especially after a few joins, maps etc.
>
> Rather than access columns by index, I would prefer to access them by
> name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>
> Also joins are particularly irritating. Currently I have to first
> construct a pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
>
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>
>
> The question is this:
> (1) I started writing a DataFrameRDD class that

Re: DataFrame RDDs

2013-11-18 Thread Anwar Rizal
I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
Paris last month.

If anybody would like to experiment with shapeless in Spark to create
something like R data frame or In canter dataset, I would be happy to see
and eventually help.

My feeling is however the fact that shapeless goes fast (eg. in my
understanding, the latest shapeless requires 2.11) may be a problem.
On Nov 19, 2013 12:46 AM, "andy petrella"  wrote:

> Maybe I'm wrong, but this use case could be a good fit for 
> Shapeless'
> records.
>
> Shapeless' records are like, so to say, lisp's record but typed! In that
> sense, they're more closer to Haskell's record notation, but imho less
> powerful, since the access will be based on String (field name) for
> Shapeless where Haskell will use pure functions!
>
> Anyway, this 
> documentation
>  is
> self-explanatory and straightforward how we (maybe) could use them to
> simulate an R's frame
>
> Thinking out loud: when reading a csv file, for instance, what would be
> needed are
>  * a Read[T] for each column,
>  * fold'ling the list of columns by "reading" each and prepending the
> result (combined with the name with ->>) to an HList
>
> The gain would be that we should recover one helpful feature of R's frame
> which is:
>   R   :: frame$newCol = frame$post - frame$pre
>   // which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
> // type safe "difference" between ints for instance
>
> Of course, we're not recovering R's frame as is, because we're simply
> dealing with rows on by one, where a frame is dealing with the full table
> -- but in the case of Spark this would have no sense to mimic that, since
> we use RDDs for that :-D.
>
> I didn't experimented this yet, but It'd be fun to try, don't know if
> someone is interested in ^^
>
> Cheers
>
> andy
>
>
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen wrote:
>
>> Sure, Shay. Let's connect offline.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:
>>
>>> Nice, any possibility of sharing this code in advance?
>>>
>>>
>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen wrote:
>>>
 Shay, we've done this at Adatao, specifically a big data frame in RDD
 representation and subsetting/projections/data mining/machine learning
 algorithms on that in-memory table structure.

 We're planning to harmonize that with the MLBase work in the near
 future. Just a matter of prioritization on limited resources. If there's
 enough interest we'll accelerate that.

 Sent while mobile. Pls excuse typos etc.
 On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:

> Hi,
>
> Is there some way to get R-style Data.Frame data structures into RDDs?
> I've been using RDD[Seq[]] but this is getting quite error-prone and the
> code gets pretty hard to read especially after a few joins, maps etc.
>
> Rather than access columns by index, I would prefer to access them by
> name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>
> Also joins are particularly irritating. Currently I have to first
> construct a pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
>
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>
>
> The question is this:
> (1) I started writing a DataFrameRDD class that kept track of the
> column names and column values, and some optional attributes common to the
> entire dataframe. However I got a little muddled when trying to figure out
> what happens when a dataframRDD is chained with other operations and get
> transformed to other types of RDDs. The Value part of the RDD is obvious,
> but I didn't know the best way to pass on the "column and attribute"
> portions of the DataFrame class.
>
> I googled around for some documentation on how to write RDDs, but only
> found a pptx slide presentation with very vague info. Is there a better
> source of info on how to write RDDs?
>
> (2) Even better than info on how to write RDDs, has anyone written an
> RDD that functions as a DataFrame? :-)
>
> tks
> shay
>

>>>
>


Re: DataFrame RDDs

2013-11-18 Thread Matei Zaharia
Interesting idea — in Scala you can also use the Dynamic type 
(http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to 
allow dynamic properties. It has the same potential pitfalls as string names, 
but with nicer syntax.

Matei

On Nov 18, 2013, at 3:45 PM, andy petrella  wrote:

> Maybe I'm wrong, but this use case could be a good fit for Shapeless' records.
> 
> Shapeless' records are like, so to say, lisp's record but typed! In that 
> sense, they're more closer to Haskell's record notation, but imho less 
> powerful, since the access will be based on String (field name) for Shapeless 
> where Haskell will use pure functions!
> 
> Anyway, this documentation is self-explanatory and straightforward how we 
> (maybe) could use them to simulate an R's frame
> 
> Thinking out loud: when reading a csv file, for instance, what would be 
> needed are 
>  * a Read[T] for each column, 
>  * fold'ling the list of columns by "reading" each and prepending the result 
> (combined with the name with ->>) to an HList
> 
> The gain would be that we should recover one helpful feature of R's frame 
> which is:
>   R   :: frame$newCol = frame$post - frame$pre
>// which adds a column to a frame
>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre"))) 
> // type safe "difference" between ints for instance
>
> Of course, we're not recovering R's frame as is, because we're simply dealing 
> with rows on by one, where a frame is dealing with the full table -- but in 
> the case of Spark this would have no sense to mimic that, since we use RDDs 
> for that :-D.
> 
> I didn't experimented this yet, but It'd be fun to try, don't know if someone 
> is interested in ^^
> 
> Cheers
> 
> andy
> 
> 
> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen  wrote:
> Sure, Shay. Let's connect offline.
> 
> Sent while mobile. Pls excuse typos etc.
> 
> On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:
> Nice, any possibility of sharing this code in advance? 
> 
> 
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen  wrote:
> Shay, we've done this at Adatao, specifically a big data frame in RDD 
> representation and subsetting/projections/data mining/machine learning 
> algorithms on that in-memory table structure.
> 
> We're planning to harmonize that with the MLBase work in the near future. 
> Just a matter of prioritization on limited resources. If there's enough 
> interest we'll accelerate that.
> 
> Sent while mobile. Pls excuse typos etc.
> 
> On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:
> Hi, 
> 
> Is there some way to get R-style Data.Frame data structures into RDDs? I've 
> been using RDD[Seq[]] but this is getting quite error-prone and the code gets 
> pretty hard to read especially after a few joins, maps etc. 
> 
> Rather than access columns by index, I would prefer to access them by name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
> 
> Also joins are particularly irritating. Currently I have to first construct a 
> pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
> 
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
> 
> 
> The question is this:
> (1) I started writing a DataFrameRDD class that kept track of the column 
> names and column values, and some optional attributes common to the entire 
> dataframe. However I got a little muddled when trying to figure out what 
> happens when a dataframRDD is chained with other operations and get 
> transformed to other types of RDDs. The Value part of the RDD is obvious, but 
> I didn't know the best way to pass on the "column and attribute" portions of 
> the DataFrame class.
> 
> I googled around for some documentation on how to write RDDs, but only found 
> a pptx slide presentation with very vague info. Is there a better source of 
> info on how to write RDDs? 
> 
> (2) Even better than info on how to write RDDs, has anyone written an RDD 
> that functions as a DataFrame? :-)
> 
> tks
> shay
> 
> 



Re: DataFrame RDDs

2013-11-18 Thread andy petrella
Maybe I'm wrong, but this use case could be a good fit for
Shapeless'
records.

Shapeless' records are like, so to say, lisp's record but typed! In that
sense, they're more closer to Haskell's record notation, but imho less
powerful, since the access will be based on String (field name) for
Shapeless where Haskell will use pure functions!

Anyway, this 
documentation
is
self-explanatory and straightforward how we (maybe) could use them to
simulate an R's frame

Thinking out loud: when reading a csv file, for instance, what would be
needed are
 * a Read[T] for each column,
 * fold'ling the list of columns by "reading" each and prepending the
result (combined with the name with ->>) to an HList

The gain would be that we should recover one helpful feature of R's frame
which is:
  R   :: frame$newCol = frame$post - frame$pre
  // which adds a column to a frame
  Shpls :: frame2 = frame + ("newCol" --> (frame("post") - frame("pre")))
  // type safe "difference" between ints for instance

Of course, we're not recovering R's frame as is, because we're simply
dealing with rows on by one, where a frame is dealing with the full table
-- but in the case of Spark this would have no sense to mimic that, since
we use RDDs for that :-D.

I didn't experimented this yet, but It'd be fun to try, don't know if
someone is interested in ^^

Cheers

andy


On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen  wrote:

> Sure, Shay. Let's connect offline.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:
>
>> Nice, any possibility of sharing this code in advance?
>>
>>
>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen wrote:
>>
>>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>>> representation and subsetting/projections/data mining/machine learning
>>> algorithms on that in-memory table structure.
>>>
>>> We're planning to harmonize that with the MLBase work in the near
>>> future. Just a matter of prioritization on limited resources. If there's
>>> enough interest we'll accelerate that.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:
>>>
 Hi,

 Is there some way to get R-style Data.Frame data structures into RDDs?
 I've been using RDD[Seq[]] but this is getting quite error-prone and the
 code gets pretty hard to read especially after a few joins, maps etc.

 Rather than access columns by index, I would prefer to access them by
 name.
 e.g. instead of writing:
 myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
 I would prefer to write
 myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))

 Also joins are particularly irritating. Currently I have to first
 construct a pair:
 somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
 Now I have to unzip away the join-key and remap the values into a seq

 instead I would rather write
 someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)


 The question is this:
 (1) I started writing a DataFrameRDD class that kept track of the
 column names and column values, and some optional attributes common to the
 entire dataframe. However I got a little muddled when trying to figure out
 what happens when a dataframRDD is chained with other operations and get
 transformed to other types of RDDs. The Value part of the RDD is obvious,
 but I didn't know the best way to pass on the "column and attribute"
 portions of the DataFrame class.

 I googled around for some documentation on how to write RDDs, but only
 found a pptx slide presentation with very vague info. Is there a better
 source of info on how to write RDDs?

 (2) Even better than info on how to write RDDs, has anyone written an
 RDD that functions as a DataFrame? :-)

 tks
 shay

>>>
>>


Re: DataFrame RDDs

2013-11-15 Thread Christopher Nguyen
Sure, Shay. Let's connect offline.

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 2:27 AM, "Shay Seng"  wrote:

> Nice, any possibility of sharing this code in advance?
>
>
> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen wrote:
>
>> Shay, we've done this at Adatao, specifically a big data frame in RDD
>> representation and subsetting/projections/data mining/machine learning
>> algorithms on that in-memory table structure.
>>
>> We're planning to harmonize that with the MLBase work in the near future.
>> Just a matter of prioritization on limited resources. If there's enough
>> interest we'll accelerate that.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:
>>
>>> Hi,
>>>
>>> Is there some way to get R-style Data.Frame data structures into RDDs?
>>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>>> code gets pretty hard to read especially after a few joins, maps etc.
>>>
>>> Rather than access columns by index, I would prefer to access them by
>>> name.
>>> e.g. instead of writing:
>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>> I would prefer to write
>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>
>>> Also joins are particularly irritating. Currently I have to first
>>> construct a pair:
>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>> Now I have to unzip away the join-key and remap the values into a seq
>>>
>>> instead I would rather write
>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>
>>>
>>> The question is this:
>>> (1) I started writing a DataFrameRDD class that kept track of the column
>>> names and column values, and some optional attributes common to the entire
>>> dataframe. However I got a little muddled when trying to figure out what
>>> happens when a dataframRDD is chained with other operations and get
>>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>>> but I didn't know the best way to pass on the "column and attribute"
>>> portions of the DataFrame class.
>>>
>>> I googled around for some documentation on how to write RDDs, but only
>>> found a pptx slide presentation with very vague info. Is there a better
>>> source of info on how to write RDDs?
>>>
>>> (2) Even better than info on how to write RDDs, has anyone written an
>>> RDD that functions as a DataFrame? :-)
>>>
>>> tks
>>> shay
>>>
>>
>


Re: DataFrame RDDs

2013-11-15 Thread Shay Seng
Nice, any possibility of sharing this code in advance?


On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen  wrote:

> Shay, we've done this at Adatao, specifically a big data frame in RDD
> representation and subsetting/projections/data mining/machine learning
> algorithms on that in-memory table structure.
>
> We're planning to harmonize that with the MLBase work in the near future.
> Just a matter of prioritization on limited resources. If there's enough
> interest we'll accelerate that.
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:
>
>> Hi,
>>
>> Is there some way to get R-style Data.Frame data structures into RDDs?
>> I've been using RDD[Seq[]] but this is getting quite error-prone and the
>> code gets pretty hard to read especially after a few joins, maps etc.
>>
>> Rather than access columns by index, I would prefer to access them by
>> name.
>> e.g. instead of writing:
>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>> I would prefer to write
>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>
>> Also joins are particularly irritating. Currently I have to first
>> construct a pair:
>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>> Now I have to unzip away the join-key and remap the values into a seq
>>
>> instead I would rather write
>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>
>>
>> The question is this:
>> (1) I started writing a DataFrameRDD class that kept track of the column
>> names and column values, and some optional attributes common to the entire
>> dataframe. However I got a little muddled when trying to figure out what
>> happens when a dataframRDD is chained with other operations and get
>> transformed to other types of RDDs. The Value part of the RDD is obvious,
>> but I didn't know the best way to pass on the "column and attribute"
>> portions of the DataFrame class.
>>
>> I googled around for some documentation on how to write RDDs, but only
>> found a pptx slide presentation with very vague info. Is there a better
>> source of info on how to write RDDs?
>>
>> (2) Even better than info on how to write RDDs, has anyone written an RDD
>> that functions as a DataFrame? :-)
>>
>> tks
>> shay
>>
>


Re: DataFrame RDDs

2013-11-15 Thread Christopher Nguyen
Shay, we've done this at Adatao, specifically a big data frame in RDD
representation and subsetting/projections/data mining/machine learning
algorithms on that in-memory table structure.

We're planning to harmonize that with the MLBase work in the near future.
Just a matter of prioritization on limited resources. If there's enough
interest we'll accelerate that.

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 1:11 AM, "Shay Seng"  wrote:

> Hi,
>
> Is there some way to get R-style Data.Frame data structures into RDDs?
> I've been using RDD[Seq[]] but this is getting quite error-prone and the
> code gets pretty hard to read especially after a few joins, maps etc.
>
> Rather than access columns by index, I would prefer to access them by name.
> e.g. instead of writing:
> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
> I would prefer to write
> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>
> Also joins are particularly irritating. Currently I have to first
> construct a pair:
> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
> Now I have to unzip away the join-key and remap the values into a seq
>
> instead I would rather write
> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>
>
> The question is this:
> (1) I started writing a DataFrameRDD class that kept track of the column
> names and column values, and some optional attributes common to the entire
> dataframe. However I got a little muddled when trying to figure out what
> happens when a dataframRDD is chained with other operations and get
> transformed to other types of RDDs. The Value part of the RDD is obvious,
> but I didn't know the best way to pass on the "column and attribute"
> portions of the DataFrame class.
>
> I googled around for some documentation on how to write RDDs, but only
> found a pptx slide presentation with very vague info. Is there a better
> source of info on how to write RDDs?
>
> (2) Even better than info on how to write RDDs, has anyone written an RDD
> that functions as a DataFrame? :-)
>
> tks
> shay
>