Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-24 Thread John Myles White
I think they’re uncorrelated, but you’d have to ask Wes to know for sure.

 — John

On Jan 24, 2014, at 12:19 AM, Matthias BUSSONNIER 
bussonniermatth...@gmail.com wrote:

 
 Le 24 janv. 2014 à 04:51, Jonathan Malmaud a écrit :
 
 Sounds reasonable. As a temporary measure for people who want that 
 functionality immediately, I've taken a stab at wrapping pandas in a Julia 
 package (just as pyplot does for matplotlib), at 
 https://github.com/malmaud/pandas. 
 
 
 Would this explain this Tweet from 10h Ago ?
 
 Wes McKinney @wesmckinn
 Friendly reminder that performance-obsessed data hackers (R, Python, Julia) 
 should feel free to drop me a line about working together
 
 -- 
 M



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-23 Thread John Myles White
Yeah, at some point in the future I’d like to see if we can imitate the 
experimental query() and eval() methods from Pandas.

It’s the fact that those methods were just recently introduced which made me 
decide we needed to stop spending time on getting them working right now. We’re 
way behind Pandas in terms of performance and reliability, so it’s a bad idea 
for us to try being as feature complete until we catch up.

 — John

On Jan 23, 2014, at 6:37 AM, Jonathan Malmaud malm...@gmail.com wrote:

 Pandas has a 'query' method 
 (http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-query) which 
 uses the Python numexpr package for delayed evaluation (if i understand what 
 you mean by that in this context). 



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-23 Thread John Myles White
I think that’s probably because you need to do using DataArrays now.

 — John

On Jan 23, 2014, at 2:08 AM, Jon Norberg jon.norb...@ecology.su.se wrote:

 is this why I get this on latest julia studio on mac with recently updated 
 packages:
 
 julia using DataFrames
 julia using RDatasets
 julia iris = data(datasets, iris)
 data not defined
 
 ??



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-23 Thread Jonathan Malmaud
Sounds reasonable. As a temporary measure for people who want that 
functionality immediately, I've taken a stab at wrapping pandas in a Julia 
package (just as pyplot does for matplotlib), 
at https://github.com/malmaud/pandas. 

On Thursday, January 23, 2014 10:17:40 AM UTC-5, John Myles White wrote:

 Yeah, at some point in the future I’d like to see if we can imitate the 
 experimental query() and eval() methods from Pandas. 

 It’s the fact that those methods were just recently introduced which made 
 me decide we needed to stop spending time on getting them working right 
 now. We’re way behind Pandas in terms of performance and reliability, so 
 it’s a bad idea for us to try being as feature complete until we catch up. 

  — John 

 On Jan 23, 2014, at 6:37 AM, Jonathan Malmaud mal...@gmail.comjavascript: 
 wrote: 

  Pandas has a 'query' method (
 http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-query) 
 which uses the Python numexpr package for delayed evaluation (if i 
 understand what you mean by that in this context). 



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-23 Thread John Myles White
Just saw that. Seems like a very smart way to get us important functionality 
while we continue to push things forward. Would be very cool if we could make 
it possible to switch between the Pandas and native Julia implementations 
totally seamlessly.

 — John

On Jan 23, 2014, at 7:51 PM, Jonathan Malmaud malm...@gmail.com wrote:

 Sounds reasonable. As a temporary measure for people who want that 
 functionality immediately, I've taken a stab at wrapping pandas in a Julia 
 package (just as pyplot does for matplotlib), at 
 https://github.com/malmaud/pandas. 
 
 On Thursday, January 23, 2014 10:17:40 AM UTC-5, John Myles White wrote:
 Yeah, at some point in the future I’d like to see if we can imitate the 
 experimental query() and eval() methods from Pandas. 
 
 It’s the fact that those methods were just recently introduced which made me 
 decide we needed to stop spending time on getting them working right now. 
 We’re way behind Pandas in terms of performance and reliability, so it’s a 
 bad idea for us to try being as feature complete until we catch up. 
 
  — John 
 
 On Jan 23, 2014, at 6:37 AM, Jonathan Malmaud mal...@gmail.com wrote: 
 
  Pandas has a 'query' method 
  (http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-query) 
  which uses the Python numexpr package for delayed evaluation (if i 
  understand what you mean by that in this context). 
 



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-22 Thread Kevin Squire
I'm also a fan of the expression-based interface (mostly because I'm used
to similar things in Pandas).  I haven't looked at that code, though, so I
can't comment on the complexity.

Kevin


On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.comwrote:

 Sure, but the resulting expression is *much* more verbose. I just noticed
 that all expression-based indexing was on the chopping block. What is left
 after all this?

 I can see how axing these features would make DataFrames.jl easier to
 maintain, but I found the expression stuff to present a rather nice
 interface.

 --Blake


 On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:

 Can you do something like df[“ColA”] = f(df)?

  — John


 On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote:

 I use within! pretty frequently. What should I be using instead if that
 is on the chopping block?

 --Blake

 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:

 I also agree with your approach, John. Based on your criteria, here
 are some other things to consider for the chopping block.

 - expression-based indexing
 - NamedArray (you already have an issue on this)
 - with, within, based_on and variants
 - @transform, @DataFrame
 - select, filter
 - DataStream

 Many of these were attempts to ease syntax via delayed evaluation. We
 can either do without or try to implement something like LINQ.



 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com
 wrote:
  Hi John,
 
  I agree with pretty much everything you have written here, and really
  appreciate that you've taken the lead in cleaning things up and
 getting us
  on track.
 
  Cheers!
 Kevin
 
 
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@
 gmail.com
  wrote:
 
  As I said in another thread recently, I am currently the lead
 maintainer
  of more packages than I can keep up with. I think it’s been useful
 for me to
  start so many different projects, but I can’t keep maintaining most
 of my
  packages given my current work schedule.
 
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several others

  doing amazing work to keep DataArrays and DataFrames going, much of
 our
  basic data infrastructure would have already become completely
 unusable. But
  even with the great work that’s been done on those package recently,
 there’s
  still lot of additional design work required. I’d like to free up
 some of my
  time to do that work.
 
  To keep things moving forward, I’d like to propose a couple of
 radical New
  Year’s resolutions for the packages I work on.
 
  (1) We need to stop adding functionality and focus entirely on
 improving
  the quality and documentation of our existing functionality. We have
 way too
  much prototype code in DataFrames that I can’t keep up with. I’m
 about to
  make a pull request for DataFrames that will remove everything
 related to
  column groupings, database-style indexing and Blocks.jl support. I
  absolutely want to see us push all of those ideas forward in the
 future, but
  they need to happen in unmerged forks or separate packages until we
 have the
  resources needed to support them. Right now, they make an overwhelming

  maintenance challenge even more onerous.
 
  (2) We can’t support anything other than the master branch of most
  JuliaStats packages except possibly for Distributions. I personally
 don’t
  have the time to simultaneously keep stuff working with Julia 0.2 and
 Julia
  0.3. Moreover, many of our basic packages aren’t mature enough to
 justify
  supporting older versions. We should do a better job of supporting our

  master releases and not invest precious time trying to support older
  releases.
 
  (3) We need to make more of DataArrays and DataFrames reflect the
 Julian
  worldview. Lots of our code uses an interface that is incongruous
 with the
  interfaces found in Base. Even worse, a large chunk of code has
  type-stability problems that makes it very slow, when comparable code
 that
  uses normal Arrays is 100x faster. We need to develop new idioms and
 new
  strategies for making code that interacts with type-destabilizing NA’s

  faster. More generally, we need to make DataArrays and DataFrames fit
 in
  better with Julia when Julia and R disagree. Following R’s lead has
 often
  lead us astray because R doesn’t share Julia’s strenths or weaknesses.

 
  (4) Going forward, there should be exactly one way to do most things.
 The
  worst part of our current codebase is that there are multiple ways to

  express the same computation, but (a) some of them are unusably slow
 and (b)
  some of them don’t ever get tested or maintained properly. This is
 closely
  linked to the excess proliferation of functionality described in
 Resolution
  1 above. We need to start removing stuff from our packages and making
 the
  parts we keep both reliable and fast.
 
  I think we can push DataArrays and DataFrames to 1.0 status by the
 end of
  this 

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-22 Thread John Myles White
My impression is that Pandas didn't support anything like delayed evaluation. 
Is that wrong?

I'm aware that the resulting expressions are a lot more verbose. That 
definitely sucks.

I'd love to see strong proposals for how we're going to do a better job of 
making code shorter going forward. But too much of our current codebase is 
buggy, unable to handle edge cases, slow and undocumented. I think it's much 
more important that we have one way of doing things that actually works as 
advertised for every Julia user than two ways of doing things, each of which is 
slightly broken and performs worse than R and Pandas.

As I've been saying lately, I'm burning out on maintaing so much Julia code. If 
someone else wants to take charge of my projects, I'm ok with that. But if I'm 
going to be doing the work going forward, I need to devote my energies to 
making a small number of things work really well. Once we get our core 
functionality solid, I'll be comfortable getting fancier stuff working again.

 -- John

On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote:

 I'm also a fan of the expression-based interface (mostly because I'm used to 
 similar things in Pandas).  I haven't looked at that code, though, so I can't 
 comment on the complexity.
 
 Kevin
 
 
 On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com 
 wrote:
 Sure, but the resulting expression is much more verbose. I just noticed that 
 all expression-based indexing was on the chopping block. What is left after 
 all this?
 
 I can see how axing these features would make DataFrames.jl easier to 
 maintain, but I found the expression stuff to present a rather nice interface.
 
 --Blake
 
 
 On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
 Can you do something like df[“ColA”] = f(df)?
 
  — John
 
 
 On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote:
 
 I use within! pretty frequently. What should I be using instead if that is 
 on the chopping block?
 
 --Blake
 
 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
 I also agree with your approach, John. Based on your criteria, here 
 are some other things to consider for the chopping block. 
 
 - expression-based indexing 
 - NamedArray (you already have an issue on this) 
 - with, within, based_on and variants 
 - @transform, @DataFrame 
 - select, filter 
 - DataStream 
 
 Many of these were attempts to ease syntax via delayed evaluation. We 
 can either do without or try to implement something like LINQ. 
 
 
 
 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: 
  Hi John, 
  
  I agree with pretty much everything you have written here, and really 
  appreciate that you've taken the lead in cleaning things up and getting us 
  on track. 
  
  Cheers! 
 Kevin 
  
  
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com 
  wrote: 
  
  As I said in another thread recently, I am currently the lead maintainer 
  of more packages than I can keep up with. I think it’s been useful for me 
  to 
  start so many different projects, but I can’t keep maintaining most of my 
  packages given my current work schedule. 
  
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several others 
  doing amazing work to keep DataArrays and DataFrames going, much of our 
  basic data infrastructure would have already become completely unusable. 
  But 
  even with the great work that’s been done on those package recently, 
  there’s 
  still lot of additional design work required. I’d like to free up some of 
  my 
  time to do that work. 
  
  To keep things moving forward, I’d like to propose a couple of radical 
  New 
  Year’s resolutions for the packages I work on. 
  
  (1) We need to stop adding functionality and focus entirely on improving 
  the quality and documentation of our existing functionality. We have way 
  too 
  much prototype code in DataFrames that I can’t keep up with. I’m about to 
  make a pull request for DataFrames that will remove everything related to 
  column groupings, database-style indexing and Blocks.jl support. I 
  absolutely want to see us push all of those ideas forward in the future, 
  but 
  they need to happen in unmerged forks or separate packages until we have 
  the 
  resources needed to support them. Right now, they make an overwhelming 
  maintenance challenge even more onerous. 
  
  (2) We can’t support anything other than the master branch of most 
  JuliaStats packages except possibly for Distributions. I personally don’t 
  have the time to simultaneously keep stuff working with Julia 0.2 and 
  Julia 
  0.3. Moreover, many of our basic packages aren’t mature enough to justify 
  supporting older versions. We should do a better job of supporting our 
  master releases and not invest precious time trying to support older 
  releases. 
  
  (3) We need to make more of DataArrays and DataFrames reflect the Julian 
  

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-22 Thread Kevin Squire
Maybe I misinterpreted the term expression-based interface.


On Wed, Jan 22, 2014 at 2:33 PM, John Myles White
johnmyleswh...@gmail.comwrote:

 My impression is that Pandas didn't support anything like delayed
 evaluation. Is that wrong?

 I'm aware that the resulting expressions are a lot more verbose. That
 definitely sucks.

 I'd love to see strong proposals for how we're going to do a better job of
 making code shorter going forward. But too much of our current codebase is
 buggy, unable to handle edge cases, slow and undocumented. I think it's
 much more important that we have one way of doing things that actually
 works as advertised for every Julia user than two ways of doing things,
 each of which is slightly broken and performs worse than R and Pandas.

 As I've been saying lately, I'm burning out on maintaing so much Julia
 code. If someone else wants to take charge of my projects, I'm ok with
 that. But if I'm going to be doing the work going forward, I need to devote
 my energies to making a small number of things work really well. Once we
 get our core functionality solid, I'll be comfortable getting fancier stuff
 working again.

  -- John

 On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote:

 I'm also a fan of the expression-based interface (mostly because I'm used
 to similar things in Pandas).  I haven't looked at that code, though, so I
 can't comment on the complexity.

 Kevin


 On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson 
 blakejohnso...@gmail.comwrote:

 Sure, but the resulting expression is *much* more verbose. I just
 noticed that all expression-based indexing was on the chopping block. What
 is left after all this?

 I can see how axing these features would make DataFrames.jl easier to
 maintain, but I found the expression stuff to present a rather nice
 interface.

 --Blake


 On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:

 Can you do something like df[“ColA”] = f(df)?

  — John


 On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote:

 I use within! pretty frequently. What should I be using instead if that
 is on the chopping block?

 --Blake

 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:

 I also agree with your approach, John. Based on your criteria, here
 are some other things to consider for the chopping block.

 - expression-based indexing
 - NamedArray (you already have an issue on this)
 - with, within, based_on and variants
 - @transform, @DataFrame
 - select, filter
 - DataStream

 Many of these were attempts to ease syntax via delayed evaluation. We
 can either do without or try to implement something like LINQ.



 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com
 wrote:
  Hi John,
 
  I agree with pretty much everything you have written here, and really

  appreciate that you've taken the lead in cleaning things up and
 getting us
  on track.
 
  Cheers!
 Kevin
 
 
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@
 gmail.com
  wrote:
 
  As I said in another thread recently, I am currently the lead
 maintainer
  of more packages than I can keep up with. I think it’s been useful
 for me to
  start so many different projects, but I can’t keep maintaining most
 of my
  packages given my current work schedule.
 
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several
 others
  doing amazing work to keep DataArrays and DataFrames going, much of
 our
  basic data infrastructure would have already become completely
 unusable. But
  even with the great work that’s been done on those package recently,
 there’s
  still lot of additional design work required. I’d like to free up
 some of my
  time to do that work.
 
  To keep things moving forward, I’d like to propose a couple of
 radical New
  Year’s resolutions for the packages I work on.
 
  (1) We need to stop adding functionality and focus entirely on
 improving
  the quality and documentation of our existing functionality. We have
 way too
  much prototype code in DataFrames that I can’t keep up with. I’m
 about to
  make a pull request for DataFrames that will remove everything
 related to
  column groupings, database-style indexing and Blocks.jl support. I
  absolutely want to see us push all of those ideas forward in the
 future, but
  they need to happen in unmerged forks or separate packages until we
 have the
  resources needed to support them. Right now, they make an
 overwhelming
  maintenance challenge even more onerous.
 
  (2) We can’t support anything other than the master branch of most
  JuliaStats packages except possibly for Distributions. I personally
 don’t
  have the time to simultaneously keep stuff working with Julia 0.2
 and Julia
  0.3. Moreover, many of our basic packages aren’t mature enough to
 justify
  supporting older versions. We should do a better job of supporting
 our
  master releases and not invest precious time trying to support older

  releases.
 
  (3) We 

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-22 Thread John Myles White
The idealized expression interface offers things like (up to reordering):

with(df, a + b * x)

where a and b are variables in the caller's scope and x is a column of df.

In practice, we've had to hack this sort of thing together to offer things like

with(df, :($a + $b * x))

That's because we need to pass quoted strings and we also need to tell the 
system which variables are in the caller's cope.

More generally, I'd refer to any operation that passes expressions around and 
asks other functions to evaluate them with an ad hoc scope as expression-based 
operations.

R offers very deep support for this in the language.

 -- John

On Jan 22, 2014, at 2:48 PM, Kevin Squire kevin.squ...@gmail.com wrote:

 Maybe I misinterpreted the term expression-based interface.
 
 
 On Wed, Jan 22, 2014 at 2:33 PM, John Myles White johnmyleswh...@gmail.com 
 wrote:
 My impression is that Pandas didn't support anything like delayed evaluation. 
 Is that wrong?
 
 I'm aware that the resulting expressions are a lot more verbose. That 
 definitely sucks.
 
 I'd love to see strong proposals for how we're going to do a better job of 
 making code shorter going forward. But too much of our current codebase is 
 buggy, unable to handle edge cases, slow and undocumented. I think it's much 
 more important that we have one way of doing things that actually works as 
 advertised for every Julia user than two ways of doing things, each of which 
 is slightly broken and performs worse than R and Pandas.
 
 As I've been saying lately, I'm burning out on maintaing so much Julia code. 
 If someone else wants to take charge of my projects, I'm ok with that. But if 
 I'm going to be doing the work going forward, I need to devote my energies to 
 making a small number of things work really well. Once we get our core 
 functionality solid, I'll be comfortable getting fancier stuff working again.
 
  -- John
 
 On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote:
 
 I'm also a fan of the expression-based interface (mostly because I'm used to 
 similar things in Pandas).  I haven't looked at that code, though, so I 
 can't comment on the complexity.
 
 Kevin
 
 
 On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com 
 wrote:
 Sure, but the resulting expression is much more verbose. I just noticed that 
 all expression-based indexing was on the chopping block. What is left after 
 all this?
 
 I can see how axing these features would make DataFrames.jl easier to 
 maintain, but I found the expression stuff to present a rather nice 
 interface.
 
 --Blake
 
 
 On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
 Can you do something like df[“ColA”] = f(df)?
 
  — John
 
 
 On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote:
 
 I use within! pretty frequently. What should I be using instead if that is 
 on the chopping block?
 
 --Blake
 
 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
 I also agree with your approach, John. Based on your criteria, here 
 are some other things to consider for the chopping block. 
 
 - expression-based indexing 
 - NamedArray (you already have an issue on this) 
 - with, within, based_on and variants 
 - @transform, @DataFrame 
 - select, filter 
 - DataStream 
 
 Many of these were attempts to ease syntax via delayed evaluation. We 
 can either do without or try to implement something like LINQ. 
 
 
 
 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: 
  Hi John, 
  
  I agree with pretty much everything you have written here, and really 
  appreciate that you've taken the lead in cleaning things up and getting 
  us 
  on track. 
  
  Cheers! 
 Kevin 
  
  
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com 
  wrote: 
  
  As I said in another thread recently, I am currently the lead maintainer 
  of more packages than I can keep up with. I think it’s been useful for 
  me to 
  start so many different projects, but I can’t keep maintaining most of 
  my 
  packages given my current work schedule. 
  
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several others 
  doing amazing work to keep DataArrays and DataFrames going, much of our 
  basic data infrastructure would have already become completely unusable. 
  But 
  even with the great work that’s been done on those package recently, 
  there’s 
  still lot of additional design work required. I’d like to free up some 
  of my 
  time to do that work. 
  
  To keep things moving forward, I’d like to propose a couple of radical 
  New 
  Year’s resolutions for the packages I work on. 
  
  (1) We need to stop adding functionality and focus entirely on improving 
  the quality and documentation of our existing functionality. We have way 
  too 
  much prototype code in DataFrames that I can’t keep up with. I’m about 
  to 
  make a pull request for DataFrames that will remove everything related 
  to 
  

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-22 Thread Kevin Squire
Got it.  I was thinking of the more verbose (but still useful)

df[(df[colA]  4)  !isna(df[colB]), :]

Kevin


On Wed, Jan 22, 2014 at 3:10 PM, John Myles White
johnmyleswh...@gmail.comwrote:

 The idealized expression interface offers things like (up to reordering):

 with(df, a + b * x)

 where a and b are variables in the caller's scope and x is a column of df.

 In practice, we've had to hack this sort of thing together to offer things
 like

 with(df, :($a + $b * x))

 That's because we need to pass quoted strings and we also need to tell the
 system which variables are in the caller's cope.

 More generally, I'd refer to any operation that passes expressions around
 and asks other functions to evaluate them with an ad hoc scope as
 expression-based operations.

 R offers very deep support for this in the language.

  -- John

 On Jan 22, 2014, at 2:48 PM, Kevin Squire kevin.squ...@gmail.com wrote:

 Maybe I misinterpreted the term expression-based interface.


 On Wed, Jan 22, 2014 at 2:33 PM, John Myles White 
 johnmyleswh...@gmail.com wrote:

 My impression is that Pandas didn't support anything like delayed
 evaluation. Is that wrong?

 I'm aware that the resulting expressions are a lot more verbose. That
 definitely sucks.

 I'd love to see strong proposals for how we're going to do a better job
 of making code shorter going forward. But too much of our current codebase
 is buggy, unable to handle edge cases, slow and undocumented. I think it's
 much more important that we have one way of doing things that actually
 works as advertised for every Julia user than two ways of doing things,
 each of which is slightly broken and performs worse than R and Pandas.

 As I've been saying lately, I'm burning out on maintaing so much Julia
 code. If someone else wants to take charge of my projects, I'm ok with
 that. But if I'm going to be doing the work going forward, I need to devote
 my energies to making a small number of things work really well. Once we
 get our core functionality solid, I'll be comfortable getting fancier stuff
 working again.

  -- John

 On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote:

 I'm also a fan of the expression-based interface (mostly because I'm used
 to similar things in Pandas).  I haven't looked at that code, though, so I
 can't comment on the complexity.

 Kevin


 On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com
  wrote:

 Sure, but the resulting expression is *much* more verbose. I just
 noticed that all expression-based indexing was on the chopping block. What
 is left after all this?

 I can see how axing these features would make DataFrames.jl easier to
 maintain, but I found the expression stuff to present a rather nice
 interface.

 --Blake


 On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:

 Can you do something like df[“ColA”] = f(df)?

  — John


 On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com
 wrote:

 I use within! pretty frequently. What should I be using instead if that
 is on the chopping block?

 --Blake

 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:

 I also agree with your approach, John. Based on your criteria, here
 are some other things to consider for the chopping block.

 - expression-based indexing
 - NamedArray (you already have an issue on this)
 - with, within, based_on and variants
 - @transform, @DataFrame
 - select, filter
 - DataStream

 Many of these were attempts to ease syntax via delayed evaluation. We
 can either do without or try to implement something like LINQ.



 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com
 wrote:
  Hi John,
 
  I agree with pretty much everything you have written here, and really

  appreciate that you've taken the lead in cleaning things up and
 getting us
  on track.
 
  Cheers!
 Kevin
 
 
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@
 gmail.com
  wrote:
 
  As I said in another thread recently, I am currently the lead
 maintainer
  of more packages than I can keep up with. I think it’s been useful
 for me to
  start so many different projects, but I can’t keep maintaining most
 of my
  packages given my current work schedule.
 
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several
 others
  doing amazing work to keep DataArrays and DataFrames going, much of
 our
  basic data infrastructure would have already become completely
 unusable. But
  even with the great work that’s been done on those package
 recently, there’s
  still lot of additional design work required. I’d like to free up
 some of my
  time to do that work.
 
  To keep things moving forward, I’d like to propose a couple of
 radical New
  Year’s resolutions for the packages I work on.
 
  (1) We need to stop adding functionality and focus entirely on
 improving
  the quality and documentation of our existing functionality. We
 have way too
  much prototype code in DataFrames that I can’t 

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-22 Thread John Myles White
That's exactly the kind of indexing I'd like to encourage using until we get 
our core functionality cleaned up. Nothing special required except Boolean 
indexing, which is easy to make fast and doesn't have weird scoping issues.

 -- John

On Jan 22, 2014, at 3:18 PM, Kevin Squire kevin.squ...@gmail.com wrote:

 Got it.  I was thinking of the more verbose (but still useful)
 
 df[(df[colA]  4)  !isna(df[colB]), :]
 
 Kevin
 
 
 On Wed, Jan 22, 2014 at 3:10 PM, John Myles White johnmyleswh...@gmail.com 
 wrote:
 The idealized expression interface offers things like (up to reordering):
 
 with(df, a + b * x)
 
 where a and b are variables in the caller's scope and x is a column of df.
 
 In practice, we've had to hack this sort of thing together to offer things 
 like
 
 with(df, :($a + $b * x))
 
 That's because we need to pass quoted strings and we also need to tell the 
 system which variables are in the caller's cope.
 
 More generally, I'd refer to any operation that passes expressions around and 
 asks other functions to evaluate them with an ad hoc scope as 
 expression-based operations.
 
 R offers very deep support for this in the language.
 
  -- John
 
 On Jan 22, 2014, at 2:48 PM, Kevin Squire kevin.squ...@gmail.com wrote:
 
 Maybe I misinterpreted the term expression-based interface.
 
 
 On Wed, Jan 22, 2014 at 2:33 PM, John Myles White johnmyleswh...@gmail.com 
 wrote:
 My impression is that Pandas didn't support anything like delayed 
 evaluation. Is that wrong?
 
 I'm aware that the resulting expressions are a lot more verbose. That 
 definitely sucks.
 
 I'd love to see strong proposals for how we're going to do a better job of 
 making code shorter going forward. But too much of our current codebase is 
 buggy, unable to handle edge cases, slow and undocumented. I think it's much 
 more important that we have one way of doing things that actually works as 
 advertised for every Julia user than two ways of doing things, each of which 
 is slightly broken and performs worse than R and Pandas.
 
 As I've been saying lately, I'm burning out on maintaing so much Julia code. 
 If someone else wants to take charge of my projects, I'm ok with that. But 
 if I'm going to be doing the work going forward, I need to devote my 
 energies to making a small number of things work really well. Once we get 
 our core functionality solid, I'll be comfortable getting fancier stuff 
 working again.
 
  -- John
 
 On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote:
 
 I'm also a fan of the expression-based interface (mostly because I'm used 
 to similar things in Pandas).  I haven't looked at that code, though, so I 
 can't comment on the complexity.
 
 Kevin
 
 
 On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com 
 wrote:
 Sure, but the resulting expression is much more verbose. I just noticed 
 that all expression-based indexing was on the chopping block. What is left 
 after all this?
 
 I can see how axing these features would make DataFrames.jl easier to 
 maintain, but I found the expression stuff to present a rather nice 
 interface.
 
 --Blake
 
 
 On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
 Can you do something like df[“ColA”] = f(df)?
 
  — John
 
 
 On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote:
 
 I use within! pretty frequently. What should I be using instead if that is 
 on the chopping block?
 
 --Blake
 
 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
 I also agree with your approach, John. Based on your criteria, here 
 are some other things to consider for the chopping block. 
 
 - expression-based indexing 
 - NamedArray (you already have an issue on this) 
 - with, within, based_on and variants 
 - @transform, @DataFrame 
 - select, filter 
 - DataStream 
 
 Many of these were attempts to ease syntax via delayed evaluation. We 
 can either do without or try to implement something like LINQ. 
 
 
 
 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: 
  Hi John, 
  
  I agree with pretty much everything you have written here, and really 
  appreciate that you've taken the lead in cleaning things up and getting 
  us 
  on track. 
  
  Cheers! 
 Kevin 
  
  
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com 
  wrote: 
  
  As I said in another thread recently, I am currently the lead 
  maintainer 
  of more packages than I can keep up with. I think it’s been useful for 
  me to 
  start so many different projects, but I can’t keep maintaining most of 
  my 
  packages given my current work schedule. 
  
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several others 
  doing amazing work to keep DataArrays and DataFrames going, much of our 
  basic data infrastructure would have already become completely 
  unusable. But 
  even with the great work that’s been done on those package recently, 
  there’s 
  still lot of 

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-21 Thread Tom Short
I also agree with your approach, John. Based on your criteria, here
are some other things to consider for the chopping block.

- expression-based indexing
- NamedArray (you already have an issue on this)
- with, within, based_on and variants
- @transform, @DataFrame
- select, filter
- DataStream

Many of these were attempts to ease syntax via delayed evaluation. We
can either do without or try to implement something like LINQ.



On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin.squ...@gmail.com wrote:
 Hi John,

 I agree with pretty much everything you have written here, and really
 appreciate that you've taken the lead in cleaning things up and getting us
 on track.

 Cheers!
Kevin


 On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyleswh...@gmail.com
 wrote:

 As I said in another thread recently, I am currently the lead maintainer
 of more packages than I can keep up with. I think it’s been useful for me to
 start so many different projects, but I can’t keep maintaining most of my
 packages given my current work schedule.

 Without Simon Kornblith, Kevin Squire, Sean Garborg and several others
 doing amazing work to keep DataArrays and DataFrames going, much of our
 basic data infrastructure would have already become completely unusable. But
 even with the great work that’s been done on those package recently, there’s
 still lot of additional design work required. I’d like to free up some of my
 time to do that work.

 To keep things moving forward, I’d like to propose a couple of radical New
 Year’s resolutions for the packages I work on.

 (1) We need to stop adding functionality and focus entirely on improving
 the quality and documentation of our existing functionality. We have way too
 much prototype code in DataFrames that I can’t keep up with. I’m about to
 make a pull request for DataFrames that will remove everything related to
 column groupings, database-style indexing and Blocks.jl support. I
 absolutely want to see us push all of those ideas forward in the future, but
 they need to happen in unmerged forks or separate packages until we have the
 resources needed to support them. Right now, they make an overwhelming
 maintenance challenge even more onerous.

 (2) We can’t support anything other than the master branch of most
 JuliaStats packages except possibly for Distributions. I personally don’t
 have the time to simultaneously keep stuff working with Julia 0.2 and Julia
 0.3. Moreover, many of our basic packages aren’t mature enough to justify
 supporting older versions. We should do a better job of supporting our
 master releases and not invest precious time trying to support older
 releases.

 (3) We need to make more of DataArrays and DataFrames reflect the Julian
 worldview. Lots of our code uses an interface that is incongruous with the
 interfaces found in Base. Even worse, a large chunk of code has
 type-stability problems that makes it very slow, when comparable code that
 uses normal Arrays is 100x faster. We need to develop new idioms and new
 strategies for making code that interacts with type-destabilizing NA’s
 faster. More generally, we need to make DataArrays and DataFrames fit in
 better with Julia when Julia and R disagree. Following R’s lead has often
 lead us astray because R doesn’t share Julia’s strenths or weaknesses.

 (4) Going forward, there should be exactly one way to do most things. The
 worst part of our current codebase is that there are multiple ways to
 express the same computation, but (a) some of them are unusably slow and (b)
 some of them don’t ever get tested or maintained properly. This is closely
 linked to the excess proliferation of functionality described in Resolution
 1 above. We need to start removing stuff from our packages and making the
 parts we keep both reliable and fast.

 I think we can push DataArrays and DataFrames to 1.0 status by the end of
 this year. But I think we need to adopt a new approach if we’re going to get
 there. Lots of stuff needs to get deprecated and what remains needs a lot
 more testing, benchmarking and documentation.

  — John




Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-21 Thread John Myles White
I agree with everything on this list, including my always neglected DataStreams 
project.

I think it would be nice to get rid of expression-based indexing + select and 
focus on getting something like LINQ working. For another interesting 
perspective, check out the nearly created query function in Pandas, which takes 
in strings rather than expressions as inputs.

 — John

On Jan 21, 2014, at 4:42 AM, Tom Short tshort.rli...@gmail.com wrote:

 I also agree with your approach, John. Based on your criteria, here
 are some other things to consider for the chopping block.
 
 - expression-based indexing
 - NamedArray (you already have an issue on this)
 - with, within, based_on and variants
 - @transform, @DataFrame
 - select, filter
 - DataStream
 
 Many of these were attempts to ease syntax via delayed evaluation. We
 can either do without or try to implement something like LINQ.



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-21 Thread Blake Johnson
I use within! pretty frequently. What should I be using instead if that is 
on the chopping block?

--Blake

On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:

 I also agree with your approach, John. Based on your criteria, here 
 are some other things to consider for the chopping block. 

 - expression-based indexing 
 - NamedArray (you already have an issue on this) 
 - with, within, based_on and variants 
 - @transform, @DataFrame 
 - select, filter 
 - DataStream 

 Many of these were attempts to ease syntax via delayed evaluation. We 
 can either do without or try to implement something like LINQ. 



 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire 
 kevin@gmail.comjavascript: 
 wrote: 
  Hi John, 
  
  I agree with pretty much everything you have written here, and really 
  appreciate that you've taken the lead in cleaning things up and getting 
 us 
  on track. 
  
  Cheers! 
 Kevin 
  
  
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White 
  johnmyl...@gmail.comjavascript: 

  wrote: 
  
  As I said in another thread recently, I am currently the lead 
 maintainer 
  of more packages than I can keep up with. I think it’s been useful for 
 me to 
  start so many different projects, but I can’t keep maintaining most of 
 my 
  packages given my current work schedule. 
  
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several others 
  doing amazing work to keep DataArrays and DataFrames going, much of our 
  basic data infrastructure would have already become completely 
 unusable. But 
  even with the great work that’s been done on those package recently, 
 there’s 
  still lot of additional design work required. I’d like to free up some 
 of my 
  time to do that work. 
  
  To keep things moving forward, I’d like to propose a couple of radical 
 New 
  Year’s resolutions for the packages I work on. 
  
  (1) We need to stop adding functionality and focus entirely on 
 improving 
  the quality and documentation of our existing functionality. We have 
 way too 
  much prototype code in DataFrames that I can’t keep up with. I’m about 
 to 
  make a pull request for DataFrames that will remove everything related 
 to 
  column groupings, database-style indexing and Blocks.jl support. I 
  absolutely want to see us push all of those ideas forward in the 
 future, but 
  they need to happen in unmerged forks or separate packages until we 
 have the 
  resources needed to support them. Right now, they make an overwhelming 
  maintenance challenge even more onerous. 
  
  (2) We can’t support anything other than the master branch of most 
  JuliaStats packages except possibly for Distributions. I personally 
 don’t 
  have the time to simultaneously keep stuff working with Julia 0.2 and 
 Julia 
  0.3. Moreover, many of our basic packages aren’t mature enough to 
 justify 
  supporting older versions. We should do a better job of supporting our 
  master releases and not invest precious time trying to support older 
  releases. 
  
  (3) We need to make more of DataArrays and DataFrames reflect the 
 Julian 
  worldview. Lots of our code uses an interface that is incongruous with 
 the 
  interfaces found in Base. Even worse, a large chunk of code has 
  type-stability problems that makes it very slow, when comparable code 
 that 
  uses normal Arrays is 100x faster. We need to develop new idioms and 
 new 
  strategies for making code that interacts with type-destabilizing NA’s 
  faster. More generally, we need to make DataArrays and DataFrames fit 
 in 
  better with Julia when Julia and R disagree. Following R’s lead has 
 often 
  lead us astray because R doesn’t share Julia’s strenths or weaknesses. 
  
  (4) Going forward, there should be exactly one way to do most things. 
 The 
  worst part of our current codebase is that there are multiple ways to 
  express the same computation, but (a) some of them are unusably slow 
 and (b) 
  some of them don’t ever get tested or maintained properly. This is 
 closely 
  linked to the excess proliferation of functionality described in 
 Resolution 
  1 above. We need to start removing stuff from our packages and making 
 the 
  parts we keep both reliable and fast. 
  
  I think we can push DataArrays and DataFrames to 1.0 status by the end 
 of 
  this year. But I think we need to adopt a new approach if we’re going 
 to get 
  there. Lots of stuff needs to get deprecated and what remains needs a 
 lot 
  more testing, benchmarking and documentation. 
  
   — John 
  
  



Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-21 Thread John Myles White
Can you do something like df[“ColA”] = f(df)?

 — John

On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejohnso...@gmail.com wrote:

 I use within! pretty frequently. What should I be using instead if that is on 
 the chopping block?
 
 --Blake
 
 On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
 I also agree with your approach, John. Based on your criteria, here 
 are some other things to consider for the chopping block. 
 
 - expression-based indexing 
 - NamedArray (you already have an issue on this) 
 - with, within, based_on and variants 
 - @transform, @DataFrame 
 - select, filter 
 - DataStream 
 
 Many of these were attempts to ease syntax via delayed evaluation. We 
 can either do without or try to implement something like LINQ. 
 
 
 
 On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: 
  Hi John, 
  
  I agree with pretty much everything you have written here, and really 
  appreciate that you've taken the lead in cleaning things up and getting us 
  on track. 
  
  Cheers! 
 Kevin 
  
  
  On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com 
  wrote: 
  
  As I said in another thread recently, I am currently the lead maintainer 
  of more packages than I can keep up with. I think it’s been useful for me 
  to 
  start so many different projects, but I can’t keep maintaining most of my 
  packages given my current work schedule. 
  
  Without Simon Kornblith, Kevin Squire, Sean Garborg and several others 
  doing amazing work to keep DataArrays and DataFrames going, much of our 
  basic data infrastructure would have already become completely unusable. 
  But 
  even with the great work that’s been done on those package recently, 
  there’s 
  still lot of additional design work required. I’d like to free up some of 
  my 
  time to do that work. 
  
  To keep things moving forward, I’d like to propose a couple of radical New 
  Year’s resolutions for the packages I work on. 
  
  (1) We need to stop adding functionality and focus entirely on improving 
  the quality and documentation of our existing functionality. We have way 
  too 
  much prototype code in DataFrames that I can’t keep up with. I’m about to 
  make a pull request for DataFrames that will remove everything related to 
  column groupings, database-style indexing and Blocks.jl support. I 
  absolutely want to see us push all of those ideas forward in the future, 
  but 
  they need to happen in unmerged forks or separate packages until we have 
  the 
  resources needed to support them. Right now, they make an overwhelming 
  maintenance challenge even more onerous. 
  
  (2) We can’t support anything other than the master branch of most 
  JuliaStats packages except possibly for Distributions. I personally don’t 
  have the time to simultaneously keep stuff working with Julia 0.2 and 
  Julia 
  0.3. Moreover, many of our basic packages aren’t mature enough to justify 
  supporting older versions. We should do a better job of supporting our 
  master releases and not invest precious time trying to support older 
  releases. 
  
  (3) We need to make more of DataArrays and DataFrames reflect the Julian 
  worldview. Lots of our code uses an interface that is incongruous with the 
  interfaces found in Base. Even worse, a large chunk of code has 
  type-stability problems that makes it very slow, when comparable code that 
  uses normal Arrays is 100x faster. We need to develop new idioms and new 
  strategies for making code that interacts with type-destabilizing NA’s 
  faster. More generally, we need to make DataArrays and DataFrames fit in 
  better with Julia when Julia and R disagree. Following R’s lead has often 
  lead us astray because R doesn’t share Julia’s strenths or weaknesses. 
  
  (4) Going forward, there should be exactly one way to do most things. The 
  worst part of our current codebase is that there are multiple ways to 
  express the same computation, but (a) some of them are unusably slow and 
  (b) 
  some of them don’t ever get tested or maintained properly. This is closely 
  linked to the excess proliferation of functionality described in 
  Resolution 
  1 above. We need to start removing stuff from our packages and making the 
  parts we keep both reliable and fast. 
  
  I think we can push DataArrays and DataFrames to 1.0 status by the end of 
  this year. But I think we need to adopt a new approach if we’re going to 
  get 
  there. Lots of stuff needs to get deprecated and what remains needs a lot 
  more testing, benchmarking and documentation. 
  
   — John 
  
 



[julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

2014-01-20 Thread John Myles White
As I said in another thread recently, I am currently the lead maintainer of 
more packages than I can keep up with. I think it’s been useful for me to start 
so many different projects, but I can’t keep maintaining most of my packages 
given my current work schedule.

Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing 
amazing work to keep DataArrays and DataFrames going, much of our basic data 
infrastructure would have already become completely unusable. But even with the 
great work that’s been done on those package recently, there’s still lot of 
additional design work required. I’d like to free up some of my time to do that 
work.

To keep things moving forward, I’d like to propose a couple of radical New 
Year’s resolutions for the packages I work on.

(1) We need to stop adding functionality and focus entirely on improving the 
quality and documentation of our existing functionality. We have way too much 
prototype code in DataFrames that I can’t keep up with. I’m about to make a 
pull request for DataFrames that will remove everything related to column 
groupings, database-style indexing and Blocks.jl support. I absolutely want to 
see us push all of those ideas forward in the future, but they need to happen 
in unmerged forks or separate packages until we have the resources needed to 
support them. Right now, they make an overwhelming maintenance challenge even 
more onerous.

(2) We can’t support anything other than the master branch of most JuliaStats 
packages except possibly for Distributions. I personally don’t have the time to 
simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many 
of our basic packages aren’t mature enough to justify supporting older 
versions. We should do a better job of supporting our master releases and not 
invest precious time trying to support older releases.

(3) We need to make more of DataArrays and DataFrames reflect the Julian 
worldview. Lots of our code uses an interface that is incongruous with the 
interfaces found in Base. Even worse, a large chunk of code has type-stability 
problems that makes it very slow, when comparable code that uses normal Arrays 
is 100x faster. We need to develop new idioms and new strategies for making 
code that interacts with type-destabilizing NA’s faster. More generally, we 
need to make DataArrays and DataFrames fit in better with Julia when Julia and 
R disagree. Following R’s lead has often lead us astray because R doesn’t share 
Julia’s strenths or weaknesses.

(4) Going forward, there should be exactly one way to do most things. The worst 
part of our current codebase is that there are multiple ways to express the 
same computation, but (a) some of them are unusably slow and (b) some of them 
don’t ever get tested or maintained properly. This is closely linked to the 
excess proliferation of functionality described in Resolution 1 above. We need 
to start removing stuff from our packages and making the parts we keep both 
reliable and fast.

I think we can push DataArrays and DataFrames to 1.0 status by the end of this 
year. But I think we need to adopt a new approach if we’re going to get there. 
Lots of stuff needs to get deprecated and what remains needs a lot more 
testing, benchmarking and documentation.

 — John