Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
I think they’re uncorrelated, but you’d have to ask Wes to know for sure. — John On Jan 24, 2014, at 12:19 AM, Matthias BUSSONNIER bussonniermatth...@gmail.com wrote: Le 24 janv. 2014 à 04:51, Jonathan Malmaud a écrit : Sounds reasonable. As a temporary measure for people who want that functionality immediately, I've taken a stab at wrapping pandas in a Julia package (just as pyplot does for matplotlib), at https://github.com/malmaud/pandas. Would this explain this Tweet from 10h Ago ? Wes McKinney @wesmckinn Friendly reminder that performance-obsessed data hackers (R, Python, Julia) should feel free to drop me a line about working together -- M
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
Yeah, at some point in the future I’d like to see if we can imitate the experimental query() and eval() methods from Pandas. It’s the fact that those methods were just recently introduced which made me decide we needed to stop spending time on getting them working right now. We’re way behind Pandas in terms of performance and reliability, so it’s a bad idea for us to try being as feature complete until we catch up. — John On Jan 23, 2014, at 6:37 AM, Jonathan Malmaud malm...@gmail.com wrote: Pandas has a 'query' method (http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-query) which uses the Python numexpr package for delayed evaluation (if i understand what you mean by that in this context).
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
I think that’s probably because you need to do using DataArrays now. — John On Jan 23, 2014, at 2:08 AM, Jon Norberg jon.norb...@ecology.su.se wrote: is this why I get this on latest julia studio on mac with recently updated packages: julia using DataFrames julia using RDatasets julia iris = data(datasets, iris) data not defined ??
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
Sounds reasonable. As a temporary measure for people who want that functionality immediately, I've taken a stab at wrapping pandas in a Julia package (just as pyplot does for matplotlib), at https://github.com/malmaud/pandas. On Thursday, January 23, 2014 10:17:40 AM UTC-5, John Myles White wrote: Yeah, at some point in the future I’d like to see if we can imitate the experimental query() and eval() methods from Pandas. It’s the fact that those methods were just recently introduced which made me decide we needed to stop spending time on getting them working right now. We’re way behind Pandas in terms of performance and reliability, so it’s a bad idea for us to try being as feature complete until we catch up. — John On Jan 23, 2014, at 6:37 AM, Jonathan Malmaud mal...@gmail.comjavascript: wrote: Pandas has a 'query' method ( http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-query) which uses the Python numexpr package for delayed evaluation (if i understand what you mean by that in this context).
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
Just saw that. Seems like a very smart way to get us important functionality while we continue to push things forward. Would be very cool if we could make it possible to switch between the Pandas and native Julia implementations totally seamlessly. — John On Jan 23, 2014, at 7:51 PM, Jonathan Malmaud malm...@gmail.com wrote: Sounds reasonable. As a temporary measure for people who want that functionality immediately, I've taken a stab at wrapping pandas in a Julia package (just as pyplot does for matplotlib), at https://github.com/malmaud/pandas. On Thursday, January 23, 2014 10:17:40 AM UTC-5, John Myles White wrote: Yeah, at some point in the future I’d like to see if we can imitate the experimental query() and eval() methods from Pandas. It’s the fact that those methods were just recently introduced which made me decide we needed to stop spending time on getting them working right now. We’re way behind Pandas in terms of performance and reliability, so it’s a bad idea for us to try being as feature complete until we catch up. — John On Jan 23, 2014, at 6:37 AM, Jonathan Malmaud mal...@gmail.com wrote: Pandas has a 'query' method (http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-query) which uses the Python numexpr package for delayed evaluation (if i understand what you mean by that in this context).
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
I'm also a fan of the expression-based interface (mostly because I'm used to similar things in Pandas). I haven't looked at that code, though, so I can't comment on the complexity. Kevin On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.comwrote: Sure, but the resulting expression is *much* more verbose. I just noticed that all expression-based indexing was on the chopping block. What is left after all this? I can see how axing these features would make DataFrames.jl easier to maintain, but I found the expression stuff to present a rather nice interface. --Blake On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote: Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@ gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We need to make more of DataArrays and DataFrames reflect the Julian worldview. Lots of our code uses an interface that is incongruous with the interfaces found in Base. Even worse, a large chunk of code has type-stability problems that makes it very slow, when comparable code that uses normal Arrays is 100x faster. We need to develop new idioms and new strategies for making code that interacts with type-destabilizing NA’s faster. More generally, we need to make DataArrays and DataFrames fit in better with Julia when Julia and R disagree. Following R’s lead has often lead us astray because R doesn’t share Julia’s strenths or weaknesses. (4) Going forward, there should be exactly one way to do most things. The worst part of our current codebase is that there are multiple ways to express the same computation, but (a) some of them are unusably slow and (b) some of them don’t ever get tested or maintained properly. This is closely linked to the excess proliferation of functionality described in Resolution 1 above. We need to start removing stuff from our packages and making the parts we keep both reliable and fast. I think we can push DataArrays and DataFrames to 1.0 status by the end of this
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
My impression is that Pandas didn't support anything like delayed evaluation. Is that wrong? I'm aware that the resulting expressions are a lot more verbose. That definitely sucks. I'd love to see strong proposals for how we're going to do a better job of making code shorter going forward. But too much of our current codebase is buggy, unable to handle edge cases, slow and undocumented. I think it's much more important that we have one way of doing things that actually works as advertised for every Julia user than two ways of doing things, each of which is slightly broken and performs worse than R and Pandas. As I've been saying lately, I'm burning out on maintaing so much Julia code. If someone else wants to take charge of my projects, I'm ok with that. But if I'm going to be doing the work going forward, I need to devote my energies to making a small number of things work really well. Once we get our core functionality solid, I'll be comfortable getting fancier stuff working again. -- John On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote: I'm also a fan of the expression-based interface (mostly because I'm used to similar things in Pandas). I haven't looked at that code, though, so I can't comment on the complexity. Kevin On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com wrote: Sure, but the resulting expression is much more verbose. I just noticed that all expression-based indexing was on the chopping block. What is left after all this? I can see how axing these features would make DataFrames.jl easier to maintain, but I found the expression stuff to present a rather nice interface. --Blake On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote: Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We need to make more of DataArrays and DataFrames reflect the Julian
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
Maybe I misinterpreted the term expression-based interface. On Wed, Jan 22, 2014 at 2:33 PM, John Myles White johnmyleswh...@gmail.comwrote: My impression is that Pandas didn't support anything like delayed evaluation. Is that wrong? I'm aware that the resulting expressions are a lot more verbose. That definitely sucks. I'd love to see strong proposals for how we're going to do a better job of making code shorter going forward. But too much of our current codebase is buggy, unable to handle edge cases, slow and undocumented. I think it's much more important that we have one way of doing things that actually works as advertised for every Julia user than two ways of doing things, each of which is slightly broken and performs worse than R and Pandas. As I've been saying lately, I'm burning out on maintaing so much Julia code. If someone else wants to take charge of my projects, I'm ok with that. But if I'm going to be doing the work going forward, I need to devote my energies to making a small number of things work really well. Once we get our core functionality solid, I'll be comfortable getting fancier stuff working again. -- John On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote: I'm also a fan of the expression-based interface (mostly because I'm used to similar things in Pandas). I haven't looked at that code, though, so I can't comment on the complexity. Kevin On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.comwrote: Sure, but the resulting expression is *much* more verbose. I just noticed that all expression-based indexing was on the chopping block. What is left after all this? I can see how axing these features would make DataFrames.jl easier to maintain, but I found the expression stuff to present a rather nice interface. --Blake On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote: Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@ gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
The idealized expression interface offers things like (up to reordering): with(df, a + b * x) where a and b are variables in the caller's scope and x is a column of df. In practice, we've had to hack this sort of thing together to offer things like with(df, :($a + $b * x)) That's because we need to pass quoted strings and we also need to tell the system which variables are in the caller's cope. More generally, I'd refer to any operation that passes expressions around and asks other functions to evaluate them with an ad hoc scope as expression-based operations. R offers very deep support for this in the language. -- John On Jan 22, 2014, at 2:48 PM, Kevin Squire kevin.squ...@gmail.com wrote: Maybe I misinterpreted the term expression-based interface. On Wed, Jan 22, 2014 at 2:33 PM, John Myles White johnmyleswh...@gmail.com wrote: My impression is that Pandas didn't support anything like delayed evaluation. Is that wrong? I'm aware that the resulting expressions are a lot more verbose. That definitely sucks. I'd love to see strong proposals for how we're going to do a better job of making code shorter going forward. But too much of our current codebase is buggy, unable to handle edge cases, slow and undocumented. I think it's much more important that we have one way of doing things that actually works as advertised for every Julia user than two ways of doing things, each of which is slightly broken and performs worse than R and Pandas. As I've been saying lately, I'm burning out on maintaing so much Julia code. If someone else wants to take charge of my projects, I'm ok with that. But if I'm going to be doing the work going forward, I need to devote my energies to making a small number of things work really well. Once we get our core functionality solid, I'll be comfortable getting fancier stuff working again. -- John On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote: I'm also a fan of the expression-based interface (mostly because I'm used to similar things in Pandas). I haven't looked at that code, though, so I can't comment on the complexity. Kevin On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com wrote: Sure, but the resulting expression is much more verbose. I just noticed that all expression-based indexing was on the chopping block. What is left after all this? I can see how axing these features would make DataFrames.jl easier to maintain, but I found the expression stuff to present a rather nice interface. --Blake On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote: Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
Got it. I was thinking of the more verbose (but still useful) df[(df[colA] 4) !isna(df[colB]), :] Kevin On Wed, Jan 22, 2014 at 3:10 PM, John Myles White johnmyleswh...@gmail.comwrote: The idealized expression interface offers things like (up to reordering): with(df, a + b * x) where a and b are variables in the caller's scope and x is a column of df. In practice, we've had to hack this sort of thing together to offer things like with(df, :($a + $b * x)) That's because we need to pass quoted strings and we also need to tell the system which variables are in the caller's cope. More generally, I'd refer to any operation that passes expressions around and asks other functions to evaluate them with an ad hoc scope as expression-based operations. R offers very deep support for this in the language. -- John On Jan 22, 2014, at 2:48 PM, Kevin Squire kevin.squ...@gmail.com wrote: Maybe I misinterpreted the term expression-based interface. On Wed, Jan 22, 2014 at 2:33 PM, John Myles White johnmyleswh...@gmail.com wrote: My impression is that Pandas didn't support anything like delayed evaluation. Is that wrong? I'm aware that the resulting expressions are a lot more verbose. That definitely sucks. I'd love to see strong proposals for how we're going to do a better job of making code shorter going forward. But too much of our current codebase is buggy, unable to handle edge cases, slow and undocumented. I think it's much more important that we have one way of doing things that actually works as advertised for every Julia user than two ways of doing things, each of which is slightly broken and performs worse than R and Pandas. As I've been saying lately, I'm burning out on maintaing so much Julia code. If someone else wants to take charge of my projects, I'm ok with that. But if I'm going to be doing the work going forward, I need to devote my energies to making a small number of things work really well. Once we get our core functionality solid, I'll be comfortable getting fancier stuff working again. -- John On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote: I'm also a fan of the expression-based interface (mostly because I'm used to similar things in Pandas). I haven't looked at that code, though, so I can't comment on the complexity. Kevin On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com wrote: Sure, but the resulting expression is *much* more verbose. I just noticed that all expression-based indexing was on the chopping block. What is left after all this? I can see how axing these features would make DataFrames.jl easier to maintain, but I found the expression stuff to present a rather nice interface. --Blake On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote: Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@ gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
That's exactly the kind of indexing I'd like to encourage using until we get our core functionality cleaned up. Nothing special required except Boolean indexing, which is easy to make fast and doesn't have weird scoping issues. -- John On Jan 22, 2014, at 3:18 PM, Kevin Squire kevin.squ...@gmail.com wrote: Got it. I was thinking of the more verbose (but still useful) df[(df[colA] 4) !isna(df[colB]), :] Kevin On Wed, Jan 22, 2014 at 3:10 PM, John Myles White johnmyleswh...@gmail.com wrote: The idealized expression interface offers things like (up to reordering): with(df, a + b * x) where a and b are variables in the caller's scope and x is a column of df. In practice, we've had to hack this sort of thing together to offer things like with(df, :($a + $b * x)) That's because we need to pass quoted strings and we also need to tell the system which variables are in the caller's cope. More generally, I'd refer to any operation that passes expressions around and asks other functions to evaluate them with an ad hoc scope as expression-based operations. R offers very deep support for this in the language. -- John On Jan 22, 2014, at 2:48 PM, Kevin Squire kevin.squ...@gmail.com wrote: Maybe I misinterpreted the term expression-based interface. On Wed, Jan 22, 2014 at 2:33 PM, John Myles White johnmyleswh...@gmail.com wrote: My impression is that Pandas didn't support anything like delayed evaluation. Is that wrong? I'm aware that the resulting expressions are a lot more verbose. That definitely sucks. I'd love to see strong proposals for how we're going to do a better job of making code shorter going forward. But too much of our current codebase is buggy, unable to handle edge cases, slow and undocumented. I think it's much more important that we have one way of doing things that actually works as advertised for every Julia user than two ways of doing things, each of which is slightly broken and performs worse than R and Pandas. As I've been saying lately, I'm burning out on maintaing so much Julia code. If someone else wants to take charge of my projects, I'm ok with that. But if I'm going to be doing the work going forward, I need to devote my energies to making a small number of things work really well. Once we get our core functionality solid, I'll be comfortable getting fancier stuff working again. -- John On Jan 22, 2014, at 1:06 PM, Kevin Squire kevin.squ...@gmail.com wrote: I'm also a fan of the expression-based interface (mostly because I'm used to similar things in Pandas). I haven't looked at that code, though, so I can't comment on the complexity. Kevin On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson blakejohnso...@gmail.com wrote: Sure, but the resulting expression is much more verbose. I just noticed that all expression-based indexing was on the chopping block. What is left after all this? I can see how axing these features would make DataFrames.jl easier to maintain, but I found the expression stuff to present a rather nice interface. --Blake On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote: Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejo...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin.squ...@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyleswh...@gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We need to make more of DataArrays and DataFrames reflect the Julian worldview. Lots of our code uses an interface that is incongruous with the interfaces found in Base. Even worse, a large chunk of code has type-stability problems that makes it very slow, when comparable code that uses normal Arrays is 100x faster. We need to develop new idioms and new strategies for making code that interacts with type-destabilizing NA’s faster. More generally, we need to make DataArrays and DataFrames fit in better with Julia when Julia and R disagree. Following R’s lead has often lead us astray because R doesn’t share Julia’s strenths or weaknesses. (4) Going forward, there should be exactly one way to do most things. The worst part of our current codebase is that there are multiple ways to express the same computation, but (a) some of them are unusably slow and (b) some of them don’t ever get tested or maintained properly. This is closely linked to the excess proliferation of functionality described in Resolution 1 above. We need to start removing stuff from our packages and making the parts we keep both reliable and fast. I think we can push DataArrays and DataFrames to 1.0 status by the end of this year. But I think we need to adopt a new approach if we’re going to get there. Lots of stuff needs to get deprecated and what remains needs a lot more testing, benchmarking and documentation. — John
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
I agree with everything on this list, including my always neglected DataStreams project. I think it would be nice to get rid of expression-based indexing + select and focus on getting something like LINQ working. For another interesting perspective, check out the nearly created query function in Pandas, which takes in strings rather than expressions as inputs. — John On Jan 21, 2014, at 4:42 AM, Tom Short tshort.rli...@gmail.com wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ.
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.comjavascript: wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.comjavascript: wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We need to make more of DataArrays and DataFrames reflect the Julian worldview. Lots of our code uses an interface that is incongruous with the interfaces found in Base. Even worse, a large chunk of code has type-stability problems that makes it very slow, when comparable code that uses normal Arrays is 100x faster. We need to develop new idioms and new strategies for making code that interacts with type-destabilizing NA’s faster. More generally, we need to make DataArrays and DataFrames fit in better with Julia when Julia and R disagree. Following R’s lead has often lead us astray because R doesn’t share Julia’s strenths or weaknesses. (4) Going forward, there should be exactly one way to do most things. The worst part of our current codebase is that there are multiple ways to express the same computation, but (a) some of them are unusably slow and (b) some of them don’t ever get tested or maintained properly. This is closely linked to the excess proliferation of functionality described in Resolution 1 above. We need to start removing stuff from our packages and making the parts we keep both reliable and fast. I think we can push DataArrays and DataFrames to 1.0 status by the end of this year. But I think we need to adopt a new approach if we’re going to get there. Lots of stuff needs to get deprecated and what remains needs a lot more testing, benchmarking and documentation. — John
Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
Can you do something like df[“ColA”] = f(df)? — John On Jan 21, 2014, at 8:48 AM, Blake Johnson blakejohnso...@gmail.com wrote: I use within! pretty frequently. What should I be using instead if that is on the chopping block? --Blake On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote: I also agree with your approach, John. Based on your criteria, here are some other things to consider for the chopping block. - expression-based indexing - NamedArray (you already have an issue on this) - with, within, based_on and variants - @transform, @DataFrame - select, filter - DataStream Many of these were attempts to ease syntax via delayed evaluation. We can either do without or try to implement something like LINQ. On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire kevin@gmail.com wrote: Hi John, I agree with pretty much everything you have written here, and really appreciate that you've taken the lead in cleaning things up and getting us on track. Cheers! Kevin On Mon, Jan 20, 2014 at 1:57 PM, John Myles White johnmyl...@gmail.com wrote: As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We need to make more of DataArrays and DataFrames reflect the Julian worldview. Lots of our code uses an interface that is incongruous with the interfaces found in Base. Even worse, a large chunk of code has type-stability problems that makes it very slow, when comparable code that uses normal Arrays is 100x faster. We need to develop new idioms and new strategies for making code that interacts with type-destabilizing NA’s faster. More generally, we need to make DataArrays and DataFrames fit in better with Julia when Julia and R disagree. Following R’s lead has often lead us astray because R doesn’t share Julia’s strenths or weaknesses. (4) Going forward, there should be exactly one way to do most things. The worst part of our current codebase is that there are multiple ways to express the same computation, but (a) some of them are unusably slow and (b) some of them don’t ever get tested or maintained properly. This is closely linked to the excess proliferation of functionality described in Resolution 1 above. We need to start removing stuff from our packages and making the parts we keep both reliable and fast. I think we can push DataArrays and DataFrames to 1.0 status by the end of this year. But I think we need to adopt a new approach if we’re going to get there. Lots of stuff needs to get deprecated and what remains needs a lot more testing, benchmarking and documentation. — John
[julia-users] New Year's resolutions for DataArrays, DataFrames and other packages
As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule. Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work. To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on. (1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous. (2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases. (3) We need to make more of DataArrays and DataFrames reflect the Julian worldview. Lots of our code uses an interface that is incongruous with the interfaces found in Base. Even worse, a large chunk of code has type-stability problems that makes it very slow, when comparable code that uses normal Arrays is 100x faster. We need to develop new idioms and new strategies for making code that interacts with type-destabilizing NA’s faster. More generally, we need to make DataArrays and DataFrames fit in better with Julia when Julia and R disagree. Following R’s lead has often lead us astray because R doesn’t share Julia’s strenths or weaknesses. (4) Going forward, there should be exactly one way to do most things. The worst part of our current codebase is that there are multiple ways to express the same computation, but (a) some of them are unusably slow and (b) some of them don’t ever get tested or maintained properly. This is closely linked to the excess proliferation of functionality described in Resolution 1 above. We need to start removing stuff from our packages and making the parts we keep both reliable and fast. I think we can push DataArrays and DataFrames to 1.0 status by the end of this year. But I think we need to adopt a new approach if we’re going to get there. Lots of stuff needs to get deprecated and what remains needs a lot more testing, benchmarking and documentation. — John