Re: [Haskell-cafe] Mathematics and Statistics libraries
Hey All, Theres actually a number of issues the come up with an effective dataframe-like for haskell, and data vis as well. (both of which I have some strong personal opinions on for haskell and which I'm exploring / experimenting with this spring). While folks have touched on a bunch, I just thought I'd put together my own opinions in the mix. First of all: any good data manipulation (i.e. data frame -like ) library needs support for efficiently querying subsets of the data in various ways. Not just that, it really should provide coherent way of dealing with out of core data! From there you might want to ask the question: do I want to iterate through chunks of the data or do i want to allow more general patterns of data access, and perhaps even ways to parallelize?. The basic thing (as others have remarked after this draft email got underway), you do essentially want to support some sql-like selection operations, and have them be efficient too, along with playing nice with columns of differing types What sort of abstractions you provide are somewhat crucial, because that in turn affects how you can write algorithms! If you look closely, this is tantamount to saying that any sufficiently well designed (industrial grade) data frame lib for haskell might wind up leading into a model for supporting mapreduce or graphlab http://graphlab.org/ style algorithms in the multicore / not distributed regime, though a first version would pragmatically just provide an interface with sequentially chunked data and use pipes-core, or one of the other enumerator libraries. Theres also some need for the aforementioned fancy types for managing data, but that not even the real challenge (in my opinion). Probably the best lib to take ideas from is the python Pandas library, or at least thats my personal opinion. Now in the space of data vis, probably the best example of a good library in terms of easy of getting informative (and pretty) outputs is ggplot2 (also in R). Now if you look there, you'll see that its VERY much integrated with the model fitting and data analysis functionality of R, and has a very compositional approach which could easily be ported pretty directly over to haskell. However, as with a good data frame-like, certain obstacles come up partly because if we insist a type safe way to do things while being at least as high level as R or python, the absence of row types for frame column names makes specifying linear models that are statically well formed (as in only referencing column names that are actually in the underlying data frame) bit tricky, and while there are approaches that do work some of the time, theres not really a good general purpose way (as far as I can tell) for that small problem of trying to resolve names as early as possible. Or at the very least I don't see a simple approach that i'm happy with. these can be summarized I think as follows: Any practical data frame lib needs to interact well with out of core data, and ideally also simplify the task of writing algorithms on top in a way that sort of gives out of core goodness for free. Theres a lot of different ways this can be perhaps done under the covers, perhaps using one of the libraries like reducers, enumerator or pipes core, but it really should be invisible for the client algorithms author, or at least invisible by default. And more over I think any attack in that direction is essentially a precursor to sorting out map-reduce and graph lab like tools for haskell. Any really nice high level data vis tool really needs to have some data analysis / machine learning style library that its working with, and this is probably best understood by looking at things already out there, such as ggplot2 in R that said, I'm all ears for other folks takes on this, especially since I'm spending some time this spring experimenting in both these directions. cheers -Carter On Sun, Mar 25, 2012 at 9:54 AM, Aleksey Khudyakov alexey.sklad...@gmail.com (mailto:alexey.sklad...@gmail.com) wrote: On 25.03.2012 14 (tel:25.03.2012%2014):52, Tom Doris wrote: Hi Heinrich, If we compare the GHCi experience with R or IPython, leaving aside any GUIs, the help system they have at the repl level is just a lot more intuitive and easy to use, and you get access to the full manual entries. For example, compare what you see if you type :info sort into GHCi versus ?sort in R. R gives you a view of the full docs for the function, whereas in GHCi you just get the type signature. Ingrating haddock documentation into GHCi would be really helpful but it's GSoC project on its own. For me most important difference between R's repl and GHCi is that :reload wipes all local binding. Effectively it forces to write everything in file and to avoid doing anything which couldn't be fitted into one-liner. It may not be bad but it's definitely different style And of course data visualization.
Re: [Haskell-cafe] Mathematics and Statistics libraries
There is the plot[1] library which provides for updateable plots from GHCi REPL and has a gnuplot-like interface. I wrote it for this very reason, a mathematics/statistics development environment. It uses Data.Vector.Storable, which provides for compatability with both statistics and hmatrix packages (as well as hstatistics). Looks very interesting. I'll try it out. I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays. Matrices from hmatrix are easily converted to rows or columns of Data.Vector.Storable and can be sliced and otherwise manipulated. That's why I said that homogenous data frame is simple. But if you want to have columns which hold values with different type they lo longer a matrix and thing become way more interesting. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
Date: Sun, 25 Mar 2012 17:54:11 +0400 From: Aleksey Khudyakov alexey.sklad...@gmail.com Subject: Re: [Haskell-cafe] Mathematics and Statistics libraries To: haskell-cafe@haskell.org Message-ID: 4f6f2383.6070...@gmail.com Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 25.03.2012 14:52, Tom Doris wrote: Hi Heinrich, And of course data visualization. Only library I know of is Chart[1] but I don't like API much. There is the plot[1] library which provides for updateable plots from GHCi REPL and has a gnuplot-like interface. I wrote it for this very reason, a mathematics/statistics development environment. It uses Data.Vector.Storable, which provides for compatability with both statistics and hmatrix packages (as well as hstatistics). I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays. Matrices from hmatrix are easily converted to rows or columns of Data.Vector.Storable and can be sliced and otherwise manipulated. [1] http://hackage.haskell.org/package/plot%20%20[1]%20http://hackage.haskell.org/package/plot Vivian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
Tom Doris tomdo...@gmail.com writes: If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out. I agree, this sounds really nice. I really disagree that we need a data frame type structure; they're an abomination in R, they try to accommodate event records and time series, and do neither well. Just to clarify (since I think the original suggestion was mine), I don't want to copy R's data frame (which I never quite understood, anyway), but I'd like some standardized data structure, ideally with an option to label columns, and functions to slice and join. The underlying structure can just be a list of columns (Vector) or whatever. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
On 26/03/2012, at 8:35 PM, Ketil Malde wrote: Just to clarify (since I think the original suggestion was mine), I don't want to copy R's data frame (which I never quite understood, anyway) A data.frame is - a record of vectors all the same length - which can be sliced and diced like a 2d matrix It's not unlike an SQL table (think of a column-oriented data base so a table is really a collection of named columns, but it _looks_ like a collection of rows). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
Tom Doris wrote: If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out. Concerning UI, the following project suggestion aims to give GHCi a web GUI http://hackage.haskell.org/trac/summer-of-code/ticket/1609 But one of your criteria is that a good UI should come with a help system, too, right? Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
Hi Heinrich, If we compare the GHCi experience with R or IPython, leaving aside any GUIs, the help system they have at the repl level is just a lot more intuitive and easy to use, and you get access to the full manual entries. For example, compare what you see if you type :info sort into GHCi versus ?sort in R. R gives you a view of the full docs for the function, whereas in GHCi you just get the type signature. I usually def a command to call out to :!hoogle --info %, which gives what you expect :info should. So, as is usually the case, there's a solution in Haskell that matches the features in other systems, but it's not the default and you have to invest effort getting it set up right. This is fine for Haskell devs who do some stats work, but it represents an offputtingly steep learning curve for quants who are willing to learn a little Haskell but expect (reasonably) some basic stuff like inline help to Just Work. Tom On 25 March 2012 08:26, Heinrich Apfelmus apfel...@quantentunnel.de wrote: Tom Doris wrote: If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out. Concerning UI, the following project suggestion aims to give GHCi a web GUI http://hackage.haskell.org/trac/summer-of-code/ticket/1609 But one of your criteria is that a good UI should come with a help system, too, right? Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
On 25.03.2012 14:52, Tom Doris wrote: Hi Heinrich, If we compare the GHCi experience with R or IPython, leaving aside any GUIs, the help system they have at the repl level is just a lot more intuitive and easy to use, and you get access to the full manual entries. For example, compare what you see if you type :info sort into GHCi versus ?sort in R. R gives you a view of the full docs for the function, whereas in GHCi you just get the type signature. Ingrating haddock documentation into GHCi would be really helpful but it's GSoC project on its own. For me most important difference between R's repl and GHCi is that :reload wipes all local binding. Effectively it forces to write everything in file and to avoid doing anything which couldn't be fitted into one-liner. It may not be bad but it's definitely different style And of course data visualization. Only library I know of is Chart[1] but I don't like API much. I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays. [1] http://hackage.haskell.org/package/Chart ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
If the goal is to help Haskell be a more acceptable choice for general statistical analysis tasks, then hmatrix, statistics, and the various gsl wrappers already provide the majority of the functionality needed. I think the bigger problem is that there is no guidance on which libraries are industrial strength, and there's no glue layer making it easier to use the APIs you'd want to, and GHCi isn't always ideal as a repl for this workflow. If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out. However, it would be a bigger contribution to get us to the point where we are able to just import Quant.Prelude to bring into scope all the standard functionality assumed in an environment like R or Matlab. In my experience most of this can come from re-exporting existing libraries while occasionally wrapping functions to simplify the interfaces and make them more consistent (e.g., a quant doesn't particularly need to know why Statistics.Sample.KernelDensity.kde uses unboxed vectors when the rest of that lib uses Generic, and they certainly won't want to spend their time remembering that they need to convert to call that function). As an exercise, in GHCi, try loading a few arbitrary csv files of tables including floating point columns, do a linear regression of one such column on another, and then display a scatterplot with the regression line, maybe throw in a check for the normality of the residuals. Assume you'll need to be able to handle large data sets so you need to use bytestring, attoparsec etc; beware that there's a known bug that will cause a segfault/bus error if you use some hmatrix/gsl functions from GHCi on x86_64, which is kind of a blocker in itself. Maybe I missed something obvious but it took me a looong time to figure out which containers, persistence + parsing, stats and plotting packages I should choose. I really disagree that we need a data frame type structure; they're an abomination in R, they try to accommodate event records and time series, and do neither well. Haskell records are fine for inhomogeneous event series and for homogeneous time series parallel Vectors or Matrices are better as they can be passed to BLAS and LAPACK with consequent performance and clarity advantages - column oriented storage rocks, and Haskell is already a good fit. Having used C++, Matlab and R (the latter for quite a while) I now use Haskell for all of my statistical analysis work, despite the many shortcomings it's definitely worth it for the code clarity and type checking, to say nothing of the pre-optimization performance and robustness. Best of luck, happy to share some preliminary code with you directly if you're interested! Tom On 21 March 2012 17:24, Ben Jones ben.jamin.pw...@gmail.com wrote: I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it. Thanks for reading, -Benjamin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
On 3/21/12 3:00 PM, Ryan Newton wrote: I think such libraries are high priority! My own experience with them is not deep, but I'll echo what I think is a common observation: * Matrix libraries are good * Statistics libs need more work I would also be very excited about a solid statistics proposal. The ticket Aleksey links to is a good start (as is the experience report linked from there), although I think that it would be possible to implement a core library with less type-trickery than he supposes. Such an interface wouldn't necessarily be perfectly statically safe, but other, tricker interfaces could be built on top of it (just as we have fancier type-level interfaces with statically checked dimensions on top of lower-level matrix libs, etc.). I envision a set of tools that let users get up and running with loading a dump of data and calculating a set of metrics on it with only a few lines. It should be designed such that the basic framework is easily extensible with various other analyses, and such that analyses compose fairly straightforwardly. Which indeed amounts to some Frame-type structure, and a core set of functions on it :-) --g ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Mathematics and Statistics libraries
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it. Thanks for reading, -Benjamin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
I think such libraries are high priority! My own experience with them is not deep, but I'll echo what I think is a common observation: - Matrix libraries are good - Statistics libs need more work And as far as wrappers around machine learning or computer vision libs (openCV)... I'm not really sure about the status of those. On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones ben.jamin.pw...@gmail.comwrote: I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it. Thanks for reading, -Benjamin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
I'd like to see more statistics work, definitely. Bryan's statistics library is excellent, but Ed Kmett has been talking about some very interesting approaches to sampling from complicated distributions, which I'd like to see implemented eventually in a library. On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones ben.jamin.pw...@gmail.comwrote: I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it. Thanks for reading, -Benjamin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Mathematics and Statistics libraries
On 21.03.2012 21:24, Ben Jones wrote: I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it. There is existing statistics related GSoC project[1]. It proposes implementation of analog of R's data frames. I think it's rather difficult since there is no obvious design. Also I think implementation will require a lot of type trickery [1] http://hackage.haskell.org/trac/summer-of-code/ticket/1596 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe