Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-29 Thread Carter Tazio Schonwald
Hey All,

Theres actually a number of issues the come up with an effective dataframe-like 
for haskell, and data vis as well.  (both of which I have some strong personal 
opinions on for haskell and which I'm exploring / experimenting with this 
spring). While folks have touched on a bunch, I just thought I'd put together 
my own opinions in the mix.

First of all: any good data manipulation (i.e. data frame -like ) library needs 
support for efficiently querying subsets of the data in various ways. Not just 
that,  it really should provide coherent way of dealing with out of core data! 
From there you might want to ask the question: do I want to iterate through 
chunks of the data or do i want to allow more general patterns of data 
access, and perhaps even ways to parallelize?. The basic thing (as others have 
remarked after this draft email got underway), you do essentially want to 
support some sql-like selection operations, and have them be efficient too, 
along with playing nice with columns of differing types 

What sort of abstractions you provide are somewhat crucial, because that in 
turn affects how you can write algorithms! If you look closely, this is 
tantamount to saying that any sufficiently well designed (industrial grade) 
data frame lib for haskell might wind up leading into a model for supporting 
mapreduce or graphlab http://graphlab.org/ style algorithms in the multicore / 
not distributed regime, though a first version would pragmatically just provide 
an interface with sequentially chunked data and use pipes-core, or one of the 
other enumerator libraries. Theres also some need for the aforementioned fancy 
types for managing data, but that not even the real challenge (in my opinion). 
Probably the best lib to take ideas from is the python Pandas library, or at 
least thats my personal opinion. 

Now in the space of data vis, probably the best example of a good library in 
terms of easy of getting informative (and pretty) outputs is ggplot2 (also in 
R). Now if you look there, you'll see that its VERY much integrated with the 
model fitting and data analysis functionality of R, and has a very 
compositional approach  which could easily be ported pretty directly over to 
haskell. 
However, as with a good data frame-like, certain obstacles come up partly 
because if we insist a type safe way to do things while being at least as high 
level as R or python, the absence of row types for frame column names makes 
specifying linear models that are statically well formed  (as in only 
referencing column names that are actually in the underlying data frame) bit 
tricky, and while there are approaches that do work some of the time,  theres 
not really a good general purpose way (as far as I can tell) for that small 
problem of trying to resolve names as early as possible. Or at the very least I 
don't see a simple approach that i'm happy with.

these can be summarized I think as follows:
Any practical data frame lib needs to interact well with out of core data, 
and ideally also simplify the task of writing algorithms on top in a way that 
sort of gives out of core goodness for free. Theres a lot of different ways 
this can be perhaps done under the covers, perhaps using one of the libraries 
like reducers, enumerator or pipes core, but it really should be invisible for 
the client algorithms author, or at least invisible by default. And more over I 
think any attack in that direction is essentially a precursor to sorting out 
map-reduce and graph lab like tools for haskell.
Any really nice high level data vis tool really needs to have some data 
analysis / machine  learning style library that its working with, and this is 
probably best understood by looking at things already out there, such as 
ggplot2 in R

that said, I'm all ears for other folks takes on this, especially since I'm 
spending some time this spring experimenting in both these directions.

cheers
-Carter

On Sun, Mar 25, 2012 at 9:54 AM, Aleksey Khudyakov alexey.sklad...@gmail.com 
(mailto:alexey.sklad...@gmail.com) wrote:
 On 25.03.2012 14 (tel:25.03.2012%2014):52, Tom Doris wrote:
  Hi Heinrich,
  
  If we compare the GHCi experience with R or IPython, leaving aside any
  GUIs, the help system they have at the repl level is just a lot more
  intuitive and easy to use, and you get access to the full manual
  entries. For example, compare what you see if you type :info sort into
  GHCi versus ?sort in R. R gives you a view of the full docs for the
  function, whereas in GHCi you just get the type signature.
  
 Ingrating haddock documentation into GHCi would be really helpful but it's 
 GSoC project on its own.
 
 For me most important difference between R's repl and GHCi is that :reload 
 wipes all local binding. Effectively it forces to write everything in file 
 and to avoid doing anything which couldn't be fitted into one-liner. It may 
 not be bad but it's definitely different style
 
 And of course data visualization. 

Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-28 Thread Aleksey Khudyakov
 There is the plot[1] library which provides for updateable plots from GHCi
 REPL and has a gnuplot-like interface.  I wrote it for this very reason, a
 mathematics/statistics development environment.

 It uses Data.Vector.Storable, which provides for compatability with both
 statistics and hmatrix packages (as well as hstatistics).

Looks very interesting. I'll try it out.


 I think talking about data frames is a bit pointless unless we specify
 what is data frame. Basically there are two representations of tabular
 data structure: array of tuples or tuple of arrays. If you want first go
 for Data.Vector.Vector YourData. If you want second you'll probably end
 up with some HList-like data structure to hold arrays.

 Matrices from hmatrix are easily converted to rows or columns of
 Data.Vector.Storable and can be sliced and otherwise manipulated.

That's why I said that homogenous data frame is simple. But if you want to
have columns which hold values with different type they lo longer a matrix
and thing become way more interesting.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-27 Thread Vivian McPhail
Date: Sun, 25 Mar 2012 17:54:11 +0400

 From: Aleksey Khudyakov alexey.sklad...@gmail.com
 Subject: Re: [Haskell-cafe] Mathematics and Statistics libraries
 To: haskell-cafe@haskell.org
 Message-ID: 4f6f2383.6070...@gmail.com
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed

 On 25.03.2012 14:52, Tom Doris wrote:
  Hi Heinrich,

 And of course data visualization. Only library I know of is Chart[1] but
 I don't like API much.


There is the plot[1] library which provides for updateable plots from GHCi
REPL and has a gnuplot-like interface.  I wrote it for this very reason, a
mathematics/statistics development environment.

It uses Data.Vector.Storable, which provides for compatability with both
statistics and hmatrix packages (as well as hstatistics).


 I think talking about data frames is a bit pointless unless we specify
 what is data frame. Basically there are two representations of tabular
 data structure: array of tuples or tuple of arrays. If you want first go
 for Data.Vector.Vector YourData. If you want second you'll probably end
 up with some HList-like data structure to hold arrays.

 Matrices from hmatrix are easily converted to rows or columns of
Data.Vector.Storable and can be sliced and otherwise manipulated.


 [1] 
http://hackage.haskell.org/package/plot%20%20[1]%20http://hackage.haskell.org/package/plot

Vivian
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-26 Thread Ketil Malde
Tom Doris tomdo...@gmail.com writes:

 If you're interested in UI work, ideally we'd have something similar
 to RStudio as an environment, a simple set of windows encapsulating an
 editor, a repl, a plotting panel and help/history, this sounds
 superficial but it really has an impact when you're exploring a data
 set and trying stuff out.

I agree, this sounds really nice.

 I really disagree that we need a data frame type structure; they're an
 abomination in R, they try to accommodate event records and time
 series, and do neither well.

Just to clarify (since I think the original suggestion was mine), I
don't want to copy R's data frame (which I never quite understood,
anyway), but I'd like some standardized data structure, ideally with an
option to label columns, and functions to slice and join.  The
underlying structure can just be a list of columns (Vector) or whatever.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-26 Thread Richard O'Keefe

On 26/03/2012, at 8:35 PM, Ketil Malde wrote:
 Just to clarify (since I think the original suggestion was mine), I
 don't want to copy R's data frame (which I never quite understood,
 anyway)

A data.frame is
 - a record of vectors all the same length
 - which can be sliced and diced like a 2d matrix

It's not unlike an SQL table (think of a column-oriented data base
so a table is really a collection of named columns, but it _looks_
like a collection of rows).


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-25 Thread Heinrich Apfelmus

Tom Doris wrote:


If you're interested in UI work, ideally we'd have something similar
to RStudio as an environment, a simple set of windows encapsulating an
editor, a repl, a plotting panel and help/history, this sounds
superficial but it really has an impact when you're exploring a data
set and trying stuff out.


Concerning UI, the following project suggestion aims to give GHCi a web GUI

  http://hackage.haskell.org/trac/summer-of-code/ticket/1609

But one of your criteria is that a good UI should come with a help 
system, too, right?



Best regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-25 Thread Tom Doris
Hi Heinrich,

If we compare the GHCi experience with R or IPython, leaving aside any
GUIs, the help system they have at the repl level is just a lot more
intuitive and easy to use, and you get access to the full manual
entries. For example, compare what you see if you type :info sort into
GHCi versus ?sort in R. R gives you a view of the full docs for the
function, whereas in GHCi you just get the type signature.

I usually def a command to call out to :!hoogle --info %, which
gives what you expect :info should. So, as is usually the case,
there's a solution in Haskell that matches the features in other
systems, but it's not the default and you have to invest effort
getting it set up right. This is fine for Haskell devs who do some
stats work, but it represents an offputtingly steep learning curve for
quants who are willing to learn a little Haskell but expect
(reasonably) some basic stuff like inline help to Just Work.

Tom

On 25 March 2012 08:26, Heinrich Apfelmus apfel...@quantentunnel.de wrote:
 Tom Doris wrote:


 If you're interested in UI work, ideally we'd have something similar
 to RStudio as an environment, a simple set of windows encapsulating an
 editor, a repl, a plotting panel and help/history, this sounds
 superficial but it really has an impact when you're exploring a data
 set and trying stuff out.


 Concerning UI, the following project suggestion aims to give GHCi a web GUI

  http://hackage.haskell.org/trac/summer-of-code/ticket/1609

 But one of your criteria is that a good UI should come with a help system,
 too, right?


 Best regards,
 Heinrich Apfelmus

 --
 http://apfelmus.nfshost.com



 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-25 Thread Aleksey Khudyakov

On 25.03.2012 14:52, Tom Doris wrote:

Hi Heinrich,

If we compare the GHCi experience with R or IPython, leaving aside any
GUIs, the help system they have at the repl level is just a lot more
intuitive and easy to use, and you get access to the full manual
entries. For example, compare what you see if you type :info sort into
GHCi versus ?sort in R. R gives you a view of the full docs for the
function, whereas in GHCi you just get the type signature.

Ingrating haddock documentation into GHCi would be really helpful but 
it's GSoC project on its own.


For me most important difference between R's repl and GHCi is that 
:reload wipes all local binding. Effectively it forces to write 
everything in file and to avoid doing anything which couldn't be fitted 
into one-liner. It may not be bad but it's definitely different style


And of course data visualization. Only library I know of is Chart[1] but 
I don't like API much.


I think talking about data frames is a bit pointless unless we specify 
what is data frame. Basically there are two representations of tabular 
data structure: array of tuples or tuple of arrays. If you want first go 
for Data.Vector.Vector YourData. If you want second you'll probably end 
up with some HList-like data structure to hold arrays.




[1] http://hackage.haskell.org/package/Chart

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-24 Thread Tom Doris
If the goal is to help Haskell be a more acceptable choice for general
statistical analysis tasks, then  hmatrix, statistics, and the various
gsl wrappers already provide the majority of the functionality needed.
I think the bigger problem is that there is no guidance on which
libraries are industrial strength, and there's no glue layer making it
easier to use the APIs you'd want to, and GHCi isn't always ideal as a
repl for this workflow.

If you're interested in UI work, ideally we'd have something similar
to RStudio as an environment, a simple set of windows encapsulating an
editor, a repl, a plotting panel and help/history, this sounds
superficial but it really has an impact when you're exploring a data
set and trying stuff out. However, it would be a bigger contribution
to get us to the point where we are able to just import
Quant.Prelude to bring into scope all the standard functionality
assumed in an environment like R or Matlab. In my experience most of
this can come from re-exporting existing libraries while occasionally
wrapping functions to simplify the interfaces and make them more
consistent (e.g., a quant doesn't particularly need to know why
Statistics.Sample.KernelDensity.kde uses unboxed vectors when the rest
of that lib uses Generic, and they certainly won't want to spend their
time remembering that they need to convert to call that function).

As an exercise, in GHCi, try loading a few arbitrary csv files of
tables including floating point columns, do a linear regression of one
such column on another, and then display a scatterplot with the
regression line, maybe throw in a check for the normality of the
residuals. Assume you'll need to be able to handle large data sets so
you need to use bytestring, attoparsec etc; beware that there's a
known bug that will cause a segfault/bus error if you use some
hmatrix/gsl functions from GHCi on x86_64, which is kind of a blocker
in itself. Maybe I missed something obvious but it took me a looong
time to figure out which containers, persistence + parsing, stats and
plotting packages I should choose.

I really disagree that we need a data frame type structure; they're an
abomination in R, they try to accommodate event records and time
series, and do neither well. Haskell records are fine for
inhomogeneous event series and for homogeneous time series parallel
Vectors or Matrices are better as they can be passed to BLAS and
LAPACK with consequent performance and clarity advantages - column
oriented storage rocks, and Haskell is already a good fit.

Having used C++, Matlab and R (the latter for quite a while) I now use
Haskell for all of my statistical analysis work, despite the many
shortcomings it's definitely worth it for the code clarity and type
checking, to say nothing of the pre-optimization performance and
robustness.

Best of luck, happy to share some preliminary code with you directly
if you're interested!
Tom



On 21 March 2012 17:24, Ben Jones ben.jamin.pw...@gmail.com wrote:
 I am a student currently interested in participating in Google Summer of
 Code. I have a strong interest in Haskell, and a semester's worth of coding
 experience in the language. I am a mathematics and cs double major with only
 a semester left and I am looking for information regarding what the
 community is lacking as far as mathematics and statistics libraries are
 concerned. If there is enough interest I would like to put together a
 project with this. I understand that such libraries are probably low
 priority, but if anyone has anything I would love to hear it.

 Thanks for reading,
       -Benjamin

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-23 Thread Gershom Bazerman

On 3/21/12 3:00 PM, Ryan Newton wrote:

I think such libraries are high priority!

My own experience with them is not deep, but I'll echo what I think is 
a common observation:


  * Matrix libraries are good
  * Statistics libs need more work

I would also be very excited about a solid statistics proposal. The 
ticket Aleksey links to is a good start (as is the experience report 
linked from there), although I think that it would be possible to 
implement a core library with less type-trickery than he supposes. Such 
an interface wouldn't necessarily be perfectly statically safe, but 
other, tricker interfaces could be built on top of it (just as we have 
fancier type-level interfaces with statically checked dimensions on top 
of lower-level matrix libs, etc.). I envision a set of tools that let 
users get up and running with loading a dump of data and calculating a 
set of metrics on it with only a few lines. It should be designed such 
that the basic framework is easily extensible with various other 
analyses, and such that analyses compose fairly straightforwardly. Which 
indeed amounts to some Frame-type structure, and a core set of functions 
on it :-)


--g
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Mathematics and Statistics libraries

2012-03-21 Thread Ben Jones
I am a student currently interested in participating in Google Summer of
Code. I have a strong interest in Haskell, and a semester's worth of coding
experience in the language. I am a mathematics and cs double major with
only a semester left and I am looking for information regarding what the
community is lacking as far as mathematics and statistics libraries are
concerned. If there is enough interest I would like to put together a
project with this. I understand that such libraries are probably low
priority, but if anyone has anything I would love to hear it.

Thanks for reading,
  -Benjamin
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-21 Thread Ryan Newton
I think such libraries are high priority!

My own experience with them is not deep, but I'll echo what I think is a
common observation:

   - Matrix libraries are good
   - Statistics libs need more work

And as far as wrappers around machine learning or computer vision libs
(openCV)... I'm not really sure about the status of those.


On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones ben.jamin.pw...@gmail.comwrote:

 I am a student currently interested in participating in Google Summer of
 Code. I have a strong interest in Haskell, and a semester's worth of coding
 experience in the language. I am a mathematics and cs double major with
 only a semester left and I am looking for information regarding what the
 community is lacking as far as mathematics and statistics libraries are
 concerned. If there is enough interest I would like to put together a
 project with this. I understand that such libraries are probably low
 priority, but if anyone has anything I would love to hear it.

 Thanks for reading,
   -Benjamin

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-21 Thread Daniel Peebles
I'd like to see more statistics work, definitely. Bryan's statistics
library is excellent, but Ed Kmett has been talking about some very
interesting approaches to sampling from complicated distributions, which
I'd like to see implemented eventually in a library.

On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones ben.jamin.pw...@gmail.comwrote:

 I am a student currently interested in participating in Google Summer of
 Code. I have a strong interest in Haskell, and a semester's worth of coding
 experience in the language. I am a mathematics and cs double major with
 only a semester left and I am looking for information regarding what the
 community is lacking as far as mathematics and statistics libraries are
 concerned. If there is enough interest I would like to put together a
 project with this. I understand that such libraries are probably low
 priority, but if anyone has anything I would love to hear it.

 Thanks for reading,
   -Benjamin

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Mathematics and Statistics libraries

2012-03-21 Thread Aleksey Khudyakov

On 21.03.2012 21:24, Ben Jones wrote:

I am a student currently interested in participating in Google Summer of
Code. I have a strong interest in Haskell, and a semester's worth of
coding experience in the language. I am a mathematics and cs double
major with only a semester left and I am looking for information
regarding what the community is lacking as far as mathematics and
statistics libraries are concerned. If there is enough interest I would
like to put together a project with this. I understand that such
libraries are probably low priority, but if anyone has anything I would
love to hear it.

There is existing statistics related GSoC project[1]. It proposes 
implementation of analog of R's data frames. I think it's rather 
difficult since there is no obvious design. Also I think implementation

will require a lot of type trickery

[1] http://hackage.haskell.org/trac/summer-of-code/ticket/1596

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe