Re: Past and future of data.generators

2014-06-08 Thread Mars0i
For the record, after doing some simple speed comparisons of ampling 
functions from Incanter, data.generators, and bigml/sampling (using 
Criterium, and making sure to doall lazy sequences), it appears that 
data.generators performs very well in some situations.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Past and future of data.generators

2014-06-08 Thread Mars0i
 For the record, I just did some simple speed comparisons of sampling 
functions from Incanter, data.generators, and bigml/sampling, and 
data.generators performs very well.  It was fastest in some tests.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Past and future of data.generators

2014-06-05 Thread Linus Ericsson
I do agree that the name data.generators is not where to look for a
controllable random source. A more specific name for these functions should
be considered.

The java.util.Random has been an issue for me in stress-testing random read
and writes to a huge memory-area by several threads. If I was to do it
again I would use the java.util.concurrent.ThreadLocalRandom to generate
random numbers in parallel. (j.u.c.TLR is only availiable in jdk = 1.7.0,
clojure.core aims for 1.6.0 as well. The core.async library
do use ThreadLocalRandom. The reducers functionality has a conditional
import, which I think is only to be used as a very last resort in
clojure.core.

A surprise and caveat is that the performance was really bad when
live-generating random memory addresses - likely because of cachetrashing.
The performance was indeed much higher when using a prerealized (very
long) random sequence of random data.

A functionality for generating random memory addresses would likely benefit
from having a buffer for helping the hardware pre-fetch memory (which is
often a realistic scenario in stream processing).

summary:
- better namespace for random object/number generation
- ThreadLocalRandom is only avail in jdk 1.7.0
- stresstests do benefit from buffering incoming random data, which is more
realistic as well.

I will dig deeper in criterium to see if this is already implemented there.

/Linus

On Thursday, June 5, 2014, Mars0i marsh...@logical.net wrote:

 clojure.core provides a minimal set of functions for random effects: rand,
 rand-int, and rand-nth, currently with no simple ability to base these on a
 resettable random number generator or on different RNGs in different
 threads.  (But see this ticket
 http://dev.clojure.org/jira/browse/CLJ-1420 pointed out by Andy
 Fingerhut in another thread.)

 data.generators includes additional useful general-purpose functions
 involving random numbers and random choices, but this is entirely not
 obvious when you read the docstrings.  (Some of the docstrings are pretty
 mysterious.)  It's also not necessarily what one would guess from the name
 of the library.  (None of this is a criticism of anyone or anything about
 the project.  Data.generators is at an 0.n.m release stage.  I'm very
 grateful for the work that people have put in on it.)

 As I understand it, data.generators was split off from test.generative,
 which sounds like a good idea.So data.generators was intended to provide
 functions that generate random data for testing.  (I imagine that the
 existing documentation makes more sense in the context of test.generative,
 too.)

 However, what's in data.generator has more general applications, for
 people who want random numbers, samples, etc. outside of software testing.
 (In my case, that would be for random effects in scientific simulations.)
 Off the top of my head, it seems to me that these other applications might
 have slightly different needs from the use of data.generators by
 test.generative.

 For one thing, efficiency might matter a lot in some simulations, but not
 in software testing.  (At least, *I* wouldn't care if my test functions
 were slow.)  I'm not saying that functions in data.generator are slow, but
 I don't think there's a good reason to worry about making them efficient if
 they're only intended for software testing.

 Further, there are other needs than are currently provided by
 test.generators.  See the sampling functions in bigml/sampling
 https://github.com/bigmlcom/sampling or Incanter http://incanter.org/,
 for example, and lots of other random functions that Incanter provides.
 Some of those should remain in Incanter, of course, but I wonder whether
 Clojure would benefit from a contributed library that satisfied a set of
 core needs for random effects.  (Incanter partly builds on clojure.core's
 rand at this point.)

 Maybe data.generators is/will be that library.  Or maybe parts of
 data.generators would make more sense as part of a separate library
 (math.random? data.random? math.probability?) that could be split out of
 data.generators.  (If it doesn't make sense to split data.generators, then
 would a new name for the library be more appropriate?)

 Just some things I was wondering about.  Curious to see what others say.

 (Fun tip: Check out data.generators' anything function, which is like
 Emacs' Zippy the Pinhead functions for people who prefer industrial atonal
 music composed by randomly filtered Jackson Pollock paintings, to speech.
 Or: People who want to thoroughly test their functions by throwing random
 randomly-typed data at them.)

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 javascript:_e(%7B%7D,'cvml','clojure@googlegroups.com');
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 

Re: Past and future of data.generators

2014-06-05 Thread Mikera
One of the challenges with random number generation is that there are quite 
a few specialised requirements. I don't believe a generic approach can meet 
all needs. I think we actually need a few things:

1. Better implementation for clojure.core/rand etc. I think conditional 
usage of j.u.c.ThreadLocalRandom for Java  1.7 would be great if we can 
make it work - there are plenty of concurrent workloads where a shared 
regular java.util.Random isn't a good solution.
2. A library generic random number generation tools (e.g. data.random - 
should be general purpose, able to generate a wide range of useful 
districutions, allow arbitrary java.util.Random instances to be passed as 
seeds etc.)
3. More specialised solutions can live in specific libraries (e.g. 
core.matrix will be getting support for generation of random matrices 
etc.). Often specialised implementations will offer much better performance 
for specific use cases, so we need to keep this option open. An example 
would be generating large random boolean matrices - generating and storing 
individual bits in bulk is *much* more efficient than going via generic 
random number functions for each bit.

I think we should clearly separate random number generation from sample 
data construction. The latter certainly depends upon the former, but random 
numbers have a lot of other independent use cases. Hence I'm in favour of 
something like data.random being separate from data.generators

On Thursday, 5 June 2014 05:53:10 UTC+1, Mars0i wrote:

 clojure.core provides a minimal set of functions for random effects: rand, 
 rand-int, and rand-nth, currently with no simple ability to base these on a 
 resettable random number generator or on different RNGs in different 
 threads.  (But see this ticket 
 http://dev.clojure.org/jira/browse/CLJ-1420 pointed out by Andy 
 Fingerhut in another thread.)

 data.generators includes additional useful general-purpose functions 
 involving random numbers and random choices, but this is entirely not 
 obvious when you read the docstrings.  (Some of the docstrings are pretty 
 mysterious.)  It's also not necessarily what one would guess from the name 
 of the library.  (None of this is a criticism of anyone or anything about 
 the project.  Data.generators is at an 0.n.m release stage.  I'm very 
 grateful for the work that people have put in on it.)

 As I understand it, data.generators was split off from test.generative, 
 which sounds like a good idea.So data.generators was intended to provide 
 functions that generate random data for testing.  (I imagine that the 
 existing documentation makes more sense in the context of test.generative, 
 too.)

 However, what's in data.generator has more general applications, for 
 people who want random numbers, samples, etc. outside of software testing.  
 (In my case, that would be for random effects in scientific simulations.)  
 Off the top of my head, it seems to me that these other applications might 
 have slightly different needs from the use of data.generators by 
 test.generative.  

 For one thing, efficiency might matter a lot in some simulations, but not 
 in software testing.  (At least, *I* wouldn't care if my test functions 
 were slow.)  I'm not saying that functions in data.generator are slow, but 
 I don't think there's a good reason to worry about making them efficient if 
 they're only intended for software testing.

 Further, there are other needs than are currently provided by 
 test.generators.  See the sampling functions in bigml/sampling 
 https://github.com/bigmlcom/sampling or Incanter http://incanter.org/, 
 for example, and lots of other random functions that Incanter provides.  
 Some of those should remain in Incanter, of course, but I wonder whether 
 Clojure would benefit from a contributed library that satisfied a set of 
 core needs for random effects.  (Incanter partly builds on clojure.core's 
 rand at this point.)

 Maybe data.generators is/will be that library.  Or maybe parts of 
 data.generators would make more sense as part of a separate library 
 (math.random? data.random? math.probability?) that could be split out of 
 data.generators.  (If it doesn't make sense to split data.generators, then 
 would a new name for the library be more appropriate?)

 Just some things I was wondering about.  Curious to see what others say.

 (Fun tip: Check out data.generators' anything function, which is like 
 Emacs' Zippy the Pinhead functions for people who prefer industrial atonal 
 music composed by randomly filtered Jackson Pollock paintings, to speech.  
 Or: People who want to thoroughly test their functions by throwing random 
 randomly-typed data at them.)


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email 

Re: Past and future of data.generators

2014-06-05 Thread Thomas
Hi,

I have used http://maths.uncommons.org/ in a few of my projects, so that 
could be used in data.random. I have also played with the random.org API in 
the past as a source of random numbers.

Thomas

ps. in one of my use cases I also care about the performance of the random 
generator as I potentially need to create loads (millions) and then 
performance can have in impact.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.