Re: [Haskell-cafe] Big Arrays

2010-10-06 Thread Ketil Malde
Hemanth Kapila saihema...@gmail.com writes:

 Let us say, we are using a bit-array of size 2^43 (that is, a byte array of
 size 2^40) to store a bloom filter. And let us further assume that we are
 interested in a false-positive probability of 0.01

Since we are just making up numbers, let us instead say we are using a
bit array of size 2^32 - still too large for Int indexing (which was the
issue here) but only 500MB in size.

 That means, I will be able to use this array to represent  a set of
 cardinality 9.18e11 ~ 10^12

...bringing it down to less than 10^9, easily reached for building an
indexing of k-words (tuples) in e.g. the human genome (3GB).

But: I'm about to start analyzing Illumina sequencing data, where we have
sequences from two individuals of the same species.  I'm interesting in
the differences between these species, so I might want to index both
data sets and screen the sequences of each against the other to identify
bits that don't appear in the other.  Since I'll have easily tens, maybe
a hundred gigabytes of sequence from each, it's not impossible to get
into the territory you describe.  

(In practice, things will be limited by available RAM, sadly still some
orders of magnitude less than 2^40 - although apparently SGI can deliver
it. I guess I just need a bigger budget.)

 I was curious to know what sort of programs would be dealing with sets of
 10^12 elements.

Most likely a program using mutation, and probably not copying,
multi-generation GC.

Other data sets that have considerable size are acoustics data (I
understand a modern sonar deliver about a Gbit/sec continously), and
seismic data.

Also, for more moderate bodies of data and more complex analysis, you
might want to use less frugal data structure, like suffix arrays.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-06 Thread Hemanth Kapila
Thanks for the response.
 That sounds sequence comparison seems very impressive

On Wed, Oct 6, 2010 at 2:23 PM, Ketil Malde ke...@malde.org wrote:

 Hemanth Kapila saihema...@gmail.com writes:

  Let us say, we are using a bit-array of size 2^43 (that is, a byte array
 of
  size 2^40) to store a bloom filter. And let us further assume that we are
  interested in a false-positive probability of 0.01

 Since we are just making up numbers, let us instead say we are using a
 bit array of size 2^32 - still too large for Int indexing (which was the
 issue here) but only 500MB in size.

  That means, I will be able to use this array to represent  a set of
  cardinality 9.18e11 ~ 10^12

 ...bringing it down to less than 10^9, easily reached for building an
 indexing of k-words (tuples) in e.g. the human genome (3GB).

 But: I'm about to start analyzing Illumina sequencing data, where we have
 sequences from two individuals of the same species.  I'm interesting in
 the differences between these species, so I might want to index both
 data sets and screen the sequences of each against the other to identify
 bits that don't appear in the other.  Since I'll have easily tens, maybe
 a hundred gigabytes of sequence from each, it's not impossible to get
 into the territory you describe.

 (In practice, things will be limited by available RAM, sadly still some
 orders of magnitude less than 2^40 - although apparently SGI can deliver
 it. I guess I just need a bigger budget.)

  I was curious to know what sort of programs would be dealing with sets of
  10^12 elements.

 Most likely a program using mutation, and probably not copying,
 multi-generation GC.

 Other data sets that have considerable size are acoustics data (I
 understand a modern sonar deliver about a Gbit/sec continously), and
 seismic data.

 Also, for more moderate bodies of data and more complex analysis, you
 might want to use less frugal data structure, like suffix arrays.

 -k
 --
 If I haven't seen further, it is by standing in the footprints of giants
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-05 Thread Ketil Malde
Hemanth Kapila saihema...@gmail.com writes:

 Just out of curiosity, may I know a use case of such huge arrays?

Bloom filters?

 At such sizes, I thought, the array would not have the expected array
 properties (constant access time) due to thrashing.

Yes, but this is true for any array size, due to the cache hierarchy.
Accessing stuff in the same vector register is faster than accessing
things in L1 cache is faster than L2, then L3, then RAM, then swap (on
SSD), then (rotating) disk. :-)  Perhaps it's not /entirely/
unreasonable to consider array accesses to be log(N) cost - but I'm
fairly sure they're always faster than pointer chasing in tree
structures.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-05 Thread Hemanth Kapila
Thanks Ketil.
On Wed, Oct 6, 2010 at 1:36 AM, Ketil Malde ke...@malde.org wrote:

  Just out of curiosity, may I know a use case of such huge arrays?

 Bloom filters?

 Am probably dense - I still did not get an idea of where would one use such
a big array.

Let us say, we are using a bit-array of size 2^43 (that is, a byte array of
size 2^40) to store a bloom filter. And let us further assume that we are
interested in a false-positive probability of 0.01

That means, I will be able to use this array to represent  a set of
cardinality 9.18e11 ~ 10^12

I was curious to know what sort of programs would be dealing with sets of
10^12 elements.
Am mainly curious about how one would decide that bloom filters are the best
algorithm when dealing with this amount of data.What factors do we consider
when deciding the algorithm or the data structure?  The impact on GC would
be taken into account, for example (am guessing there would be at least one
copy from a younger generation to a permanent generation)?


/Hemanth K
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] Big Arrays

2010-10-04 Thread Hemanth Kapila
Just out of curiosity, may I know a use case of such huge arrays?
At such sizes, I thought, the array would not have the expected array
properties (constant access time) due to thrashing.

thanks,
Hemanth
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Big Arrays

2010-10-04 Thread Bulat Ziganshin
Hello John,

Monday, October 4, 2010, 7:57:13 AM, you wrote:

 Sure it does; a 32-bit system can address much more than 2**30
 elements. Artificially limiting how much memory can be allocated by
 depending on a poorly-specced type like 'Int' is a poor design
 decision in Haskell and GHC.

are you understand that the poor design decision makes array access
several times faster and doesn't limit anything except for very rare huge
Bool arrays?


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] Big Arrays

2010-10-04 Thread John Millikin
On Mon, Oct 4, 2010 at 01:51, Bulat Ziganshin bulat.zigans...@gmail.com wrote:
 Hello John,

 Monday, October 4, 2010, 7:57:13 AM, you wrote:

 Sure it does; a 32-bit system can address much more than 2**30
 elements. Artificially limiting how much memory can be allocated by
 depending on a poorly-specced type like 'Int' is a poor design
 decision in Haskell and GHC.

 are you understand that the poor design decision makes array access
 several times faster and doesn't limit anything except for very rare huge
 Bool arrays?

I don't see how using 'Int' instead of 'Word' makes array access
several times faster. Could you elaborate on that?

The important limited use case is an array of Word8 -- by using 'Int',
byte buffers are artificially limited to half of their potential
maximum size.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Big Arrays

2010-10-03 Thread Henry Laxen
Dear Group,

I am trying to create a (relatively) big array, 
but it seems I cannot make anything larger 
than 2^30 or so.  Here is the code:

import Data.Word
import Data.Array.Unboxed
import Control.Monad.ST
import Data.Array.ST
import Control.Exception
import Prelude hiding (catch)


t1 :: Word64 - UArray Word64 Bool
t1 size = runSTUArray 
  (do a - newArray (0,size) True
 :: ST s (STUArray s Word64 Bool)
  writeArray a 0 False
  return a)

catchArrayException x = do
  let err = show (x :: SomeException)
  putStrLn $ Exception [ ++ err ++ ]
  return ()   

main = do
  let a1 = t1 (2^30)
  a2 = t1 (2^31)
  a3 = t1 (2^32)
  catch (print $ (a1!0,a1!1)) catchArrayException
  catch (print $ (a2!0,a2!1)) catchArrayException
  catch (print $ (a3!0,a3!1)) catchArrayException
  

This results in:
*Main GOA main
(False,True)
(Exception [Negative range size]
(False,Exception [Error in array index; 1 not in range [0..1)]

It looks like array ranges can only be Ints, and not Int64 or Word64 types.  
Any pointers would be appreciated.
Best wishes,
Henry Laxen



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-03 Thread Bulat Ziganshin
Hello Henry,

Sunday, October 3, 2010, 7:54:49 PM, you wrote:

 It looks like array ranges can only be Ints, and not Int64 or Word64 types.

yes, it's Int internally got efficiency reasons. you can do your own
implementation to override this limit :)


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-03 Thread Antoine Latter
On Sun, Oct 3, 2010 at 11:18 AM, Bulat Ziganshin
bulat.zigans...@gmail.com wrote:
 Hello Henry,

 Sunday, October 3, 2010, 7:54:49 PM, you wrote:

 It looks like array ranges can only be Ints, and not Int64 or Word64 types.

 yes, it's Int internally got efficiency reasons. you can do your own
 implementation to override this limit :)


A good place to start when rolling your own is the primitive
package[1] on Hackage. It is intimately tied to GHC, however.

Antoine

[1] http://hackage.haskell.org/package/primitive
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-03 Thread Bryan O'Sullivan
On Sun, Oct 3, 2010 at 11:54 AM, Henry Laxen nadine.and.he...@pobox.comwrote:


 I am trying to create a (relatively) big array,
 but it seems I cannot make anything larger
 than 2^30 or so.  Here is the code:


Use a 64-bit machine, where Int is 64 bits wide. Trying to create a larger
array on a 32-bit machine doesn't make any sense.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Big Arrays

2010-10-03 Thread John Millikin
On Sun, Oct 3, 2010 at 19:09, Bryan O'Sullivan b...@serpentine.com wrote:
 On Sun, Oct 3, 2010 at 11:54 AM, Henry Laxen nadine.and.he...@pobox.com
 wrote:

 I am trying to create a (relatively) big array,
 but it seems I cannot make anything larger
 than 2^30 or so.  Here is the code:

 Use a 64-bit machine, where Int is 64 bits wide. Trying to create a larger
 array on a 32-bit machine doesn't make any sense.

Sure it does; a 32-bit system can address much more than 2**30
elements. Artificially limiting how much memory can be allocated by
depending on a poorly-specced type like 'Int' is a poor design
decision in Haskell and GHC.

OP: for this particular use case (unboxed Word64), an easily solution
is to have some structure like (data BigArray a = BigArray (UArray
Word32 a) ((UArray Word32 a) ((UArray Word32 a) ((UArray Word32 a)),
with each array containing 2^30 elements. You'll need to write custom
indexing and modification functions, to process the index and pass it
to the appropriate array. Something like:

idxBig :: BigArray a - Word32 - (UArray Word32 a, Word32)
idxBig (BigArray a0 a1 a2 a3) i
| i  2^30 = (a0, i)
| i  2^31 = (a1, i - 2^30)
| i  2^30 + 2^31 = (a2, i - 2^31)
| i  2^32 = (a3, i - 2^31 - 2^30)

Then wrap the existing array functions:

idx :: BigArray a - Word32 - a
idx arr i = let (arr', i') = idxBig arr i in arr' ! i'
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe