Re: [Haskell-cafe] ANNOUNCE: GA-1.0, a library for working with genetic algorithms
algorithm: {- BEGIN CODE -} evolveSolution = do let funcs = [mkPrim (:), mkPrim lookup, mkPrim delete, mkPrim insert] ++ map mkPrim [0..100] ++ map mkPrim [(+),(*),(-)] allFuncs = funcs ++ primsForContainersPackage -- my package should have eventually provided such collections fitness f = f 503 == 0 gaConf = mkGA funcs (mkPrim fitness) defaultConfig in evolve gaConf {- END CODE -} In the system each individual is an operator and a list of arguments, each contained in their own Dynamic type. All individuals include 1) a mapping from type to sub-trees that are of that type and 2) a mapping of types to functions that will construct the same individual (that is: Map typ (typ - Individual)). The union of the domain of these to mappings show what, if any, opportunities for crossover exist between any two individuals. The global configuration maintains all the primitives needed to generate new individuals, which means sub-trees can also be generated to allow mutation. The main two issues that made me stop (read: I didn't recognize these as the core issue till I'd already hacked around without thinking about why what I'm doing wasn't quite right) were: 1) I didn't have a good way to dynamically safely coerce one type, ty1, into another type, ty2. For example, when given t_1 - t_2 - ... - t_n - r and needed b_1 - b_2 - ... - b_m - r where m n and there was a injective mapping between the b, t type variables I still had bugs in the actual coercion. A more concrete example of this point: given Int - Float - Float, I wanted to coerce it into a function of type Float - Int - Float or Float - Float or Int - Float. Usually my solution worked, but I think a bug lingered (needs testing, which I don't have time for now). 2) Generation of individuals in highly heterogenious configurations was basically non-terminating without special effort. I was going to make a routine to compute the minimum depth given any particular primitive, then removed any primitive from consideration if the minimum depth put me over the maximum depth for the individual. So a bit long winded, but that was the effort in a nutshell. If nothing else I hope it was entertaining. I'm sure it's doable but I haven't the time of focus to do it properly, and won't for a while. Cheers, Thomas On Thu, Sep 29, 2011 at 12:45 PM, Kenneth Hoste kenneth.ho...@gmail.com wrote: Hello, I'm proud to announce the v1.0 release of GA [1], my library for working with genetic algorithms in Haskell. Source repo is available on github. [2] This is a major version bump compared to the previous v0.2 release, because the library is pretty mature now in my view. Major features: * flexible user-friendly API for working with genetic algorithms * Entity type class to let user define entity definition, scoring, etc. * abstraction over monad, resulting in a powerful yet simple interface * support for scoring entire population at once * support for checkpointing each generation, and restoring from last checkpoint * convergence detection, as defined by user * also available: random searching, user-defined progress output * illustrative toy examples included I'm happy to take any questions or suggestions that you might have. [1] http://hackage.haskell.org/package/GA [2] https://github.com/boegel/GA ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] ANNOUNCE: GA-1.0, a library for working with genetic algorithms
Hello, I'm proud to announce the v1.0 release of GA [1], my library for working with genetic algorithms in Haskell. Source repo is available on github. [2] This is a major version bump compared to the previous v0.2 release, because the library is pretty mature now in my view. Major features: * flexible user-friendly API for working with genetic algorithms * Entity type class to let user define entity definition, scoring, etc. * abstraction over monad, resulting in a powerful yet simple interface * support for scoring entire population at once * support for checkpointing each generation, and restoring from last checkpoint * convergence detection, as defined by user * also available: random searching, user-defined progress output * illustrative toy examples included I'm happy to take any questions or suggestions that you might have. [1] http://hackage.haskell.org/package/GA [2] https://github.com/boegel/GA ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] HackageDB User Account
Hi Jonathan, On 21 Sep 2011, at 22:41, Jonathan Frywater wrote: How does one go about getting an account? I sent an email to the address provided at http://hackage.haskell.org/packages/accounts.html but haven't received any response yet. Since it's been over 3 weeks, I decided to try my luck here. Ross is usually quite responsive. Try resending him your email, maybe he missed it for some reason. I got a password reset for Hackage a couple of weeks ago in a matter of hours. Kenneth___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Template Haskell tutorials?
Hi, The links to the supposedly brilliant Template Haskell tutorials by Bulat are broken. http://www.haskell.org/bz/thdoc.htm http://www.haskell.org/bz/th3.htm Does anyone know if these tutorials moved to somewhere else? greetings, Kenneth___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] ANNOUNCE: StrictBench 0.1 - Benchmarking code through strict evaluation
Hi all, On Jun 8, 2009, at 15:12 , Magnus Therning wrote: On Mon, Jun 8, 2009 at 1:56 PM, Martijn van Steenbergenmart...@van.steenbergen.nl wrote: Magnus Therning wrote: Is there no way to force repeated evaluation of a pure value? (It'd be nice to be able to perform time measurements on pure code so that it's possible to compare Haskell implementations of algorithms to implementations in other languages, without running into confounding factors.) I'm really curious about this too. My guess is that the answer is no because doing so would (among other things) mean a thunk have to be copied first before it is evaluated, to preserve the unevaluated version. And what guarantee is there that values further down the expression haven't been evaluated already? Efficient lazy evaluation is hard; inefficient lazy evalation is even harder. ;-) Yes, I guessed as much. I was hoping that there might be some way of tricking GHC into being more inefficient though, something like a rollback in evaluation state. I've been playing around with MicroBench [1], and I believe there is a way to trick GHC (at least the 6.10.2 version) into being inefficient. Below is a snippet of the code I used to benchmark various implementations of a function. The key is partial function application and the IO monad. Don't ask me why it works, but I believe it does. -- benchmark a given function by applying it n times to the given value benchmark :: (a - b) - a - Int - IO() benchmark f x n = do r - mapM (\y - return $! f y) (replicate n x) performGC return () The performGC might not be 100% necessary, but I see it as a part of the function evaluation (i.e. make the runtime clean up the mess the function made). Of course this assumes performGC to be called before using benchmark. Note: MicroBench was doing something similar, but was using mapM_ instead, which no longer seems to fool GHC into evaluating the function n times. mapM does seem to work though. K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
On May 18, 2009, at 15:28 , Claus Reinke wrote: My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=a' (check -ddump-tc, or -ddump-simpl output). I understand from your later post that is was in fact specialized, but how do I make sure it _is_ specialized? Can I just add a type signature in the dist_fast definition for euclidean, or should I define euclidean outside of dist_fast, with an explicit type signature? If the latter, won't that hurt performance? Or should marking it INLINE take care of that? After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet: http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation. This seems rather serious, and might be one of the bigger reasons why I'm getting nowhere close to C in terms of performance... The loop body is ridiculously small, so it would make sense to unroll it somewhat to help avoid the loop overhead. However, it seems like GHC isn't able to do that now. Is there any way to unroll the loop myself, to speed things up? Seems hard, because I'm using uvector... K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
On May 18, 2009, at 20:54 , Claus Reinke wrote: As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': I can't reproduce this. Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3. I attach the modified Dist.hs and its simpl output, created via: ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl Dist.dumps Perhaps others can confirm the effect? Note that the 'dist_fast' in the same module does get fused, so it is not likely an options issue. I still suspect that the inlining of the 'Dist.zipWith' wrapper in the 'dist_fast_inlined' '__inline_me' has some significance - it is odd to see inlined code in an '__inline_me' and the fusion rule won't trigger on 'Dist.sumU . Dist. $wzipWithU', right? As far as I can tell, the dist_fast_inlined doesn't get fused, i.e. I'm seeing zipWithU and sumU being used in it, which is not the case in dist_fast. This is on OS X/PowerPC, using GHC 6.10.1. Does the complete program fragment I posted earlier yield the desired result? Yes. Note that the original poster also reported slowdown from use of 'dist_fast_inlined'. Don, you were defining dist inside the main module, while in our case the dist functions are defined in a seperate Dist.hs module... Would that matter? K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
On May 19, 2009, at 13:24 , Daniel Schüssler wrote: Hello! On Monday 18 May 2009 14:37:51 Kenneth Hoste wrote: I'm mostly interested in the range 10D to 100D is the dimension known at compile-time? Then you could consider Template Haskell. In general, no. :-) It will be known for some applications, but not for others. I'm more and more amazed what comes into play just to implement something simple like n-dim. Euclidean distance relatively fast using Haskell. It seems to me that GHC is missing several critical optimizations (yes, I know, patches welcome) to enable it to emit fast code for HPC applications. I'm still a big fan of Haskell, for a variety of reasons, but it seems like it's not ready yet for the task I had in mind, which is a shame. Just to be clear, this isn't a flame bait post or anything, just my 2 cents. K. I wrote up some code for generating the vector types and vector subtraction/inner product below, HTH. One problem is that I'm using a typeclass and apparently you can't make {-# SPECIALISE #-} pragmas with TH, so let's hope it is automatically specialised by GHC. Greetings, Daniel TH.hs -- {-# LANGUAGE TemplateHaskell #-} {-# OPTIONS -fglasgow-exts #-} module TH where import Language.Haskell.TH import Control.Monad -- Non-TH stuff class InnerProductSpace v r | v - r where innerProduct :: v - v - r class AbGroup v where minus :: v - v - v euclidean x y = case minus x y of z - sqrt $! innerProduct z z -- TH noContext :: Q Cxt noContext = return [] strict :: Q Type - StrictTypeQ strict = liftM ((,) IsStrict) makeVectors :: Int -- ^ Dimension - Q Type -- ^ Component type, assumed to be a 'Num' - String -- ^ Name for the generated type - Q [Dec] makeVectors n ctyp name0 = do -- let's assume ctyp = Double, name = Vector for the comments -- generate names for the variables we will need xs - replicateM n (newName x) ys - replicateM n (newName y) let name = mkName name0 -- shorthands for arithmetic expressions; the first takes expressions, -- the others take variable names sumE e1 e2 = infixE (Just e1) [|(+)|] (Just e2) varDiffE e1 e2 = infixE (Just (varE e1)) [|(-)|] (Just (varE e2)) varProdE e1 e2 = infixE (Just (varE e1)) [|(*)|] (Just (varE e2)) conPat vars = conP name (fmap varP vars) -- data Vector = Vector !Double ... !Double theDataD = dataD noContext name [] -- no context, no params [normalC name (replicate n (strict ctyp))] [''Eq,''Ord,''Show] -- 'deriving' clause innerProdD = -- instance InnerProductSpace Vector Double where ... instanceD noContext ( conT ''InnerProductSpace `appT` conT name `appT` ctyp) -- innerProduct = ... [valD (varP 'innerProduct) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- x1*y1 + + xn*yn + 0 (foldl sumE [|0|] $ zipWith varProdE xs ys) )) [] -- no 'where' clause ] abGroupD = instanceD noContext ( conT ''AbGroup `appT` conT name) -- minus = ... [valD (varP 'minus) (normalB -- \(Vector x1 x2 ... xn) (Vector y1 y2 ... yn) - (lamE [conPat xs, conPat ys] -- Vector (x1-y1) ... (xn-yn) (foldl appE (conE name) $ zipWith varDiffE xs ys) )) [] -- no 'where' clause ] sequence [theDataD,innerProdD,abGroupD] Main.hs -- {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE MultiParamTypeClasses #-} module Main where import TH $(makeVectors 3 [t|Double|] Vec3) main = print $ euclidean (Vec3 1 1 1) (Vec3 0 0 0) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] fast Eucl. dist. - Haskell vs C
Hello, For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space. So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into. My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double - UArr Double - Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds= zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y I've been benchmarking this function against various alternatives using the MicroBench [1] package, which allows to get accurate timings of single function calls. I've also mimicked the MicroBench approach in pure C, to get comparable timings for a C-only implementation. The C-only function is quite straightforward: double dist(int dim, double* p1, double* p2){ int i; double d = 0.0; for(i=0; i dim; i++){ d += (p1[i] - p2[i])*(p1[i] - p2[i]); } return sqrt(d); } (Note that the optimizer takes care of the p1-p2 operation appearing twice in the code). All code is attached if you'd like to play around with it. All numbers reported below are using GHC 6.10.2 and gcc 4.3.3 on Linux/x86. The compilation details can be found in the Makefile attached, but in short, I'm using -O2 -fexcess-precision or -O2 -fexcess-precision -fvia-C -optc-O3 with GHC, and -O3 with gcc. Dist.hs Description: Binary data microbench_dist.c Description: Binary data microbench_dist.hs Description: Binary data dist_c.c Description: Binary data dist_c.h Description: Binary data Makefile Description: Binary data Now the bad news: for small dimensions, e.g. 2D/3D-space, the dist_fast function is 70-240x slower than a pure C implementation, depending on the architecture. For example, in 2D-space on an Intel Pentium 4 (3.0GHz, 1M L2 cache), a single call to dist_fast takes about 1.75 microseconds (or 0.0175s), while a call to dist_c (C implementation of Eucl. dist), takes about 0.025 microseconds (70x slowdown). On a Core 2 2.5GHz with 6MB L2 this maps to 1.9 and 0.008 microseconds, resp. (i.e. 240x slower), while on a Core i7 2.66GHz with 8MB L2 the numbers are 1.53 and 0.011 microseconds (i.e. 140x slower). For larger dimensions, the gap becomes less big, but is still worrying: 10D: 40-110x; 100D: 10-17x; 1kD: 2.5x-6x. I'm mostly interested in the range 10D to 100D, so seeing that Haskell is over 10x and up to 100x slower than C is kind of making me cry. I've tried some things to improve on this without much luck, on the contrary: *) Marking dist_fast for inlining makes things worse; in general the inlined version is 2x slower for low dimensionality, and even 5x slower for larger dimensionality. This was somewhat surprising to me... *) In a moment of weakness, I used the Foreign Function Interface to call the dist_c C-only implementation from Haskell. Unfortunately, there seems to be a lot of overhead in calling dist_c from Haskell. Most of the performance gain from using C melts away, and sometimes the performance of the FFI'd dist_c is 15-30% worse than the native dist_fast version (especially at low dimensionality). Only for the largest dimensionalities (10k-100kD), the FFI'd version reaches the performance of the native C approach. But, since I'm mostly interested in the 10-100D range, this is of little use to me. One thing I noticed is that compiling through C using -fvia-C -optc-O3 might be a bad idea, depending on your system. On an Intel Pentium 4 system, -fvia-C -optc-O3 was giving me a speedup of up 70% (large dim.), while on Core 2 and Core i7 it resulted in a slowdown of 15-20% ! I was using roughly equal versions of GCC with this, i.e. a self-bootstrapped GCC 4.3.x. So, my question to the list if simple: how can I get better performance out of a Haskell-only approach? Any comments/suggestions are highly appreciated. I'd prefer a Haskell-only approach, but my main concern is speed. The Euclidean distance function will be used quite heavily in various tools. I currently have a C-version of some of the tools, but the amount of code that is needed for those tools is becoming ridiculously big. I believe using Haskell will allow me to come up with a more easy to maintain code base. However, I don't want to pay a huge price for this in terms of performance. greetings, Kenneth [1] MicroBench: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/microbench -- Kenneth Hoste Paris research group
Re: [Haskell-cafe] possible memory leak in uvector 0.1.0.3
On Mar 3, 2009, at 11:10 , Manlio Perillo wrote: Manlio Perillo ha scritto: [...] The other program, with a lot of array concatenations, still eats a lot of memory... Concatenating arrays generally copies data. Which uses memory. Of course, but why the garbage collector does not release this temporary used memory? Note that I'm using the -F1 flag with the RST. Maybe it is a problem with IntMap, when there are a lot of keys? It *is* a problem with IntMap. I have changed the program to not use any array concatenation, and it still requires a lot of memory. Does esist a data structure that is able to store something like 480189 keys with efficient memory usage? I ran into the same problem when I first organized the IntMap to use user IDs as keys... The problem is the huge amount of keys, and the small UArrays as values. The overhead of both the IntMap and the UArray data types is just way too big with 480k different keys... I never looked into it thoroughly, but if you look at the definition of IntMap, each key causes several words of overhead, along with one word or so for each UArray. K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] help optimizing memory usage for a program
On Mar 2, 2009, at 19:13 , Manlio Perillo wrote: Manlio Perillo ha scritto: [...] moreover, you may set upgrowing factor. with a g.f. of 1.5, for example, memory will be collected once heap will become 1.5x larger than real memory usage after last GC. this effectively guarantees that memory overhead will never be over this factor Thanks. This seems to be effective (but it also reduce performances). 3) With -F1 option [...] I have to parse the whole data set, to check if memory usage is good. Ok, done: real 49m7.369s user 45m21.642s sys0m18.893s 814 MB used This is better memory usage, respect to: real 7m17.853s user 3m38.506s sys0m7.612s 1586 MB used However, Kenneth Hoste reported (http://boegel.kejo.be/): 26 minutes, with 700 MB used. Maybe he was using the latest GHC version. I would also like to check how performances are with other functional languages. The 26m/700MB I mentioned on my blog was on my ancient PowerBook G4 (1.5GHz PowerPC G4, 1.25G). I redid the same experiment on our iMac (Core2 Duo, 2.0 GHz, 3.0G), i.e.: - read in all the data - count the number of keys in the IntMap (which should be 17,770, i.e. the number of movies) - compute the mean overall movie rating (which should be 3.6033) That task was done in 3m42s, using just 632M of memory, using the following command: ./netflix ./training_set/ +RTS -A128M -s The -A option makes sure GC isn't cleaning up stuff too frequently, while -s just reports some statistics. The way in which I'm reading in the data is somewhat different from yours. I construct the IntMap from the ground up, i.e. starting with an empty IntMap and using foldM, together with a function readMovie with the following type: readMovie :: IntMap (UArray Int Int, UArray Int Word8) - FilePath - IO (IntMap (UArray Int Int, UArray Int Word8)) In readMovie, I'm using the 'insert' function provided by the IntMap module, which justs insert a new key-value pair in the existing IntMap. Your approach is very different: you create 17,770 IntMaps with a single key/value pair in them, and then use union to combine them all. I profiled your approach on the same iMac (using the same +RTS options), and it needed 4m33s to run, using 948M of memory. I think my approach is turning out better because I'm: - building up the IntMap using 'empty' and 'insert', instead of combining 17,770 'singleton' IntMaps (which probably results better GC behavior) - using UArray instead of Urr (although I don't know if that actually makes a difference here) I hope this helps you with figuring out what the bottleneck is on your side. It took me several days to come up with this approach, with the help from various Haskellers at IRC, so I'm surely no expert... I've thrown my current code online at http://boegel.kejo.be/files/Netflix_read-and-parse_24-02-2009.hs , let me know if it's helpful in any way... Also, I was indeed using GHC 6.10.1, although I'm unsure to what extent that matter. greetings, Kenneth -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8
On Feb 26, 2009, at 13:00 , Manlio Perillo wrote: Kenneth Hoste ha scritto: Hello, I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave. [...] To see if I could efficiently represent the data set in this way, I wrote a small Haskell program (attached) which uses the following data type: From what I see, to append a new integer to the Array, you convert the array to a list, append the new element to the list, and then convert to array again. Isn't this a bit inefficient? Yes, performance-wise this is terribly inefficient, I agree. But, it was just an artefact of how the raw data is organized. My main concern was the memory usage of the huge IntMap with UArray elements. Once I solved that, I would be able to get around the performance issue by reorganizing the raw data. However, as I posted yesterday, I've been able to circumvent the issue by rethinking my data type, i.e. using the ~18K movie IDs as key instead of the 480K user IDs, which radically limits the overhead... That way, I'm able to fit the data set in 700M of memory, without having to reorganize the raw data. The uvector package implements a vector of unboxed types, and has an snocU operation, to append an element to the array. I don't know how efficient it is, however. By the way, about uvector: it has a Stream data type, and you can build a vector from a stream. Thanks for letting me know, I'll keep this in mind. greetings, Kenneth -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.ho...@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8
Hello, I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave. I kind of have an algorithm in mind that I want to implement using Haskell, but up until now, the main issue has been to find a way to efficiently represent the data... For people who are not familiar with the Netflix data, in short, it consist of roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different movies, coming from 480,109 different users. The idea I have in mind requires fast access to all the ratings for a particular user, so I was thinking about an IntMap in terms of user ids, which maps to movie id/rating pairs somehow. To see if I could efficiently represent the data set in this way, I wrote a small Haskell program (attached) which uses the following data type: testMemSizeUArray_UArray_Word8.hs Description: Binary data type data = IntMap (UArray Int Word8) -- map of user IDs to ratings (lacks movie IDs) For convenience, and because I've been discussing this with various people in #haskell @ IRC, the code is also available here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=1671#a1671 . I'm aware that the way in which the UArray's are constructed (i.e. by adding a single element each time), is far from efficient performance-wise, but that's not the point here... By reformatting the raw data, I can easily read in the data more efficiently. The issue I ran into is that changing the data type to the following, doesn't yield any significant different in memory usage. type data = IntMap (UArray Int Int) -- use Int instead of Word8 for a user rating Someone (dolio) @ #haskell suggested that maybe UArray is not byte packed for Word8, which would cause little difference with a UArray containing Int's, but someone else (dons @ #ghc) was able to tell me it _is_ byte packed. Does anyone know why the Word8 version is not significantly better in terms of memory usage? greetings, Kenneth PS: My adventures on trying to tackle the Netflix Prize problem with Haskell can be followed at http://boegel.kejo.be. -- Kenneth Hoste ELIS - Ghent University email: kenneth.ho...@elis.ugent.be blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8
On Feb 23, 2009, at 19:57 , Don Stewart wrote: bos: 2009/2/23 Kenneth Hoste kenneth.ho...@ugent.be Does anyone know why the Word8 version is not significantly better in terms of memory usage? Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/ BUArr.hs. How's that for service? :-) UArray or UArr? Well, I'm using UArray, but I'm willing to consider other suitable containers... As long as they are memory efficient. :-) The typical usage of a UArray will be getting all it's contents, and converting it to a list to easily manipulate (filter, ...). So, maybe another data type allows me to store the data in a limited amount of memory (which is my main concern now)... K. -- Kenneth Hoste ELIS - Ghent University email: kenneth.ho...@elis.ugent.be blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] LLVM back end
On 20 Nov 2006, at 08:27, Michael T. Richter wrote: I've been eyeing LLVM[1] as interesting technology -- brief executive summary: a virtual machine suited as the back end of compiler output with optimised native code then coming from it as either JIT-based execution of the LLVM bytecode or as a further compilation step -- and couldn't help but immediately think of the possibility of one of the Haskell compiler projects providing an LLVM code generator. I think this would help in several areas: it could make porting the compiler to other architectures -- including oddball ones that would be too small to otherwise support -- easier; it could help remove the nigh-ubiquitous reliance upon GCC as a back-end (while I think that GCC is a pretty good piece of software, I'm not sure it's really suited to its current role as the do-everything back end); it could leverage some of the really interesting work that's going on in optimisation technology by letting one VM's optimiser do the work for any number of languages; it could improve interaction between source code written in multiple languages. Is this me opening up a Pandora's Box of ignorance here? Or is LLVM potentially interesting? (And were someone motivated into perhaps trying to make an LLVM back-end, where would one start to poke around in, say, the GHC codebase to even begin to implement this? And how insane would they be driven by the process?) I've been looking at LLVM for a while too now, for research purposes. And one of the big downsides (at least for me), is the lack of a real simulator which mimicks the virtual machine. They do have a JIT compiler which allows you to execute LLVM bytecode on a number of platforms, but I'm interested in analyzing the dynamic behaviour of the LLVM bytecode, and JIT'ing doesn''t allow that. They also have an interpreter, which is probably quite similar to a simulator, but that's just horrible slow because of the SSA (hence, no registers) system used. To get to the point: I think Haskell might be a great candidate for building an LLVM bytecode simulator. I have been thinking about it, but because of my lack of coding experience (and understanding of Monads), I haven't started a project yet. If some people would be interested in such a thing though, we might join to make this a succes. I'm not LLVM expert, and surely no Haskell expert either, but I think such a project can be pretty interesting. greetings, Kenneth -- Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital (Aaron Levenstein) Kenneth Hoste ELIS - Ghent University [EMAIL PROTECTED] http://www.elis.ugent.be/~kehoste ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Request for subject suggestions for a series of informal top 10 Haskell articles
Greetings, As you all might know, the Haskell community has it's own informal magazine, The Monad.Reader (http://haskell.org/tmrwiki). I'm planning to write (with some support from the frequent users of #haskell @ irc.freenode.net) a series of articles for TMR, which describe some aspect of Haskell. Because of the informal (but serious and thorough) style of TMR, I'd like to do this using a top 10 list regarding some subject. Every article will list 10 of most famous, most frequently used or most usefull things regarding that subject. I'd like to ask for suggestions on each subject when I'm writing the article on this mailing list, because I don't know the Haskell world very well (I've only been working with it for a year know, in my spare time). This first mail is intented to ask for suggestions regarding future subjects for the articles. Suggestions are welcome by replying, or by adding them to the 'take 10 [1..]'-wiki at http://haskell.org/tmrwiki/TopTenHaskell . Also other suggestions or remarks are very welcome. I hope this series of articles can provide a source of information for both new and experienced users, because you there's always room to discover new aspects of your favourite language. Kenneth a.k.a. boegel ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Dread __DISCARD__
Steven Elkins wrote: Hello everyone, I'm a Haskell newbie trying out various programs from the web. I'm trying to compile one called bjpop-ray (from Bernie Pope, I think) and I hit this at link-time: snip Can you tell us where you got bjpop-ray ? I wrote my own raytracer in Haskell, and would like to check this one out too... K. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe