(Yes, it is a Java binary requiring Java 6+. It runs against Hadoop 0.20.x - 2.0.x or work-alikes, or Amazon EMR. The work is in the reducer in this implementation, so you would need to hand the reducers extra memory instead of mappers. I think that you can run the whole 20M rows of input in Myrrix, with your given ~5GB per *reducer*, if you turn up the number of reducers a little bit to -Dmapred.reduce.tasks=16 or something. It gets away with more because of a few tricks like using floats, and further partitioning the matrices.)
Either way the issue with a large U is not in computing *U* but in computing M from U, since that is when U has to be in memory. While it's just a gut guess, 20 features sounds quite low relative to the cardinality of the input. 50-100 is more usual but yes you are memory constrained. Yes your rule of thumb is a about right for this implementation as it uses 8-byte doubles. There's other overhead in the data structure, and other data structures in memory too of course. Note that in this context you need to constrain Hadoop to not put 2 mappers on one machine, if 2x the heap used doesn't fit in physical memory of that machine. It will fail or at least you will get bad swapping: mapred.map.tasks.maximum=1 (Same idea if you were using big reducers.) In apps like this, you can squeeze more out of a given amount of RAM because here most of the RAM is used by long-lived objects, by turning up the new ratio: -XX:NewRatio=12 or even higher. Otherwise you "run out" of heap when there is a fair bit of room still available but reserved for new short-lived objects. While it won't make much difference, I recommend -XX:+UseParallelOldGC and do not recommend you disable UseGCOverheadLimit! it should also have on useful stuff like -XX:+UseCompressedOops by default already with the latest Java versions and this heap size. On Mon, Nov 19, 2012 at 8:29 AM, Abramov Pavel <[email protected]> wrote: > > Can Myrrix Computation Level run on FreeBSD? Yes, we use hadoop with > freeBSD ) > > Regards, > Pavel > >
