feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-16 Thread AlexG
Do Ignite and Alluxio offer reasonable means of transferring data, in memory, from Spark to MPI? A straightforward way to transfer data is use piping, but unless you have MPI processes running in a one-to-one mapping to the Spark partitions, this will require some complicated logic to get working

how to write pyspark interface to scala code?

2016-04-12 Thread AlexG
I have Scala Spark code for computing a matrix factorization. I'd like to make it possible to use this code from PySpark, so users can pass in a python RDD and receive back one without knowing or caring that Scala code is being called. Please point me to an example of code (e.g. somewhere in the

cause of RPC error?

2016-02-04 Thread AlexG
I am simply trying to load an RDD from disk with transposeRowsRDD.avro(baseInputFname).rdd.map( ) and I get this error in my log: 16/02/04 11:44:07 ERROR TaskSchedulerImpl: Lost executor 7 on nid00788: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network

Re: cause of RPC error?

2016-02-04 Thread AlexG
To clarify, that's the tail of the node stderr log, so the last message shown is at the EOF. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cause-of-RPC-error-tp26151p26152.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

failure to parallelize an RDD

2016-01-12 Thread AlexG
I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array of rows in Array[Array[Float]] format into another matrix (rowChunk) also stored row-wise as a 54843210-by-200 Array[Array[Float]] using the following code: val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)

how to make a dataframe of Array[Doubles] ?

2015-12-14 Thread AlexG
My attempts to create a dataframe of Array[Doubles], I get an error about RDD[Array[Double]] not having a toDF function: import sqlContext.implicits._ val testvec = Array( Array(1.0, 2.0, 3.0, 4.0), Array(5.0, 6.0, 7.0, 8.0)) val testrdd = sc.parallelize(testvec) testrdd.toDF gives :29: error:

distcp suddenly broken with spark-ec2 script setup

2015-12-09 Thread AlexG
I've been using the same method to launch my clusters then pull my data from S3 to local hdfs: $SPARKHOME/ec2/spark-ec2 -k mykey -i ~/.ssh/mykey.pem -s 29 --instance-type=r3.8xlarge --placement-group=pcavariants --copy-aws-credentials --hadoop-major-version=2 --spot-price=2.8 launch mycluster

Parquet runs out of memory when reading in a huge matrix

2015-12-05 Thread AlexG
I am trying to multiply against a large matrix that is stored in parquet format, so am being careful not to store the RDD in memory, but am getting an OOM error from the parquet reader: 15/12/06 05:23:36 WARN TaskSetManager: Lost task 950.0 in stage 4.0 (TID 28398, 172.31.34.233):

controlling parquet file sizes for faster transfer to S3 from HDFS

2015-11-26 Thread AlexG
Is there a way to control how large the part- files are for a parquet dataset? I'm currently using e.g. results.toDF.coalesce(60).write.mode("append").parquet(outputdir) to manually reduce the number of parts, but this doesn't map linearly to fewer parts: I noticed that coalescing to 30 actually

Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread AlexG
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster with 16.73 Tb storage, using distcp. The dataset is a collection of tar files of about 1.7 Tb each. Nothing else was stored in the HDFS, but after completing the download, the namenode page says that 11.59 Tb are in use.

out of memory error with Parquet

2015-11-13 Thread AlexG
I'm using Spark to read in a data from many files and write it back out in Parquet format for ease of use later on. Currently, I'm using this code: val fnamesRDD = sc.parallelize(fnames, ceil(fnames.length.toFloat/numfilesperpartition).toInt) val results =

Re: out of memory error with Parquet

2015-11-13 Thread AlexG
Never mind; when I switched to Spark 1.5.0, my code works as written and is pretty fast! Looking at some Parquet related Spark jiras, it seems that Parquet is known to have some memory issues with buffering and writing, and that at least some were resolved in Spark 1.5.0. -- View this

spark on yarn is slower than spark-ec2 standalone, how to tune?

2015-08-15 Thread AlexG
I'm using a manually installation of Spark under Yarn to run a 30 node r3.8xlarge EC2 cluster (each node has 244Gb RAM, 600Gb SDD). All my code runs much faster on a cluster launched w/ the spark-ec2 script, but there's a mysterious problem with nodes becoming inaccessible, so I switched to using

what is cause of, and how to recover from, unresponsive nodes w/ spark-ec2 script

2015-08-12 Thread AlexG
I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster. Occasionally several nodes will become unresponsive: I will notice that hdfs complains it can't find some blocks, then when I go to restart hadoop, the messages indicate that the connection to some nodes timed out, then when I

spark hangs at broadcasting during a filter

2015-08-05 Thread AlexG
I'm trying to load a 1 Tb file whose lines i,j,v represent the values of a matrix given as A_{ij} = v so I can convert it to a Parquet file. Only some of the rows of A are relevant, so the following code first loads the triplets are text, splits them into Tuple3[Int, Int, Double], drops triplets

small accumulator gives out of memory error

2015-07-15 Thread AlexG
When I call the following minimal working example, the accumulator matrix is 32-by-100K, and each executor has 64G but I get an out of memory error: Exception in thread main java.lang.OutOfMemoryError: Requested array size exceeds VM limit Here BDM is a Breeze DenseMatrix object

java heap error

2015-07-15 Thread AlexG
I'm trying to compute the Frobenius norm error in approximating an IndexedRowMatrix A with the product L*R where L and R are Breeze DenseMatrices. I've written the following function that computes the squared error over each partition of rows then sums up to get the total squared error (ignore

out of memory error in treeAggregate

2015-07-15 Thread AlexG
I'm using the following function to compute B*A where B is a 32-by-8mil Breeze DenseMatrix and A is a 8mil-by-100K IndexedRowMatrix. // computes BA where B is a local matrix and A is distributed: let b_i denote the // ith col of B and a_i denote the ith row of A, then BA = sum(b_i a_i) def

Re: breeze.linalg.DenseMatrix not found

2015-06-29 Thread AlexG
I get the same error even when I define covOperator not to use a matrix at all: def covOperator(v : BDV[Double]) :BDV[Double] = { v } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/breeze-linalg-DenseMatrix-not-found-tp23537p23538.html Sent from the

breeze.linalg.DenseMatrix not found

2015-06-29 Thread AlexG
I'm trying to compute the eigendecomposition of a matrix in a portion of my code, using mllib.linalg.EigenValueDecomposition (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala ) as follows: val tol = 1e-10 val maxIter