Do Ignite and Alluxio offer reasonable means of transferring data, in memory,
from Spark to MPI? A straightforward way to transfer data is use piping, but
unless you have MPI processes running in a one-to-one mapping to the Spark
partitions, this will require some complicated logic to get working
I have Scala Spark code for computing a matrix factorization. I'd like to
make it possible to use this code from PySpark, so users can pass in a
python RDD and receive back one without knowing or caring that Scala code is
being called.
Please point me to an example of code (e.g. somewhere in the
I am simply trying to load an RDD from disk with
transposeRowsRDD.avro(baseInputFname).rdd.map( )
and I get this error in my log:
16/02/04 11:44:07 ERROR TaskSchedulerImpl: Lost executor 7 on nid00788:
Remote RPC client disassociated. Likely due to containers exceeding
thresholds, or network
To clarify, that's the tail of the node stderr log, so the last message shown
is at the EOF.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cause-of-RPC-error-tp26151p26152.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array of
rows in Array[Array[Float]] format into another matrix (rowChunk) also
stored row-wise as a 54843210-by-200 Array[Array[Float]] using the following
code:
val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)
My attempts to create a dataframe of Array[Doubles], I get an error about
RDD[Array[Double]] not having a toDF function:
import sqlContext.implicits._
val testvec = Array( Array(1.0, 2.0, 3.0, 4.0), Array(5.0, 6.0, 7.0, 8.0))
val testrdd = sc.parallelize(testvec)
testrdd.toDF
gives
:29: error:
I've been using the same method to launch my clusters then pull my data from
S3 to local hdfs:
$SPARKHOME/ec2/spark-ec2 -k mykey -i ~/.ssh/mykey.pem -s 29
--instance-type=r3.8xlarge --placement-group=pcavariants
--copy-aws-credentials --hadoop-major-version=2 --spot-price=2.8 launch
mycluster
I am trying to multiply against a large matrix that is stored in parquet
format, so am being careful not to store the RDD in memory, but am getting
an OOM error from the parquet reader:
15/12/06 05:23:36 WARN TaskSetManager: Lost task 950.0 in stage 4.0
(TID 28398, 172.31.34.233):
Is there a way to control how large the part- files are for a parquet
dataset? I'm currently using e.g.
results.toDF.coalesce(60).write.mode("append").parquet(outputdir)
to manually reduce the number of parts, but this doesn't map linearly to
fewer parts: I noticed that coalescing to 30 actually
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
with 16.73 Tb storage, using
distcp. The dataset is a collection of tar files of about 1.7 Tb each.
Nothing else was stored in the HDFS, but after completing the download, the
namenode page says that 11.59 Tb are in use.
I'm using Spark to read in a data from many files and write it back out in
Parquet format for ease of use later on. Currently, I'm using this code:
val fnamesRDD = sc.parallelize(fnames,
ceil(fnames.length.toFloat/numfilesperpartition).toInt)
val results =
Never mind; when I switched to Spark 1.5.0, my code works as written and is
pretty fast! Looking at some Parquet related Spark jiras, it seems that
Parquet is known to have some memory issues with buffering and writing, and
that at least some were resolved in Spark 1.5.0.
--
View this
I'm using a manually installation of Spark under Yarn to run a 30 node
r3.8xlarge EC2 cluster (each node has 244Gb RAM, 600Gb SDD). All my code
runs much faster on a cluster launched w/ the spark-ec2 script, but there's
a mysterious problem with nodes becoming inaccessible, so I switched to
using
I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster.
Occasionally several nodes will become unresponsive: I will notice that hdfs
complains it can't find some blocks, then when I go to restart hadoop, the
messages indicate that the connection to some nodes timed out, then when I
I'm trying to load a 1 Tb file whose lines i,j,v represent the values of a
matrix given as A_{ij} = v so I can convert it to a Parquet file. Only some
of the rows of A are relevant, so the following code first loads the
triplets are text, splits them into Tuple3[Int, Int, Double], drops triplets
When I call the following minimal working example, the accumulator matrix is
32-by-100K, and each executor has 64G but I get an out of memory error:
Exception in thread main java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
Here BDM is a Breeze DenseMatrix
object
I'm trying to compute the Frobenius norm error in approximating an
IndexedRowMatrix A with the product L*R where L and R are Breeze
DenseMatrices.
I've written the following function that computes the squared error over
each partition of rows then sums up to get the total squared error (ignore
I'm using the following function to compute B*A where B is a 32-by-8mil
Breeze DenseMatrix and A is a 8mil-by-100K IndexedRowMatrix.
// computes BA where B is a local matrix and A is distributed: let b_i
denote the
// ith col of B and a_i denote the ith row of A, then BA = sum(b_i a_i)
def
I get the same error even when I define covOperator not to use a matrix at
all:
def covOperator(v : BDV[Double]) :BDV[Double] = { v }
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-linalg-DenseMatrix-not-found-tp23537p23538.html
Sent from the
I'm trying to compute the eigendecomposition of a matrix in a portion of my
code, using mllib.linalg.EigenValueDecomposition
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala
)
as follows:
val tol = 1e-10
val maxIter
20 matches
Mail list logo