Hi,
I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64 # bin/spark-shell ... I have also added the following to /opt/intel/mkl/lib/intel64: lrwxrwxrwx 1 root root 12 Feb 1 09:18 libblas.so -> libmkl_rt.so lrwxrwxrwx 1 root root 12 Feb 1 09:18 libblas.so.3 -> libmkl_rt.so lrwxrwxrwx 1 root root 12 Feb 1 09:18 liblapack.so -> libmkl_rt.so lrwxrwxrwx 1 root root 12 Feb 1 09:18 liblapack.so.3 -> libmkl_rt.so I believe (???) that I'm using Intel MKL because the warnings went away: 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS After collectAsMap, there is no progress but I can observe that only 1 CPU is being utilised with the following stack trace: "ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x00007fbf30ab6000 nid=0xbdc runnable [0x00007fbf12205000] java.lang.Thread.State: RUNNABLE at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555) This last few steps takes more than half of the total time for a 1Mx100 dataset. The code is just: val clusters = KMeans.train(parsedData, 1000, 1) Shouldn't it utilising all the cores for the dot product? Is this a misconfiguration? Thanks!