MLlib/kmeans newbie question(s)

2015-03-07 Thread Pierce Lamb
each tweet into a vector, randomly picks some clusters, then runs kmeans to group the tweets (at a really high level, the clusters, i assume, would be common topics). As such, when it checks each tweet to see if models.predict == 1, different sets of tweets should appear under each cluster

Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster

Re: high GC in the Kmeans algorithm

2015-02-17 Thread lihu
ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every

Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB

getting the cluster elements from kmeans run

2015-02-11 Thread Harini Srinivasan
Hi, Is there a way to get the elements of each cluster after running kmeans clustering? I am using the Java version. thanks

Re: getting the cluster elements from kmeans run

2015-02-11 Thread VISHNU SUBRAMANIAN
running kmeans clustering? I am using the Java version. thanks

Re: getting the cluster elements from kmeans run

2015-02-11 Thread Suneel Marthi
KMeansModel only returns the cluster centroids. To get the # of elements in each cluster, try calling kmeans.predict() on each of the points in the data used to build the model. See https://github.com/OryxProject/oryx/blob/master/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/mllib/kmeans

Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen
. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM

Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB

high GC in the Kmeans algorithm

2015-02-11 Thread lihu
Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I

Re: KMeans with large clusters Java Heap Space

2015-01-30 Thread derrickburns
bytes (8 bytes per double) You should try to use my new generalized kmeans clustering package https://github.com/derrickburns/generalized-kmeans-clustering , which works on high dimensional sparse data. You will want to use the RandomIndexing embedding: def sparseTrain(raw: RDD[Vector

KMeans with large clusters Java Heap Space

2015-01-29 Thread mvsundaresan
))) = (val1, val2, val3, val4)} joined.saveAsTextFile(.../clustersoutput.txt) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng
compare kmeans in mllib with another kmeans implementation directly. The kmeans|| initialization step takes more time than the algorithm implemented in julia for example. There is also the ability to run multiple runs of kmeans algorithm in mllib even by default the number of runs is 1. DB Tsai can

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Jaonary Rabarisoa
I've tried some additional experiments with kmeans and I finally got it worked as I expected. In fact, the number of partition is critical. I had a data set of 24x784 with 12 partitions. In this case the kmeans algorithm took a very long time (about hours to converge). When I change

Re: Why KMeans with mllib is so slow ?

2014-12-08 Thread DB Tsai
AM, Jaonary Rabarisoa jaon...@gmail.com wrote: After some investigation, I learned that I can't compare kmeans in mllib with another kmeans implementation directly. The kmeans|| initialization step takes more time than the algorithm implemented in julia for example. There is also the ability

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Sean Owen
in a distributed store like HDFS. But it's also possible you're not configuring the implementations the same way, yes. There's not enough info here really to say. On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to a run clustering with kmeans algorithm

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core. Solving the same problem with spark

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
The code is really simple : *object TestKMeans {* * def main(args: Array[String]) {* *val conf = new SparkConf()* * .setAppName(Test KMeans)* * .setMaster(local[8])* * .set(spark.executor.memory, 8g)* *val sc = new SparkContext(conf)* *val numClusters = 500

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
(Test KMeans) .setMaster(local[8]) .set(spark.executor.memory, 8g) val sc = new SparkContext(conf) val numClusters = 500; val numIterations = 2; val data = sc.textFile(sample.csv).map(x = Vectors.dense(x.split(',').map(_.toDouble))) data.cache() val

Store kmeans model

2014-11-24 Thread Jaonary Rabarisoa
Dear all, How can one save a kmeans model after training ? Best, Jao

MLIB KMeans Exception

2014-11-20 Thread Alan Prando
Hi Folks! I'm running a Python Spark job on a cluster with 1 master and 10 slaves (64G RAM and 32 cores each machine). This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and call Kmeans method as following: # SLAVE CODE - Reading features from HDFS def

Re: MLIB KMeans Exception

2014-11-20 Thread Xiangrui Meng
cores each machine). This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and call Kmeans method as following: # SLAVE CODE - Reading features from HDFS def get_features_from_images_hdfs(self, timestamp): def shallow(lista): for row in lista

buffer overflow when running Kmeans

2014-10-21 Thread Yang
this is the stack trace I got with yarn logs -applicationId really no idea where to dig further. thanks! yang 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [ phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21] 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98

Re: buffer overflow when running Kmeans

2014-10-21 Thread Ted Yu
Just posted below for a similar question. Have you seen this thread ? http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflowsubj=RE+spark+nbsp+kryo+serilizable+nbsp+exception On Tue, Oct 21, 2014 at 2:44 PM, Yang tedd...@gmail.com wrote: this is the stack trace I got

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-15 Thread Ray
there for almost 1 hour. I guess I can only go with random initialization in KMeans. Thanks again for your help. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16530.html Sent from the Apache Spark User List

Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys, I am new to Spark. When I run Spark Kmeans (org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works great. However, when using a large dataset with 1.5 million vectors, it just hangs there at some reducyByKey/collectAsMap stages (attached image shows the corresponding UI

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
observable hanging. Hopefully this provides more information. Thanks. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
:58:03 PM Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153 and has less than 15 nonzero elements. Yes, if I set num-executors = 200, from the hadoop cluster

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Burak, In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1. In the current test, I increase num-executors = 200. In the storage info 2 (as shown below), 11 executors are used (I think the data is kind of balanced) and others have zero memory usage. http://apache-spark-user

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread DB Tsai
- From: Ray ray-w...@outlook.com To: u...@spark.incubator.apache.org Sent: Tuesday, October 14, 2014 2:58:03 PM Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
, October 14, 2014 2:58:03 PM Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153 and has less than 15 nonzero elements. Yes, if I set num-executors = 200, from

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui, Thanks for the guidance. I read the log carefully and found the root cause. KMeans, by default, uses KMeans++ as the initialization mode. According to the log file, the 70-minute hanging is actually the computing time of Kmeans++, as pasted below: 14/10/14 14:48:18 INFO

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
Xiangrui, Thanks for the guidance. I read the log carefully and found the root cause. KMeans, by default, uses KMeans++ as the initialization mode. According to the log file, the 70-minute hanging is actually the computing time of Kmeans++, as pasted below: 14/10/14 14:48:18 INFO DAGScheduler

Re: How to run kmeans after pca?

2014-09-30 Thread st553
Thanks for your response Burak it was very helpful. I am noticing that if I run PCA before KMeans that the KMeans algorithm will actually take longer to run than if I had just run KMeans without PCA. I was hoping that by using PCA first it would actually speed up the KMeans algorithm. I have

Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks
Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows

Re: Python version of kmeans

2014-09-18 Thread Burak Yavuz
Hi, spark-1.0.1/examples/src/main/python/kmeans.py = Naive example for users to understand how to code in Spark spark-1.0.1/python/pyspark/mllib/clustering.py = Use this!!! Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py = Example on how to call KMeans. Feel free to use

Re: OutOfMemoryError with basic kmeans

2014-09-17 Thread st553
Not sure if you resolved this but I had a similar issue and resolved it. In my case, the problem was the ids of my items were of type Long and could be very large (even though there are only a small number of distinct ids... maybe a few hundred of them). KMeans will create a dense vector

How to run kmeans after pca?

2014-09-17 Thread st553
I would like to reduce the dimensionality of my data before running kmeans. The problem I'm having is that both RowMatrix.computePrincipalComponents() and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train() requires an RDD[Vector]. Does MLlib provide a way to do this conversion

Python version of kmeans

2014-09-17 Thread MEETHU MATHEW
Hi all, I need the kmeans code written against Pyspark for some testing purpose. Can somebody tell me the difference between these two files. spark-1.0.1/examples/src/main/python/kmeans.py and spark-1.0.1/python/pyspark/mllib/clustering.py Thanks Regards, Meethu M

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
failure. Is there a recommended solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
-at-KMeans-training-tp12411p12842.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Only master is really busy at KMeans training

2014-08-19 Thread durin
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
, simn -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Sean Owen
It sounds like your data does not all have the same dimension? that's a decent guess. Have a look at the assertions in this method. On Tue, Aug 12, 2014 at 4:44 AM, Ge, Yao (Y.) y...@ford.com wrote: I am trying to train a KMeans model with sparse vector with Spark 1.0.1. When I run

RE: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Ge, Yao (Y.)
' Subject: KMeans - java.lang.IllegalArgumentException: requirement failed I am trying to train a KMeans model with sparse vector with Spark 1.0.1. When I run the training I got the following exception: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require

Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin
I'm trying to apply KMeans training to some text data, which consists of lines that each contain something between 3 and 20 words. For that purpose, all unique words are saved in a dictionary. This dictionary can become very large as no hashing etc. is done, but it should spill to disk in case

KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Ge, Yao (Y.)
I am trying to train a KMeans model with sparse vector with Spark 1.0.1. When I run the training I got the following exception: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.mllib.util.MLUtils

Re: KMeans Input Format

2014-08-09 Thread AlexanderRiggers
at console:19 scala scala // Set model and run it scala val model = new KMeans(). | setInitializationMode(k-means||). | setK(2).setMaxIterations(2). | setEpsilon(1e-4). | setRuns(1). | run(parsedData) 14/08/09 14:58:33 WARN snappy.LoadSnappy: Snappy native library is available 14

Re: KMeans Input Format

2014-08-08 Thread AlexanderRiggers
cost should be a member of KMeans isn't it? My whole code is here: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf val conf = new SparkConf() .setMaster(local) .setAppName(Kmeans) .set(spark.executor.memory, 2g) val sc = new SparkContext

Re: KMeans Input Format

2014-08-08 Thread Sean Owen
(-incubator, +user) It's a method of KMeansModel, not KMeans. On first glance it looks like model should be a KMeansModel, but Scala says it's not. The problem is... val model = new KMeans() .setInitializationMode(k-means||) .setK(2) .setMaxIterations(2) .setEpsilon(1e-4) .setRuns(1) .run(train

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
AM Subject: KMeans Input Format I want to perform a K-Means task and fail training the model and get kicked out of Sparks scala shell before I get my result metrics. I am not sure if the input format is the problem or something else. I use Spark 1.0.0 and my input textile (400MB) looks like

Re: KMeans Input Format

2014-08-07 Thread Sean Owen
-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: AlexanderRiggers alexander.rigg...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40 AM Subject: KMeans Input

Re: KMeans Input Format

2014-08-07 Thread AlexanderRiggers
model = new KMeans() model: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setInitializationMode(k-means||) res0: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setK(2) res1

Re: KMeans Input Format

2014-08-07 Thread durin
() with val train = parsedData.repartition(20).cache() Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11719.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: KMeans Input Format

2014-08-07 Thread Xiangrui Meng
Besides durin's suggestion, please also confirm driver and executor memory in the WebUI, since they are small according to the log: 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) -Xiangrui

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
representation. Do you know of any way to improve performance then? Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html Sent from the Apache Spark User List mailing list archive

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html Sent from the Apache Spark User List mailing list

Re: Kmeans: set initial centers explicitly

2014-07-27 Thread Xiangrui Meng
://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-set-initial-centers-explicitly-tp10609.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Kmeans: set initial centers explicitly

2014-07-24 Thread SK
allows us to specify the initial centers.) thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-set-initial-centers-explicitly-tp10609.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

KMeans: expensiveness of large vectors

2014-07-24 Thread durin
a KMeans model than a large number of rows. To give an example: 10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k colums took 1min47s. Both files had a size of 238M. Can someone explain what in the implementation of KMeans causes large vectors to be so much more expensive

Re: Kmeans

2014-07-17 Thread Xiangrui Meng
wrote: Can anyone explain to me what is difference between kmeans in Mlib and kmeans in examples/src/main/python/kmeans.py? Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18

Kmeans

2014-07-16 Thread amin mohebbi
Can anyone explain to me what is difference between kmeans in Mlib and kmeans in examples/src/main/python/kmeans.py?   Best Regards ... Amin Mohebbi PhD candidate in Software Engineering   at university of Malaysia   H#x2F;P : +60 18

Re: KMeans code is rubbish

2014-07-14 Thread Wanda Hawk
with this issue is to run kmeans multiple times and choose the best answer.  You can do this by changing the runs parameter from the default value (1) to something larger (say 10). -Ameet On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I also took a look at  spark-1.0.0

Re: KMeans for large training data

2014-07-12 Thread durin
for this behavior? Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9508.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans for large training data

2014-07-12 Thread Aaron Davidson
The netlib.BLAS: Failed to load implementation warning only means that the BLAS implementation may be slower than using a native one. The reason why it only shows up at the end is that the library is only used for the finalization step of the KMeans algorithm, so your job should've been wrapping

Re: KMeans code is rubbish

2014-07-11 Thread Wanda Hawk
for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even

KMeans for large training data

2014-07-11 Thread durin
code (where it gets slow) is this: What could I do to use more executors, and generally speed this up? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html Sent from the Apache Spark User List mailing list archive

Re: KMeans for large training data

2014-07-11 Thread Sean Owen
it wrong. The relevant code (where it gets slow) is this: What could I do to use more executors, and generally speed this up? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html Sent from the Apache Spark User

Re: KMeans code is rubbish

2014-07-11 Thread Ameet Talwalkar
Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value

Re: KMeans for large training data

2014-07-11 Thread Sean Owen
no activity is shown in the WebUI. Is that the GC at work? If yes, how would I improve this? You mean there are a few minutes where no job is running? I assume that's time when the driver is busy doing something. Is it thrashing? Also, Local KMeans++ reached the max number of iterations: 30

KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks,  Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Sean Owen
kmeans = new KMeans() kmeans.setK(2) val model = kmeans.run(vectors) model.clusterCenters res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0]) You may be aware that k-means starts from a random set of centroids. It's possible that your run picked one that leads

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how

Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
(not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
),(5,3),(6,3)).map(p = Vectors.dense(Array[Double](p._1, p._2 val kmeans = new KMeans() kmeans.setK(2) val model = kmeans.run(vectors) model.clusterCenters res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0]) You may be aware that k-means starts from a random set

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
10, 2014 12:46 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed

Inter and Inra Cluster Density in KMeans

2014-05-28 Thread Stuti Awasthi
Hi, I wanted to calculate the InterClusterDensity and IntraClusterDensity from the clusters generated from KMeans. How can I achieve that? Is there any already present code/api to use for this purpose. Thanks Stuti Awasthi ::DISCLAIMER

Re: Understanding epsilon in KMeans

2014-05-16 Thread Sean Owen
It is running k-means many times, independently, from different random starting points in order to pick the best clustering. Convergence ends one run, not all of them. Yes epsilon should be the same as convergence threshold elsewhere. You can set epsilon if you instantiate KMeans directly. Maybe

Re: Understanding epsilon in KMeans

2014-05-16 Thread Long Pham
Stuti, I'm answering your questions in order: 1. From MLLib https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L159 *,* you can see that clustering stops when we have reached*maxIterations* or there are no more*activeRuns*. KMeans

Re: Understanding epsilon in KMeans

2014-05-16 Thread Brian Gawalt
Hi Stuti, I think you're right. The epsilon parameter is indeed used as a threshold for deciding when KMeans has converged. If you look at line 201 of mllib's KMeans.scala: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L201 you

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
significant time before any movement. In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes. I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows). Is this a normal behaviour? Thanks!

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows). Is this a normal behaviour? Thanks!

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem. -Xiangrui On Sun, Mar 23, 2014 at 11

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Thanks again. If you use the KMeans implementation from MLlib, the initialization stage is done on master, The master here is the app

Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Jaonary Rabarisoa
Dear All, I'm trying to cluster data from native library code with Spark Kmeans||. In my native library the data are represented as a matrix (row = number of data and col = dimension). For efficiency reason, they are copied into a one dimensional scala Array row major wise so after

Re: Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Xiangrui Meng
...@gmail.com wrote: Dear All, I'm trying to cluster data from native library code with Spark Kmeans||. In my native library the data are represented as a matrix (row = number of data and col = dimension). For efficiency reason, they are copied into a one dimensional scala Array row major wise so after

<    1   2