each tweet into a vector,
randomly picks some clusters, then runs kmeans to group the tweets (at
a really high level, the clusters, i assume, would be common
topics). As such, when it checks each tweet to see if models.predict
== 1, different sets of tweets should appear under each cluster
in the
memory
from
the web ui.
On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
wrote:
Are you actually using that memory for executors?
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I run the kmeans(MLlib) in a cluster
ui.
On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
wrote:
Are you actually using that memory for executors?
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers.
Every
:
Are you actually using that memory for executors?
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every
work
own a
128G RAM, 24Core. I run 48 task in one machine. the total data is
just
40GB
Hi,
Is there a way to get the elements of each cluster after running kmeans
clustering? I am using the Java version.
thanks
running kmeans
clustering? I am using the Java version.
thanks
KMeansModel only returns the cluster centroids.
To get the # of elements in each cluster, try calling kmeans.predict() on each
of the points in the data used to build the model.
See
https://github.com/OryxProject/oryx/blob/master/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/mllib/kmeans
.
On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote:
Are you actually using that memory for executors?
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every
work
own a
128G RAM
using that memory for executors?
On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every
work
own a
128G RAM, 24Core. I run 48 task in one machine. the total data is
just
40GB
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.
When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.
When I
bytes (8 bytes per double)
You should try to use my new generalized kmeans clustering package
https://github.com/derrickburns/generalized-kmeans-clustering , which
works on high dimensional sparse data.
You will want to use the RandomIndexing embedding:
def sparseTrain(raw: RDD[Vector
))) =
(val1, val2, val3, val4)}
joined.saveAsTextFile(.../clustersoutput.txt)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
compare kmeans in mllib
with another kmeans implementation directly. The kmeans|| initialization
step takes more time than the algorithm implemented in julia for example.
There is also the ability to run multiple runs of kmeans algorithm in mllib
even by default the number of runs is 1.
DB Tsai can
I've tried some additional experiments with kmeans and I finally got it
worked as I expected. In fact, the number of partition is critical. I had a
data set of 24x784 with 12 partitions. In this case the kmeans
algorithm took a very long time (about hours to converge). When I change
AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
After some investigation, I learned that I can't compare kmeans in mllib
with another kmeans implementation directly. The kmeans|| initialization
step takes more time than the algorithm implemented in julia for example.
There is also the ability
Hi all,
I'm trying to a run clustering with kmeans algorithm. The size of my data
set is about 240k vectors of dimension 384.
Solving the problem with the kmeans available in julia (kmean++)
http://clusteringjl.readthedocs.org/en/latest/kmeans.html
take about 8 minutes on a single core
in a distributed store like HDFS.
But it's also possible you're not configuring the implementations the
same way, yes. There's not enough info here really to say.
On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
I'm trying to a run clustering with kmeans algorithm
with kmeans algorithm. The size of my
data
set is about 240k vectors of dimension 384.
Solving the problem with the kmeans available in julia (kmean++)
http://clusteringjl.readthedocs.org/en/latest/kmeans.html
take about 8 minutes on a single core.
Solving the same problem with spark
The code is really simple :
*object TestKMeans {*
* def main(args: Array[String]) {*
*val conf = new SparkConf()*
* .setAppName(Test KMeans)*
* .setMaster(local[8])*
* .set(spark.executor.memory, 8g)*
*val sc = new SparkContext(conf)*
*val numClusters = 500
(Test KMeans)
.setMaster(local[8])
.set(spark.executor.memory, 8g)
val sc = new SparkContext(conf)
val numClusters = 500;
val numIterations = 2;
val data = sc.textFile(sample.csv).map(x =
Vectors.dense(x.split(',').map(_.toDouble)))
data.cache()
val
Dear all,
How can one save a kmeans model after training ?
Best,
Jao
Hi Folks!
I'm running a Python Spark job on a cluster with 1 master and 10 slaves
(64G RAM and 32 cores each machine).
This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and
call Kmeans method as following:
# SLAVE CODE - Reading features from HDFS
def
cores each machine).
This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and
call Kmeans method as following:
# SLAVE CODE - Reading features from HDFS
def get_features_from_images_hdfs(self, timestamp):
def shallow(lista):
for row in lista
this is the stack trace I got with yarn logs -applicationId
really no idea where to dig further.
thanks!
yang
14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [
phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21]
14/10/21 14:36:47 ERROR Executor: Exception in task ID 98
Just posted below for a similar question.
Have you seen this thread ?
http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflowsubj=RE+spark+nbsp+kryo+serilizable+nbsp+exception
On Tue, Oct 21, 2014 at 2:44 PM, Yang tedd...@gmail.com wrote:
this is the stack trace I got
there for almost 1 hour.
I guess I can only go with random initialization in KMeans.
Thanks again for your help.
Ray
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16530.html
Sent from the Apache Spark User List
Hi guys,
I am new to Spark. When I run Spark Kmeans
(org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works
great. However, when using a large dataset with 1.5 million vectors, it just
hangs there at some reducyByKey/collectAsMap stages (attached image shows
the corresponding UI
observable
hanging.
Hopefully this provides more information.
Thanks.
Ray
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
:58:03 PM
Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap
Hi Xiangrui,
The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.
Yes, if I set num-executors = 200, from the hadoop cluster
Hi Burak,
In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1.
In the current test, I increase num-executors = 200. In the storage info 2
(as shown below), 11 executors are used (I think the data is kind of
balanced) and others have zero memory usage.
http://apache-spark-user
-
From: Ray ray-w...@outlook.com
To: u...@spark.incubator.apache.org
Sent: Tuesday, October 14, 2014 2:58:03 PM
Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap
Hi Xiangrui,
The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153
, October 14, 2014 2:58:03 PM
Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap
Hi Xiangrui,
The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.
Yes, if I set num-executors = 200, from
Hi Xiangrui,
Thanks for the guidance. I read the log carefully and found the root cause.
KMeans, by default, uses KMeans++ as the initialization mode. According to
the log file, the 70-minute hanging is actually the computing time of
Kmeans++, as pasted below:
14/10/14 14:48:18 INFO
Xiangrui,
Thanks for the guidance. I read the log carefully and found the root cause.
KMeans, by default, uses KMeans++ as the initialization mode. According to
the log file, the 70-minute hanging is actually the computing time of
Kmeans++, as pasted below:
14/10/14 14:48:18 INFO DAGScheduler
Thanks for your response Burak it was very helpful.
I am noticing that if I run PCA before KMeans that the KMeans algorithm will
actually take longer to run than if I had just run KMeans without PCA. I was
hoping that by using PCA first it would actually speed up the KMeans
algorithm.
I have
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows
Hi,
spark-1.0.1/examples/src/main/python/kmeans.py = Naive example for users to
understand how to code in Spark
spark-1.0.1/python/pyspark/mllib/clustering.py = Use this!!!
Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py = Example on how
to call KMeans. Feel free to use
Not sure if you resolved this but I had a similar issue and resolved it. In
my case, the problem was the ids of my items were of type Long and could be
very large (even though there are only a small number of distinct ids...
maybe a few hundred of them). KMeans will create a dense vector
I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion
Hi all,
I need the kmeans code written against Pyspark for some testing purpose.
Can somebody tell me the difference between these two files.
spark-1.0.1/examples/src/main/python/kmeans.py and
spark-1.0.1/python/pyspark/mllib/clustering.py
Thanks Regards,
Meethu M
failure.
Is there a recommended solution to this issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
solution to this issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
-at-KMeans-training-tp12411p12842.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
,
simn
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
It sounds like your data does not all have the same dimension? that's
a decent guess. Have a look at the assertions in this method.
On Tue, Aug 12, 2014 at 4:44 AM, Ge, Yao (Y.) y...@ford.com wrote:
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run
'
Subject: KMeans - java.lang.IllegalArgumentException: requirement failed
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require
I'm trying to apply KMeans training to some text data, which consists of
lines that each contain something between 3 and 20 words. For that purpose,
all unique words are saved in a dictionary. This dictionary can become very
large as no hashing etc. is done, but it should spill to disk in case
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at
org.apache.spark.mllib.util.MLUtils
at console:19
scala
scala // Set model and run it
scala val model = new KMeans().
| setInitializationMode(k-means||).
| setK(2).setMaxIterations(2).
| setEpsilon(1e-4).
| setRuns(1).
| run(parsedData)
14/08/09 14:58:33 WARN snappy.LoadSnappy: Snappy native library is available
14
cost should be a member of KMeans isn't it?
My whole code is here:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster(local)
.setAppName(Kmeans)
.set(spark.executor.memory, 2g)
val sc = new SparkContext
(-incubator, +user)
It's a method of KMeansModel, not KMeans. On first glance it looks
like model should be a KMeansModel, but Scala says it's not. The
problem is...
val model = new KMeans()
.setInitializationMode(k-means||)
.setK(2)
.setMaxIterations(2)
.setEpsilon(1e-4)
.setRuns(1)
.run(train
AM
Subject: KMeans Input Format
I want to perform a K-Means task and fail training the model and get kicked
out of Sparks scala shell before I get my result metrics. I am not sure if
the input format is the problem or something else. I use Spark 1.0.0 and my
input textile (400MB) looks like
-shell with the flag --driver-memory 2g or more if
you have more RAM available and try again?
Thanks,
Burak
- Original Message -
From: AlexanderRiggers alexander.rigg...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Thursday, August 7, 2014 7:37:40 AM
Subject: KMeans Input
model = new KMeans()
model: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d
scala .setInitializationMode(k-means||)
res0: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d
scala .setK(2)
res1
()
with
val train = parsedData.repartition(20).cache()
Best regards,
Simon
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Besides durin's suggestion, please also confirm driver and executor
memory in the WebUI, since they are small according to the log:
14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB)
-Xiangrui
Development is really rapid here, that's a great thing.
Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large
here, that's a great thing.
Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html
regards,
Simon
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
representation.
Do you know of any way to improve performance then?
Best regards,
Simon
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
Sent from the Apache Spark User List mailing list archive
:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
in a reasonable time. I guess using torrent helps a lot in this
case.
Best regards,
Simon
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html
Sent from the Apache Spark User List mailing list
://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-set-initial-centers-explicitly-tp10609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
allows us to specify the initial centers.)
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-set-initial-centers-explicitly-tp10609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
a KMeans model
than a large number of rows.
To give an example:
10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k
colums took 1min47s. Both files had a size of 238M.
Can someone explain what in the implementation of KMeans causes large
vectors to be so much more expensive
wrote:
Can anyone explain to me what is difference between kmeans in Mlib and
kmeans in examples/src/main/python/kmeans.py?
Best Regards
...
Amin Mohebbi
PhD candidate in Software Engineering
at university of Malaysia
H#x2F;P : +60 18
Can anyone explain to me what is difference between kmeans in Mlib and kmeans
in examples/src/main/python/kmeans.py?
Best Regards
...
Amin Mohebbi
PhD candidate in Software Engineering
at university of Malaysia
H#x2F;P : +60 18
with this issue
is to run kmeans multiple times and choose the best answer. You can do this by
changing the runs parameter from the default value (1) to something larger (say
10).
-Ameet
On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
I also took a look at
spark-1.0.0
for this
behavior?
Best regards,
Simon
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
The netlib.BLAS: Failed to load implementation warning only means that
the BLAS implementation may be slower than using a native one. The reason
why it only shows up at the end is that the library is only used for the
finalization step of the KMeans algorithm, so your job should've been
wrapping
for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
dataset as well, I got the expected answer. And I believe that even
code (where it gets slow) is this:
What could I do to use more executors, and generally speed this up?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html
Sent from the Apache Spark User List mailing list archive
it
wrong. The relevant code (where it gets slow) is this:
What could I do to use more executors, and generally speed this up?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html
Sent from the Apache Spark User
Hi Wanda,
As Sean mentioned, K-means is not guaranteed to find an optimal answer,
even for seemingly simple toy examples. A common heuristic to deal with
this issue is to run kmeans multiple times and choose the best answer. You
can do this by changing the runs parameter from the default value
no activity is shown in the WebUI. Is that
the GC at work? If yes, how would I improve this?
You mean there are a few minutes where no job is running? I assume
that's time when the driver is busy doing something. Is it thrashing?
Also, Local KMeans++ reached the max number of iterations: 30
Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
The obvious result should be (2,2) and (5,2) ... (you can draw them if you
don't believe me ...)
Thanks,
Wanda
kmeans = new KMeans()
kmeans.setK(2)
val model = kmeans.run(vectors)
model.clusterCenters
res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0])
You may be aware that k-means starts from a random set of centroids.
It's possible that your run picked one that leads
:
A picture is worth a thousand... Well, a picture with this dataset, what you
are expecting and what you get, would help answering your initial question.
Bertrand
On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
Can someone please run the standard kMeans code
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
your dataset as well, I got the expected answer. And I believe that even
though initialization is done using sampling, the example actually sets the
seed to a constant 42, so the result should always be the same no matter
how
(not the mllib KMeans that Sean ran) with your
dataset as well, I got the expected answer. And I believe that even though
initialization is done using sampling, the example actually sets the seed to
a constant 42, so the result should always be the same no matter how many
times you run it. So I am
),(5,3),(6,3)).map(p
= Vectors.dense(Array[Double](p._1, p._2
val kmeans = new KMeans()
kmeans.setK(2)
val model = kmeans.run(vectors)
model.clusterCenters
res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0])
You may be aware that k-means starts from a random set
10, 2014 12:46 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
dataset as well, I got the expected answer. And I believe that even though
initialization is done using sampling, the example actually sets the seed
Hi,
I wanted to calculate the InterClusterDensity and IntraClusterDensity from the
clusters generated from KMeans.
How can I achieve that? Is there any already present code/api to use for this
purpose.
Thanks
Stuti Awasthi
::DISCLAIMER
It is running k-means many times, independently, from different random
starting points in order to pick the best clustering. Convergence ends
one run, not all of them.
Yes epsilon should be the same as convergence threshold elsewhere.
You can set epsilon if you instantiate KMeans directly. Maybe
Stuti,
I'm answering your questions in order:
1. From MLLib
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L159
*,* you can see that clustering stops when we have reached*maxIterations* or
there are no more*activeRuns*.
KMeans
Hi Stuti,
I think you're right. The epsilon parameter is indeed used as a threshold
for deciding when KMeans has converged. If you look at line 201 of mllib's
KMeans.scala:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L201
you
significant time before any movement.
In the stage detail of the UI, I can see that there are 127 tasks running but
the duration each is at least a few minutes.
I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB
(50M rows).
Is this a normal behaviour?
Thanks!
local storage (not hdfs) and the kmeans data is about 6.5GB
(50M rows).
Is this a normal behaviour?
Thanks!
K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem
number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.
-Xiangrui
On Sun, Mar 23, 2014 at 11
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui
On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
Thanks again.
If you use the KMeans implementation from MLlib, the
initialization stage is done on master,
The master here is the app
Dear All,
I'm trying to cluster data from native library code with Spark Kmeans||. In
my native library the data are represented as a matrix (row = number of
data and col = dimension). For efficiency reason, they are copied into a
one dimensional scala Array row major wise so after
...@gmail.com wrote:
Dear All,
I'm trying to cluster data from native library code with Spark Kmeans||. In
my native library the data are represented as a matrix (row = number of data
and col = dimension). For efficiency reason, they are copied into a one
dimensional scala Array row major wise so after
101 - 197 of 197 matches
Mail list logo