Re: K Means Clustering Explanation

2018-03-04 Thread Alessandro Solimando
Hi Matt,
unfortunately I have no code pointer at hand.

I will sketch how to accomplish this via the API, it will for sure at least
help you getting started.

1) ETL + vectorization (I assume your feature vector to be named "features")

2) You run a clustering algorithm (say KMeans:
https://spark.apache.org/docs/2.2.0/ml-clustering.html), and the "fit"
method will add an extra column named "prediction", where each row in the
original dataframe now has a cluster id associated (you can control the
name of the prediction column, as shown in the example, I assume the
default)

3) You run a DT (https://spark.apache.org/docs/2.2.0/ml-classification-
regression.html#decision-tree-classifier), and specify

.setLabelCol("prediction")
> .setFeaturesCol("features")


so that the output of KMeans (the cluster id) is used as a class label by
the classification algorithm, which is again using the feature vector
(always stored in "features" column).

4) For a start, you can visualize the decision tree with "toDebugString()"
method (note that you have indexed feature names, not the original name,
check this for an idea of how to convert the indexes back:
https://stackoverflow.com/questions/36122559/how-to-map-variable-names-to-
features-after-pipeline)

The easiest insight you can get on the classifier is "feature importance",
that can give you an approximate idea of which are the most relevant
features used to classify.

Otherwise you can inspect the model programmatically or manually, but you
have to define precisely what you want to have a look at (coverage,
precision, recall etc.), and rank the tree leaves accordingly (using the
impurity stats of each node, for instance).

Hth,
Alessandro

On 2 March 2018 at 15:42, Matt Hicks <m...@outr.com> wrote:

> Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
> having issues determining how to actually accomplish this with the API.
>
> Can anyone point me to an example in code showing how to accomplish this?
>
>
>
> On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando
> alessandro.solima...@gmail.com wrote:
>
>> Hi Matt,
>> similarly to what Christoph does, I first derive the cluster id for the
>> elements of my original dataset, and then I use a classification algorithm
>> (cluster ids being the classes here).
>>
>> For this method to be useful you need a "human-readable" model,
>> tree-based models are generally a good choice (e.g., Decision Tree).
>>
>> However, since those models tend to be verbose, you still need a way to
>> summarize them to facilitate readability (there must be some literature on
>> this topic, although I have no pointers to provide).
>>
>> Hth,
>> Alessandro
>>
>>
>>
>>
>>
>> On 1 March 2018 at 21:59, Christoph Brücke <carabo...@gmail.com> wrote:
>>
>> Hi Matt,
>>
>> I see. You could use the trained model to predict the cluster id for each
>> training point. Now you should be able to create a dataset with your
>> original input data and the associated cluster id for each data point in
>> the input data. Now you can group this dataset by cluster id and aggregate
>> over the original 5 features. E.g., get the mean for numerical data or the
>> value that occurs the most for categorical data.
>>
>> The exact aggregation is use-case dependent.
>>
>> I hope this helps,
>> Christoph
>>
>> Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>:
>>
>> Thanks for the response Christoph.
>>
>> I'm converting large amounts of data into clustering training and I'm
>> just having a hard time reasoning about reversing the clusters (in code)
>> back to the original format to properly understand the dominant values in
>> each cluster.
>>
>> For example, if I have five fields of data and I've trained ten clusters
>> of data I'd like to output the five fields of data as represented by each
>> of the ten clusters.
>>
>>
>>
>> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:
>>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>:
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>>
>>


Re: K Means Clustering Explanation

2018-03-02 Thread Matt Hicks
Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
having issues determining how to actually accomplish this with the API.
Can anyone point me to an example in code showing how to accomplish this?  





On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando 
alessandro.solima...@gmail.com 
 wrote:
Hi Matt,similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on this
topic, although I have no pointers to provide).
Hth,Alessandro



On 1 March 2018 at 21:59, Christoph Brücke <carabo...@gmail.com>  wrote:
Hi Matt,
I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your original
input data and the associated cluster id for each data point in the input data.
Now you can group this dataset by cluster id and aggregate over the original 5
features. E.g., get the mean for numerical data or the value that occurs the
most for categorical data.
The exact aggregation is use-case dependent.
I hope this helps,Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>:
Thanks for the response Christoph.
I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to the
original format to properly understand the dominant values in each cluster.
For example, if I have five fields of data and I've trained ten clusters of data
I'd like to output the five fields of data as represented by each of the ten
clusters.  





On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com  wrote:
Hi matt,
the cluster are defined by there centroids / cluster centers. All the points
belonging to a certain cluster are closer to its than to the centroids of any
other cluster.
What I typically do is to convert the cluster centers back to the original input
format or of that is not possible use the point nearest to the cluster center
and use this as a representation of the whole cluster.
Can you be a little bit more specific about your use-case?
Best,Christoph
Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>:
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K Means Clustering Explanation

2018-03-02 Thread Alessandro Solimando
Hi Matt,
similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).

For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on
this topic, although I have no pointers to provide).

Hth,
Alessandro





On 1 March 2018 at 21:59, Christoph Brücke <carabo...@gmail.com> wrote:

> Hi Matt,
>
> I see. You could use the trained model to predict the cluster id for each
> training point. Now you should be able to create a dataset with your
> original input data and the associated cluster id for each data point in
> the input data. Now you can group this dataset by cluster id and aggregate
> over the original 5 features. E.g., get the mean for numerical data or the
> value that occurs the most for categorical data.
>
> The exact aggregation is use-case dependent.
>
> I hope this helps,
> Christoph
>
> Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>:
>
> Thanks for the response Christoph.
>
> I'm converting large amounts of data into clustering training and I'm just
> having a hard time reasoning about reversing the clusters (in code) back to
> the original format to properly understand the dominant values in each
> cluster.
>
> For example, if I have five fields of data and I've trained ten clusters
> of data I'd like to output the five fields of data as represented by each
> of the ten clusters.
>
>
>
> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:
>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>:
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>


Re: K Means Clustering Explanation

2018-03-01 Thread Christoph Brücke
Hi Matt,

I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your
original input data and the associated cluster id for each data point in
the input data. Now you can group this dataset by cluster id and aggregate
over the original 5 features. E.g., get the mean for numerical data or the
value that occurs the most for categorical data.

The exact aggregation is use-case dependent.

I hope this helps,
Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>:

Thanks for the response Christoph.

I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to
the original format to properly understand the dominant values in each
cluster.

For example, if I have five fields of data and I've trained ten clusters of
data I'd like to output the five fields of data as represented by each of
the ten clusters.



On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:

> Hi matt,
>
> the cluster are defined by there centroids / cluster centers. All the
> points belonging to a certain cluster are closer to its than to the
> centroids of any other cluster.
>
> What I typically do is to convert the cluster centers back to the original
> input format or of that is not possible use the point nearest to the
> cluster center and use this as a representation of the whole cluster.
>
> Can you be a little bit more specific about your use-case?
>
> Best,
> Christoph
>
> Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>:
>
> I'm using K Means clustering for a project right now, and it's working
> very well.  However, I'd like to determine from the clusters what
> information distinctions define each cluster so I can explain the "reasons"
> data fits into a specific cluster.
>
> Is there a proper way to do this in Spark ML?
>
>


K Means Clustering Explanation

2018-03-01 Thread Matt Hicks
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali,

The main output of KMeansModel is clusterCenters which is Array[Vector]. It
has k elements where k is the number of clusters and each elements is the
center of the specified cluster.

Yanbo


2015-12-31 12:52 GMT+08:00 :

> Hi,
>
> I am trying to use kmeans for clustering in spark using python. I
> implemented it on the data set which spark has within. It's a 3*4 matrix.
>
> Can anybody please help me with how and what is orientation of data for
> kmeans.
> Also how to find out what all clusters and its members are.
>
> Thanks
> Anjali
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


K means clustering in spark

2015-12-30 Thread anjali . gautam09
Hi,

I am trying to use kmeans for clustering in spark using python. I implemented 
it on the data set which spark has within. It's a 3*4 matrix. 

Can anybody please help me with how and what is orientation of data for kmeans. 
Also how to find out what all clusters and its members are.

Thanks 
Anjali


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Distance Calculation in Spark K means clustering

2015-08-31 Thread ashensw
Hi all,

I am currently working on some K means clustering project. I want to get the
distances of each data point to it's cluster center after building the K
means model. Currently I get the cluster centers of each data point by
sending the JavaRDD which includes all the data points to K means
predict function. Then It returns a JavaRDD which consist of all
cluster Indexes of each data point. Then I convert those JavaRDD's to lists
using collect function and use them to calculate Euclidean distances. But
since this process involve a collect function seems like it's more time
consuming. 

Is there any other efficient way to calculate these distances of each data
points to their cluster centers? 

And also I want to know the distance measure that Spark K means algorithm
use to build the model. Is it euclidean distance or squared euclidean
distance?

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-Calculation-in-Spark-K-means-clustering-tp24516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Taking too long on K-means clustering

2015-08-27 Thread masoom alam
HI every one,

I am trying to run KDD data set - basically chapter 5 of the Advanced
Analytics with Spark book. The data set is of 789MB, but Spark is taking
some 3 to 4 hours. Is it normal behaviour.or some tuning is required.
The server RAM is 32 GB, but we can only give 4 GB RAM on 64 bit Ubuntu to
Java

Please guide.

Thanks


Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
 283,
  in _handle_request_noblock
  self.process_request(request, client_address)   File
  /usr/lib64/python2.6/SocketServer.py, line 309, in process_request
  self.finish_request(request, client_address)   File
  /usr/lib64/python2.6/SocketServer.py, line 322, in finish_request
  self.RequestHandlerClass(request, client_address, self)   File
  /usr/lib64/python2.6/SocketServer.py, line 617, in __init__
  self.handle()   File /root/spark/python/pyspark/accumulators.py,
  line 235, in handle
  num_updates = read_int(self.rfile)   File
  /root/spark/python/pyspark/serializers.py, line 544, in read_int
  raise EOFError EOFError
  
 
 
  ---
  Py4JNetworkError  Traceback (most recent call
  last) ipython-input-13-3dd00c2c5e93 in module()
   1 model = KMeans.train(data, 1000,
  initializationMode=k-means||)
 
  /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
  maxIterations, runs, initializationMode, seed, initializationSteps,
  epsilon)
  134 Train a k-means clustering model.
  135 model = callMLlibFunc(trainKMeansModel,
  rdd.map(_convert_to_vector), k, maxIterations,
  -- 136   runs, initializationMode, seed,
  initializationSteps, epsilon)
  137 centers = callJavaFunc(rdd.context,
  model.clusterCenters)
  138 return KMeansModel([c.toArray() for c in centers])
 
  /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
  *args)
  126 sc = SparkContext._active_spark_context
  127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
  -- 128 return callJavaFunc(sc, api, *args)
  129
  130
 
  /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
  *args)
  119  Call Java Function 
  120 args = [_py2java(sc, a) for a in args]
  -- 121 return _java2py(sc, func(*args))
  122
  123
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  __call__(self, *args)
  534 END_COMMAND_PART
  535
  -- 536 answer = self.gateway_client.send_command(command)
  537 return_value = get_return_value(answer,
  self.gateway_client,
  538 self.target_id, self.name)
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  send_command(self, command, retry)
  367 if retry:
  368 #print_exc()
  -- 369 response = self.send_command(command)
  370 else:
  371 response = ERROR
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  send_command(self, command, retry)
  360  the Py4J protocol.
  361 
  -- 362 connection = self._get_connection()
  363 try:
  364 response = connection.send_command(command)
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  _get_connection(self)
  316 connection = self.deque.pop()
  317 except Exception:
  -- 318 connection = self._create_connection()
  319 return connection
  320
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  _create_connection(self)
  323 connection = GatewayConnection(self.address,
  self.port,
  324 self.auto_close, self.gateway_property)
  -- 325 connection.start()
  326 return connection
  327
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  start(self)
  430 'server'
  431 logger.exception(msg)
  -- 432 raise Py4JNetworkError(msg)
  433
  434 def close(self):
 
  Py4JNetworkError: An error occurred while trying to connect to the
  Java server
 
 
  Is there any specific setting that I am missing , that causes this
  error?
 
  Thanks and Regards,
  Rogers Jeffrey L




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-19 Thread Rogers Jeffrey
/python2.6/SocketServer.py, line 617, in __init__
  self.handle()   File /root/spark/python/pyspark/accumulators.py,
  line 235, in handle
  num_updates = read_int(self.rfile)   File
  /root/spark/python/pyspark/serializers.py, line 544, in read_int
  raise EOFError EOFError
  
 
 
 ---
  Py4JNetworkError  Traceback (most recent call
  last) ipython-input-13-3dd00c2c5e93 in module()
   1 model = KMeans.train(data, 1000,
 initializationMode=k-means||)
 
  /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
  maxIterations, runs, initializationMode, seed, initializationSteps,
  epsilon)
  134 Train a k-means clustering model.
  135 model = callMLlibFunc(trainKMeansModel,
  rdd.map(_convert_to_vector), k, maxIterations,
  -- 136   runs, initializationMode, seed,
  initializationSteps, epsilon)
  137 centers = callJavaFunc(rdd.context,
 model.clusterCenters)
  138 return KMeansModel([c.toArray() for c in centers])
 
  /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
  *args)
  126 sc = SparkContext._active_spark_context
  127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
  -- 128 return callJavaFunc(sc, api, *args)
  129
  130
 
  /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
  *args)
  119  Call Java Function 
  120 args = [_py2java(sc, a) for a in args]
  -- 121 return _java2py(sc, func(*args))
  122
  123
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  __call__(self, *args)
  534 END_COMMAND_PART
  535
  -- 536 answer = self.gateway_client.send_command(command)
  537 return_value = get_return_value(answer,
  self.gateway_client,
  538 self.target_id, self.name)
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  send_command(self, command, retry)
  367 if retry:
  368 #print_exc()
  -- 369 response = self.send_command(command)
  370 else:
  371 response = ERROR
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  send_command(self, command, retry)
  360  the Py4J protocol.
  361 
  -- 362 connection = self._get_connection()
  363 try:
  364 response = connection.send_command(command)
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  _get_connection(self)
  316 connection = self.deque.pop()
  317 except Exception:
  -- 318 connection = self._create_connection()
  319 return connection
  320
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  _create_connection(self)
  323 connection = GatewayConnection(self.address, self.port,
  324 self.auto_close, self.gateway_property)
  -- 325 connection.start()
  326 return connection
  327
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  start(self)
  430 'server'
  431 logger.exception(msg)
  -- 432 raise Py4JNetworkError(msg)
  433
  434 def close(self):
 
  Py4JNetworkError: An error occurred while trying to connect to the
  Java server
 
 
  Is there any specific setting that I am missing , that causes this
 error?
 
  Thanks and Regards,
  Rogers Jeffrey L





Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
Hi All,

I am trying to run KMeans clustering on a large data set with 12,000 points
and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode
 with 8  workers running on 2 slaves with 160 GB Ram and 40 VCPU.

My Code is as Follows:

def convert_into_sparse_vector(A):
non_nan_indices=np.nonzero(~np.isnan(A) )
non_nan_values=A[non_nan_indices]
dictionary=dict(zip(non_nan_indices[0],non_nan_values))
return Vectors.sparse (len(A),dictionary)

X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
sc=SparkContext(appName=parallel_kmeans)
data=sc.parallelize(X,10)
model = KMeans.train(data, 1000, initializationMode=k-means||)

where complete_dataframe is a pandas data frame that has my data.

I get the error: *Py4JNetworkError: An error occurred while trying to
connect to the **Java server.*

The error  trace is as follows:

  Exception happened during processing 
 of request from ('127.0.0.1', 41360) Traceback (most recent
 call last):   File /usr/lib64/python2.6/SocketServer.py, line 283,
 in _handle_request_noblock
 self.process_request(request, client_address)   File 
 /usr/lib64/python2.6/SocketServer.py, line 309, in process_request
 self.finish_request(request, client_address)   File 
 /usr/lib64/python2.6/SocketServer.py, line 322, in finish_request
 self.RequestHandlerClass(request, client_address, self)   File 
 /usr/lib64/python2.6/SocketServer.py, line 617, in __init__
 self.handle()   File /root/spark/python/pyspark/accumulators.py, line 
 235, in handle
 num_updates = read_int(self.rfile)   File 
 /root/spark/python/pyspark/serializers.py, line 544, in read_int
 raise EOFError EOFError
 
 --- 
 Py4JNetworkError  Traceback (most recent call
 last) ipython-input-13-3dd00c2c5e93 in module()
  1 model = KMeans.train(data, 1000, initializationMode=k-means||)

 /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
 maxIterations, runs, initializationMode, seed, initializationSteps,
 epsilon)
 134 Train a k-means clustering model.
 135 model = callMLlibFunc(trainKMeansModel, 
 rdd.map(_convert_to_vector), k, maxIterations,
 -- 136   runs, initializationMode, seed, 
 initializationSteps, epsilon)
 137 centers = callJavaFunc(rdd.context, model.clusterCenters)
 138 return KMeansModel([c.toArray() for c in centers])

 /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
 *args)
 126 sc = SparkContext._active_spark_context
 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
 -- 128 return callJavaFunc(sc, api, *args)
 129
 130

 /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
 *args)
 119  Call Java Function 
 120 args = [_py2java(sc, a) for a in args]
 -- 121 return _java2py(sc, func(*args))
 122
 123

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 __call__(self, *args)
 534 END_COMMAND_PART
 535
 -- 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 538 self.target_id, self.name)

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 send_command(self, command, retry)
 367 if retry:
 368 #print_exc()
 -- 369 response = self.send_command(command)
 370 else:
 371 response = ERROR

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 send_command(self, command, retry)
 360  the Py4J protocol.
 361 
 -- 362 connection = self._get_connection()
 363 try:
 364 response = connection.send_command(command)

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 _get_connection(self)
 316 connection = self.deque.pop()
 317 except Exception:
 -- 318 connection = self._create_connection()
 319 return connection
 320

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 _create_connection(self)
 323 connection = GatewayConnection(self.address, self.port,
 324 self.auto_close, self.gateway_property)
 -- 325 connection.start()
 326 return connection
 327

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 start(self)
 430 'server'
 431 logger.exception(msg)
 -- 432 raise Py4JNetworkError(msg)
 433
 434 def close(self):

 Py4JNetworkError: An error occurred while trying to connect to the
 Java server


Is there any specific setting that I am missing

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to
store the cluster centers. That is ~600MB. If there are 10 partitions,
you might need 6GB on the driver to collect updates from workers. I
guess the driver died. Did you specify driver memory with
spark-submit? -Xiangrui

On Thu, Jun 18, 2015 at 12:22 PM, Rogers Jeffrey
rogers.john2...@gmail.com wrote:
 Hi All,

 I am trying to run KMeans clustering on a large data set with 12,000 points
 and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode  with
 8  workers running on 2 slaves with 160 GB Ram and 40 VCPU.

 My Code is as Follows:

 def convert_into_sparse_vector(A):
 non_nan_indices=np.nonzero(~np.isnan(A) )
 non_nan_values=A[non_nan_indices]
 dictionary=dict(zip(non_nan_indices[0],non_nan_values))
 return Vectors.sparse (len(A),dictionary)

 X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
 sc=SparkContext(appName=parallel_kmeans)
 data=sc.parallelize(X,10)
 model = KMeans.train(data, 1000, initializationMode=k-means||)

 where complete_dataframe is a pandas data frame that has my data.

 I get the error: Py4JNetworkError: An error occurred while trying to connect
 to the Java server.

 The error  trace is as follows:

  Exception happened during
 processing of request from ('127.0.0.1', 41360) Traceback (most recent
 call last):   File /usr/lib64/python2.6/SocketServer.py, line 283,
 in _handle_request_noblock
 self.process_request(request, client_address)   File
 /usr/lib64/python2.6/SocketServer.py, line 309, in process_request
 self.finish_request(request, client_address)   File
 /usr/lib64/python2.6/SocketServer.py, line 322, in finish_request
 self.RequestHandlerClass(request, client_address, self)   File
 /usr/lib64/python2.6/SocketServer.py, line 617, in __init__
 self.handle()   File /root/spark/python/pyspark/accumulators.py,
 line 235, in handle
 num_updates = read_int(self.rfile)   File
 /root/spark/python/pyspark/serializers.py, line 544, in read_int
 raise EOFError EOFError
 

 ---
 Py4JNetworkError  Traceback (most recent call
 last) ipython-input-13-3dd00c2c5e93 in module()
  1 model = KMeans.train(data, 1000, initializationMode=k-means||)

 /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
 maxIterations, runs, initializationMode, seed, initializationSteps,
 epsilon)
 134 Train a k-means clustering model.
 135 model = callMLlibFunc(trainKMeansModel,
 rdd.map(_convert_to_vector), k, maxIterations,
 -- 136   runs, initializationMode, seed,
 initializationSteps, epsilon)
 137 centers = callJavaFunc(rdd.context, model.clusterCenters)
 138 return KMeansModel([c.toArray() for c in centers])

 /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
 *args)
 126 sc = SparkContext._active_spark_context
 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
 -- 128 return callJavaFunc(sc, api, *args)
 129
 130

 /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
 *args)
 119  Call Java Function 
 120 args = [_py2java(sc, a) for a in args]
 -- 121 return _java2py(sc, func(*args))
 122
 123

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 __call__(self, *args)
 534 END_COMMAND_PART
 535
 -- 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
 self.gateway_client,
 538 self.target_id, self.name)

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 send_command(self, command, retry)
 367 if retry:
 368 #print_exc()
 -- 369 response = self.send_command(command)
 370 else:
 371 response = ERROR

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 send_command(self, command, retry)
 360  the Py4J protocol.
 361 
 -- 362 connection = self._get_connection()
 363 try:
 364 response = connection.send_command(command)

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 _get_connection(self)
 316 connection = self.deque.pop()
 317 except Exception:
 -- 318 connection = self._create_connection()
 319 return connection
 320

 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 _create_connection(self)
 323 connection = GatewayConnection(self.address, self.port,
 324 self.auto_close, self.gateway_property)
 -- 325 connection.start()
 326 return connection
 327

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
I am submitting the application from a python notebook. I am launching
pyspark as follows:

SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com
SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g
  SPARK_DAEMON_JAVA_OPTS=-XX:MaxPermSize=30g -Xms30g -Xmx30g IPYTHON=1
PYSPARK_PYTHON=/usr/bin/python SPARK_PRINT_LAUNCH_COMMAND=1
 ./spark/bin/pyspark --master spark://
54.165.202.17.compute-1.amazonaws.com:7077   --deploy-mode client

I guess I should be adding another extra argument --conf
spark.driver.memory=15g . Is that correct?

Regards,
Rogers Jeffrey L

On Thu, Jun 18, 2015 at 7:50 PM, Xiangrui Meng men...@gmail.com wrote:

 With 80,000 features and 1000 clusters, you need 80,000,000 doubles to
 store the cluster centers. That is ~600MB. If there are 10 partitions,
 you might need 6GB on the driver to collect updates from workers. I
 guess the driver died. Did you specify driver memory with
 spark-submit? -Xiangrui

 On Thu, Jun 18, 2015 at 12:22 PM, Rogers Jeffrey
 rogers.john2...@gmail.com wrote:
  Hi All,
 
  I am trying to run KMeans clustering on a large data set with 12,000
 points
  and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode
 with
  8  workers running on 2 slaves with 160 GB Ram and 40 VCPU.
 
  My Code is as Follows:
 
  def convert_into_sparse_vector(A):
  non_nan_indices=np.nonzero(~np.isnan(A) )
  non_nan_values=A[non_nan_indices]
  dictionary=dict(zip(non_nan_indices[0],non_nan_values))
  return Vectors.sparse (len(A),dictionary)
 
  X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
  sc=SparkContext(appName=parallel_kmeans)
  data=sc.parallelize(X,10)
  model = KMeans.train(data, 1000, initializationMode=k-means||)
 
  where complete_dataframe is a pandas data frame that has my data.
 
  I get the error: Py4JNetworkError: An error occurred while trying to
 connect
  to the Java server.
 
  The error  trace is as follows:
 
   Exception happened during
  processing of request from ('127.0.0.1', 41360) Traceback (most recent
  call last):   File /usr/lib64/python2.6/SocketServer.py, line 283,
  in _handle_request_noblock
  self.process_request(request, client_address)   File
  /usr/lib64/python2.6/SocketServer.py, line 309, in process_request
  self.finish_request(request, client_address)   File
  /usr/lib64/python2.6/SocketServer.py, line 322, in finish_request
  self.RequestHandlerClass(request, client_address, self)   File
  /usr/lib64/python2.6/SocketServer.py, line 617, in __init__
  self.handle()   File /root/spark/python/pyspark/accumulators.py,
  line 235, in handle
  num_updates = read_int(self.rfile)   File
  /root/spark/python/pyspark/serializers.py, line 544, in read_int
  raise EOFError EOFError
  
 
 
 ---
  Py4JNetworkError  Traceback (most recent call
  last) ipython-input-13-3dd00c2c5e93 in module()
   1 model = KMeans.train(data, 1000, initializationMode=k-means||)
 
  /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
  maxIterations, runs, initializationMode, seed, initializationSteps,
  epsilon)
  134 Train a k-means clustering model.
  135 model = callMLlibFunc(trainKMeansModel,
  rdd.map(_convert_to_vector), k, maxIterations,
  -- 136   runs, initializationMode, seed,
  initializationSteps, epsilon)
  137 centers = callJavaFunc(rdd.context,
 model.clusterCenters)
  138 return KMeansModel([c.toArray() for c in centers])
 
  /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
  *args)
  126 sc = SparkContext._active_spark_context
  127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
  -- 128 return callJavaFunc(sc, api, *args)
  129
  130
 
  /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
  *args)
  119  Call Java Function 
  120 args = [_py2java(sc, a) for a in args]
  -- 121 return _java2py(sc, func(*args))
  122
  123
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  __call__(self, *args)
  534 END_COMMAND_PART
  535
  -- 536 answer = self.gateway_client.send_command(command)
  537 return_value = get_return_value(answer,
  self.gateway_client,
  538 self.target_id, self.name)
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  send_command(self, command, retry)
  367 if retry:
  368 #print_exc()
  -- 369 response = self.send_command(command)
  370 else:
  371 response = ERROR
 
  /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
  send_command(self, command, retry)
  360

Announcement: Generalized K-Means Clustering on Spark

2015-01-25 Thread derrickburns
This project generalizes the Spark MLLIB K-Means clusterer to support
clustering of dense or sparse, low or high dimensional data using distance
functions defined by Bregman divergences.

https://github.com/derrickburns/generalized-kmeans-clustering



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Announcement-Generalized-K-Means-Clustering-on-Spark-tp21363.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



K-means clustering

2014-11-25 Thread amin mohebbi
 I  have generated a sparse matrix by python, which has the size of  
4000*174000 (.pkl), the following is a small part of this matrix :
 (0, 45) 1  (0, 413) 1  (0, 445) 1  (0, 107) 4  (0, 80) 2  (0, 352) 1  (0, 157) 
1  (0, 191) 1  (0, 315) 1  (0, 395) 4  (0, 282) 3  (0, 184) 1  (0, 403) 1  (0, 
169) 1  (0, 267) 1  (0, 148) 1  (0, 449) 1  (0, 241) 1  (0, 303) 1  (0, 364) 1  
(0, 257) 1  (0, 372) 1  (0, 73) 1  (0, 64) 1  (0, 427) 1  : :  (2, 399) 1  (2, 
277) 1  (2, 229) 1  (2, 255) 1  (2, 409) 1  (2, 355) 1  (2, 391) 1  (2, 28) 1  
(2, 384) 1  (2, 86) 1  (2, 285) 2  (2, 166) 1  (2, 165) 1  (2, 419) 1  (2, 367) 
2  (2, 133) 1  (2, 61) 1  (2, 434) 1  (2, 51) 1  (2, 423) 1  (2, 398) 1  (2, 
438) 1  (2, 389) 1  (2, 26) 1  (2, 455) 1
I am new in Spark and would like to cluster this matrix by k-means algorithm. 
Can anyone explain to me what kind of problems  I might be faced. Please note 
that I do not want to use Mllib and would like to write my own k-means. 
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

Re: K-means clustering

2014-11-25 Thread Xiangrui Meng
There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui

On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi
aminn_...@yahoo.com.invalid wrote:
  I  have generated a sparse matrix by python, which has the size of
 4000*174000 (.pkl), the following is a small part of this matrix :

  (0, 45) 1
   (0, 413) 1
   (0, 445) 1
   (0, 107) 4
   (0, 80) 2
   (0, 352) 1
   (0, 157) 1
   (0, 191) 1
   (0, 315) 1
   (0, 395) 4
   (0, 282) 3
   (0, 184) 1
   (0, 403) 1
   (0, 169) 1
   (0, 267) 1
   (0, 148) 1
   (0, 449) 1
   (0, 241) 1
   (0, 303) 1
   (0, 364) 1
   (0, 257) 1
   (0, 372) 1
   (0, 73) 1
   (0, 64) 1
   (0, 427) 1
   : :
   (2, 399) 1
   (2, 277) 1
   (2, 229) 1
   (2, 255) 1
   (2, 409) 1
   (2, 355) 1
   (2, 391) 1
   (2, 28) 1
   (2, 384) 1
   (2, 86) 1
   (2, 285) 2
   (2, 166) 1
   (2, 165) 1
   (2, 419) 1
   (2, 367) 2
   (2, 133) 1
   (2, 61) 1
   (2, 434) 1
   (2, 51) 1
   (2, 423) 1
   (2, 398) 1
   (2, 438) 1
   (2, 389) 1
   (2, 26) 1
   (2, 455) 1

 I am new in Spark and would like to cluster this matrix by k-means
 algorithm. Can anyone explain to me what kind of problems  I might be faced.
 Please note that I do not want to use Mllib and would like to write my own
 k-means.
 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 Tel : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: k-means clustering

2014-11-25 Thread Yanbo Liang
Pre-processing is major workload before training model.
MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is
essential for preprocessing and would be great help to the model training.

Take a look at this
http://spark.apache.org/docs/latest/mllib-feature-extraction.html

2014-11-21 7:18 GMT+08:00 Jun Yang yangjun...@gmail.com:

 Guys,

 As to the questions of pre-processing, you could just migrate your logic
 to Spark before using K-means.

 I only used Scala on Spark, and haven't used Python binding on Spark, but
 I think the basic steps must be the same.

 BTW, if your data set is big with huge sparse dimension feature vector,
 K-Means may not works as good as you expected. And I think this is still
 the optimization direction of Spark MLLib.

 On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi aminn_...@yahoo.com.invalid
  wrote:

 Hi there,

 I would like to do text clustering using  k-means and Spark on a
 massive dataset. As you know, before running the k-means, I have to do
 pre-processing methods such as TFIDF and NLTK on my big dataset. The
 following is my code in python :

 if __name__ == '__main__': # Cluster a bunch of text documents. import re
 import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
 with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
 row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
 as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
 e))  remove_spl_char_regex = re.compile('[%s]' %
 re.escape(string.punctuation)) # regex to remove special characters
 remove_num = re.compile('[\d]+') #nltk.download() stop_words=
 nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float
 )  a1 = a.strip().lower() a2 = remove_spl_char_regex.sub( ,a1) #
 Remove special characters a3 = remove_num.sub(, a2) #Remove numbers #Remove
 stop words words = a3.split() filter_stop_words = [w for w in words if
 not w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
 filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r\w+, a1) for w in
 ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())

 Can anyone explain to me how can I do the pre-processing step, before
 running the k-means using spark.


 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 Tel : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com




 --
 yangjun...@gmail.com
 http://hi.baidu.com/yjpro



Re: k-means clustering

2014-11-20 Thread Jun Yang
Guys,

As to the questions of pre-processing, you could just migrate your logic to
Spark before using K-means.

I only used Scala on Spark, and haven't used Python binding on Spark, but I
think the basic steps must be the same.

BTW, if your data set is big with huge sparse dimension feature vector,
K-Means may not works as good as you expected. And I think this is still
the optimization direction of Spark MLLib.

On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi aminn_...@yahoo.com.invalid
wrote:

 Hi there,

 I would like to do text clustering using  k-means and Spark on a massive
 dataset. As you know, before running the k-means, I have to do
 pre-processing methods such as TFIDF and NLTK on my big dataset. The
 following is my code in python :

 if __name__ == '__main__': # Cluster a bunch of text documents. import re
 import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
 with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
 row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
 as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
 e))  remove_spl_char_regex = re.compile('[%s]' %
 re.escape(string.punctuation)) # regex to remove special characters
 remove_num = re.compile('[\d]+') #nltk.download() stop_words=
 nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float)
 a1 = a.strip().lower() a2 = remove_spl_char_regex.sub( ,a1) # Remove
 special characters a3 = remove_num.sub(, a2) #Remove numbers #Remove
 stop words words = a3.split() filter_stop_words = [w for w in words if not
 w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
 filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r\w+, a1) for w in
 ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())

 Can anyone explain to me how can I do the pre-processing step, before
 running the k-means using spark.


 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 Tel : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com




-- 
yangjun...@gmail.com
http://hi.baidu.com/yjpro


k-means clustering

2014-11-18 Thread amin mohebbi
Hi there,
I would like to do text clustering using  k-means and Spark on a massive 
dataset. As you know, before running the k-means, I have to do pre-processing 
methods such as TFIDF and NLTK on my big dataset. The following is my code in 
python :

|
| if __name__ == '__main__': |
|  |  # Cluster a bunch of text documents. |
|  |  import re |
|  |  import sys |
|  |  |
|  |  k = 6 |
|  |  vocab = {} |
|  |  xs = [] |
|  |  ns=[] |
|  |  cat=[] |
|  |  filename='2013-01.csv' |
|  |  with open(filename, newline='') as f: |
|  |  try: |
|  |  newsreader = csv.reader(f) |
|  |  for row in newsreader: |
|  |  ns.append(row[3]) |
|  |  cat.append(row[4]) |
|  |  except csv.Error as e: |
|  |  sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) |
|  |  |
|  |  |
|  |  remove_spl_char_regex = re.compile('[%s]' % 
re.escape(string.punctuation)) # regex to remove special characters |
|  |  remove_num = re.compile('[\d]+') |
|  |  #nltk.download() |
|  |  stop_words=nltk.corpus.stopwords.words('english') |
|  |  |
|  |  for a in ns: |
|  |  x = defaultdict(float) |
|  |  |
|  |  |
|  |  a1 = a.strip().lower() |
|  |  a2 = remove_spl_char_regex.sub( ,a1) # Remove special characters |
|  |  a3 = remove_num.sub(, a2) #Remove numbers |
|  |  #Remove stop words |
|  |  words = a3.split() |
|  |  filter_stop_words = [w for w in words if not w in stop_words] |
|  |  stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] |
|  |  ws=sorted(stemed) |
|  |  |
|  |  |
|  |  #ws=re.findall(r\w+, a1) |
|  |  for w in ws: |
|  |  vocab.setdefault(w, len(vocab)) |
|  |  x[vocab[w]] += 1 |
|  |  xs.append(x.items()) |
|  |



Can anyone explain to me how can I do the pre-processing step, before running 
the k-means using spark.
 
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen
I think it's on the table but not yet merged?
https://issues.apache.org/jira/browse/SPARK-1216

On Tue, Sep 16, 2014 at 10:04 PM, st553 sthompson...@gmail.com wrote:
 Does MLlib provide utility functions to do this kind of encoding?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui

On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
viora...@gmail.com wrote:
 I am trying to use MLlib for K-Means clustering on a data set with 1 million
 rows and 50 columns (all columns have double values) which is on HDFS (raw
 txt file is 28 MB)

 I initially tried the following:

 val data3 = sc.textFile(hdfs://...inputData.txt)
 val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
 val numIterations = 10
 val numClusters = 200
 val clusters = KMeans.train(parsedData3, numClusters, numIterations)

 This took me nearly 850 seconds.

 I tried using persist with MEMORY_ONLY option hoping that this would
 significantly speed up the algorithm:

 val data3 = sc.textFile(hdfs://...inputData.txt)
 val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
 parsedData3.persist(MEMORY_ONLY)
 val numIterations = 10
 val numClusters = 200
 val clusters = KMeans.train(parsedData3, numClusters, numIterations)

 This resulted in only a marginal improvement and took around 720 seconds.

 Is there any other way to speed up the algorithm further?

 Thank you.

 Regards,
 Ravi


Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
Hi Xiangrui,

Yes I am using Spark v0.9 and am not running it in local mode.

I did the memory setting using export SPARK_MEM=4G before starting the  Spark
instance.

Also previously, I was starting it with -c 1 but changed it to -c 12 since
it is a 12 core machine. It did bring down the time taken to less than 200
seconds from over 700 seconds.

I am not sure how to repartition the data to match the CPU cores. How do I
do it?

Thank you.

Ravi


On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote:

 Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
 and repartition your data to match number of CPU cores such that the
 data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
 the data. Make sure there are enough memory for caching. -Xiangrui

 On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
 viora...@gmail.com wrote:
  I am trying to use MLlib for K-Means clustering on a data set with 1
 million
  rows and 50 columns (all columns have double values) which is on HDFS
 (raw
  txt file is 28 MB)
 
  I initially tried the following:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This took me nearly 850 seconds.
 
  I tried using persist with MEMORY_ONLY option hoping that this would
  significantly speed up the algorithm:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  parsedData3.persist(MEMORY_ONLY)
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This resulted in only a marginal improvement and took around 720 seconds.
 
  Is there any other way to speed up the algorithm further?
 
  Thank you.
 
  Regards,
  Ravi



Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Please try

val parsedData3 =
data3.repartition(12).map(_.split(\t)).map(_.toDouble).cache()

and check the storage and driver/executor memory in the WebUI. Make
sure the data is fully cached.

-Xiangrui


On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan
viora...@gmail.com wrote:
 Hi Xiangrui,

 Yes I am using Spark v0.9 and am not running it in local mode.

 I did the memory setting using export SPARK_MEM=4G before starting the
 Spark instance.

 Also previously, I was starting it with -c 1 but changed it to -c 12 since
 it is a 12 core machine. It did bring down the time taken to less than 200
 seconds from over 700 seconds.

 I am not sure how to repartition the data to match the CPU cores. How do I
 do it?

 Thank you.

 Ravi


 On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote:

 Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
 and repartition your data to match number of CPU cores such that the
 data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
 the data. Make sure there are enough memory for caching. -Xiangrui

 On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
 viora...@gmail.com wrote:
  I am trying to use MLlib for K-Means clustering on a data set with 1
  million
  rows and 50 columns (all columns have double values) which is on HDFS
  (raw
  txt file is 28 MB)
 
  I initially tried the following:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This took me nearly 850 seconds.
 
  I tried using persist with MEMORY_ONLY option hoping that this would
  significantly speed up the algorithm:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  parsedData3.persist(MEMORY_ONLY)
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This resulted in only a marginal improvement and took around 720
  seconds.
 
  Is there any other way to speed up the algorithm further?
 
  Thank you.
 
  Regards,
  Ravi




Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
Hi Folks,

Does any one have experience or recommendations on incorporating categorical 
features (attributes) into k-means clustering in Spark?  In other words, I want 
to cluster on a set of attributes that include categorical variables.

I know I could probably implement some custom code to parse and calculate my 
own similarity function, but I wanted to reach out before I did so.  I’d also 
prefer to take advantage of the k-means\parallel initialization feature of the 
model in MLlib, so an MLlib-based implementation would be preferred.

Thanks in advance.

Best,

-Wen


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Sean Owen
Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
depending on your use case. So a dimension that takes on 3 categorical
values, becomes 3 dimensions, of which all are 0 except one that has
value 1.

On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan wen.p...@mac.com wrote:
 Hi Folks,

 Does any one have experience or recommendations on incorporating categorical 
 features (attributes) into k-means clustering in Spark?  In other words, I 
 want to cluster on a set of attributes that include categorical variables.

 I know I could probably implement some custom code to parse and calculate my 
 own similarity function, but I wanted to reach out before I did so.  I’d also 
 prefer to take advantage of the k-means\parallel initialization feature of 
 the model in MLlib, so an MLlib-based implementation would be preferred.

 Thanks in advance.

 Best,

 -Wen


Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
I see.  So, basically, kind of like dummy variables like with regressions.  
Thanks, Sean.

On Jul 11, 2014, at 10:11 AM, Sean Owen so...@cloudera.com wrote:

 Since you can't define your own distance function, you will need to
 convert these to numeric dimensions. 1-of-n encoding can work OK,
 depending on your use case. So a dimension that takes on 3 categorical
 values, becomes 3 dimensions, of which all are 0 except one that has
 value 1.
 
 On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan wen.p...@mac.com wrote:
 Hi Folks,
 
 Does any one have experience or recommendations on incorporating categorical 
 features (attributes) into k-means clustering in Spark?  In other words, I 
 want to cluster on a set of attributes that include categorical variables.
 
 I know I could probably implement some custom code to parse and calculate my 
 own similarity function, but I wanted to reach out before I did so.  I’d 
 also prefer to take advantage of the k-means\parallel initialization feature 
 of the model in MLlib, so an MLlib-based implementation would be preferred.
 
 Thanks in advance.
 
 Best,
 
 -Wen



signature.asc
Description: Message signed with OpenPGP using GPGMail