[Structured Streaming][Parquet] How do specify partition and data when saving to Parquet

2018-03-02 Thread karthikjay
My DataFrame has the following schema
root
 |-- data: struct (nullable = true)
 ||-- zoneId: string (nullable = true)
 ||-- deviceId: string (nullable = true)
 ||-- timeSinceLast: long (nullable = true)
 |-- date: date (nullable = true)

 
How can I do a writeStream with Parquet format and write the data
(containing zoneId, deviceId, timeSinceLast except date) and partition the
data by date ? I tried the following code and the partition by clause did
not work

val query1 = df1
  .writeStream
  .format("parquet")
  .option("path", "/Users/abc/hb_parquet/data")
  .option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
  .partitionBy("data.zoneId")
  .start()



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
For pyspark specifically IMO should be very high on the list to port back...

As for roadmap - should be sharing more soon.


From: lucas.g...@gmail.com 
Sent: Friday, March 2, 2018 9:41:46 PM
To: user@spark.apache.org
Cc: Felix Cheung
Subject: Re: Question on Spark-kubernetes integration

Oh interesting, given that pyspark was working in spark on kub 2.2 I assumed it 
would be part of what got merged.

Is there a roadmap in terms of when that may get merged up?

Thanks!



On 2 March 2018 at 21:32, Felix Cheung 
> wrote:
That’s in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
>
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



Re: Question on Spark-kubernetes integration

2018-03-02 Thread lucas.g...@gmail.com
Oh interesting, given that pyspark was working in spark on kub 2.2 I
assumed it would be part of what got merged.

Is there a roadmap in terms of when that may get merged up?

Thanks!



On 2 March 2018 at 21:32, Felix Cheung  wrote:

> That’s in the plan. We should be sharing a bit more about the roadmap in
> future releases shortly.
>
> In the mean time this is in the official documentation on what is coming:
> https://spark.apache.org/docs/latest/running-on-kubernetes.
> html#future-work
>
> This supports started as a fork of the Apache Spark project and this fork
> has dynamic scaling support you can check out here:
> https://apache-spark-on-k8s.github.io/userdocs/running-on-
> kubernetes.html#dynamic-executor-scaling
>
>
> --
> *From:* Lalwani, Jayesh 
> *Sent:* Friday, March 2, 2018 8:08:55 AM
> *To:* user@spark.apache.org
> *Subject:* Question on Spark-kubernetes integration
>
>
> Does the Resource scheduler support dynamic resource allocation? Are there
> any plans to add in the future?
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
That's in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Tathagata Das
Structured Streaming's file sink solves these problems by writing a
log/manifest of all the authoritative files written out (for any format).
So if you run batch or interactive queries on the output directory with
Spark, it will automatically read the manifest and only process files are
that are in the manifest, thus skipping any partial files, etc.



On Fri, Mar 2, 2018 at 1:37 PM, Sunil Parmar  wrote:

> Is there a way to get finer control over file writing in parquet file
> writer ?
>
> We've an streaming application using Apache Apex ( on path of migration to
> Spark ...story for a different thread). The existing streaming application
> read JSON from Kafka and writes Parquet to HDFS. We're trying to deal with
> partial files by writing .tmp files and renaming them as the last step. We
> only commit offset after rename is successful. This way we get at least
> once semantic and partial file write issue.
>
> Thoughts ?
>
>
> Sunil Parmar
>
> On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> There is no good way to save to parquet without causing downstream
>> consistency issues.
>> You could use foreachRDD to get each RDD, convert it to
>> DataFrame/Dataset, and write out as parquet files. But you will later run
>> into issues with partial files caused by failures, etc.
>>
>>
>> On Wed, Feb 28, 2018 at 11:09 AM, karthikus  wrote:
>>
>>> Hi all,
>>>
>>> I have a Kafka stream data and I need to save the data in parquet format
>>> without using Structured Streaming (due to the lack of Kafka Message
>>> header
>>> support).
>>>
>>> val kafkaStream =
>>>   KafkaUtils.createDirectStream(
>>> streamingContext,
>>> LocationStrategies.PreferConsistent,
>>> ConsumerStrategies.Subscribe[String, String](
>>>   topics,
>>>   kafkaParams
>>> )
>>>   )
>>> // process the messages
>>> val messages = kafkaStream.map(record => (record.key, record.value))
>>> val lines = messages.map(_._2)
>>>
>>> Now, how do I save it as parquet ? All the examples that I have come
>>> across
>>> uses SQLContext which is deprecated. ! Any help appreciated !
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread ayan guha
Hi

Couple of questions:

1. It seems the error is due to number format:
Caused by: java.util.concurrent.ExecutionException:
java.lang.NumberFormatException:
For input string: "0003024_"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.
generateSplitsInfo(OrcInputFormat.java:998)
... 75 more
Why do you think it is due to ACID?

2. You should not be creating Hive Context again in REPL, no need for that.
REPL already reports: SparkContext available as sc, HiveContext available
as sqlContext.

3. Have you tried the same with spark 2.x?



On Sat, Mar 3, 2018 at 5:00 AM, Debabrata Ghosh 
wrote:

> Hi All,
>Greetings ! I needed some help to read a Hive table
> via Pyspark for which the transactional property is set to 'True' (In other
> words ACID property is enabled). Following is the entire stacktrace and the
> description of the hive table. Would you please be able to help me resolve
> the error:
>
> 18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager
> 18/03/01 11:06:22 INFO EventLoggingListener: Logging events to
> hdfs:///spark-history/local-1519923982155
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
>
> Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> from pyspark.sql import HiveContext
> >>> hive_context = HiveContext(sc)
> >>> hive_context.sql("select count(*) from load_etl.trpt_geo_defect_prod_
> dec07_del_blank").show()
> 18/03/01 11:09:45 INFO HiveContext: Initializing execution hive, version
> 1.2.1
> 18/03/01 11:09:45 INFO ClientWrapper: Inspected Hadoop version:
> 2.7.3.2.6.0.3-8
> 18/03/01 11:09:45 INFO ClientWrapper: Loaded 
> org.apache.hadoop.hive.shims.Hadoop23Shims
> for Hadoop version 2.7.3.2.6.0.3-8
> 18/03/01 11:09:46 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 18/03/01 11:09:46 INFO ObjectStore: ObjectStore, initialize called
> 18/03/01 11:09:46 INFO Persistence: Property 
> hive.metastore.integral.jdo.pushdown
> unknown - will be ignored
> 18/03/01 11:09:46 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 18/03/01 11:09:50 INFO ObjectStore: Setting MetaStore object pin classes
> with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,
> Partition,Database,Type,FieldSchema,Order"
> 18/03/01 11:09:50 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:50 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:53 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:53 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
> is DERBY
> 18/03/01 11:09:54 INFO ObjectStore: Initialized ObjectStore
> 18/03/01 11:09:54 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 18/03/01 11:09:54 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 18/03/01 11:09:54 INFO HiveMetaStore: Added admin role in metastore
> 18/03/01 11:09:54 INFO HiveMetaStore: Added public role in metastore
> 18/03/01 11:09:55 INFO HiveMetaStore: No user is added in admin role,
> since config is empty
> 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_all_databases
> 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
>   cmd=get_all_databases
> 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_functions: db=default pat=*
> 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
>   cmd=get_functions: db=default pat=*
> 18/03/01 11:09:55 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MResourceUri"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:55 INFO SessionState: Created local directory:
> /tmp/22ea9ac9-23d1-4247-9e02-ce45809cd9ae_resources
> 18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
> /tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
> 18/03/01 11:09:55 INFO SessionState: Created local directory:
> /tmp/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
> 18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
> 

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Sunil Parmar
Is there a way to get finer control over file writing in parquet file
writer ?

We've an streaming application using Apache Apex ( on path of migration to
Spark ...story for a different thread). The existing streaming application
read JSON from Kafka and writes Parquet to HDFS. We're trying to deal with
partial files by writing .tmp files and renaming them as the last step. We
only commit offset after rename is successful. This way we get at least
once semantic and partial file write issue.

Thoughts ?


Sunil Parmar

On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das 
wrote:

> There is no good way to save to parquet without causing downstream
> consistency issues.
> You could use foreachRDD to get each RDD, convert it to DataFrame/Dataset,
> and write out as parquet files. But you will later run into issues with
> partial files caused by failures, etc.
>
>
> On Wed, Feb 28, 2018 at 11:09 AM, karthikus  wrote:
>
>> Hi all,
>>
>> I have a Kafka stream data and I need to save the data in parquet format
>> without using Structured Streaming (due to the lack of Kafka Message
>> header
>> support).
>>
>> val kafkaStream =
>>   KafkaUtils.createDirectStream(
>> streamingContext,
>> LocationStrategies.PreferConsistent,
>> ConsumerStrategies.Subscribe[String, String](
>>   topics,
>>   kafkaParams
>> )
>>   )
>> // process the messages
>> val messages = kafkaStream.map(record => (record.key, record.value))
>> val lines = messages.map(_._2)
>>
>> Now, how do I save it as parquet ? All the examples that I have come
>> across
>> uses SQLContext which is deprecated. ! Any help appreciated !
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Can I get my custom spark strategy to run last?

2018-03-02 Thread Vadim Semenov
Something like this?

sparkSession.experimental.extraStrategies = Seq(Strategy)

val logicalPlan = df.logicalPlan
val newPlan: LogicalPlan = Strategy(logicalPlan)

Dataset.ofRows(sparkSession, newPlan)


On Thu, Mar 1, 2018 at 8:20 PM, Keith Chapman 
wrote:

> Hi,
>
> I'd like to write a custom Spark strategy that runs after all the existing
> Spark strategies are run. Looking through the Spark code it seems like the
> custom strategies are prepended to the list of strategies in Spark. Is
> there a way I could get it to run last?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>


Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread Debabrata Ghosh
Hi All,
   Greetings ! I needed some help to read a Hive table
via Pyspark for which the transactional property is set to 'True' (In other
words ACID property is enabled). Following is the entire stacktrace and the
description of the hive table. Would you please be able to help me resolve
the error:

18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager
18/03/01 11:06:22 INFO EventLoggingListener: Logging events to
hdfs:///spark-history/local-1519923982155
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.3
  /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> hive_context.sql("select count(*) from
load_etl.trpt_geo_defect_prod_dec07_del_blank").show()
18/03/01 11:09:45 INFO HiveContext: Initializing execution hive, version
1.2.1
18/03/01 11:09:45 INFO ClientWrapper: Inspected Hadoop version:
2.7.3.2.6.0.3-8
18/03/01 11:09:45 INFO ClientWrapper: Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
2.7.3.2.6.0.3-8
18/03/01 11:09:46 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
18/03/01 11:09:46 INFO ObjectStore: ObjectStore, initialize called
18/03/01 11:09:46 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
18/03/01 11:09:46 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
18/03/01 11:09:50 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
18/03/01 11:09:50 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:50 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:53 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:53 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
is DERBY
18/03/01 11:09:54 INFO ObjectStore: Initialized ObjectStore
18/03/01 11:09:54 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 1.2.0
18/03/01 11:09:54 WARN ObjectStore: Failed to get database default,
returning NoSuchObjectException
18/03/01 11:09:54 INFO HiveMetaStore: Added admin role in metastore
18/03/01 11:09:54 INFO HiveMetaStore: Added public role in metastore
18/03/01 11:09:55 INFO HiveMetaStore: No user is added in admin role, since
config is empty
18/03/01 11:09:55 INFO HiveMetaStore: 0: get_all_databases
18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
cmd=get_all_databases
18/03/01 11:09:55 INFO HiveMetaStore: 0: get_functions: db=default pat=*
18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
cmd=get_functions: db=default pat=*
18/03/01 11:09:55 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:55 INFO SessionState: Created local directory:
/tmp/22ea9ac9-23d1-4247-9e02-ce45809cd9ae_resources
18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
/tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
18/03/01 11:09:55 INFO SessionState: Created local directory:
/tmp/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
/tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae/_tmp_space.db
18/03/01 11:09:55 INFO HiveContext: default warehouse location is
/user/hive/warehouse
18/03/01 11:09:55 INFO HiveContext: Initializing HiveMetastoreConnection
version 1.2.1 using Spark classes.
18/03/01 11:09:55 INFO ClientWrapper: Inspected Hadoop version:
2.7.3.2.6.0.3-8
18/03/01 11:09:55 INFO ClientWrapper: Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
2.7.3.2.6.0.3-8
18/03/01 11:09:56 INFO metastore: Trying to connect to metastore with URI
thrift://ip.com:9083
18/03/01 11:09:56 INFO metastore: Connected to metastore.
18/03/01 11:09:56 INFO SessionState: Created local directory:
/tmp/24379bb3-8ddf-4716-b68d-07ac0f92d9f1_resources
18/03/01 11:09:56 INFO SessionState: Created HDFS directory:
/tmp/hive/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1
18/03/01 11:09:56 INFO SessionState: Created local directory:
/tmp/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1
18/03/01 

Spark Streaming reading many topics with Avro

2018-03-02 Thread Guillermo Ortiz
Hello,

I want to read with a single Spark Streaming process several topics. I'm
using avro and the data to the different topics have a different
schema.Ideally, If I would only have one topic I could implement a
deserializer but, I don't know if it's possible with many different schemas.

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "localhost:9092,anotherhost:9092",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "use_a_separate_group_id_for_each_stream",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean))



I can only set an value.deserializer and even if I could set many of them,
I don't know how the process is going to pick the right one.  Any idea?, I
guess I could use ByteDeserializer and do it for myself too...


Question on Spark-kubernetes integration

2018-03-02 Thread Lalwani, Jayesh
Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: K Means Clustering Explanation

2018-03-02 Thread Matt Hicks
Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
having issues determining how to actually accomplish this with the API.
Can anyone point me to an example in code showing how to accomplish this?  





On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando 
alessandro.solima...@gmail.com 
 wrote:
Hi Matt,similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on this
topic, although I have no pointers to provide).
Hth,Alessandro



On 1 March 2018 at 21:59, Christoph Brücke   wrote:
Hi Matt,
I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your original
input data and the associated cluster id for each data point in the input data.
Now you can group this dataset by cluster id and aggregate over the original 5
features. E.g., get the mean for numerical data or the value that occurs the
most for categorical data.
The exact aggregation is use-case dependent.
I hope this helps,Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" :
Thanks for the response Christoph.
I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to the
original format to properly understand the dominant values in each cluster.
For example, if I have five fields of data and I've trained ten clusters of data
I'd like to output the five fields of data as represented by each of the ten
clusters.  





On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com  wrote:
Hi matt,
the cluster are defined by there centroids / cluster centers. All the points
belonging to a certain cluster are closer to its than to the centroids of any
other cluster.
What I typically do is to convert the cluster centers back to the original input
format or of that is not possible use the point nearest to the cluster center
and use this as a representation of the whole cluster.
Can you be a little bit more specific about your use-case?
Best,Christoph
Am 01.03.2018 20:53 schrieb "Matt Hicks" :
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K Means Clustering Explanation

2018-03-02 Thread Alessandro Solimando
Hi Matt,
similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).

For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on
this topic, although I have no pointers to provide).

Hth,
Alessandro





On 1 March 2018 at 21:59, Christoph Brücke  wrote:

> Hi Matt,
>
> I see. You could use the trained model to predict the cluster id for each
> training point. Now you should be able to create a dataset with your
> original input data and the associated cluster id for each data point in
> the input data. Now you can group this dataset by cluster id and aggregate
> over the original 5 features. E.g., get the mean for numerical data or the
> value that occurs the most for categorical data.
>
> The exact aggregation is use-case dependent.
>
> I hope this helps,
> Christoph
>
> Am 01.03.2018 21:40 schrieb "Matt Hicks" :
>
> Thanks for the response Christoph.
>
> I'm converting large amounts of data into clustering training and I'm just
> having a hard time reasoning about reversing the clusters (in code) back to
> the original format to properly understand the dominant values in each
> cluster.
>
> For example, if I have five fields of data and I've trained ten clusters
> of data I'd like to output the five fields of data as represented by each
> of the ten clusters.
>
>
>
> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:
>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" :
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>