Re: MatrixUDT and VectorUDT in Spark ML

2018-05-30 Thread Dongjin Lee
How is this issue going? Is there any Jira ticket about this?

Thanks,
Dongjin

On Sat, Mar 24, 2018 at 1:39 PM, Himanshu Mohan <
himanshu.mo...@aexp.com.invalid> wrote:

> I agree
>
>
>
>
>
>
>
> Thanks
>
> Himanshu
>
>
>
> *From:* Li Jin [mailto:ice.xell...@gmail.com]
> *Sent:* Friday, March 23, 2018 8:24 PM
> *To:* dev 
> *Subject:* MatrixUDT and VectorUDT in Spark ML
>
>
>
> Hi All,
>
>
>
> I came across these two types MatrixUDT and VectorUDF in Spark ML when
> doing feature extraction and preprocessing with PySpark. However, when
> trying to do some basic operations, such as vector multiplication and
> matrix multiplication, I had to go down to Python UDF.
>
>
>
> It seems to be it would be very useful to have built-in operators on these
> types just like first class Spark SQL types, e.g.,
>
>
>
> df.withColumn('v', df.matrix_column * df.vector_column)
>
>
>
> I wonder what are other people's thoughts on this?
>
>
>
> Li
>
> --
> American Express made the following annotations
> --
>
> "This message and any attachments are solely for the intended recipient
> and may contain confidential or privileged information. If you are not the
> intended recipient, any disclosure, copying, use, or distribution of the
> information included in this message and any attachments is prohibited. If
> you have received this communication in error, please notify us by reply
> e-mail and immediately and permanently delete this message and any
> attachments. Thank you."
>
> American Express a ajouté le commentaire suivant le
> Ce courrier et toute pièce jointe qu'il contient sont réservés au seul
> destinataire indiqué et peuvent renfermer des renseignements confidentiels
> et privilégiés. Si vous n'êtes pas le destinataire prévu, toute
> divulgation, duplication, utilisation ou distribution du courrier ou de
> toute pièce jointe est interdite. Si vous avez reçu cette communication par
> erreur, veuillez nous en aviser par courrier et détruire immédiatement le
> courrier et les pièces jointes. Merci.
> --
>
>


-- 
*Dongjin Lee*

*A hitchhiker in the mathematical world.*

*github:  github.com/dongjinleekr
linkedin: kr.linkedin.com/in/dongjinleekr
slideshare:
www.slideshare.net/dongjinleekr
*


Re: Spark version for Mesos 0.27.0

2018-05-30 Thread Thodoris Zois
Hello,

I need Mesos 0.27 for specific purposes and unfortunately I can’t use a newer 
version. Did you find anything? Could it be Spark 1.6? 

Except that, from which version Spark supports dynamic allocation on Mesos? 

- Thodoris

> On 25 May 2018, at 16:06, Jacek Laskowski  wrote:
> 
> Hi,
> 
> Mesos 0.27.0?! That's been a while. I'd search for the changes to pom.xml and 
> see when the mesos dependency version changed. That'd give you the most 
> precise answer. I think it could've been 1.5 or older.
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
> 
>> On Fri, May 25, 2018 at 1:29 PM, Thodoris Zois  wrote:
>> Hello,
>> 
>> Could you please tell me which version of Spark works with Apache Mesos
>>  version 0.27.0? (I cannot find anything on docs at github)
>> 
>> Thank you very much,
>> Thodoris Zois
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 


Spark on Kubernetes plan for 2.4

2018-05-30 Thread Yinan Li
On behalf of folks who work on Spark on Kubernetes, I would like to share a
doc

on the plan for Spark on Kubernetes features and changes for the upcoming
2.4 release. Please take a look if you are interested. Feedback and
comments are highly appreciated. Thanks!

Yinan


Re: Unable to alter partition. The transaction for alter partition did not commit successfully.

2018-05-30 Thread naresh Goud
What are you doing? Give more details o what are you doing

On Wed, May 30, 2018 at 12:58 PM Arun Hive 
wrote:

>
> Hi
>
> While running my spark job component i am getting the following exception.
> Requesting for your help on this:
> Spark core version -
> spark-core_2.10-2.1.1
>
> Spark streaming version -
> spark-streaming_2.10-2.1.1
>
> Spark hive version -
> spark-hive_2.10-2.1.1
>
>
> 2018-05-28 00:08:04,317  [streaming-job-executor-2] ERROR (Hive.java:1883)
> - org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
> partition. The transaction for alter partition did not commit successfully.
> at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:573)
> at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:546)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.alterPartitionSpec(Hive.java:1915)
> at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1875)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1407)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
> at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:831)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveClientImpl.scala:693)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:691)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:691)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.loadDynamicPartitions(HiveClientImpl.scala:691)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:823)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:811)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:811)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.loadDynamicPartitions(HiveExternalCatalog.scala:811)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:319)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:263)
> at
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:243)
>
> 
> -
>  -
> 
> -
> 
> at
> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
> at
> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> at
> 

[VOTE] SPIP ML Pipelines in R

2018-05-30 Thread Hossein
Hi,

I started discussion thread

for a new R package to expose MLlib pipelines in R

.

To summarize we will work on utilities to generate R wrappers for MLlib
pipeline API for a new R package. This will lower the burden for exposing
new API in future.

Following the SPIP process
, I am proposing the
SPIP  for a vote.

+1: Let's go ahead and implement the SPIP.
+0: Don't really care.
-1: I do not think this is a good idea for the following reasons.

Thanks,
--Hossein


Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Reynold Xin
SQL expressions?

On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski  wrote:

> Hi,
>
> I've been exploring RuntimeReplaceable expressions [1] and have been
> wondering what their purpose is.
>
> Quoting the scaladoc [2]:
>
> > An expression that gets replaced at runtime (currently by the optimizer)
> into a different expression for evaluation. This is mainly used to provide
> compatibility with other databases.
>
> For example, ParseToTimestamp expression is a RuntimeReplaceable
> expression and it is replaced by Cast(left, TimestampType)
> or Cast(UnixTimestamp(left, format), TimestampType) per to_timestamp
> function (there are two variants).
>
> My question is why is this RuntimeReplaceable better than simply using the
> Casts as the implementation of to_timestamp functions?
>
> def to_timestamp(s: Column, fmt: String): Column = withExpr {
>   // pseudocode
>   Cast(UnixTimestamp(left, format), TimestampType)
> }
>
> What's wrong with the above implementation compared to the current one?
>
> [1]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L275
>
> [2]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L266-L267
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>


[SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Jacek Laskowski
Hi,

I've been exploring RuntimeReplaceable expressions [1] and have been
wondering what their purpose is.

Quoting the scaladoc [2]:

> An expression that gets replaced at runtime (currently by the optimizer)
into a different expression for evaluation. This is mainly used to provide
compatibility with other databases.

For example, ParseToTimestamp expression is a RuntimeReplaceable expression
and it is replaced by Cast(left, TimestampType) or Cast(UnixTimestamp(left,
format), TimestampType) per to_timestamp function (there are two variants).

My question is why is this RuntimeReplaceable better than simply using the
Casts as the implementation of to_timestamp functions?

def to_timestamp(s: Column, fmt: String): Column = withExpr {
  // pseudocode
  Cast(UnixTimestamp(left, format), TimestampType)
}

What's wrong with the above implementation compared to the current one?

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L275

[2]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L266-L267

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


Re: Unable to alter partition. The transaction for alter partition did not commit successfully.

2018-05-30 Thread Arun Hive
 
Hi 
While running my spark job component i am getting the following exception. 
Requesting for your help on this:Spark core version - spark-core_2.10-2.1.1
Spark streaming version -spark-streaming_2.10-2.1.1
Spark hive version -spark-hive_2.10-2.1.1

2018-05-28 00:08:04,317  [streaming-job-executor-2] ERROR (Hive.java:1883) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter partition. 
The transaction for alter partition did not commit successfully.
 at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:573)
 at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:546)
 at org.apache.hadoop.hive.ql.metadata.Hive.alterPartitionSpec(Hive.java:1915)
 at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1875)
 at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1407)
 at 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
 at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:831)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveClientImpl.scala:693)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:691)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:691)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadDynamicPartitions(HiveClientImpl.scala:691)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:823)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:811)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:811)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadDynamicPartitions(HiveExternalCatalog.scala:811)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:319)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
 at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:263)
 at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:243)

-
 
-
 
-
 at 
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
 at 
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
 at 

Closing IPC connection

2018-05-30 Thread Arun Hive
Hi,
While running my spark job component i am getting the following exception. 
Requesting for your help on this:Spark core version - spark-core_2.10-2.1.1
Spark streaming version -spark-streaming_2.10-2.1.1
Spark hive version -spark-hive_2.10-2.1.1

b-executor-0] DEBUG (Client.java:428) - The ping interval is 6 
ms.2018-05-28 00:08:10,187  [streaming-job-executor-0] DEBUG (Client.java:698) 
- Connecting to /:80202018-05-28 00:08:10,188  
[streaming-job-executor-0] DEBUG (Client.java:1176) - closing ipc connection to 
/:8020: nulljava.nio.channels.ClosedByInterruptException 
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
 at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659) at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) 
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at 
org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at 
org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at 
org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at 
org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
org.apache.hadoop.ipc.Client.call(Client.java:1400) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
 at com.sun.proxy.$Proxy11.getListing(Unknown Source) at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:554)
 at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy12.getListing(Unknown Source) at 
org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1958) at 
org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1941) at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
 at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69) at 
org.apache.hadoop.fs.Globber.glob(Globber.java:217) at 
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644) at 
org.apache.hadoop.hive.common.HiveStatsUtils.getFileStatusRecurse(HiveStatsUtils.java:73)
 at 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1549) 
at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:831)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveClientImpl.scala:693)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:691)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:691)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadDynamicPartitions(HiveClientImpl.scala:691)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:823)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:811)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:811)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadDynamicPartitions(HiveExternalCatalog.scala:811)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:319)
 at 

Re: Revisiting Online serving of Spark models?

2018-05-30 Thread Denny Lee
I most likely will not be able to join SF next week but definitely up for a
session after Summit in Seattle to dive further into this, eh?!

On Wed, May 30, 2018 at 9:32 AM Felix Cheung 
wrote:

> Hi!
>
> Thank you! Let’s meet then
>
> June 6 4pm
>
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> 
>
> Ground floor (outside of conference area - should be available for all) -
> we will meet and decide where to go
>
> (Would not send invite because that would be too much noise for dev@)
>
> To paraphrase Joseph, we will use this to kick off the discusssion and
> post notes after and follow up online. As for Seattle, I would be very
> interested to meet in person lateen and discuss ;)
>
>
> _
> From: Saikat Kanjilal 
> Sent: Tuesday, May 29, 2018 11:46 AM
>
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice 
> Cc: Felix Cheung , Holden Karau <
> hol...@pigscanfly.ca>, Joseph Bradley , Leif Walsh
> , dev 
>
>
>
> Would love to join but am in Seattle, thoughts on how to make this work?
>
> Regards
>
> Sent from my iPhone
>
> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> maximilianofel...@gmail.com> wrote:
>
> Big +1 to a meeting with fresh air.
>
> Could anyone send the invites? I don't really know which is the place
> Holden is talking about.
>
> 2018-05-29 14:27 GMT-03:00 Felix Cheung :
>
>> You had me at blue bottle!
>>
>> _
>> From: Holden Karau 
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung 
>> Cc: Saikat Kanjilal , Maximiliano Felice <
>> maximilianofel...@gmail.com>, Joseph Bradley ,
>> Leif Walsh , dev 
>>
>>
>>
>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>> blue bottle and grab coffee (if the weather holds have our design meeting
>> outside :p)?
>>
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung 
>> wrote:
>>
>>> Bump.
>>>
>>> --
>>> *From:* Felix Cheung 
>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>> the Summit?
>>>
>>> (I propose we meet at the venue entrance so we could accommodate people
>>> might not be in the conference)
>>>
>>> --
>>> *From:* Saikat Kanjilal 
>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>> *To:* Maximiliano Felice
>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> I’m in the same exact boat as Maximiliano and have use cases as well for
>>> model serving and would love to join this discussion.
>>>
>>> Sent from my iPhone
>>>
>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>> maximilianofel...@gmail.com> wrote:
>>>
>>> Hi!
>>>
>>> I'm don't usually write a lot on this list but I keep up to date with
>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>> attention, as we're currently facing this issue at work. I'm attending to
>>> the summit and was wondering if it would it be possible for me to join that
>>> meeting. I might be able to share some helpful usecases and ideas.
>>>
>>> Thanks,
>>> Maximiliano Felice
>>>
>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
>>> escribió:
>>>
 I’m with you on json being more readable than parquet, but we’ve had
 success using pyarrow’s parquet reader and have been quite happy with it so
 far. If your target is python (and probably if not now, then soon, R), you
 should look in to it.

 On Mon, May 21, 2018 at 16:52 Joseph Bradley 
 wrote:

> Regarding model reading and writing, I'll give quick thoughts here:
> * Our approach was to use the same format but write JSON instead of
> Parquet.  It's easier to parse JSON without Spark, and using the same
> format simplifies architecture.  Plus, some people want to check files 
> into
> version control, and JSON is nice for that.
> * The reader/writer APIs could be extended to take format parameters
> (just like DataFrame reader/writers) to handle JSON (and maybe, 
> eventually,
> handle Parquet in the online serving setting).
>
> This would be a big project, so proposing a SPIP might be best.  If
> people are around at the Spark Summit, that could be a good time to meet 
> up
> & then post notes back to the dev list.
>
> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
> felixcheun...@hotmail.com> wrote:
>
>> Specifically I’d like bring part of the discussion to Model and
>> PipelineModel, and various ModelReader and SharedReadWrite 
>> implementations
>> 

Re: Revisiting Online serving of Spark models?

2018-05-30 Thread Felix Cheung
Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center
800 Howard Street, San Francisco, CA 94103

Ground floor (outside of conference area - should be available for all) - we 
will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post 
notes after and follow up online. As for Seattle, I would be very interested to 
meet in person lateen and discuss ;)


_
From: Saikat Kanjilal 
Sent: Tuesday, May 29, 2018 11:46 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice 
Cc: Felix Cheung , Holden Karau 
, Joseph Bradley , Leif Walsh 
, dev 


Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is 
talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung 
mailto:felixcheun...@hotmail.com>>:
You had me at blue bottle!

_
From: Holden Karau mailto:hol...@pigscanfly.ca>>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung mailto:felixcheun...@hotmail.com>>
Cc: Saikat Kanjilal mailto:sxk1...@hotmail.com>>, 
Maximiliano Felice 
mailto:maximilianofel...@gmail.com>>, Joseph 
Bradley mailto:jos...@databricks.com>>, Leif Walsh 
mailto:leif.wa...@gmail.com>>, dev 
mailto:dev@spark.apache.org>>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue 
bottle and grab coffee (if the weather holds have our design meeting outside 
:p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Bump.


From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)


From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_
From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online