[Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Hi, This is a very interesting requirement, where I am getting stuck at a few places. *Requirement* - Col1Col2 1 10 2 11 3 12 4 13 5 14 *I have to calculate avg of col1 and then divide each row of col2 by that avg. And,

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Any help, guys? On Mon, Apr 2, 2018 at 1:01 PM, Aakash Basu wrote: > Hi, > > This is a very interesting requirement, where I am getting stuck at a few > places. > > *Requirement* - > > Col1Col2 > 1 10 > 2 11 > 3 12 > 4

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Hi all, The following is the updated code, where I'm getting the avg in a DF, but the collect() function, to store the value as a variable and pass it to the final select query is not working. So, avg is currently a dataframe and not a variable with value stored in it. New code - from

Merge query using spark sql

2018-04-02 Thread Deepak Sharma
I am using spark to run merge query in postgres sql. The way its being done now is save the data to be merged in postgres as temp tables. Now run the merge queries in postgres using java sql connection and statment . So basically this query runs in postgres. The queries are insert into source

unsubscribe

2018-04-02 Thread purna pradeep
unsubscribe

unsubscribe

2018-04-02 Thread Romero, Saul
unsubscribe

Re: is there a way of register python UDF using java API?

2018-04-02 Thread kant kodali
Looks like there is spark.udf().registerPython() like below. public void registerPython(java.lang.String name, org.apache.spark.sql.execution.python.UserDefinedPythonFunction udf) can anyone describe what *udfDeterministic *parameter does in the method signature below? public

Re: is there a way of register python UDF using java API?

2018-04-02 Thread Bryan Cutler
Hi Kant, The udfDeterministic would be set to false if the results from your UDF are non-deterministic, such as produced by random numbers, so the catalyst optimizer will not cache and reuse results. On Mon, Apr 2, 2018 at 12:11 PM, kant kodali wrote: > Looks like there is

Uncaught exception in thread heartbeat-receiver-event-loop-thread

2018-04-02 Thread Shiyuan
Hi, I got an error of Uncaught exception in thread heartbeat-receiver-event-loop-thread. Does this error indicate that some node is too overloaded to be responsive? Thanks! ERROR Utils: Uncaught exception in thread heartbeat-receiver-event-loop-thread java.lang.NullPointerException

unsubscribe

2018-04-02 Thread 学生张洪斌
发自网易邮箱大师

How to delete empty columns in df when writing to parquet?

2018-04-02 Thread Junfeng Chen
I am trying to read data from kafka and writing them in parquet format via Spark Streaming. The problem is, the data from kafka are in variable data structure. For example, app one has columns A,B,C, app two has columns B,C,D. So the data frame I read from kafka has all columns ABCD. When I decide

[Spark sql]: Re-execution of same operation takes less time than 1st

2018-04-02 Thread snjv
Hi, When we execute the same operation twice, spark takes less time ( ~40%) than the first. Our operation is like this: Read 150M rows ( spread in multiple parquet files) into DF Read 10M rows ( spread in multiple parquet files) into other DF. Do an intersect operation. Size of 150M row file:

[Spark-sql]: DF parquet read write multiple tasks

2018-04-02 Thread snjv
Spark : 2.2 Number of cores : 128 ( all allocated to spark) Filesystem : Alluxio 1.6 Block size on alluxio: 32MB Input1 size : 586MB ( 150m records with only 1 column as int) Input2 size : 50MB ( 10m records with only 1 column as int) Input1 is spread across 20 parquet files. each file size is