Hi,
This is a very interesting requirement, where I am getting stuck at a few
places.
*Requirement* -
Col1Col2
1 10
2 11
3 12
4 13
5 14
*I have to calculate avg of col1 and then divide each row of col2 by that
avg. And,
Any help, guys?
On Mon, Apr 2, 2018 at 1:01 PM, Aakash Basu
wrote:
> Hi,
>
> This is a very interesting requirement, where I am getting stuck at a few
> places.
>
> *Requirement* -
>
> Col1Col2
> 1 10
> 2 11
> 3 12
> 4
Hi all,
The following is the updated code, where I'm getting the avg in a DF, but
the collect() function, to store the value as a variable and pass it to the
final select query is not working. So, avg is currently a dataframe and not
a variable with value stored in it.
New code -
from
I am using spark to run merge query in postgres sql.
The way its being done now is save the data to be merged in postgres as
temp tables.
Now run the merge queries in postgres using java sql connection and
statment .
So basically this query runs in postgres.
The queries are insert into source
unsubscribe
unsubscribe
Looks like there is spark.udf().registerPython() like below.
public void registerPython(java.lang.String name,
org.apache.spark.sql.execution.python.UserDefinedPythonFunction udf)
can anyone describe what *udfDeterministic *parameter does in the method
signature below?
public
Hi Kant,
The udfDeterministic would be set to false if the results from your UDF are
non-deterministic, such as produced by random numbers, so the catalyst
optimizer will not cache and reuse results.
On Mon, Apr 2, 2018 at 12:11 PM, kant kodali wrote:
> Looks like there is
Hi,
I got an error of Uncaught exception in
thread heartbeat-receiver-event-loop-thread. Does this error indicate that
some node is too overloaded to be responsive? Thanks!
ERROR Utils: Uncaught exception in thread
heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
发自网易邮箱大师
I am trying to read data from kafka and writing them in parquet format via
Spark Streaming.
The problem is, the data from kafka are in variable data structure. For
example, app one has columns A,B,C, app two has columns B,C,D. So the data
frame I read from kafka has all columns ABCD. When I decide
Hi,
When we execute the same operation twice, spark takes less time ( ~40%) than
the first.
Our operation is like this:
Read 150M rows ( spread in multiple parquet files) into DF
Read 10M rows ( spread in multiple parquet files) into other DF.
Do an intersect operation.
Size of 150M row file:
Spark : 2.2
Number of cores : 128 ( all allocated to spark)
Filesystem : Alluxio 1.6
Block size on alluxio: 32MB
Input1 size : 586MB ( 150m records with only 1 column as int)
Input2 size : 50MB ( 10m records with only 1 column as int)
Input1 is spread across 20 parquet files. each file size is
13 matches
Mail list logo