Re: Union of multiple data frames

2018-04-05 Thread Cesar
Thanks for your answers. The suggested method works when the number of Data Frames is small. However, I am trying to union >30 Data Frames, and the time to create the plan is taking longer than the execution, which should not be the case. Thanks! -- Cesar On Thu, Apr 5, 2018 at 1:29 PM, Andy

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Andy Davidson
FYI http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext From: Andrew Davidson Date: Wednesday, April 4, 2018 at 5:36 PM To: "user @spark" Subject: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Andy Davidson
Hi Hyukjin Thanks for the links. At this point I sort of got my eclipse, pyDev, spark, unitTests working. In my unit test I can run from the cmd line or from with in eclipse a simple unit test. The test creates a data frame from a text file and calls df.show() The last challenge is that it

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Hyukjin Kwon
FYI, there is a PR and JIRA for virtualEnv support in PySpark https://issues.apache.org/jira/browse/SPARK-13587 https://github.com/apache/spark/pull/13599 2018-04-06 7:48 GMT+08:00 Andy Davidson : > FYI > >

unsubscribe

2018-04-05 Thread Nikhil Kalbande
unsubscribe DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized

Spark Structured Streaming Inner Queries fails

2018-04-05 Thread Aakash Basu
Hi, Why are inner queries not allowed in Spark Streaming? Spark assumes the inner query to be a separate stream altogether and expects it to be triggered with a separate writeStream.start(). Why so? Error: pyspark.sql.utils.StreamingQueryException: 'Queries with streaming sources must be

[Structured Streaming] How to save entire column aggregation to a file

2018-04-05 Thread Aakash Basu
Hi, I want to save an aggregate to a file without using any window, watermark or groupBy. So, my aggregation is at entire column level. df = spark.sql("select avg(col1) as aver from ds") Now, the challenge is as follows - 1) If I use outputMode = Append, but "*Append output mode not supported

[Structured Streaming] More than 1 streaming in a code

2018-04-05 Thread Aakash Basu
Hi, If I have more than one writeStream in a code, which operates on the same readStream data, why does it produce only the first writeStream? I want the second one to be also printed on the console. How to do that? from pyspark.sql import SparkSession from pyspark.sql.functions import split,

Which metrics would be best to alert on?

2018-04-05 Thread Mark Bonetti
Hi, I'm building a monitoring system for Apache Spark and want to set up default alerts (threshold or anomaly) on 2-3 key metrics everyone who uses Spark typically wants to alert on, but I don't yet have production-grade experience with Spark. Importantly, alert rules have to be generally useful,

Union of multiple data frames

2018-04-05 Thread Cesar
The following code works for small n, but not for large n (>20): val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _) dfUnion.show() By not working, I mean that Spark takes a lot of time to create the execution plan. *Is there a more optimal way to perform a union of multiple data frames?*

Re: Union of multiple data frames

2018-04-05 Thread Andy Davidson
Hi Ceasar I have used Brandson approach in the past with out any problem Andy From: Brandon Geise Date: Thursday, April 5, 2018 at 11:23 AM To: Cesar , "user @spark" Subject: Re: Union of multiple data frames > Maybe

Re: Union of multiple data frames

2018-04-05 Thread Brandon Geise
Maybe something like var finalDF = spark.sqlContext.emptyDataFrame for (df <- dfs){     finalDF = finalDF.union(df) } Where dfs is a Seq of dataframes. From: Cesar Date: Thursday, April 5, 2018 at 2:17 PM To: user Subject: Union of