Aggregation of Streaming UI Statistics for multiple jobs

2018-05-26 Thread skmishra
Hi, I am working on a streaming use case where I need to run multiple spark streaming applications at the same time and measure the throughput and latencies. The spark UI provides all the statistics, but if I want to run more than 100 applications at the same time then I do not have any clue on

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Jules Damji
Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And point to the blog by their contributors from Two Sigma. :-) “On the other hand, Pandas UDF built atop Apache Arrow accords high-performance to Python developers, whether you use Pandas UDFs on a single-node machine or

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet
Gourav & Nicholas, THank you! It does look like the pyspark Pandas UDF is exactly what I want and the article I read didn't mention that it used Arrow underneath. Looks like Wes McKinney was also key part of building the Pandas UDF. Gourav, I totally apologize for my long and drawn out response

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Nicolas Paris
hi corey not familiar with arrow, plasma. However recently read an article about spark on a standalone machine (your case). Sounds like you could take benefit of pyspark "as-is" https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html regars, 2018-05-23

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
I think I found the solution. The last comment from this link - https://issues.apache.org/jira/browse/SPARK-14948 But, my question is even after using table.column, why does Spark find the same column name from two different tables ambiguous? I mean table1.column = table2.column, Spark should

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
You're right. The same set of queries are working for max 2 columns in loop. If I give more than 2 column, the 2nd column is failing with this error - *attribute(s) with the same name appear in the operation: marginal_adhesion_bucketed. Please check if the right attribute(s) are used.* Any

Re: Spark 2.3 Tree Error

2018-05-26 Thread hemant singh
Per the sql plan this is where it is failing - Attribute(s) with the same name appear in the operation: fnlwgt_bucketed. Please check if the right attribute(s) are used.; On Sat, May 26, 2018 at 6:16 PM, Aakash Basu wrote: > Hi, > > This query is based on one step

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Well, it did, meaning, internally a TempTable and a TempView are the same. Thanks buddy! On Sat, May 26, 2018 at 9:23 PM, Aakash Basu wrote: > Question is, while registering, using registerTempTable() and while > dropping, using a dropTempView(), would it go and hit

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Question is, while registering, using registerTempTable() and while dropping, using a dropTempView(), would it go and hit the same TempTable internally or would search for a registered view? Not sure. Any idea? On Sat, May 26, 2018 at 9:04 PM, SNEHASISH DUTTA wrote: >

Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Hi all, I'm trying to use dropTempTable() after the respective Temporary Table's use is over (to free up the memory for next calculations). Newer Spark Session doesn't need sqlContext, so, it is confusing me on how to use the function. 1) Tried, same DF which I used to register a temp table to

Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
Hi, This query is based on one step further from the query in this link . In this scenario, I add 1 or 2 more columns to be processed, Spark throws an ERROR by printing the physical plan of queries. It

Spark 2.3 Memory Leak on Executor

2018-05-26 Thread Aakash Basu
Hi, I am getting memory leak warning which ideally was a Spark bug back till 1.6 version and was resolved. Mode: Standalone IDE: PyCharm Spark version: 2.3 Python version: 3.6 Below is the stack trace - 2018-05-25 15:00:05 WARN Executor:66 - Managed memory leak detected; size = 262144 bytes,

what defines dataset partition number in spark sql

2018-05-26 Thread 崔苗
Hi, I want to know when I create a dataset by reading files in hdfs in spark sql, like : Dataset user = spark.read().format("json").load(filePath) , what defines the partition number of the dataset? And what if the filePath is a directory instead of a singe file ? Why we can't get the partitions