Re: pySpark - pandas UDF and binaryType

2019-05-03 Thread Gourav Sengupta
And also be aware that pandas UDF does not always lead to better performance and sometimes even massively slow performance. With Grouped Map dont you run into the risk of random memory errors as well? On Thu, May 2, 2019 at 9:32 PM Bryan Cutler wrote: > Hi, > > BinaryType support was not added

Re: Howto force spark to honor parquet partitioning

2019-05-03 Thread Gourav Sengupta
so you want data from one physical partition in the disk to go to only one executor? On Fri, May 3, 2019 at 5:38 PM Tomas Bartalos wrote: > Hello, > > I have partitioned parquet files based on "event_hour" column. > After reading parquet files to spark: >

Re: Spark SQL JDBC teradata syntax error

2019-05-03 Thread Gourav Sengupta
What is the query On Fri, May 3, 2019 at 5:28 PM KhajaAsmath Mohammed wrote: > Hi > > I have followed link > https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/77824 > to > connect teradata from spark. > > I was able to print

[MLlib][Beginner][Debug]: Logistic Regression model always predicts the same value

2019-05-03 Thread Josue Lopes
So this is my first time using Apache Spark and machine learning in general and i'm currently trying to create a small application to detect credit card fraud. Currently I have about 1 transaction objects i'm using for my data set with 70% of it going towards training the model and 30% for

Howto force spark to honor parquet partitioning

2019-05-03 Thread Tomas Bartalos
Hello, I have partitioned parquet files based on "event_hour" column. After reading parquet files to spark: spark.read.format("parquet").load("...") Files from the same parquet partition are scattered in many spark partitions. Example of mapping spark partition -> parquet partition: Spark

Spark SQL JDBC teradata syntax error

2019-05-03 Thread KhajaAsmath Mohammed
Hi I have followed link https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/77824 to connect teradata from spark. I was able to print schema if I give table name instead of sql query. I am getting below error if I give query(code

Re: Spark SQL Teradata load is very slow

2019-05-03 Thread Shyam P
Asmath, Why upperBound is set to 300 ? how many cores you have ? check how data is distributed in TeraData DB table. SELECT distinct( itm_bloon_seq_no ), count(*) as cc FROM TABLE order by itm_bloon_seq_no desc; Is this column "itm_bloon_seq_no" already in table or you derived at spark

Re: Update / Delete records in Parquet

2019-05-03 Thread Chetan Khatri
Agreed with delta.io, I am exploring both options On Wed, May 1, 2019 at 2:50 PM Vitaliy Pisarev wrote: > Ankit, you should take a look at delta.io that was recently open sourced > by databricks. > > Full DML support is on the way. > > > > *From: *"Khare, Ankit" > *Date: *Tuesday, 23 April

Re: Getting EOFFileException while reading from sequence file in spark

2019-05-03 Thread Prateek Rajput
Hi all, Please share if anyone have faced the same problem. There are many similar issues on web but I did not find any solution and reason why this happens. It will be really helpful. Regards, Prateek On Mon, Apr 29, 2019 at 3:18 PM Prateek Rajput wrote: > I checked and removed 0 sized files

Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-05-03 Thread Olivier Girardot
Hi, I did not try on another vendor, so I can't say if it's only related to gke, and no, I did not notice anything on the kubelet or kube-dns processes... Regards Le ven. 3 mai 2019 à 03:05, Li Gao a écrit : > hi Olivier, > > This seems a GKE specific issue? have you tried on other vendors ?