Understanding Spark execution plans

2020-08-05 Thread Daniel Stojanov
Hi, When an execution plan is printed it lists the tree of operations that will be completed when the job is run. The tasks have somewhat cryptic names of the sort: BroadcastHashJoin, Project, Filter, etc. These do not appear to map directly to functions that are performed on an RDD. 1) Is there

Multi insert with join in Spark SQL

2020-08-05 Thread moqi
Hi, I am trying to migrate Hive SQL to Spark SQL. When I execute the Multi insert with join statement, Spark SQL will scan the same table multiple times, while Hive SQL will only scan once. In the actual production environment, this table is relatively large, which causes the running time of

S3 read/write from PySpark

2020-08-05 Thread Daniel Stojanov
Hi, I am trying to read/write files to S3 from PySpark. The procedure that I have used is to download Spark, start PySpark with the hadoop-aws, guava, aws-java-sdk-bundle packages. The versions are explicitly specified by looking up the exact dependency version on Maven. Allowing dependencies to

Re: Tab delimited csv import and empty columns

2020-08-05 Thread Stephen Coy
Hi Sean, German and others, Setting the “nullValue” option (for parsing CSV at least) seems to be an exercise in futility. When parsing the file, com.univocity.parsers.common.input.AbstractCharInputReader#getString contains the following logic: String out; if (len <= 0) { out =

Re: Comments conventions in Spark distribution official examples

2020-08-05 Thread Sean Owen
These only matter to our documentation, which includes the source of these examples inline in the docs. For brevity, the examples don't need to show all the imports that are otherwise necessary for the source file. You can ignore them like the compiler does as comments if you are using the example

Comments conventions in Spark distribution official examples

2020-08-05 Thread Fuad Efendi
Hello, I am trying to guess what such comments needed for and cannot google it on Internet, maybe some documentation tool? Both, Java and Scala, have this in import statements and in a code: “$example on” and “$example off" package org.apache.spark.examples.sql // $example

Async API to save RDDs?

2020-08-05 Thread Antonin Delpeuch (lists)
Hi, The RDD API provides async variants of a few RDD methods, which let the user execute the corresponding jobs asynchronously. This makes it possible to cancel the jobs for instance: https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/AsyncRDDActions.html There does not seem to

file importing / hibernate

2020-08-05 Thread nt
1. I need to import csv files with a entity resolution logic, spark could help me to process rows in parallel Do you think is a good approach ? 2. I've quite complex database structure and eager to use i.e. hibernate to resolve and save the data but it seems like everybody uses plain jdbc is this

Re: Renaming a DataFrame column makes Spark lose partitioning information

2020-08-05 Thread Antoine Wendlinger
Well that's great ! Thank you very much :) Antoine On Tue, Aug 4, 2020 at 11:22 PM Terry Kim wrote: > This is fixed in Spark 3.0 by https://github.com/apache/spark/pull/26943: > > scala> :paste > // Entering paste mode (ctrl-D to finish) > > Seq((1, 2)) > .toDF("a", "b") >