Re: Spark SaveMode

2019-07-19 Thread Mich Talebzadeh
This behaviour is governed by the underlying RDBMS for bulk insert, where it either commits or roll backs. You can insert new rows into an staging table in Oracle (which is common in ETL) and then insert/select into Oracle table in shell routine. The other way is to use JDBC in Spark to read Orac

Re: Spark SaveMode

2019-07-19 Thread Jörn Franke
This is not an issue of Spark, but the underlying database. The primary key constraint has a purpose and ignoring it would defeat that purpose. Then to handle your use case, you would need to make multiple decisions that may imply you don’t want to simply insert if not exist. Maybe you want to d

Spark SaveMode

2019-07-19 Thread Richard
Any reason why Spark's SaveMode doesn't have mode that ignore any Primary Key/Unique constraint violations? Let's say I'm using spark to migrate some data from Cassandra to Oracle, I want the insert operation to be "ignore if exist primary keys" instead of failing the whole batch. Thanks, Richard

Re: Spark dataset to explode json string

2019-07-19 Thread Richard
ok, thanks, I have another way that is currently working but not efficient if I have to extract lot of fields that is creating udf for each extraction: df = df.withColumn("foo", getfoo.apply(col("jsonCol"))) .withColumn("bar", getbar.apply(col("jsonCol"))); On Fri, Jul 19, 2019 at 8:54 PM Mic

Re: Spark dataset to explode json string

2019-07-19 Thread Mich Talebzadeh
You can try to split the {"foo": "val1", "bar": "val2"} as below. /* This is an example of output! (c1003d93-5157-4092-86cf-0607157291d8,{"rowkey":"c1003d93-5157-4092-86cf-0607157291d8","ticker":"TSCO", "timeissued":"2019-07-01T09:10:55", "price":395.25}) {"rowkey":"c1003d93-5157-4092-86cf-060715

Re: Spark and Oozie

2019-07-19 Thread William Shen
Dennis, do you know what’s taking the additional time? Is it the Spark Job, or oozie waiting for allocation from YARN? Do you have resource contention issue in YARN? On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija wrote: > Hi Dennis, > > Oozie jobs shouldn't take that long in a well configured cl

Re: Spark dataset to explode json string

2019-07-19 Thread Richard
example of jsonCol (String): {"foo": "val1", "bar": "val2"} Thanks, On Fri, Jul 19, 2019 at 3:57 PM Mich Talebzadeh wrote: > Sure. > > Do you have an example of a record from Cassandra read into df by any > chance? Only columns that need to go into Oracle. > > df.select('col1, 'col2, 'jsonCol).

Re: Spark dataset to explode json string

2019-07-19 Thread Mich Talebzadeh
Sure. Do you have an example of a record from Cassandra read into df by any chance? Only columns that need to go into Oracle. df.select('col1, 'col2, 'jsonCol).take(1).foreach(println) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd

Re: Spark dataset to explode json string

2019-07-19 Thread Richard
Thanks for the reply, my situation is little different than your sample: Following is the schema from source (df.printSchema();) root |-- id: string (nullable = true) |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- jsonCol: string (nullable = true) I want extract mul

Re: Spark dataset to explode json string

2019-07-19 Thread Mich Talebzadeh
Hi Richard, You can use the following to read JSON data into DF. The example is reading JSON from Kafka topic val sc = spark.sparkContext import spark.implicits._ // Use map to create the new RDD using the value portion of the pair. val jsonRDD = pricesRDD.map

Spark dataset to explode json string

2019-07-19 Thread Richard
let's say I use spark to migrate some data from Cassandra table to Oracle table Cassandra Table: CREATE TABLE SOURCE( id UUID PRIMARY KEY, col1 text, col2 text, jsonCol text ); example jsonCol value: {"foo": "val1", "bar", "val2"} I am trying to extract fields from the json column while importing

Spark ImportError: No module named XXX

2019-07-19 Thread zenglong chen
Hi,all: aused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/ubuntu/spark-2.4.3/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/home/ubuntu/spa

Unsubscribe

2019-07-19 Thread Aslan Bakirov

Re: Spark and Oozie

2019-07-19 Thread Bartek Dobija
Hi Dennis, Oozie jobs shouldn't take that long in a well configured cluster. Oozie allocates it's own resources in Yarn which may require fine tuning. Check if YARN gives resources to the Oozie job immediately which may be one of the reasons and change jobs priorities in YARN scheduling configurat

Spark and Oozie

2019-07-19 Thread Dennis Suhari
Dear experts, I am using Spark for processing data from HDFS (hadoop). These Spark application are data pipelines, data wrangling and machine learning applications. Thus Spark submits its job using YARN. This also works well. For scheduling I am now trying to use Apache Oozie, but I am facin