spar kafka option properties

2020-05-24 Thread Gunjan Kumar
Hi,

while reading streaming data from kafka we use following API.

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .option("startingOffsets", "earliest") \
  .load()


My Question is how to see different options() available here like

kafka.bootstrap.servers in the IDE?


Re: [apache-spark]-spark-shuffle

2020-05-24 Thread vijay.bvp
How a Spark job reads datasources depends on the underlying source system,the
job configuration about number of executors and cores per executor. 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets

About Shuffle operations. 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
https://stackoverflow.com/questions/32210011/spark-difference-between-shuffle-write-shuffle-spill-memory-shuffle-spill
https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs/29012187#29012187

this has great explanation of how shuffle works
https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark


A sample of code and job configuration, the DAG underlying source (HDFS or
others) would help explain

thanks
VP



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: unsubscribe

2020-05-24 Thread Sunil Prabhakara
On Sat, 16 May 2020, 22:34 Punna Yenumala,  wrote:

>


Re: ETL Using Spark

2020-05-24 Thread vijay.bvp
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to
SQL Server possible as Mich mentioned below as longs as there a JDBC driver
that can connect to SQL ServerFor a production workloads important points to
consider,  >> what is the QoS requirements for your case? at least once, at
most once, exactly-once  >> how to handle Spark Streaming job restarts?
(because of error or you have to put a new version of application) >> What
are your error handling strategies? >> How do you deal with late arriving
data since you are doing aggregations?It is best to make downstream systems
idempotent, that is very less troublesome way to have maintainable
production workloadsBest RegardsVP thanksVijay



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Cleanup hook for temporary files produced as part of a spark job

2020-05-24 Thread jelmer
I am writing something that partitions a data set and then trains a machine
learning model on the data in each partition

The resulting model is very big  and right now i am storing it in an rdd as
a pair of  :
partition_id and very_big_model_that_is_hundreds_of_megabytes_big

but it is becoming increasingly apparent that storing data that big in a
single row of an RDD causes all sorts of complications

So i figured that instead i could save this model to the filesystem and
store a pointer to the model (file path) in the RDD.  Then i would simply
load the model again in a mapPartitions function and avoid the issue

But it raises the question of when to clean up these temporary files. Is
there some way to ensure that files outputted by spark code get cleaned up
when the sparksession ends or the rdd is no longer referenced ?

Or is there any other solution to this problem ?


[no subject]

2020-05-24 Thread Vijaya Phanindra Sarma B



Re: Parallelising JDBC reads in spark

2020-05-24 Thread Mike Artz
Does anything different happened when you set the isolationLevel to do
Dirty Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
wrote:

> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>- Use `ROW_NUMBER` window function in the SQL query
>- Then do
>-
>
>DataFrame df =
>hiveContext
>.read()
>.jdbc(
>**,
>query,
>"row_num",
>1,
>,
>noOfPartitions,
>jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Georg Heiler
Why don't you apply proper change data capture?
This will be more complex though.

Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H <
manjunathshe...@live.com>:

> Hi Mike,
>
> Thanks for the response.
>
> Even with that flag set data miss can happen right ?. As the fetch is
> based on the last watermark (maximum timestamp of the row that last batch
> job fetched ), Take a scenario like this with table
>
> a :  1
> b :  2
> c :  3
> d :  4
> *f  :  6*
> g :  7
> h :  8
> e :  5
>
>
>- a,b,c,d,e get picked by 1 task
>- by the time second task starts, e has been updated, so the row order
>changes
>- As f moves up, it will completely get missed in the fetch
>
>
> Thanks
> Manjunath
>
> --
> *From:* Mike Artz 
> *Sent:* Monday, May 25, 2020 10:50 AM
> *To:* Manjunath Shetty H 
> *Cc:* user 
> *Subject:* Re: Parallelising JDBC reads in spark
>
> Does anything different happened when you set the isolationLevel to do
> Dirty Reads i.e. "READ_UNCOMMITTED"
>
> On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H <
> manjunathshe...@live.com> wrote:
>
> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>- Use `ROW_NUMBER` window function in the SQL query
>- Then do
>-
>
>DataFrame df =
>hiveContext
>.read()
>.jdbc(
>**,
>query,
>"row_num",
>1,
>,
>noOfPartitions,
>jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>


Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi Mike,

Thanks for the response.

Even with that flag set data miss can happen right ?. As the fetch is based on 
the last watermark (maximum timestamp of the row that last batch job fetched ), 
Take a scenario like this with table

a :  1
b :  2
c :  3
d :  4
f  :  6
g :  7
h :  8
e :  5


  *   a,b,c,d,e get picked by 1 task
  *   by the time second task starts, e has been updated, so the row order 
changes
  *   As f moves up, it will completely get missed in the fetch

Thanks
Manjunath


From: Mike Artz 
Sent: Monday, May 25, 2020 10:50 AM
To: Manjunath Shetty H 
Cc: user 
Subject: Re: Parallelising JDBC reads in spark

Does anything different happened when you set the isolationLevel to do Dirty 
Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty


Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
hiveContext
.read()
.jdbc(
,
query,
"row_num",
1,
,
noOfPartitions,
jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty