spar kafka option properties
Hi, while reading streaming data from kafka we use following API. df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("subscribe", "topic1") \ .option("startingOffsets", "earliest") \ .load() My Question is how to see different options() available here like kafka.bootstrap.servers in the IDE?
Re: [apache-spark]-spark-shuffle
How a Spark job reads datasources depends on the underlying source system,the job configuration about number of executors and cores per executor. https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets About Shuffle operations. https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations https://stackoverflow.com/questions/32210011/spark-difference-between-shuffle-write-shuffle-spill-memory-shuffle-spill https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs/29012187#29012187 this has great explanation of how shuffle works https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark A sample of code and job configuration, the DAG underlying source (HDFS or others) would help explain thanks VP -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: unsubscribe
On Sat, 16 May 2020, 22:34 Punna Yenumala, wrote: >
Re: ETL Using Spark
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to SQL Server possible as Mich mentioned below as longs as there a JDBC driver that can connect to SQL ServerFor a production workloads important points to consider, >> what is the QoS requirements for your case? at least once, at most once, exactly-once >> how to handle Spark Streaming job restarts? (because of error or you have to put a new version of application) >> What are your error handling strategies? >> How do you deal with late arriving data since you are doing aggregations?It is best to make downstream systems idempotent, that is very less troublesome way to have maintainable production workloadsBest RegardsVP thanksVijay -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
Cleanup hook for temporary files produced as part of a spark job
I am writing something that partitions a data set and then trains a machine learning model on the data in each partition The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big but it is becoming increasingly apparent that storing data that big in a single row of an RDD causes all sorts of complications So i figured that instead i could save this model to the filesystem and store a pointer to the model (file path) in the RDD. Then i would simply load the model again in a mapPartitions function and avoid the issue But it raises the question of when to clean up these temporary files. Is there some way to ensure that files outputted by spark code get cleaned up when the sparksession ends or the rdd is no longer referenced ? Or is there any other solution to this problem ?
[no subject]
Re: Parallelising JDBC reads in spark
Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED" On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H wrote: > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem we are facing when we try to > parallelising single table reads into multiple tasks without missing any > data. > > We have tried this, > > >- Use `ROW_NUMBER` window function in the SQL query >- Then do >- > >DataFrame df = >hiveContext >.read() >.jdbc( >**, >query, >"row_num", >1, >, >noOfPartitions, >jdbcOptions); > > > > The problem with this approach is if our tables get updated in between in SQL > Server while tasks are still running then the `ROW_NUMBER` will change and we > may miss some records. > > > Any approach to how to fix this issue ? . Any pointers will be helpful > > > *Note*: I am on spark 1.6 > > > Thanks > > Manjiunath Shetty > >
Re: Parallelising JDBC reads in spark
Why don't you apply proper change data capture? This will be more complex though. Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H < manjunathshe...@live.com>: > Hi Mike, > > Thanks for the response. > > Even with that flag set data miss can happen right ?. As the fetch is > based on the last watermark (maximum timestamp of the row that last batch > job fetched ), Take a scenario like this with table > > a : 1 > b : 2 > c : 3 > d : 4 > *f : 6* > g : 7 > h : 8 > e : 5 > > >- a,b,c,d,e get picked by 1 task >- by the time second task starts, e has been updated, so the row order >changes >- As f moves up, it will completely get missed in the fetch > > > Thanks > Manjunath > > -- > *From:* Mike Artz > *Sent:* Monday, May 25, 2020 10:50 AM > *To:* Manjunath Shetty H > *Cc:* user > *Subject:* Re: Parallelising JDBC reads in spark > > Does anything different happened when you set the isolationLevel to do > Dirty Reads i.e. "READ_UNCOMMITTED" > > On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H < > manjunathshe...@live.com> wrote: > > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem we are facing when we try to > parallelising single table reads into multiple tasks without missing any > data. > > We have tried this, > > >- Use `ROW_NUMBER` window function in the SQL query >- Then do >- > >DataFrame df = >hiveContext >.read() >.jdbc( >**, >query, >"row_num", >1, >, >noOfPartitions, >jdbcOptions); > > > > The problem with this approach is if our tables get updated in between in SQL > Server while tasks are still running then the `ROW_NUMBER` will change and we > may miss some records. > > > Any approach to how to fix this issue ? . Any pointers will be helpful > > > *Note*: I am on spark 1.6 > > > Thanks > > Manjiunath Shetty > >
Re: Parallelising JDBC reads in spark
Hi Mike, Thanks for the response. Even with that flag set data miss can happen right ?. As the fetch is based on the last watermark (maximum timestamp of the row that last batch job fetched ), Take a scenario like this with table a : 1 b : 2 c : 3 d : 4 f : 6 g : 7 h : 8 e : 5 * a,b,c,d,e get picked by 1 task * by the time second task starts, e has been updated, so the row order changes * As f moves up, it will completely get missed in the fetch Thanks Manjunath From: Mike Artz Sent: Monday, May 25, 2020 10:50 AM To: Manjunath Shetty H Cc: user Subject: Re: Parallelising JDBC reads in spark Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED" On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H mailto:manjunathshe...@live.com>> wrote: Hi, We are writing a ETL pipeline using Spark, that fetch the data from SQL server in batch mode (every 15mins). Problem we are facing when we try to parallelising single table reads into multiple tasks without missing any data. We have tried this, * Use `ROW_NUMBER` window function in the SQL query * Then do * DataFrame df = hiveContext .read() .jdbc( , query, "row_num", 1, , noOfPartitions, jdbcOptions); The problem with this approach is if our tables get updated in between in SQL Server while tasks are still running then the `ROW_NUMBER` will change and we may miss some records. Any approach to how to fix this issue ? . Any pointers will be helpful Note: I am on spark 1.6 Thanks Manjiunath Shetty
Parallelising JDBC reads in spark
Hi, We are writing a ETL pipeline using Spark, that fetch the data from SQL server in batch mode (every 15mins). Problem we are facing when we try to parallelising single table reads into multiple tasks without missing any data. We have tried this, * Use `ROW_NUMBER` window function in the SQL query * Then do * DataFrame df = hiveContext .read() .jdbc( , query, "row_num", 1, , noOfPartitions, jdbcOptions); The problem with this approach is if our tables get updated in between in SQL Server while tasks are still running then the `ROW_NUMBER` will change and we may miss some records. Any approach to how to fix this issue ? . Any pointers will be helpful Note: I am on spark 1.6 Thanks Manjiunath Shetty