Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread ayan guha
Hi Option 1: You can write back to processed queue and add some additional info like last time tried and some counter. Option 2: Load unpaired transactiona in a staging table in postgres. Modify streaming job of "created" flow to try to pair any unpaired trx and clear it by moving to main table.

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
One alternative I can think of is that you publish your orphaned transactions to another topic from the main Spark job You create a new DF based on orphaned transactions result = orphanedDF \ .. .writeStream \

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
I mean... I guess? But I don't really have Airflow here, and I didn't really wanted to fall back to a "batch"-kinda approach with Airflow I'd rather use a Dead Letter Queue approach instead (like I mentioned another topic for the failed ones, which is later consumed and pumps the messages back

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Well this is a matter of using journal entries. What you can do is that those "orphaned" transactions that you cannot pair through transaction_id can be written to a journal table in your Postgres DB. Then you can pair them with the entries in the relevant Postgres table. If the essence is not

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
That is exactly the case, Sebastian! - In practise, that "created means "*authorized*", but I still cannot deduct anything from the customer balance - the "processed" means I can safely deduct the transaction_amount from the customer balance, - and the "refunded" means I must give the

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Sebastian Piu
So in payment systems you have something similar I think You have an authorisation, then the actual transaction and maybe a refund some time in the future. You want to proceed with a transaction only if you've seen the auth but in an eventually consistent system this might not always happen. You

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
I'm terribly sorry, Mich. That was my mistake. The timestamps are not the same (I copy without realizing that, I'm really sorry for the confusion) Please assume NONE of the following transactions are in the database yet *transactions-created:* { "transaction_id": 1, "amount": 1000, "timestamp":

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
One second The topic called transactions_processed is streaming through Spark. Transactions-created are created at the same time (the same timestamp) but you have NOT received them and they don't yet exist in your DB, because presumably your relational database is too slow to ingest them? you do

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
Thanks for the quick reply! I'm not sure I got the idea correctly... but from what I'm underding, wouldn't that actually end the same way? Because, this is the current scenario: *transactions-processed: * { "transaction_id": 1, "timestamp": "2020-04-04 11:01:00" } { "transaction_id": 2,

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Thanks for the details. Can you read these in the same app. For example. This is PySpark but it serves the purpose. Read topic "newtopic" in micro batch and the other topic "md" in another microbatch try: # process topic --> newtopic streamingNewtopic =

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
Hello! Sure thing! I'm reading them *separately*, both are apps written with Scala + Spark Structured Streaming. I feel like I missed some details on my original thread (sorry it was past 4 AM) and it was getting frustrating Please let me try to clarify some points: *Transactions Created

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Can you please clarify if you are reading these two topics separately or within the same scala or python script in Spark Structured Streaming? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all