Thank you so much for this information, Sean. One more question, that when it wants to re-run the failed partition, where does it run? On the same node or some other node?
On Fri, 21 Jan 2022, 23:41 Sean Owen, <sro...@gmail.com> wrote: > The Spark program already knows the partitions of the data and where they > exist; that's just defined by the data layout. It doesn't care what data is > inside. It knows partition 1 needs to be processed and if the task > processing it fails, needs to be run again. I'm not sure where you're > seeing data loss here? the data is already stored to begin with, not > somehow consumed and deleted. > > On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar < > kalgaonkarsiddh...@gmail.com> wrote: > >> Okay, so suppose I have 10 records distributed across 5 nodes and the >> partition of the first node holding 2 records failed. I understand that it >> will re-process this partition but how will it come to know that XYZ >> partition was holding XYZ data so that it will pick again only those >> records and reprocess it? In case of failure of a partition, is there a >> data loss? or is it stored somewhere? >> >> Maybe my question is very naive but I am trying to understand it in a >> better way. >> >> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen <sro...@gmail.com> wrote: >> >>> In that case, the file exists in parts across machines. No, tasks won't >>> re-read the whole file; no task does or can do that. Failed partitions are >>> reprocessed, but as in the first pass, the same partition is processed. >>> >>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar < >>> kalgaonkarsiddh...@gmail.com> wrote: >>> >>>> Hello team, >>>> >>>> I am aware that in case of memory issues when a task fails, it will try >>>> to restart 4 times since it is a default number and if it still fails then >>>> it will cause the entire job to fail. >>>> >>>> But suppose if I am reading a file that is distributed across nodes in >>>> partitions. So, what will happen if a partition fails that holds some data? >>>> Will it re-read the entire file and get that specific subset of data since >>>> the driver has the complete information? or will it copy the data to the >>>> other working nodes or tasks and try to run it? >>>> >>>