Probably, because Spark prefers locality, but not necessarily. On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar < kalgaonkarsiddh...@gmail.com> wrote:
> Thank you so much for this information, Sean. One more question, that when > it wants to re-run the failed partition, where does it run? On the same > node or some other node? > > > On Fri, 21 Jan 2022, 23:41 Sean Owen, <sro...@gmail.com> wrote: > >> The Spark program already knows the partitions of the data and where they >> exist; that's just defined by the data layout. It doesn't care what data is >> inside. It knows partition 1 needs to be processed and if the task >> processing it fails, needs to be run again. I'm not sure where you're >> seeing data loss here? the data is already stored to begin with, not >> somehow consumed and deleted. >> >> On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar < >> kalgaonkarsiddh...@gmail.com> wrote: >> >>> Okay, so suppose I have 10 records distributed across 5 nodes and the >>> partition of the first node holding 2 records failed. I understand that it >>> will re-process this partition but how will it come to know that XYZ >>> partition was holding XYZ data so that it will pick again only those >>> records and reprocess it? In case of failure of a partition, is there a >>> data loss? or is it stored somewhere? >>> >>> Maybe my question is very naive but I am trying to understand it in a >>> better way. >>> >>> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> In that case, the file exists in parts across machines. No, tasks won't >>>> re-read the whole file; no task does or can do that. Failed partitions are >>>> reprocessed, but as in the first pass, the same partition is processed. >>>> >>>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar < >>>> kalgaonkarsiddh...@gmail.com> wrote: >>>> >>>>> Hello team, >>>>> >>>>> I am aware that in case of memory issues when a task fails, it will >>>>> try to restart 4 times since it is a default number and if it still fails >>>>> then it will cause the entire job to fail. >>>>> >>>>> But suppose if I am reading a file that is distributed across nodes in >>>>> partitions. So, what will happen if a partition fails that holds some >>>>> data? >>>>> Will it re-read the entire file and get that specific subset of data since >>>>> the driver has the complete information? or will it copy the data to the >>>>> other working nodes or tasks and try to run it? >>>>> >>>>