Hi Selvam, Is your 35GB parquet file split up into multiple S3 objects or just one big Parquet file?
If its just one big file then I believe only one executor will be able to work on it until some job action partitions the data into smaller chunks. On 11 October 2016 at 06:03, Selvam Raman <sel...@gmail.com> wrote: > I mentioned parquet as input format. > On Oct 10, 2016 11:06 PM, "ayan guha" <guha.a...@gmail.com> wrote: > >> It really depends on the input format used. >> On 11 Oct 2016 08:46, "Selvam Raman" <sel...@gmail.com> wrote: >> >>> Hi, >>> >>> How spark reads data from s3 and runs parallel task. >>> >>> Assume I have a s3 bucket size of 35 GB( parquet file). >>> >>> How the sparksession will read the data and process the data parallel. >>> How it splits the s3 data and assign to each executor task. >>> >>> Please share me your points. >>> >>> Note: >>> if we have RDD , then we can look at the partitions.size or length to >>> check how many partition for a file. But how this will be accomplished in >>> terms of S3 bucket. >>> >>> -- >>> Selvam Raman >>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" >>> >>