Hi, we've switched from 0.8.1 to 0.9.0 on Monday, and we're facing a
problems that does not seem to be obvious. Basically, we generate a random dense matrix (2M rows * 40 columns), split it in 20, collect() it and then broadcast it. The generation is ok, but then the workers send the blocks, and nothing happens. Spark is locked forever in this state. Here is what happens on the driver : 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:13 as 2083 bytes in 0 ms 14/02/19 16:21:49 INFO TaskSetManager: Starting task 12.0:14 as TID 2294 on executor 1: t4.exensa.loc (PROCESS_LOCAL) 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:14 as 2083 bytes in 0 ms 14/02/19 16:21:49 INFO TaskSetManager: Starting task 12.0:15 as TID 2295 on executor 4: t3.exensa.loc (PROCESS_LOCAL) 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:15 as 2083 bytes in 1 ms 14/02/19 16:21:49 INFO TaskSetManager: Starting task 12.0:16 as TID 2296 on executor 0: t0.exensa.loc (PROCESS_LOCAL) 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:16 as 2083 bytes in 2 ms 14/02/19 16:21:49 INFO TaskSetManager: Starting task 12.0:17 as TID 2297 on executor 3: t1.exensa.loc (PROCESS_LOCAL) 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:17 as 2083 bytes in 1 ms 14/02/19 16:21:49 INFO TaskSetManager: Starting task 12.0:18 as TID 2298 on executor 2: t5.exensa.loc (PROCESS_LOCAL) 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:18 as 2083 bytes in 1 ms 14/02/19 16:21:49 INFO TaskSetManager: Starting task 12.0:19 as TID 2299 on executor 5: t6.exensa.loc (PROCESS_LOCAL) 14/02/19 16:21:49 INFO TaskSetManager: Serialized task 12.0:19 as 2083 bytes in 1 ms And on an executor : 14/02/19 15:21:53 INFO Executor: Serialized size of result for 2287 is 17229427 14/02/19 15:21:53 INFO Executor: Sending result for 2287 directly to driver 14/02/19 15:21:53 INFO Executor: Serialized size of result for 2299 is 17229262 14/02/19 15:21:53 INFO Executor: Sending result for 2299 directly to driver 14/02/19 15:21:53 INFO Executor: Finished task ID 2299 14/02/19 15:21:53 INFO Executor: Finished task ID 2287 14/02/19 15:21:53 INFO Executor: Serialized size of result for 2281 is 17229426 14/02/19 15:21:53 INFO Executor: Sending result for 2281 directly to driver 14/02/19 15:21:53 INFO Executor: Finished task ID 2281 And.... that's all. The driver does not receive the information that the task is finished. I have a akka.frameSize=512 and a kryo buffer on 512mb DEBUG level does not add anything (at least executor side, I didn't try on driver). The RDD which is collected is made of (Int, (Array[Int],FloatMatrix)) Any help would be greatly appreciated. Thanks Guillaume --
|
- Execution blocked when collect()ing some relatively big bl... Guillaume Pitel
- Re: Execution blocked when collect()ing some relative... Guillaume Pitel
- Re: Execution blocked when collect()ing some rela... Roshan Nair
- Re: Execution blocked when collect()ing some ... Guillaume Pitel
- Re: Execution blocked when collect()ing s... Guillaume Pitel
- Re: Execution blocked when collect()... Roshan Nair