Yes, I check Spark UI to follow what’s going on. It seems to start several 
tasks fine (8 tasks in my case) out of ~70k tasks, and then stalls.

I actually was able to get things to work by disabling dynamic allocation. 
Basically I set the number of executors manually, which disables dynamic 
allocation. This seems to fix the problem.

My guess is that when faced with too many backlogged tasks, the dynamic 
allocator could be having trouble launching executors, or something similar. 
I’m not sure though if this is a bug, but maybe someone familiar with the 
internal of dynamic allocation can tell if this is a bug worth filing.

I’m using YARN as resource manager.

Khaled 

> On Jun 13, 2016, at 6:24 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Have you looked at spark GUI to see what it is waiting for. is that available 
> memory. What is the resource manager you are using?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 13 June 2016 at 20:45, Khaled Hammouda <khaled.hammo...@kik.com 
> <mailto:khaled.hammo...@kik.com>> wrote:
> Hi Michael,
> 
> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the 
> preview and tried using it, but I’m running into the exact same issue.
> 
> Khaled
> 
>> On Jun 13, 2016, at 2:58 PM, Michael Armbrust <mich...@databricks.com 
>> <mailto:mich...@databricks.com>> wrote:
>> 
>> You might try with the Spark 2.0 preview.  We spent a bunch of time 
>> improving the handling of many small files.
>> 
>> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda <khaled.hammo...@kik.com 
>> <mailto:khaled.hammo...@kik.com>> wrote:
>> I'm trying to use Spark SQL to load json data that are split across about 70k
>> files across 24 directories in hdfs, using
>> sqlContext.read.json("hdfs:///user/hadoop/data/* <>/*").
>> 
>> This doesn't seem to work for some reason, I get timeout errors like the
>> following:
>> 
>> -------
>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 
>> <http://172.31.31.114:46028/> has been quiet for 120000
>> ms while there are outstanding requests. Assuming connection is dead; please
>> adjust spark.network.timeout if this is wrong.
>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
>> outstanding when connection from
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 
>> <http://172.31.31.114:46028/> is closed
>> ...
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> ...
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [120 seconds]
>> ------
>> 
>> I don't want to start tinkering with increasing timeouts yet. I tried to
>> load just one sub-directory, which contains around 4k files, and this seems
>> to work fine. So I thought of writing a loop where I load the json files
>> from each sub-dir and then unionAll the current dataframe with the previous
>> dataframe. However, this also fails because apparently the json files don't
>> have the exact same schema, causing this error:
>> 
>> ---
>> Traceback (most recent call last):
>>   File "/home/hadoop/load_json.py", line 65, in <module>
>>     df = df.unionAll(hrdf)
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>> line 998, in unionAll
>>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>> line 813, in __call__
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>> 51, in deco
>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
>> ---
>> 
>> I'd like to know what's preventing Spark from loading 70k files the same way
>> it's loading 4k files?
>> 
>> To give you some idea about my setup and data:
>> - ~70k files across 24 directories in HDFS
>> - Each directory contains 3k files on average
>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>> available to YARN
>> - Spark 1.6.1
>> 
>> Thanks.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 
> 

Reply via email to