Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
Thanks, did that and now I am getting an out of memory. But I am not sure where this occurs. It can't be on the spark executor as I have 28GB allocated to it. It is not the driver because I run this locally and monitor it via jvisualvm. Unfortunately I can't jmx-monitor hadoop. From the

spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Kostas Kougios
I am reading millions of xml files via val xmls = sc.binaryFiles(xmlDir) The operation runs fine locally but on yarn it fails with: client token: N/A diagnostics: Application application_1433491939773_0012 failed 2 times due to ApplicationMaster for attempt

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Try putting a * on the end of xmlDir, i.e. xmlDir = fdfs:///abc/def/* Rather than xmlDir = Hdfs://abc/def and see what happens. I don't know why, but that appears to be more reliable for me with S3 as the filesystem. I'm also using binaryFiles, but I've tried running the same command while

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting an out of mem exception, kind of different one: 15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw exception: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Can you do a simple sc.binaryFiles(hdfs:///path/to/files/*).count() in the spark-shell and verify that part works? Ewan -Original Message- From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com] Sent: 08 June 2015 15:40 To: Ewan Leith; user@spark.apache.org Subject: Re:

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
It was giving the same error, which made me figure out it is the driver but the driver running on hadoop - not the local one. So I did --conf spark.driver.memory=8g and now it is processing the files! Cheers On 08/06/15 15:52, Ewan Leith wrote: Can you do a simple