Re: join function in a loop
I am sorry, we can not divide the data set and process it separately. does it mean that I overuse Spark for my data size because it consumes a long time to shuffle the data? On Sun, May 29, 2016 at 8:53 AM, Ted Yu wrote: > Heri: > Is it possible to partition your data set so that the number of rows > involved in join is under control ? > > Cheers > > On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> You are welcome >> >> Also use can use OS command /usr/bin/free to see how much free memory >> you have on each node. >> >> You should also see from SPARK GUI (first job on master node:4040, next >> on 4041etc) the resource and Storage (memory usage) for each SparkSubmit >> job. >> >> HTH >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 29 May 2016 at 01:16, heri wijayanto wrote: >> >>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but >>> currently, my cluster is running to do the other job. After it finished, I >>> will try your suggestions >>> >>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> You should have errors in yarn-nodemanager and yarn-resourcemanager >>>> logs. >>>> >>>> Something like below for heathy container >>>> >>>> 2016-05-29 00:50:50,496 INFO >>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >>>> Memory usage of ProcessTree 29769 for container-id >>>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory >>>> used; 2.7 GB of 8.4 GB virtual memory used >>>> >>>> It appears that you are running out of memory. Have you also checked >>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing >>>> job? It will show you the resource usage= like memory/heap/cpu etc >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 29 May 2016 at 00:26, heri wijayanto wrote: >>>> >>>>> I implement spark with join function for processing in around 250 >>>>> million rows of text. >>>>> >>>>> When I just used several hundred of rows, it could run, but when I use >>>>> the large data, it is failed. >>>>> >>>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 >>>>> node computers. >>>>> >>>>> Thank you very much, Ted Yu >>>>> >>>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu wrote: >>>>> >>>>>> Can you let us know your case ? >>>>>> >>>>>> When the join failed, what was the error (consider pastebin) ? >>>>>> >>>>>> Which release of Spark are you using ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto >>>>>> wrote: >>>>>> > >>>>>> > Hi everyone, >>>>>> > I perform join function in a loop, and it is failed. I found a >>>>>> tutorial from the web, it says that I should use a broadcast variable but >>>>>> it is not a good choice for doing it on the loop. >>>>>> > I need your suggestion to address this problem, thank you very much. >>>>>> > and I am sorry, I am a beginner in Spark programming >>>>>> >>>>> >>>>> >>>> >>> >> >
Re: join function in a loop
Heri: Is it possible to partition your data set so that the number of rows involved in join is under control ? Cheers On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh wrote: > You are welcome > > Also use can use OS command /usr/bin/free to see how much free memory you > have on each node. > > You should also see from SPARK GUI (first job on master node:4040, next on > 4041etc) the resource and Storage (memory usage) for each SparkSubmit job. > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 29 May 2016 at 01:16, heri wijayanto wrote: > >> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but >> currently, my cluster is running to do the other job. After it finished, I >> will try your suggestions >> >> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> You should have errors in yarn-nodemanager and yarn-resourcemanager >>> logs. >>> >>> Something like below for heathy container >>> >>> 2016-05-29 00:50:50,496 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >>> Memory usage of ProcessTree 29769 for container-id >>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory >>> used; 2.7 GB of 8.4 GB virtual memory used >>> >>> It appears that you are running out of memory. Have you also checked >>> with jps and jmonitor for SparkSubmit (the driver process) for the failing >>> job? It will show you the resource usage= like memory/heap/cpu etc >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 29 May 2016 at 00:26, heri wijayanto wrote: >>> >>>> I implement spark with join function for processing in around 250 >>>> million rows of text. >>>> >>>> When I just used several hundred of rows, it could run, but when I use >>>> the large data, it is failed. >>>> >>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 >>>> node computers. >>>> >>>> Thank you very much, Ted Yu >>>> >>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu wrote: >>>> >>>>> Can you let us know your case ? >>>>> >>>>> When the join failed, what was the error (consider pastebin) ? >>>>> >>>>> Which release of Spark are you using ? >>>>> >>>>> Thanks >>>>> >>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto >>>>> wrote: >>>>> > >>>>> > Hi everyone, >>>>> > I perform join function in a loop, and it is failed. I found a >>>>> tutorial from the web, it says that I should use a broadcast variable but >>>>> it is not a good choice for doing it on the loop. >>>>> > I need your suggestion to address this problem, thank you very much. >>>>> > and I am sorry, I am a beginner in Spark programming >>>>> >>>> >>>> >>> >> >
Re: join function in a loop
You are welcome Also use can use OS command /usr/bin/free to see how much free memory you have on each node. You should also see from SPARK GUI (first job on master node:4040, next on 4041etc) the resource and Storage (memory usage) for each SparkSubmit job. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 29 May 2016 at 01:16, heri wijayanto wrote: > Thank you, Dr Mich Talebzadeh, I will capture the error messages, but > currently, my cluster is running to do the other job. After it finished, I > will try your suggestions > > On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> You should have errors in yarn-nodemanager and yarn-resourcemanager logs. >> >> Something like below for heathy container >> >> 2016-05-29 00:50:50,496 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Memory usage of ProcessTree 29769 for container-id >> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory >> used; 2.7 GB of 8.4 GB virtual memory used >> >> It appears that you are running out of memory. Have you also checked with >> jps and jmonitor for SparkSubmit (the driver process) for the failing job? >> It will show you the resource usage= like memory/heap/cpu etc >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 29 May 2016 at 00:26, heri wijayanto wrote: >> >>> I implement spark with join function for processing in around 250 >>> million rows of text. >>> >>> When I just used several hundred of rows, it could run, but when I use >>> the large data, it is failed. >>> >>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 >>> node computers. >>> >>> Thank you very much, Ted Yu >>> >>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu wrote: >>> >>>> Can you let us know your case ? >>>> >>>> When the join failed, what was the error (consider pastebin) ? >>>> >>>> Which release of Spark are you using ? >>>> >>>> Thanks >>>> >>>> > On May 28, 2016, at 3:27 PM, heri wijayanto >>>> wrote: >>>> > >>>> > Hi everyone, >>>> > I perform join function in a loop, and it is failed. I found a >>>> tutorial from the web, it says that I should use a broadcast variable but >>>> it is not a good choice for doing it on the loop. >>>> > I need your suggestion to address this problem, thank you very much. >>>> > and I am sorry, I am a beginner in Spark programming >>>> >>> >>> >> >
Re: join function in a loop
Thank you, Dr Mich Talebzadeh, I will capture the error messages, but currently, my cluster is running to do the other job. After it finished, I will try your suggestions On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh wrote: > You should have errors in yarn-nodemanager and yarn-resourcemanager logs. > > Something like below for heathy container > > 2016-05-29 00:50:50,496 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 29769 for container-id > container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory > used; 2.7 GB of 8.4 GB virtual memory used > > It appears that you are running out of memory. Have you also checked with > jps and jmonitor for SparkSubmit (the driver process) for the failing job? > It will show you the resource usage= like memory/heap/cpu etc > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 29 May 2016 at 00:26, heri wijayanto wrote: > >> I implement spark with join function for processing in around 250 million >> rows of text. >> >> When I just used several hundred of rows, it could run, but when I use >> the large data, it is failed. >> >> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 >> node computers. >> >> Thank you very much, Ted Yu >> >> On Sun, May 29, 2016 at 6:48 AM, Ted Yu wrote: >> >>> Can you let us know your case ? >>> >>> When the join failed, what was the error (consider pastebin) ? >>> >>> Which release of Spark are you using ? >>> >>> Thanks >>> >>> > On May 28, 2016, at 3:27 PM, heri wijayanto >>> wrote: >>> > >>> > Hi everyone, >>> > I perform join function in a loop, and it is failed. I found a >>> tutorial from the web, it says that I should use a broadcast variable but >>> it is not a good choice for doing it on the loop. >>> > I need your suggestion to address this problem, thank you very much. >>> > and I am sorry, I am a beginner in Spark programming >>> >> >> >
Re: join function in a loop
You should have errors in yarn-nodemanager and yarn-resourcemanager logs. Something like below for heathy container 2016-05-29 00:50:50,496 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29769 for container-id container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory used; 2.7 GB of 8.4 GB virtual memory used It appears that you are running out of memory. Have you also checked with jps and jmonitor for SparkSubmit (the driver process) for the failing job? It will show you the resource usage= like memory/heap/cpu etc HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 29 May 2016 at 00:26, heri wijayanto wrote: > I implement spark with join function for processing in around 250 million > rows of text. > > When I just used several hundred of rows, it could run, but when I use the > large data, it is failed. > > My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node > computers. > > Thank you very much, Ted Yu > > On Sun, May 29, 2016 at 6:48 AM, Ted Yu wrote: > >> Can you let us know your case ? >> >> When the join failed, what was the error (consider pastebin) ? >> >> Which release of Spark are you using ? >> >> Thanks >> >> > On May 28, 2016, at 3:27 PM, heri wijayanto wrote: >> > >> > Hi everyone, >> > I perform join function in a loop, and it is failed. I found a tutorial >> from the web, it says that I should use a broadcast variable but it is not >> a good choice for doing it on the loop. >> > I need your suggestion to address this problem, thank you very much. >> > and I am sorry, I am a beginner in Spark programming >> > >
Re: join function in a loop
I implement spark with join function for processing in around 250 million rows of text. When I just used several hundred of rows, it could run, but when I use the large data, it is failed. My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node computers. Thank you very much, Ted Yu On Sun, May 29, 2016 at 6:48 AM, Ted Yu wrote: > Can you let us know your case ? > > When the join failed, what was the error (consider pastebin) ? > > Which release of Spark are you using ? > > Thanks > > > On May 28, 2016, at 3:27 PM, heri wijayanto wrote: > > > > Hi everyone, > > I perform join function in a loop, and it is failed. I found a tutorial > from the web, it says that I should use a broadcast variable but it is not > a good choice for doing it on the loop. > > I need your suggestion to address this problem, thank you very much. > > and I am sorry, I am a beginner in Spark programming >
Re: join function in a loop
Can you let us know your case ? When the join failed, what was the error (consider pastebin) ? Which release of Spark are you using ? Thanks > On May 28, 2016, at 3:27 PM, heri wijayanto wrote: > > Hi everyone, > I perform join function in a loop, and it is failed. I found a tutorial from > the web, it says that I should use a broadcast variable but it is not a good > choice for doing it on the loop. > I need your suggestion to address this problem, thank you very much. > and I am sorry, I am a beginner in Spark programming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
join function in a loop
Hi everyone, I perform join function in a loop, and it is failed. I found a tutorial from the web, it says that I should use a broadcast variable but it is not a good choice for doing it on the loop. I need your suggestion to address this problem, thank you very much. and I am sorry, I am a beginner in Spark programming