Re: join function in a loop

2016-05-28 Thread heri wijayanto
I am sorry, we can not divide the data set and process it separately. does
it mean that I overuse Spark for my data size because it consumes a long
time to shuffle the data?

On Sun, May 29, 2016 at 8:53 AM, Ted Yu  wrote:

> Heri:
> Is it possible to partition your data set so that the number of rows
> involved in join is under control ?
>
> Cheers
>
> On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You are welcome
>>
>> Also use can use OS command /usr/bin/free to see how much free memory
>> you have on each node.
>>
>> You should also see from SPARK GUI (first job on master node:4040, next
>> on 4041etc) the  resource and Storage (memory usage) for each SparkSubmit
>> job.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 01:16, heri wijayanto  wrote:
>>
>>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
>>> currently, my cluster is running to do the other job. After it finished, I
>>> will try your suggestions
>>>
>>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You should have errors in yarn-nodemanager and yarn-resourcemanager
>>>> logs.
>>>>
>>>> Something like below for heathy container
>>>>
>>>> 2016-05-29 00:50:50,496 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>>> Memory usage of ProcessTree 29769 for container-id
>>>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
>>>> used; 2.7 GB of 8.4 GB virtual memory used
>>>>
>>>> It appears that you are running out of memory. Have you also checked
>>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing
>>>> job? It will show you the resource usage= like memory/heap/cpu etc
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>>>>
>>>>> I implement spark with join function for processing in around 250
>>>>> million rows of text.
>>>>>
>>>>> When I just used several hundred of rows, it could run, but when I use
>>>>> the large data, it is failed.
>>>>>
>>>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>>>>> node computers.
>>>>>
>>>>> Thank you very much, Ted Yu
>>>>>
>>>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>>>>>
>>>>>> Can you let us know your case ?
>>>>>>
>>>>>> When the join failed, what was the error (consider pastebin) ?
>>>>>>
>>>>>> Which release of Spark are you using ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto 
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi everyone,
>>>>>> > I perform join function in a loop, and it is failed. I found a
>>>>>> tutorial from the web, it says that I should use a broadcast variable but
>>>>>> it is not a good choice for doing it on the loop.
>>>>>> > I need your suggestion to address this problem, thank you very much.
>>>>>> > and I am sorry, I am a beginner in Spark programming
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: join function in a loop

2016-05-28 Thread Ted Yu
Heri:
Is it possible to partition your data set so that the number of rows
involved in join is under control ?

Cheers

On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh 
wrote:

> You are welcome
>
> Also use can use OS command /usr/bin/free to see how much free memory you
> have on each node.
>
> You should also see from SPARK GUI (first job on master node:4040, next on
> 4041etc) the  resource and Storage (memory usage) for each SparkSubmit job.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 01:16, heri wijayanto  wrote:
>
>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
>> currently, my cluster is running to do the other job. After it finished, I
>> will try your suggestions
>>
>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You should have errors in yarn-nodemanager and yarn-resourcemanager
>>> logs.
>>>
>>> Something like below for heathy container
>>>
>>> 2016-05-29 00:50:50,496 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>> Memory usage of ProcessTree 29769 for container-id
>>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
>>> used; 2.7 GB of 8.4 GB virtual memory used
>>>
>>> It appears that you are running out of memory. Have you also checked
>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing
>>> job? It will show you the resource usage= like memory/heap/cpu etc
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>>>
>>>> I implement spark with join function for processing in around 250
>>>> million rows of text.
>>>>
>>>> When I just used several hundred of rows, it could run, but when I use
>>>> the large data, it is failed.
>>>>
>>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>>>> node computers.
>>>>
>>>> Thank you very much, Ted Yu
>>>>
>>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>>>>
>>>>> Can you let us know your case ?
>>>>>
>>>>> When the join failed, what was the error (consider pastebin) ?
>>>>>
>>>>> Which release of Spark are you using ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto 
>>>>> wrote:
>>>>> >
>>>>> > Hi everyone,
>>>>> > I perform join function in a loop, and it is failed. I found a
>>>>> tutorial from the web, it says that I should use a broadcast variable but
>>>>> it is not a good choice for doing it on the loop.
>>>>> > I need your suggestion to address this problem, thank you very much.
>>>>> > and I am sorry, I am a beginner in Spark programming
>>>>>
>>>>
>>>>
>>>
>>
>


Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You are welcome

Also use can use OS command /usr/bin/free to see how much free memory you
have on each node.

You should also see from SPARK GUI (first job on master node:4040, next on
4041etc) the  resource and Storage (memory usage) for each SparkSubmit job.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 01:16, heri wijayanto  wrote:

> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
> currently, my cluster is running to do the other job. After it finished, I
> will try your suggestions
>
> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You should have errors in yarn-nodemanager and yarn-resourcemanager logs.
>>
>> Something like below for heathy container
>>
>> 2016-05-29 00:50:50,496 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>> Memory usage of ProcessTree 29769 for container-id
>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
>> used; 2.7 GB of 8.4 GB virtual memory used
>>
>> It appears that you are running out of memory. Have you also checked with
>> jps and jmonitor for SparkSubmit (the driver process) for the failing job?
>> It will show you the resource usage= like memory/heap/cpu etc
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>>
>>> I implement spark with join function for processing in around 250
>>> million rows of text.
>>>
>>> When I just used several hundred of rows, it could run, but when I use
>>> the large data, it is failed.
>>>
>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>>> node computers.
>>>
>>> Thank you very much, Ted Yu
>>>
>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>>>
>>>> Can you let us know your case ?
>>>>
>>>> When the join failed, what was the error (consider pastebin) ?
>>>>
>>>> Which release of Spark are you using ?
>>>>
>>>> Thanks
>>>>
>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto 
>>>> wrote:
>>>> >
>>>> > Hi everyone,
>>>> > I perform join function in a loop, and it is failed. I found a
>>>> tutorial from the web, it says that I should use a broadcast variable but
>>>> it is not a good choice for doing it on the loop.
>>>> > I need your suggestion to address this problem, thank you very much.
>>>> > and I am sorry, I am a beginner in Spark programming
>>>>
>>>
>>>
>>
>


Re: join function in a loop

2016-05-28 Thread heri wijayanto
Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
currently, my cluster is running to do the other job. After it finished, I
will try your suggestions

On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh 
wrote:

> You should have errors in yarn-nodemanager and yarn-resourcemanager logs.
>
> Something like below for heathy container
>
> 2016-05-29 00:50:50,496 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 29769 for container-id
> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
> used; 2.7 GB of 8.4 GB virtual memory used
>
> It appears that you are running out of memory. Have you also checked with
> jps and jmonitor for SparkSubmit (the driver process) for the failing job?
> It will show you the resource usage= like memory/heap/cpu etc
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>
>> I implement spark with join function for processing in around 250 million
>> rows of text.
>>
>> When I just used several hundred of rows, it could run, but when I use
>> the large data, it is failed.
>>
>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>> node computers.
>>
>> Thank you very much, Ted Yu
>>
>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>>
>>> Can you let us know your case ?
>>>
>>> When the join failed, what was the error (consider pastebin) ?
>>>
>>> Which release of Spark are you using ?
>>>
>>> Thanks
>>>
>>> > On May 28, 2016, at 3:27 PM, heri wijayanto 
>>> wrote:
>>> >
>>> > Hi everyone,
>>> > I perform join function in a loop, and it is failed. I found a
>>> tutorial from the web, it says that I should use a broadcast variable but
>>> it is not a good choice for doing it on the loop.
>>> > I need your suggestion to address this problem, thank you very much.
>>> > and I am sorry, I am a beginner in Spark programming
>>>
>>
>>
>


Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You should have errors in yarn-nodemanager and yarn-resourcemanager logs.

Something like below for heathy container

2016-05-29 00:50:50,496 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 29769 for container-id
container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
used; 2.7 GB of 8.4 GB virtual memory used

It appears that you are running out of memory. Have you also checked with
jps and jmonitor for SparkSubmit (the driver process) for the failing job?
It will show you the resource usage= like memory/heap/cpu etc

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 00:26, heri wijayanto  wrote:

> I implement spark with join function for processing in around 250 million
> rows of text.
>
> When I just used several hundred of rows, it could run, but when I use the
> large data, it is failed.
>
> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node
> computers.
>
> Thank you very much, Ted Yu
>
> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>
>> Can you let us know your case ?
>>
>> When the join failed, what was the error (consider pastebin) ?
>>
>> Which release of Spark are you using ?
>>
>> Thanks
>>
>> > On May 28, 2016, at 3:27 PM, heri wijayanto  wrote:
>> >
>> > Hi everyone,
>> > I perform join function in a loop, and it is failed. I found a tutorial
>> from the web, it says that I should use a broadcast variable but it is not
>> a good choice for doing it on the loop.
>> > I need your suggestion to address this problem, thank you very much.
>> > and I am sorry, I am a beginner in Spark programming
>>
>
>


Re: join function in a loop

2016-05-28 Thread heri wijayanto
I implement spark with join function for processing in around 250 million
rows of text.

When I just used several hundred of rows, it could run, but when I use the
large data, it is failed.

My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node
computers.

Thank you very much, Ted Yu

On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:

> Can you let us know your case ?
>
> When the join failed, what was the error (consider pastebin) ?
>
> Which release of Spark are you using ?
>
> Thanks
>
> > On May 28, 2016, at 3:27 PM, heri wijayanto  wrote:
> >
> > Hi everyone,
> > I perform join function in a loop, and it is failed. I found a tutorial
> from the web, it says that I should use a broadcast variable but it is not
> a good choice for doing it on the loop.
> > I need your suggestion to address this problem, thank you very much.
> > and I am sorry, I am a beginner in Spark programming
>


Re: join function in a loop

2016-05-28 Thread Ted Yu
Can you let us know your case ?

When the join failed, what was the error (consider pastebin) ?

Which release of Spark are you using ?

Thanks

> On May 28, 2016, at 3:27 PM, heri wijayanto  wrote:
> 
> Hi everyone,
> I perform join function in a loop, and it is failed. I found a tutorial from 
> the web, it says that I should use a broadcast variable but it is not a good 
> choice for doing it on the loop. 
> I need your suggestion to address this problem, thank you very much.
> and I am sorry, I am a beginner in Spark programming

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



join function in a loop

2016-05-28 Thread heri wijayanto
Hi everyone,
I perform join function in a loop, and it is failed. I found a tutorial
from the web, it says that I should use a broadcast variable but it is not
a good choice for doing it on the loop.
I need your suggestion to address this problem, thank you very much.
and I am sorry, I am a beginner in Spark programming