Hi, I also added the swap space and the algorithm ran a bit more but after some time I faced the same problem again. I am on holidays and will figure out the issue in next few days. I will keep you guys updated.
Regards, Behroz On Fri, Sep 11, 2015 at 6:36 AM, Edward J. Yoon <[email protected]> wrote: > Hi, I think you have to adding some swap space. Did you figure out > what's problem? > > On Fri, Sep 4, 2015 at 8:20 AM, Behroz Sikander <[email protected]> > wrote: > > More info on this: > > I noticed that only 2 machines were failing with OutOfMemory. After > messing > > around, I found out that the swap memory was 0 for these 2 machines but > > others had swap space of 1 GB. I added the swap to these machines and it > > worked. But as expected in the next run of algorithm with more data it > > crashed again. This time GroomChildProcess crashed with the following log > > message > > > > > > *OpenJDK 64-Bit Server VM warning: INFO: > > os::commit_memory(0x00000007fa100000, 42467328, 0) failed; error='Cannot > > allocate memory' (errno=12)* > > *#* > > *# There is insufficient memory for the Java Runtime Environment to > > continue.* > > *# Native memory allocation (malloc) failed to allocate 42467328 bytes > for > > committing reserved memory.* > > *# An error report file with more information is saved as:* > > *# > > > /home/behroz/Documents/Packages/tmp_data/hama_tmp/bsp/local/groomServer/attempt_201509040050_0004_000006_0/work/hs_err_pid28850.log* > > > > My slave machines have 8GB of RAM, 4 CPUs, 20 GB harddrive and 1GB swap. > I > > run 3 groom child process each taking 2GB of RAM. Apart from > > GroomChildProcess, I have GroomServer, DataNode and TaskManager running > on > > the slave. After assigning 2GB ram to 3 child groom process (total 6GB > > RAM), only 2 GB of RAM is left for others. Do you think this is the > problem > > ? > > > > Regards, > > Behroz > > > > On Thu, Sep 3, 2015 at 11:39 PM, Behroz Sikander <[email protected]> > wrote: > > > >> Ok I found a strange thing. In my hadoop folder, I found a new file > named > >> "hs_err_pid4919.log" inside the $HADOOP_HOME directory. > >> > >> The content of the file are > >> > >> *# Increase physical memory or swap space* > >> *# Check if swap backing store is full* > >> *# Use 64 bit Java on a 64 bit OS* > >> *# Decrease Java heap size (-Xmx/-Xms)* > >> *# Decrease number of Java threads* > >> *# Decrease Java thread stack sizes (-Xss)* > >> *# Set larger code cache with -XX:ReservedCodeCacheSize=* > >> *# This output file may be truncated or incomplete.* > >> *#* > >> *# Out of Memory Error (os_linux.cpp:2809), pid=4919, > tid=140564483778304* > >> *#* > >> *# JRE version: OpenJDK Runtime Environment (7.0_79-b14) (build > >> 1.7.0_79-b14)* > >> *# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 > >> compressed oops)* > >> *# Derivative: IcedTea 2.5.6* > >> *# Distribution: Ubuntu 14.04 LTS, package 7u79-2.5.6-0ubuntu1.14.04.1* > >> *# Failed to write core dump. Core dumps have been disabled. To enable > >> core dumping, try "ulimit -c unlimited" before starting Java again* > >> *#* > >> > >> *--------------- T H R E A D ---------------* > >> > >> *Current thread (0x00007fd7c0438800): JavaThread "PacketResponder: > >> BP-1786576942-141.40.254.14-1441293753577:blk_1074136820_396012, > >> type=HAS_DOWNSTREAM_IN_PIPELINE" daemon [_thread_new, id=11943, > >> stack(0x00007fd7b80fa000,0x00007fd7b81fb000)]* > >> > >> *Stack: [0x00007fd7b80fa000,0x00007fd7b81fb000], sp=0x00007fd7b81f9be0, > >> free space=1022k* > >> *Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, > C=native > >> code)* > >> > >> I think my DataNode process is crashing. I now know that it is a out of > >> memory error but the reason is not sure. > >> > >> On Thu, Sep 3, 2015 at 10:25 PM, Behroz Sikander <[email protected]> > >> wrote: > >> > >>> ok. HA = High Availability ? > >>> > >>> I am also trying to solve the following problem. But I do not > understand > >>> why I get the exception because my algorithm does not have a lot of > data > >>> that is being sent to master. > >>> *'BSP task process exit with nonzero status of 1'* > >>> > >>> Each slave node processes some data and sends back a Double array of > size > >>> 96 to the master machine. Recently, I was testing the algorithm on 8000 > >>> files when it crashed. This means that 8000 double arrays of size 96 > are > >>> sent to the master to process. Once master receives all the data, it > gets > >>> out of sync and starts the processing again. Here is the calculation > >>> > >>> 8000 * 96 * 8 (size of float) = 6144000 = ~6.144 MB. > >>> > >>> I am not sure but this does not seem to be alot of data and I think > >>> message manager that you mentioned should be able to handle it. > >>> > >>> Regards, > >>> Behroz > >>> > >>> On Tue, Sep 1, 2015 at 1:07 PM, Edward J. Yoon <[email protected]> > >>> wrote: > >>> > >>>> I'm reading GroomServer code and its taskMonitorService. It seems > >>>> related with cluster HA. > >>>> > >>>> On Sat, Aug 29, 2015 at 1:16 PM, Edward J. Yoon < > [email protected]> > >>>> wrote: > >>>> >> If my Groom Child Process fails for some reason, the processes are > >>>> not killed automatically > >>>> > > >>>> > I also experienced this problem before. I guess, if one of processes > >>>> > crashed with OutOfMemory, other processes infinitely waiting for it. > >>>> > This is a bug. > >>>> > > >>>> > On Sat, Aug 29, 2015 at 1:02 AM, Behroz Sikander < > [email protected]> > >>>> wrote: > >>>> >> Just another quick question. If my Groom Child Process fails for > some > >>>> >> reason, the processes are not killed automatically. If i run JPS > >>>> command, I > >>>> >> can still see something like "3791 GroomServer$BSPPeerChild". Is > this > >>>> the > >>>> >> expected behavior ? > >>>> >> > >>>> >> I am using latest hama version (0.7.0). > >>>> >> Regards, > >>>> >> Behroz > >>>> >> > >>>> >> On Fri, Aug 28, 2015 at 4:12 PM, Behroz Sikander < > [email protected]> > >>>> wrote: > >>>> >> > >>>> >>> Ok I will try it out. > >>>> >>> > >>>> >>> No, actually I am learning alot by facing these problems. It is > >>>> actually a > >>>> >>> good thing :D > >>>> >>> > >>>> >>> Regards, > >>>> >>> Behroz > >>>> >>> > >>>> >>> On Fri, Aug 28, 2015 at 5:52 AM, Edward J. Yoon < > >>>> [email protected]> > >>>> >>> wrote: > >>>> >>> > >>>> >>>> > message managers. Hmmm, I will recheck my logic related to > >>>> messages. Btw > >>>> >>>> > >>>> >>>> Serialization (like GraphJobMessage) is good idea. It stores > >>>> multiple > >>>> >>>> messages in serialized form in a single object to reduce the > memory > >>>> >>>> usage and RPC overhead. > >>>> >>>> > >>>> >>>> > what is the limit of these message managers ? How much data at > a > >>>> single > >>>> >>>> > time they can handle ? > >>>> >>>> > >>>> >>>> It depends on memory. > >>>> >>>> > >>>> >>>> > P.S. Each day, as I am moving towards a big cluster I am > running > >>>> into > >>>> >>>> > problems (alot of them :D). > >>>> >>>> > >>>> >>>> Haha, sorry for inconvenient and thanks for your reports. > >>>> >>>> > >>>> >>>> On Fri, Aug 28, 2015 at 11:25 AM, Behroz Sikander < > >>>> [email protected]> > >>>> >>>> wrote: > >>>> >>>> > Ok. So, I do have a memory problem. I will try to scale out. > >>>> >>>> > > >>>> >>>> > *>>Each task processor has two message manager, one for > outgoing > >>>> and > >>>> >>>> one* > >>>> >>>> > > >>>> >>>> > *for incoming. All these are handled in memory, so it > >>>> sometimesrequires > >>>> >>>> > large memory space.* > >>>> >>>> > So, you mean that before barrier synchronization, I have alot > of > >>>> data in > >>>> >>>> > message managers. Hmmm, I will recheck my logic related to > >>>> messages. Btw > >>>> >>>> > what is the limit of these message managers ? How much data at > a > >>>> single > >>>> >>>> > time they can handle ? > >>>> >>>> > > >>>> >>>> > P.S. Each day, as I am moving towards a big cluster I am > running > >>>> into > >>>> >>>> > problems (alot of them :D). > >>>> >>>> > > >>>> >>>> > Regards, > >>>> >>>> > Behroz Sikander > >>>> >>>> > > >>>> >>>> > On Fri, Aug 28, 2015 at 4:04 AM, Edward J. Yoon < > >>>> [email protected]> > >>>> >>>> > wrote: > >>>> >>>> > > >>>> >>>> >> > for 3 Groom child process + 2GB for Ubuntu OS). Is this > correct > >>>> >>>> >> > understanding ? > >>>> >>>> >> > >>>> >>>> >> and, > >>>> >>>> >> > >>>> >>>> >> > on a big dataset. I think these exceptions have something to > >>>> do with > >>>> >>>> >> Ubuntu > >>>> >>>> >> > OS killing the hama process due to lack of memory. So, I was > >>>> curious > >>>> >>>> >> about > >>>> >>>> >> > >>>> >>>> >> Yes, you're right. > >>>> >>>> >> > >>>> >>>> >> Each task processor has two message manager, one for outgoing > >>>> and one > >>>> >>>> >> for incoming. All these are handled in memory, so it sometimes > >>>> >>>> >> requires large memory space. To solve the OutOfMemory issue, > you > >>>> >>>> >> should scale-out your cluster by increasing the number of > nodes > >>>> and > >>>> >>>> >> job tasks, or optimize your algorithm. Another option is > >>>> >>>> >> disk-spillable message manager. This is not supported yet. > >>>> >>>> >> > >>>> >>>> >> On Fri, Aug 28, 2015 at 10:45 AM, Behroz Sikander < > >>>> [email protected]> > >>>> >>>> >> wrote: > >>>> >>>> >> > Hi, > >>>> >>>> >> > Yes. According to hama-default.xml, each machine will open 3 > >>>> process > >>>> >>>> with > >>>> >>>> >> > 2GB memory each. This means that my VMs need atleast 8GB > >>>> memory (2GB > >>>> >>>> each > >>>> >>>> >> > for 3 Groom child process + 2GB for Ubuntu OS). Is this > correct > >>>> >>>> >> > understanding ? > >>>> >>>> >> > > >>>> >>>> >> > I recently ran into the following exceptions when I was > trying > >>>> to run > >>>> >>>> >> hama > >>>> >>>> >> > on a big dataset. I think these exceptions have something to > >>>> do with > >>>> >>>> >> Ubuntu > >>>> >>>> >> > OS killing the hama process due to lack of memory. So, I was > >>>> curious > >>>> >>>> >> about > >>>> >>>> >> > my configurations. > >>>> >>>> >> > 'BSP task process exit with nonzero status of 137.' > >>>> >>>> >> > 'BSP task process exit with nonzero status of 1' > >>>> >>>> >> > > >>>> >>>> >> > > >>>> >>>> >> > > >>>> >>>> >> > Regards, > >>>> >>>> >> > Behroz > >>>> >>>> >> > > >>>> >>>> >> > On Fri, Aug 28, 2015 at 3:04 AM, Edward J. Yoon < > >>>> >>>> [email protected]> > >>>> >>>> >> > wrote: > >>>> >>>> >> > > >>>> >>>> >> >> Hi, > >>>> >>>> >> >> > >>>> >>>> >> >> You can change the max tasks per node by setting below > >>>> property in > >>>> >>>> >> >> hama-site.xml. :-) > >>>> >>>> >> >> > >>>> >>>> >> >> <property> > >>>> >>>> >> >> <name>bsp.tasks.maximum</name> > >>>> >>>> >> >> <value>3</value> > >>>> >>>> >> >> <description>The maximum number of BSP tasks that will > be > >>>> run > >>>> >>>> >> >> simultaneously > >>>> >>>> >> >> by a groom server.</description> > >>>> >>>> >> >> </property> > >>>> >>>> >> >> > >>>> >>>> >> >> > >>>> >>>> >> >> On Fri, Aug 28, 2015 at 5:18 AM, Behroz Sikander < > >>>> >>>> [email protected]> > >>>> >>>> >> >> wrote: > >>>> >>>> >> >> > Hi, > >>>> >>>> >> >> > Recently, I noticed that my hama deployment is only > opening > >>>> 3 > >>>> >>>> >> processes > >>>> >>>> >> >> per > >>>> >>>> >> >> > machine. This is because of the configuration settings in > >>>> the > >>>> >>>> default > >>>> >>>> >> >> hama > >>>> >>>> >> >> > file. > >>>> >>>> >> >> > > >>>> >>>> >> >> > My questions is why 3 and why not 5 or 7 ? What > criteria's > >>>> should > >>>> >>>> be > >>>> >>>> >> >> > considered if I want to increase the value ? > >>>> >>>> >> >> > > >>>> >>>> >> >> > Regards, > >>>> >>>> >> >> > Behroz > >>>> >>>> >> >> > >>>> >>>> >> >> > >>>> >>>> >> >> > >>>> >>>> >> >> -- > >>>> >>>> >> >> Best Regards, Edward J. Yoon > >>>> >>>> >> >> > >>>> >>>> >> > >>>> >>>> >> > >>>> >>>> >> > >>>> >>>> >> -- > >>>> >>>> >> Best Regards, Edward J. Yoon > >>>> >>>> >> > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> -- > >>>> >>>> Best Regards, Edward J. Yoon > >>>> >>>> > >>>> >>> > >>>> >>> > >>>> > > >>>> > > >>>> > > >>>> > -- > >>>> > Best Regards, Edward J. Yoon > >>>> > >>>> > >>>> > >>>> -- > >>>> Best Regards, Edward J. Yoon > >>>> > >>> > >>> > >> > > > > -- > Best Regards, Edward J. Yoon >
