Getting core.exception.OutOfMemoryError error on allocating large arrays
I am running enum long DIM = 1024L * 1024L * 1024L* 8L ; void main() { auto signal = new double[DIM]; } and getting core.exception.OutOfMemoryError error. One option is to use short/int, but I need to use double. Also, on using large arrays, computer becomes slow. Is there no workaround at all, so that I can work on large arrays? Please let me know.
Re: Getting core.exception.OutOfMemoryError error on allocating large arrays
Assuming double.sizeof==8 on your machine, You're requesting 1024*1024*1024*8*8 bytes = 68GB, do you have that much RAM available? You are completely correct, however in C, one could do: const long DIM = 1024L * 1024L * 1024L* 8L ; int main() { double signal[DIM]; } which runs fine. So, I was sort of looking for some solution like this.
Re: Getting core.exception.OutOfMemoryError error on allocating large arrays
Thanks. Yes, you are right. I will change my program.
Re: Passing a value by reference to function in taskPool
Thanks a lot for your reply.
How to initialize an immutable array
I am making a program which accesses 1D array using for loop and then I am parallelizing this with foreach, TaskPool and parallel. The array does not need to change, once initialized. However, the parallel version takes more time than serial version, which I think may be because compiler is trying to make sure that array is properly handled by different threads. So, is there a way, an array can be made immutable and still initialized? Thanks a lot for your time.
Re: How to initialize an immutable array
Array is really big! import std.stdio; import std.datetime; import std.parallelism; import std.range; //int numberOfWorkers = 2; //for parallel; double my_abs(double n) { return n 0 ? n : -n; } immutable long DIM = 1024L*1024L *128L; void main() { double[] signal = new double[DIM+1]; double temp; double sample[2]= [4.1,7.2]; for(long i=0L; i DIM+1; i++) { signal[i] = (i+ DIM)%7 + (i+DIM+1)%5; // could be any random value } //auto workerPool = new TaskPool(numberOfWorkers); // for parallel StopWatch sw; sw.start(); //start/resume mesuring. for (long i=0L; i DIM; i++) //foreach(i; workerPool.parallel(iota(0, DIM))) // for parallel { temp = my_abs(sample[0]-signal[i]) + my_abs(sample[1]-signal[i+1]) ; } //workerPool.finish(); // for parallel sw.stop(); //stop/pause measuring. writeln( Total time: , (sw.peek().msecs/1000), [sec]); } It has both serial and parallel versions. Just comment/uncomment as per comments.
Re: How to initialize an immutable array
foreach (immutable i; 0 .. DIM + 1) { Thanks. However, rdmd gives error on this line: temp1.d(12): Error: no identifier for declarator immutable(i)
Re: How to initialize an immutable array
On Friday, 1 March 2013 at 20:28:19 UTC, FG wrote: I suppose this: immutable long DIM = 1024L*1024L *128L; immutable(double)[] signal = new double[DIM+1]; static this() { for (long i=0L; i DIM+1; i++) { signal[i] = (i+DIM)%7 + (i+DIM+1)%5; } } void main() { ... } Thanks. This gives an error, which I don't know how to resolve: Error: cannot evaluate new double[](134217729LU) at compile time Can you please tell.
Re: How to initialize an immutable array
Removing immutable word solves the problem. Thanks.
Re: How to initialize an immutable array
I realized that access to temp causes bottleneck. On defining it inside for loop, it become local and then there is speedup. Defining it outside makes it shared, which slows the program.
Passing a value by reference to function in taskPool
Here is a code: import std.stdio, std.datetime, std.random, std.range, std.parallelism; enum long numberOfSlaves = 2; void myFunc( ref long countvar) { countvar = 500; writeln( value of countvar is , countvar); } void main() { long count1=0, count2=0; alias typeof(task!(myFunc)(0L)) MyTask ; //Possibility 1 MyTask[numberOfSlaves] tasks; tasks[0] = task!(myFunc)(count1); taskPool.put(tasks[0]); tasks[1] = task!(myFunc)(count2); taskPool.put(tasks[1]); for (long cc =0; cc numberOfSlaves; cc++) tasks[cc].yieldForce(); //Possibility 2 //myFunc(count1); //myFunc(count2); writeln( value of count1 and count2 are , count1, , count2); } Possibility 1: Here, I wanted to pass a value by reference to myFunc, but when I read that value in main function, its value is not changed at all? Possibility 2: It does what I want. So, how to properly use the taskPool, so that pass by reference works. Uncomment/comment Possibility 1 or 2 to see the output.
Re: How to specify number of worker threads for Taskpool
Thanks a lot for your reply. It was very helpful.
Re: Finding large difference b/w execution time of c++ and D codes for same problem
Thanks a lot for your reply.
Strange that D is printing large values as zero. Any mistake in my code?
Here is the program: import std.stdio; const long DIM = 1024*1024*1024*1024*4; void main() { writeln( DIM is , DIM); writeln( Value , 1024*1024*1024*1024*4); writeln( Max , long.max); } I compiled it: gdc -frelease -O3 temp.d -o t1 ; ./t1 DIM is 0 Value 0 Max 9223372036854775807 Can you please tell, why it is taking DIM as zero? If I reduce DIM, it works fine. It is strange.
Re: Strange that D is printing large values as zero. Any mistake in my code?
On Thursday, 14 February 2013 at 15:51:45 UTC, Joseph Rushton Wakeling wrote: On 02/14/2013 04:44 PM, Sparsh Mittal wrote: Can you please tell, why it is taking DIM as zero? If I reduce DIM, it works fine. It is strange. 1024 is an int value. Write 1024L instead to ensure that the calculation is performed using long. Thanks a lot for your help.
Finding large difference b/w execution time of c++ and D codes for same problem
I am writing Julia sets program in C++ and D; exactly same way as much as possible. On executing I find large difference in their execution time. Can you comment what wrong am I doing or is it expected? //===C++ code, compiled with -O3 == #include sys/time.h #include iostream using namespace std; const int DIM= 4194304; struct complexClass { float r; float i; complexClass( float a, float b ) { r = a; i = b; } float squarePlusMag(complexClass another) { float r1 = r*r - i*i + another.r; float i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1+ i1*i1); } }; int juliaFunction( int x, int y ) { complexClass a (x,y); complexClass c(-0.8, 0.156); int i = 0; for (i=0; i200; i++) { if( a.squarePlusMag(c) 1000) return 0; } return 1; } void kernel( ){ for (int x=0; xDIM; x++) { for (int y=0; yDIM; y++) { int offset = x + y * DIM; int juliaValue = juliaFunction( x, y ); //juliaValue will be used by some function. } } } int main() { struct timeval start, end; gettimeofday(start, NULL); kernel(); gettimeofday(end, NULL); float delta = ((end.tv_sec - start.tv_sec) * 100u + end.tv_usec - start.tv_usec) / 1.e6; cout C++ code with dimension DIM Total time: delta [sec]\n; } //=D++ code, compiled with -O -release -inline= #!/usr/bin/env rdmd import std.stdio; import std.datetime; immutable int DIM= 4194304; struct complexClass { float r; float i; float squarePlusMag(complexClass another) { float r1 = r*r - i*i + another.r; float i1 = 2.0*i*r + another.i; r = r1; i = i1; return (r1*r1+ i1*i1); } }; int juliaFunction( int x, int y ) { complexClass c = complexClass(0.8, 0.156); complexClass a= complexClass(x, y); for (int i=0; i200; i++) { if( a.squarePlusMag(c) 1000) return 0; } return 1; } void kernel( ){ for (int x=0; xDIM; x++) { for (int y=0; yDIM; y++) { int offset = x + y * DIM; int juliaValue = juliaFunction( x, y ); //juliaValue will be used by some function. } } } void main() { StopWatch sw; sw.start(); kernel(); sw.stop(); writeln( D code serial with dimension , DIM , Total time: , (sw.peek().msecs/1000), [sec]); } // I will appreciate any help.
Re: Finding large difference b/w execution time of c++ and D codes for same problem
I am finding C++ code is much faster than D code.
Re: Finding large difference b/w execution time of c++ and D codes for same problem
Pardon me, can you please point me to suitable reference or tell just command here. Searching on google, I could not find anything yet. Performance is my main concern.
Re: Finding large difference b/w execution time of c++ and D codes for same problem
OK. I found it.
Re: Finding large difference b/w execution time of c++ and D codes for same problem
Thanks for your insights. It was very helpful.
Re: Looking for writing parallel foreach kind of statement for nested for-loops
Think again if you need that. Things start getting pretty ugly. :) Yes, it is not at all intuitive. Indeed... Sparsh, any reason you need the calculation to be done on 2d blocks instead of independent slots? For my problem, original answer was fine, since parallel calculations are not at all dependent on each others. Sometime there are calculations to be done on 2d grid, where calculations are not uniform across the grid(see paper An overview of parallel visualisation methods for Mandelbrot and Julia sets where you can see Julia sets parallelization), and hence, dividing in a particular way leads to better load-balancing than others. Thanks a lot for your answer and time.
Re: Looking for writing parallel foreach kind of statement for nested for-loops
for(int i=1; i N; i++)==foreach(i; iota(1, N)) so you can use: foreach(i; parallel(iota(1, N))) { ... } Thanks a lot. This one divides the x-cross-y region by rows. Suppose dimension is 8*12 and 4 parallel threads are there, so current method is dividing by 2*12 to each of 4 threads. The current reply answers my question, but I was just curious. Can we have a method which divides the 2d region as follows: 8*12 divided into 4*6 to each of 4 threads.
Re: Allocating large 2D array in D
Thanks for your prompt reply. It was very helpful.
Not able to get scaled performance on increasing number of threads
I am parallelizing a program which follows this structure: for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(myFunc, args) } }
Not able to get scaled performance on increasing number of threads
It got posted before I completed it! Sorry. I am parallelizing a program which follows this structure: immutable int numberOfThreads= 2 for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(myFunc, args) } myLocalBarrier.wait(); } void myFunc(args) { //do the task myLocalBarrier.wait() } When I run it, and compare this parallel version with its serial version, I only get speedup of nearly 1.3 for 2 threads. When I write same program in Go, scaling is nearly 2. Also, in D, on doing top, I see the usage as only 130% CPU and not nearly 200% or 180%. So I was wondering, if I am doing it properly. Please help me.
Re: Not able to get scaled performance on increasing number of threads
Can't tell much without the whole source or at least compilable standalone piece. Give me a moment. I will post.
Re: Not able to get scaled performance on increasing number of threads
Here is the code: #!/usr/bin/env rdmd import std.stdio; import std.concurrency; import core.thread; import std.datetime; import std.conv; import core.sync.barrier; immutable int gridSize = 256; immutable int MAXSTEPS = 5; /* Maximum number of iterations */ immutable double TOL_VAL =0.1; /* Numerical Tolerance */ immutable double omega = 0.376; immutable double one_minus_omega = 1.0 - 0.376; immutable int numberOfThreads = 2; double MAX_FUNC(double a, double b) { return a b? a: b; } double ABS_VAL(double a) { return a 0? a: -a; } __gshared Barrier myLocalBarrier = null; shared double[gridSize+2][gridSize+2] gridInfo; shared double maxError = 0.0; void main(string args[]) { for(int i=0; igridSize+2; i++) { for(int j=0; jgridSize+2; j++) { if(i==0) gridInfo[i][j] = 1.0; else gridInfo[i][j] = 0.0; } } bool shouldCheck = false; bool isConverged = false; for(int iter = 1; iter = MAXSTEPS; iter++) { shouldCheck = false; if(iter % 400 ==0) { shouldCheck = true; maxError = 0.0; } //This is Phase 1 { myLocalBarrier = new Barrier(numberOfThreads+1); for (int cc=0; ccnumberOfThreads; cc++) { spawn(SolverSlave, thisTid,cc, 0 ,shouldCheck); } myLocalBarrier.wait(); } //This is Phase 2 { myLocalBarrier = new Barrier(numberOfThreads+1); for (int cc=0; ccnumberOfThreads; cc++) { spawn(SolverSlave, thisTid,cc, 1 ,shouldCheck ); } myLocalBarrier.wait(); } if( maxError TOL_VAL) { isConverged = true; break; } } if(isConverged) writeln(It converged); else writeln(It did not converge); } void SolverSlave(Tid owner, int myNumber, int remainder, bool shouldCheckHere) { double sum =0; //Divide task among threads int iStart = ((myNumber*gridSize)/numberOfThreads) + 1; int iEnd = (((myNumber+1)*gridSize)/numberOfThreads) ; for(int i=iStart; i= iEnd; i++) { for(int j=1; j gridSize+1; j++) { if( ((i+j)%2 ==remainder)) //Phase 1 or 2 { sum = ( gridInfo[i ][j+1] + gridInfo[i+1][j ] + gridInfo[i-1][j ] + gridInfo[i ][j-1] )*0.25; //Should not check everytime to reduce synchronization overhead if(shouldCheckHere) { maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), maxError); } gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum; } } } myLocalBarrier.wait(); }
Re: Not able to get scaled performance on increasing number of threads
Excellent. Thank you so much for your suggestion and code. It now produces near linear speedup.
Re: Not able to get scaled performance on increasing number of threads
Thanks. Yes, you are right. I have increased the dimension.
Trying to understand how shared works in D
I wrote this code. My purpose is to see how shared works in D. I create a global variable (globalVar) and access it in two different threads and it prints fine, although it is not shared. So, can you please tell, what difference it makes to use/not-use shared (ref. http://www.informit.com/articles/article.aspx?p=1609144seqNum=3). Also, is a global const implicitly shared? #!/usr/bin/env rdmd import std.stdio; import std.concurrency; import core.thread; const int globalConst = 51; int globalVar = 17; void main() { writefln(Calling Function); spawn(test1, thisTid); writefln(Wait Here ); spawn(test2, thisTid); writefln(End Here ); } void test1(Tid owner) { writefln(The value of globalConst here is %s, globalConst); writefln(The value of globalVar here is %s, globalVar); } void test2(Tid owner) { writefln(The value of globalConst here is %s, globalConst); writefln(The value of globalVar here is %s, globalVar); }
Re: Trying to understand how shared works in D
Thanks a lot. VERY VERY helpful.
Looking for command for synchronization of threads
Background: I am implementing an iterative algorithm in parallel manner. The algorithm iteratively updates a matrix (2D grid) of data. So, I will divide the grid to different threads, which will work on it for single iteration. After each iteration, all threads should wait since next iteration depends on previous iteration. My issue: To achieve synchronization, I am looking for an equivalent of sync in Cilk or cudaEventSynchronize in CUDA. I saw synchronized, but was not sure, if that is the answer. Please help me. I will put that command at end of for loop and it will be executed once per iteration.
Re: Looking for command for synchronization of threads
I suggest looking at std.parallelism since it's designed for this kind of thing. That aside, all traditional synchronization methods are in core.sync. The equivalent of sync in Cylk would be core.sync.barrier. Thanks. I wrote this: #!/usr/bin/env rdmd import std.stdio; import std.concurrency; import std.algorithm; import core.sync.barrier; import core.thread; void sorter(Tid owner, shared(int)[] sliceToSort, int mynumber) { writefln(Came inside %s, mynumber); sort(sliceToSort); writefln(Going out of %s, mynumber); } void main() { shared numbers = [ 6, 5, 4, 3, 2, 1 ]; auto barrier = new Barrier(2); spawn(sorter, thisTid, numbers[0 .. $ / 2], 0); spawn(sorter, thisTid, numbers[$ / 2 .. $],1 ); writefln(Waiting for barrier in main); barrier.wait(); writeln(numbers); } It compiles but barrier does not get released. Can you please point out the fault. Pardon my mistake. I searched whole web, there are almost no examples of it online. I saw this: http://www.digitalmars.com/d/archives/digitalmars/D/bugs/Issue_9005_New_std.concurrency.spawn_should_allow_void_delegate_Args_shared_for_new_Tid_44426.html but it does not compile.
An error on trying to sort shared data using slice
Purpose: I am trying to sort only a range of values in an array of struct (the struct has two fields and I want to sort on one of its fields using myComp function below). However, I am getting this error: ../src/phobos/std/algorithm.d(7731): Error: cannot implicitly convert expression (assumeSorted(r)) of type SortedRange!(shared(intpair)[], myComp) to SortedRange!(shared(intpair[]), myComp) ./ParallelCode.d(223): Error: template instance ParallelCode.singleSlave.sort!(myComp, cast(SwapStrategy)0, shared(intpair[])) error instantiating = where my relevant code is: = struct intpair{ int AllInts[2]; }; shared intpair [] distArray; void main() { ... distArray = new shared intpair[number_of_lines]; ... } void singleThreadFunction(...) { bool myComp(shared intpair x,shared intpair y) { return x.AllInts[0] y.AllInts[0]; } shared intpair[] tempSortArray = distArray[startRange..endRange+1]; /*line 223:*/ sort!(myComp)(tempSortArray); } Can you please help me. Thanks.
Re: An error on trying to sort shared data using slice
Thanks for your reply and link (which I will try to follow). However, I am trying to write a parallel program where I have a big array. Multiple (e.g. 2, 4, 8) threads do work on part of those arrays. Afterwards, they sort their portions of array and return the answer to main. So, I have made global variable, which is shared array. I do not know if there is another way to tackle the problem. When I don't use shared, the singleThreadFunction, which is executed by different threads, does not process the shared array. Thanks.
Re: An error on trying to sort shared data using slice
Thanks a lot. Actually, I am using std.concurrency, following your tutorial: http://ddili.org/ders/d.en/concurrency.html. Thanks for that tutorial. My requirement is to sort a portion of an array in each thread, such that there is no overlap b/w portions and all portions together make the whole array. So I am taking array as shared. Currently, in each thread, I am taking a slice of that array to sort, although that slice is not shared, I am forced to take shared since compiler does not allow without it. Can you suggest something, e.g. sorting only a portion of an array, without slicing? Thanks.
Re: An error on trying to sort shared data using slice
Thanks a lot. Your code is very valuable to explain the whole concept. I have changed my code based on it.
Re: How to call external program in D
Thanks a lot, it was very helpful.