Getting core.exception.OutOfMemoryError error on allocating large arrays

2013-03-03 Thread Sparsh Mittal

I am running


enum long DIM = 1024L * 1024L * 1024L* 8L ;
void main() {
  auto signal = new double[DIM];
}


and getting core.exception.OutOfMemoryError error.  One option is 
to use short/int, but I need to use double. Also, on using large 
arrays, computer becomes slow.


Is there no workaround at all, so that I can work on large 
arrays? Please let me know.


Re: Getting core.exception.OutOfMemoryError error on allocating large arrays

2013-03-03 Thread Sparsh Mittal


Assuming double.sizeof==8 on your machine, You're requesting 
1024*1024*1024*8*8 bytes = 68GB, do you have that much RAM 
available?

You are completely correct, however in C, one could do:

const long DIM = 1024L * 1024L * 1024L* 8L ;
int  main() {
   double signal[DIM];
 }

which runs fine. So, I was sort of looking for some solution like 
this.




Re: Getting core.exception.OutOfMemoryError error on allocating large arrays

2013-03-03 Thread Sparsh Mittal

Thanks. Yes, you are right. I will change my program.


Re: Passing a value by reference to function in taskPool

2013-03-02 Thread Sparsh Mittal

Thanks a lot for your reply.



How to initialize an immutable array

2013-03-01 Thread Sparsh Mittal
I am making a program which accesses 1D array using for loop and 
then I am parallelizing this with foreach, TaskPool and parallel.


The array does not need to change, once initialized. However, the 
parallel version takes more time than serial version, which I 
think may be because compiler is trying to make sure that array 
is properly handled by different threads.


So, is there a way, an array can be made immutable and still 
initialized? Thanks a lot for your time.


Re: How to initialize an immutable array

2013-03-01 Thread Sparsh Mittal

Array is really big!


import std.stdio;
import std.datetime;
import std.parallelism;
import std.range;
//int numberOfWorkers = 2; //for parallel;
double my_abs(double n) { return n  0 ? n : -n; }

immutable long DIM = 1024L*1024L *128L;

void main()
{

  double[] signal = new double[DIM+1];

  double temp;


  double sample[2]= [4.1,7.2];



  for(long i=0L; i DIM+1; i++)
  {
signal[i] = (i+ DIM)%7 + (i+DIM+1)%5; // could be any random 
value

  }

  //auto workerPool = new TaskPool(numberOfWorkers); // for 
parallel

  StopWatch sw;
  sw.start(); //start/resume mesuring.


  for (long i=0L; i DIM; i++)
  //foreach(i; workerPool.parallel(iota(0, DIM))) // for parallel
  {

temp =
my_abs(sample[0]-signal[i]) + 
my_abs(sample[1]-signal[i+1]) ;

  }
  //workerPool.finish(); // for parallel

  sw.stop(); //stop/pause measuring.


  writeln( Total time: , (sw.peek().msecs/1000), [sec]);

}

It has both serial and parallel versions. Just comment/uncomment 
as per comments.





Re: How to initialize an immutable array

2013-03-01 Thread Sparsh Mittal



foreach (immutable i; 0 .. DIM + 1) {


Thanks. However, rdmd gives error on this line:

temp1.d(12): Error: no identifier for declarator immutable(i)



Re: How to initialize an immutable array

2013-03-01 Thread Sparsh Mittal

On Friday, 1 March 2013 at 20:28:19 UTC, FG wrote:

I suppose this:

immutable long DIM = 1024L*1024L *128L;
immutable(double)[] signal = new double[DIM+1];
static this() {
for (long i=0L; i DIM+1; i++) {
signal[i] = (i+DIM)%7 + (i+DIM+1)%5;
}
}
void main()
{ ... }


Thanks. This gives an error, which I don't know how to resolve:

Error: cannot evaluate new double[](134217729LU) at compile time

Can you please tell.


Re: How to initialize an immutable array

2013-03-01 Thread Sparsh Mittal

Removing immutable word solves the problem. Thanks.




Re: How to initialize an immutable array

2013-03-01 Thread Sparsh Mittal
I realized that  access to temp causes bottleneck. On defining 
it inside for loop, it become local and then there is speedup. 
Defining it outside makes it shared, which slows the program.




Passing a value by reference to function in taskPool

2013-03-01 Thread Sparsh Mittal

Here is a code:

import std.stdio, std.datetime, std.random, std.range, 
std.parallelism;



enum long numberOfSlaves = 2;


void myFunc( ref long countvar)
{


 countvar = 500;

  writeln(  value of countvar is  , countvar);
}


void main()
{
long count1=0, count2=0;
alias typeof(task!(myFunc)(0L)) MyTask ;


//Possibility  1
MyTask[numberOfSlaves] tasks;
tasks[0] = task!(myFunc)(count1);
taskPool.put(tasks[0]);
tasks[1] = task!(myFunc)(count2);
taskPool.put(tasks[1]);
   for (long cc =0; cc  numberOfSlaves; cc++)
   tasks[cc].yieldForce();
//Possibility  2
  //myFunc(count1);
  //myFunc(count2);

   writeln(  value of count1 and count2 are , count1,  , 
count2);

}


Possibility  1: Here, I wanted to pass a value by reference to 
myFunc, but when I read that value in main function, its value is 
not changed at all?


Possibility 2: It does what I want.

So, how to properly use the taskPool, so that pass by reference 
works.


Uncomment/comment Possibility 1 or 2 to see the output.



Re: How to specify number of worker threads for Taskpool

2013-02-17 Thread Sparsh Mittal

Thanks a lot for your reply. It was very helpful.





Re: Finding large difference b/w execution time of c++ and D codes for same problem

2013-02-14 Thread Sparsh Mittal

Thanks a lot for your reply.


Strange that D is printing large values as zero. Any mistake in my code?

2013-02-14 Thread Sparsh Mittal

Here is the program:


import std.stdio;

const long DIM = 1024*1024*1024*1024*4;
void main()
{
writeln( DIM  is , DIM);
writeln( Value , 1024*1024*1024*1024*4);
writeln( Max , long.max);

}

I compiled it:
gdc -frelease -O3 temp.d -o t1 ; ./t1
 DIM  is 0
 Value 0
 Max 9223372036854775807

Can you please tell, why it is taking DIM as zero? If I reduce 
DIM, it works fine. It is strange.


Re: Strange that D is printing large values as zero. Any mistake in my code?

2013-02-14 Thread Sparsh Mittal
On Thursday, 14 February 2013 at 15:51:45 UTC, Joseph Rushton 
Wakeling wrote:

On 02/14/2013 04:44 PM, Sparsh Mittal wrote:
Can you please tell, why it is taking DIM as zero? If I reduce 
DIM, it works

fine. It is strange.


1024 is an int value.  Write 1024L instead to ensure that the 
calculation is performed using long.


Thanks a lot for your help.



Finding large difference b/w execution time of c++ and D codes for same problem

2013-02-12 Thread Sparsh Mittal
I am writing Julia sets program in C++ and D; exactly same way as 
much as possible. On executing I find large difference in their 
execution time. Can you comment what wrong am I doing or is it 
expected?



//===C++ code, compiled with -O3 ==
#include sys/time.h
#include iostream
using namespace std;
const  int DIM= 4194304;

struct complexClass {
  float r;
  float i;
  complexClass( float a, float b )
  {
r = a;
i = b;
  }


  float squarePlusMag(complexClass another)
  {
float r1 = r*r - i*i + another.r;
float i1 = 2.0*i*r + another.i;

r = r1;
i = i1;

return (r1*r1+ i1*i1);
  }
};


int juliaFunction( int x, int y )
{

  complexClass a (x,y);

   complexClass c(-0.8, 0.156);

  int i = 0;

  for (i=0; i200; i++) {
   if( a.squarePlusMag(c)  1000)
  return 0;
  }

  return 1;
}


void kernel(  ){
  for (int x=0; xDIM; x++) {
for (int y=0; yDIM; y++) {
  int offset = x + y * DIM;
  int juliaValue = juliaFunction( x, y );
//juliaValue will be used by some function.
}
  }
}


int main()
{

  struct timeval start, end;
  gettimeofday(start, NULL);
  kernel();
  gettimeofday(end, NULL);
  float delta = ((end.tv_sec  - start.tv_sec) * 100u +
 end.tv_usec - start.tv_usec) / 1.e6;


  cout C++ code with dimension   DIM  Total time:  
delta  [sec]\n;

}






//=D++ code, compiled with -O -release 
-inline=


#!/usr/bin/env rdmd
import std.stdio;
import std.datetime;
immutable int DIM= 4194304;


struct complexClass {
  float r;
  float i;

  float squarePlusMag(complexClass another)
  {
float r1 = r*r - i*i + another.r;
float i1 = 2.0*i*r + another.i;

r = r1;
i = i1;

return (r1*r1+ i1*i1);
  }
};


int juliaFunction( int x, int y )
{

  complexClass c = complexClass(0.8, 0.156);
  complexClass a= complexClass(x, y);


  for (int i=0; i200; i++) {

if( a.squarePlusMag(c)  1000)
  return 0;
  }
  return 1;
}


void kernel(  ){
  for (int x=0; xDIM; x++) {
for (int y=0; yDIM; y++) {
  int offset = x + y * DIM;
  int juliaValue = juliaFunction( x, y );
  //juliaValue will be used by some function.   
}
  }
}


void main()
{
  StopWatch sw;
  sw.start();
  kernel();
  sw.stop();
  writeln( D code serial with dimension , DIM , Total time: , 
(sw.peek().msecs/1000), [sec]);

}

//
I will appreciate any help.


Re: Finding large difference b/w execution time of c++ and D codes for same problem

2013-02-12 Thread Sparsh Mittal

I am finding C++ code is much faster than D code.


Re: Finding large difference b/w execution time of c++ and D codes for same problem

2013-02-12 Thread Sparsh Mittal
Pardon me, can you please point me to suitable reference or tell 
just command here. Searching on google, I could not find anything 
yet. Performance is my main concern.






Re: Finding large difference b/w execution time of c++ and D codes for same problem

2013-02-12 Thread Sparsh Mittal

OK. I found it.



Re: Finding large difference b/w execution time of c++ and D codes for same problem

2013-02-12 Thread Sparsh Mittal

Thanks for your insights. It was very helpful.




Re: Looking for writing parallel foreach kind of statement for nested for-loops

2013-02-10 Thread Sparsh Mittal


Think again if you need that. Things start getting pretty 
ugly. :)



Yes, it is not at all intuitive.
Indeed... Sparsh, any reason you need the calculation to be 
done on 2d

blocks instead of independent slots?


For my problem, original answer was fine, since parallel 
calculations are not at all dependent on each others.


Sometime there are calculations to be done on 2d grid, where 
calculations are not uniform across the grid(see paper An 
overview of parallel visualisation methods for Mandelbrot and 
Julia sets where you can see Julia sets parallelization), and 
hence, dividing in a particular way leads to better 
load-balancing than others.


Thanks a lot for your answer and time.




Re: Looking for writing parallel foreach kind of statement for nested for-loops

2013-02-09 Thread Sparsh Mittal



for(int i=1; i N; i++)==foreach(i; iota(1, N))
so you can use:  foreach(i; parallel(iota(1, N))) { ... }
Thanks a lot. This one divides the x-cross-y region by rows. 
Suppose dimension is 8*12 and 4 parallel threads are there, so 
current method is dividing by 2*12 to each of 4 threads.


The current reply answers my question, but I was just curious. 
Can we have a method which divides the 2d region as follows: 8*12 
divided into 4*6 to each of 4 threads.








Re: Allocating large 2D array in D

2013-02-04 Thread Sparsh Mittal

Thanks for your prompt reply. It was very helpful.


Not able to get scaled performance on increasing number of threads

2013-02-01 Thread Sparsh Mittal

I am parallelizing a program which follows this structure:

for iter = 1 to MAX_ITERATION
{
 myLocalBarrier = new Barrier(numberOfThreads+1);
 for i= 1 to numberOfThreads
  {
spawn(myFunc, args)
  }

}


Not able to get scaled performance on increasing number of threads

2013-02-01 Thread Sparsh Mittal

It got posted before I completed it! Sorry.


I am parallelizing a program which follows this structure:

immutable int numberOfThreads= 2

for iter = 1 to MAX_ITERATION
{
 myLocalBarrier = new Barrier(numberOfThreads+1);
 for i= 1 to numberOfThreads
  {
spawn(myFunc, args)
  }
  myLocalBarrier.wait();

}

void myFunc(args)
{
 //do the task

   myLocalBarrier.wait()
}

When I run it, and compare this parallel version with its serial 
version, I only get speedup of nearly 1.3 for 2 threads. When I 
write same program in Go, scaling is nearly 2.


Also, in D, on doing top, I see the usage as only 130% CPU and 
not nearly 200% or 180%. So I was wondering, if I am doing it 
properly. Please help me.


Re: Not able to get scaled performance on increasing number of threads

2013-02-01 Thread Sparsh Mittal




Can't tell much without the whole source or at least compilable 
standalone piece.

Give me a moment. I will post.



Re: Not able to get scaled performance on increasing number of threads

2013-02-01 Thread Sparsh Mittal

Here is the code:


#!/usr/bin/env rdmd
import std.stdio;
import std.concurrency;
import core.thread;
import std.datetime;
import std.conv;
import core.sync.barrier;



immutable int gridSize = 256;
immutable int MAXSTEPS = 5;   /* Maximum number of 
iterations   */
immutable double TOL_VAL =0.1; /* Numerical Tolerance 
*/

immutable double omega =  0.376;
immutable double one_minus_omega = 1.0 - 0.376;


immutable int numberOfThreads = 2;


double MAX_FUNC(double a, double b)
{
  return a b? a: b;
}

double ABS_VAL(double a)
{
  return a 0? a: -a;
}

__gshared Barrier myLocalBarrier = null;
shared double[gridSize+2][gridSize+2] gridInfo;
shared double maxError = 0.0;

void main(string args[])
{

  for(int i=0; igridSize+2; i++)
  {
for(int j=0; jgridSize+2; j++)
{
  if(i==0)
gridInfo[i][j] = 1.0;
  else
gridInfo[i][j] = 0.0;
}
  }

  bool shouldCheck = false;
  bool isConverged = false;
  for(int iter = 1; iter = MAXSTEPS; iter++)
  {
shouldCheck = false;
if(iter % 400 ==0)
{
  shouldCheck = true;
  maxError = 0.0;
}

  //This is Phase 1
{
  myLocalBarrier = new Barrier(numberOfThreads+1);
  for (int cc=0; ccnumberOfThreads; cc++)
  {
spawn(SolverSlave, thisTid,cc, 0 ,shouldCheck);
  }


  myLocalBarrier.wait();
}

 //This is Phase 2
{
  myLocalBarrier = new Barrier(numberOfThreads+1);
  for (int cc=0; ccnumberOfThreads; cc++)
  {
   spawn(SolverSlave, thisTid,cc, 1 ,shouldCheck );
  }

  myLocalBarrier.wait();
}

if( maxError   TOL_VAL)
  {
isConverged = true;
break;
  }

  }
  if(isConverged)
writeln(It converged);
  else
writeln(It did not converge);
}



void SolverSlave(Tid owner, int myNumber, int remainder, bool 
shouldCheckHere)

{

  double sum =0;

  //Divide task among threads
  int iStart = ((myNumber*gridSize)/numberOfThreads) + 1;
  int iEnd =  (((myNumber+1)*gridSize)/numberOfThreads) ;



  for(int i=iStart; i= iEnd; i++)
  {
for(int j=1; j gridSize+1; j++)
{
  if( ((i+j)%2 ==remainder)) //Phase 1 or 2
  {
sum = ( gridInfo[i  ][j+1] + gridInfo[i+1][j  ] +
gridInfo[i-1][j  ] + gridInfo[i  ][j-1] )*0.25;

//Should not check everytime to reduce synchronization 
overhead

if(shouldCheckHere)
{
  maxError = MAX_FUNC(ABS_VAL(omega 
*(sum-gridInfo[i][j])), maxError);

}
gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + 
omega*sum;

  }

}
  }

  myLocalBarrier.wait();
}





Re: Not able to get scaled performance on increasing number of threads

2013-02-01 Thread Sparsh Mittal
Excellent. Thank you so much for your suggestion and code. It now 
produces near linear speedup.




Re: Not able to get scaled performance on increasing number of threads

2013-02-01 Thread Sparsh Mittal

Thanks. Yes, you are right. I have increased the dimension.


Trying to understand how shared works in D

2013-01-31 Thread Sparsh Mittal


I wrote this code. My purpose is to see how shared works in D. I 
create a global variable (globalVar) and access it in two 
different threads and it prints fine, although it is not shared. 
So, can you please tell, what difference it makes to use/not-use 
shared (ref. 
http://www.informit.com/articles/article.aspx?p=1609144seqNum=3).


Also, is a global const implicitly shared?


#!/usr/bin/env rdmd

import std.stdio;
import std.concurrency;
import core.thread;

const int globalConst = 51;
int globalVar = 17;
void main() {

  writefln(Calling Function);
spawn(test1, thisTid);
writefln(Wait Here );
spawn(test2, thisTid);
writefln(End Here );
}

void test1(Tid owner)
{
  writefln(The value of globalConst here is %s, globalConst);
  writefln(The value of globalVar here is %s, globalVar);
}

void test2(Tid owner)
{
  writefln(The value of globalConst here is %s, globalConst);
  writefln(The value of globalVar here is %s, globalVar);
}


Re: Trying to understand how shared works in D

2013-01-31 Thread Sparsh Mittal

Thanks a lot. VERY VERY helpful.



Looking for command for synchronization of threads

2013-01-30 Thread Sparsh Mittal


Background:
I am implementing an iterative algorithm in parallel manner. The 
algorithm iteratively updates a matrix (2D grid) of data. So, I 
will divide the grid to different threads, which will work on 
it for single iteration. After each iteration, all threads should 
wait since next iteration depends on previous iteration.


My issue:
To achieve synchronization,  I am looking for an equivalent of 
sync in Cilk or cudaEventSynchronize in CUDA. I saw 
synchronized, but was not sure, if that is the answer. Please 
help me. I will put that command at end of for loop and it will 
be executed once per iteration.




Re: Looking for command for synchronization of threads

2013-01-30 Thread Sparsh Mittal




I suggest looking at std.parallelism since it's designed for 
this kind of thing.  That aside, all traditional 
synchronization methods are in core.sync.  The equivalent of 
sync in Cylk would be core.sync.barrier.


Thanks. I wrote this:

#!/usr/bin/env rdmd

import std.stdio;
import std.concurrency;
import std.algorithm;
import core.sync.barrier;
import core.thread;

void sorter(Tid owner, shared(int)[] sliceToSort, int mynumber)
{
writefln(Came inside  %s, mynumber);
sort(sliceToSort);
writefln(Going out of %s, mynumber);

}

void main()
{
shared numbers = [ 6, 5, 4, 3, 2, 1 ];
auto barrier = new Barrier(2);
spawn(sorter, thisTid, numbers[0 .. $ / 2],  0);
spawn(sorter, thisTid, numbers[$ / 2 .. $],1 );

writefln(Waiting for barrier in main);
barrier.wait();

writeln(numbers);
}

It compiles but barrier does not get released. Can you please 
point out the fault. Pardon my mistake. I searched whole web, 
there are almost no examples of it online.



I saw this: 
http://www.digitalmars.com/d/archives/digitalmars/D/bugs/Issue_9005_New_std.concurrency.spawn_should_allow_void_delegate_Args_shared_for_new_Tid_44426.html


but it does not compile.


An error on trying to sort shared data using slice

2013-01-28 Thread Sparsh Mittal
Purpose: I am trying to sort only a range of values in an array 
of struct (the struct has two fields and I want to sort on one of 
its fields using myComp function below). However, I am getting 
this error:


../src/phobos/std/algorithm.d(7731): Error: cannot implicitly 
convert expression (assumeSorted(r)) of type 
SortedRange!(shared(intpair)[], myComp) to 
SortedRange!(shared(intpair[]), myComp)
./ParallelCode.d(223): Error: template instance 
ParallelCode.singleSlave.sort!(myComp, cast(SwapStrategy)0, 
shared(intpair[])) error instantiating

=
where my relevant code is:
=


struct intpair{
  int AllInts[2];
};
shared intpair [] distArray;

void main()
{
...
distArray = new shared intpair[number_of_lines];
...
}

void singleThreadFunction(...)
{
 bool myComp(shared intpair x,shared intpair y)
   {
 return x.AllInts[0]  y.AllInts[0];
   }
shared intpair[] tempSortArray = 
distArray[startRange..endRange+1];


/*line 223:*/   sort!(myComp)(tempSortArray);

}

Can you please help me. Thanks.


Re: An error on trying to sort shared data using slice

2013-01-28 Thread Sparsh Mittal

Thanks for your reply and link (which I will try to follow).

However, I am trying to write a parallel program where I have a 
big array. Multiple (e.g. 2, 4, 8) threads do work on part of 
those arrays. Afterwards, they sort their portions of array and 
return the answer to main.


So, I have made global variable, which is shared array. I do not 
know if there is another way to tackle the problem.


When I don't use shared, the singleThreadFunction, which is 
executed by different threads, does not process the shared array. 
Thanks.


Re: An error on trying to sort shared data using slice

2013-01-28 Thread Sparsh Mittal
Thanks a lot. Actually, I am using std.concurrency, following 
your tutorial:
http://ddili.org/ders/d.en/concurrency.html. Thanks for that 
tutorial.


My requirement is to sort a portion of an array in each thread, 
such that there is no overlap b/w portions and all portions 
together make the whole array.


So I am taking array as shared. Currently, in each thread, I am 
taking a slice of that array to sort, although that slice is not 
shared, I am forced to take shared since compiler does not allow 
without it.


Can you suggest something, e.g. sorting only a portion of an 
array, without slicing? Thanks.




Re: An error on trying to sort shared data using slice

2013-01-28 Thread Sparsh Mittal
Thanks a lot. Your code is very valuable to explain the whole 
concept. I have changed my code based on it.


Re: How to call external program in D

2012-11-16 Thread Sparsh Mittal




Thanks a lot, it was very helpful.