Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-17 Thread Francisco Jerez
Jan Vesely  writes:

> On Tue, 2017-08-15 at 12:00 -0700, Francisco Jerez wrote:
>> Jan Vesely  writes:
>> 
>> > On Sat, 2017-08-12 at 20:14 -0700, Francisco Jerez wrote:
>> > > Jan Vesely  writes:
>> > > 
>> > > > On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
>> > > > > Francisco Jerez  writes:
>> > > > > 
>> > > > > > Jan Vesely  writes:
>> > > > > > 
>> > > > > > > Hi,
>> > > > > > > 
>> > > > > > > thanks for detailed explanation. I indeed missed the writeBuffer 
>> > > > > > > part
>> > > > > > > in specs.
>> > > > > > > 
>> > > > > > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>> > > > > > > > These changes are somewhat redundant and potentially
>> > > > > > > > performance-impacting, the reason is that in the OpenCL API,
>> > > > > > > > clEnqueueWrite* commands are specified to block until the 
>> > > > > > > > memory
>> > > > > > > > provided by the application as origin can be reused safely 
>> > > > > > > > (i.e. until
>> > > > > > > > soft_copy_op()() runs), not necessarily until the transfer to 
>> > > > > > > > graphics
>> > > > > > > > memory has completed (which is what hard_event::wait() will 
>> > > > > > > > wait for).
>> > > > > > > > OTOH reads and maps as implemented by soft_copy_op and friends 
>> > > > > > > > are
>> > > > > > > > essentially blocking so the wait() call is redundant in most 
>> > > > > > > > cases.
>> > > > > > > 
>> > > > > > > That explains a noticeable slowdown running piglit with these 
>> > > > > > > changes.
>> > > > > > > I'm not sure about the read part though. I expected it to be 
>> > > > > > > basically
>> > > > > > > a noop, but it changes behaviour.
>> > > > > > 
>> > > > > > I think this change would have slowed you down the most whenever 
>> > > > > > the
>> > > > > > mapping operation performed by soft_copy_op() is able to proceed
>> > > > > > immediately, either because the buffer is idle (so the driver 
>> > > > > > doesn't
>> > > > > > stall on transfer_map()) *or* because the driver is trying to be 
>> > > > > > smart
>> > > > > > and creates a bounce buffer where data can be copied into 
>> > > > > > immediately
>> > > > > > without stalling, then inserts a pipelined GPU copy from the bounce
>> > > > > > buffer into the real buffer.  With this patch you will stall on 
>> > > > > > the GPU
>> > > > > > copy regardless (and whatever other work was already on the command
>> > > > > > stream which might be substantial), even though it wouldn't have 
>> > > > > > been
>> > > > > > necessary in any of these cases.
>> > > > > > 
>> > > > > > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> > > > > > > blocking read in one of the piglit tests surprisingly returns
>> > > > > > > CL_QUEUED.
>> > > > > > > 
>> > > > > > 
>> > > > > > Hmm, yeah, that seems kind of debatable behaviour, although it's
>> > > > > > definitely legit for writes, not quite sure for reads...  I 
>> > > > > > believe the
>> > > > > > reason why that happens is because the CPU copy proceeds very 
>> > > > > > quickly
>> > > > > > (due to the reasons mentioned in the last paragraph), but the 
>> > > > > > hard_event
>> > > > > > is still associated with a pipe_fence synchronous with the GPU's 
>> > > > > > command
>> > > > > > stream, so it won't get signalled until the GPU catches up.
>> > > > 
>> > > > Hi,
>> > > > sorry for the delay, last week was submission week...
>> > > > 
>> > > 
>> > > No worries.
>> > > 
>> > > > The part that I'm still missing is what kind of GPU work needs to be
>> > > > done after clEnqueueRead*(). I assume all necessary work is completed
>> > > > before I can access the data.
>> > > > Also CL_QUEUED status was surprising. since we performed at least some
>> > > > of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
>> > > > least.
>> > > > 
>> > > 
>> > > The lag is not due to GPU work that needs to be performed after the
>> > > clEnqueueRead call, but due to GPU work that may precede it in the
>> > > command stream: Because clover doesn't know when the last time was that
>> > > the buffer was referenced by GPU work, the event is associated with a
>> > > fence synchronous with the current batch (which obviously won't signal
>> > > before any of the GPU work that actually referenced the buffer
>> > > completes).  However the pipe driver has a more accurate idea of when
>> > > the buffer was used last, so the transfer_map() operation is likely to
>> > > complete more quickly than the CL event status changes to CL_COMPLETE.
>> > > The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
>> > > the read operation didn't even need to flush the current batch, so
>> > > there's no fence associated with the CL event object yet.
>> > 
>> > thanks. so the event is waiting for the current batch, even if the
>> > buffer access is done out of order. The question is, 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-17 Thread Jan Vesely
On Tue, 2017-08-15 at 12:00 -0700, Francisco Jerez wrote:
> Jan Vesely  writes:
> 
> > On Sat, 2017-08-12 at 20:14 -0700, Francisco Jerez wrote:
> > > Jan Vesely  writes:
> > > 
> > > > On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
> > > > > Francisco Jerez  writes:
> > > > > 
> > > > > > Jan Vesely  writes:
> > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > thanks for detailed explanation. I indeed missed the writeBuffer 
> > > > > > > part
> > > > > > > in specs.
> > > > > > > 
> > > > > > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
> > > > > > > > These changes are somewhat redundant and potentially
> > > > > > > > performance-impacting, the reason is that in the OpenCL API,
> > > > > > > > clEnqueueWrite* commands are specified to block until the memory
> > > > > > > > provided by the application as origin can be reused safely 
> > > > > > > > (i.e. until
> > > > > > > > soft_copy_op()() runs), not necessarily until the transfer to 
> > > > > > > > graphics
> > > > > > > > memory has completed (which is what hard_event::wait() will 
> > > > > > > > wait for).
> > > > > > > > OTOH reads and maps as implemented by soft_copy_op and friends 
> > > > > > > > are
> > > > > > > > essentially blocking so the wait() call is redundant in most 
> > > > > > > > cases.
> > > > > > > 
> > > > > > > That explains a noticeable slowdown running piglit with these 
> > > > > > > changes.
> > > > > > > I'm not sure about the read part though. I expected it to be 
> > > > > > > basically
> > > > > > > a noop, but it changes behaviour.
> > > > > > 
> > > > > > I think this change would have slowed you down the most whenever the
> > > > > > mapping operation performed by soft_copy_op() is able to proceed
> > > > > > immediately, either because the buffer is idle (so the driver 
> > > > > > doesn't
> > > > > > stall on transfer_map()) *or* because the driver is trying to be 
> > > > > > smart
> > > > > > and creates a bounce buffer where data can be copied into 
> > > > > > immediately
> > > > > > without stalling, then inserts a pipelined GPU copy from the bounce
> > > > > > buffer into the real buffer.  With this patch you will stall on the 
> > > > > > GPU
> > > > > > copy regardless (and whatever other work was already on the command
> > > > > > stream which might be substantial), even though it wouldn't have 
> > > > > > been
> > > > > > necessary in any of these cases.
> > > > > > 
> > > > > > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
> > > > > > > blocking read in one of the piglit tests surprisingly returns
> > > > > > > CL_QUEUED.
> > > > > > > 
> > > > > > 
> > > > > > Hmm, yeah, that seems kind of debatable behaviour, although it's
> > > > > > definitely legit for writes, not quite sure for reads...  I believe 
> > > > > > the
> > > > > > reason why that happens is because the CPU copy proceeds very 
> > > > > > quickly
> > > > > > (due to the reasons mentioned in the last paragraph), but the 
> > > > > > hard_event
> > > > > > is still associated with a pipe_fence synchronous with the GPU's 
> > > > > > command
> > > > > > stream, so it won't get signalled until the GPU catches up.
> > > > 
> > > > Hi,
> > > > sorry for the delay, last week was submission week...
> > > > 
> > > 
> > > No worries.
> > > 
> > > > The part that I'm still missing is what kind of GPU work needs to be
> > > > done after clEnqueueRead*(). I assume all necessary work is completed
> > > > before I can access the data.
> > > > Also CL_QUEUED status was surprising. since we performed at least some
> > > > of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
> > > > least.
> > > > 
> > > 
> > > The lag is not due to GPU work that needs to be performed after the
> > > clEnqueueRead call, but due to GPU work that may precede it in the
> > > command stream: Because clover doesn't know when the last time was that
> > > the buffer was referenced by GPU work, the event is associated with a
> > > fence synchronous with the current batch (which obviously won't signal
> > > before any of the GPU work that actually referenced the buffer
> > > completes).  However the pipe driver has a more accurate idea of when
> > > the buffer was used last, so the transfer_map() operation is likely to
> > > complete more quickly than the CL event status changes to CL_COMPLETE.
> > > The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
> > > the read operation didn't even need to flush the current batch, so
> > > there's no fence associated with the CL event object yet.
> > 
> > thanks. so the event is waiting for the current batch, even if the
> > buffer access is done out of order. The question is, why do we use the
> > fence at all? If I understood correctly mapping the buffer will be
> > delayed by the pipe driver if needed, so we don't really need the
> > fence. Am I 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-15 Thread Francisco Jerez
Jan Vesely  writes:

> On Sat, 2017-08-12 at 20:14 -0700, Francisco Jerez wrote:
>> Jan Vesely  writes:
>> 
>> > On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
>> > > Francisco Jerez  writes:
>> > > 
>> > > > Jan Vesely  writes:
>> > > > 
>> > > > > Hi,
>> > > > > 
>> > > > > thanks for detailed explanation. I indeed missed the writeBuffer part
>> > > > > in specs.
>> > > > > 
>> > > > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>> > > > > > These changes are somewhat redundant and potentially
>> > > > > > performance-impacting, the reason is that in the OpenCL API,
>> > > > > > clEnqueueWrite* commands are specified to block until the memory
>> > > > > > provided by the application as origin can be reused safely (i.e. 
>> > > > > > until
>> > > > > > soft_copy_op()() runs), not necessarily until the transfer to 
>> > > > > > graphics
>> > > > > > memory has completed (which is what hard_event::wait() will wait 
>> > > > > > for).
>> > > > > > OTOH reads and maps as implemented by soft_copy_op and friends are
>> > > > > > essentially blocking so the wait() call is redundant in most cases.
>> > > > > 
>> > > > > That explains a noticeable slowdown running piglit with these 
>> > > > > changes.
>> > > > > I'm not sure about the read part though. I expected it to be 
>> > > > > basically
>> > > > > a noop, but it changes behaviour.
>> > > > 
>> > > > I think this change would have slowed you down the most whenever the
>> > > > mapping operation performed by soft_copy_op() is able to proceed
>> > > > immediately, either because the buffer is idle (so the driver doesn't
>> > > > stall on transfer_map()) *or* because the driver is trying to be smart
>> > > > and creates a bounce buffer where data can be copied into immediately
>> > > > without stalling, then inserts a pipelined GPU copy from the bounce
>> > > > buffer into the real buffer.  With this patch you will stall on the GPU
>> > > > copy regardless (and whatever other work was already on the command
>> > > > stream which might be substantial), even though it wouldn't have been
>> > > > necessary in any of these cases.
>> > > > 
>> > > > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> > > > > blocking read in one of the piglit tests surprisingly returns
>> > > > > CL_QUEUED.
>> > > > > 
>> > > > 
>> > > > Hmm, yeah, that seems kind of debatable behaviour, although it's
>> > > > definitely legit for writes, not quite sure for reads...  I believe the
>> > > > reason why that happens is because the CPU copy proceeds very quickly
>> > > > (due to the reasons mentioned in the last paragraph), but the 
>> > > > hard_event
>> > > > is still associated with a pipe_fence synchronous with the GPU's 
>> > > > command
>> > > > stream, so it won't get signalled until the GPU catches up.
>> > 
>> > Hi,
>> > sorry for the delay, last week was submission week...
>> > 
>> 
>> No worries.
>> 
>> > The part that I'm still missing is what kind of GPU work needs to be
>> > done after clEnqueueRead*(). I assume all necessary work is completed
>> > before I can access the data.
>> > Also CL_QUEUED status was surprising. since we performed at least some
>> > of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
>> > least.
>> > 
>> 
>> The lag is not due to GPU work that needs to be performed after the
>> clEnqueueRead call, but due to GPU work that may precede it in the
>> command stream: Because clover doesn't know when the last time was that
>> the buffer was referenced by GPU work, the event is associated with a
>> fence synchronous with the current batch (which obviously won't signal
>> before any of the GPU work that actually referenced the buffer
>> completes).  However the pipe driver has a more accurate idea of when
>> the buffer was used last, so the transfer_map() operation is likely to
>> complete more quickly than the CL event status changes to CL_COMPLETE.
>> The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
>> the read operation didn't even need to flush the current batch, so
>> there's no fence associated with the CL event object yet.
>
> thanks. so the event is waiting for the current batch, even if the
> buffer access is done out of order. The question is, why do we use the
> fence at all? If I understood correctly mapping the buffer will be
> delayed by the pipe driver if needed, so we don't really need the
> fence. Am I missing something?
>

The current CL event object implementation needs a gallium fence object
in order to find out what status it's in and in order to implement
waiting.  The problem is that it doesn't have any fence object available
that is close "enough" to the actual GPU operation that modified the
buffer for the last time.  There are several more or less hairy ways to
improve this, e.g. using a soft_event that's manually signaled when
triggered instead of a 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-14 Thread Jan Vesely
On Sat, 2017-08-12 at 20:14 -0700, Francisco Jerez wrote:
> Jan Vesely  writes:
> 
> > On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
> > > Francisco Jerez  writes:
> > > 
> > > > Jan Vesely  writes:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > thanks for detailed explanation. I indeed missed the writeBuffer part
> > > > > in specs.
> > > > > 
> > > > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
> > > > > > These changes are somewhat redundant and potentially
> > > > > > performance-impacting, the reason is that in the OpenCL API,
> > > > > > clEnqueueWrite* commands are specified to block until the memory
> > > > > > provided by the application as origin can be reused safely (i.e. 
> > > > > > until
> > > > > > soft_copy_op()() runs), not necessarily until the transfer to 
> > > > > > graphics
> > > > > > memory has completed (which is what hard_event::wait() will wait 
> > > > > > for).
> > > > > > OTOH reads and maps as implemented by soft_copy_op and friends are
> > > > > > essentially blocking so the wait() call is redundant in most cases.
> > > > > 
> > > > > That explains a noticeable slowdown running piglit with these changes.
> > > > > I'm not sure about the read part though. I expected it to be basically
> > > > > a noop, but it changes behaviour.
> > > > 
> > > > I think this change would have slowed you down the most whenever the
> > > > mapping operation performed by soft_copy_op() is able to proceed
> > > > immediately, either because the buffer is idle (so the driver doesn't
> > > > stall on transfer_map()) *or* because the driver is trying to be smart
> > > > and creates a bounce buffer where data can be copied into immediately
> > > > without stalling, then inserts a pipelined GPU copy from the bounce
> > > > buffer into the real buffer.  With this patch you will stall on the GPU
> > > > copy regardless (and whatever other work was already on the command
> > > > stream which might be substantial), even though it wouldn't have been
> > > > necessary in any of these cases.
> > > > 
> > > > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
> > > > > blocking read in one of the piglit tests surprisingly returns
> > > > > CL_QUEUED.
> > > > > 
> > > > 
> > > > Hmm, yeah, that seems kind of debatable behaviour, although it's
> > > > definitely legit for writes, not quite sure for reads...  I believe the
> > > > reason why that happens is because the CPU copy proceeds very quickly
> > > > (due to the reasons mentioned in the last paragraph), but the hard_event
> > > > is still associated with a pipe_fence synchronous with the GPU's command
> > > > stream, so it won't get signalled until the GPU catches up.
> > 
> > Hi,
> > sorry for the delay, last week was submission week...
> > 
> 
> No worries.
> 
> > The part that I'm still missing is what kind of GPU work needs to be
> > done after clEnqueueRead*(). I assume all necessary work is completed
> > before I can access the data.
> > Also CL_QUEUED status was surprising. since we performed at least some
> > of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
> > least.
> > 
> 
> The lag is not due to GPU work that needs to be performed after the
> clEnqueueRead call, but due to GPU work that may precede it in the
> command stream: Because clover doesn't know when the last time was that
> the buffer was referenced by GPU work, the event is associated with a
> fence synchronous with the current batch (which obviously won't signal
> before any of the GPU work that actually referenced the buffer
> completes).  However the pipe driver has a more accurate idea of when
> the buffer was used last, so the transfer_map() operation is likely to
> complete more quickly than the CL event status changes to CL_COMPLETE.
> The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
> the read operation didn't even need to flush the current batch, so
> there's no fence associated with the CL event object yet.

thanks. so the event is waiting for the current batch, even if the
buffer access is done out of order. The question is, why do we use the
fence at all? If I understood correctly mapping the buffer will be
delayed by the pipe driver if needed, so we don't really need the
fence. Am I missing something?

Jan

> 
> > > > 
> > > > > The specs don't mention use of events with blocking read, but it does
> > > > > say that "When the read command has completed, the contents of the
> > > > > buffer that ptr points to can be used by the application." in the non-
> > > > > blocking section. I'd say that the expectation is for the event to be
> > > > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
> > > > > follow this).
> > > > > 
> > > > > > 
> > > > > > The only reason why it might be useful to behave differently on 
> > > > > > blocking
> > > > > > transfers is that the application may have specified a 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-14 Thread Aaron Watry
On Mon, Aug 14, 2017 at 4:29 PM, Aaron Watry  wrote:
> On Mon, Aug 14, 2017 at 1:53 PM, Francisco Jerez  
> wrote:
>> Aaron Watry  writes:
>>
>>> On Sat, Aug 12, 2017 at 10:14 PM, Francisco Jerez  
>>> wrote:
 Jan Vesely  writes:

> On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
>> Francisco Jerez  writes:
>>
>> > Jan Vesely  writes:
>> >
>> > > Hi,
>> > >
>> > > thanks for detailed explanation. I indeed missed the writeBuffer part
>> > > in specs.
>> > >
>> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>> > > > These changes are somewhat redundant and potentially
>> > > > performance-impacting, the reason is that in the OpenCL API,
>> > > > clEnqueueWrite* commands are specified to block until the memory
>> > > > provided by the application as origin can be reused safely (i.e. 
>> > > > until
>> > > > soft_copy_op()() runs), not necessarily until the transfer to 
>> > > > graphics
>> > > > memory has completed (which is what hard_event::wait() will wait 
>> > > > for).
>> > > > OTOH reads and maps as implemented by soft_copy_op and friends are
>> > > > essentially blocking so the wait() call is redundant in most cases.
>> > >
>> > > That explains a noticeable slowdown running piglit with these 
>> > > changes.
>> > > I'm not sure about the read part though. I expected it to be 
>> > > basically
>> > > a noop, but it changes behaviour.
>> >
>> > I think this change would have slowed you down the most whenever the
>> > mapping operation performed by soft_copy_op() is able to proceed
>> > immediately, either because the buffer is idle (so the driver doesn't
>> > stall on transfer_map()) *or* because the driver is trying to be smart
>> > and creates a bounce buffer where data can be copied into immediately
>> > without stalling, then inserts a pipelined GPU copy from the bounce
>> > buffer into the real buffer.  With this patch you will stall on the GPU
>> > copy regardless (and whatever other work was already on the command
>> > stream which might be substantial), even though it wouldn't have been
>> > necessary in any of these cases.
>> >
>> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> > > blocking read in one of the piglit tests surprisingly returns
>> > > CL_QUEUED.
>> > >
>> >
>> > Hmm, yeah, that seems kind of debatable behaviour, although it's
>> > definitely legit for writes, not quite sure for reads...  I believe the
>> > reason why that happens is because the CPU copy proceeds very quickly
>> > (due to the reasons mentioned in the last paragraph), but the 
>> > hard_event
>> > is still associated with a pipe_fence synchronous with the GPU's 
>> > command
>> > stream, so it won't get signalled until the GPU catches up.
>
> Hi,
> sorry for the delay, last week was submission week...
>

 No worries.

> The part that I'm still missing is what kind of GPU work needs to be
> done after clEnqueueRead*(). I assume all necessary work is completed
> before I can access the data.
> Also CL_QUEUED status was surprising. since we performed at least some
> of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
> least.
>

 The lag is not due to GPU work that needs to be performed after the
 clEnqueueRead call, but due to GPU work that may precede it in the
 command stream: Because clover doesn't know when the last time was that
 the buffer was referenced by GPU work, the event is associated with a
 fence synchronous with the current batch (which obviously won't signal
 before any of the GPU work that actually referenced the buffer
 completes).  However the pipe driver has a more accurate idea of when
 the buffer was used last, so the transfer_map() operation is likely to
 complete more quickly than the CL event status changes to CL_COMPLETE.
 The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
 the read operation didn't even need to flush the current batch, so
 there's no fence associated with the CL event object yet.
>>>
>>> Speaking of event status issues, I've been sitting on the attached
>>> patch (and some others) until my current series dealing with language
>>> versions is dealt with.
>>>
>>> Basically, our clSetEventCallback implementation is ignoring several
>>> event statuses that *should* be triggering the callbacks, and is
>>> instead generating errors which cause CTS failures.
>>>
>>> --Aaron
>>>

>> >
>> > > The specs don't mention use of events with blocking read, but it does
>> > > say that "When the read command has 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-14 Thread Aaron Watry
On Mon, Aug 14, 2017 at 1:53 PM, Francisco Jerez  wrote:
> Aaron Watry  writes:
>
>> On Sat, Aug 12, 2017 at 10:14 PM, Francisco Jerez  
>> wrote:
>>> Jan Vesely  writes:
>>>
 On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
> Francisco Jerez  writes:
>
> > Jan Vesely  writes:
> >
> > > Hi,
> > >
> > > thanks for detailed explanation. I indeed missed the writeBuffer part
> > > in specs.
> > >
> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
> > > > These changes are somewhat redundant and potentially
> > > > performance-impacting, the reason is that in the OpenCL API,
> > > > clEnqueueWrite* commands are specified to block until the memory
> > > > provided by the application as origin can be reused safely (i.e. 
> > > > until
> > > > soft_copy_op()() runs), not necessarily until the transfer to 
> > > > graphics
> > > > memory has completed (which is what hard_event::wait() will wait 
> > > > for).
> > > > OTOH reads and maps as implemented by soft_copy_op and friends are
> > > > essentially blocking so the wait() call is redundant in most cases.
> > >
> > > That explains a noticeable slowdown running piglit with these changes.
> > > I'm not sure about the read part though. I expected it to be basically
> > > a noop, but it changes behaviour.
> >
> > I think this change would have slowed you down the most whenever the
> > mapping operation performed by soft_copy_op() is able to proceed
> > immediately, either because the buffer is idle (so the driver doesn't
> > stall on transfer_map()) *or* because the driver is trying to be smart
> > and creates a bounce buffer where data can be copied into immediately
> > without stalling, then inserts a pipelined GPU copy from the bounce
> > buffer into the real buffer.  With this patch you will stall on the GPU
> > copy regardless (and whatever other work was already on the command
> > stream which might be substantial), even though it wouldn't have been
> > necessary in any of these cases.
> >
> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
> > > blocking read in one of the piglit tests surprisingly returns
> > > CL_QUEUED.
> > >
> >
> > Hmm, yeah, that seems kind of debatable behaviour, although it's
> > definitely legit for writes, not quite sure for reads...  I believe the
> > reason why that happens is because the CPU copy proceeds very quickly
> > (due to the reasons mentioned in the last paragraph), but the hard_event
> > is still associated with a pipe_fence synchronous with the GPU's command
> > stream, so it won't get signalled until the GPU catches up.

 Hi,
 sorry for the delay, last week was submission week...

>>>
>>> No worries.
>>>
 The part that I'm still missing is what kind of GPU work needs to be
 done after clEnqueueRead*(). I assume all necessary work is completed
 before I can access the data.
 Also CL_QUEUED status was surprising. since we performed at least some
 of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
 least.

>>>
>>> The lag is not due to GPU work that needs to be performed after the
>>> clEnqueueRead call, but due to GPU work that may precede it in the
>>> command stream: Because clover doesn't know when the last time was that
>>> the buffer was referenced by GPU work, the event is associated with a
>>> fence synchronous with the current batch (which obviously won't signal
>>> before any of the GPU work that actually referenced the buffer
>>> completes).  However the pipe driver has a more accurate idea of when
>>> the buffer was used last, so the transfer_map() operation is likely to
>>> complete more quickly than the CL event status changes to CL_COMPLETE.
>>> The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
>>> the read operation didn't even need to flush the current batch, so
>>> there's no fence associated with the CL event object yet.
>>
>> Speaking of event status issues, I've been sitting on the attached
>> patch (and some others) until my current series dealing with language
>> versions is dealt with.
>>
>> Basically, our clSetEventCallback implementation is ignoring several
>> event statuses that *should* be triggering the callbacks, and is
>> instead generating errors which cause CTS failures.
>>
>> --Aaron
>>
>>>
> >
> > > The specs don't mention use of events with blocking read, but it does
> > > say that "When the read command has completed, the contents of the
> > > buffer that ptr points to can be used by the application." in the non-
> > > blocking section. I'd say that the expectation is for the event to be
> > > CL_COMPLETE after 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-14 Thread Francisco Jerez
Aaron Watry  writes:

> On Sat, Aug 12, 2017 at 10:14 PM, Francisco Jerez  
> wrote:
>> Jan Vesely  writes:
>>
>>> On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
 Francisco Jerez  writes:

 > Jan Vesely  writes:
 >
 > > Hi,
 > >
 > > thanks for detailed explanation. I indeed missed the writeBuffer part
 > > in specs.
 > >
 > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
 > > > These changes are somewhat redundant and potentially
 > > > performance-impacting, the reason is that in the OpenCL API,
 > > > clEnqueueWrite* commands are specified to block until the memory
 > > > provided by the application as origin can be reused safely (i.e. 
 > > > until
 > > > soft_copy_op()() runs), not necessarily until the transfer to 
 > > > graphics
 > > > memory has completed (which is what hard_event::wait() will wait 
 > > > for).
 > > > OTOH reads and maps as implemented by soft_copy_op and friends are
 > > > essentially blocking so the wait() call is redundant in most cases.
 > >
 > > That explains a noticeable slowdown running piglit with these changes.
 > > I'm not sure about the read part though. I expected it to be basically
 > > a noop, but it changes behaviour.
 >
 > I think this change would have slowed you down the most whenever the
 > mapping operation performed by soft_copy_op() is able to proceed
 > immediately, either because the buffer is idle (so the driver doesn't
 > stall on transfer_map()) *or* because the driver is trying to be smart
 > and creates a bounce buffer where data can be copied into immediately
 > without stalling, then inserts a pipelined GPU copy from the bounce
 > buffer into the real buffer.  With this patch you will stall on the GPU
 > copy regardless (and whatever other work was already on the command
 > stream which might be substantial), even though it wouldn't have been
 > necessary in any of these cases.
 >
 > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
 > > blocking read in one of the piglit tests surprisingly returns
 > > CL_QUEUED.
 > >
 >
 > Hmm, yeah, that seems kind of debatable behaviour, although it's
 > definitely legit for writes, not quite sure for reads...  I believe the
 > reason why that happens is because the CPU copy proceeds very quickly
 > (due to the reasons mentioned in the last paragraph), but the hard_event
 > is still associated with a pipe_fence synchronous with the GPU's command
 > stream, so it won't get signalled until the GPU catches up.
>>>
>>> Hi,
>>> sorry for the delay, last week was submission week...
>>>
>>
>> No worries.
>>
>>> The part that I'm still missing is what kind of GPU work needs to be
>>> done after clEnqueueRead*(). I assume all necessary work is completed
>>> before I can access the data.
>>> Also CL_QUEUED status was surprising. since we performed at least some
>>> of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
>>> least.
>>>
>>
>> The lag is not due to GPU work that needs to be performed after the
>> clEnqueueRead call, but due to GPU work that may precede it in the
>> command stream: Because clover doesn't know when the last time was that
>> the buffer was referenced by GPU work, the event is associated with a
>> fence synchronous with the current batch (which obviously won't signal
>> before any of the GPU work that actually referenced the buffer
>> completes).  However the pipe driver has a more accurate idea of when
>> the buffer was used last, so the transfer_map() operation is likely to
>> complete more quickly than the CL event status changes to CL_COMPLETE.
>> The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
>> the read operation didn't even need to flush the current batch, so
>> there's no fence associated with the CL event object yet.
>
> Speaking of event status issues, I've been sitting on the attached
> patch (and some others) until my current series dealing with language
> versions is dealt with.
>
> Basically, our clSetEventCallback implementation is ignoring several
> event statuses that *should* be triggering the callbacks, and is
> instead generating errors which cause CTS failures.
>
> --Aaron
>
>>
 >
 > > The specs don't mention use of events with blocking read, but it does
 > > say that "When the read command has completed, the contents of the
 > > buffer that ptr points to can be used by the application." in the non-
 > > blocking section. I'd say that the expectation is for the event to be
 > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
 > > follow this).
 > >
 > > >
 > > > The only reason why it might be useful to behave differently on 
 > > > 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-12 Thread Francisco Jerez
Jan Vesely  writes:

> On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
>> Francisco Jerez  writes:
>> 
>> > Jan Vesely  writes:
>> > 
>> > > Hi,
>> > > 
>> > > thanks for detailed explanation. I indeed missed the writeBuffer part
>> > > in specs.
>> > > 
>> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>> > > > These changes are somewhat redundant and potentially
>> > > > performance-impacting, the reason is that in the OpenCL API,
>> > > > clEnqueueWrite* commands are specified to block until the memory
>> > > > provided by the application as origin can be reused safely (i.e. until
>> > > > soft_copy_op()() runs), not necessarily until the transfer to graphics
>> > > > memory has completed (which is what hard_event::wait() will wait for).
>> > > > OTOH reads and maps as implemented by soft_copy_op and friends are
>> > > > essentially blocking so the wait() call is redundant in most cases.
>> > > 
>> > > That explains a noticeable slowdown running piglit with these changes.
>> > > I'm not sure about the read part though. I expected it to be basically
>> > > a noop, but it changes behaviour.
>> > 
>> > I think this change would have slowed you down the most whenever the
>> > mapping operation performed by soft_copy_op() is able to proceed
>> > immediately, either because the buffer is idle (so the driver doesn't
>> > stall on transfer_map()) *or* because the driver is trying to be smart
>> > and creates a bounce buffer where data can be copied into immediately
>> > without stalling, then inserts a pipelined GPU copy from the bounce
>> > buffer into the real buffer.  With this patch you will stall on the GPU
>> > copy regardless (and whatever other work was already on the command
>> > stream which might be substantial), even though it wouldn't have been
>> > necessary in any of these cases.
>> > 
>> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> > > blocking read in one of the piglit tests surprisingly returns
>> > > CL_QUEUED.
>> > > 
>> > 
>> > Hmm, yeah, that seems kind of debatable behaviour, although it's
>> > definitely legit for writes, not quite sure for reads...  I believe the
>> > reason why that happens is because the CPU copy proceeds very quickly
>> > (due to the reasons mentioned in the last paragraph), but the hard_event
>> > is still associated with a pipe_fence synchronous with the GPU's command
>> > stream, so it won't get signalled until the GPU catches up.
>
> Hi,
> sorry for the delay, last week was submission week...
>

No worries.

> The part that I'm still missing is what kind of GPU work needs to be
> done after clEnqueueRead*(). I assume all necessary work is completed
> before I can access the data.
> Also CL_QUEUED status was surprising. since we performed at least some
> of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
> least.
>

The lag is not due to GPU work that needs to be performed after the
clEnqueueRead call, but due to GPU work that may precede it in the
command stream: Because clover doesn't know when the last time was that
the buffer was referenced by GPU work, the event is associated with a
fence synchronous with the current batch (which obviously won't signal
before any of the GPU work that actually referenced the buffer
completes).  However the pipe driver has a more accurate idea of when
the buffer was used last, so the transfer_map() operation is likely to
complete more quickly than the CL event status changes to CL_COMPLETE.
The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
the read operation didn't even need to flush the current batch, so
there's no fence associated with the CL event object yet.

>> > 
>> > > The specs don't mention use of events with blocking read, but it does
>> > > say that "When the read command has completed, the contents of the
>> > > buffer that ptr points to can be used by the application." in the non-
>> > > blocking section. I'd say that the expectation is for the event to be
>> > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
>> > > follow this).
>> > > 
>> > > > 
>> > > > The only reason why it might be useful to behave differently on 
>> > > > blocking
>> > > > transfers is that the application may have specified a user event in 
>> > > > the
>> > > > event dependency list, which may cause the soft_copy_op() call to be
>> > > > delayed until the application signals the user event.  In order to fix
>> > > > that it should really only be necessary to wait for the event action to
>> > > > be executed, not necessarily its associated GPU work.
>> > > 
>> > > I think that another use is that non-blocking writes do not create an
>> > > extra copy of the buffer. Thus
>> > > clEnqueueWriteBuffer(...,cl_false, ev, ...)
>> > > clWaitForEvents(ev)
>> > > is more memory efficient.
>> > > 
>> > > > 
>> > > > Last time this issue came up (yeah it's not the 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-12 Thread Jan Vesely
On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
> Francisco Jerez  writes:
> 
> > Jan Vesely  writes:
> > 
> > > Hi,
> > > 
> > > thanks for detailed explanation. I indeed missed the writeBuffer part
> > > in specs.
> > > 
> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
> > > > These changes are somewhat redundant and potentially
> > > > performance-impacting, the reason is that in the OpenCL API,
> > > > clEnqueueWrite* commands are specified to block until the memory
> > > > provided by the application as origin can be reused safely (i.e. until
> > > > soft_copy_op()() runs), not necessarily until the transfer to graphics
> > > > memory has completed (which is what hard_event::wait() will wait for).
> > > > OTOH reads and maps as implemented by soft_copy_op and friends are
> > > > essentially blocking so the wait() call is redundant in most cases.
> > > 
> > > That explains a noticeable slowdown running piglit with these changes.
> > > I'm not sure about the read part though. I expected it to be basically
> > > a noop, but it changes behaviour.
> > 
> > I think this change would have slowed you down the most whenever the
> > mapping operation performed by soft_copy_op() is able to proceed
> > immediately, either because the buffer is idle (so the driver doesn't
> > stall on transfer_map()) *or* because the driver is trying to be smart
> > and creates a bounce buffer where data can be copied into immediately
> > without stalling, then inserts a pipelined GPU copy from the bounce
> > buffer into the real buffer.  With this patch you will stall on the GPU
> > copy regardless (and whatever other work was already on the command
> > stream which might be substantial), even though it wouldn't have been
> > necessary in any of these cases.
> > 
> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
> > > blocking read in one of the piglit tests surprisingly returns
> > > CL_QUEUED.
> > > 
> > 
> > Hmm, yeah, that seems kind of debatable behaviour, although it's
> > definitely legit for writes, not quite sure for reads...  I believe the
> > reason why that happens is because the CPU copy proceeds very quickly
> > (due to the reasons mentioned in the last paragraph), but the hard_event
> > is still associated with a pipe_fence synchronous with the GPU's command
> > stream, so it won't get signalled until the GPU catches up.

Hi,
sorry for the delay, last week was submission week...

The part that I'm still missing is what kind of GPU work needs to be
done after clEnqueueRead*(). I assume all necessary work is completed
before I can access the data.
Also CL_QUEUED status was surprising. since we performed at least some
of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
least.

> > 
> > > The specs don't mention use of events with blocking read, but it does
> > > say that "When the read command has completed, the contents of the
> > > buffer that ptr points to can be used by the application." in the non-
> > > blocking section. I'd say that the expectation is for the event to be
> > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
> > > follow this).
> > > 
> > > > 
> > > > The only reason why it might be useful to behave differently on blocking
> > > > transfers is that the application may have specified a user event in the
> > > > event dependency list, which may cause the soft_copy_op() call to be
> > > > delayed until the application signals the user event.  In order to fix
> > > > that it should really only be necessary to wait for the event action to
> > > > be executed, not necessarily its associated GPU work.
> > > 
> > > I think that another use is that non-blocking writes do not create an
> > > extra copy of the buffer. Thus
> > > clEnqueueWriteBuffer(...,cl_false, ev, ...)
> > > clWaitForEvents(ev)
> > > is more memory efficient.
> > > 
> > > > 
> > > > Last time this issue came up (yeah it's not the first time) I proposed
> > > > the patches below to add a mechanism to wait for the event action only,
> > > > feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
> > > > while so they may no longer apply cleanly).
> > > 
> > > I think we can just add comments explaining why the blocking argument
> > > is ignored, until someone chooses to fix this problem
> > 
> > I think the problem is definitely worth fixing, and it shouldn't really
> > take more effort than adding comments explaining the current behaviour
> > ;), basically just add a bunch of 'if (blocking)
> > hev().wait_signalled();' where the spec requires it, roughly as you had
> > been doing in this patch, but wait_signalled() should only stall on the
> > CPU work associated with the event, which should give you the same
> > performance as the current approach.

I can send a patch that replaces wait() -> wait_signalled()

> > 
> > > and/or to
> > > implement proper non-blocking variants (would 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-05 Thread Francisco Jerez
Francisco Jerez  writes:

> Jan Vesely  writes:
>
>> Hi,
>>
>> thanks for detailed explanation. I indeed missed the writeBuffer part
>> in specs.
>>
>> On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>>> These changes are somewhat redundant and potentially
>>> performance-impacting, the reason is that in the OpenCL API,
>>> clEnqueueWrite* commands are specified to block until the memory
>>> provided by the application as origin can be reused safely (i.e. until
>>> soft_copy_op()() runs), not necessarily until the transfer to graphics
>>> memory has completed (which is what hard_event::wait() will wait for).
>>> OTOH reads and maps as implemented by soft_copy_op and friends are
>>> essentially blocking so the wait() call is redundant in most cases.
>>
>> That explains a noticeable slowdown running piglit with these changes.
>> I'm not sure about the read part though. I expected it to be basically
>> a noop, but it changes behaviour.
>
> I think this change would have slowed you down the most whenever the
> mapping operation performed by soft_copy_op() is able to proceed
> immediately, either because the buffer is idle (so the driver doesn't
> stall on transfer_map()) *or* because the driver is trying to be smart
> and creates a bounce buffer where data can be copied into immediately
> without stalling, then inserts a pipelined GPU copy from the bounce
> buffer into the real buffer.  With this patch you will stall on the GPU
> copy regardless (and whatever other work was already on the command
> stream which might be substantial), even though it wouldn't have been
> necessary in any of these cases.
>
>> Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> blocking read in one of the piglit tests surprisingly returns
>> CL_QUEUED.
>>
>
> Hmm, yeah, that seems kind of debatable behaviour, although it's
> definitely legit for writes, not quite sure for reads...  I believe the
> reason why that happens is because the CPU copy proceeds very quickly
> (due to the reasons mentioned in the last paragraph), but the hard_event
> is still associated with a pipe_fence synchronous with the GPU's command
> stream, so it won't get signalled until the GPU catches up.
>
>> The specs don't mention use of events with blocking read, but it does
>> say that "When the read command has completed, the contents of the
>> buffer that ptr points to can be used by the application." in the non-
>> blocking section. I'd say that the expectation is for the event to be
>> CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
>> follow this).
>>
>>> 
>>> The only reason why it might be useful to behave differently on blocking
>>> transfers is that the application may have specified a user event in the
>>> event dependency list, which may cause the soft_copy_op() call to be
>>> delayed until the application signals the user event.  In order to fix
>>> that it should really only be necessary to wait for the event action to
>>> be executed, not necessarily its associated GPU work.
>>
>> I think that another use is that non-blocking writes do not create an
>> extra copy of the buffer. Thus
>> clEnqueueWriteBuffer(...,cl_false, ev, ...)
>> clWaitForEvents(ev)
>> is more memory efficient.
>>
>>> 
>>> Last time this issue came up (yeah it's not the first time) I proposed
>>> the patches below to add a mechanism to wait for the event action only,
>>> feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
>>> while so they may no longer apply cleanly).
>>
>> I think we can just add comments explaining why the blocking argument
>> is ignored, until someone chooses to fix this problem
>
> I think the problem is definitely worth fixing, and it shouldn't really
> take more effort than adding comments explaining the current behaviour
> ;), basically just add a bunch of 'if (blocking)
> hev().wait_signalled();' where the spec requires it, roughly as you had
> been doing in this patch, but wait_signalled() should only stall on the
> CPU work associated with the event, which should give you the same
> performance as the current approach.
>
>> and/or to
>> implement proper non-blocking variants (would std::async work for
>> trivial cases like ReadBuffer?)
>>

Hm, and to answer this question -- Yeah, std::async would probably work,
but I'm not certain whether it would actually perform better than the
current approach, because on the one hand the actual DMA-ing of the
buffer is likely to happen quasi-asynchronously already assuming the
driver is competent, and OTOH because spawning a new thread for the copy
would introduce additional overhead that might defeat your purpose
unless the copy is very large -- Only experimentation will tell whether
it pays off.

>> thanks,
>> Jan
>>
>>> 
>>> Thank you.
>>> 
>>> Jan Vesely  writes:
>>> 
>>> > v2: wait in map_buffer and map_image as well
>>> > 
>>> > Signed-off-by: Jan Vesely 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-05 Thread Francisco Jerez
Jan Vesely  writes:

> Hi,
>
> thanks for detailed explanation. I indeed missed the writeBuffer part
> in specs.
>
> On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>> These changes are somewhat redundant and potentially
>> performance-impacting, the reason is that in the OpenCL API,
>> clEnqueueWrite* commands are specified to block until the memory
>> provided by the application as origin can be reused safely (i.e. until
>> soft_copy_op()() runs), not necessarily until the transfer to graphics
>> memory has completed (which is what hard_event::wait() will wait for).
>> OTOH reads and maps as implemented by soft_copy_op and friends are
>> essentially blocking so the wait() call is redundant in most cases.
>
> That explains a noticeable slowdown running piglit with these changes.
> I'm not sure about the read part though. I expected it to be basically
> a noop, but it changes behaviour.

I think this change would have slowed you down the most whenever the
mapping operation performed by soft_copy_op() is able to proceed
immediately, either because the buffer is idle (so the driver doesn't
stall on transfer_map()) *or* because the driver is trying to be smart
and creates a bounce buffer where data can be copied into immediately
without stalling, then inserts a pipelined GPU copy from the bounce
buffer into the real buffer.  With this patch you will stall on the GPU
copy regardless (and whatever other work was already on the command
stream which might be substantial), even though it wouldn't have been
necessary in any of these cases.

> Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
> blocking read in one of the piglit tests surprisingly returns
> CL_QUEUED.
>

Hmm, yeah, that seems kind of debatable behaviour, although it's
definitely legit for writes, not quite sure for reads...  I believe the
reason why that happens is because the CPU copy proceeds very quickly
(due to the reasons mentioned in the last paragraph), but the hard_event
is still associated with a pipe_fence synchronous with the GPU's command
stream, so it won't get signalled until the GPU catches up.

> The specs don't mention use of events with blocking read, but it does
> say that "When the read command has completed, the contents of the
> buffer that ptr points to can be used by the application." in the non-
> blocking section. I'd say that the expectation is for the event to be
> CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
> follow this).
>
>> 
>> The only reason why it might be useful to behave differently on blocking
>> transfers is that the application may have specified a user event in the
>> event dependency list, which may cause the soft_copy_op() call to be
>> delayed until the application signals the user event.  In order to fix
>> that it should really only be necessary to wait for the event action to
>> be executed, not necessarily its associated GPU work.
>
> I think that another use is that non-blocking writes do not create an
> extra copy of the buffer. Thus
> clEnqueueWriteBuffer(...,cl_false, ev, ...)
> clWaitForEvents(ev)
> is more memory efficient.
>
>> 
>> Last time this issue came up (yeah it's not the first time) I proposed
>> the patches below to add a mechanism to wait for the event action only,
>> feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
>> while so they may no longer apply cleanly).
>
> I think we can just add comments explaining why the blocking argument
> is ignored, until someone chooses to fix this problem

I think the problem is definitely worth fixing, and it shouldn't really
take more effort than adding comments explaining the current behaviour
;), basically just add a bunch of 'if (blocking)
hev().wait_signalled();' where the spec requires it, roughly as you had
been doing in this patch, but wait_signalled() should only stall on the
CPU work associated with the event, which should give you the same
performance as the current approach.

> and/or to
> implement proper non-blocking variants (would std::async work for
> trivial cases like ReadBuffer?)
>
> thanks,
> Jan
>
>> 
>> Thank you.
>> 
>> Jan Vesely  writes:
>> 
>> > v2: wait in map_buffer and map_image as well
>> > 
>> > Signed-off-by: Jan Vesely 
>> > ---
>> > Hi Aaron,
>> > 
>> > yes, I think you're right, we should wait in Map* as well.
>> > If nothing else it's consistent, even if passing the flag to add_map might 
>> > make it unnecessary (haven't actually checked).
>> > 
>> > thanks,
>> > Jan
>> > 
>> >  src/gallium/state_trackers/clover/api/transfer.cpp | 30 
>> > --
>> >  1 file changed, 28 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp 
>> > b/src/gallium/state_trackers/clover/api/transfer.cpp
>> > index f7046253be..729a34590e 100644
>> > --- a/src/gallium/state_trackers/clover/api/transfer.cpp
>> > +++ 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-03 Thread Jan Vesely
Hi,

thanks for detailed explanation. I indeed missed the writeBuffer part
in specs.

On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
> These changes are somewhat redundant and potentially
> performance-impacting, the reason is that in the OpenCL API,
> clEnqueueWrite* commands are specified to block until the memory
> provided by the application as origin can be reused safely (i.e. until
> soft_copy_op()() runs), not necessarily until the transfer to graphics
> memory has completed (which is what hard_event::wait() will wait for).
> OTOH reads and maps as implemented by soft_copy_op and friends are
> essentially blocking so the wait() call is redundant in most cases.

That explains a noticeable slowdown running piglit with these changes.
I'm not sure about the read part though. I expected it to be basically
a noop, but it changes behaviour.
Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
blocking read in one of the piglit tests surprisingly returns
CL_QUEUED.

The specs don't mention use of events with blocking read, but it does
say that "When the read command has completed, the contents of the
buffer that ptr points to can be used by the application." in the non-
blocking section. I'd say that the expectation is for the event to be
CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
follow this).

> 
> The only reason why it might be useful to behave differently on blocking
> transfers is that the application may have specified a user event in the
> event dependency list, which may cause the soft_copy_op() call to be
> delayed until the application signals the user event.  In order to fix
> that it should really only be necessary to wait for the event action to
> be executed, not necessarily its associated GPU work.

I think that another use is that non-blocking writes do not create an
extra copy of the buffer. Thus
clEnqueueWriteBuffer(...,cl_false, ev, ...)
clWaitForEvents(ev)
is more memory efficient.

> 
> Last time this issue came up (yeah it's not the first time) I proposed
> the patches below to add a mechanism to wait for the event action only,
> feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
> while so they may no longer apply cleanly).

I think we can just add comments explaining why the blocking argument
is ignored, until someone chooses to fix this problem and/or to
implement proper non-blocking variants (would std::async work for
trivial cases like ReadBuffer?)

thanks,
Jan

> 
> Thank you.
> 
> Jan Vesely  writes:
> 
> > v2: wait in map_buffer and map_image as well
> > 
> > Signed-off-by: Jan Vesely 
> > ---
> > Hi Aaron,
> > 
> > yes, I think you're right, we should wait in Map* as well.
> > If nothing else it's consistent, even if passing the flag to add_map might 
> > make it unnecessary (haven't actually checked).
> > 
> > thanks,
> > Jan
> > 
> >  src/gallium/state_trackers/clover/api/transfer.cpp | 30 
> > --
> >  1 file changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp 
> > b/src/gallium/state_trackers/clover/api/transfer.cpp
> > index f7046253be..729a34590e 100644
> > --- a/src/gallium/state_trackers/clover/api/transfer.cpp
> > +++ b/src/gallium/state_trackers/clover/api/transfer.cpp
> > @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, cl_mem d_mem, 
> > cl_bool blocking,
> > , obj_origin, obj_pitch,
> > region));
> >  
> > +   if (blocking)
> > +   hev().wait();
> > +
> > ret_object(rd_ev, hev);
> > return CL_SUCCESS;
> >  
> > @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, cl_mem 
> > d_mem, cl_bool blocking,
> > ptr, {}, obj_pitch,
> > region));
> >  
> > +   if (blocking)
> > +   hev().wait();
> > +
> > ret_object(rd_ev, hev);
> > return CL_SUCCESS;
> >  
> > @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue d_q, cl_mem 
> > d_mem, cl_bool blocking,
> > , obj_origin, obj_pitch,
> > region));
> >  
> > +   if (blocking)
> > +   hev().wait();
> > +
> > ret_object(rd_ev, hev);
> > return CL_SUCCESS;
> >  
> > @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue d_q, cl_mem 
> > d_mem, cl_bool blocking,
> > ptr, host_origin, host_pitch,
> > region));
> >  
> > +   if (blocking)
> > +   hev().wait();
> > +
> > ret_object(rd_ev, hev);
> > return CL_SUCCESS;
> >  
> > @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, cl_mem d_mem, 
> > cl_bool blocking,
> > , src_origin, src_pitch,
> > region));
> >  
> > +   if (blocking)
> > +   hev().wait();
> > +
> > ret_object(rd_ev, hev);
> > return CL_SUCCESS;
> >  
> > @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, 

Re: [Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

2017-08-02 Thread Francisco Jerez
These changes are somewhat redundant and potentially
performance-impacting, the reason is that in the OpenCL API,
clEnqueueWrite* commands are specified to block until the memory
provided by the application as origin can be reused safely (i.e. until
soft_copy_op()() runs), not necessarily until the transfer to graphics
memory has completed (which is what hard_event::wait() will wait for).
OTOH reads and maps as implemented by soft_copy_op and friends are
essentially blocking so the wait() call is redundant in most cases.

The only reason why it might be useful to behave differently on blocking
transfers is that the application may have specified a user event in the
event dependency list, which may cause the soft_copy_op() call to be
delayed until the application signals the user event.  In order to fix
that it should really only be necessary to wait for the event action to
be executed, not necessarily its associated GPU work.

Last time this issue came up (yeah it's not the first time) I proposed
the patches below to add a mechanism to wait for the event action only,
feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
while so they may no longer apply cleanly).

Thank you.

Jan Vesely  writes:

> v2: wait in map_buffer and map_image as well
>
> Signed-off-by: Jan Vesely 
> ---
> Hi Aaron,
>
> yes, I think you're right, we should wait in Map* as well.
> If nothing else it's consistent, even if passing the flag to add_map might 
> make it unnecessary (haven't actually checked).
>
> thanks,
> Jan
>
>  src/gallium/state_trackers/clover/api/transfer.cpp | 30 
> --
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp 
> b/src/gallium/state_trackers/clover/api/transfer.cpp
> index f7046253be..729a34590e 100644
> --- a/src/gallium/state_trackers/clover/api/transfer.cpp
> +++ b/src/gallium/state_trackers/clover/api/transfer.cpp
> @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, cl_mem d_mem, 
> cl_bool blocking,
> , obj_origin, obj_pitch,
> region));
>  
> +   if (blocking)
> +   hev().wait();
> +
> ret_object(rd_ev, hev);
> return CL_SUCCESS;
>  
> @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, cl_mem d_mem, 
> cl_bool blocking,
> ptr, {}, obj_pitch,
> region));
>  
> +   if (blocking)
> +   hev().wait();
> +
> ret_object(rd_ev, hev);
> return CL_SUCCESS;
>  
> @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue d_q, cl_mem 
> d_mem, cl_bool blocking,
> , obj_origin, obj_pitch,
> region));
>  
> +   if (blocking)
> +   hev().wait();
> +
> ret_object(rd_ev, hev);
> return CL_SUCCESS;
>  
> @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue d_q, cl_mem 
> d_mem, cl_bool blocking,
> ptr, host_origin, host_pitch,
> region));
>  
> +   if (blocking)
> +   hev().wait();
> +
> ret_object(rd_ev, hev);
> return CL_SUCCESS;
>  
> @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, cl_mem d_mem, 
> cl_bool blocking,
> , src_origin, src_pitch,
> region));
>  
> +   if (blocking)
> +   hev().wait();
> +
> ret_object(rd_ev, hev);
> return CL_SUCCESS;
>  
> @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, cl_mem d_mem, 
> cl_bool blocking,
> ptr, {}, src_pitch,
> region));
>  
> +   if (blocking)
> +   hev().wait();
> +
> ret_object(rd_ev, hev);
> return CL_SUCCESS;
>  
> @@ -667,7 +685,11 @@ clEnqueueMapBuffer(cl_command_queue d_q, cl_mem d_mem, 
> cl_bool blocking,
>  
> void *map = mem.resource(q).add_map(q, flags, blocking, obj_origin, 
> region);
>  
> -   ret_object(rd_ev, create(q, CL_COMMAND_MAP_BUFFER, deps));
> +   auto hev = create(q, CL_COMMAND_MAP_BUFFER, deps);
> +   if (blocking)
> +   hev().wait();
> +
> +   ret_object(rd_ev, hev);
> ret_error(r_errcode, CL_SUCCESS);
> return map;
>  
> @@ -695,7 +717,11 @@ clEnqueueMapImage(cl_command_queue d_q, cl_mem d_mem, 
> cl_bool blocking,
>  
> void *map = img.resource(q).add_map(q, flags, blocking, origin, region);
>  
> -   ret_object(rd_ev, create(q, CL_COMMAND_MAP_IMAGE, deps));
> +   auto hev = create(q, CL_COMMAND_MAP_IMAGE, deps);
> +   if (blocking)
> +   hev().wait();
> +
> +   ret_object(rd_ev, hev);
> ret_error(r_errcode, CL_SUCCESS);
> return map;
>  
> -- 
> 2.13.3

From 162047763a83a57764c5784c95c3f5474da897b1 Mon Sep 17 00:00:00 2001
From: Francisco Jerez 
Date: Tue, 9 Jun 2015 22:52:25 +0300
Subject: [PATCH 1/2] clover: Wrap event::wait_count in a method taking care of
 the required locking.

---
 src/gallium/state_trackers/clover/core/event.cpp | 19 ---