Hey,

2014-07-21 22:17 GMT+02:00 Karl Rupp <[email protected]>:

> Hi hi hi,
>
>
> On 07/21/2014 03:59 PM, Philippe Tillet wrote:
>
>> Hi hi,
>>
>> Yes, the vendor id looks useless for our purpose. I think it's attached
>> to a given machine though, contrary to the device id that changes across
>> executions. There are some opencl extensions (provided by apple) to find
>> out which opencl device is driving the display, using vendor ids, if I
>> remember correctly.
>>
>> Let's agree on a device matching process. Here it how it works for now
>> (including the improvements I made yesterday):
>>
>> 1)  Device vendor (should be from the hardware generation, since it
>> looks like we've agreed on a sdk independent mechanism)
>> 2)  Device Type
>> 3)  Hardware generation
>> 4)  Device name
>>
>> 1) and 2) should be swapped, indeed.
>>
>
> If everything else fails, at least the device type is a reliable
> information. :-)
>
>
>
>
>  I don't really use vendor defaults,
>> here is how fallbacks are handled:
>>
>> -> There are global defaults for each device type, for both double and
>> floats.
>> -> When the generation is not found in the database, we look for the
>> closest generation for this vendor.
>> Example. Generations are stored as:
>> enum architecture_generation
>> {
>>    tesla,
>>    fermi,
>>    kepler,
>>    maxwell,
>>
>>   evergreen,
>>   northern_islands,
>>   southern_islands
>> }
>>
>> if the user has the combination (nvidia, kepler) and there is no kepler
>> in the database, then the database will parse the database for (nvidia,
>> $architecture), and will select the $architecture which is closest to
>> kepler (in terms of difference in the enum)
>>
>> -> If the device name is not found, the first device found for this
>> architecture is selected. We should change this to have a similar
>> mechanism as above, so that similar devices are closer in an enum.
>>
>> enum device_name
>> {
>>    ...
>>     gtx560,
>>     gtx570,
>>     gtx580,
>>     gtx590,
>>     ...
>> }
>>
>> So that if there is a profile for the gtx520 and one for the gtx580,
>> then the gtx530 will pick the former and the gtx570 will pick the
>> latter. There is a pitfall with this approach, though, since it won't
>> handle how close devices from different generations are. (Ie a gtx470
>> may be similar to a gtx550?)
>>
>
> What if 'closer' always picks the weaker GPU?
>

Well. This is indeed something tricky. Should a gtx560 fallback on a gtx580
or a gtx 540 ... what about a gtx660... Well, I would argue that
considering what a fallback should be, we wouldn't want to be too greedy
about finding the best fallback possible.

GPU-rebranding is indeed annoying here, but I think it can be addressed in
> the same way vendors relabel it: We just take the tuning profile for device
> 'a' and rebrand it with name 'b'. ;-)
>
>
That's the best way to handle it indeed! We could rebrand a new version of
viennacl which just adds support for the rebranded devices. Kidding... :-p

>
>
>  -> Finally, even with all these fallbacks we cannot ensure correctness
>> for an unknown device. The kernel might require too much resources, for
>> example. For now, an exception would be thrown at
>> template_base::generate(), which is not acceptable. How should we handle
>> this ?
>> My idea is that we should check for invalidity when constructing the
>> template, and if the profile is not valid for this device, then fallback
>> on the defaults.
>>
>
> Does this affect kernels other than GEMM? Either way, this needs some kind
> of hierarchical fallback, which, in simplest term, is just falling back to
> a super-conservative profile.


Yes, even the simplest vector addition template can be made invalid, just
feed it with a local size too big and it'll die like a golden fish.
I will experiment another solution to handle this, since I'll need to try
this for evolutionary tuning anyway. I'll come up with a mechanism to
project the profile back on the space of the valid configurations. For
example, if work_group_size > max_work_group_size, change the local sizes
so that we obtain work_group_size = max_work_group_size. For some
criterions it gets more tricky but it should give us a better
device-adaptive fallback. It is actually the only viable solution, since
the opencl standards only guarantees work_group_size>1, and for GPUs we
certainly want to find a better lower-bound. If we use a fallback which
uses a work group size of 128, we have no guarantee that it will work
everywhere (even though it probably will). Also, we can do nice things like
ensure for NVidia that the local size is a multiple of 32, and for AMD that
it is a multiple of 64.


>
>
>  As a general rule, when the slow default profile is
>> used, should we output a warning?
>>
>
> We should provide a diagnostics function to the user, yes. We should not
> dump anything to stdout without the user explicitly asking for it.
>
>
The most convenient is probably to introduce a flag
VIENNACL_DEBUG_GENERATOR .

Philippe

Best regards,
> Karli
>
>
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to