Hey hey,



2014-07-09 14:47 GMT+02:00 Karl Rupp <r...@iue.tuwien.ac.at>:

> Hey,
>
>
>      Philippe, did you by chance check the impact of the generator
>>     integration on kernel latency? We only have a 1-10us margin to work
>>     with, which I haven't checked yet.
>>
>>
>>
>> Don't worry for the overhead. It used to be fine. I'll re-check to see
>> whether everything is still fine, but when the program-name and the
>> kernel name prefix is known in advance (ie for the pre-compiled
>> programs), I don't see where a significant overhead could come from!
>> I'll benchmark this ASAP, once some other modifications are done.
>>
>
> The overhead could come from too many indirections in memory accesses,
> i.e. if too many lookups in maps and string comparisons are involved. Since
> you know the implementation better than me and don't think this is an
> issue, it should be fine. Either way, it needs to be checked, as it is such
> a fundamental quantity for the onset of scaling behavior of almost all
> 'higher-level' algorithms.
>
>
The process of enqueueing  the generator is extremely lightweight, there is
no map involved. It does basically two things:
- Parse the statement to retrieve some quantities (e.g. M,N,K in the case
of GEMM)
- Recursively enqueues the elements of the statement (matrix, vector,
scalar, etc)
When the program name is known in advance, there is no need to build the
representation of the statement (which fills a char*), but even this should
be fast enough. I remember having measured, some time ago, a total overhead
< 10microseconds when building this representation. But I'll re-evaluate
this ASAP.




>
>
>          I've been very motivated to work on the kernel generator
>>         recently, and
>>         simply don't feel like working on (1) or (2) at the moment. Now,
>>         there
>>         are two different options, for (4):
>>         4.1 - Implementing the kernel fusion mechanism inside the
>> scheduler.
>>         4.2 - Input-dependent kernels, and performance prediction.
>>
>>         While I could help with 4.1, I don't feel like I could do this
>> task
>>         alone, because I don't have a sufficient knowledge of the
>>         backend. Plus,
>>         it implies to get rid of op_executor(), and I'm not sure how I
>>         could do
>>         this, too!
>>         I feel operational, though, for 4.2. I feel like ViennaCL 1.6
>>         should be
>>         a performance-oriented release, and having an
>>         (input+device)-dependent
>>         kernel selection mechanism is something we have to do!
>>
>>
>>     I think we should not go for 4.1 with a 1.6.0 release, simply
>>     because it would delay the release cycle. We should provide features
>>     to our users fairly quickly after they are stabilized, not have them
>>     hanging around in the developer repository for too long. We have
>>     enough features for 1.6.0 already ;-)
>>
>>     Some work from your side on 4.2 would be good, so if you have some
>>     resources left, please focus on that.
>>
>>
>> Sure. 4.2 is part of my (future) PhD work, so I can't expect to have
>> everything working flawlessly for ViennaCL 1.6.0.
>>
>
> As always, it's better to have a smaller set of reliable features in a
> release rather than a larger set of broken features ;-)
>
>
>
>
>  But I feel like I
>> should be able to create the backbone for this release.a simple
>> environment-variable based mechanism that points to a folder where the f
>> spitted out by the python auto-tuner. I'd like an environment-variable
>> based extension, as they can be easily exploited by the advanced users
>> in C++, and generalized by pyviennacl. (since python has a portable
>> filesystem framework) !
>>
>> Here's my idea. We could have VIENNACL_MODELS_PATH pointing to a
>> directory containing standardized device names (lower-case, spaces
>> replaced by dashes). At runtime, we check if the environment variable is
>> set and if we can open the corresponding file. If not, we fallback on
>> the built-in, input-agnostic database.
>>
>
> This sounds to me much more like a researchers facility rather than
> something an average user wants to be exposed to. Keep in mind that
> whenever something needs to go through the file system, it is subject to
> additional problems: These can be permission problems, problems with blanks
> (or umlauts, etc.), random IO errors, or tricky problems in batch systems
> on supercomputers. Since I'm part of the PETSc developer team I've learned
> about so many problems on machines 'out there', where Murphy's law is
> constantly in action. Can we focus on populating the built-in database for
> the 1.6.0 release instead? A standard-user with a standard-GPU should not
> have to worry about any tuning or stumble upon file system problems.
>
>
>
I like to look it the other way around. Portable filesystem implementation
is planned for C++17. Since, in 2014, we are still far away from using
C++11, will we ever be able to rely on such C++17 features ? If we
want ViennaCL to use input-dependent kernels (which is not reasonably
doable without a model file) preferably before 2025, we'll have to deal
with the lack of portable (!=boost) filesystem toolkit. Of course, the
environment variables involved would be disabled by default. Similarly, it
sounds ridiculous not to provide an optional caching mechanism because it
would involve using the filesystem! I'm ready to bet that a lot of users
would prefer significantly increased performance (kernels caching,
input-dependent kernels) at the cost of optionally messing with the
filesystem. I'm even sure that the python community would totally laugh at
us if we didn't choose this option! In the worst case, if there is some
filesystem problems, the environment variable can be unset and ViennaCL
will only use the built-in database.



>
>  The good point is that the auto-tuner can be integrated in pyviennacl's
>> installation, since there is no other dependency!
>>
>> python configure.py --autotune
>> python setup.py build;
>> python setup.py install;
>>
>> Of course, --autotune can take some more options (activated for all the
>> devices by default, but we can chose to auto-tune just one device, if
>> needed.) I suggest, too, that it is activated by default and that some
>> warning is done at the beginning of setup.py that explains what
>> auto-tuning does, that it can lengthen the compilation time and how to
>> deactivate it.
>>
>
> Regarding "python configure.py --autotune": If we are not super-careful
> about an efficient tuning process, we will pretty much inherit the problems
> of ATLAS, i.e. endless installations. I think it is a good feature to have,
> yes, but I think we need to develop some performance models and heuristics
> first before we can really provide this to our users.


Hmm, in the worst case it should still be possible to do this the other way
around:
sudo python autotune.py --models-path ...

You're right, perhaps we should, by default, disable the auto-tuner for the
pyviennacl installation. The auto-tuning process for square matrices is
pretty short if we choose large square matrices only the right parameters
space (~15 to 30 mins on most <5 years old desktop GPUs), but most
pyviennacl users will.

In essence, the autotuner should remain more important for us (and however
> wants to carry out some research in that direction) rather than for average
> users.


Yep, but it can also allow us to collect more data if it's simple enough
for the users to run it!


>
> Just one more question on the interaction with the GUI: Is it still
> possible to iterate over all reasonable configurations within C++ so that
> we don't have to require Python for the benchmark GUI?
>

Yes, it is possible, even though there is no API.

for(p1 in values_p1)
 for(p2 in values_p2)
    ....
      device_specific::execute(some_template(some_parameters(p1,p2,...),
some_options), some_statement, some_context)


Best regards,
Philippe (the (over?)confident)



>
> Best regards,
> Karli (the hesitant...)
>
>
>
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to