Re: [HACKERS] [GSoC] Personal presentation and request for clarification

2017-03-22 Thread João Miguel Afonso
> From: Robert Haas 
> Sent: 09 March 2017 01:09
>
>> The project that most caught my eye was on "Implementing push-based query
>> executor".
>> Although it completely fits my capabilities and current research, I have
>> some concerns on "The ability to understand and modify PostgresSQL executor
>> code" as I had not enough time to understand the dimension of the referred
>> changes.
>
> They are formidable.
>
> https://www.Postgresql.org/message-id/CA%2BTgmoaf_uR_wVMj53MVvyEQ_wRx62MM3QQwR6aPZe0Lbr%2BJew%40mail.gmail.com

I want to contribute with valuable work, so I will focus on my second
choice: "Sorting algorithms benchmark and implementation". Maybe when
I get more familiarised with the PostgreSQL project I would give it a
try.


> From: pgsql-hackers-ow...@postgresql.org  
> on behalf of Kevin Grittner 
> Sent: 17 March 2017 13:57
>
> Some ideas for desirable content:
>
>   - A resume or CV of the student, including any prior GSoC work
>   - Their reasons for wanting to participate
>   - What else they have planned for the summer, and what their time
> commitment to the GSoC work will be
>   - A clear statement that there will be no intellectual property
> problems with the work they will be doing -- that the PostgreSQL
> community will be able to use their work without encumbrances
> (e.g., there should be no agreements related to prior or
> ongoing work which might assign the rights to the work they do
> to someone else)
>   - A description of what they will do, and how
>   - Milestones with dates
>   - What they consider to be the test that they have successfully
> completed the project

Using the information posted 
HERE and Kevin Grittner's 
suggestions,
I would like to start writing my proposal as well as begin my work on the
project.

In the last two weeks I have been using some profiling tools like
dstat, top, iostat,... in my university's cluster with the "NAS
Parallel Benchmarks" package from NASA. Now I will start another
academic work using DTrace on a Solaris machine.

I have permanent access to the cluster of SeARCH6, description 
HERE.
I know it is not that powerful, but it's quite heterogeneous, composed
by many generations of processors, including both Intel many core
solutions (the KNC and the not listed KNL), what I think is good
to test the algorithms in many different scenarios.

I have no permissions to install new software, so I guess I can't use
specific benchmarking software, but it can still be use to test the
algorithm alone, using some selected data sets.

The point here is just to inform about important knowledge and
material that maybe I can use on the project. Other information about
my motivations and competences can be found 
HERE.

Anyway, I would like to accomplish some small goals before the
23 April's deadline, so I can spot and be prepared for some trickier
parts of the project.

As I will have classes and evaluations in June, and possibly an
internship in the University of Texas in July, I will have to
work in both tasks at the same time, so I made a schedule with
what I think I can do, leaving August almost free to explore the
project (micro optimisations, ...) or compensate in case something
doesn't go as expected.

I would appreciate if you could review it and a advise me if I'm
pointing on the wrong direction.

Schedule:

Before April 3:

project specific work:
- read all the suggested papers
- implement all the sorting algorithms (functional but
 unoptimised versions)
- validate core ideas with the community
integration work:
- read some of the PostgreSQL documentation and source code
- read the HACKERS mailing list

April 3 - May 30:

project specific work:
- discuss possible benchmarks and optimization possibilities
- do a simple benchmark to the current used sort
integration work:
- go further on understanding PostgreSQL project
- keep reading the mailing list and clarify possible doubts

May 30 - June 26 (Coding officially begins!):

- set up the final benchmark environment
- correctly benchmark current sort
- macro optimise all the implemented sorts and define performance
 goals
- test the produced code vs the current one

June 26 - July 24:

micro optimise all the algorithms:
- study cache/memory issues, vectorisation, ...
- first steps on parallelism
do a full profile of the current work:
- CPU and memory usage
- execution time
- number of operations (per second)

July 24 - August 29:

- optimise parallel solutions
- discuss some possible optimisations and test them
- revise and document all the code
- produce valuable report for future reference

After August 29:

- keep in contact and look for a possible project that fits
 my skills


A small apart:

I read this INFO

[HACKERS] [GSoC] Personal presentation and request for clarification

2017-03-02 Thread João Miguel Afonso
Dear community member(s),

I am João Afonso, a Portuguese MSc student and I'm writing to ask some 
information about the GSoC projects.

For the reasons explained below, PostgreSQL was the organisation that I most 
identify with, so I am trying to introduce myself to the community. This way, 
as I really want to participate, I will describe  my most relevant experiences 
and knowledge on the field.
Please feel free to pass by the less relevant topics.

The project that most caught my eye was on "Implementing push-based query 
executor".
Although it completely fits my capabilities and current research, I have some 
concerns on "The ability to understand and modify PostgresSQL executor code" as 
I had not enough time to understand the dimension of the referred changes.

My second choice would be the "Sorting algorithms benchmark and 
implementation", that although is not directly related to my current work, I am 
more familiarised with and looks quite easier to accomplish.
As I said, I had not enough time to explore the whole project documentation or 
source code, but I read the code of the sorting algorithm and I realised that 
it is sequential. Would a parallel implementation take some benefits here?

I will keep working on reading all the documentation and some of the code, but 
I would appreciate if someone more familiarised with the project could point me 
the project that best suits my abilities



My motivations:


A group formed by me and other four MSc students is currently working on a 
solution for

a linear algebra approach to OLAP. We are at the same time translating the SQL 
language to linear

algebra operations, developing methods to automate the process, optimising the 
previously

implemented sparse matrix operations, and benchmarking the resultant work on 
different

Intel x86 and Nvidia architectures (multi-core, many-core, GPU). Future work 
may even include query/machine level cost prediction functionalities.


It would be really interesting for me to do a continue analysis on how the 
replacement of relational algebra would influence the performance and 
implementation complexity of each independent module and the entire system, so 
at least I could do benchmark tasks even if I am not accepted on GSoC.


I'm also working on a personal project of making a general benchmark script, 
capable of test all the combinations of N parameters, both in Serial, Shared 
and Distributed Memory. It main purpose is to reduce the time spent on the MSc 
assigned tasks. [GIT HERE]


Not referring the anxiety of joining such important project (what I think is 
normal), my major concern for both projects is my reduced experience in the 
referred microoptimizations.


This way, I feel that is important to include my previous work on this topic. 
It was on the simple dot product algorithm and the main cases of study was 
cache issues, CPU/Memory bounds,... The work is described 
[HERE].


Please feel free to contact me for any question.

Best regards,

João Afonso