from:"kaigai"

Re: [HACKERS] list of credits for release notes

2017-10-03 Thread Kohei KaiGai

2017-10-03 6:09 GMT+09:00 Tatsuo Ishii :
>> On Wed, Sep 27, 2017 at 8:29 PM, Michael Paquier
>>  wrote:
>>> Looking at this list, the first name is followed by the family name,
>>> so there are inconsistencies with some Japanese names:
>>> - Fujii Masao should be Masao Fujii.
>>> - KaiGai Kohei should be Kohei Kaigai.
>>
>> But his emails say Fujii Masao, not Masao Fujii.
>
> Michael is correct.
>
> Sometimes people choose family name first in the emails.  However I am
> sure "Fujii" is the family name and "Masao" is the firstname.
>
>> KaiGai's case is a bit trickier, as his email name has changed over time.
>
> Michael is correct about Kaigai too.
>
I set up my personal e-mail account using ${FAMILYNAME} ${FIRSTNAME}
manner according to the eastern convention.
However, my last company enforced a centralized e-mail account policy,
so ${FIRSTNAME} ${LASTNAME} was shown when I post a message
from the jp.nec.com domain.
It is the reason why my email name has changed.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Float value 'Infinity' is cast to numeric 1 on Windows

2017-09-27 Thread Kohei KaiGai

Hello,

Does it make sense to put a check of "isinf(val)" and ereport prior to
the snprintf()?
It is a lightweight check more than set_var_from_str().

Thanks,

2017-09-27 19:41 GMT+09:00 Taiki Kondo :
> Hi all,
>
> I build PostgreSQL 9.6.5 by Visual Studio 2013 on Windows, I found weird 
> behavior on it.
>
> I execute following SQL, occurring ERROR is expected, but I got mysterious 
> value 1.
>
>
> postgres=# select (float8 'inf')::numeric;
>  numeric
> -
>1
> (1 row)
>
>
> On Windows, at least Visual Studio, the string of float/double value meaning
> infinity is "1.#INF" not "inf". So, set_var_from_str() called from 
> float8_numeric()
> in utils/adt/numeric.c will return numeric value 1, and no one checks this 
> situation.
>
> This situation is also occurring by cast from float4.
>
> I wrote a patch to add check this situation.
> Please find attached.
>
>
> Sincerely,
>
> --
> Taiki Kondo
> NEC Solution Innovators, Ltd.
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-03-12 Thread Kouhei Kaigai

> Hello,
> 
> On Wed, March 1, 2017 7:21 pm, Kouhei Kaigai wrote:
> >> I've looked at the patch, and as I'm not that familiar with the
> >> pg-sourcecode, customs and so on, this isn't a review, more like food
> >> for thought and all should be taken with a grain of salt. :)
> >>
> >> So here are a few questions and remarks:
> >>
> >> >+ double  limit_tuples = -1.0;
> >>
> >> Surely the limit cannot be fractional, and must be an integer. So
> >> wouldn't it be better the same type as say:
> >>
> >> >+ if (root->limit_tuples >= 0.0 &&
> >>
> >> Than you could also compare with ">= 0", not ">= 0.0".
> >>
> > The above variable represents the "estimated" number of rows at the
> > planning stage, not execution time.
> > You may be able to see Path structure has "rows" field declared as
> > double type. It makes sense to consider stupid paths during planning,
> > even if it is eventually rejected. For example, if a cross join with
> > two large tables appear during planning, 64bit integer will make
> > overflow easily.
> 
> Hm, ok. Not related to your patch, just curious: Is there a mechanism in
> place that automatically rejects plans where the limit would overflow the
> double to uint64 conversation? Or is this more of a "there would be hopefully
> a plan with a better limit so we do not use the bad one"?
> 
> Would it possible to force a plan where such overflow would occur?
>
We have no such mechanism, and less necessity.
Estimated number of rows in plan time is stored in the plan_rows field of
the Plan structure, as FP64. Once plan-tree gets constructed, estimated
number of rows shall not affect to the execution. (Some plan might use it
for estimation of resource consumption on execution time.)
On the other hands, the actual number of rows that was processed is saved
on the instrument field of the PlanState structure. It is counted up from
the zero by one. So, people wants to replace the hardware prior to uint64
overflow. If 1B rows are processed per sec, uint64 overflow happen at 26th
century(!).

> >> And this comment might be better "were we already called?"
> >>
> >> >+ boolrs_started; /* are we already
> >> called? */
> >>
> > Other variables in ResultState uses present form, like:
> >
> > +   boolrs_started; /* are we already called? */
> > boolrs_done;/* are we done? */
> > boolrs_checkqual;   /* do we need to check the qual? */
> >  } ResultState;
> 
> Yes, I noted that, but still "are" and "called" and "already" don't read
> well together for me:
> 
>   are - present form
>   called - past form like "were we called?", or "are we called bob?" an
> ongoing process
>   already - it has started
> 
> So "are we already called" reads like someone is waiting for being called.
> 
> Maybe to mirror the comment on "rs_done":
> 
>   /* have we started yet? */
> 
> Also, maybe it's easier for the comment to describe what is happening in
> the code because of the flag, not just to the flag itself:
> 
>   /* To do things once when we are called */
> 
> Anyway, it is a minor point and don't let me distract you from your work,
> I do like the feature and the patch :)
>
Fixed to "have we started yet?"


PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 



passdown-limit-fdw.v6.patch
Description: passdown-limit-fdw.v6.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-03-01 Thread Kouhei Kaigai

> Hello all,
> 
> as this is my first mail to pgsql-hackers, please be gentle :)
>
Welcome to pgsql-hackers,

> I've looked at the patch, and as I'm not that familiar with the pg-sourcecode,
> customs and so on, this isn't a review, more like food for thought and all
> should be taken with a grain of salt. :)
> 
> So here are a few questions and remarks:
> 
> >+double  limit_tuples = -1.0;
> 
> Surely the limit cannot be fractional, and must be an integer. So wouldn't
> it be better the same type as say:
> 
> >+if (root->limit_tuples >= 0.0 &&
> 
> Than you could also compare with ">= 0", not ">= 0.0".
>
The above variable represents the "estimated" number of rows at the
planning stage, not execution time.
You may be able to see Path structure has "rows" field declared as
double type. It makes sense to consider stupid paths during planning,
even if it is eventually rejected. For example, if a cross join with
two large tables appear during planning, 64bit integer will make overflow
easily.

> node->ss.ps.ps_numTuples is f.i. an uint64.
> 
> Or is there a specific reason the limit must be a double?
>
The above variable represents "actual" number of rows at the execution
stage. Likely, hardware replacement cycle will come before int64 overflow.

> And finally:
> 
> >+if (node->ss.ps.ps_numTuples > 0)
> 
> >+appendStringInfo(&buf, " LIMIT %ld",
> node->ss.ps.ps_numTuples);
> 
> vs.
> 
> >+appendStringInfo(&buf, "%s LIMIT %lu",
> >+ sql,
> node->ss.ps.ps_numTuples);
> 
> It seems odd to have two different format strings here for the same variable.
>
Ah, yes, %lu is right because ps_numTuples is uint64.

> A few comments miss "." at the end, like these:
> 
> >+ * Also, pass down the required number of tuples
> 
> >+ * Pass down the number of required tuples by the upper node
> 
OK,

> And this comment might be better "were we already called?"
> 
> >+boolrs_started; /* are we already
> called? */
> 
Other variables in ResultState uses present form, like:

+   boolrs_started; /* are we already called? */
boolrs_done;/* are we done? */
boolrs_checkqual;   /* do we need to check the qual? */
 } ResultState;

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


passdown-limit-fdw.v5.patch
Description: passdown-limit-fdw.v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-02-28 Thread Kouhei Kaigai

The attached patch is rebased version of pass-down LIMIT clause patch,
which was forgotten to register on the last CF.

It allows to inform required number of rows to the sub-plans not only
ones we have individually handled at pass_down_bound().
Its primary target is control of number of remote tuple transfer over
the network connection by postgres_fdw.

According to the past discussion, we add a new field @ps_numTuples on
the PlanState to represent the required number of tuples.
Limit node assign a particular number on the field of sub-plan, then
this sub-plan can know its upper node does not require entire tuples,
and adjust its execution storategy.
Like MergeAppend, the sub-plan can also pass down the bounds to its
sub-plans again, if it makes sense and works correctly.

This feature is potentially a basis of GPU-based sorting on top of
CustomScan, because it has advantage for a workload to pick up the
top-N tuples if its data-size is enough small to load onto GPU-RAM.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
> Sent: Tuesday, January 03, 2017 12:07 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Jeevan Chalke ; Robert Haas
> ; pgsql-hackers@postgresql.org; Etsuro Fujita
> ; Andres Freund 
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> Oops, I oversight this patch was marked as "returned with feedback", not
> "moved to the next CF".
> 
> Its status has not been changed since the last update. (Code was revised
> according to the last comment by Jeevan, but CF-Nov was time up at that
> time.)
> 
> How do I handle the patch?
> 
> 2016-12-05 16:49 GMT+09:00 Kouhei Kaigai :
> > Hello,
> >
> > Sorry for my late response.
> > The attached patch reflects your comments.
> >
> >> Here are few comments on latest patch:
> >>
> >>
> >> 1.
> >> make/make check is fine, however I am getting regression failure in
> >> postgres_fdw contrib module (attached regression.diff).
> >> Please investigate and fix.
> >>
> > It was an incorrect interaction when postgres_fdw tries to push down
> > sorting to the remote side. We cannot attach LIMIT clause on the plain
> > scan path across SORT, however, the previous version estimated the
> > cost for the plain scan with LIMIT clause even if local sorting is needed.
> > If remote scan may return just 10 rows, estimated cost of the local
> > sort is very lightweight, thus, this unreasonable path was chosen.
> > (On the other hands, its query execution results were correct because
> > ps_numTuples is not delivered across Sort node, so ForeignScan
> > eventually scanned all the remote tuples. It made correct results but
> > not optimal from the viewpoint of performance.)
> >
> > The v3 patch estimates the cost with remote LIMIT clause only if
> > supplied pathkey strictly matches with the final output order of the
> > query, thus, no local sorting is expected.
> >
> > Some of the regression test cases still have different plans but due
> > to the new optimization by remote LIMIT clause.
> > Without remote LIMIT clause, some of regression test cases preferred
> > remote-JOIN + local-SORT then local-LIMIT.
> > Once we have remote-LIMIT option, it allows to discount the cost for
> > remote-SORT by choice of top-k heap sorting.
> > It changed the optimizer's decision on some test cases.
> >
> > Potential one big change is the test case below.
> >
> >  -- CROSS JOIN, not pushed down
> >  EXPLAIN (VERBOSE, COSTS OFF)
> >  SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1,
> > t2.c1 OFFSET 100 LIMIT 10;
> >
> > It assumed CROSS JOIN was not pushed down due to the cost for network
> > traffic, however, remote LIMIT reduced the estimated number of tuples
> > to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to
> > run on the remote side.
> >
> >> 2.
> >> + *
> >> + * MEMO: root->limit_tuples is not attached when query
> >> contains
> >> + * grouping-clause or aggregate functions. So, we don's
> adjust
> >> + * rows even if LIMIT  is supplied.
> >>
> >> Can you please explain why you are not doing this for grouping-clause
> >> or aggregate functions.
> >>
> > See grouping_planner() at optimizer/plan/planner.c It puts an invalid
> > value on the root->limit_tuples if query has GROUP BY clause,

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

2017-02-25 1:40 GMT+09:00 Claudio Freire :
> On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas  wrote:
>> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
>>> This attached patch re-designed the previous v2 according to Robert's 
>>> comment.
>>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>>> ExecShutdownNode,
>>> because it is already modified to call child node first, earlier than
>>> ExecShutdownGather
>>> which releases DSM segment.
>>
>> Aside from the documentation, which needs some work, this looks fine
>> to me on a quick read.
>
> LGTM too.
>
> You may want to clarify in the docs when the hook is called in
> relation to other hooks too.
>
I tried to add a description how custom-scan callbacks performs under the
executor, and when invoked roughly.
However, it is fundamentally not easy for most of people because it assumes
FDW/CSP author understand the overall behavior of optimizer / executor, not
only APIs specifications.

Thanks,
-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

Hello,

This attached patch re-designed the previous v2 according to Robert's comment.
All I had to do is putting entrypoint for ForeignScan/CustomScan at
ExecShutdownNode,
because it is already modified to call child node first, earlier than
ExecShutdownGather
which releases DSM segment.

Thanks,

2017-02-20 9:25 GMT+09:00 Kouhei Kaigai :
>> -Original Message-
>> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
>> Sent: Monday, February 20, 2017 2:20 AM
>> To: Kaigai Kouhei(海外 浩平) 
>> Cc: Claudio Freire ; Amit Kapila
>> ; pgsql-hackers 
>> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
>> ExecEndGather)
>>
>> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai 
>> wrote:
>> > The attached patch is revised one.
>> >
>> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
>> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
>> > tree twice.
>> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
>> > statistics only when query is executed under EXPLAIN ANALYZE.
>> >
>> > This enhancement allows FDW/CSP to collect its specific run- time
>> > statistics more than Instrumentation, then show them as output of
>> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
>> > transfer ratio and so on. These statistics will never appear in the
>> > Instrumentation structure, however, can be a hot- point of performance
>> > bottleneck if CustomScan works on background workers.
>>
>> Would gather_shutdown_children_first.patch from
>> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
>> b8eb10c-6aywjdxb...@mail.gmail.com
>> help with this problem also?  Suppose we did that, and then also added an
>> ExecShutdownCustom method.  Then you'd definitely be able to get control
>> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
>>
> Ah, yes, I couldn't find any problem around the above approach.
> ExecShutdownGather() can be called by either ExecShutdownNode() or
> ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
> prior to release of the DSM.
>
> Thanks,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-19 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Monday, February 20, 2017 2:20 AM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Claudio Freire ; Amit Kapila
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai 
> wrote:
> > The attached patch is revised one.
> >
> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
> > tree twice.
> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
> > statistics only when query is executed under EXPLAIN ANALYZE.
> >
> > This enhancement allows FDW/CSP to collect its specific run- time
> > statistics more than Instrumentation, then show them as output of
> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
> > transfer ratio and so on. These statistics will never appear in the
> > Instrumentation structure, however, can be a hot- point of performance
> > bottleneck if CustomScan works on background workers.
> 
> Would gather_shutdown_children_first.patch from
> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
> b8eb10c-6aywjdxb...@mail.gmail.com
> help with this problem also?  Suppose we did that, and then also added an
> ExecShutdownCustom method.  Then you'd definitely be able to get control
> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
> 
Ah, yes, I couldn't find any problem around the above approach.
ExecShutdownGather() can be called by either ExecShutdownNode() or
ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
prior to release of the DSM.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-16 Thread Kouhei Kaigai

Hello,

The attached patch is revised one.

Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
ExecParallelRetrieveInstrumentation() not to walk on the plan-
state tree twice.
One (hypothetical) downside is, FDW/CSP can retrieve its own
run-time statistics only when query is executed under EXPLAIN
ANALYZE.

This enhancement allows FDW/CSP to collect its specific run-
time statistics more than Instrumentation, then show them as
output of EXPLAIN. My expected examples are GPU's kernel execution
time, DMA transfer ratio and so on. These statistics will never
appear in the Instrumentation structure, however, can be a hot-
point of performance bottleneck if CustomScan works on background
workers.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 3:37 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:42 AM, Kouhei Kaigai  wrote:
> > I also had thought an idea to have extra space to Instrumentation
> > structure, however, it needs to make Instrumentation flexible-length
> > structure according to the custom format by CSP/FDW. Likely, it is not
> a good design.
> > As long as extension can retrieve its custom statistics on DSM area
> > required by ExecParallelEstimate(), I have no preference on the hook
> location.
> 
> That's what I had in mind: the hook happens there, but the extension
> retrieves the information from some extension-specific DSM area, just as
> it would on the ParallelFinish hook.
> 
> > One thing we may pay attention is, some extension (not mine) may want
> > to collect worker's statistics regardless of Instrumentation (in other
> > words, even if plan is not under EXPLAIN ANALYZE).
> > It is the reason why I didn't put a hook under the
> ExecParallelRetrieveInstrumentation().
> 
> I don't think you should worry about that as long as it's a hypothetical
> case.
> 
> If/when some extension actually needs to do that, the design can be discussed
> with a real use case at hand, and not a hypothetical one.
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


parallel-finish-fdw_csp.v2.patch
Description: parallel-finish-fdw_csp.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 1:05 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:00 AM, Claudio Freire 
> wrote:
> > On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai  wrote:
> >>> If the use case for this is to gather instrumentation, I'd suggest
> >>> calling the hook in RetrieveInstrumentation, and calling it
> >>> appropriately. It would make the intended use far clearer than it is
> now.
> >>>
> >>> And if it saves some work, all the better.
> >>>
> >>> Until there's a use case for a non-instrumentation hook in that
> >>> place, I wouldn't add it. This level of generality sounds like a
> >>> solution waiting for a problem to solve.
> >>>
> >> The use cases I'd like to add are extension specific but significant
> >> for performance analytics. These statistics are not included in
> Instrumentation.
> >> For example, my problems are GPU execution time, data transfer ratio
> >> over DMA, synchronization time for GPU task completion, and so on.
> >> Only extension can know which attributes are collected during the
> >> execution, and its data format. I don't think Instrumentation fits these
> requirements.
> >> This is a problem I faced on the v9.6 based interface design, so I
> >> could notice it.
> >
> >
> > What RetrieveInstrumentation currently does may not cover the
> > extension's specific instrumentation, but what I'm suggesting is
> > adding a hook, like the one you're proposing for ParallelFinish, that
> > would collect extension-specific instrumentation. Couple that with
> > extension-specific explain output and you'll get your use cases
> > covered, I think.
> 
> 
> In essence, you added a ParallelFinish that is called after
> RetrieveInstrumentation. That's overreaching, for your use case.
> 
> If instrumentation is what you're collecting, it's far more correct, and
> more readable, to have a hook called from inside RetrieveInstrumentation
> that will retrieve extension-specific instrumentation.
> 
> RetrieveInstrumentation itself doesn't need to parse that information, that
> will be loaded into the custom scan nodes, and those are the ones that will
> parse it when generating explain output.
> 
> So, in essence it's almost what you did with ParallelFinish, more clearly
> aimed at collecting runtime statistics.
> 
I also had thought an idea to have extra space to Instrumentation structure,
however, it needs to make Instrumentation flexible-length structure according
to the custom format by CSP/FDW. Likely, it is not a good design.
As long as extension can retrieve its custom statistics on DSM area required
by ExecParallelEstimate(), I have no preference on the hook location.

One thing we may pay attention is, some extension (not mine) may want to
collect worker's statistics regardless of Instrumentation (in other words,
even if plan is not under EXPLAIN ANALYZE).
It is the reason why I didn't put a hook under the 
ExecParallelRetrieveInstrumentation().

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Saturday, February 04, 2017 8:47 AM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Oct 31, 2016 at 11:33 AM, Kouhei Kaigai 
> wrote:
> > Hello,
> >
> > The attached patch implements the suggestion by Amit before.
> >
> > What I'm motivated is to collect extra run-time statistics specific to
> > a particular ForeignScan/CustomScan, not only the standard
> > Instrumentation; like DMA transfer rate or execution time of GPU
> > kernels in my case.
> >
> > Per-node DSM toc is one of the best way to return run-time statistics
> > to the master backend, because FDW/CSP can assign arbitrary length of
> > the region according to its needs. It is quite easy to require.
> > However, one problem is, the per-node DSM toc is already released when
> > ExecEndNode() is called on the child node of Gather.
> >
> > This patch allows extensions to get control on the master backend's
> > context when all the worker node gets finished but prior to release of
> > the DSM segment. If FDW/CSP has its special statistics on the segment,
> > it can move to the private memory area for EXPLAIN output or something
> > other purpose.
> >
> > One design consideration is whether the hook shall be called from
> > ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
> > The former is a function to retrieve the standard Instrumentation
> > information, thus, it is valid only if EXPLAIN ANALYZE.
> > On the other hands, if we put entrypoint at ExecParallelFinish(),
> > extension can get control regardless of EXPLAIN ANALYZE, however, it
> > also needs an extra planstate_tree_walker().
> 
> If the use case for this is to gather instrumentation, I'd suggest calling
> the hook in RetrieveInstrumentation, and calling it appropriately. It would
> make the intended use far clearer than it is now.
> 
> And if it saves some work, all the better.
> 
> Until there's a use case for a non-instrumentation hook in that place, I
> wouldn't add it. This level of generality sounds like a solution waiting
> for a problem to solve.
>
The use cases I'd like to add are extension specific but significant for
performance analytics. These statistics are not included in Instrumentation.
For example, my problems are GPU execution time, data transfer ratio over
DMA, synchronization time for GPU task completion, and so on.
Only extension can know which attributes are collected during the execution,
and its data format. I don't think Instrumentation fits these requirements.
This is a problem I faced on the v9.6 based interface design, so I could
notice it.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] T_Float morph to T_Integer after nodeRead

2017-01-05 Thread Kouhei Kaigai

> Kouhei Kaigai  writes:
> > Simplified description of what I did is:
> >   fval = makeFloat(psprintf("%.0f", plan_nrows));
> >   custom_scan->custom_private = list_make1(fval);
> 
> So don't do that.  The lexer would never produce T_Float for an
> integer-looking string, so I think it's out of scope for nodeRead() to be
> able to reconstitute such a thing.  You could use %.1f, perhaps.
> But actually, if you're worried about reconstituting exactly what you had,
> aren't you risking precision loss anyway?  Maybe something like
> psprintf("%.*e", DBL_DIG+3, plan_nrows) would be safer.
>
Ah, indeed, it is a good idea to avoid the problem.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] T_Float morph to T_Integer after nodeRead

2017-01-05 Thread Kouhei Kaigai

I noticed a strange behavior when T_Float value is serialized, then deserialized
on the worker process for cpu parallel execution.

Simplified description of what I did is:
  fval = makeFloat(psprintf("%.0f", plan_nrows));
  custom_scan->custom_private = list_make1(fval);

This string expression contains no dot, then Float value was written out
as if it is an integer value, like "654321".

nodeRead() calls nodeTokenType() to determine the token type.
It determines a numeric token with no dot an Integer value, even if it is
generated by makeFloat(). Then, the worker process reference this value
using floatVal() and gets SEGV.

A workaround is that we never use "%.0f" format for makeFloat().
It may be sufficient because we have small number of makeFloat() call all
around the source tree. Or, do we have any other systematic solution to
prevent numeric cstring without dot?

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-01-04 Thread Kouhei Kaigai




> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, January 05, 2017 5:29 AM
> To: Kohei KaiGai 
> Cc: Kaigai Kouhei(海外 浩平) ; Jeevan Chalke
> ; pgsql-hackers@postgresql.org; Etsuro
> Fujita ; Andres Freund 
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> On Mon, Jan 2, 2017 at 10:07 PM, Kohei KaiGai  wrote:
> > Oops, I oversight this patch was marked as "returned with feedback",
> > not "moved to the next CF".
> >
> > Its status has not been changed since the last update. (Code was
> > revised according to the last comment by Jeevan, but CF-Nov was time
> > up at that time.)
> >
> > How do I handle the patch?
> 
> Can you just change the status to "Moved to Next CF" and then make it "Needs
> Review"?
> 
Unfortunately, it was not possible. Probably, administrator privilege will be
needed for this operation.
May I add it to the CF:2017-03?

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-01-02 Thread Kohei KaiGai

Oops, I oversight this patch was marked as "returned with feedback",
not "moved to the next CF".

Its status has not been changed since the last update. (Code was
revised according to the last
comment by Jeevan, but CF-Nov was time up at that time.)

How do I handle the patch?

2016-12-05 16:49 GMT+09:00 Kouhei Kaigai :
> Hello,
>
> Sorry for my late response.
> The attached patch reflects your comments.
>
>> Here are few comments on latest patch:
>>
>>
>> 1.
>> make/make check is fine, however I am getting regression failure in
>> postgres_fdw contrib module (attached regression.diff).
>> Please investigate and fix.
>>
> It was an incorrect interaction when postgres_fdw tries to push down
> sorting to the remote side. We cannot attach LIMIT clause on the plain
> scan path across SORT, however, the previous version estimated the cost
> for the plain scan with LIMIT clause even if local sorting is needed.
> If remote scan may return just 10 rows, estimated cost of the local sort
> is very lightweight, thus, this unreasonable path was chosen.
> (On the other hands, its query execution results were correct because
> ps_numTuples is not delivered across Sort node, so ForeignScan eventually
> scanned all the remote tuples. It made correct results but not optimal
> from the viewpoint of performance.)
>
> The v3 patch estimates the cost with remote LIMIT clause only if supplied
> pathkey strictly matches with the final output order of the query, thus,
> no local sorting is expected.
>
> Some of the regression test cases still have different plans but due to
> the new optimization by remote LIMIT clause.
> Without remote LIMIT clause, some of regression test cases preferred
> remote-JOIN + local-SORT then local-LIMIT.
> Once we have remote-LIMIT option, it allows to discount the cost for
> remote-SORT by choice of top-k heap sorting.
> It changed the optimizer's decision on some test cases.
>
> Potential one big change is the test case below.
>
>  -- CROSS JOIN, not pushed down
>  EXPLAIN (VERBOSE, COSTS OFF)
>  SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 
> OFFSET 100 LIMIT 10;
>
> It assumed CROSS JOIN was not pushed down due to the cost for network
> traffic, however, remote LIMIT reduced the estimated number of tuples
> to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to run
> on the remote side.
>
>> 2.
>> + *
>> + * MEMO: root->limit_tuples is not attached when query
>> contains
>> + * grouping-clause or aggregate functions. So, we don's adjust
>> + * rows even if LIMIT  is supplied.
>>
>> Can you please explain why you are not doing this for grouping-clause or
>> aggregate functions.
>>
> See grouping_planner() at optimizer/plan/planner.c
> It puts an invalid value on the root->limit_tuples if query has GROUP BY 
> clause,
> so we cannot know number of tuples to be returned even if we have upper limit
> actually.
>
> /*
>  * Figure out whether there's a hard limit on the number of rows that
>  * query_planner's result subplan needs to return.  Even if we know a
>  * hard limit overall, it doesn't apply if the query has any
>  * grouping/aggregation operations, or SRFs in the tlist.
>  */
> if (parse->groupClause ||
> parse->groupingSets ||
> parse->distinctClause ||
> parse->hasAggs ||
> parse->hasWindowFuncs ||
> parse->hasTargetSRFs ||
> root->hasHavingQual)
> root->limit_tuples = -1.0;
> else
>     root->limit_tuples = limit_tuples;
>
>> 3.
>> Typo:
>>
>> don's  => don't
>>
> Fixed,
>
> best regards,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
KaiGai Kohei 


passdown-limit-fdw.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai

2016-12-23 8:24 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 11:01 PM, Kohei KaiGai  wrote:
>> Regardless of the ExpandedObject, does the flatten format need to
>> contain fully flatten data chunks?
>
> I suspect it does, and I think that's why this isn't going to get very
> far without a super-varlena format.
>
Yep, I'm now under investigation how to implement with typlen == -3
approach. Likely, it will be the most straight-forward infrastructure for
other potential use cases more than matrix/vector.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai

2016-12-23 8:23 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 10:44 PM, Kohei KaiGai  wrote:
>>> Handling objects >1GB at all seems like the harder part of the
>>> problem.
>>>
>> I could get your point almost. Does the last line above mention about
>> amount of the data object >1GB? even if the "super-varlena" format
>> allows 64bit length?
>
> Sorry, I can't understand your question about what I wrote.
>
I thought you just pointed out it is always harder part to treat large
amount of data even if data format allows >1GB or more. Right?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-08 Thread Kohei KaiGai

2016-12-08 16:11 GMT+09:00 Craig Ringer :
> On 8 December 2016 at 12:01, Kohei KaiGai  wrote:
>
>>> At a higher level, I don't understand exactly where such giant
>>> ExpandedObjects would come from.  (As you point out, there's certainly
>>> no easy way for a client to ship over the data for one.)  So this feels
>>> like a very small part of a useful solution, if indeed it's part of a
>>> useful solution at all, which is not obvious.
>>>
>> I expect an aggregate function that consumes millions of rows as source
>> of a large matrix larger than 1GB. Once it is formed to a variable, it is
>> easy to deliver as an argument of PL functions.
>
> You might be interested in how Java has historically dealt with similar 
> issues.
>
> For a long time the JVM had quite low limits on the maximum amount of
> RAM it could manage, in the single gigabytes for a long time. Even for
> the 64-bit JVM. Once those limitations were lifted, the garbage
> collector algorithm placed a low practical limit on how much RAM it
> could cope with effectively.
>
> If you were doing scientific computing with Java, lots of big
> image/video work, using GPGPUs, doing large scale caching, etc, this
> rapidly became a major pain point. So people introduced external
> memory mappings to Java, where objects could reference and manage
> memory outside the main JVM heap. The most well known is probably
> BigMemory (https://www.terracotta.org/products/bigmemory), but there
> are many others. They exposed this via small opaque handle objects
> that you used to interact with the external memory store via library
> functions.
>
> It might make a lot of sense to apply the same principle to
> PostgreSQL, since it's much less intrusive than true 64-bit VARLENA.
> Rather than extending all of PostgreSQL to handle special-case
> split-up VARLENA extended objects, have your interim representation be
> a simple opaque value that points to externally mapped memory. Your
> operators for the type, etc, know how to work with it. You probably
> don't need a full suite of normal operators, you'll be interacting
> with the data in a limited set of ways.
>
> The main issue would presumably be one of resource management, since
> we currently assume we can just copy a Datum around without telling
> anybody about it or doing any special management. You'd need to know
> when to clobber your external segment, when to copy(!) it if
> necessary, etc. This probably makes sense for working with GPGPUs
> anyway, since they like dealing with big contiguous chunks of memory
> (or used to, may have improved?).
>
> It sounds like only code specifically intended to work with the
> oversized type should be doing much with it except passing it around
> as an opaque handle, right?
>
Thanks for your proposition. Its characteristics looks like a largeobject
but volatile (because of no backing storage).
As you mentioned, I also think its resource management is the core of issues.
We have no reference count mechanism, so it is uncertain when we should
release the orphan memory chunk, or we have to accept this memory consumption
until clean-up of the relevant memory context.
Moreever, we have to pay attention to the scenario when this opaque identifier
is delivered to the background worker, thus, it needs to be valid on other
process's context. (Or, prohibit to exchange it on the planner stage?)

> Do you need to serialize this type to/from disk at all? Or just
> exchange it in chunks with a client? If you do need to, can you
> possibly do TOAST-like or pg_largeobject-like storage where you split
> it up for on disk storage then reassemble for use?
>
Even though I don't expect this big chunk is entirely exported to/imported
from the client at once, it makes sense to save/load this big chunk to/from
the disk, because construction of a big memory structure consumes much CPU
cycles than simple memory copy from buffers.
So, in other words, it does not need valid typinput/typoutput handlers, but
serialization/deserialization on disk i/o is helpful.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:36 GMT+09:00 Tom Lane :
> Robert Haas  writes:
>> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>>> I like to propose a new optional type handler 'typserialize' to
>>> serialize an in-memory varlena structure (that can have indirect
>>> references) to on-disk format.
>
>> I think it's probably a mistake to conflate objects with substructure
>> with objects > 1GB.  Those are two somewhat orthogonal needs.
>
> Maybe.  I think where KaiGai-san is trying to go with this is being
> able to turn an ExpandedObject (which could contain very large amounts
> of data) directly into a toast pointer or vice versa.  There's nothing
> really preventing a TOAST OID from having more than 1GB of data
> attached, and if you had a side channel like this you could transfer
> the data without ever having to form a larger-than-1GB tuple.
>
> The hole in that approach, to my mind, is that there are too many places
> that assume that they can serialize an ExpandedObject into part of an
> in-memory tuple, which might well never be written to disk, or at least
> not written to disk in a table.  (It might be intended to go into a sort
> or hash join, for instance.)  This design can't really work for that case,
> and unfortunately I think it will be somewhere between hard and impossible
> to remove all the places where that assumption is made.
>
Regardless of the ExpandedObject, does the flatten format need to
contain fully flatten data chunks?
If a data type internally contains multiple toast pointers as like an array,
its flatten image is likely very small we can store using an existing
varlena mechanism.
One problem is VARSIZE() will never tell us exact total length of
the data even if it references multiple GB scale chunks.

> At a higher level, I don't understand exactly where such giant
> ExpandedObjects would come from.  (As you point out, there's certainly
> no easy way for a client to ship over the data for one.)  So this feels
> like a very small part of a useful solution, if indeed it's part of a
> useful solution at all, which is not obvious.
>
I expect an aggregate function that consumes millions of rows as source
of a large matrix larger than 1GB. Once it is formed to a variable, it is
easy to deliver as an argument of PL functions.

> FWIW, ExpandedObjects themselves are far from a fully fleshed out
> concept, one of the main problems being that they don't have very long
> lifespans except in the case that they're the value of a plpgsql
> variable.  I think we would need to move things along quite a bit in
> that area before it would get to be useful to think in terms of
> ExpandedObjects containing multiple GB of data.  Otherwise, the
> inevitable flattenings and re-expansions are just going to kill you.q
>
> Likewise, the need for clients to be able to transfer data in chunks
> gets pressing well before you get to 1GB.  So there's a lot here that
> really should be worked on before we try to surmount that barrier.
>
Do you point out the problem around client<->server protocol, isn't it?
Likely, we eventually need this enhancement. I agree.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:04 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>> I like to propose a new optional type handler 'typserialize' to
>> serialize an in-memory varlena structure (that can have indirect
>> references) to on-disk format.
>> If any, it shall be involced on the head of toast_insert_or_update()
>> than indirect references are transformed to something other which
>> is safe to save. (My expectation is, the 'typserialize' handler
>> preliminary saves the indirect chunks to the toast relation, then
>> put toast pointers instead.)
>
> This might not work.  The reason is that we have important bits of
> code that expect that they can figure out how to do some operation on
> a datum (find its length, copy it, serialize it) based only on typlen
> and typbyval.  See src/backend/utils/adt/datum.c for assorted
> examples.  Note also the lengthy comment near the top of the file,
> which explains that typlen > 0 indicates a fixed-length type, typlen
> == -1 indicates a varlena, and typlen == -2 indicates a cstring.  I
> think there's probably room for other typlen values; for example, we
> could have typlen == -3 indicate some new kind of thing -- a
> super-varlena that has a higher length limit, or some other kind of
> thing altogether.
>
> Now, you can imagine trying to treat what you're talking about as a
> new type of TOAST pointer, but I think that's not going to help,
> because at some point the TOAST pointer gets de-toasted into a varlena
> ... which is still limited to 1GB.  So that's not really going to
> work.  And it brings out another point, which is that if you define a
> new typlen code, like -3, for super-big things, they won't be
> varlenas, which means they won't work with the existing TOAST
> interfaces.  Still, you might be able to fix that.  You would probably
> have to do some significant surgery on the wire protocol, per the
> commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd.
>
Hmm... The reason why I didn't introduce the idea of 64bit varlena format is
this approach seems too invasive for existing PostgreSQL core and extensions,
because I assumed this "long" variable length datum utilize/enhance existing
varlena infrastructure.
However, once we have completely independent infrastructure from the exiting
varlena, we may not need to have a risk of code mixture.
It seems to me an advantage.

My concern about ExpandedObject is its flat format still needs 32bit varlena
header that restrict the total length up to 1GB, so, it was not possible to
represent a large data chunk.
So, I didn't think the current ExpandedObject is a solution for us.

Regarding to the protocol stuff, I want to consider the support of a large
record next to the internal data format, because my expected primary
usage is an internal use for in-database analytics, then user will get
its results from a complicated logic described in PL function.

> I think it's probably a mistake to conflate objects with substructure
> with objects > 1GB.  Those are two somewhat orthogonal needs.  As Jim
> also points out, expanded objects serve the first need.  Of course,
> those assume that we're dealing with a varlena, so if we made a
> super-varlena, we'd still need to create an equivalent.  But perhaps
> it would be a fairly simple adaptation of what we've already got.
> Handling objects >1GB at all seems like the harder part of the
> problem.
>
I could get your point almost. Does the last line above mention about
amount of the data object >1GB? even if the "super-varlena" format
allows 64bit length?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

_y; /* vertical size of each block */
  int gridsz_x;  /* horizontal # of blocks */
  int gridsz_y;  /* vertical # of blocks */
  struct {
  Oid va_valueid;
  Oid va_toastrelid;
  void   *ptr_block;
  } blocks[FLEXIBLE_ARRAY_MEMBER];
  };

If and when this structure is fetched from the tuple, its @ptr_block
is initialized to NULL. Once it is supplied to a function which
references a part of blocks, type specific code can load sub-matrix
from the toast relation, then update the @ptr_block not to load the
sub-matrix from the toast multiple times.
I'm not certain whether it is acceptable behavior/manner.

If it is OK, it seems to me the direction to support matrix larger
than 1GB is all green.


Your comments are welcome.


 FOR YOUR REFERENCE

* Beyond the 1GB limitation of varlena
http://kaigai.hatenablog.com/entry/2016/12/04/223826

* PGconf.SV 2016 and PL/CUDA
http://kaigai.hatenablog.com/entry/2016/11/17/070708

* PL/CUDA slides at PGconf.ASIA (English)
http://www.slideshare.net/kaigai/pgconfasia2016-plcuda-en

Best regards,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-06 Thread Kohei KaiGai

2016-12-06 16:59 GMT+09:00 Pavel Stehule :
>
>
> 2016-12-06 1:50 GMT+01:00 Kohei KaiGai :
>>
>> 2016-12-05 22:45 GMT+09:00 Pavel Stehule :
>> >
>> > There are more goals:
>> >
>> > 1. user friendly import of text or binary data - import text data (with
>> > psql) from file is possible - but you have to load a content to psql
>> > variable. For binary data you should to use workaround based on LO and
>> > transformation from LO to bytea.
>> >
>> > 2. user friendly export text or binary data - now, the binary data can
>> > be
>> > exported only via transformation to LO. The XML has very interesting
>> > features when is passing from/to client binary. This feature is
>> > impossible
>> > in psql now.
>> >
>>   :
>> 
>>   :
>> >> It seems to me extend of COPY statement for this optimization is a bit
>> >> overkill
>> >> solution. Do we find out an alternative solution that we can build on
>> >> the existing
>> >> infrastructure?
>> >
>> > The advantage and sense of COPY RAW was reusing existing interface. The
>> > question was: How I can export/import binary data simply from psql
>> > console?
>> >
>> OK, I could get your point.
>>
>> Likeky, we can implement the feature without COPY statement enhancement
>> by adding a special purpose function and \xxx command on psql.
>>
>> Let's assume the two commands below on psql:
>>
>> \blob_import   (STDIN|)
>> \blob_export  (STDOUT|)
>>
>> On \blob_import, the psql command reads the binary contents from either
>> stdin or file, than call a special purpose function that takes three
>> arguments; table name, column name and a binary data chunk.
>> PQexecParams() of libpq allows to deliver the data chunk with keeping
>> binary data format, then the special purpose function will be able to
>> lookup the destination table/column and construct a tuple that contains
>> the supplied data chunk. (I think _recv handler shall be used for
>> data validation, but not an element of this feature.)
>>
>>
>> On \blob_export, the psql command also set up a simple query as follows:
>>   SELECT blob_export((> For example,
>>   \blob_export SELECT binary_data FROM my_table WHERE id = 10   /tmp/aaa
>> shall be transformed to
>>   SELECT blob_export((SELECT binary_data FROM my_table WHERE id = 10))
>
>
> This is reason why I prefer a COPY statement - because it does all necessary
> things natural.  But if there is a disagreement against COPY RAW it can be
> implemented as psql commands.
>
Yes, both of approach will be able to implement what you want to do.
I agree it is valuable if psql can import/export a particular item with
simple shell-script description, however, here is no consensus how
to implement it.

If psql supports the special \xxx command, it is equivalently convenient
from the standpoint of users, with no enhancement of the statement.

I hope committers comment on the approach we will take on.

Thanks,

> export should be similar like \g, \gset feature
>
> so
>
> SELECT xmldoc FROM 
> \gbinary_store .xxx
>
> import is maybe better solved by proposed file references in queries
>
> Regards
>
> Pavel
>
>
>>
>>
>> This function is declared as:
>>   blob_export(anyelement) RETURNS bytea
>> So, as long as the user supplied query returns exactly one column and
>> one row, it can transform the argument to the binary stream, then psql
>> command receive it and dump somewhere; stdout or file.
>>
>> How about your thought?
>>
>> Thanks,
>> --
>> KaiGai Kohei 
>
>



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-05 Thread Kohei KaiGai

2016-12-05 22:45 GMT+09:00 Pavel Stehule :
>
> There are more goals:
>
> 1. user friendly import of text or binary data - import text data (with
> psql) from file is possible - but you have to load a content to psql
> variable. For binary data you should to use workaround based on LO and
> transformation from LO to bytea.
>
> 2. user friendly export text or binary data - now, the binary data can be
> exported only via transformation to LO. The XML has very interesting
> features when is passing from/to client binary. This feature is impossible
> in psql now.
>
  :

  :
>> It seems to me extend of COPY statement for this optimization is a bit
>> overkill
>> solution. Do we find out an alternative solution that we can build on
>> the existing
>> infrastructure?
>
> The advantage and sense of COPY RAW was reusing existing interface. The
> question was: How I can export/import binary data simply from psql console?
>
OK, I could get your point.

Likeky, we can implement the feature without COPY statement enhancement
by adding a special purpose function and \xxx command on psql.

Let's assume the two commands below on psql:

\blob_import   (STDIN|)
\blob_export  (STDOUT|)

On \blob_import, the psql command reads the binary contents from either
stdin or file, than call a special purpose function that takes three
arguments; table name, column name and a binary data chunk.
PQexecParams() of libpq allows to deliver the data chunk with keeping
binary data format, then the special purpose function will be able to
lookup the destination table/column and construct a tuple that contains
the supplied data chunk. (I think _recv handler shall be used for
data validation, but not an element of this feature.)


On \blob_export, the psql command also set up a simple query as follows:
  SELECT blob_export((


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-05 Thread Kohei KaiGai

Sorry for my late response.

I've briefly checked a series of discussion in the past.
I understood the target/purpose of this patch is provision of a fast interface
to import/export a particular cell of a relation, by skip of text<->binary
transformation. Its typical use case are XML and JSON data types. Right?

If so, how about the idea to use fast-path invocation protocol to call functions
to import/export these document types?
It allows to accept binary form of the data stream, with minimum overheads.

It seems to me extend of COPY statement for this optimization is a bit overkill
solution. Do we find out an alternative solution that we can build on
the existing
infrastructure?

Best regards,

2016-12-05 14:16 GMT+09:00 Haribabu Kommi :
>
>
> On Tue, Nov 22, 2016 at 10:48 PM, Haribabu Kommi 
> wrote:
>>
>>  Hi,
>>
>> This is a gentle reminder.
>>
>> you assigned as reviewer to the current patch in the 11-2016 commitfest.
>> But you haven't shared your review yet. Please share your review about
>> the patch. This will help us in smoother operation of commitfest.
>>
>> Please Ignore if you already shared your review.
>
>
> Patch is not applying properly to HEAD.
> Moved to next CF with "waiting on author" status.
>
>
> Regards,
> Hari Babu
> Fujitsu Australia

-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-12-04 Thread Kouhei Kaigai

Hello,

Sorry for my late response.
The attached patch reflects your comments.

> Here are few comments on latest patch:
> 
> 
> 1.
> make/make check is fine, however I am getting regression failure in
> postgres_fdw contrib module (attached regression.diff).
> Please investigate and fix.
>
It was an incorrect interaction when postgres_fdw tries to push down
sorting to the remote side. We cannot attach LIMIT clause on the plain
scan path across SORT, however, the previous version estimated the cost
for the plain scan with LIMIT clause even if local sorting is needed.
If remote scan may return just 10 rows, estimated cost of the local sort
is very lightweight, thus, this unreasonable path was chosen.
(On the other hands, its query execution results were correct because
ps_numTuples is not delivered across Sort node, so ForeignScan eventually
scanned all the remote tuples. It made correct results but not optimal
from the viewpoint of performance.)

The v3 patch estimates the cost with remote LIMIT clause only if supplied
pathkey strictly matches with the final output order of the query, thus,
no local sorting is expected.

Some of the regression test cases still have different plans but due to
the new optimization by remote LIMIT clause.
Without remote LIMIT clause, some of regression test cases preferred
remote-JOIN + local-SORT then local-LIMIT.
Once we have remote-LIMIT option, it allows to discount the cost for
remote-SORT by choice of top-k heap sorting.
It changed the optimizer's decision on some test cases.

Potential one big change is the test case below.

 -- CROSS JOIN, not pushed down
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 
100 LIMIT 10;

It assumed CROSS JOIN was not pushed down due to the cost for network
traffic, however, remote LIMIT reduced the estimated number of tuples
to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to run
on the remote side.

> 2.
> + *
> + * MEMO: root->limit_tuples is not attached when query
> contains
> + * grouping-clause or aggregate functions. So, we don's adjust
> + * rows even if LIMIT  is supplied.
> 
> Can you please explain why you are not doing this for grouping-clause or
> aggregate functions.
>
See grouping_planner() at optimizer/plan/planner.c
It puts an invalid value on the root->limit_tuples if query has GROUP BY clause,
so we cannot know number of tuples to be returned even if we have upper limit
actually.

/*
 * Figure out whether there's a hard limit on the number of rows that
 * query_planner's result subplan needs to return.  Even if we know a
 * hard limit overall, it doesn't apply if the query has any
 * grouping/aggregation operations, or SRFs in the tlist.
 */
if (parse->groupClause ||
parse->groupingSets ||
parse->distinctClause ||
parse->hasAggs ||
parse->hasWindowFuncs ||
parse->hasTargetSRFs ||
root->hasHavingQual)
root->limit_tuples = -1.0;
else
root->limit_tuples = limit_tuples;

> 3.
> Typo:
> 
> don's  => don't
>
Fixed,

best regards,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 



passdown-limit-fdw.v3.patch
Description: passdown-limit-fdw.v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-21 Thread Kouhei Kaigai

Hello,

The attached patch is a revised version of pass-down LIMIT to FDW/CSP.

Below is the updates from the last version.

'ps_numTuples' of PlanState was declared as uint64, instead of long
to avoid problems on 32bits machine when a large LIMIT clause is
supplied.

'ps_numTuples' is re-interpreted; 0 means that its upper node wants
to fetch all the tuples. It allows to eliminate a boring initialization
on ExecInit handler for each executor node.

Even though it was not suggested, estimate_path_cost_size() of postgres_fdw
adjusts number of rows if foreign-path is located on top-level of
the base-relations and LIMIT clause takes a constant value.
It will make more adequate plan as follows:

* WITHOUT this patch

postgres=# explain verbose select * from t_a, t_b where t_a.id = t_b.id and 
t_a.x < t_b.x LIMIT 100;
   QUERY PLAN

 Limit  (cost=261.17..274.43 rows=100 width=88)
   Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
   ->  Hash Join  (cost=261.17..581.50 rows=2416 width=88)
 Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
 Hash Cond: (t_a.id = t_b.id)
 Join Filter: (t_a.x < t_b.x)
 ->  Foreign Scan on public.t_a  (cost=100.00..146.12 rows=1204 
width=44)
   Output: t_a.id, t_a.x, t_a.y
   Remote SQL: SELECT id, x, y FROM public.t
 ->  Hash  (cost=146.12..146.12 rows=1204 width=44)
   Output: t_b.id, t_b.x, t_b.y
   ->  Foreign Scan on public.t_b  (cost=100.00..146.12 rows=1204 
width=44)
 Output: t_b.id, t_b.x, t_b.y
 Remote SQL: SELECT id, x, y FROM public.t
(14 rows)

* WITH this patch
-
postgres=# explain verbose select * from t_a, t_b where t_a.id = t_b.id and 
t_a.x < t_b.x LIMIT 100;
  QUERY PLAN
--
 Limit  (cost=100.00..146.58 rows=100 width=88)
   Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
   ->  Foreign Scan  (cost=100.00..146.58 rows=100 width=88)
 Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
 Relations: (public.t_a) INNER JOIN (public.t_b)
 Remote SQL: SELECT r1.id, r1.x, r1.y, r2.id, r2.x, r2.y FROM (public.t 
r1 INNER JOIN public.t r2 ON (((r1.x < r2.x)) AND ((r1.id = r2.id
(6 rows)


On the other hands, I noticed it is not safe to attach LIMIT clause at
the planner stage because root->limit_tuples is declared as double.
Even if LIMIT clause takes a constant value, it is potentially larger
than 2^53 which is the limitation we can represent accurately with
float64 data type but LIMIT clause allows up to 2^63-1.
So, postgres_fdw now attaches LIMIT clause on the remote query on
execution time only.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke
> ; Etsuro Fujita
> ; Andres Freund 
> Subject: ##freemail## Re: PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai 
> wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in theory
> know that at planner time and optimize accordingly.  If the user says LIMIT
> twelve(), however, we will need to wait until execution time unless twelve()
> happens to be capable of being simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at execution
> (regardless of whether those values were also available during planning).
> Those are, basically, two separate things, and this patch has enough to
> do just focusing on one of them.
> 
> --
> Robert Haas
> EnterpriseDB: htt

Re: [HACKERS] Use of pg_proc.probin is legal?

2016-11-17 Thread Kohei KaiGai

2016-11-16 7:46 GMT-08:00 Tom Lane :
> Kohei KaiGai  writes:
>> On the other hands, interpret_AS_clause() raises an ereport if SQL
>> function tries to use probin except
>> for C-language. Is it illegal for other languages to use probin field
>> to store something useful?
>
> Well, there's no convention about how to use it.
>
>> In my case, PL/CUDA language allows to define SQL function with a CUDA
>> code block.
>> It saves a raw CUDA source code on the pg_proc.prosrc and its
>> intermediate representation
>> on the pg_proc.probin; which is automatically constructed on the
>> validator callback of the language
>> handler.
>
> I have precisely zero sympathy for such a kluge.  The validator exists
> to validate, it is not supposed to modify the pg_proc row.
>
> We could imagine extending the PL API to allow storage of a compiled
> version in probin, but overloading the validator isn't the way to do
> that IMO.  I'd prefer to see a separate "compile" function for it.
> Existence of a compile function could be the trigger that instructs,
> eg, pg_dump not to include the probin value in the dump.
>
> (There once was a LANCOMPILER option in the CREATE LANGUAGE syntax,
> which I imagine was meant to do something like this, but it was never
> fully implemented and we got rid of it years ago.)
>
> The bigger question though is whether it's really worth the trouble.
> All existing PLs deal with this by caching compiled (to one degree
> or another) representations in process memory.  If you keep it in
> probin you can save some compilation time once per session, but on the
> other hand you're limited to representations that can fit in a flat blob.
>
After the thought with fresh brain, it may not be a serious problem.
Right now, it assigns dependency between a PL/CUDA function and
its helper functions (*1), then put function's OID in the intermediate
representation on the pg_proc.probin, to work even if function gets
renamed.
However, it is a situation simply we can raise an ereport if helper
function was not found, instead of the dependency mechanism.
In addition, the heavier compilation stuff is intermediate form to
the real binary form rather than the source-to-intermediate stuff,
although I don't know about other script language.

Thanks,

(*1) PL/CUDA allows to put some kind of helper function in its
program source for some purpose. For example, the directive
below suggests the runtime of PL/CUDA to allocate a particular
amount of device memory according to the result (int8) of
my_plcuda_workbuf_sz() which have identical argument set with
the PL/CUDA function.

$$
  :
#plcuda_workbuf_sz my_plcuda_func_workbuf_sz
  :
$$

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Use of pg_proc.probin is legal?

2016-11-15 Thread Kohei KaiGai

Hello,

I have a question as subject line.

https://www.postgresql.org/docs/devel/static/catalog-pg-proc.html
According to the documentation, the purpose of pg_proc.probin is
introduced as follows:
| Additional information about how to invoke the function. Again, the
interpretation is language-specific.

On the other hands, interpret_AS_clause() raises an ereport if SQL
function tries to use probin except
for C-language. Is it illegal for other languages to use probin field
to store something useful?

In my case, PL/CUDA language allows to define SQL function with a CUDA
code block.
It saves a raw CUDA source code on the pg_proc.prosrc and its
intermediate representation
on the pg_proc.probin; which is automatically constructed on the
validator callback of the language
handler.

I noticed the problematic scenario because pg_dump generates CREATE
FUNCTION statement
with two AS arguments, then pg_restore tries to run the statement but failed.

Solution seems to me either of them:
1. Admit CREATE FUNCTION with two AS arguments. If language does not
support is, probin is
simply ignored.
2. pg_dump does not dump pg_proc.probin if language is not C-language.
3. Update documentation to inform not to provide the probin if not C-language.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai

> On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
>  wrote:
> > 1. ps_numTuples is declared as long, however offset and count members in
> > LimitState struct and bound member in SortState struct is int64.  However
> > long on 32 bit machine may be 32 bits and thus I think tuples_needed which
> > is long may have overflow hazards as it may store int64 + int64.  I think
> > ps_numTuples should be int64.
> 
> I suggested long originally because that's what ExecutorRun() was
> using at the time.  It seems that it got changed to uint64 in
> 23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
> probably use uint64.
> 
> > 2. Robert suggested following in the previous discussion:
> > "For example, suppose we add a new PlanState member "long
> > numTuples" where 0 means that the number of tuples that will be needed
> > is unknown (so that most node types need not initialize it), a
> > positive value is an upper bound on the number of tuples that will be
> > fetched, and -1 means that it is known for certain that we will need
> > all of the tuples."
> >
> > We should have 0 for the default case so that we don't need to initialize it
> > at most of the places.  But I see many such changes in the patch.  I think
> > this is not possible here since 0 can be a legal user provided value which
> > cannot be set as a default (default is all rows).
> >
> > However do you think, can we avoid that? Is there any other way so that we
> > don't need every node having ps_numTuples to be set explicitly?
> 
> +1.
>
I thought we have to distinguish a case if LIMIT 0 is supplied.
However, in this case, ExecLimit() never goes down to the child nodes,
thus, its ps_numTuples shall not be referenced anywhere.

OK, I'll use uint64 for ps_numTuples, and 0 shall be a usual default
value that means no specific number of rows are given.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke; Etsuro Fujita; Andres Freund
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai  wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in
> theory know that at planner time and optimize accordingly.  If the
> user says LIMIT twelve(), however, we will need to wait until
> execution time unless twelve() happens to be capable of being
> simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at
> execution (regardless of whether those values were also available
> during planning).  Those are, basically, two separate things, and this
> patch has enough to do just focusing on one of them.
>
OK, we need to have a private value of postgres_fdw to indicate whether
LIMIT and OFFSET were supplied at the planner stage. If any, it has to
be matched with the ps_numTuples informed at the executor stage.

I'll revise the patch.
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-11-01 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dilip Kumar
> Sent: Saturday, October 29, 2016 3:48 PM
> To: Andres Freund
> Cc: Tom Lane; Alvaro Herrera; pgsql-hackers
> Subject: Re: [HACKERS] Proposal: scan key push down to heap [WIP]
> 
> On Wed, Oct 26, 2016 at 12:01 PM, Andres Freund  wrote:
> > The gains are quite noticeable in some cases. So if we can make it work
> > without noticeable downsides...
> >
> > What I'm worried about though is that this, afaics, will quite
> > noticeably *increase* total cost in cases with a noticeable number of
> > columns and a not that selective qual. The reason for that being that
> > HeapKeyTest() uses heap_getattr(), whereas upper layers use
> > slot_getattr(). The latter "caches" repeated deforms, the former
> > doesn't... That'll lead to deforming being essentially done twice, and
> > it's quite often already a major cost of query processing.
> 
> What about putting slot reference inside HeapScanDesc ?. I know it
> will make ,heap layer use executor structure but just a thought.
> 
> I have quickly hacked this way where we use slot reference in
> HeapScanDesc and directly use
>  slot_getattr inside HeapKeyTest (only if we have valid slot otherwise
> use _heap_getattr) and measure the worst case performance (what you
> have mentioned above.)
> 
> My Test: (21 column table with varchar in beginning + qual is on last
> few column + varying selectivity )
> 
> postgres=# \d test
>   Table "public.test"
>  Column |   Type| Modifiers
> +---+---
>  f1 | integer   |
>  f2 | character varying |
>  f3 | integer   |
>  f4 | integer   |
>  f5 | integer   |
>  f6 | integer   |
>  f7 | integer   |
>  f8 | integer   |
>  f9 | integer   |
>  f10| integer   |
>  f11| integer   |
>  f12| integer   |
>  f13| integer   |
>  f14| integer   |
>  f15| integer   |
>  f16| integer   |
>  f17| integer   |
>  f18| integer   |
>  f19| integer   |
>  f20| integer   |
>  f21| integer   |
> 
> tuple count : 1000 (10 Million)
> explain analyze select * from test where f21< $1 and f20 < $1 and f19
> < $1 and f15 < $1 and f10 < $1; ($1 vary from 1Million to 1Million).
> 
> Target code base:
> ---
> 1. Head
> 2. Heap_scankey_pushdown_v1
> 3. My hack for keeping slot reference in HeapScanDesc
> (v1+use_slot_in_HeapKeyTest)
> 
> Result:
> Selectivity Head   scan_key_pushdown_v1 v1+use_slot_in_HeapKeyTest
> 0.1 3880  2980 2747
> 0.2 4041  3187 2914
> 0.5 5051  4921 3626
> 0.8 5378  7296 3879
> 1.0 6161  8525 4575
> 
> Performance graph is attached in the mail..
> 
> Observation:
> 
> 1. Heap_scankey_pushdown_v1, start degrading after very high
> selectivity (this behaviour is only visible if table have 20 or more
> columns, I tested with 10 columns but with that I did not see any
> regression in v1).
> 
> 2. (v1+use_slot_in_HeapKeyTest) is always winner, even at very high 
> selectivity.
> 
Prior to this interface change, it may be a starting point to restrict scan key
pushdown only when OpExpr references the column with static attcacheoff.
This type of column does not need walks on tuples from the head, thus, tuple
deforming cost will not be a downside.

By the way, I'm a bit skeptical whether this enhancement is really beneficial
than works for this enhancement, because we can now easily increase the number
of processor cores to run seq-scan with qualifier, especially, when it has high
selectivity.
How about your thought?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2016-10-31 Thread Kouhei Kaigai

Hello,

The attached patch implements the suggestion by Amit before.

What I'm motivated is to collect extra run-time statistics specific
to a particular ForeignScan/CustomScan, not only the standard
Instrumentation; like DMA transfer rate or execution time of GPU
kernels in my case.

Per-node DSM toc is one of the best way to return run-time statistics
to the master backend, because FDW/CSP can assign arbitrary length of
the region according to its needs. It is quite easy to require.
However, one problem is, the per-node DSM toc is already released when
ExecEndNode() is called on the child node of Gather.

This patch allows extensions to get control on the master backend's
context when all the worker node gets finished but prior to release
of the DSM segment. If FDW/CSP has its special statistics on the
segment, it can move to the private memory area for EXPLAIN output
or something other purpose.

One design consideration is whether the hook shall be called from
ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
The former is a function to retrieve the standard Instrumentation
information, thus, it is valid only if EXPLAIN ANALYZE.
On the other hands, if we put entrypoint at ExecParallelFinish(),
extension can get control regardless of EXPLAIN ANALYZE, however,
it also needs an extra planstate_tree_walker().

Right now, we don't assume anything onto the requirement by FDW/CSP.
It may want run-time statistics regardless of EXPLAIN ANALYZE, thus,
hook shall be invoked always when Gather node confirmed termination
of the worker processes.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Monday, October 17, 2016 11:22 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers
> Subject: ##freemail## Re: [HACKERS] Steps inside ExecEndGather
> 
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai  wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(&node->ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
> 
> 
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

parallel-finish-fdw_csp.v1.patch
Description: parallel-finish-fdw_csp.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-10-31 Thread Kouhei Kaigai

Hello,

The attached patch is revised version of the pass-down-bounds feature.
Its functionality is not changed from the previous version, however,
implementation was revised according to the discussion at the last CF.

This patch add a new fields (ps_numTuples) to the PlanState. This is
a hint for optimization when parent node needs first N-tuples only.
It shall be set prior to ExecProcNode() after ExecInitNode() or
ExecReScan() by the parent node, then child nodes can adjust its
execution behavior (e.g, Sort will take top-N heapsort if ps_numTuples
is set) and pass down the hint to its child nodes furthermore.

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

* without patch
=
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
 QUERY PLAN

 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=2332.548..2332.550 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=2332.547..2332.548 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.177 ms
 Execution time: 2445.590 ms
(7 rows)

* with patch
==
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
QUERY PLAN
--
 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=579.469..579.471 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=579.468..579.469 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.123 ms
 Execution time: 579.858 ms
(7 rows)

Right now, I have a few concerns for this patch.
1. Because LIMIT clause can have expression not only constant value,
   we cannot determine the value of ps_numTuples until execution time.
   So, it is not possible to adjust remote query on planning time, and
   EXPLAIN does not show exact remote query even if LIMIT clause was
   attached actually.

2. Where is the best location to put the interface contract to set
   ps_numTuples field. It has to be set prior to the first ExecProcNode()
   after ExecInitNode() or ExecReScan().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Friday, September 16, 2016 12:39 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jeevan Chalke; pgsql-hackers@postgresql.org; Etsuro Fujita; Andres Freund
> Subject: ##freemail## Re: [HACKERS] PassDownLimitBound for 
> ForeignScan/CustomScan
> 
> On Tue, Sep 13, 2016 at 9:07 PM, Kouhei Kaigai  wrote:
> > In the current implementation calls recompute_limits() on the first
> > invocation of ExecLimit and ExecReScanLimit. Do we expect the
> > ps->numTuples will be also passed down to the child nodes on the same
> > timing?
> 
> Sure, unless we find some reason why that's not good.
> 
> > I also think this new executor contract shall be considered as a hint
> > (but not a requirement) for the child nodes, because it allows the
> > parent nodes to re-distribute the upper limit regardless of the type
> > of the child nodes as long as the parent node can work correctly and
> > has benefit even if the child node returns a part of tuples. It makes
> > the decision whether the upper limit should be passed down much simple.
> > The child node "can" ignore the hint but can utilize for more optimization.
> 
> +1.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

passdown-limit-fdw.v1.patch
Description: passdown-limit-fdw.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai

> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai  wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(&node->ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
>
Thanks for the suggestion.
Hmm. Indeed, it is more straightforward way to do, although a new hook
is needed for CSP/FDW.

What I want to collect are: DMA transfer rate between RAM<->GPU, Execution
time of GPU kernels and etc... These are obviously out of the standard
Instrumentation structure, so only CSP/FDW can know its size and format.

If we would have a callback just before the planstate_tree_walker() when
planstate is either CustomScanState or ForeignScanState, it looks to me
the problem can be solved very cleanly.

Best regards,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai

> I'm now trying to carry extra performance statistics on CustomScan
> (like DMA transfer rate, execution time of GPU kernels, etc...)
> from parallel workers to the leader process using the DSM segment
> attached by the parallel-context.
> We can require an arbitrary length of DSM using ExecCustomScanEstimate
> hook by extension, then it looks leader/worker can share the DSM area.
> However, we have a problem on this design.
> 
> Below is the implementation of ExecEndGather().
> 
>   void
>   ExecEndGather(GatherState *node)
>   {
>   ExecShutdownGather(node);
>   ExecFreeExprContext(&node->ps);
>   ExecClearTuple(node->ps.ps_ResultTupleSlot);
>   ExecEndNode(outerPlanState(node));
>   }
> 
> It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> The DSM segment shall be released on this call, so child node cannot
> reference the DSM at the time of ExecEndNode().
> 
> Is there some technical reason why parallel context needs to be released
> prior to ExecEndNode() of the child nodes? Or, just convention of coding?
> 
> I think I'm not an only person who wants to use DSM of CustomScan to write
> back something extra status of parallel workers.
> How about an idea to move ExecShutdownGather() after the ExecEndNode()?
> 
> To avoid this problem, right now, I allocate an another DSM then inform
> its handle to the parallel workers. This segment can be survived until
> ExecEndCustomScan(), but not best effective way, of course.
>
My analysis was not collect a bit.

ExecShutdownNode() at ExecutePlan() is the primary point to call
ExecShutdownGather(), thus, parallel context shall not survive at the point
of ExecEndPlan() regardless of the implementation of ExecEndGather.

Hmm, what is the best way to do...? Or, is it completely abuse of DSM that
is setup by the parallel context?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai

Hello,

I'm now trying to carry extra performance statistics on CustomScan
(like DMA transfer rate, execution time of GPU kernels, etc...)
from parallel workers to the leader process using the DSM segment
attached by the parallel-context.
We can require an arbitrary length of DSM using ExecCustomScanEstimate
hook by extension, then it looks leader/worker can share the DSM area.
However, we have a problem on this design.

Below is the implementation of ExecEndGather().

  void
  ExecEndGather(GatherState *node)
  {
  ExecShutdownGather(node);
  ExecFreeExprContext(&node->ps);
  ExecClearTuple(node->ps.ps_ResultTupleSlot);
  ExecEndNode(outerPlanState(node));
  }

It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
The DSM segment shall be released on this call, so child node cannot
reference the DSM at the time of ExecEndNode().

Is there some technical reason why parallel context needs to be released
prior to ExecEndNode() of the child nodes? Or, just convention of coding?

I think I'm not an only person who wants to use DSM of CustomScan to write
back something extra status of parallel workers.
How about an idea to move ExecShutdownGather() after the ExecEndNode()?

To avoid this problem, right now, I allocate an another DSM then inform
its handle to the parallel workers. This segment can be survived until
ExecEndCustomScan(), but not best effective way, of course.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-14 Thread Kouhei Kaigai

> On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigai  wrote:
> > It looks to me pg_buffercache tries to allocate more than 1GB using
> > palloc(), when shared_buffers is more than 256GB.
> >
> > # show shared_buffers ;
> >  shared_buffers
> > 
> >  280GB
> > (1 row)
> >
> > # SELECT buffers, d.datname, coalesce(c.relname, '???')
> > FROM (SELECT count(*) buffers, reldatabase, relfilenode
> > FROM pg_buffercache group by reldatabase, relfilenode) b
> >LEFT JOIN pg_database d ON d.oid = b.reldatabase
> >LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
> >  WHERE datname = current_database())
> >AND b.relfilenode = pg_relation_filenode(c.oid)
> >ORDER BY buffers desc;
> > ERROR:  invalid memory alloc request size 1174405120
> >
> > It is a situation to use MemoryContextAllocHuge(), instead of palloc().
> > Also, it may need a back patching?
> 
> I guess so.  Although it's not very desirable for it to use that much
> memory, I suppose if you have a terabyte of shared_buffers you
> probably have 4GB of memory on top of that to show what they contain.
>
Exactly. I found this problem when a people asked me why shared_buffers=280GB
is slower than shared_buffers=128MB to scan 350GB table.
As I expected, most of shared buffers are not in-use and it also reduced
amount of free memory; usable for page-cache.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-13 Thread Kouhei Kaigai

Hello,

It looks to me pg_buffercache tries to allocate more than 1GB using
palloc(), when shared_buffers is more than 256GB.

# show shared_buffers ;
 shared_buffers

 280GB
(1 row)

# SELECT buffers, d.datname, coalesce(c.relname, '???')
FROM (SELECT count(*) buffers, reldatabase, relfilenode
FROM pg_buffercache group by reldatabase, relfilenode) b
   LEFT JOIN pg_database d ON d.oid = b.reldatabase
   LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
 WHERE datname = current_database())
   AND b.relfilenode = pg_relation_filenode(c.oid)
   ORDER BY buffers desc;
ERROR:  invalid memory alloc request size 1174405120

It is a situation to use MemoryContextAllocHuge(), instead of palloc().
Also, it may need a back patching?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-fix-pg_buffercache-palloc-huge.patch
Description: pgsql-fix-pg_buffercache-palloc-huge.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai

> On Tue, Sep 13, 2016 at 3:48 AM, Kouhei Kaigai  wrote:
> > Sorry for my late.
> >
> > The attached patch fixed the wording problems on SGML part.
> 
> I agree that we should have some way for foreign data wrappers and
> custom scans and perhaps also other executor nodes to find out whether
> there's a known limit to the number of tuples that they might need to
> produce, but I wonder if we should be doing something more general
> than this.  For example, suppose we add a new PlanState member "long
> numTuples" where 0 means that the number of tuples that will be needed
> is unknown (so that most node types need not initialize it), a
> positive value is an upper bound on the number of tuples that will be
> fetched, and -1 means that it is known for certain that we will need
> all of the tuples.  This might be relevant to the executor batching
> stuff that Andres has been working on, because you could for example
> set ps->numTuples == 1 on the inner side of a semi-join, warning the
> executor node that it shouldn't bother trying to batch anything.
>
I also think the generic approach is a preferable direction.

In the current implementation calls recompute_limits() on the first
invocation of ExecLimit and ExecReScanLimit. Do we expect the
ps->numTuples will be also passed down to the child nodes on the same
timing?
I also think this new executor contract shall be considered as a hint
(but not a requirement) for the child nodes, because it allows the
parent nodes to re-distribute the upper limit regardless of the type
of the child nodes as long as the parent node can work correctly and
has benefit even if the child node returns a part of tuples. It makes
the decision whether the upper limit should be passed down much simple.
The child node "can" ignore the hint but can utilize for more optimization.

> On a more practical level, I notice that you haven't adapted
> postgres_fdw or file_fdw to benefit from this new callback.  It seems
> like postgres_fdw could benefit, because it could fetch only the
> required number of tuples if that happens to be a smaller number than
> the configured fetch_size.
>
It is because of just my time pressure around the patch submission days.
I'll try to enhance postgres_fdw as a usage of this run-time optimization.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai

Sorry for my late.

The attached patch fixed the wording problems on SGML part.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Jeevan Chalke [mailto:jeevan.cha...@enterprisedb.com]
> Sent: Tuesday, September 06, 2016 11:22 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> Hi,
> 
> Changes look good to me.
> 
> However there are couple of minor issues need to be fixed.
> 
> 1.
> "under" repeated on second line. Please remove.
> +if and when CustomScanState is located under
> +under LimitState; which implies the underlying node is not
> 
> 2.
> Typo: dicsussion => discussion
> Please fix.
> 
> Apart from this I see no issues.
> 
> 
> Thanks
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 



pgsql-v10-fdw-css-limit-bound.v3.patch
Description: pgsql-v10-fdw-css-limit-bound.v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-05 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Monday, September 05, 2016 12:58 PM
> To: Jeevan Chalke
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> > On Mon, Aug 29, 2016 at 7:25 AM, Kouhei Kaigai  wrote:
> >
> >
> > Hello,
> >
> > The attached patch adds an optional callback to support special 
> > optimization
> > if ForeignScan/CustomScan are located under the Limit node in plan-tree.
> >
> > Our sort node wisely switches the behavior when we can preliminary know
> > exact number of rows to be produced, because all the Sort node has to
> > return is the top-k rows when it is located under the Limit node.
> > It is much lightweight workloads than sorting of entire input rows when
> > nrows is not small.
> >
> > In my case, this information is very useful because GPU can complete its
> > sorting operations mostly on L1-grade memory if we can preliminary know
> > the top-k value is enough small and fits to size of the fast memory.
> >
> > Probably, it is also valuable for Fujita-san's case because this 
> > information
> > allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
> > It will reduce amount of the network traffic and remote CPU consumption
> > once we got support of sort pushdown.
> >
> >
> >
> > One thing we need to pay attention is cost estimation on the planner 
> > stage.
> > In the existing code, only create_ordered_paths() and
> > create_merge_append_path()
> > considers the limit clause for cost estimation of sorting. They use the
> > 'limit_tuples' of PlannerInfo; we can reference the structure when 
> > extension
> > adds ForeignPath/CustomPath, so I think we don't need a special 
> > enhancement
> > on the planner stage.
> >
> Thanks for your comments.
> 
> > I believe this hook is gets called at execution time.
> > So to push LIMIT clause like you said above we should use "limit_tuples" at 
> > the time
> > of planning and then use this hook to optimize at runtime, right?
> >
> Yes. For more correctness, a valid "limit_tuples" of PlannerInfo is set only 
> when
> LIMIT clause takes constant values; it is true for most of use case.
> Then, the hook I added shall be called at execution time for more exact 
> optimization.
> 
> If FDW/CSP cannot accept uncertain number of rows to generate on planning 
> time,
> it is not a duty to provide its own path which is optimized for small number 
> of
> LIMIT clause.
> 
> > Apart from that, attached patch applies cleanly on latest sources and found 
> > no issues
> > with make or with regressions.
> >
> > However this patch is an infrastructure for any possible optimization when
> > foreign/customscan is under LIMIT.
> >
> > So look good to me.
> >
> > I quickly tried adding a hook support in postgres_fdw, and it gets called 
> > correctly
> > when we have foreignscan with LIMIT (limit being evaluated on local server).
> >
> > So code wise no issue. Also add this hook details in documentation.
> >
> OK, I'll try to write up some detailed documentation stuff; not only API 
> specification.
>
The v2 patch attached. It introduces the role of this hook and how extension
utilizes the LIMIT clause for its further optimization on planning and
execution time.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v10-fdw-css-limit-bound.v2.patch
Description: pgsql-v10-fdw-css-limit-bound.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-04 Thread Kouhei Kaigai

> On Mon, Aug 29, 2016 at 7:25 AM, Kouhei Kaigai  wrote:
> 
> 
>   Hello,
> 
>   The attached patch adds an optional callback to support special 
> optimization
>   if ForeignScan/CustomScan are located under the Limit node in plan-tree.
> 
>   Our sort node wisely switches the behavior when we can preliminary know
>   exact number of rows to be produced, because all the Sort node has to
>   return is the top-k rows when it is located under the Limit node.
>   It is much lightweight workloads than sorting of entire input rows when
>   nrows is not small.
> 
>   In my case, this information is very useful because GPU can complete its
>   sorting operations mostly on L1-grade memory if we can preliminary know
>   the top-k value is enough small and fits to size of the fast memory.
> 
>   Probably, it is also valuable for Fujita-san's case because this 
> information
>   allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
>   It will reduce amount of the network traffic and remote CPU consumption
>   once we got support of sort pushdown.
> 
> 
> 
>   One thing we need to pay attention is cost estimation on the planner 
> stage.
>   In the existing code, only create_ordered_paths() and
> create_merge_append_path()
>   considers the limit clause for cost estimation of sorting. They use the
>   'limit_tuples' of PlannerInfo; we can reference the structure when 
> extension
>   adds ForeignPath/CustomPath, so I think we don't need a special 
> enhancement
>   on the planner stage.
>
Thanks for your comments.

> I believe this hook is gets called at execution time.
> So to push LIMIT clause like you said above we should use "limit_tuples" at 
> the time
> of planning and then use this hook to optimize at runtime, right?
>
Yes. For more correctness, a valid "limit_tuples" of PlannerInfo is set only 
when
LIMIT clause takes constant values; it is true for most of use case.
Then, the hook I added shall be called at execution time for more exact 
optimization.

If FDW/CSP cannot accept uncertain number of rows to generate on planning time,
it is not a duty to provide its own path which is optimized for small number of
LIMIT clause.

> Apart from that, attached patch applies cleanly on latest sources and found 
> no issues
> with make or with regressions.
> 
> However this patch is an infrastructure for any possible optimization when
> foreign/customscan is under LIMIT.
> 
> So look good to me.
> 
> I quickly tried adding a hook support in postgres_fdw, and it gets called 
> correctly
> when we have foreignscan with LIMIT (limit being evaluated on local server).
> 
> So code wise no issue. Also add this hook details in documentation.
>
OK, I'll try to write up some detailed documentation stuff; not only API 
specification.

Best regards,

> 
> Thanks
> 
> 
> 
>   Thanks,
>   --
>   NEC Business Creation Division / PG-Strom Project
>   KaiGai Kohei 
> 
> 
> 
>   --
>   Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>   To make changes to your subscription:
>   http://www.postgresql.org/mailpref/pgsql-hackers
> <http://www.postgresql.org/mailpref/pgsql-hackers>
> 
> 
> 
> 
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-08-28 Thread Kouhei Kaigai

Hello,

The attached patch adds an optional callback to support special optimization
if ForeignScan/CustomScan are located under the Limit node in plan-tree.

Our sort node wisely switches the behavior when we can preliminary know
exact number of rows to be produced, because all the Sort node has to
return is the top-k rows when it is located under the Limit node.
It is much lightweight workloads than sorting of entire input rows when
nrows is not small.

In my case, this information is very useful because GPU can complete its
sorting operations mostly on L1-grade memory if we can preliminary know
the top-k value is enough small and fits to size of the fast memory.

Probably, it is also valuable for Fujita-san's case because this information
allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
It will reduce amount of the network traffic and remote CPU consumption
once we got support of sort pushdown.

One thing we need to pay attention is cost estimation on the planner stage.
In the existing code, only create_ordered_paths() and create_merge_append_path()
considers the limit clause for cost estimation of sorting. They use the
'limit_tuples' of PlannerInfo; we can reference the structure when extension
adds ForeignPath/CustomPath, so I think we don't need a special enhancement
on the planner stage.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v10-fdw-css-limit-bound.v1.patch
Description: pgsql-v10-fdw-css-limit-bound.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] comment fix for CUSTOMPATH_* flags

2016-08-28 Thread Kouhei Kaigai

Hello,

I noticed the source code comment around CustomPath structure says "see above"
for definition of CUSTOMPATH_* flags. It was originally right, but it was moved
to nodes/extensible.h on the further development. So, no comments are above.
The attached patch corrects the comment for the right location.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-v9.6-custom-flags-comments-fixup.patch
Description: pgsql-v9.6-custom-flags-comments-fixup.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-04 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Thursday, August 04, 2016 4:42 PM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/02 22:02, Kouhei Kaigai wrote:
> 
> I wrote:
> >> I removed the Relations line.  Here is an updated version of the patch.
> >>
> >> * As I said upthread, I left the upper-relation handling for another
> >> patch.  Currently, the patch prints "Foreign Scan" with no relations in
> >> that case.
> >>
> >> * I kept custom joins as-is.  We would need discussions about how to
> >> choose relations we print in EXPLAIN, so I'd also like to leave that for
> >> yet another patch.
> 
> > Please don't rely on fs_relids bitmap to pick up relations to be printed.
> > It just hold a set of underlying relations, but it does not mean all of
> > these relations are actually scanned inside of the ForeignScan.
> 
> Firstly, I'd like to discuss about what the relations printed after
> "Foreign join" would mean.  I think they would mean the relations
> involved in that foreign join (in other words, the ones that participate
> in that foreign join) rather than the relations to be scanned by a
> Foreign Scan representing that foreign join.  Wouldn't that make sense?
> 
> In that sense I think it's reasonable to print all relations specified
> in fs_relids after "Foreign Join".  (And in that sense I was thinking we
> could handle the custom join case the same way as the foreign join case.)
> 
> > You didn't answer the following scenario I pointed out in the upthread.
> >
> > | Please assume an enhancement of postgres_fdw that reads a small local 
> > table (tbl_1)
> > | and parse them as VALUES() clause within a remote query to execute remote 
> > join
> > | with foreign tables (ftbl_2, ftbl_3).
> > | This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and 
> > ftbl_3.
> > | Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
> > | In this case, which relations shall be printed around ForeignScan?
> > | Is it possible to choose proper relation names without hint by the 
> > extension?
> > |
> >
> > To care about these FDW usage, you should add an alternative bitmap rather
> > than fs_relids/custom_relids. My suggestion is, having explain_relids for
> > the purpose.
> 
> We currently don't allow such a join to be performed as a foreign join,
> because we allow a join to be performed so if the relations of the join
> are all foreign tables belonging to the same remote server, as you know.
> 
> So, as I said upthread, I think it would make sense to introduce such a
> bitmap when we extend the existing foreign-join pushdown infrastructure
> so that we can do such a thing as proposed by you.  I guess that that
> would need to extend the concept of foreign joins, though.
>
OK, right now, FDW does not support to take arbitrary child nodes, unlike
CustomScan. However, it implies your proposed infrastructure also has to
be revised once FDW gets enhanced in the near future.

> > Also, the logic to print "Foreign (Scan|Insert|Update|Delete)" is different
> > from what I suggested. I'm suggesting to allow extension giving a label
> > to fill up "Foreign %s" format.
> >
> > Please explain why your choice is better than my proposition.
> 
> No, I haven't done anything about that yet.  I kept the behavior as-is.
> 
> > At least, my proposition is available to apply on both of foreign-scan and
> > custom-scan, and no need to future maintenance if and when FDW gets support
> > remote Aggregation, Sort or others.
> 
> I'd like to discuss this issue separately, maybe in a new thread.
>
Why do you try to re-invent a similar infrastructure twice and separately?

What I proposed perfectly covers what you want to do, and has more benefits.
- A common manner for both of ForeignScan and CustomScan
- Flexibility to control "Foreign XXX" label and relation names to be printed.

Even if it is sufficient for the current usage of FDW, I've been saying your
proposition is not sufficient for CustomScan nowadays, and ForeignScan in the
near future.
It is not an answer to ignore the CustomScan side, because we have to enhanced
the infrastructure of CustomScan side to follow up FDW sooner or later.
However, we will have to apply a different manner on CustomScan side, or 
overwrite
what you proposed on FDW side, at that time.
It is not a desirable future.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-02 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, August 02, 2016 9:36 PM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/01 20:15, Etsuro Fujita wrote:
> > I thought about the Relations line a bit more and noticed that there are
> > cases where the table reference names for joining relations in the
> > Relations line are printed incorrectly.  Here is an example:
> >
> > postgres=# explain verbose select * from (select t1.a, t2.a from ft1 t1,
> > ft2 t2 where t1.a = t2.a union select t1.a, t2.a from ft1 t1, ft2 t2
> > where t1.a = t2.a) as t(t1a, t2a);
> >  QUERY PLAN
> >
> --
> --
> >
> >  Unique  (cost=204.12..204.13 rows=2 width=8)
> >Output: t1.a, t2.a
> >->  Sort  (cost=204.12..204.12 rows=2 width=8)
> >  Output: t1.a, t2.a
> >  Sort Key: t1.a, t2.a
> >  ->  Append  (cost=100.00..204.11 rows=2 width=8)
> >->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
> >  Output: t1.a, t2.a
> >  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> >  Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
> > INNER JOIN public.t2 r2 ON (((r1.a = r2.a
> >->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
> >  Output: t1_1.a, t2_1.a
> >  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> >  Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
> > INNER JOIN public.t2 r2 ON (((r1.a = r2.a
> > (14 rows)
> >
> > The table reference names for ft1 and ft2 in the Relations line for the
> > second Foreign Scan should be t1_1 and t2_1 respectively.
> >
> > Another concern about the Relations line is, that represents just an
> > internal representation of a pushed-down join, so that would not
> > necessarily match a deparsed query shown in the Remote SQL line.  Here
> > is an example, which I found when working on supporting pushing down
> > full outer join a lot more, by improving the deparsing logic so that
> > postgres_fdw can build a remote query that involves subqueries [1],
> > which I'll work on for 10.0:
> >
> > + -- full outer join with restrictions on the joining relations
> > + EXPLAIN (COSTS false, VERBOSE)
> > + SELECT t1.c1, t2.c1 FROM (SELECT c1 FROM ft4 WHERE c1 BETWEEN 50 AND
> > 60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 BETWEEN 50 AND 60) t2 ON
> > (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1;
> > + QUERY
> > PLAN
> > +
> >
> --
> --
> --
> -
> >
> > +  Foreign Scan
> > +Output: ft4.c1, ft5.c1
> > +Relations: (public.ft4) FULL JOIN (public.ft5)
> > +Remote SQL: SELECT ss1.c1, ss2.c1 FROM ((SELECT c1 FROM "S 1"."T 3"
> > WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss1(c1) FULL JOIN (SELECT c1 FROM
> > "S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss2(c1) ON (((ss1.c1 =
> > ss2.c1 ORDER BY ss1.c1 ASC NULLS LAST, ss2.c1 ASC NULLS LAST
> > + (4 rows)
> >
> > "(public.ft4) FULL JOIN (public.ft5)" in the Relations line does not
> > exactly match the deparsed query in the Remote SQL line, which I think
> > would be rather confusing for users.  (We may be able to print more
> > exact information in the Relations line so as to match the depaserd
> > query, but I think that that would make the Relations line redundant.)
> >
> > Would we really need the Relations line?  If joining relations are
> > printed by core like "Foreign Join on public.ft1 t1_1, public.ft2 t2_1"
> > as proposed upthread, we can see those relations from that, not the
> > Relations line.  Also we can see the join tree structure from the
> > deparsed query in the Remote SQL line.  The Relations line seems to be
> > not that useful anymore, then.  What do you think about that?
> 
> I removed the Relations line.  Here is an updated version of t

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-01 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, August 02, 2016 2:45 PM
> To: Kaigai Kouhei(海外 浩平); Ashutosh Bapat
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/02 13:32, Kouhei Kaigai wrote:
> 
> I wrote:
> >> My concern here is EXPLAIN for foreign joins.  I think it's another
> >> problem how we handle Foreign Scan plan nodes representing
> >> post-scan/join operations in EXPLAIN, so I'd like to leave that for
> >> another patch.
> 
> > What is the post-scan/join operations? Are you saying EPQ recheck and
> > alternative local join plan?
> 
> No.  I mean e.g., aggregation, window functions, sorting, or table
> modification.  In other words, Foreign Scan plan nodes created from
> ForeignPath paths created from GetForeignUpperPaths.
>
Why do you need to handle these upper paths individually?
FDW driver knows the remote query contains aggregation, window functions,
sorting, or table modification. It can give proper label.

If remote query contains partial aggregation and relations join, for
example, "Partial Agg + Scan" will be a proper label that shall be
followed by the "Foreign %s".

All you need to do are the two enhancements:
- Add "const char *explain_label" on the ForeignScanState or somewhere
  extension can set. It gives a label to be printed.
  If NULL string is set, EXPLAIN shows "Foreign Scan" as a default.
- Add "Bitmapset explain_rels" on the ForeignScanState or somewhere
  extension can set. It indicates the relations to be printed after
  the "Foreign %s" token. If you want to print all the relations names
  underlying this ForeignScan node, just copy scanrelids bitmap.
  If NULL bitmap is set, EXPLAIN shows nothing as a default.

Please note that the default does not change the existing behavior.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-01 Thread Kouhei Kaigai

> On 2016/08/01 22:25, Kouhei Kaigai wrote:
> 
> I wrote:
> >>> a broader
> >>> word like "Processing" seems to work well because we allow various
> >>> kinds of operations to the remote side, in addition to scans/joins,
> >>> to be performed in that one Foreign Scan node indicated by "Foreign
> >>> Processing", such as aggregation, window functions, distinct, order
> >>> by, row locking, table modification, or combinations of them.
> 
>  >> On 2016/07/29 13:28, Ashutosh Bapat wrote:
> >>> "Scan" is a better word than "Processing". From plan's perspective it's
> >>> ultimately a Scan (on the data produced by the foreign server) and not
> >>> processing.
> 
> I wrote:
> >> Exactly, but one thing I'm concerned about is the table modification
> >> case; the EXPLAIN output would be something like this:
> >>
> >>Foreign Scan
> >>  Remote SQL: INSERT INTO remote_table VALUES ...
> >>
> >> That would be strange, so I think a more broader word is better.  I
> >> don't think "Foreign Processing" is best.  "Foreign Upper" might be much
> >> better because the corresponding path is created by calling
> >> GetForeignUpperPaths.
> 
> > Be "Foreign %s", and gives "Insert" label by postgres_fdw; which knows
> > the ForeignScan node actually insert tuples.
> > From the standpoint of users, it looks "Foreign Insert".
> 
> My concern here is EXPLAIN for foreign joins.  I think it's another
> problem how we handle Foreign Scan plan nodes representing
> post-scan/join operations in EXPLAIN, so I'd like to leave that for
> another patch.
>
What is the post-scan/join operations? Are you saying EPQ recheck and
alternative local join plan?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-01 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Monday, August 01, 2016 8:26 PM
> To: Ashutosh Bapat
> Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/07/29 13:28, Ashutosh Bapat wrote:
> 
> I wrote:
> > Probably something like this:
> >
> >Foreign Processing
> >  Remote Operations: ...
> >
> > In the Remote Operations line, the FDW/extension could print
> > any info
> > about remote operations, eg, "Scan/Join + Aggregate".
> 
> I wrote:
> > I intentionally chose that word and thought we could leave detailed
> > descriptions about remote operations to the FDW/extension; a broader
> > word like "Processing" seems to work well because we allow various
> > kinds of operations to the remote side, in addition to scans/joins,
> > to be performed in that one Foreign Scan node indicated by "Foreign
> > Processing", such as aggregation, window functions, distinct, order
> > by, row locking, table modification, or combinations of them.
> 
> > "Scan" is a better word than "Processing". From plan's perspective it's
> > ultimately a Scan (on the data produced by the foreign server) and not
> > processing.
> 
> Exactly, but one thing I'm concerned about is the table modification
> case; the EXPLAIN output would be something like this:
> 
>Foreign Scan
>  Remote SQL: INSERT INTO remote_table VALUES ...
> 
> That would be strange, so I think a more broader word is better.  I
> don't think "Foreign Processing" is best.  "Foreign Upper" might be much
> better because the corresponding path is created by calling
> GetForeignUpperPaths.
>
Be "Foreign %s", and gives "Insert" label by postgres_fdw; which knows
the ForeignScan node actually insert tuples.
From the standpoint of users, it looks "Foreign Insert".

> Also for a Foreign Scan representing a foreign join, I think "Foreign
> Join" is better than "Foreign Scan".  Here is an example:
> 
>Foreign Join on foreign_table1, foreign_table2
>  Remote SQL: SELECT ... FROM remote_table1 INNER JOIN remote_table2
> WHERE ...
> 
> I think "Foreign Join" better indicates that foreign tables listed after
> that (ie, foreign_table1 and foreign_table2 in the example) are joining
> (or component) tables of this join, than "Foreign Scan".
>
Postgres_fdw knows it is remote join. It is quite easy to tell the core
this ForeignScan node is "Join". Also, it knows which tables are involved in.
We can add hint information to control relations name to be printed.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Kouhei Kaigai

> I wrote:
> >> That may be so, but my point is that the target relations involved in
> >> the foreign join (ie, ft1 and ft2) should be printed somewhere in the
> >> EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.
> 
> > Why? According to your rule, Hash Join should take "on t0,t1,t2".
> >
> > postgres=# explain select id from t0 natural join t1 natural join t2;
> >  QUERY PLAN
> > -
> >  Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
> >Hash Cond: (t0.aid = t1.aid)
> >->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
> >  Hash Cond: (t0.bid = t2.bid)
> >  ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
> >  ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
> >->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
> >->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
> >  ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
> > (9 rows)
> 
> I don't think it needs "on t0,t1,t2", because we can see joining
> relations from inner/outer plans in that case.  In a foreign-join case,
> however, we can't see such relations from the EXPLAIN printed *by core*.
>   postgres_fdw avoids this issue by adding such relations to the EXPLAIN
> using ExplainForeignScan as shown in the below example, but since such
> relations are essential, I think that information should be shown by
> core itself.
>
I never opposed to print the relation names by the core, however, it has
to be controllable by the extension which provides ForeignScan/CustomScan
because only extension can know how underlying relation shall be scanned
exactly.

> postgres=# explain select * from (select ft1.a from ft1 left join ft2 on
> ft1.a = ft2.a where ft1.b = 1) ss1(a) full join (select ft3.a from ft3
> left join ft4 on ft3.a = ft4.a where ft3.b = 1) ss2(a) on ss1.a = ss2.a;
> QUERY PLAN
> 
>   Hash Full Join  (cost=202.06..204.12 rows=1 width=8)
> Hash Cond: (ft1.a = ft3.a)
> ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
>   Relations: (public.ft1) LEFT JOIN (public.ft2)
> ->  Hash  (cost=102.05..102.05 rows=1 width=4)
>   ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
> Relations: (public.ft3) LEFT JOIN (public.ft4)
> (7 rows)
> 
>  From the Relations line shown by postgres_fdw, we can see which foreign
> join joins which foreign tables, but if no such lines, we couldn't.
>
In case of postgres_fdw, if extension can indicate the core EXPLAIN to
print all of its underlying relations, the core EXPLAIN routine will
print name of the relations. Here is no problem.

> >>> postgres=# explain select id from t0 natural join t1 natural join t2;
> >>> QUERY PLAN
> >>> ---
> >>>  Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
> >>>GPU Projection: t0.id
> >>>Depth 1: GpuHashJoin, HashKeys: (t0.bid)
> >>> JoinQuals: (t0.bid = t2.bid)
> >>> Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
> >>>Depth 2: GpuHashJoin, HashKeys: (t0.aid)
> >>> JoinQuals: (t0.aid = t1.aid)
> >>> Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
> >>>->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
> >>>->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
> >>>->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
> >>> (11 rows)
> 
> > My largest concern for you proposition is, ForeignScan/CustomScan node is
> > enforced to print name of underlying relations, regardless of its actual
> > behavior. The above GpuJoin never scans tables at least, thus, it mislead
> > users if we have no choice to print underlying relation names.
> 
> OK, I understand we would need special handling for such custom joins.
>
It is not a case only for CustomScan.

Please assume an enhancement of postgres_fdw that reads a small local table 
(tbl_1)
and parse them as VALUES() clause within a remote query to execute remote join
with foreign tables (ftbl_2, ftbl_3).
This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and ftbl_3.
Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
In this case, which relations shall be printed around ForeignScan?
Is it possible to choose proper relation names without hint by the extension?
   
Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Kouhei Kaigai

> On 2016/07/28 10:01, Kouhei Kaigai wrote:
> > What I'm saying is here:
> 
> > EXPLAIN (COSTS false, VERBOSE)
> > SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY
> > t1.c3, t1.c1 OFFSET 100 LIMIT 10;
> >   QUERY PLAN
> > ---
> > Limit
> >   Output: t1.c1, t2.c1, t1.c3
> >   ->  Foreign Scan
> >   
> > Output: t1.c1, t2.c1, t1.c3
> > Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> > Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T
> > 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY
> > r1.c3 ASC N\
> > ULLS LAST, r1."C 1" ASC NULLS LAST
> > (6 rows)
> 
> > Your concern is that EXPLAIN shows ForeignScan node as "Foreign Scan"
> > although it actually works as "Foreign Join".
> 
> That may be so, but my point is that the target relations involved in
> the foreign join (ie, ft1 and ft2) should be printed somewhere in the
> EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.
>
Why? According to your rule, Hash Join should take "on t0,t1,t2".

postgres=# explain select id from t0 natural join t1 natural join t2;
 QUERY PLAN
-
 Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
   Hash Cond: (t0.aid = t1.aid)
   ->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
 Hash Cond: (t0.bid = t2.bid)
 ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
 ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
 ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(9 rows)

> > My suggestion makes EXPLAIN print "Foreign %s" according to the label
> > assigned by the extension. Only FDW driver knows how this ForeignScan
> > node actually performs on, thus, only FDW driver can set meaningful label.
> >
> > This label may be "Join", may be "Partial Aggregate + Join", or something
> > others according to the actual job of this node.
> 
> OK, thanks for that.  But sorry, I'm not sure that's a good idea.  One
> reason for that is that FDWs/extentions might give different labels to
> the same upper-planner action, which would lead to confusing EXPLAINs.
>
If extension put a label hard to understand, it is a problem of the extension.

> Another is that labeling might be annoying, so some FDWs/extensions may
> not want to do that.
>
I'm happy with arbitrary label, not annoying. :-)

> Couldn't we print EXPLAINs in a more unified way?  A simple idea is to
> introduce a broad label, eg, "Foreign Processing", as a unified label;
> if the FDW/extension performs any upper-planner actions remotely, then
> the label "Foreign Processing" will be printed by core, and following
> that, the FDW/extension would print what they want, using
> ExplainForeignScan/ExplainCustomScan.  Probably something like this:
> 
>Foreign Processing
>  Remote Operations: ...
> 
> In the Remote Operations line, the FDW/extension could print any info
> about remote operations, eg, "Scan/Join + Aggregate".  Different data
> sources have different concepts/terms, so it seems reasonable to me that
> there would be different descriptions by different data sources, to the
> same plan actions done remotely.
>
"Foreign" implies this node is processed by FDW, but "Procesing" gives us
no extra information; seems to me redundant.
Prior to the new invention, please explain why you don't want to by my
suggestion first? Annoying is a feel of you, but not a logic to persuade
others.

> >>>  - A flag to turn on/off printing relation(s) name
> >>
> >> ISTM that the relation information should be always printed even in the
> >> case of scanrelid=0 by the core, because that would be essential for
> >> analyzing EXPLAIN results.
> 
> > We have no information which relations are actually scanned by ForeignScan
> > and CustomScan. Your patch abuses fs_relids and custom_relids, however,
> > these fields don't indicate the relations actually scan by these nodes.
> > It just tracks relations processed by this node or its children.
> 
> Maybe my explanation was not enough, but I just m

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-27 Thread Kouhei Kaigai

> On 2016/07/27 13:51, Kouhei Kaigai wrote:
> > This output is, at least, not incorrect.
> > This ForeignScan actually scan a relation that consists of two joined
> > tables on the remote side. So, not incorrect, but may not convenient for
> > better understanding by users who don't have deep internal knowledge.
> 
> Hmm.
> 
> > On the other hands, I cannot be happy with your patch because it concludes
> > the role of ForeignScan/CustomScan with scanrelid==0 is always join.
> > However, it does not cover all the case.
> >
> > When postgres_fdw gets support of remote partial aggregate, we can implement
> > the feature using ForeignScan with scanrelid==0. Is it a join? No.
> 
> Yeah, I think that that could be implemented in both cases (scanrelid>0
> and scanrelid=0).
> 
> > Probably, the core cannot know exact purpose of ForeignScan/CustomScan with
> > scanrelid==0. It can be a remote partial aggregation. It can be an 
> > alternative
> > sort logic by GPU. It can be something others.
> > So, I think extension needs to inform the core more proper label to show;
> > including a flag to control whether the relation(s) name shall be attached
> > next to the ForeignScan/CustomScan line.
> >
> > I'd like you to suggest to add two fields below:
> >  - An alternative label extension wants to show instead of the default one
> 
> How about adding something like the "Additional Operations" line as
> proposed in a previous email, instead?  As you know, FDWs/Extensions
> could add such information through ExplainForeignScan/ExplainCustomScan.
>
No. What I'm saying is here:


EXPLAIN (COSTS false, VERBOSE)
SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY  
t1.c3, t1.c1 OFFSET 100 LIMIT 10;
  QUERY PLAN   
---
Limit
  Output: t1.c1, t2.c1, t1.c3
  ->  Foreign Scan
  
Output: t1.c1, t2.c1, t1.c3
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T  
1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY  
r1.c3 ASC N\
ULLS LAST, r1."C 1" ASC NULLS LAST
(6 rows)


Your concern is that EXPLAIN shows ForeignScan node as "Foreign Scan"
although it actually works as "Foreign Join".

My suggestion makes EXPLAIN print "Foreign %s" according to the label 
assigned by the extension. Only FDW driver knows how this ForeignScan
node actually performs on, thus, only FDW driver can set meaningful label.

This label may be "Join", may be "Partial Aggregate + Join", or something
others according to the actual job of this node.

> >  - A flag to turn on/off printing relation(s) name
> 
> ISTM that the relation information should be always printed even in the
> case of scanrelid=0 by the core, because that would be essential for
> analyzing EXPLAIN results.
>
We have no information which relations are actually scanned by ForeignScan
and CustomScan. Your patch abuses fs_relids and custom_relids, however,
these fields don't indicate the relations actually scan by these nodes.
It just tracks relations processed by this node or its children.

See the example below. This GpuJoin on CustomScan takes three SeqScan nodes
as inner/outer source of relations joins. GpuJoin never scans the tables
actually. It picks up the records generated by underlying SeqScan nodes.


postgres=# explain select id from t0 natural join t1 natural join t2;
QUERY PLAN
---
 Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
   GPU Projection: t0.id
   Depth 1: GpuHashJoin, HashKeys: (t0.bid)
JoinQuals: (t0.bid = t2.bid)
Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
   Depth 2: GpuHashJoin, HashKeys: (t0.aid)
JoinQuals: (t0.aid = t1.aid)
Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
   ->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(11 rows)


If EXPLAIN command attaches something like "on t0,t1,t2" after the GpuJoin
*with no choice*, it is problematic and misleading.
It shall be controllable by the extension which knows what tables are
actually scanned. (Please note that I never says extension makes the string.
It is a job of core explain. I suggest it needs to be controllable.)


Even though I recommended a boolean flag

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-26 Thread Kouhei Kaigai

> I noticed that currently the core doesn't show any information on the
> target relations involved in a foreign/custom join in EXPLAIN, by
> itself.  Here is an example:
> 
> -- join two tables
> EXPLAIN (COSTS false, VERBOSE)
> SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY
> t1.c3, t1.c1 OFFSET 100 LIMIT 10;
>   QUERY PLAN   \
> 
> --
> -\
> ---
> Limit
>   Output: t1.c1, t2.c1, t1.c3
>   ->  Foreign Scan
> Output: t1.c1, t2.c1, t1.c3
> Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T
> 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY
> r1.c3 ASC N\
> ULLS LAST, r1."C 1" ASC NULLS LAST
> (6 rows)
> 
> postgres_fdw shows the target relations in the Relations line, as shown
> above, but I think that the core should show such information
> independently of FDWs; in the above example replace "Foreign Scan" with
> "Foreign Join on public.ft1 t1, public.ft2 t2".  Please find attached a
> patch for that.  Comments welcome!
>
This output is, at least, not incorrect.
This ForeignScan actually scan a relation that consists of two joined
tables on the remote side. So, not incorrect, but may not convenient for
better understanding by users who don't have deep internal knowledge.

On the other hands, I cannot be happy with your patch because it concludes
the role of ForeignScan/CustomScan with scanrelid==0 is always join.
However, it does not cover all the case.

When postgres_fdw gets support of remote partial aggregate, we can implement
the feature using ForeignScan with scanrelid==0. Is it a join? No.

Probably, the core cannot know exact purpose of ForeignScan/CustomScan with
scanrelid==0. It can be a remote partial aggregation. It can be an alternative
sort logic by GPU. It can be something others.
So, I think extension needs to inform the core more proper label to show;
including a flag to control whether the relation(s) name shall be attached
next to the ForeignScan/CustomScan line.

I'd like you to suggest to add two fields below:
 - An alternative label extension wants to show instead of the default one
 - A flag to turn on/off printing relation(s) name

EXPLAIN can print proper information according to these requirements.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make clean didn't clean up files generated from *.(y|l)

2016-06-28 Thread Kouhei Kaigai

> Kouhei Kaigai  writes:
> > I tried to build the latest master branch just after the switch from
> > REL9_5_STABLE and "make clean", however, repl_gram.c was not cleaned
> > up correctly. So, my problem is that repl_gram.l was the latest version,
> > but compiler saw the repl_gram.c generated based on the v9.5 source.
> > ...
> > Probably, we have to add explicit cleanup of these auto-generated files
> > on Makefiles.
> 
> "make clean" absolutely should NOT remove that file; not even "make
> distclean" should, because we ship it in tarballs.  Likewise for the other
> bison product files you mention, as well as a boatload of other derived
> files.
> 
> If you want to checkout a different release branch in the same working
> directory, I'd suggest "make maintainer-clean" or "git clean -dfx" first.
> (Personally I don't ever do that --- it's much easier to maintain a
> separate workdir per branch.)
> 
> Having said that, switching to a different branch should have resulted in
> repl_gram.l being updated by git, and thereby acquiring a new file mod
> date; so I don't understand why make wouldn't have chosen to rebuild
> repl_gram.c.  Can you provide a reproducible sequence that makes this
> happen?
>
Ah, I might have inadequate operation just before the branch switching.

$ cd ~/source/pgsql <-- REL9_5_STABLE; already built
$ git checkout master
$ cp -r ~/source/pgsql ~/repo/pgsql-kg
$ cd ~/repo/pgsql-kg
$ ./configure
$ make clean
$ make  <-- repl_gram.c raised an error

~/source/pgsql is a copy of community's branch; with no my own modification.
To keep it clean, I copied entire repository to other directory, but cp command
updated the file modification timestamp.
I may be the reason why repl_gram.c was not rebuilt.

Sorry for the noise.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] make clean didn't clean up files generated from *.(y|l)

2016-06-28 Thread Kouhei Kaigai

Hello,

I got the build error below. It concerns RESERVE_WAL is not defined,
however, it should not be a problem to be oversight for a long time.

I tried to build the latest master branch just after the switch from
REL9_5_STABLE and "make clean", however, repl_gram.c was not cleaned
up correctly. So, my problem is that repl_gram.l was the latest version,
but compiler saw the repl_gram.c generated based on the v9.5 source.

I could see the similar problems at:
  src/backend/replication/repl_gram.c
  src/interfaces/ecpg/preproc/pgc.c
  src/bin/pgbench/exprparse.c
  src/bin/pgbench/exprscan.c
  src/pl/plpgsql/src/pl_gram.c

(*) At least, these files raised a build error.

Probably, we have to add explicit cleanup of these auto-generated files
on Makefiles.

Thanks,


gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -O0 -g -I. -I. 
-I../../../src/include -D_GNU_SOURCE   -c -o repl_gram.o repl_gram.c
In file included from repl_gram.y:320:0:
repl_scanner.l: In function 'replication_yylex':
repl_scanner.l:98:10: error: 'K_RESERVE_WAL' undeclared (first use in this 
function)
 RESERVE_WAL   { return K_RESERVE_WAL; }
  ^
repl_scanner.l:98:10: note: each undeclared identifier is reported only once 
for each function it appears in
make[3]: *** [repl_gram.o] Error 1
make[3]: Leaving directory `/home/kaigai/repo/pgsql/src/backend/replication'
make[2]: *** [replication-recursive] Error 2
make[2]: Leaving directory `/home/kaigai/repo/pgsql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/kaigai/repo/pgsql/src'
make: *** [all-src-recurse] Error 2


--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-06-01 Thread Kouhei Kaigai

> -Original Message-
> From: Jim Nasby [mailto:jim.na...@bluetreble.com]
> Sent: Wednesday, June 01, 2016 11:32 PM
> To: Kaigai Kouhei(海外 浩平); Gavin Flower; Joe Conway; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 5/30/16 9:05 PM, Kouhei Kaigai wrote:
> > Due to performance reason, location of each element must be deterministic
> > without walking on the data structure. This approach guarantees we can
> > reach individual element with 2 steps.
> 
> Agreed.
> 
> On various other points...
> 
> Yes, please keep the discussion here, even when it relates only to PL/R.
> Whatever is being done for R needs to be done for plpython as well. I've
> looked at ways to improve analytics in plpython related to this, and it
> looks like I need to take a look at the fast-path function stuff. One of
> the things I've pondered for storing ndarrays in Postgres is how to
> reduce or eliminate the need to copy data from one memory region to
> another. It would be nice if there was a way to take memory that was
> allocated by one manager (ie: python's) and transfer ownership of that
> memory directly to Postgres without having to copy everything. Obviously
> you'd want to go the other way as well. IIRC cython's memory manager is
> the same as palloc in regard to very large allocations basically being
> ignored completely, so this should be possible in that case.
> 
> One thing I don't understand is why this type needs to be limited to 1
> or 2 dimensions? Isn't the important thing how many individual elements
> you can fit into GPU? So if you can fit a 1024x1024, you could also fit
> a 100x100x100, a 32x32x32x32, etc. At low enough values maybe that stops
> making sense, but I don't see why there needs to be an artificial limit.
>
Simply, I didn't looked at the matrix larger than 2-dimensional. 
Is it a valid mathematic concept?
Because of the nature of GPU, it is designed to map threads on X-, Y-,
and Z-axis. However, not limited to 3-dimensions, because programmer can
handle upper 10bit of X-axis as 4th dimension for example.

> I think what's important for something like kNN is that the storage is
> optimized for this, which I think means treating the highest dimension
> as if it was a list. I don't know if it then matters whither the lower
> dimensions are C style vs FORTRAN style. Other algorithms might want
> different storage.
>
FORTRAN style is preferable, because it allows to use BLAS library without
data format transformation.
I'm not sure why you prefer a list structure on the highest dimension.
A simple lookup table is enough and suitable for massive parallel processors.

> Something else to consider is the 1G toast limit. I'm pretty sure that's
> why MADlib stores matricies as a table of vectors. I know for certain
> it's a problem they run into, because they've discussed it on their
> mailing list.
> 
> BTW, take a look at MADlib svec[1]... ISTM that's just a special case of
> what you're describing with entire grids being zero (or vice-versa).
> There might be some commonality there.
> 
> [1] https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html
>
Once we try to deal with a table as representation of matrix, it goes
beyond the scope of data type in PostgreSQL. Likely, users have to take
something special jobs to treat it, more than operator functions that
support matrix data types.

For a matrix larger than toast limit, my thought is that; a large matrix
which is consists of multiple grid can have multiple toast pointers for
each sub-matrix. Although individual sub-matrix must be up to 1GB, we can
represent an entire matrix as a set of grid more than 1GB.

As I wrote in the previous message, a matrix structure head will have offset
to each sub-matrix of grid. The sub-matrix will be inline, or external toast
if VARATT_IS_1B_E(ptr).
Probably, we also have to hack row deletion code not to leak sub-matrix in
the toast table.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

 --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-30 Thread Kouhei Kaigai

> -Original Message-
> From: Gavin Flower [mailto:gavinflo...@archidevsys.co.nz]
> Sent: Tuesday, May 31, 2016 9:47 AM
> To: Kaigai Kouhei(海外 浩平); Joe Conway; Jim Nasby; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 31/05/16 12:01, Kouhei Kaigai wrote:
> >> On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:
> >>> For the closer integration, it may be valuable if PL/R and PL/CUDA can 
> >>> exchange
> >>> the data structure with no serialization/de-serialization when PL/R code 
> >>> tries
> >>> to call SQL functions.
> >> I had been thinking about something similar. Maybe PL/R can create an
> >> extension within the R environment that wraps PL/CUDA directly or at the
> >> least provides a way to use a fast-path call. We should probably try to
> >> start out with one common use case to see how it might work and how much
> >> benefit there might be.
> >>
> > My thought is the second option above. If SPI interface supports fast-path
> > like 'F' protocol, it may become a natural way for other PLs also to
> > integrate SQL functions in other languages.
> >
> >>> IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL 
> >>> functions
> >> inside PL/R scripts.
> >>
> >> Correct (currently).
> >>
> >> BTW, this is starting to drift off topic I think -- perhaps we should
> >> continue off list?
> >>
> > Some elements are common for PostgreSQL (matrix data type and fastpath SPI
> > interface). I like to keep the discussion on the list.
> > Regarding to the PoC on a particular use case, it might be an off-list
> > discussion.
> >
> > Thanks,
> > --
> > NEC Business Creation Division / PG-Strom Project
> > KaiGai Kohei 
> >
> Possibly there should be two matrix types in Postgres: the first would
> be the default and optimized for small dense matrices, the second would
> store large sparse matrices efficiently in memory at the expensive of
> speed (possibly with one or more parameters relating to how sparse it is
> likely to be?) - for appropriate definitions 'small' & 'large', though
> memory savings for the latter type might not kick in unless the matrices
> are big enough (so a small sparse matrix might consume more memory than
> a nominally larger dense matrix type & a sparse matrix might have to be
> sufficiently sparse to make real memory savings at any size).
>
One idea in my mind is that a sparse matrix is represented as a grid
of a smaller matrixes, and omit all-zero area. It looks like indirect
pointer reference. The header of matrix type has offset values to
each grid. If offset == 0, it means this grid contains all-zero.

Due to performance reason, location of each element must be deterministic
without walking on the data structure. This approach guarantees we can
reach individual element with 2 steps.

A flat matrix can be represented as a special case of the sparse matrix.
If entire matrix is consists of 1x1 grid, it is a flat matrix.
We may not need to define two individual data types.

> Probably good to think of 2 types at the start, even if the only the
> first is implemented initially.
>
I agree.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-30 Thread Kouhei Kaigai

> On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:
> > For the closer integration, it may be valuable if PL/R and PL/CUDA can 
> > exchange
> > the data structure with no serialization/de-serialization when PL/R code 
> > tries
> > to call SQL functions.
> 
> I had been thinking about something similar. Maybe PL/R can create an
> extension within the R environment that wraps PL/CUDA directly or at the
> least provides a way to use a fast-path call. We should probably try to
> start out with one common use case to see how it might work and how much
> benefit there might be.
>
My thought is the second option above. If SPI interface supports fast-path
like 'F' protocol, it may become a natural way for other PLs also to
integrate SQL functions in other languages.

> > IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL 
> > functions
> inside PL/R scripts.
> 
> Correct (currently).
> 
> BTW, this is starting to drift off topic I think -- perhaps we should
> continue off list?
>
Some elements are common for PostgreSQL (matrix data type and fastpath SPI
interface). I like to keep the discussion on the list.
Regarding to the PoC on a particular use case, it might be an off-list
discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-29 Thread Kouhei Kaigai

> On 05/28/2016 03:33 PM, Kouhei Kaigai wrote:
> >> -Original Message-
> >> From: Joe Conway [mailto:m...@joeconway.com]
> >> Sent: Sunday, May 29, 2016 1:40 AM
> >> To: Kaigai Kouhei(海外 浩平); Jim Nasby; Ants Aasma; Simon Riggs
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> >>
> >> On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:
> >>> Sparse matrix! It is a disadvantaged area for the current array format.
> >>>
> >>> I have two ideas. HPC folks often split a large matrix into multiple
> >>> grid. A grid is typically up to 1024x1024 matrix, for example.
> >>> If a grid is consists of all zero elements, it is obvious we don't need
> >>> to have individual elements on the grid.
> >>> One other idea is compression. If most of matrix is zero, it is an ideal
> >>> data for compression, and it is easy to reconstruct only when calculation.
> >>>
> >>>> Related to this, Tom has mentioned in the past that perhaps we should
> >>>> support abstract use of the [] construct. Currently point finds a way to
> >>>> make use of [], but I think that's actually coded into the grammar.
> >>>>
> >>> Yep, if we consider 2D-array is matrix, no special enhancement is needed
> >>> to use []. However, I'm inclined to have own data structure for matrix
> >>> to present the sparse matrix.
> >>
> >> +1 I'm sure this would be useful for PL/R as well.
> >>
> >> Joe
> >>
> > It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
> > for advanced analytics. We will be able to off-load heavy computing portion
> > to GPU, then also utilize various R functions inside database.
> 
> Agreed. Perhaps at some point we should discuss closer integration of
> some sort, or at least a sample use case.
>
What I'm trying to implement first is k-means clustering by GPU. It core 
workload
is iteration of massive distance calculations. When I run kmeans() function of R
for million items with 10 clusters on 40 dimensions, it took about thousand 
seconds.
If GPU version provides the result matrix more rapidly, then I expect R can plot
relationship between items and clusters in human friendly way.

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions. IIUC, pg.spi.exec("SELECT my_function(...)") is the only
way to call SQL functions inside PL/R scripts.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-28 Thread Kouhei Kaigai

> -Original Message-
> From: Joe Conway [mailto:m...@joeconway.com]
> Sent: Sunday, May 29, 2016 1:40 AM
> To: Kaigai Kouhei(海外 浩平); Jim Nasby; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:
> > Sparse matrix! It is a disadvantaged area for the current array format.
> >
> > I have two ideas. HPC folks often split a large matrix into multiple
> > grid. A grid is typically up to 1024x1024 matrix, for example.
> > If a grid is consists of all zero elements, it is obvious we don't need
> > to have individual elements on the grid.
> > One other idea is compression. If most of matrix is zero, it is an ideal
> > data for compression, and it is easy to reconstruct only when calculation.
> >
> >> Related to this, Tom has mentioned in the past that perhaps we should
> >> support abstract use of the [] construct. Currently point finds a way to
> >> make use of [], but I think that's actually coded into the grammar.
> >>
> > Yep, if we consider 2D-array is matrix, no special enhancement is needed
> > to use []. However, I'm inclined to have own data structure for matrix
> > to present the sparse matrix.
> 
> +1 I'm sure this would be useful for PL/R as well.
> 
> Joe
>
It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
for advanced analytics. We will be able to off-load heavy computing portion
to GPU, then also utilize various R functions inside database.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-28 Thread Kouhei Kaigai

> On 5/25/16 7:46 AM, Kouhei Kaigai wrote:
> > My only concern is that domain type is not allowed to define type cast.
> > If we could add type cast on domain, we can define type transformation from
> > other array type to matrix.
> 
> I've actually wished for that in the past, as well as casting to
> compound types. Having those would make it easier to mock up a real data
> type for experimentation.
> 
> I strongly encourage you to talk to the MADlib community about
> first-class matrix support. They currently emulate matrices via arrays.
>
A MADLib folks contacted me in the past; when I initially made PG-Strom
several years before, however, no communication with them after that.

https://madlib.incubator.apache.org/docs/v1.8/group__grp__arraysmatrix.html
According to the documentation, their approach is different from what
I'd like to implement. They use a table as if a matrix. Because of its
data format, we have to adjust position of the data element to fit
requirement by high performance matrix libraries (like cuBLAS).

> I don't know offhand if they support NULLs in their regular matrices.
> They also support a sparsematrix "type" that is actually implemented as
> a table that contains coordinates and a value for each value in the
> matrix. Having that as a type might also be interesting (if you're
> sparse enough, that will be cheaper than the current NULL bitmap
> implementation).
>
Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

> Related to this, Tom has mentioned in the past that perhaps we should
> support abstract use of the [] construct. Currently point finds a way to
> make use of [], but I think that's actually coded into the grammar.
>
Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-25 Thread Kouhei Kaigai

> On Wed, May 25, 2016 at 10:38 AM, Simon Riggs  wrote:
> > On 25 May 2016 at 03:52, Kouhei Kaigai  wrote:
> >>
> >> In a few days, I'm working for a data type that represents matrix in
> >> mathematical area. Does people favor to have this data type in the core,
> >> not only my extension?
> >
> >
> > If we understood the use case, it might help understand whether to include
> > it or not.
> >
> > Multi-dimensionality of arrays isn't always useful, so this could be good.
> 
> Many natural language and image processing methods extract feature
> vectors that then use some simple distance metric, like dot product to
> calculate vector similarity. For example we presented a latent
> semantic analysis prototype at pgconf.eu 2015 that used real[] to
> store the features and a dotproduct(real[], real[]) real function to
> do similarity matching. However using real[] instead of a hypothetical
> realvector or realmatrix did not prove to be a huge overhead, so
> overall I'm on the fence for the usefulness of a special type. Maybe a
> helper function or two to validate the additional restrictions in a
> check constraint would be enough.
>
The 'matrix' data type as domain type of real[] is an option to implement.
We can define operators on the domain types, thus, it allows us to process
large amount of calculation by one operation, in native binary speed.

My only concern is that domain type is not allowed to define type cast.
If we could add type cast on domain, we can define type transformation from
other array type to matrix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-25 Thread Kouhei Kaigai

> -Original Message-
> From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Sent: Wednesday, May 25, 2016 4:39 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 25 May 2016 at 03:52, Kouhei Kaigai  wrote:
> 
> 
>   In a few days, I'm working for a data type that represents matrix in
>   mathematical area. Does people favor to have this data type in the core,
>   not only my extension?
> 
> 
> If we understood the use case, it might help understand whether to include it 
> or not.
> 
> Multi-dimensionality of arrays isn't always useful, so this could be good.
>
As you may expect, the reason why I've worked for matrix data type is one of
the groundwork for GPU acceleration, but not limited to.

What I tried to do is in-database calculation of some analytic algorithm; not
exporting entire dataset to client side.
My first target is k-means clustering; often used to data mining.
When we categorize N-items which have M-attributes into k-clusters, the master
data can be shown in NxM matrix; that is equivalent to N vectors in M-dimension.
The cluster centroid is also located inside of the M-dimension space, so it
can be shown in kxM matrix; that is equivalent to k vectors in M-dimension.
The k-means algorithm requires to calculate the distance to any cluster centroid
for each items, thus, it produces Nxk matrix; that is usually called as distance
matrix. Next, it updates the cluster centroid using the distance matrix, then
repeat the entire process until convergence.

The heart of workload is calculation of distance matrix. When I tried to write
k-means algorithm using SQL + R, its performance was not sufficient (poor).
  https://github.com/kaigai/toybox/blob/master/Rstat/pgsql-kmeans.r

If we would have native functions we can use instead of the complicated SQL
expression, it will make sense for people who tries in-database analytics.

Also, fortunately, PostgreSQL's 2-D array format is binary compatible to BLAS
library's requirement. It will allow GPU to process large matrix in HPC grade
performance.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Does people favor to have matrix data type?

2016-05-24 Thread Kouhei Kaigai

In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

Like oidvector or int2vector, it is one of array type with a few
restrictions:
 - 2 dimensional only
 - never contains NULL
 - element type is real or float
 - no lower bounds of array

A vector in mathematic is a special case of matrix; either 1xN or Nx1.

We can define various operators between matrix/scalar, like:
  matrix + matrix -> matrix
  matrix - scalar -> matrix
  matrix * matrix -> matrix
  transform(matrix) -> matrix
 :

How about people's thought?

If overall consensus is welcome to have, I'll set up a patch.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Academic help for Postgres

2016-05-11 Thread Kohei KaiGai

2016-05-11 23:20 GMT+09:00 Bruce Momjian :
> I am giving a keynote at an IEEE database conference in Helsinki next
> week (http://icde2016.fi/).  (Yes, I am not attending PGCon Ottawa
> because I accepted the Helsinki conference invitation before the PGCon
> Ottawa date was changed from June to May).
>
> As part of the keynote, I would like to mention areas where academia can
> help us.  The topics I can think of are:
>
> Query optimization
> Optimizer statistics
> Indexing structures
> Reducing function call overhead
> CPU locality
> Sorting
> Parallelism
> Sharding
>
> Any others?
>
How about NVRAM utilization?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley
> Sent: Tuesday, May 10, 2016 2:01 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] asynchronous and vectorized execution
> 
> On 10 May 2016 at 13:38, Kouhei Kaigai  wrote:
> > My concern about ExecProcNode is, it is constructed with a large switch
> > ... case statement. It involves tons of comparison operation at run-time.
> > If we replace this switch ... case by function pointer, probably, it make
> > performance improvement. Especially, OLAP workloads that process large
> > amount of rows.
> 
> I imagined that any decent compiler would have built the code to use
> jump tables for this. I have to say that I've never checked to make
> sure though.
>
Ah, indeed, you are right. Please forget above part.

In GCC 4.8.5, the case label between T_ResultState and T_LimitState were
handled using jump table.

TupleTableSlot *
ExecProcNode(PlanState *node)
{
:
  
:
switch (nodeTag(node))
  5ad361:   8b 03   mov(%rbx),%eax
  5ad363:   2d c9 00 00 00  sub$0xc9,%eax
  5ad368:   83 f8 24cmp$0x24,%eax
  5ad36b:   0f 87 4f 02 00 00   ja 5ad5c0 
  5ad371:   ff 24 c5 68 48 8b 00jmpq   *0x8b4868(,%rax,8)
  5ad378:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
  5ad37f:   00

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai

uot;event loop", which
> repeatedly waits for FDs chosen by waiting nodes to become readable
> and/or writeable and then gives the node a chance to react.
> Eventually, the waiting node will stop waiting and have a result
> ready, at which point the event loop will give the parent of that node
> a chance to run.  If that node consequently becomes ready, then its
> parent gets a chance to run.  Eventually (we hope), the node for which
> we're waiting becomes ready, and we can then read a result tuple.
> With some more work, this seems like it can handle the FDW case, but I
> haven't worked out how to make it handle the related parallel query
> case.  What we want there is to wait not for the readiness of an FD
> but rather for some other process involved in the parallel query to
> reach a point where it can welcome assistance executing that node.  I
> don't know exactly what the signaling for that should look like yet -
> maybe setting the process latch or something.
> 
> By the way, one smaller executor project that I think we should also
> look at has to do with this comment in nodeSeqScan.c:
> 
> static bool
> SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
> {
> /*
>  * Note that unlike IndexScan, SeqScan never use keys in 
> heap_beginscan
>  * (and this is very bad) - so, here we do not check are keys ok or 
> not.
>  */
> return true;
> }
> 
> Some quick prototyping by my colleague Dilip Kumar suggests that, in
> fact, there are cases where pushing down keys into heap_beginscan()
> could be significantly faster.  Some care is required here because any
> functions we execute as scan keys are run with the buffer locked, so
> we had better not run anything very complicated.  But doing this for
> simple things like integer equality operators seems like it could save
> quite a few buffer lock/unlock cycles and some other executor overhead
> as well.
> 
> Thoughts, ideas, suggestions, etc. very welcome.
>

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] textlike under the LIKE operator for char(n)

2016-05-06 Thread Kohei KaiGai

2016-05-06 23:17 GMT+09:00 Kevin Grittner :
> On Fri, May 6, 2016 at 8:58 AM, Kohei KaiGai  wrote:
>
>> postgres=# select 'abcd'::char(20) LIKE 'ab%cd';
>>  ?column?
>> --
>>  f
>> (1 row)
>>
>> postgres=# select 'abcd'::char(4) LIKE 'ab%cd';
>>  ?column?
>> --
>>  t
>> (1 row)
>>
>> LIKE operator (that is eventually processed by textlike) considers the
>> padding space of char(n) data type as a part of string.
>
> The SQL standard generally requires this for CHAR(n) columns.
>
>> On the other hands, equal operator ignores the padding space when it
>> compares two strings.
>>
>> postgres=# select 'abcd'::char(20) = 'abcd';
>>  ?column?
>> --
>>  t
>> (1 row)
>>
>> postgres=# select 'abcd'::char(4) = 'abcd';
>>  ?column?
>> --
>>  t
>> (1 row)
>
> The SQL standard specifically requires this exception to the
> general rule.
>
>> Is this behavior as expected? or, bug?
>
> This has been discussed on community lists multiple times in the
> past; you might want to search the archives.  I'm not inclined to
> dig through the standard for details on this point again right now,
> but in general the behaviors we provide for CHAR(n) are mandated by
> standard.  It would not entirely shock me if there are some corner
> cases where different behavior could be allowed or even more
> correct, but my recollection is that what you have shown is all
> required to work that way.
>
Thanks, I couldn't find out the reason of the behavior shortly.
Requirement by SQL standard is a clear guidance even if it looks
a bit mysterious.

> Generally, I recommend avoiding CHAR(n) columns like the plague.
>
Yep, I agree. I found this matter when I port LIKE operator on GPU,
not a time when some real-life query tried to use char(n).

Best regards,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] textlike under the LIKE operator for char(n)

2016-05-06 Thread Kohei KaiGai

Hi,

I found a mysterious behavior when we use LIKE operator on char(n) data type.


postgres=# select 'abcd'::char(20) LIKE 'ab%cd';
 ?column?
--
 f
(1 row)

postgres=# select 'abcd'::char(4) LIKE 'ab%cd';
 ?column?
--
 t
(1 row)

LIKE operator (that is eventually processed by textlike) considers the
padding space of char(n) data type as a part of string.

On the other hands, equal operator ignores the padding space when it
compares two strings.

postgres=# select 'abcd'::char(20) = 'abcd';
 ?column?
--
 t
(1 row)

postgres=# select 'abcd'::char(4) = 'abcd';
 ?column?
--
 t
(1 row)

The LIKE operator on char(n) data type is implemented by textlike().

at pg_proc.h:
DATA(insert OID = 1631 (  bpcharlike   PGNSP PGUID 12 1 0 0 0 f f
f f t f i s 2 0 16 "1042 25" _null_ _null_ _null_ _null_ _null_
textlike _null_ _null_ _null_ ));

It calls GenericMatchText() with length of the target string,
calculated by VARSIZE_ANY_EXHDR, however, it includes the padding
space.
It seems to me bcTruelen() gives the correct length for char(n) data
types, instead of this macro.

Is this behavior as expected? or, bug?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] EPQ recheck across HashJoin, what does it actuall check?

2016-03-31 Thread Kouhei Kaigai

Folks,

Could you correct me if I misunderstand the execution path.

We may have the following example. Query try to get row level locking
on the table 't1' that is under the 'Hash' node.

postgres=# EXPLAIN
postgres-# SELECT * FROM t0 NATURAL JOIN t1 WHERE ax between 10 and 30 FOR 
UPDATE OF t1;
   QUERY PLAN
-
 LockRows  (cost=2682.53..263118.01 rows=1980172 width=93)
   ->  Hash Join  (cost=2682.53..243316.29 rows=1980172 width=93)
 Hash Cond: (t0.aid = t1.aid)
 ->  Seq Scan on t0  (cost=0.00..183332.58 rows=858 width=46)
 ->  Hash  (cost=2435.00..2435.00 rows=19802 width=51)
   ->  Seq Scan on t1  (cost=0.00..2435.00 rows=19802 width=51)
 Filter: ((ax >= '10'::double precision) AND (ax <= 
'30'::double precision))
(7 rows)

Concurrent update on 't1' by other backend may kick EPQ recheck to ensure
whether the source tuple is still visible or not.
In this example, if somebody updated 't1.ax' to 45 concurrently, LockRows
shall discard the joined tuple and shall not acquire row lock.

ExecLockRows() fills up es_epqTuple[] slot by locked/non-locked tuples,
then calls EvalPlanQualNext() to re-evaluate the tuple.
EvalPlanQualNext() is a thin wrapper of ExecProcNode(), thus, it walks
down into the child plan nodes.

If it would be some kind of scan node, ExecScan() -> ExecScanFetch() handles
the EPQ recheck appropriately. Here is no matter.

In case of HashJoin, it seems to me ExecHashJoin() takes no special behavior.
If a record that has 't1.ax' = 20.0 when it was fetched and loaded to hash
table was updated to 45.0, it is ought to be handled as an invisible row.
However, ExecHashJoin() does not walk down to the inner side again, thus,
the updated tuple might be considered as visible, and locked.

If it would not be my oversight, ExecHashJoin() and ExecHash() needs to call
ExecProcNode() then re-evaluate its join condition again, apart from the
hash key values of the latest tuples.

Thought?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-28 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, March 29, 2016 10:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Petr Jelinek; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On Mon, Mar 28, 2016 at 9:36 PM, Kouhei Kaigai  wrote:
> > I don't have a strong reason to keep these stuff in separate files.
> > Both stuffs covers similar features and amount of code are enough small.
> > So, the attached v4 just merged custom-node.[ch] stuff into extensible.
> >
> > Once we put similar routines closely, it may be better to consolidate
> > these routines.
> > As long as EXTNODENAME_MAX_LEN == CUSTOM_NAME_MAX_LEN, both features
> > have identical structure layout, so it is easy to call an internal
> > common function to register or find out a table of callbacks according
> > to the function actually called by other modules.
> >
> > I'm inclined to think to replace EXTNODENAME_MAX_LEN and
> > CUSTOM_NAME_MAX_LEN by NAMEDATALEN again, to fit structure layout.
> 
> I don't think that we need both EXTNODENAME_MAX_LEN and
> CUSTOM_NAME_MAX_LEN; we can use EXTNODENAME_MAX_LEN for both.  I'm
> opposed to using NAMEDATALEN for anything unrelated to the size of a
> Name.  If it's not being stored in a catalog, it doesn't need to care.
>
OK, I adjusted the v4 patch to use EXTNODENAME_MAX_LEN for both.

The structure of hash entry was revised as follows, then registered via
an internal common function: RegisterExtensibleNodeEntry, and found out
via also an internal common function: GetExtensibleNodeEntry.

typedef struct
{
charextnodename[EXTNODENAME_MAX_LEN];
const void *extnodemethods;
 } ExtensibleNodeEntry;

ExtensibleNodeMethods and CustomScanMethods shall be stored in
'extensible_node_methods' and 'custom_scan_methods' separatedly.
The entrypoint functions calls above internal common functions with
appropriate HTAB variable.

It will be re-usable if we would have further extensible nodes in the
future versions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-scan-serialization-reworks.5.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-28 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Friday, March 25, 2016 12:27 AM
> To: Petr Jelinek
> Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On Wed, Mar 23, 2016 at 1:36 PM, Petr Jelinek  wrote:
> > Ok, I am happy with it, marked it as ready for committer (it was marked as
> > committed although it wasn't committed).
> 
> Thanks for fixing the status.   I had forgotten about this thread.
> 
> I can't really endorse the naming conventions here.  I mean, we've got
> the main extensible nodes stuff in extensible.h, and then we've got
> this stuff in custom_node.h (BTW, there is a leftover reference to
> custom-node.h).  There's no hint in the naming that this relates to
> scans, and why is it extensible in one place and custom in another?
> 
> I'm not quite sure how to clean this up.  At a minimum, I think we
> should standardize on "custom_scan.h" instead of "custom_node.h".  I
> think that would be clearer.  But I'm wondering if we should bite the
> bullet and rename everything from "custom" to "extensible" and declare
> it all in "extensible.h".
>
I don't have a strong reason to keep these stuff in separate files.
Both stuffs covers similar features and amount of code are enough small.
So, the attached v4 just merged custom-node.[ch] stuff into extensible.

Once we put similar routines closely, it may be better to consolidate
these routines.
As long as EXTNODENAME_MAX_LEN == CUSTOM_NAME_MAX_LEN, both features
have identical structure layout, so it is easy to call an internal
common function to register or find out a table of callbacks according
to the function actually called by other modules.

I'm inclined to think to replace EXTNODENAME_MAX_LEN and
CUSTOM_NAME_MAX_LEN by NAMEDATALEN again, to fit structure layout.

> src/backend/nodes/custom_node.c:45: indent with spaces.
> +}
> 
Oops, thanks,

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-custom-scan-serialization-reworks.4.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.4.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-28 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Saturday, March 19, 2016 8:57 AM
> To: Tom Lane
> Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> > Sent: Friday, March 18, 2016 11:44 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] WIP: Upper planner pathification
> >
> > Kouhei Kaigai  writes:
> > > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane  wrote:
> > >> I do not, however, like the proposal to expose wflists and so forth.
> > >> Those are internal data structures in grouping_planner and I absolutely
> > >> refuse to promise that they're going to stay stable.
> >
> > > Hmm... It's not easy to imagine a case that extension wants own idea
> > > to extract window functions from the target list and select active
> > > windows, even if extension wants to have own executor and own cost
> > > estimation logic.
> > > In case when extension tries to add WindowPath + CustomPath(Sort),
> > > extension is interested in alternative sort task, but not window
> > > function itself. It is natural to follow the built-in implementation,
> > > thus, it motivates extension author to take copy & paste the code.
> > > select_active_windows() is static, so extension needs to have same
> > > routine on their side.
> >
> > Well, to be perfectly blunt about it, I have said from day one that this
> > notion that a CustomScan extension will be able to cause arbitrary planner
> > behavior changes is loony.  We are simply not going to drop a hook into
> > every tenth line of the planner for you, nor de-static-ify every internal
> > function, nor (almost equivalently) expose the data structures those
> > functions produce, because it would cripple core planner development to
> > try to keep the implied APIs stable.  And I continue to maintain that any
> > actually-generally-useful ideas would be better handled by submitting them
> > as patches to the core planner, rather than trying to implement them in an
> > arms-length extension.
> >
> > In the case at hand, I notice that the WindowFuncLists struct is
> > actually from find_window_functions in clauses.c, so an extension
> > that needed to get hold of that would be unlikely to do any copying
> > and pasting anyhow -- it'd just call find_window_functions again.
> > (That only needs to search the targetlist, so it's not that expensive.)
> > The other lists you mention are all tightly tied to specific, and not
> > terribly well-designed, implementation strategies for grouping sets and
> > window functions.  Those are *very* likely to change in the near future;
> > and even if they don't, I'm skeptical that an external implementor of
> > grouping sets or window functions would want to use exactly those same
> > implementation strategies.  Maybe the only reason you're there at all
> > is you want to be smarter about the order of doing window functions,
> > for example.
> >
> > So I really don't want to export any of that stuff.
> >
> Hmm. I could understand we have active development around this area
> thus miscellaneous internal data structure may not be enough stable
> to expose the extension.
> Don't you deny recall the discussion once implementation gets calmed
> down on the future development cycle?
> 
> > As for other details of the proposed patch, I was intending to put
> > all the hook calls into grouping_planner for consistency, rather than
> > scattering them between grouping_planner and its subroutines.  So that
> > would probably mean calling the hook for a given step *before* we
> > generate the core paths for that step, not after.  Did you have a
> > reason to want the other order?  (If you say "so the hook can look
> > at the core-made paths", I'm going to say "that's a bad idea".  It'd
> > further increase the coupling between a CustomScan extension and core.)
> >
> No deep reason. I just followed the manner in scan/join path hook; that
> calls extension once the core feature adds built-in path nodes.
>
Ah, I oversight a deep reason.
ForeignScan/CustomScan may have an alternative execution path if extension
s

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-23 Thread Kouhei Kaigai

> >> On 15/03/16 05:03, Kouhei Kaigai wrote:
> >>> Petr,
> >>>
> >>> The attached patch is the revised one that follows the new extensible-
> >>> node routine.
> >>>
> >>> It is almost same the previous version except for:
> >>> - custom-apis.[ch] was renamed to custom-node.[ch]
> >>> - check for the length of custom-scan-method name followed
> >>> the manner of RegisterExtensibleNodeMethods()
> >>>
> >>
> >> Hi,
> >>
> >> looks good, only nitpick I have is that it probably should be
> >> custom_node.h with underscore given that we use underscore everywhere
> >> (except for libpq and for some reason atomic ops).
> >>
> > Sorry for my response late.
> >
> > The attached patch just renamed custom-node.[ch] by custom_node.[ch].
> > Other portions are not changed from the previous revison.
> >
> 
> Forgot to attach?
>
Yes Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-custom-scan-serialization-reworks.3.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-23 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Thursday, March 17, 2016 5:06 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 15/03/16 05:03, Kouhei Kaigai wrote:
> > Petr,
> >
> > The attached patch is the revised one that follows the new extensible-
> > node routine.
> >
> > It is almost same the previous version except for:
> > - custom-apis.[ch] was renamed to custom-node.[ch]
> > - check for the length of custom-scan-method name followed
> >the manner of RegisterExtensibleNodeMethods()
> >
> 
> Hi,
> 
> looks good, only nitpick I have is that it probably should be
> custom_node.h with underscore given that we use underscore everywhere
> (except for libpq and for some reason atomic ops).
>
Sorry for my response late.

The attached patch just renamed custom-node.[ch] by custom_node.[ch].
Other portions are not changed from the previous revison.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai

> Robert Haas  writes:
> > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai  
> >>> wrote:
> >>>> So, even though we don't need to define multiple hook declarations,
> >>>> I think the hook invocation is needed just after create__paths()
> >>>> for each. It will need to inform extension the context of hook
> >>>> invocation, the argument list will take UpperRelationKind.
> 
> >>> That actually seems like a pretty good point.  Otherwise you can't
> >>> push anything from the upper rels down unless you are prepared to
> >>> handle all of it.
> 
> >> I'm not exactly convinced of the use-case for that.  What external
> >> thing is likely to handle window functions but not aggregation,
> >> for example?
> 
> > I'm not going to say that you're entirely wrong, but I think that
> > attitude is a bit short-sighted.
> 
> Well, I'm prepared to yield to the extent of repeating the hook call
> before each phase with an UpperRelationKind parameter to tell which phase
> we're about to do.  The main concern here is to avoid redundant
> computation, but the hook can check the "kind" parameter and fall out
> quickly if it has nothing useful to do at the current phase.
> 
> I do not, however, like the proposal to expose wflists and so forth.
> Those are internal data structures in grouping_planner and I absolutely
> refuse to promise that they're going to stay stable.  (I had already
> been thinking a couple of weeks ago about revising the activeWindows
> data structure, now that it would be reasonably practical to cost out
> different orders for doing the window functions in.)  I think a hook
> that has its own ideas about window function implementation methods
> can gather its own information about the WFs without that much extra
> cost, and it very probably wouldn't want exactly the same data that
> create_window_paths uses today anyway.
> 
> So what I would now propose is
> 
> typedef void (*create_upper_paths_hook_type) (PlannerInfo *root,
>   UpperRelationKind stage,
>   RelOptInfo *input_rel);
>
Hmm... It's not easy to imagine a case that extension wants own idea
to extract window functions from the target list and select active
windows, even if extension wants to have own executor and own cost
estimation logic.
In case when extension tries to add WindowPath + CustomPath(Sort),
extension is interested in alternative sort task, but not window
function itself. It is natural to follow the built-in implementation,
thus, it motivates extension author to take copy & paste the code.
select_active_windows() is static, so extension needs to have same
routine on their side.

On the other hands, 'rollup_lists' and 'rollup_groupclauses' need
three static functions (extract_rollup_sets(), reorder_grouping_sets()
and preprocess_groupclause() to reproduce the equivalent data structure.
It is larger copy & paste burden, if extension is not interested in
extracting the information related to grouping set.


I understand it is not "best", but better to provide extra information
needed for extension to reproduce equivalent pathnode, even if fields
of UpperPathExtraData structure is not stable right now.

> and have this invoked at each stage right before we call
> create_grouping_paths, create_window_paths, etc.
>
It seems to me reasonable.

> Also, I don't particularly see a need for a corresponding API for FDWs.
> If an FDW is going to do anything in this space, it presumably has to
> build up ForeignPaths for all the steps anyway.  So I'd be inclined
> to leave GetForeignUpperPaths as-is.
>
It seems to me reasonable. FDW driver which is interested in remote
execution of upper path can use the hook arbitrary.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai

> > Robert Haas  writes:
> > > On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai  
> > > wrote:
> > >> So, even though we don't need to define multiple hook declarations,
> > >> I think the hook invocation is needed just after create__paths()
> > >> for each. It will need to inform extension the context of hook
> > >> invocation, the argument list will take UpperRelationKind.
> >
> > > That actually seems like a pretty good point.  Otherwise you can't
> > > push anything from the upper rels down unless you are prepared to
> > > handle all of it.
> >
> > I'm not exactly convinced of the use-case for that.  What external
> > thing is likely to handle window functions but not aggregation,
> > for example?
> >
> WindowPath usually takes a SortPath. Even though extension don't want to
> handle window function itself, it may want to add alternative sort logic
> than built-in.
> Unless it does not calculate expected cost, nobody knows whether WindowPath +
> SortPath is really cheaper than WindowPath + CustomPath("GpuSort").
> 
> The supplied query may require to run group-by prior to window function,
> but extension may not be interested in group-by on the other hands, thus,
> extension needs to get control around the location where built-in logic
> also adds paths to fetch the cheapest path of the underlying paths.
>
If I would design the hook, I will put its entrypoint at:
- tail of create_grouping_paths(), just before set_cheapest()
- tail of create_window_paths(), just before set_cheapest()
- tail of create_distinct_paths(), just before set_cheapest()
- tail of create_ordered_paths(), just before set_cheapest()
- tail of grouping_planner(), after the loop of create_modifytable_path()

I'm not 100% certain whether the last one is the straightforward idea
to provide alternative writing stuff. For example, if an extension has
own special storage like columnar format, we may need more consideration
whether CustomPath and related stuff are suitable tool.

On the other hands, I believe the earlier 4 entrypoints are right location.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-upper-custom-path.1.patch
Description: pgsql-v9.6-upper-custom-path.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai

> Robert Haas  writes:
> > On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai  wrote:
> >> So, even though we don't need to define multiple hook declarations,
> >> I think the hook invocation is needed just after create__paths()
> >> for each. It will need to inform extension the context of hook
> >> invocation, the argument list will take UpperRelationKind.
> 
> > That actually seems like a pretty good point.  Otherwise you can't
> > push anything from the upper rels down unless you are prepared to
> > handle all of it.
> 
> I'm not exactly convinced of the use-case for that.  What external
> thing is likely to handle window functions but not aggregation,
> for example?
>
WindowPath usually takes a SortPath. Even though extension don't want to
handle window function itself, it may want to add alternative sort logic
than built-in.
Unless it does not calculate expected cost, nobody knows whether WindowPath +
SortPath is really cheaper than WindowPath + CustomPath("GpuSort").

The supplied query may require to run group-by prior to window function,
but extension may not be interested in group-by on the other hands, thus,
extension needs to get control around the location where built-in logic
also adds paths to fetch the cheapest path of the underlying paths.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-18 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Friday, March 18, 2016 11:44 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> Kouhei Kaigai  writes:
> > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane  wrote:
> >> I do not, however, like the proposal to expose wflists and so forth.
> >> Those are internal data structures in grouping_planner and I absolutely
> >> refuse to promise that they're going to stay stable.
> 
> > Hmm... It's not easy to imagine a case that extension wants own idea
> > to extract window functions from the target list and select active
> > windows, even if extension wants to have own executor and own cost
> > estimation logic.
> > In case when extension tries to add WindowPath + CustomPath(Sort),
> > extension is interested in alternative sort task, but not window
> > function itself. It is natural to follow the built-in implementation,
> > thus, it motivates extension author to take copy & paste the code.
> > select_active_windows() is static, so extension needs to have same
> > routine on their side.
> 
> Well, to be perfectly blunt about it, I have said from day one that this
> notion that a CustomScan extension will be able to cause arbitrary planner
> behavior changes is loony.  We are simply not going to drop a hook into
> every tenth line of the planner for you, nor de-static-ify every internal
> function, nor (almost equivalently) expose the data structures those
> functions produce, because it would cripple core planner development to
> try to keep the implied APIs stable.  And I continue to maintain that any
> actually-generally-useful ideas would be better handled by submitting them
> as patches to the core planner, rather than trying to implement them in an
> arms-length extension.
> 
> In the case at hand, I notice that the WindowFuncLists struct is
> actually from find_window_functions in clauses.c, so an extension
> that needed to get hold of that would be unlikely to do any copying
> and pasting anyhow -- it'd just call find_window_functions again.
> (That only needs to search the targetlist, so it's not that expensive.)
> The other lists you mention are all tightly tied to specific, and not
> terribly well-designed, implementation strategies for grouping sets and
> window functions.  Those are *very* likely to change in the near future;
> and even if they don't, I'm skeptical that an external implementor of
> grouping sets or window functions would want to use exactly those same
> implementation strategies.  Maybe the only reason you're there at all
> is you want to be smarter about the order of doing window functions,
> for example.
> 
> So I really don't want to export any of that stuff.
>
Hmm. I could understand we have active development around this area
thus miscellaneous internal data structure may not be enough stable
to expose the extension.
Don't you deny recall the discussion once implementation gets calmed
down on the future development cycle?

> As for other details of the proposed patch, I was intending to put
> all the hook calls into grouping_planner for consistency, rather than
> scattering them between grouping_planner and its subroutines.  So that
> would probably mean calling the hook for a given step *before* we
> generate the core paths for that step, not after.  Did you have a
> reason to want the other order?  (If you say "so the hook can look
> at the core-made paths", I'm going to say "that's a bad idea".  It'd
> further increase the coupling between a CustomScan extension and core.)
>
No deep reason. I just followed the manner in scan/join path hook; that
calls extension once the core feature adds built-in path nodes.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-14 Thread Kouhei Kaigai

Petr,

The attached patch is the revised one that follows the new extensible-
node routine.

It is almost same the previous version except for:
- custom-apis.[ch] was renamed to custom-node.[ch]
- check for the length of custom-scan-method name followed
  the manner of RegisterExtensibleNodeMethods()

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Tuesday, March 15, 2016 2:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Petr Jelinek; pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] Reworks of CustomScan
> serialization/deserialization
> 
> On Sun, Mar 13, 2016 at 9:53 PM, Kouhei Kaigai  wrote:
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> 
> Committed, except that (1) I replaced ereport() with elog(), because I
> can't see making translators care about this message; and (2) I
> reworded the error message a bit.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


pgsql-v9.6-custom-scan-serialization-reworks.2.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-14 Thread Kouhei Kaigai





> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Tuesday, March 15, 2016 2:04 AM
> To: Petr Jelinek
> Cc: Kaigai Kouhei(海外 浩平); David Rowley; Robert Haas;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> Petr Jelinek  writes:
> > On 14/03/16 02:43, Kouhei Kaigai wrote:
> >> Even though I couldn't check the new planner implementation entirely,
> >> it seems to be the points below are good candidate to inject CustomPath
> >> (and potentially ForeignScan).
> >>
> >> - create_grouping_paths
> >> - create_window_paths
> >> - create_distinct_paths
> >> - create_ordered_paths
> >> - just below of the create_modifytable_path
> >> (may be valuable for foreign-update pushdown)
> 
> > To me that seems too low inside the planning tree, perhaps adding it
> > just to the subquery_planner before SS_identify_outer_params would be
> > better, that's the place where you see the path for the whole (sub)query
> > so you can search and modify what you need from there.
> 
> I don't like either of those too much.  The main thing I've noticed over
> the past few days is that you can't readily generate custom upper-level
> Paths unless you know what PathTarget grouping_planner is expecting each
> level to produce.  So what I was toying with doing is (1) having
> grouping_planner put all those targets into the PlannerInfo, perhaps
> in an array indexed by UpperRelationKind; and (2) adding a hook call
> immediately after those targets are computed, say right before
> the create_grouping_paths() call (approximately planner.c:1738
> in HEAD).  It should be sufficient to have one hook there since
> you can inject Paths into any of the upper relations at that point;
> moreover, that's late enough that you shouldn't have to recompute
> anything you figured out during scan/join planning.
>
Regarding to the (2), I doubt whether the location is reasonable,
because pathlist of each upper_rels[] are still empty, aren't it?
It will make extension not-easy to construct its own CustomPath that
takes underlying built-in pathnodes.

For example, an extension implements its own sort logic but not
interested in group-by/window function, it shall want to add its
CustomPath to UPPERREL_ORDERED, however, it does not know which is
the input_rel and no built-in paths are not added yet at the point
of create_upper_paths_hook().

On the other hands, here is another problem if we put a hook after
all the upper paths done. In this case, built-in create__paths()
functions cannot pay attention for CustomPath to be added later when
these functions pick up the cheapest path.

So, even though we don't need to define multiple hook declarations,
I think the hook invocation is needed just after create__paths()
for each. It will need to inform extension the context of hook
invocation, the argument list will take UpperRelationKind.

In addition, extension cannot reference some local variables from
the root structure, like:
 - rollup_lists
 - rollup_groupclauses
 - wflists
 - activeWindows
 - have_postponed_srfs
As we are doing at set_join_pathlist_hook, it is good idea to define
UpperPathExtraData structure to pack misc information.

So, how about to re-define the hook as follows?

typedef void (*create_upper_paths_hook_type) (UpperRelationKind upper_kind,
      PlannerInfo *root,
  RelOptInfo *scan_join_rel,
  UpperPathExtraData *extra);

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-14 Thread Kouhei Kaigai

> On Sun, Mar 13, 2016 at 9:53 PM, Kouhei Kaigai  wrote:
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> 
> Committed, except that (1) I replaced ereport() with elog(), because I
> can't see making translators care about this message; and (2) I
> reworded the error message a bit.
>
Thanks, and I got the point why ereport() is suggested for the error
message that may be visible to users, instead of elog().

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use %u to print user mapping's umid and userid

2016-03-14 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Etsuro Fujita
> Sent: Monday, March 14, 2016 4:59 PM
> To: Ashutosh Bapat; Tom Lane
> Cc: pgsql-hackers
> Subject: Re: [HACKERS] Use %u to print user mapping's umid and userid
> 
> Hi,
> 
> On 2016/02/09 14:09, Ashutosh Bapat wrote:
> > Sorry, I was wrong. For public user mapping userid is 0 (InvalidOid),
> > which is returned as is in UserMapping object. I confused InvalidOid
> > with -1.
> 
> I think the following umid handling in postgresGetForeignPlan has the
> same issue:
> 
>  /*
>   * Build the fdw_private list that will be available to the executor.
>   * Items in the list must match order in enum FdwScanPrivateIndex.
>   */
>  fdw_private = list_make4(makeString(sql.data),
>   retrieved_attrs,
>   makeInteger(fpinfo->fetch_size),
>   makeInteger(foreignrel->umid));
> 
> I don't think it's correct to use makeInteger for the foreignrel's umid.
>
BTW, use of ExtensibleNode allows to forget problems come from data format
translation.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-13 Thread Kouhei Kaigai

> -Original Message-
> From: Petr Jelinek [mailto:p...@2ndquadrant.com]
> Sent: Monday, March 14, 2016 12:18 PM
> To: Kaigai Kouhei(海外 浩平); Tom Lane; David Rowley
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> On 14/03/16 02:43, Kouhei Kaigai wrote:
> >
> > CustomPath node is originally expected to generate various kind of plan
> > node, not only scan/join, and its interface is designed to support them.
> > For example, we can expect a CustomPath that generates "CustomSort".
> >
> > On the other hands, upper path consideration is more variable than the
> > case of scan/join path consideration. Probably, we can have no centralized
> > point to add custom-paths for sort, group-by, ...
> > So, I think we have hooks for each (supported) upper path workload.
> >
> > In case of sorting for example, the best location of the hook is just
> > above of the Assert() in the create_ordered_paths(). It allows to compare
> > estimated cost between SortPath and CustomPath.
> > However, it does not allow to inject CustomPath(for sort) into the path
> > node that may involve sorting, like WindowPath or AggPath.
> > Thus, another hook may be put on create_window_paths and
> > create_grouping_paths in my thought.
> >
> > Some other good idea?
> >
> > Even though I couldn't check the new planner implementation entirely,
> > it seems to be the points below are good candidate to inject CustomPath
> > (and potentially ForeignScan).
> >
> > - create_grouping_paths
> > - create_window_paths
> > - create_distinct_paths
> > - create_ordered_paths
> > - just below of the create_modifytable_path
> >(may be valuable for foreign-update pushdown)
> >
> 
> To me that seems too low inside the planning tree, perhaps adding it
> just to the subquery_planner before SS_identify_outer_params would be
> better, that's the place where you see the path for the whole (sub)query
> so you can search and modify what you need from there.
>
Thanks for your idea. Yes, I also thought a similar point; where all
the path consideration get completed. It indeed allows extensions to
walk down the path tree and replace a part of them.
However, when we want to inject CustomPath under the built-in paths,
extension has to re-calculate cost of the built-in paths again.
Perhaps, it affects to the decision of built-in path selection.
So, I concluded that it is not realistic to re-implement equivalent
upper planning stuff in the extension side, if we put the hook after
all the planning works done.

If extension can add its CustomPath at create_grouping_paths(), the
later steps, like create_window_paths, stands on the estimated cost
of the CustomPath. Thus, extension don't need to know the detail of
the entire upper planning.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-13 Thread Kouhei Kaigai

> On 14/03/16 02:53, Kouhei Kaigai wrote:
> >> -Original Message-
> >> From: pgsql-hackers-ow...@postgresql.org
> >> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> >> Sent: Friday, March 11, 2016 12:27 AM
> >> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> >>
> >> On 10/03/16 08:08, Kouhei Kaigai wrote:
> >>>>
> >>>>>> Also in RegisterCustomScanMethods
> >>>>>> +  Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>>>>>
> >>>>>> Shouldn't this be actually "if" with ereport() considering this is
> >>>>>> public API and extensions can pass anything there? (for that matter 
> >>>>>> same
> >>>>>> is true for RegisterExtensibleNodeMethods but that's already 
> >>>>>> committed).
> >>>>>>
> >>>>> Hmm. I don't have clear answer which is better. The reason why I put
> >>>>> Assert() here is that only c-binary extension uses this interface, thus,
> >>>>> author will fix up the problem of too long name prior to its release.
> >>>>> Of course, if-with-ereport() also informs extension author the name is
> >>>>> too long.
> >>>>> One downside of Assert() may be, it makes oversight if --enable-cassert
> >>>>> was not specified.
> >>>>>
> >>>>
> >>>> Well that's exactly my problem, this should IMHO throw error even
> >>>> without --enable-cassert. It's not like it's some performance sensitive
> >>>> API where if would be big problem, ensuring correctness of the input is
> >>>> more imporant here IMHO.
> >>>>
> >>> We may need to fix up RegisterExtensibleNodeMethods() first.
> >>>
> >>> Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
> >>> is consumed by '\0' character. In fact, hash, match and keycopy function
> >>> of HTAB for string keys deal with the first (keysize - 1) bytes.
> >>> So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.
> >>>
> >>
> >> Yes, my thoughts as well but that can be separate tiny patch that does
> >> not have to affect this one. In my opinion if we fixed this one it would
> >> be otherwise ready to go in, and I definitely prefer this approach to
> >> the previous one.
> >>
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> >
> 
> Okay, it's somewhat akin to hairsplitting but works for me. Do you plan
> to do same thing with the CustomScan patch itself as well?.
>
Yes. I'll fixup the patch to follow the same manner.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-13 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Friday, March 11, 2016 12:27 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 10/03/16 08:08, Kouhei Kaigai wrote:
> >>
> >>>> Also in RegisterCustomScanMethods
> >>>> +Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>>>
> >>>> Shouldn't this be actually "if" with ereport() considering this is
> >>>> public API and extensions can pass anything there? (for that matter same
> >>>> is true for RegisterExtensibleNodeMethods but that's already committed).
> >>>>
> >>> Hmm. I don't have clear answer which is better. The reason why I put
> >>> Assert() here is that only c-binary extension uses this interface, thus,
> >>> author will fix up the problem of too long name prior to its release.
> >>> Of course, if-with-ereport() also informs extension author the name is
> >>> too long.
> >>> One downside of Assert() may be, it makes oversight if --enable-cassert
> >>> was not specified.
> >>>
> >>
> >> Well that's exactly my problem, this should IMHO throw error even
> >> without --enable-cassert. It's not like it's some performance sensitive
> >> API where if would be big problem, ensuring correctness of the input is
> >> more imporant here IMHO.
> >>
> > We may need to fix up RegisterExtensibleNodeMethods() first.
> >
> > Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
> > is consumed by '\0' character. In fact, hash, match and keycopy function
> > of HTAB for string keys deal with the first (keysize - 1) bytes.
> > So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.
> >
> 
> Yes, my thoughts as well but that can be separate tiny patch that does
> not have to affect this one. In my opinion if we fixed this one it would
> be otherwise ready to go in, and I definitely prefer this approach to
> the previous one.
>
OK, I split the previous small patch into two tiny patches.
The one is bugfix around max length of the extnodename.
The other replaces Assert() by ereport() according to the upthread discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-fix-extnodename-max-len.patch
Description: pgsql-v9.6-fix-extnodename-max-len.patch


pgsql-v9.6-replace-assert-by-ereport-on-register-extnode.patch
Description: pgsql-v9.6-replace-assert-by-ereport-on-register-extnode.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-13 Thread Kouhei Kaigai

Hello,

I'm now checking the new planner implementation to find out the way to
integrate CustomPath to the upper planner also.
CustomPath node is originally expected to generate various kind of plan
node, not only scan/join, and its interface is designed to support them.
For example, we can expect a CustomPath that generates "CustomSort".

On the other hands, upper path consideration is more variable than the
case of scan/join path consideration. Probably, we can have no centralized
point to add custom-paths for sort, group-by, ...
So, I think we have hooks for each (supported) upper path workload.

In case of sorting for example, the best location of the hook is just
above of the Assert() in the create_ordered_paths(). It allows to compare
estimated cost between SortPath and CustomPath.
However, it does not allow to inject CustomPath(for sort) into the path
node that may involve sorting, like WindowPath or AggPath.
Thus, another hook may be put on create_window_paths and
create_grouping_paths in my thought.

Some other good idea?

Even though I couldn't check the new planner implementation entirely,
it seems to be the points below are good candidate to inject CustomPath
(and potentially ForeignScan).

- create_grouping_paths
- create_window_paths
- create_distinct_paths
- create_ordered_paths
- just below of the create_modifytable_path
  (may be valuable for foreign-update pushdown)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Saturday, March 05, 2016 3:02 AM
> To: David Rowley
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> OK, here is a version that I think addresses all of the recent comments:
> 
> * I refactored the grouping-sets stuff as suggested by Robert and David.
> The GroupingSetsPath code is now used *only* when there are grouping sets,
> otherwise what you get is a plain AGG_SORTED AggPath.  This allowed
> removal of a boatload of weird corner cases in the GroupingSets code path,
> so it was a good change.  (Fundamentally, that's cleaning up some
> questionable coding in the grouping sets patch rather than fixing anything
> directly related to pathification, but I like the code better now.)
> 
> * I refactored the handling of targetlists in createplan.c.  After some
> reflection I decided that the disuse_physical_tlist callers fell into
> three separate categories: those that actually needed the exact requested
> tlist to be returned, those that wanted non-bloated tuples because they
> were going to put them into sort or hash storage, and those that needed
> grouping columns to be properly labeled.  The new approach is to pass down
> a "flags" word that specifies which if any of these cases apply at a
> specific plan level.  use_physical_tlist now always makes the right
> decision to start with, and disuse_physical_tlist is gone entirely, which
> should make things a bit faster since we won't uselessly construct and
> discard physical tlists.  The missing logic from make_subplanTargetList
> and locate_grouping_columns is reincarnated in the physical-tlist code.
> 
> * Added explicit limit/offset fields to LimitPath, as requested by Teodor.
> 
> * Removed SortPath.sortgroupclauses.
> 
> * Fixed handling of parallel-query fields in new path node types.
> (BTW, I found what seemed to be a couple of pre-existing bugs of
> the same kind, eg create_mergejoin_path was different from the
> other two kinds of join as to setting parallel_degree.)
> 
> 
> What remains to be done, IMV:
> 
> * Performance testing as per yesterday's discussion.
> 
> * Debug support in outfuncs.c and print_path() for new node types.
> 
> * Clean up unfinished work on function header comments.
> 
> * Write some documentation about how FDWs might use this.
> 
> I'll work on the performance testing next.  Barring unsatisfactory
> results from that, I think this could be committable in a couple
> of days.
> 
>   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-09 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Thursday, March 10, 2016 11:01 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 10/03/16 02:18, Kouhei Kaigai wrote:
> >
> >> I am not sure I like the fact that we have this EXTNODENAME_MAX_LEN and
> >> now the CUSTOM_NAME_MAX_LEN with the same length and also they are both
> >> same lenght as NAMEDATALEN I wonder if this shouldn't be somehow
> >> squished to less defines.
> >>
> > Hmm. I just followed the manner in extensible.c, because this label was
> > initially NAMEDATALEN, then Robert changed it with EXTNODENAME_MAX_LEN.
> > I guess he avoid to apply same label on different entities - NAMEDATALEN
> > is a limitation for NameData type, but identifier of extensible-node and
> > custom-scan node are not restricted by.
> >
> 
> Makes sense.
> 
> >> Also in RegisterCustomScanMethods
> >> +  Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>
> >> Shouldn't this be actually "if" with ereport() considering this is
> >> public API and extensions can pass anything there? (for that matter same
> >> is true for RegisterExtensibleNodeMethods but that's already committed).
> >>
> > Hmm. I don't have clear answer which is better. The reason why I put
> > Assert() here is that only c-binary extension uses this interface, thus,
> > author will fix up the problem of too long name prior to its release.
> > Of course, if-with-ereport() also informs extension author the name is
> > too long.
> > One downside of Assert() may be, it makes oversight if --enable-cassert
> > was not specified.
> >
> 
> Well that's exactly my problem, this should IMHO throw error even
> without --enable-cassert. It's not like it's some performance sensitive
> API where if would be big problem, ensuring correctness of the input is
> more imporant here IMHO.
>
We may need to fix up RegisterExtensibleNodeMethods() first.

Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
is consumed by '\0' character. In fact, hash, match and keycopy function
of HTAB for string keys deal with the first (keysize - 1) bytes.
So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-extensible-namelen-check-by-ereport.patch
Description: pgsql-v9.6-extensible-namelen-check-by-ereport.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-09 Thread Kouhei Kaigai

> On 29/02/16 13:07, Kouhei Kaigai wrote:
> >
> > I'd like to adjust a few of custom-scan interface prior to v9.6 freeze.
> >
> > The major point is serialization/deserialization mechanism.
> > Now, extension has to give LibraryName and SymbolName to reproduce
> > same CustomScanMethods on the background worker process side. Indeed,
> > it is sufficient information to pull the table of function pointers.
> >
> > On the other hands, we now have different mechanism to wrap private
> > information - extensible node. It requires extensions to register its
> > ExtensibleNodeMethods identified by name, usually, on _PG_init() time.
> > It is also reasonable way to reproduce same objects on background
> > worker side.
> >
> > However, mixture of two different ways is not good. My preference is
> > what extensible-node is doing rather than what custom-scan is currently
> > doing.
> > The attached patch allows extension to register CustomScanMethods once,
> > then readFunc.c can pull this table by CustomName in string form.
> >
> 
> Agreed, but this will break compatibility right?
>
The manner to pass a pair of library-name and symbol-name is a new feature
in v9.6, not in v9.5, so it is now the last chance to fix up the interface
requirement.

> > I'm not 100% certain whether "nodes/custom-apis.h" is the best location,
> > but somewhere we can put these declarations rather than the primitive
> > header files might be needed.
> 
> custom-apis.c does not sound like right name to me, maybe it can be just
> custom.c but custom.h might be bit too generic, maybe custom_node.h
>
OK, custom_node.h may be better.

> I am not sure I like the fact that we have this EXTNODENAME_MAX_LEN and
> now the CUSTOM_NAME_MAX_LEN with the same length and also they are both
> same lenght as NAMEDATALEN I wonder if this shouldn't be somehow
> squished to less defines.
>
Hmm. I just followed the manner in extensible.c, because this label was
initially NAMEDATALEN, then Robert changed it with EXTNODENAME_MAX_LEN.
I guess he avoid to apply same label on different entities - NAMEDATALEN
is a limitation for NameData type, but identifier of extensible-node and
custom-scan node are not restricted by.

> Also in RegisterCustomScanMethods
> + Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> 
> Shouldn't this be actually "if" with ereport() considering this is
> public API and extensions can pass anything there? (for that matter same
> is true for RegisterExtensibleNodeMethods but that's already committed).
>
Hmm. I don't have clear answer which is better. The reason why I put
Assert() here is that only c-binary extension uses this interface, thus,
author will fix up the problem of too long name prior to its release.
Of course, if-with-ereport() also informs extension author the name is
too long.
One downside of Assert() may be, it makes oversight if --enable-cassert
was not specified.

> Other than that this seems like straight conversion to same basic
> template as extensible nodes so I think it's ok.
> 

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-03-07 Thread Kouhei Kaigai





> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, March 05, 2016 2:42 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai  wrote:
> > I found one other, but tiny, problem to implement SSD-to-GPU direct
> > data transfer feature under the PostgreSQL storage.
> >
> > Extension cannot know the raw file descriptor opened by smgr.
> >
> > I expect an extension issues an ioctl(2) on the special device file
> > on behalf of the special kernel driver, to control the P2P DMA.
> > This ioctl(2) will pack file descriptor of the DMA source and some
> > various information (like base position, range, destination device
> > pointer, ...).
> >
> > However, the raw file descriptor is wrapped in the fd.c, instead of
> > the File handler, thus, not visible to extension. oops...
> >
> > The attached patch provides a way to obtain raw file descriptor (and
> > relevant flags) of a particular File virtual file descriptor on
> > PostgreSQL. (No need to say, extension has to treat the raw descriptor
> > carefully not to give an adverse effect to the storage manager.)
> >
> > How about this tiny enhancement?
> 
> Why not FileDescriptor(), FileFlags(), FileMode() as separate
> functions like FilePathName()?
>
Here is no deep reason. The attached patch adds three individual
functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-filegetrawdesc.2.patch
Description: pgsql-v9.6-filegetrawdesc.2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-03-03 Thread Kouhei Kaigai

I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> > Sent: Saturday, February 13, 2016 1:46 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> > Subject: Re: [HACKERS] Way to check whether a particular block is on the
> > shared_buffer?
> >
> > On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai  wrote:
> > > Hmm. In my experience, it is often not a productive discussion whether
> > > a feature is niche or commodity. So, let me change the viewpoint.
> > >
> > > We may utilize OS-level locking mechanism here.
> > >
> > > Even though it depends on filesystem implementation under the VFS,
> > > we may use inode->i_mutex lock that shall be acquired during the buffer
> > > copy from user to kernel, at least, on a few major filesystems; ext4,
> > > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > > acquire the inode->i_mutex lock during P2P DMA transfer.
> > >
> > > Once we can consider the OS buffer is updated atomically by the lock,
> > > we don't need to worry about corrupted pages, but still needs to pay
> > > attention to the scenario when updated buffer page is moved to GPU.
> > >
> > > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > > infrastructure, so I intend to move all-visible pages only.
> > > If someone updates the buffer concurrently, then write out the page
> > > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > > updated tuples should not be visible to the transaction which issued
> > > P2P DMA.
> > >
> > > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > > that indicates CPU to retry this page again. In this case, this page is
> > > likely loaded to the shared buffer already, so retry penalty is not so
> > > much.
> > >
> > > I'll try to investigate the implementation in this way.
> > > Please correct me, if I misunderstand something (especially, treatment
> > > of PD_ALL_VISIBLE).
> >
> > I suppose there's no theoretical reason why the buffer couldn't go
> > from all-visible to not-all-visible and back to all-visible again all
> > during the time you are copying it.
> >
> The backend process that is copying the data to GPU has a transaction
> in-progress (= not committed). Is it possible to get the updated buffer
> page back to the all-visible state again?
> I expect that in-progress transactions works as a blocker for backing
> to all-visible. Right?
> 
> > Honestly, I think trying to access buffers without going through
> > shared_buffers is likely to be very hard to make correct and probably
> > a loser.
> >
> No challenge, no outcome. ;-)
> 
> > Copying the data into shared_buffers and then to the GPU is,
> > doubtless, at least somewhat slower.  But I kind of doubt that it's
> > enough slower to make up for all of the problems you're going to have
> > with the approach you've chosen.
> >
> Honestly, I'm still uncertain whether it works well as I expects.
> However, scan workload on the table larger than main memory is
> headache for PG-Strom, so I'd like to try ideas we can implement.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei 
>



pgsql-v9.6-filegetrawdesc.1.patch
Description: pgsql-v9.6-filegetrawdesc.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reworks of CustomScan serialization/deserialization

2016-02-29 Thread Kouhei Kaigai

Hello,

I'd like to adjust a few of custom-scan interface prior to v9.6 freeze.

The major point is serialization/deserialization mechanism.
Now, extension has to give LibraryName and SymbolName to reproduce
same CustomScanMethods on the background worker process side. Indeed,
it is sufficient information to pull the table of function pointers.

On the other hands, we now have different mechanism to wrap private
information - extensible node. It requires extensions to register its
ExtensibleNodeMethods identified by name, usually, on _PG_init() time.
It is also reasonable way to reproduce same objects on background
worker side.

However, mixture of two different ways is not good. My preference is
what extensible-node is doing rather than what custom-scan is currently
doing.
The attached patch allows extension to register CustomScanMethods once,
then readFunc.c can pull this table by CustomName in string form.


The minor one is header file location of CustomMethods declaration.
These are currently declared at relation.h, plannodes.h and execnodes.h.
These files are very primitive, so we put these lines:

  struct ParallelContext; /* avoid including parallel.h here */
  struct shm_toc; /* avoid including shm_toc.h here */
  struct ExplainState;/* avoid including explain.h here */

to avoid inclusion of other headers here.

It seems to me CustomMethods shall be moved to somewhere appropriate,
like fdwapi.h for FDW. If we put "struct CustomMethods;" on these
primitive header files instead, it will work.

I'm not 100% certain whether "nodes/custom-apis.h" is the best location,
but somewhere we can put these declarations rather than the primitive
header files might be needed.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-scan-serialization-reworks.1.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] A trivial fix on extensiblenode

2016-02-29 Thread Kouhei Kaigai

Hello,

RegisterExtensibleNodeMethods() initializes its hash table
with keysize=NAMEDATALEN, instead of EXTNODENAME_MAX_LEN.

The attached patch fixes it.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-v9.6-trivial-fix-extensiblenode.patch
Description: pgsql-v9.6-trivial-fix-extensiblenode.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-13 Thread Kouhei Kaigai





> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, February 13, 2016 1:46 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai  wrote:
> > Hmm. In my experience, it is often not a productive discussion whether
> > a feature is niche or commodity. So, let me change the viewpoint.
> >
> > We may utilize OS-level locking mechanism here.
> >
> > Even though it depends on filesystem implementation under the VFS,
> > we may use inode->i_mutex lock that shall be acquired during the buffer
> > copy from user to kernel, at least, on a few major filesystems; ext4,
> > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > acquire the inode->i_mutex lock during P2P DMA transfer.
> >
> > Once we can consider the OS buffer is updated atomically by the lock,
> > we don't need to worry about corrupted pages, but still needs to pay
> > attention to the scenario when updated buffer page is moved to GPU.
> >
> > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > infrastructure, so I intend to move all-visible pages only.
> > If someone updates the buffer concurrently, then write out the page
> > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > updated tuples should not be visible to the transaction which issued
> > P2P DMA.
> >
> > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > that indicates CPU to retry this page again. In this case, this page is
> > likely loaded to the shared buffer already, so retry penalty is not so
> > much.
> >
> > I'll try to investigate the implementation in this way.
> > Please correct me, if I misunderstand something (especially, treatment
> > of PD_ALL_VISIBLE).
> 
> I suppose there's no theoretical reason why the buffer couldn't go
> from all-visible to not-all-visible and back to all-visible again all
> during the time you are copying it.
>
The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

> Honestly, I think trying to access buffers without going through
> shared_buffers is likely to be very hard to make correct and probably
> a loser.
>
No challenge, no outcome. ;-)

> Copying the data into shared_buffers and then to the GPU is,
> doubtless, at least somewhat slower.  But I kind of doubt that it's
> enough slower to make up for all of the problems you're going to have
> with the approach you've chosen.
>
Honestly, I'm still uncertain whether it works well as I expects.
However, scan workload on the table larger than main memory is
headache for PG-Strom, so I'd like to try ideas we can implement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-12 Thread Kohei KaiGai

2016-02-13 0:11 GMT+09:00 Robert Haas :
> On Fri, Feb 12, 2016 at 9:56 AM, Andres Freund  wrote:
>> On 2016-02-10 23:26:20 -0500, Robert Haas wrote:
>>> I think the part about whacking around the FDW API is a little more
>>> potentially objectionable to others, so I want to hold off doing that
>>> unless a few more people chime in with +1.  Perhaps we could start a
>>> new thread to talk about that specific idea.  This is useful even
>>> without that, though.
>>
>> FWIW, I can delete a couple hundred lines of code from citusdb thanks to
>> this...
>>
>> A quick questions about what you committed:
>>
>>> @@ -527,10 +532,17 @@ extern PGDLLIMPORT Node *newNodeMacroHolder;
>>>   */
>>>  extern char *nodeToString(const void *obj);
>>>
>>> +struct Bitmapset;/* not to include bitmapset.h here */
>>> +struct StringInfoData;   /* not to include stringinfo.h here */
>>> +extern void outToken(struct StringInfoData *str, const char *s);
>>> +extern void outBitmapset(struct StringInfoData *str,
>>> +  const struct Bitmapset *bms);
>>> +
>>>  /*
>>>   * nodes/{readfuncs.c,read.c}
>>>   */
>>>  extern void *stringToNode(char *str);
>>> +extern struct Bitmapset *readBitmapset(void);
>>
>> why exactly did you expose read/writeBitmapset(), and nothing else?
>> Afaics a lot of the other read routines are also pretty necessary from
>> the outside?
>
> Because that's what KaiGai submitted and I had no reason to
> second-guess it.  I'm happy to expose other stuff along similar lines
> if somebody writes a patch for it.
>
Although we have various service macro in outfuncs.c and readfuncs.c,
it is not a big burden for extension authors to write equivalent code,
except for string and bitmap. On the other hands, string needs _outToken,
bitmap needs _outBitmap. It is not good idea to suggest extensions to
have copy & paste code with no clear reason.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-11 Thread Kouhei Kaigai

> On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai  wrote:
> > Unfortunately, it was not sufficient.
> >
> > Due to the assumption, the buffer page to be suspended does not exist
> > when a backend process issues a series P2P DMA command. (If block would
> > be already loaded to the shared buffer, it don't need to issue P2P DMA,
> > but just use usual memory<->device DMA because RAM is much faster than
> > SSD.)
> > It knows the pair of (rel,fork,block), but no BufferDesc of this block
> > exists. Thus, it cannot acquire locks in BufferDesc structure.
> >
> > Even if the block does not exist at this point, concurrent process may
> > load the same page. BufferDesc of this page shall be assigned at this
> > point, however, here is no chance to lock something in BufferDesc for
> > the process which issues P2P DMA command.
> >
> > It is the reason why I assume the suspend/resume mechanism shall take
> > a pair of (rel,fork,block) as identifier of the target block.
> 
> I see the problem, but I'm not terribly keen on putting in the hooks
> that it would take to let you solve it without hacking core.  It
> sounds like an awfully invasive thing for a pretty niche requirement.
>
Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-11 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, February 11, 2016 1:26 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> On Wed, Feb 10, 2016 at 1:25 AM, Kouhei Kaigai  wrote:
> > It is pretty good!
> >
> > The attached patch (primary one) implements the above idea.
> >
> > Now ExtensibleNode works as a basis structure of data container,
> > regardless of CustomScan and ForeignScan.
> > Also, fdw_private and custom_private are de-defined to Node * type
> > from List * type. It affected to a few FDW APIs.
> 
> I extracted the subset of this that just creates the extensible node
> framework and did some cleanup - the result is attached.  I will
> commit this if nobody objects.
>
I have no objection of course.

> I think the part about whacking around the FDW API is a little more
> potentially objectionable to others, so I want to hold off doing that
> unless a few more people chime in with +1.  Perhaps we could start a
> new thread to talk about that specific idea.  This is useful even
> without that, though.
>

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-09 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 11:59 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: ##freemail## Re: CustomScan in a larger structure (RE: [HACKERS]
> CustomScan support on readfuncs.c)
> 
> On Sun, Feb 7, 2016 at 7:28 PM, Kouhei Kaigai  wrote:
> > The new callbacks of T_ExtensibleNode will replace the necessity to
> > form and deform process of private values, like as:
> >   https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L114
> 
> Yeah.
> 
> > It transforms a bunch of internal data of CustomScan (similar to the
> > extended fields in T_ExtensibleNode) to/from the node functions
> > understandable forms for copy, input and output support.
> > I think it implies you proposition is workable.
> >
> > I'd like to follow this proposition basically.
> > On the other hands, I have two points I want to pay attention.
> >
> > 1. At this moment, it is allowed to define a larger structure that
> > embeds CustomPath and CustomScanState by extension. How do we treat
> > this coding manner in this case? Especially, CustomScanState has no
> > private pointer dereference because it assumes an extension define
> > a larger structure. Of course, we don't need to care about node
> > operations on Path and PlanState nodes, but right now.
> 
> I see no real advantage in letting a CustomPath be larger.  If
> custom_private can include extension-defined node types, that seems
> good enough.  On the other hand, if CustomScanState can be larger,
> that seems fine.   We don't really need any special support for that,
> do we?
>
Yes. Right now, we have no code path that handles PlanState or its
inheritance using node operations. So, it is not a real problem.

> > 2. I intended to replace LibraryName and SymbolName fields from the
> > CustomScanMethods structure by integration of extensible node type.
> > We had to give a pair of these identifiers because custom scan provider
> > has no registration points at this moment. A little concern is extension
> > has to assume a particular filename of itself.
> > But, probably, it shall be a separated discussion. My preference is
> > preliminary registration of custom scan provider by its name, as well
> > as extensible node.
> 
> Seems like we could just leave the CustomScan stuff alone and worry
> about this as a separate facility.
>
OK

> > Towards the last question; whether *_private shall be void * or List *,
> > I want to keep fdw_private and custom_private as List * pointer, because
> > a new node type definition is a bit overdone job if this FDW or CSP will
> > take only a few private fields with primitive data types.
> > It is a preferable features when extension defines ten or more private
> > fields.
> 
> Well, I suggested Node *, not void *.  A Node can be a List, but not
> every Node is a List.
>
It is pretty good!

The attached patch (primary one) implements the above idea.

Now ExtensibleNode works as a basis structure of data container,
regardless of CustomScan and ForeignScan.
Also, fdw_private and custom_private are de-defined to Node * type
from List * type. It affected to a few FDW APIs.

The secondary patch is a demonstration of new ExtensibleNode using
postgres_fdw extension. Its private data are expected to be packed
in a list with a particular order. Self defined structure allows to
keep these variables without ugly pack/unpacking.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-private.v5.demo.patch
Description: pgsql-v9.6-custom-private.v5.demo.patch


pgsql-v9.6-custom-private.v5.patch
Description: pgsql-v9.6-custom-private.v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-09 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Wednesday, February 10, 2016 1:58 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: ##freemail## Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Sun, Feb 7, 2016 at 9:49 PM, Kouhei Kaigai  wrote:
> > On the other hands, it also became clear we have to guarantee OS buffer
> > or storage block must not be updated partially during the P2P DMA.
> > My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
> > filter out unnecessary rows and columns prior to loading to CPU/RAM.
> > It needs to ensure PostgreSQL does not write out buffers to OS buffers
> > to avoid unexpected data corruption.
> >
> > What I want to achieve is suspend of buffer write towards a particular
> > (relnode, forknum, blocknum) pair for a short time, by completion of
> > data processing by GPU (or other external devices).
> > In addition, it is preferable being workable regardless of the choice
> > of storage manager, even if we may have multiple options on top of the
> > pluggable smgr in the future.
> 
> It seems like you just need to take an exclusive content lock on the
> buffer, or maybe a shared content lock would be sufficient.
>
Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-07 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 1:52 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai  wrote:
> > I can agree that smgr hooks shall be primarily designed to make storage
> > systems pluggable, even if we can use this hooks for suspend & resume of
> > write i/o stuff.
> > In addition, "pluggable storage" is a long-standing feature, even though
> > it is not certain whether existing smgr hooks are good starting point.
> > It may be a risk if we implement a grand feature on top of the hooks
> > but out of its primary purpose.
> >
> > So, my preference is a mechanism to hook buffer write to implement this
> > feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
> > has nearly zero cost when no extension activate the feature.)
> > One downside of this approach is larger number of hook points.
> > We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
> > and FlushRelationBuffers, in addition to FlushBuffer, at least.
> 
> I don't understand what you're hoping to achieve by introducing
> pluggability at the smgr layer.  I mean, md.c is pretty much good for
> read and writing from anything that looks like a directory of files.
> Another smgr layer would make sense if we wanted to read and write via
> some kind of network protocol, or if we wanted to have some kind of
> storage substrate that did internally to itself some of the tasks for
> which we are currently relying on the filesystem - e.g. if we wanted
> to be able to use a raw device, or perhaps more plausibly if we wanted
> to reduce the number of separate files we need, or provide a substrate
> that can clip an unused extent out of the middle of a relation
> efficiently.  But I don't understand what this has to do with what
> you're trying to do here.  The subject of this thread is about whether
> you can check for the presence of a block in shared_buffers, and as
> discussed upthread, you can.  I don't quite follow how we made the
> jump from there to smgr pluggability.
>
Yes. smgr pluggability is not what I want to investigate in this thread.
It is not a purpose of discussion, but one potential "idea to implement".

Through the discussion, it became clear that extension can check existence
of buffer of a particular block, using existing infrastructure.

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

The data processing close to the storage needs OS buffer should not be
updated under the P2P DMA, concurrently. So, I want the feature below.
1. An extension (that controls GPU and P2P DMA) can register a particular
   (relnode, forknum, blocknum) pair as suspended block for write.
2. Once a particular block gets suspended, smgrwrite (or its caller) shall
   be blocked unless the above suspended block is not unregistered.
3. The extension will unregister when P2P DMA from the blocks get completed,
   then suspended concurrent backend shall be resumed to write i/o.
4. On the other hands, the extension cannot register the block if some
   other concurrent executes smgrwrite, to avoid potential data flaw.

One idea was injection of a thin layer on top of the smgr mechanism, to
implement the above mechanism.
However, I'm also uncertain whether injection to entire smgr hooks is
a straightforward approach to achieve it.

The minimum stuff I want is a facility to get a control at the head and tail
of smgrwrite() - to suspend the concurrent write prior to smgr_write, and to
notify the concurrent smgr_write gets completed for the mechanism.

It does not need pluggability of smgr, but entrypoint shall be located around
smgr functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1894 matches

Mail list logo