from:"Kohei KaiGai"

Re: [HACKERS] list of credits for release notes

2017-10-03 Thread Kohei KaiGai

2017-10-03 6:09 GMT+09:00 Tatsuo Ishii <is...@sraoss.co.jp>:
>> On Wed, Sep 27, 2017 at 8:29 PM, Michael Paquier
>> <michael.paqu...@gmail.com> wrote:
>>> Looking at this list, the first name is followed by the family name,
>>> so there are inconsistencies with some Japanese names:
>>> - Fujii Masao should be Masao Fujii.
>>> - KaiGai Kohei should be Kohei Kaigai.
>>
>> But his emails say Fujii Masao, not Masao Fujii.
>
> Michael is correct.
>
> Sometimes people choose family name first in the emails.  However I am
> sure "Fujii" is the family name and "Masao" is the firstname.
>
>> KaiGai's case is a bit trickier, as his email name has changed over time.
>
> Michael is correct about Kaigai too.
>
I set up my personal e-mail account using ${FAMILYNAME} ${FIRSTNAME}
manner according to the eastern convention.
However, my last company enforced a centralized e-mail account policy,
so ${FIRSTNAME} ${LASTNAME} was shown when I post a message
from the jp.nec.com domain.
It is the reason why my email name has changed.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kai...@heterodb.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Float value 'Infinity' is cast to numeric 1 on Windows

2017-09-27 Thread Kohei KaiGai

Hello,

Does it make sense to put a check of "isinf(val)" and ereport prior to
the snprintf()?
It is a lightweight check more than set_var_from_str().

Thanks,

2017-09-27 19:41 GMT+09:00 Taiki Kondo :
> Hi all,
>
> I build PostgreSQL 9.6.5 by Visual Studio 2013 on Windows, I found weird 
> behavior on it.
>
> I execute following SQL, occurring ERROR is expected, but I got mysterious 
> value 1.
>
>
> postgres=# select (float8 'inf')::numeric;
>  numeric
> -
>1
> (1 row)
>
>
> On Windows, at least Visual Studio, the string of float/double value meaning
> infinity is "1.#INF" not "inf". So, set_var_from_str() called from 
> float8_numeric()
> in utils/adt/numeric.c will return numeric value 1, and no one checks this 
> situation.
>
> This situation is also occurring by cast from float4.
>
> I wrote a patch to add check this situation.
> Please find attached.
>
>
> Sincerely,
>
> --
> Taiki Kondo
> NEC Solution Innovators, Ltd.
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

2017-02-25 1:40 GMT+09:00 Claudio Freire <klaussfre...@gmail.com>:
> On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>>> This attached patch re-designed the previous v2 according to Robert's 
>>> comment.
>>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>>> ExecShutdownNode,
>>> because it is already modified to call child node first, earlier than
>>> ExecShutdownGather
>>> which releases DSM segment.
>>
>> Aside from the documentation, which needs some work, this looks fine
>> to me on a quick read.
>
> LGTM too.
>
> You may want to clarify in the docs when the hook is called in
> relation to other hooks too.
>
I tried to add a description how custom-scan callbacks performs under the
executor, and when invoked roughly.
However, it is fundamentally not easy for most of people because it assumes
FDW/CSP author understand the overall behavior of optimizer / executor, not
only APIs specifications.

Thanks,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


parallel-finish-fdw_csp.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

Hello,

This attached patch re-designed the previous v2 according to Robert's comment.
All I had to do is putting entrypoint for ForeignScan/CustomScan at
ExecShutdownNode,
because it is already modified to call child node first, earlier than
ExecShutdownGather
which releases DSM segment.

Thanks,

2017-02-20 9:25 GMT+09:00 Kouhei Kaigai :
>> -Original Message-
>> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
>> Sent: Monday, February 20, 2017 2:20 AM
>> To: Kaigai Kouhei(海外 浩平) 
>> Cc: Claudio Freire ; Amit Kapila
>> ; pgsql-hackers 
>> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
>> ExecEndGather)
>>
>> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai 
>> wrote:
>> > The attached patch is revised one.
>> >
>> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
>> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
>> > tree twice.
>> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
>> > statistics only when query is executed under EXPLAIN ANALYZE.
>> >
>> > This enhancement allows FDW/CSP to collect its specific run- time
>> > statistics more than Instrumentation, then show them as output of
>> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
>> > transfer ratio and so on. These statistics will never appear in the
>> > Instrumentation structure, however, can be a hot- point of performance
>> > bottleneck if CustomScan works on background workers.
>>
>> Would gather_shutdown_children_first.patch from
>> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
>> b8eb10c-6aywjdxb...@mail.gmail.com
>> help with this problem also?  Suppose we did that, and then also added an
>> ExecShutdownCustom method.  Then you'd definitely be able to get control
>> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
>>
> Ah, yes, I couldn't find any problem around the above approach.
> ExecShutdownGather() can be called by either ExecShutdownNode() or
> ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
> prior to release of the DSM.
>
> Thanks,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-01-02 Thread Kohei KaiGai

Oops, I oversight this patch was marked as "returned with feedback",
not "moved to the next CF".

Its status has not been changed since the last update. (Code was
revised according to the last
comment by Jeevan, but CF-Nov was time up at that time.)

How do I handle the patch?

2016-12-05 16:49 GMT+09:00 Kouhei Kaigai :
> Hello,
>
> Sorry for my late response.
> The attached patch reflects your comments.
>
>> Here are few comments on latest patch:
>>
>>
>> 1.
>> make/make check is fine, however I am getting regression failure in
>> postgres_fdw contrib module (attached regression.diff).
>> Please investigate and fix.
>>
> It was an incorrect interaction when postgres_fdw tries to push down
> sorting to the remote side. We cannot attach LIMIT clause on the plain
> scan path across SORT, however, the previous version estimated the cost
> for the plain scan with LIMIT clause even if local sorting is needed.
> If remote scan may return just 10 rows, estimated cost of the local sort
> is very lightweight, thus, this unreasonable path was chosen.
> (On the other hands, its query execution results were correct because
> ps_numTuples is not delivered across Sort node, so ForeignScan eventually
> scanned all the remote tuples. It made correct results but not optimal
> from the viewpoint of performance.)
>
> The v3 patch estimates the cost with remote LIMIT clause only if supplied
> pathkey strictly matches with the final output order of the query, thus,
> no local sorting is expected.
>
> Some of the regression test cases still have different plans but due to
> the new optimization by remote LIMIT clause.
> Without remote LIMIT clause, some of regression test cases preferred
> remote-JOIN + local-SORT then local-LIMIT.
> Once we have remote-LIMIT option, it allows to discount the cost for
> remote-SORT by choice of top-k heap sorting.
> It changed the optimizer's decision on some test cases.
>
> Potential one big change is the test case below.
>
>  -- CROSS JOIN, not pushed down
>  EXPLAIN (VERBOSE, COSTS OFF)
>  SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 
> OFFSET 100 LIMIT 10;
>
> It assumed CROSS JOIN was not pushed down due to the cost for network
> traffic, however, remote LIMIT reduced the estimated number of tuples
> to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to run
> on the remote side.
>
>> 2.
>> + *
>> + * MEMO: root->limit_tuples is not attached when query
>> contains
>> + * grouping-clause or aggregate functions. So, we don's adjust
>> + * rows even if LIMIT  is supplied.
>>
>> Can you please explain why you are not doing this for grouping-clause or
>> aggregate functions.
>>
> See grouping_planner() at optimizer/plan/planner.c
> It puts an invalid value on the root->limit_tuples if query has GROUP BY 
> clause,
> so we cannot know number of tuples to be returned even if we have upper limit
> actually.
>
> /*
>  * Figure out whether there's a hard limit on the number of rows that
>  * query_planner's result subplan needs to return.  Even if we know a
>  * hard limit overall, it doesn't apply if the query has any
>  * grouping/aggregation operations, or SRFs in the tlist.
>  */
> if (parse->groupClause ||
> parse->groupingSets ||
> parse->distinctClause ||
> parse->hasAggs ||
> parse->hasWindowFuncs ||
> parse->hasTargetSRFs ||
> root->hasHavingQual)
> root->limit_tuples = -1.0;
> else
> root->limit_tuples = limit_tuples;
>
>> 3.
>> Typo:
>>
>> don's  => don't
>>
> Fixed,
>
> best regards,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
KaiGai Kohei 


passdown-limit-fdw.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai

2016-12-23 8:24 GMT+09:00 Robert Haas <robertmh...@gmail.com>:
> On Wed, Dec 7, 2016 at 11:01 PM, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>> Regardless of the ExpandedObject, does the flatten format need to
>> contain fully flatten data chunks?
>
> I suspect it does, and I think that's why this isn't going to get very
> far without a super-varlena format.
>
Yep, I'm now under investigation how to implement with typlen == -3
approach. Likely, it will be the most straight-forward infrastructure for
other potential use cases more than matrix/vector.

Thanks,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai

2016-12-23 8:23 GMT+09:00 Robert Haas <robertmh...@gmail.com>:
> On Wed, Dec 7, 2016 at 10:44 PM, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>>> Handling objects >1GB at all seems like the harder part of the
>>> problem.
>>>
>> I could get your point almost. Does the last line above mention about
>> amount of the data object >1GB? even if the "super-varlena" format
>> allows 64bit length?
>
> Sorry, I can't understand your question about what I wrote.
>
I thought you just pointed out it is always harder part to treat large
amount of data even if data format allows >1GB or more. Right?

-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-08 Thread Kohei KaiGai

2016-12-08 16:11 GMT+09:00 Craig Ringer <cr...@2ndquadrant.com>:
> On 8 December 2016 at 12:01, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>
>>> At a higher level, I don't understand exactly where such giant
>>> ExpandedObjects would come from.  (As you point out, there's certainly
>>> no easy way for a client to ship over the data for one.)  So this feels
>>> like a very small part of a useful solution, if indeed it's part of a
>>> useful solution at all, which is not obvious.
>>>
>> I expect an aggregate function that consumes millions of rows as source
>> of a large matrix larger than 1GB. Once it is formed to a variable, it is
>> easy to deliver as an argument of PL functions.
>
> You might be interested in how Java has historically dealt with similar 
> issues.
>
> For a long time the JVM had quite low limits on the maximum amount of
> RAM it could manage, in the single gigabytes for a long time. Even for
> the 64-bit JVM. Once those limitations were lifted, the garbage
> collector algorithm placed a low practical limit on how much RAM it
> could cope with effectively.
>
> If you were doing scientific computing with Java, lots of big
> image/video work, using GPGPUs, doing large scale caching, etc, this
> rapidly became a major pain point. So people introduced external
> memory mappings to Java, where objects could reference and manage
> memory outside the main JVM heap. The most well known is probably
> BigMemory (https://www.terracotta.org/products/bigmemory), but there
> are many others. They exposed this via small opaque handle objects
> that you used to interact with the external memory store via library
> functions.
>
> It might make a lot of sense to apply the same principle to
> PostgreSQL, since it's much less intrusive than true 64-bit VARLENA.
> Rather than extending all of PostgreSQL to handle special-case
> split-up VARLENA extended objects, have your interim representation be
> a simple opaque value that points to externally mapped memory. Your
> operators for the type, etc, know how to work with it. You probably
> don't need a full suite of normal operators, you'll be interacting
> with the data in a limited set of ways.
>
> The main issue would presumably be one of resource management, since
> we currently assume we can just copy a Datum around without telling
> anybody about it or doing any special management. You'd need to know
> when to clobber your external segment, when to copy(!) it if
> necessary, etc. This probably makes sense for working with GPGPUs
> anyway, since they like dealing with big contiguous chunks of memory
> (or used to, may have improved?).
>
> It sounds like only code specifically intended to work with the
> oversized type should be doing much with it except passing it around
> as an opaque handle, right?
>
Thanks for your proposition. Its characteristics looks like a largeobject
but volatile (because of no backing storage).
As you mentioned, I also think its resource management is the core of issues.
We have no reference count mechanism, so it is uncertain when we should
release the orphan memory chunk, or we have to accept this memory consumption
until clean-up of the relevant memory context.
Moreever, we have to pay attention to the scenario when this opaque identifier
is delivered to the background worker, thus, it needs to be valid on other
process's context. (Or, prohibit to exchange it on the planner stage?)

> Do you need to serialize this type to/from disk at all? Or just
> exchange it in chunks with a client? If you do need to, can you
> possibly do TOAST-like or pg_largeobject-like storage where you split
> it up for on disk storage then reassemble for use?
>
Even though I don't expect this big chunk is entirely exported to/imported
from the client at once, it makes sense to save/load this big chunk to/from
the disk, because construction of a big memory structure consumes much CPU
cycles than simple memory copy from buffers.
So, in other words, it does not need valid typinput/typoutput handlers, but
serialization/deserialization on disk i/o is helpful.

Thanks,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:36 GMT+09:00 Tom Lane <t...@sss.pgh.pa.us>:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>>> I like to propose a new optional type handler 'typserialize' to
>>> serialize an in-memory varlena structure (that can have indirect
>>> references) to on-disk format.
>
>> I think it's probably a mistake to conflate objects with substructure
>> with objects > 1GB.  Those are two somewhat orthogonal needs.
>
> Maybe.  I think where KaiGai-san is trying to go with this is being
> able to turn an ExpandedObject (which could contain very large amounts
> of data) directly into a toast pointer or vice versa.  There's nothing
> really preventing a TOAST OID from having more than 1GB of data
> attached, and if you had a side channel like this you could transfer
> the data without ever having to form a larger-than-1GB tuple.
>
> The hole in that approach, to my mind, is that there are too many places
> that assume that they can serialize an ExpandedObject into part of an
> in-memory tuple, which might well never be written to disk, or at least
> not written to disk in a table.  (It might be intended to go into a sort
> or hash join, for instance.)  This design can't really work for that case,
> and unfortunately I think it will be somewhere between hard and impossible
> to remove all the places where that assumption is made.
>
Regardless of the ExpandedObject, does the flatten format need to
contain fully flatten data chunks?
If a data type internally contains multiple toast pointers as like an array,
its flatten image is likely very small we can store using an existing
varlena mechanism.
One problem is VARSIZE() will never tell us exact total length of
the data even if it references multiple GB scale chunks.

> At a higher level, I don't understand exactly where such giant
> ExpandedObjects would come from.  (As you point out, there's certainly
> no easy way for a client to ship over the data for one.)  So this feels
> like a very small part of a useful solution, if indeed it's part of a
> useful solution at all, which is not obvious.
>
I expect an aggregate function that consumes millions of rows as source
of a large matrix larger than 1GB. Once it is formed to a variable, it is
easy to deliver as an argument of PL functions.

> FWIW, ExpandedObjects themselves are far from a fully fleshed out
> concept, one of the main problems being that they don't have very long
> lifespans except in the case that they're the value of a plpgsql
> variable.  I think we would need to move things along quite a bit in
> that area before it would get to be useful to think in terms of
> ExpandedObjects containing multiple GB of data.  Otherwise, the
> inevitable flattenings and re-expansions are just going to kill you.q
>
> Likewise, the need for clients to be able to transfer data in chunks
> gets pressing well before you get to 1GB.  So there's a lot here that
> really should be worked on before we try to surmount that barrier.
>
Do you point out the problem around client<->server protocol, isn't it?
Likely, we eventually need this enhancement. I agree.

Thanks,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:04 GMT+09:00 Robert Haas <robertmh...@gmail.com>:
> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>> I like to propose a new optional type handler 'typserialize' to
>> serialize an in-memory varlena structure (that can have indirect
>> references) to on-disk format.
>> If any, it shall be involced on the head of toast_insert_or_update()
>> than indirect references are transformed to something other which
>> is safe to save. (My expectation is, the 'typserialize' handler
>> preliminary saves the indirect chunks to the toast relation, then
>> put toast pointers instead.)
>
> This might not work.  The reason is that we have important bits of
> code that expect that they can figure out how to do some operation on
> a datum (find its length, copy it, serialize it) based only on typlen
> and typbyval.  See src/backend/utils/adt/datum.c for assorted
> examples.  Note also the lengthy comment near the top of the file,
> which explains that typlen > 0 indicates a fixed-length type, typlen
> == -1 indicates a varlena, and typlen == -2 indicates a cstring.  I
> think there's probably room for other typlen values; for example, we
> could have typlen == -3 indicate some new kind of thing -- a
> super-varlena that has a higher length limit, or some other kind of
> thing altogether.
>
> Now, you can imagine trying to treat what you're talking about as a
> new type of TOAST pointer, but I think that's not going to help,
> because at some point the TOAST pointer gets de-toasted into a varlena
> ... which is still limited to 1GB.  So that's not really going to
> work.  And it brings out another point, which is that if you define a
> new typlen code, like -3, for super-big things, they won't be
> varlenas, which means they won't work with the existing TOAST
> interfaces.  Still, you might be able to fix that.  You would probably
> have to do some significant surgery on the wire protocol, per the
> commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd.
>
Hmm... The reason why I didn't introduce the idea of 64bit varlena format is
this approach seems too invasive for existing PostgreSQL core and extensions,
because I assumed this "long" variable length datum utilize/enhance existing
varlena infrastructure.
However, once we have completely independent infrastructure from the exiting
varlena, we may not need to have a risk of code mixture.
It seems to me an advantage.

My concern about ExpandedObject is its flat format still needs 32bit varlena
header that restrict the total length up to 1GB, so, it was not possible to
represent a large data chunk.
So, I didn't think the current ExpandedObject is a solution for us.

Regarding to the protocol stuff, I want to consider the support of a large
record next to the internal data format, because my expected primary
usage is an internal use for in-database analytics, then user will get
its results from a complicated logic described in PL function.

> I think it's probably a mistake to conflate objects with substructure
> with objects > 1GB.  Those are two somewhat orthogonal needs.  As Jim
> also points out, expanded objects serve the first need.  Of course,
> those assume that we're dealing with a varlena, so if we made a
> super-varlena, we'd still need to create an equivalent.  But perhaps
> it would be a fairly simple adaptation of what we've already got.
> Handling objects >1GB at all seems like the harder part of the
> problem.
>
I could get your point almost. Does the last line above mention about
amount of the data object >1GB? even if the "super-varlena" format
allows 64bit length?

Thanks,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

This is a design proposal for matrix data type which can be larger
than 1GB. Not only a new data type support, it also needs a platform
enhancement because existing varlena has a hard limit (1GB).
We had a discussion about this topic on the developer unconference
at Tokyo/Akihabara, the day before PGconf.ASIA. The design proposal
below stands on the overall consensus on the discussion.


 BACKGROUND

The varlena format has either short (1-byte) or long (4-bytes) header.
We use the long header for in-memory structure which is referenced
by VARSIZE()/VARDATA(), or on-disk strcuture which is larger than
126b but less than TOAST threshold. Elsewhere, the short format is
used if varlena is less than 126b or externally stored in the toast
relation. Any kind of varlena representation does not support data-
size larger than 1GB.
On the other hands, some use cases which handle (relative) big-data
in database are interested in variable length datum larger than 1GB.

One example is what I've presented at PGconf.SV.
A PL/CUDA function takes two arguments (2D-array of int4 instead of
matrix), then returns top-N combination of the chemical compounds
according to the similarity, like as:

SELECT knn_gpu_similarity(5, Q.matrix,
 D.matrix)
  FROM (SELECT array_matrix(bitmap) matrix
  FROM finger_print_query
 WHERE tag LIKE '%abc%') Q,
   (SELECT array_matrix(bitmap) matrix
  FROM finger_print_10m) D;

array_matrix() is an aggregate function to generate 2D-array which
contains all the input relation stream.
It works fine as long as data size is less than 1GB. Once it exceeds
the boundary, user has to split the 2D-array by manual, although
it is not uncommon the recent GPU model has more than 10GB RAM.

It is really problematic if we cannot split the mathematical problem
into several portions appropriately. And, it is not comfortable for
users who cannot use full capability of the GPU device.

=
 PROBLEM
=
Our problem is that varlena format does not permit to have a variable
length datum larger than 1GB, even if our "matrix" type wants to move
a bunch of data larger than 1GB.
So, we need to solve the problem of the varlena format restriction
prior to the matrix type implementation.

In the developer unconference, people had discussed three ideas towards
the problem. Then, overall consensus was the idea of special data-type
which can contain multiple indirect references to other data chunks.
Both of the main part and referenced data chunks are less than 1GB,
but total amount of data size we can represent is more than 1GB.

For example, even if we have a large matrix around 8GB, its sub-matrix
separated into 9 portions (3x3) are individually less than 1GB.
It is problematic when we try to save the matrix which contains indirect
reference to the sub-matrixes, because toast_save_datum() writes out
the flat portion just after the varlena head onto a tuple or a toast
relation as is.
If main portion of the matrix contains pointers (indirect reference),
it is obviously problematic. We need to have an infrastructure to
serialize the indirect reference prior to saving.

BTW, other ideas but less acknowledgement were 64bit varlena header
and utilization of large object. The earlier idea breaks effects to
the current data format, thus, will make unexpected side-effect on
the existing code of PostgreSQL core and extensions. The later one
requires users to construct a large object preliminary. It makes
impossible to use interim result of sub-query, and leads unnecessary
i/o for preparation.

==
 SOLUTION
==
I like to propose a new optional type handler 'typserialize' to
serialize an in-memory varlena structure (that can have indirect
references) to on-disk format.
If any, it shall be involced on the head of toast_insert_or_update()
than indirect references are transformed to something other which
is safe to save. (My expectation is, the 'typserialize' handler
preliminary saves the indirect chunks to the toast relation, then
put toast pointers instead.)

On the other hands, it is uncertain whether we need 'typdeserialize'
handler symmetrically. Because all the functions/operators which
support the special data types should know its internal format,
it is possible to load the data chunks indirectly referenced on
demand.
It will be beneficial from the performance perspective if functions
/operators touches only a part of the large structure, because the
rest of portions are not necessary to load into the memory.

One thing I'm not certain is, whether we can update the datum
supplied as an argument in functions/operators.
Let me assume the data structure below:

  struct {
  int32   varlena_head;
  Oid element_id;
  int matrix_type;
  int blocksz_x; /* horizontal size of each block */
  int blocksz_y; /* vertical size of each block */
  int gridsz_x;  /*

Re: [HACKERS] raw output from copy

2016-12-06 Thread Kohei KaiGai

2016-12-06 16:59 GMT+09:00 Pavel Stehule <pavel.steh...@gmail.com>:
>
>
> 2016-12-06 1:50 GMT+01:00 Kohei KaiGai <kai...@kaigai.gr.jp>:
>>
>> 2016-12-05 22:45 GMT+09:00 Pavel Stehule <pavel.steh...@gmail.com>:
>> >
>> > There are more goals:
>> >
>> > 1. user friendly import of text or binary data - import text data (with
>> > psql) from file is possible - but you have to load a content to psql
>> > variable. For binary data you should to use workaround based on LO and
>> > transformation from LO to bytea.
>> >
>> > 2. user friendly export text or binary data - now, the binary data can
>> > be
>> > exported only via transformation to LO. The XML has very interesting
>> > features when is passing from/to client binary. This feature is
>> > impossible
>> > in psql now.
>> >
>>   :
>> 
>>   :
>> >> It seems to me extend of COPY statement for this optimization is a bit
>> >> overkill
>> >> solution. Do we find out an alternative solution that we can build on
>> >> the existing
>> >> infrastructure?
>> >
>> > The advantage and sense of COPY RAW was reusing existing interface. The
>> > question was: How I can export/import binary data simply from psql
>> > console?
>> >
>> OK, I could get your point.
>>
>> Likeky, we can implement the feature without COPY statement enhancement
>> by adding a special purpose function and \xxx command on psql.
>>
>> Let's assume the two commands below on psql:
>>
>> \blob_import   (STDIN|)
>> \blob_export  (STDOUT|)
>>
>> On \blob_import, the psql command reads the binary contents from either
>> stdin or file, than call a special purpose function that takes three
>> arguments; table name, column name and a binary data chunk.
>> PQexecParams() of libpq allows to deliver the data chunk with keeping
>> binary data format, then the special purpose function will be able to
>> lookup the destination table/column and construct a tuple that contains
>> the supplied data chunk. (I think _recv handler shall be used for
>> data validation, but not an element of this feature.)
>>
>>
>> On \blob_export, the psql command also set up a simple query as follows:
>>   SELECT blob_export((<user's supplied query))
>> For example,
>>   \blob_export SELECT binary_data FROM my_table WHERE id = 10   /tmp/aaa
>> shall be transformed to
>>   SELECT blob_export((SELECT binary_data FROM my_table WHERE id = 10))
>
>
> This is reason why I prefer a COPY statement - because it does all necessary
> things natural.  But if there is a disagreement against COPY RAW it can be
> implemented as psql commands.
>
Yes, both of approach will be able to implement what you want to do.
I agree it is valuable if psql can import/export a particular item with
simple shell-script description, however, here is no consensus how
to implement it.

If psql supports the special \xxx command, it is equivalently convenient
from the standpoint of users, with no enhancement of the statement.

I hope committers comment on the approach we will take on.

Thanks,

> export should be similar like \g, \gset feature
>
> so
>
> SELECT xmldoc FROM 
> \gbinary_store .xxx
>
> import is maybe better solved by proposed file references in queries
>
> Regards
>
> Pavel
>
>
>>
>>
>> This function is declared as:
>>   blob_export(anyelement) RETURNS bytea
>> So, as long as the user supplied query returns exactly one column and
>> one row, it can transform the argument to the binary stream, then psql
>> command receive it and dump somewhere; stdout or file.
>>
>> How about your thought?
>>
>> Thanks,
>> --
>> KaiGai Kohei <kai...@kaigai.gr.jp>
>
>



-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-05 Thread Kohei KaiGai

2016-12-05 22:45 GMT+09:00 Pavel Stehule :
>
> There are more goals:
>
> 1. user friendly import of text or binary data - import text data (with
> psql) from file is possible - but you have to load a content to psql
> variable. For binary data you should to use workaround based on LO and
> transformation from LO to bytea.
>
> 2. user friendly export text or binary data - now, the binary data can be
> exported only via transformation to LO. The XML has very interesting
> features when is passing from/to client binary. This feature is impossible
> in psql now.
>
  :

  :
>> It seems to me extend of COPY statement for this optimization is a bit
>> overkill
>> solution. Do we find out an alternative solution that we can build on
>> the existing
>> infrastructure?
>
> The advantage and sense of COPY RAW was reusing existing interface. The
> question was: How I can export/import binary data simply from psql console?
>
OK, I could get your point.

Likeky, we can implement the feature without COPY statement enhancement
by adding a special purpose function and \xxx command on psql.

Let's assume the two commands below on psql:

\blob_import   (STDIN|)
\blob_export  (STDOUT|)

On \blob_import, the psql command reads the binary contents from either
stdin or file, than call a special purpose function that takes three
arguments; table name, column name and a binary data chunk.
PQexecParams() of libpq allows to deliver the data chunk with keeping
binary data format, then the special purpose function will be able to
lookup the destination table/column and construct a tuple that contains
the supplied data chunk. (I think _recv handler shall be used for
data validation, but not an element of this feature.)


On \blob_export, the psql command also set up a simple query as follows:
  SELECT blob_export((


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-05 Thread Kohei KaiGai

Sorry for my late response.

I've briefly checked a series of discussion in the past.
I understood the target/purpose of this patch is provision of a fast interface
to import/export a particular cell of a relation, by skip of text<->binary
transformation. Its typical use case are XML and JSON data types. Right?

If so, how about the idea to use fast-path invocation protocol to call functions
to import/export these document types?
It allows to accept binary form of the data stream, with minimum overheads.

It seems to me extend of COPY statement for this optimization is a bit overkill
solution. Do we find out an alternative solution that we can build on
the existing
infrastructure?

Best regards,

2016-12-05 14:16 GMT+09:00 Haribabu Kommi :
>
>
> On Tue, Nov 22, 2016 at 10:48 PM, Haribabu Kommi 
> wrote:
>>
>>  Hi,
>>
>> This is a gentle reminder.
>>
>> you assigned as reviewer to the current patch in the 11-2016 commitfest.
>> But you haven't shared your review yet. Please share your review about
>> the patch. This will help us in smoother operation of commitfest.
>>
>> Please Ignore if you already shared your review.
>
>
> Patch is not applying properly to HEAD.
> Moved to next CF with "waiting on author" status.
>
>
> Regards,
> Hari Babu
> Fujitsu Australia

-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of pg_proc.probin is legal?

2016-11-17 Thread Kohei KaiGai

2016-11-16 7:46 GMT-08:00 Tom Lane <t...@sss.pgh.pa.us>:
> Kohei KaiGai <kai...@kaigai.gr.jp> writes:
>> On the other hands, interpret_AS_clause() raises an ereport if SQL
>> function tries to use probin except
>> for C-language. Is it illegal for other languages to use probin field
>> to store something useful?
>
> Well, there's no convention about how to use it.
>
>> In my case, PL/CUDA language allows to define SQL function with a CUDA
>> code block.
>> It saves a raw CUDA source code on the pg_proc.prosrc and its
>> intermediate representation
>> on the pg_proc.probin; which is automatically constructed on the
>> validator callback of the language
>> handler.
>
> I have precisely zero sympathy for such a kluge.  The validator exists
> to validate, it is not supposed to modify the pg_proc row.
>
> We could imagine extending the PL API to allow storage of a compiled
> version in probin, but overloading the validator isn't the way to do
> that IMO.  I'd prefer to see a separate "compile" function for it.
> Existence of a compile function could be the trigger that instructs,
> eg, pg_dump not to include the probin value in the dump.
>
> (There once was a LANCOMPILER option in the CREATE LANGUAGE syntax,
> which I imagine was meant to do something like this, but it was never
> fully implemented and we got rid of it years ago.)
>
> The bigger question though is whether it's really worth the trouble.
> All existing PLs deal with this by caching compiled (to one degree
> or another) representations in process memory.  If you keep it in
> probin you can save some compilation time once per session, but on the
> other hand you're limited to representations that can fit in a flat blob.
>
After the thought with fresh brain, it may not be a serious problem.
Right now, it assigns dependency between a PL/CUDA function and
its helper functions (*1), then put function's OID in the intermediate
representation on the pg_proc.probin, to work even if function gets
renamed.
However, it is a situation simply we can raise an ereport if helper
function was not found, instead of the dependency mechanism.
In addition, the heavier compilation stuff is intermediate form to
the real binary form rather than the source-to-intermediate stuff,
although I don't know about other script language.

Thanks,

(*1) PL/CUDA allows to put some kind of helper function in its
program source for some purpose. For example, the directive
below suggests the runtime of PL/CUDA to allocate a particular
amount of device memory according to the result (int8) of
my_plcuda_workbuf_sz() which have identical argument set with
the PL/CUDA function.

$$
  :
#plcuda_workbuf_sz my_plcuda_func_workbuf_sz
  :
$$

-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Use of pg_proc.probin is legal?

2016-11-15 Thread Kohei KaiGai

Hello,

I have a question as subject line.

https://www.postgresql.org/docs/devel/static/catalog-pg-proc.html
According to the documentation, the purpose of pg_proc.probin is
introduced as follows:
| Additional information about how to invoke the function. Again, the
interpretation is language-specific.

On the other hands, interpret_AS_clause() raises an ereport if SQL
function tries to use probin except
for C-language. Is it illegal for other languages to use probin field
to store something useful?

In my case, PL/CUDA language allows to define SQL function with a CUDA
code block.
It saves a raw CUDA source code on the pg_proc.prosrc and its
intermediate representation
on the pg_proc.probin; which is automatically constructed on the
validator callback of the language
handler.

I noticed the problematic scenario because pg_dump generates CREATE
FUNCTION statement
with two AS arguments, then pg_restore tries to run the statement but failed.

Solution seems to me either of them:
1. Admit CREATE FUNCTION with two AS arguments. If language does not
support is, probin is
simply ignored.
2. pg_dump does not dump pg_proc.probin if language is not C-language.
3. Update documentation to inform not to provide the probin if not C-language.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Academic help for Postgres

2016-05-11 Thread Kohei KaiGai

2016-05-11 23:20 GMT+09:00 Bruce Momjian :
> I am giving a keynote at an IEEE database conference in Helsinki next
> week (http://icde2016.fi/).  (Yes, I am not attending PGCon Ottawa
> because I accepted the Helsinki conference invitation before the PGCon
> Ottawa date was changed from June to May).
>
> As part of the keynote, I would like to mention areas where academia can
> help us.  The topics I can think of are:
>
> Query optimization
> Optimizer statistics
> Indexing structures
> Reducing function call overhead
> CPU locality
> Sorting
> Parallelism
> Sharding
>
> Any others?
>
How about NVRAM utilization?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] textlike under the LIKE operator for char(n)

2016-05-06 Thread Kohei KaiGai

2016-05-06 23:17 GMT+09:00 Kevin Grittner <kgri...@gmail.com>:
> On Fri, May 6, 2016 at 8:58 AM, Kohei KaiGai <kai...@kaigai.gr.jp> wrote:
>
>> postgres=# select 'abcd'::char(20) LIKE 'ab%cd';
>>  ?column?
>> --
>>  f
>> (1 row)
>>
>> postgres=# select 'abcd'::char(4) LIKE 'ab%cd';
>>  ?column?
>> --
>>  t
>> (1 row)
>>
>> LIKE operator (that is eventually processed by textlike) considers the
>> padding space of char(n) data type as a part of string.
>
> The SQL standard generally requires this for CHAR(n) columns.
>
>> On the other hands, equal operator ignores the padding space when it
>> compares two strings.
>>
>> postgres=# select 'abcd'::char(20) = 'abcd';
>>  ?column?
>> --
>>  t
>> (1 row)
>>
>> postgres=# select 'abcd'::char(4) = 'abcd';
>>  ?column?
>> --
>>  t
>> (1 row)
>
> The SQL standard specifically requires this exception to the
> general rule.
>
>> Is this behavior as expected? or, bug?
>
> This has been discussed on community lists multiple times in the
> past; you might want to search the archives.  I'm not inclined to
> dig through the standard for details on this point again right now,
> but in general the behaviors we provide for CHAR(n) are mandated by
> standard.  It would not entirely shock me if there are some corner
> cases where different behavior could be allowed or even more
> correct, but my recollection is that what you have shown is all
> required to work that way.
>
Thanks, I couldn't find out the reason of the behavior shortly.
Requirement by SQL standard is a clear guidance even if it looks
a bit mysterious.

> Generally, I recommend avoiding CHAR(n) columns like the plague.
>
Yep, I agree. I found this matter when I port LIKE operator on GPU,
not a time when some real-life query tried to use char(n).

Best regards,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] textlike under the LIKE operator for char(n)

2016-05-06 Thread Kohei KaiGai

Hi,

I found a mysterious behavior when we use LIKE operator on char(n) data type.


postgres=# select 'abcd'::char(20) LIKE 'ab%cd';
 ?column?
--
 f
(1 row)

postgres=# select 'abcd'::char(4) LIKE 'ab%cd';
 ?column?
--
 t
(1 row)

LIKE operator (that is eventually processed by textlike) considers the
padding space of char(n) data type as a part of string.

On the other hands, equal operator ignores the padding space when it
compares two strings.

postgres=# select 'abcd'::char(20) = 'abcd';
 ?column?
--
 t
(1 row)

postgres=# select 'abcd'::char(4) = 'abcd';
 ?column?
--
 t
(1 row)

The LIKE operator on char(n) data type is implemented by textlike().

at pg_proc.h:
DATA(insert OID = 1631 (  bpcharlike   PGNSP PGUID 12 1 0 0 0 f f
f f t f i s 2 0 16 "1042 25" _null_ _null_ _null_ _null_ _null_
textlike _null_ _null_ _null_ ));

It calls GenericMatchText() with length of the target string,
calculated by VARSIZE_ANY_EXHDR, however, it includes the padding
space.
It seems to me bcTruelen() gives the correct length for char(n) data
types, instead of this macro.

Is this behavior as expected? or, bug?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-12 Thread Kohei KaiGai

2016-02-13 0:11 GMT+09:00 Robert Haas :
> On Fri, Feb 12, 2016 at 9:56 AM, Andres Freund  wrote:
>> On 2016-02-10 23:26:20 -0500, Robert Haas wrote:
>>> I think the part about whacking around the FDW API is a little more
>>> potentially objectionable to others, so I want to hold off doing that
>>> unless a few more people chime in with +1.  Perhaps we could start a
>>> new thread to talk about that specific idea.  This is useful even
>>> without that, though.
>>
>> FWIW, I can delete a couple hundred lines of code from citusdb thanks to
>> this...
>>
>> A quick questions about what you committed:
>>
>>> @@ -527,10 +532,17 @@ extern PGDLLIMPORT Node *newNodeMacroHolder;
>>>   */
>>>  extern char *nodeToString(const void *obj);
>>>
>>> +struct Bitmapset;/* not to include bitmapset.h here */
>>> +struct StringInfoData;   /* not to include stringinfo.h here */
>>> +extern void outToken(struct StringInfoData *str, const char *s);
>>> +extern void outBitmapset(struct StringInfoData *str,
>>> +  const struct Bitmapset *bms);
>>> +
>>>  /*
>>>   * nodes/{readfuncs.c,read.c}
>>>   */
>>>  extern void *stringToNode(char *str);
>>> +extern struct Bitmapset *readBitmapset(void);
>>
>> why exactly did you expose read/writeBitmapset(), and nothing else?
>> Afaics a lot of the other read routines are also pretty necessary from
>> the outside?
>
> Because that's what KaiGai submitted and I had no reason to
> second-guess it.  I'm happy to expose other stuff along similar lines
> if somebody writes a patch for it.
>
Although we have various service macro in outfuncs.c and readfuncs.c,
it is not a big burden for extension authors to write equivalent code,
except for string and bitmap. On the other hands, string needs _outToken,
bitmap needs _outBitmap. It is not good idea to suggest extensions to
have copy & paste code with no clear reason.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] cash_mul_int8 / cash_div_int8

2015-10-07 Thread Kohei KaiGai

I noticed cash_mul_int8 / cash_div_int8 are defined in cash.c,
however, pg_proc.h and pg_operator.h contains no relevant entries.

Is it just a careless oversight?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cash_mul_int8 / cash_div_int8

2015-10-07 Thread Kohei KaiGai

Also, cash_pl, cash_mi, cash_mul_int4 and so on... does not have overflow checks
like as int8pl has.

Of course, most of people don't need to worry about 64bit overflow for
money... :-).

2015-10-08 0:03 GMT+09:00 Kohei KaiGai <kai...@kaigai.gr.jp>:
> I noticed cash_mul_int8 / cash_div_int8 are defined in cash.c,
> however, pg_proc.h and pg_operator.h contains no relevant entries.
>
> Is it just a careless oversight?
>
> Thanks,
> --
> KaiGai Kohei <kai...@kaigai.gr.jp>



-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai

2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
 On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 I see two ways to fix this:

 (1) enforce the 1GB limit (probably better for back-patching, if that's
 necessary)

 (2) make it work with hash tables over 1GB

 I'm in favor of (2) if there's a good way to do that. It seems a bit
 stupid not to be able to use fast hash table because there's some artificial
 limit. Are there any fundamental reasons not to use the
 MemoryContextAllocHuge fix, proposed by KaiGai-san?


 If there are no objections, I will apply the patch for 2) to HEAD and
 backpatch to 9.5.

Please don't be rush. :-)

It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

==
Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select
ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
 from web_sales ws1,web_sales ws2
 where ws1.ws_order_number = ws2.ws_order_number
   and ws1.ws_warehouse_sk  ws2.ws_warehouse_sk;
 QUERY PLAN


 Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
   Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
   Join Filter: (ws1.ws_warehouse_sk  ws2.ws_warehouse_sk)
   Rows Removed by Join Filter: 127853313
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
 Sort Key: ws1.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
 Sort Key: ws2.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
 Planning time: 0.232 ms
 Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai

2015-08-19 21:29 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
 On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
  On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com
  wrote:
 
 
  I see two ways to fix this:
 
  (1) enforce the 1GB limit (probably better for back-patching, if that's
  necessary)
 
  (2) make it work with hash tables over 1GB
 
  I'm in favor of (2) if there's a good way to do that. It seems a bit
  stupid not to be able to use fast hash table because there's some
  artificial
  limit. Are there any fundamental reasons not to use the
  MemoryContextAllocHuge fix, proposed by KaiGai-san?
 
 
  If there are no objections, I will apply the patch for 2) to HEAD and
  backpatch to 9.5.
 
 Please don't be rush. :-)


 Please explain what rush you see?

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

Thanks,

 It is not difficult to replace palloc() by palloc_huge(), however, it may
 lead
 another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.


  Yes, I can read both threads and would apply patches for each problem.

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] security labels on databases are bad for dump restore

2015-07-14 Thread Kohei KaiGai

2015-07-15 2:39 GMT+09:00 Ted Toth txt...@gmail.com:
 That's exactly what I'm talking about like I said KaiGais branch was
 never merged into the mainline so I do not believe that it is used at
 all.

It depends on the definition of integrated.
The PostgreSQL core offers an infrastructure for label based security
mechanism, not only selinux. Also, one extension module that is
usually distributed with PosgreSQL bridges the world of database and
the world of selinux (even though all the features I initially designed
are not yet implemented). I like to say it is integrated.

 On Tue, Jul 14, 2015 at 12:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jul 14, 2015 at 1:22 PM, Ted Toth txt...@gmail.com wrote:
 I'm sort of new to this so maybe I'm missing something but since the
 sepgsql SELinux userspace object manager was never integrated into
 postgresql (AFAIK KaiGais branch was never merged into the mainline)
 who uses these labels? What use are they?

 See contrib/sepgsql

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-06-24 Thread Kohei KaiGai

Does it make sense to put the result tuple of remote join on evety
estate-es_epqTupleSet[] slot represented by this ForeignScan if
scanrelid==0?

It allows to recheck qualifier for each LockRow that intends to lock
base foreign table underlying the remote join.
ForeignScan-fdw_relids tells us which rtindexes are represented
by this ForeignScan, so infrastructure side may be able to handle.

Thanks,


2015-06-24 11:40 GMT+09:00 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
 Hi,

 While reviewing the foreign join pushdown core patch, I noticed that the
 patch doesn't perform an EvalPlanQual recheck properly.  The example
 that crashes the server will be shown below (it uses the postgres_fdw
 patch [1]).  I think the reason for that is because the ForeignScan node
 performing the foreign join remotely has scanrelid = 0 while
 ExecScanFetch assumes that its scan node has scanrelid  0.

 I think this is a bug.  I've not figured out how to fix this yet, but I
 thought we would also need another plan that evaluates the join locally
 for the test tuples for EvalPlanQual.  Though I'm missing something though.

 Create an environment:

 postgres=# create table tab (a int, b int);
 CREATE TABLE
 postgres=# create foreign table foo (a int) server myserver options
 (table_name 'foo');
 CREATE FOREIGN TABLE
 postgres=# create foreign table bar (a int) server myserver options
 (table_name 'bar');
 CREATE FOREIGN TABLE
 postgres=# insert into tab values (1, 1);
 INSERT 0 1
 postgres=# insert into foo values (1);
 INSERT 0 1
 postgres=# insert into bar values (1);
 INSERT 0 1
 postgres=# analyze tab;
 ANALYZE
 postgres=# analyze foo;
 ANALYZE
 postgres=# analyze bar;
 ANALYZE

 Run the example:

 [Terminal 1]
 postgres=# begin;
 BEGIN
 postgres=# update tab set b = b + 1 where a = 1;
 UPDATE 1

 [Terminal 2]
 postgres=# explain verbose select tab.* from tab, foo, bar where tab.a =
 foo.a and foo.a = bar.a for update;

  QUERY PLAN

 
 
  LockRows  (cost=100.00..101.18 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
-  Nested Loop  (cost=100.00..101.14 rows=4 width=70)
  Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
  Join Filter: (foo.a = tab.a)
  -  Seq Scan on public.tab  (cost=0.00..1.01 rows=1 width=14)
Output: tab.a, tab.b, tab.ctid
  -  Foreign Scan  (cost=100.00..100.08 rows=4 width=64)
Output: foo.*, foo.a, bar.*, bar.a
Relations: (public.foo) INNER JOIN (public.bar)
Remote SQL: SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT
 ROW(l.a9), l.a9 FROM (SELECT a a9 FROM public.foo FOR UPDATE) l) l (a1,
 a2) INNER
 JOIN (SELECT ROW(r.a9), r.a9 FROM (SELECT a a9 FROM public.bar FOR
 UPDATE) r) r (a1, a2) ON ((l.a2 = r.a2))
 (11 rows)

 postgres=# select tab.* from tab, foo, bar where tab.a = foo.a and foo.a
 = bar.a for update;

 [Terminal 1]
 postgres=# commit;
 COMMIT

 [Terminal 2]
 (After the commit in Terminal 1, Terminal 2 will show the following.)
 server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
 The connection to the server was lost. Attempting reset: Failed.
 !

 Best regards,
 Etsuro Fujita

 [1]
 http://www.postgresql.org/message-id/CAEZqfEe9KGy=1_wagh2rgzpg0o4pqgd+iauyaj8wtze+cyj...@mail.gmail.com


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai

2015-06-11 23:20 GMT+09:00 Jan Wieck j...@wi3ck.info:
 On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

 curious: what was work_mem set to?

 work_mem=48GB

 My machine mounts 256GB physical RAM.


 work_mem can be allocated several times per backend. Nodes like sort and
 hash_aggregate may each allocate that much. You should set work_mem to a
 fraction of physical-RAM / concurrent-connections depending on the
 complexity of your queries. 48GB does not sound reasonable.

Smaller number of max_connections and large work_mem configuration are
usual for typical OLAP workloads.

Even if configuration is not reasonable, it is not a right error message.
People cannot understand how to fix it.

psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai

2015-06-11 23:33 GMT+09:00 Tomas Vondra tomas.von...@2ndquadrant.com:
 Hi,

 On 06/11/15 16:20, Jan Wieck wrote:

 On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

 curious: what was work_mem set to?

 work_mem=48GB

 My machine mounts 256GB physical RAM.


 work_mem can be allocated several times per backend. Nodes like sort
 and hash_aggregate may each allocate that much. You should set
 work_mem to a fraction of physical-RAM / concurrent-connections
 depending on the complexity of your queries. 48GB does not sound
 reasonable.


 That's true, but there are cases where values like this may be useful (e.g.
 for a particular query). We do allow such work_mem values, so I consider
 this failure to be a bug.

 It probably existed in the past, but was amplified by the hash join
 improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead of
 NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we also
 much more memory than we had in the past.

 Interestingly, the hash code checks for INT_MAX overflows on a number of
 places, but does not check for this ...

Which number should be changed in this case?

Indeed, nbuckets is declared as int, so INT_MAX is hard limit of hash-slot.
However, some extreme usage can easily create a situation that we shall
touch this restriction.

Do we have nbuckets using long int?

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai

2015-06-11 23:28 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
 memset().
 Indeed, this hash table is constructed towards the relation with 
 nrows=119994544,
 so, it is not strange even if hash-slot itself is larger than 1GB.

 You forgot to attach the patch, I think.

Oops, I forgot to attach indeed.

  It looks to me like the size
 of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
 That's a lot of buckets, but maybe not unreasonably many if you've got
 enough memory.

EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.

-- 
KaiGai Kohei kai...@kaigai.gr.jp


hashslot-allocation-by-huge-alloc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] trust authentication behavior

2015-05-18 Thread Kohei KaiGai

2015-05-18 15:15 GMT+09:00 Denis Kirjanov k...@itsirius.su:


 - Ursprüngliche Mail -
 Von: Kohei KaiGai kai...@kaigai.gr.jp
 An: Robert Haas robertmh...@gmail.com
 CC: David G. Johnston david.g.johns...@gmail.com, Denis Kirjanov 
 k...@itsirius.su, pgsql-hackers@postgresql.org, Kohei KaiGai 
 kai...@ak.jp.nec.com
 Gesendet: Samstag, 16. Mai 2015 03:32:55
 Betreff: Re: [HACKERS] trust authentication behavior

 2015-05-16 5:13 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Thu, May 14, 2015 at 3:52 PM, David G. Johnston
 david.g.johns...@gmail.com wrote:
 On Thu, May 14, 2015 at 12:22 PM, Denis Kirjanov k...@itsirius.su wrote:

 Yeah, but the idea is to do that without the pg_hba.conf

 You may want to try describing the problem and not just ask if the chosen
 solution is possible - of which I am doubtful but I have never used selinux
 or studied it in any depth.  pg_hba.conf is the chosen tool for this kind of
 thing so pointing out why it cannot be used is a much more useful first
 step.

 In mandatory access control systems like SE-Linux, the system security
 policy is supposed to centralize all security decisions, and it should
 be possible to enforce any necessary access control rule by modifying
 that policy.  At least that's my understanding.  sepgsql lets the
 kernel's mandatory access control policies filter down into access
 control decisions that PostgreSQL makes.  sepgsql consults the
 operating system policy when faced with an access control decision of
 a type that it supports, and accepts or rejects the connect based on
 that.

 So the question is whether the sepgsql integration points include
 anything that can block a connection, rather than, say, allowing the
 connection but blocking access to particular tables.  Looking at the
 code, it appears that it vaguely contemplates a db_database:{access}
 permission, which sounds like about the right thing, and it's also
 mentioned at 
 https://wiki.postgresql.org/wiki/SEPostgreSQL/Permissions#Connection
 as maybe being the right thing, but I can't find anyplace that it is
 actually enforce.  That's rather disappointing...

 KaiGai, any thoughts?

 I'd like to understand what Denis Kirjanov actually wants to do
 first of all.

 If he wants to control accesses whole of the PostgreSQL instances
 according to the credential on operating system, it is a configuration
 on operating system side.
 He can write up self security policy module that allows someone_t
 domain to connect PostgreSQL instance, not per database basis.
 It is permission around the pair of someone_t and postgresql_t, than
 can be wrapped up by postgresql_stream_connect() in selinux's policy.

 If he wants to control accesses per database basis based on selinux
 policy, it is a right choice to consider sepgsql module.
 However, some of permissions are not still implemented, like
 db_database:{access} because of priority of permissions (and I had to
 focus on GPU acceleration infrastructure in v9.5 cycle...).

 If he wants to control accesses based on the credential of operating
 system, not limited to selinux, IDENT method is available, isn't it?

 Also, he may additionally needs labeled networking configuration,
 if he wants to carry security label information over the TCP/IP
 connection. It is a point to be considered for his requirement.

 Thanks,
 --
 KaiGai Kohei kai...@kaigai.gr.jp
 - Ursprüngliche Mail -


 The idea is to put all the security decisions into the selinux policy.

Yep, it is fundamental idea of sepgsql module.

 As I wrote before it concerns the security label command behavior(done 
 through the policy) and
 the incoming connections (for now it's incoming  remove trust-based)

Are you concerning about client's security label when connection
comes over TCP/IP network? Or, something other concern??

If you want to get client's label over tcp/ip network, you need
additional configuration on peer-to-peer ipsec to exchange security
label over the network.
Elsewhere, selinux also offers a mechanism to assign client's
security label based on the source ip address.

If you have other concern, please introduce it.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] trust authentication behavior

2015-05-15 Thread Kohei KaiGai

2015-05-16 5:13 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Thu, May 14, 2015 at 3:52 PM, David G. Johnston
 david.g.johns...@gmail.com wrote:
 On Thu, May 14, 2015 at 12:22 PM, Denis Kirjanov k...@itsirius.su wrote:

 Yeah, but the idea is to do that without the pg_hba.conf

 You may want to try describing the problem and not just ask if the chosen
 solution is possible - of which I am doubtful but I have never used selinux
 or studied it in any depth.  pg_hba.conf is the chosen tool for this kind of
 thing so pointing out why it cannot be used is a much more useful first
 step.

 In mandatory access control systems like SE-Linux, the system security
 policy is supposed to centralize all security decisions, and it should
 be possible to enforce any necessary access control rule by modifying
 that policy.  At least that's my understanding.  sepgsql lets the
 kernel's mandatory access control policies filter down into access
 control decisions that PostgreSQL makes.  sepgsql consults the
 operating system policy when faced with an access control decision of
 a type that it supports, and accepts or rejects the connect based on
 that.

 So the question is whether the sepgsql integration points include
 anything that can block a connection, rather than, say, allowing the
 connection but blocking access to particular tables.  Looking at the
 code, it appears that it vaguely contemplates a db_database:{access}
 permission, which sounds like about the right thing, and it's also
 mentioned at 
 https://wiki.postgresql.org/wiki/SEPostgreSQL/Permissions#Connection
 as maybe being the right thing, but I can't find anyplace that it is
 actually enforce.  That's rather disappointing...

 KaiGai, any thoughts?

I'd like to understand what Denis Kirjanov actually wants to do
first of all.

If he wants to control accesses whole of the PostgreSQL instances
according to the credential on operating system, it is a configuration
on operating system side.
He can write up self security policy module that allows someone_t
domain to connect PostgreSQL instance, not per database basis.
It is permission around the pair of someone_t and postgresql_t, than
can be wrapped up by postgresql_stream_connect() in selinux's policy.

If he wants to control accesses per database basis based on selinux
policy, it is a right choice to consider sepgsql module.
However, some of permissions are not still implemented, like
db_database:{access} because of priority of permissions (and I had to
focus on GPU acceleration infrastructure in v9.5 cycle...).

If he wants to control accesses based on the credential of operating
system, not limited to selinux, IDENT method is available, isn't it?

Also, he may additionally needs labeled networking configuration,
if he wants to carry security label information over the TCP/IP
connection. It is a point to be considered for his requirement.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-05-13 Thread Kohei KaiGai

2015-05-13 21:45 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Sun, May 10, 2015 at 3:15 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2015-05-01 9:52 GMT+09:00 Kohei KaiGai kai...@kaigai.gr.jp:
 2015-05-01 7:40 GMT+09:00 Alvaro Herrera alvhe...@2ndquadrant.com:
 Kouhei Kaigai wrote:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
   The idea of making the regression test entirely independent of the
   system's policy would presumably solve this problem, so I'd kind of
   like to see progress on that front.
 
  Apologies, I guess it wasn't clear, but that's what I was intending to
  advocate.
 
 OK, I'll try to design a new regression test policy that is independent
 from the system's policy assumption, like unconfined domain.

 Please give me time for this work.

 Any progress here?

 Not done.
 The last version I rebuild had a trouble on user/role transition from
 unconfined_u/unconfined_r to the self defined user/role...
 So, I'm trying to keep the user/role field (that is not redefined for
 several years) but to define self domain/types (that have been
 redefined multiple times) for the regression test at this moment.

 The second approach above works.
 I defined a own privileged domain (sepgsql_regtest_superuser_t)
 instead of system's unconfined_t domain.
 The reason why regression test gets failed was, definition of
 unconfined_t in the system default policy was changed to bypass
 multi-category rules; which our regression test depends on.
 So, the new sepgsql_regtest_superuser_t domain performs almost
 like as unconfined_t, but restricted by multi-category policy as
 traditional unconfined_t did.
 It is self defined domain, so will not affected by system policy
 change.
 Even though the sepgsql-regtest.te still uses unconfined_u and
 unconfined_r pair for selinux-user and role, it requires users to
 define additional selinux-user by hand if we try to define own one.
 In addition, its definition has not been changed for several years.
 So, I thought it has less risk to rely on unconfined_u/unconfined_r
 field unlike unconfined_t domain.

 Can you add this to the next CommitFest?

OK, done

https://commitfest.postgresql.org/5/249/

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-10 Thread Kohei KaiGai

Tom,

I briefly checked your updates.
Even though it is not described in the commit-log, I noticed a
problematic change.

This commit reverts create_plan_recurse() as static function. It means extension
cannot have child node, even if it wants to add a custom-join logic.
Please assume a simple case below:
  SELECT * FROM t0, t1 WHERE t0.a = t1.x;

An extension adds a custom join path, then its PlanCustomPath method will be
called back to create a plan node once it gets chosen by planner.
The role of PlanCustomPath is to construct a plan-node of itself, and plan-nodes
of the source relations also.
If create_plan_recurse() is static, we have no way to initialize
plan-node for t0
and t1 scan even if join-logic itself is powerful than built-in ones.

It was already discussed in the upthread, and people's consensus.
Please revert create_plan_recurse() as like initial commit.

Also, regarding of the *_scan_tlist,

 I don't have time to pursue this idea right now, but I think it would be
 a good change to squeeze into 9.5, just so that we could have some test
 coverage on those parts of this patch.

Do you want just a test purpose module and regression test cases?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-09 Thread Kohei KaiGai

2015-05-09 11:21 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Fri, May 8, 2015 at 5:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... btw, I just noticed something that had escaped me because it seems so
 obviously wrong that I had not even stopped to consider the possibility
 that the code was doing what it's doing.  To wit, that the planner
 supposes that two foreign tables are potentially remote-joinable if they
 share the same underlying FDW handler function.  Not the same server, and
 not even the same pg_foreign_data_wrapper entry, but the pg_proc entry for
 the handler function.  I think this is fundamentally bogus.  Under what
 circumstances are we not just laying off the need to check same server
 origin onto the FDW?  How is it that the urgent need for the FDW to check
 for that isn't even mentioned in the documentation?

 I think that we'd really be better off insisting on same server (as in
 same pg_foreign_server OID), hence automatically same FDW, and what's
 even more important, same user mapping for any possible query execution
 context.  The possibility that there are some corner cases where some FDWs
 could optimize other scenarios seems to me to be poor return for the bugs
 and security holes that will arise any time typical FDWs forget to check
 this.

 I originally wanted to go quite the other way with this and check for
 join pushdown via handler X any time at least one of the two relations
 involved used handler X, even if the other one used some other handler
 or was a plain table.  In particular, it seems to me quite plausible
 to want to teach an FDW that a certain local table is replicated on a
 remote node, allowing a join between a foreign table and a plain table
 to be pushed down.  This infrastructure can't be used that way anyhow,
 so maybe there's no harm in tightening it up, but I'm wary of
 circumscribing what FDW authors can do.  I think it's better to be
 rather expansive in terms of when we call them and let them return
 without doing anything some of them time than to define the situations
 in which we call them too narrowly and end up ruling out interesting
 use cases.

Probably, it is relatively minor case to join a foreign table and a replicated
local relation on remote side. Even if the rough check by sameness of
foreign server-id does not invoke GetForeignJoinPaths, FDW driver can
implement its arbitrary logic using set_join_pathlist_hook by its own risk,
isn't it?

The attached patch changed the logic to check joinability of two foreign
relations. As upthread, it checks foreign server-id instead of handler
function.
build_join_rel() set fdw_server of RelOptInfo if inner and outer foreign-
relations have same value, then it eventually allows to kick
GetForeignJoinPaths on add_paths_to_joinrel().

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


custom-join-fdw-pushdown-check-by-server.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-09 Thread Kohei KaiGai

2015-05-09 8:18 GMT+09:00 Kohei KaiGai kai...@kaigai.gr.jp:
2015-05-09 2:46 GMT+09:00 Tom Lane t...@sss.pgh.pa.us:
Kouhei Kaigai kai...@ak.jp.nec.com writes:
I've been trying to code-review this patch, because the documentation
seemed several bricks shy of a load, and I find myself entirely confused
by the fdw_ps_tlist and custom_ps_tlist fields.

Main-point of your concern is lack of documentation/comments to introduce
how does the pseudo-scan targetlist works here, isn't it??

Well, there's a bunch of omissions and outright errors in the docs and
comments, but this is the main issue that I was uncertain how to fix
from looking at the patch.

Also,
if that is what they're for (ie, to allow the FDW to redefine the scan
tuple contents) it would likely be better to decouple that feature from
whether the plan node is for a simple scan or a join.

In this version, we don't intend FDW/CSP to redefine the contents of
scan tuples, even though I want off-loads heavy targetlist calculation
workloads to external computing resources in *the future version*.

I do not think it's a good idea to introduce such a field now and then
redefine how it works and what it's for in a future version. We should
not be moving the FDW APIs around more than we absolutely have to,
especially not in ways that wouldn't throw an obvious compile error
for un-updated code. Also, the longer we wait to make a change that
we know we want, the more pain we inflict on FDW authors (simply because
there will be more of them a year from now than there are today).

Ah, above my sentence don't intend to reuse the existing field for
different works in the future version. It's just what I want to support
in the future version.
Yep, I see. It is not a good idea to redefine the existing field for
different purpose silently. It's not my plan.

The business about
resjunk columns in that list also seems a bit half baked, or at least
underdocumented.

I'll add source code comments to introduce how does it works any when
does it have resjunk=true. It will be a bit too deep to be introduced
in the SGML file.

I don't actually see a reason for resjunk marking in that list at all,
if what it's for is to define the contents of the scan tuple. I think we
should just s/ExecCleanTypeFromTL/ExecTypeFromTL/ in nodeForeignscan and
nodeCustom, and get rid of the sanity check in create_foreignscan_plan
(which is pretty pointless anyway, considering the number of other ways
you could screw up that tlist without it being detected).

http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010d7...@bpxm15gp.gisp.nec.co.jp

Does the introduction in above post make sense?
The *_ps_tlist is not only used for a basic of scan-tuple descriptor, but
also used to solve var-node if varno==INDEX_VAR in EXPLAIN command.
On the other hands, existence of the junk entries (which are referenced in
external computing resources only) may cause unnecessary projection.
So, I want to discriminate target-entries for basis of scan-tuple descriptor
from other ones just for EXPLAIN command.

I'm also inclined to rename the fields to
fdw_scan_tlist/custom_scan_tlist, which would better reflect what they do,
and to change the API of make_foreignscan() to add a parameter
corresponding to the scan tlist. It's utterly bizarre and error-prone
that this patch has added a field that the FDW is supposed to set and
not changed make_foreignscan to match.

OK, I'll do the both of changes. The name of ps_tlist is a shorten of
pseudo-scan target-list. So, fdw_scan_tlist/custom_scan_tlist are
almost intentional.

The attached patch renamed *_ps_tlist by *_scan_tlist according to
the suggestion.
Also, put a few detailed source code comments around this alternative
scan_tlist.

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp

custom-join-rename-ps_tlist.patch
Description: Binary data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-09 Thread Kohei KaiGai

2015-05-09 8:32 GMT+09:00 Kohei KaiGai kai...@kaigai.gr.jp:
 2015-05-09 3:51 GMT+09:00 Tom Lane t...@sss.pgh.pa.us:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, May 8, 2015 at 1:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That's nice, but 9.5 feature freeze is only a week away.  I don't have a
 lot of confidence that this stuff is actually in a state where we won't
 regret shipping it in 9.5.

 Yeah.  The POC you were asking for upthread certainly exists and has
 for a while, or I would not have committed this.  But I do not think
 it likely that the  postgres_fdw support will be ready for 9.5.

 Well, we have two alternatives.  I can keep hacking on this and get it
 to a state where it seems credible to me, but we won't have any proof
 that it actually works (though perhaps we could treat any problems
 as bugs that should hopefully get found before 9.5 ships, if a
 postgres_fdw patch shows up in the next few months).  Or we could
 revert the whole thing and bounce it to the 9.6 cycle.  I don't really
 like doing the latter, but I'm pretty uncomfortable with committing to
 published FDW APIs that are (a) as messy as this and (b) practically
 untested.  The odds that something slipped through the cracks are high.

 Aside from the other gripes I raised, I'm exceedingly unhappy with the
 ad-hoc APIs proposed for GetForeignJoinPaths and set_join_pathlist_hook.
 It's okay for internal calls in joinpath.c to look like that, but
 exporting that set of parameters seems like pure folly.  We've changed
 those parameter lists repeatedly (for instance in 9.2 and again in 9.3);
 the odds that they'll need to change again in future approach 100%.

 One way we could reduce the risk of code breakage there is to stuff all
 or most of those parameters into a struct.  This might result in a small
 slowdown for the internal calls, or then again maybe not --- there
 probably aren't many architectures that can pass 10 parameters in
 registers anyway.

 Is it like a following structure definition?

   typedef struct
   {
 PlannerInfo *root;
 RelOptInfo *joinrel;
 RelOptInfo *outerrel;
 RelOptInfo *innerrel;
 List *restrictlist;
 JoinType jointype;
 SpecialJoinInfo *sjinfo;
 SemiAntiJoinFactors *semifactors;
 Relids param_source_rels;
 Relids extra_lateral_rels;
   } SetJoinPathListArgs;

 I agree the idea. It also helps CSP driver implementation where it calls
 next driver that was already chained on its installation.

   if (set_join_pathlist_next)
   set_join_pathlist_next(args);

 is more stable manner than

   if (set_join_pathlist_next)
   set_join_pathlist_next(root,
joinrel,
outerrel,
innerrel,
restrictlist,
jointype,
sjinfo,
semifactors,
param_source_rels,
extra_lateral_rels);

The attached patch newly defines ExtraJoinPathArgs struct to pack
all the necessary information to be delivered on GetForeignJoinPaths
and set_join_pathlist_hook.

Its definition is below:
  typedef struct
  {
 PlannerInfo*root;
 RelOptInfo *joinrel;
 RelOptInfo *outerrel;
 RelOptInfo *innerrel;
 List   *restrictlist;
 JoinTypejointype;
 SpecialJoinInfo *sjinfo;
 SemiAntiJoinFactors *semifactors;
 Relids  param_source_rels;
 Relids  extra_lateral_rels;
  } ExtraJoinPathArgs;

then, hook invocation gets much simplified, like:

/*
 * 6. Finally, give extensions a chance to manipulate the path list.
 */
if (set_join_pathlist_hook)
set_join_pathlist_hook(jargs);

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


custom-join-argument-by-struct.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-08 Thread Kohei KaiGai

2015-05-09 2:46 GMT+09:00 Tom Lane t...@sss.pgh.pa.us:
Kouhei Kaigai kai...@ak.jp.nec.com writes:
I've been trying to code-review this patch, because the documentation
seemed several bricks shy of a load, and I find myself entirely confused
by the fdw_ps_tlist and custom_ps_tlist fields.

Main-point of your concern is lack of documentation/comments to introduce
how does the pseudo-scan targetlist works here, isn't it??

Well, there's a bunch of omissions and outright errors in the docs and
comments, but this is the main issue that I was uncertain how to fix
from looking at the patch.

The business about
resjunk columns in that list also seems a bit half baked, or at least
underdocumented.

I'll add source code comments to introduce how does it works any when
does it have resjunk=true. It will be a bit too deep to be introduced
in the SGML file.

http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010d7...@bpxm15gp.gisp.nec.co.jp

OK, I'll do the both of changes. The name of ps_tlist is a shorten of
pseudo-scan target-list. So, fdw_scan_tlist/custom_scan_tlist are
almost intentional.

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-08 Thread Kohei KaiGai

2015-05-09 3:51 GMT+09:00 Tom Lane t...@sss.pgh.pa.us:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, May 8, 2015 at 1:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That's nice, but 9.5 feature freeze is only a week away.  I don't have a
 lot of confidence that this stuff is actually in a state where we won't
 regret shipping it in 9.5.

 Yeah.  The POC you were asking for upthread certainly exists and has
 for a while, or I would not have committed this.  But I do not think
 it likely that the  postgres_fdw support will be ready for 9.5.

 Well, we have two alternatives.  I can keep hacking on this and get it
 to a state where it seems credible to me, but we won't have any proof
 that it actually works (though perhaps we could treat any problems
 as bugs that should hopefully get found before 9.5 ships, if a
 postgres_fdw patch shows up in the next few months).  Or we could
 revert the whole thing and bounce it to the 9.6 cycle.  I don't really
 like doing the latter, but I'm pretty uncomfortable with committing to
 published FDW APIs that are (a) as messy as this and (b) practically
 untested.  The odds that something slipped through the cracks are high.

 Aside from the other gripes I raised, I'm exceedingly unhappy with the
 ad-hoc APIs proposed for GetForeignJoinPaths and set_join_pathlist_hook.
 It's okay for internal calls in joinpath.c to look like that, but
 exporting that set of parameters seems like pure folly.  We've changed
 those parameter lists repeatedly (for instance in 9.2 and again in 9.3);
 the odds that they'll need to change again in future approach 100%.

 One way we could reduce the risk of code breakage there is to stuff all
 or most of those parameters into a struct.  This might result in a small
 slowdown for the internal calls, or then again maybe not --- there
 probably aren't many architectures that can pass 10 parameters in
 registers anyway.

Is it like a following structure definition?

  typedef struct
  {
PlannerInfo *root;
RelOptInfo *joinrel;
RelOptInfo *outerrel;
RelOptInfo *innerrel;
List *restrictlist;
JoinType jointype;
SpecialJoinInfo *sjinfo;
SemiAntiJoinFactors *semifactors;
Relids param_source_rels;
Relids extra_lateral_rels;
  } SetJoinPathListArgs;

I agree the idea. It also helps CSP driver implementation where it calls
next driver that was already chained on its installation.

  if (set_join_pathlist_next)
  set_join_pathlist_next(args);

is more stable manner than

  if (set_join_pathlist_next)
  set_join_pathlist_next(root,
   joinrel,
   outerrel,
   innerrel,
   restrictlist,
   jointype,
   sjinfo,
   semifactors,
   param_source_rels,
   extra_lateral_rels);

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-08 Thread Kohei KaiGai

2015-05-09 6:48 GMT+09:00 Tom Lane t...@sss.pgh.pa.us:
 ... btw, I just noticed something that had escaped me because it seems so
 obviously wrong that I had not even stopped to consider the possibility
 that the code was doing what it's doing.  To wit, that the planner
 supposes that two foreign tables are potentially remote-joinable if they
 share the same underlying FDW handler function.  Not the same server, and
 not even the same pg_foreign_data_wrapper entry, but the pg_proc entry for
 the handler function.  I think this is fundamentally bogus.  Under what
 circumstances are we not just laying off the need to check same server
 origin onto the FDW?  How is it that the urgent need for the FDW to check
 for that isn't even mentioned in the documentation?

Indeed. Comparison of fdw_handler may cause unexpected behavior.
I agree it needs to be fixed up.

 I think that we'd really be better off insisting on same server (as in
 same pg_foreign_server OID), hence automatically same FDW, and what's
 even more important, same user mapping for any possible query execution
 context.  The possibility that there are some corner cases where some FDWs
 could optimize other scenarios seems to me to be poor return for the bugs
 and security holes that will arise any time typical FDWs forget to check
 this.

The former version of foreign/custom-join patch did check for joinable relations
using FDW server id, however, it was changed to the current form because it
may have additional optimization opportunity - in case when multiple foreign
servers have same remote host, access credential and others...
Also, I understand your concern about potential security holes by oversight.
It is an issue like a weighing scales, however, it seems to me the benefit
come from the potential optimization case does not negate the security-
hole risk enough.
So, I'll make a patch to change the logic to check joinable foreign-tables.

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] feature freeze and beta schedule

2015-05-02 Thread Kohei KaiGai

2015-05-02 1:37 GMT+09:00 Andres Freund and...@anarazel.de:
 * ctidscan as an example of custom-scan
   Hasn't really gotten sufficient review.
   = Move

I have to agree.

 * Join pushdown support for foreign tables
   Hasn't gotten particularly much review yet. And it doesn't look
   entirely ready to me on a very quick scan. But I don't know that much
   about the area.
   = Move?

I think its overall design had been discussed and people reached
consensus. Even though some documentation / comments works
are required, it is not a fundamental design change isn't it?
I expect the join-pushdown of postgres_fdw will reach the commitable
state within two weeks.

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-04-30 Thread Kohei KaiGai

2015-05-01 7:40 GMT+09:00 Alvaro Herrera alvhe...@2ndquadrant.com:
 Kouhei Kaigai wrote:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
   The idea of making the regression test entirely independent of the
   system's policy would presumably solve this problem, so I'd kind of
   like to see progress on that front.
 
  Apologies, I guess it wasn't clear, but that's what I was intending to
  advocate.
 
 OK, I'll try to design a new regression test policy that is independent
 from the system's policy assumption, like unconfined domain.

 Please give me time for this work.

 Any progress here?

Not done.
The last version I rebuild had a trouble on user/role transition from
unconfined_u/unconfined_r to the self defined user/role...
So, I'm trying to keep the user/role field (that is not redefined for
several years) but to define self domain/types (that have been
redefined multiple times) for the regression test at this moment.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-03-11 Thread Kohei KaiGai

2015-03-12 1:27 GMT+09:00 Alvaro Herrera alvhe...@2ndquadrant.com:
 Robert Haas wrote:
 On Tue, Mar 10, 2015 at 6:58 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
  ERRCODE_FEATURE_NOT_SUPPORTED is suitable error code here.
  Please see the attached one.

 Committed.  I did not bother back-patching this, but I can do that if
 people think it's important.

 I don't really care myself.

 The sepgsql regression tests don't seem
 to pass for me any more; I wonder if some expected-output changes are
 needed as a result of core changes.
 I'm guessing these tests are not running in an automated fashion anywhere?

 Oops, that's bad.  I vaguely recall asking someone for a buildfarm
 animal running these tests, but I guess that didn't happen.

This regression test fail come from the base security policy of selinux.
In the recent selinux-policy package, unconfined domain was changed
to have unrestricted permission as literal. So, this test case relies multi-
category policy restricts unconfined domain, but its assumption is not
correct now.
The attached patch fixes the policy module of regression test.
However, I also think we may stop to rely permission set of pre-defined
selinux domains. Instead of pre-defined one, sepgsql-regtest.te may be
ought to define own domain with appropriate permission set independent
from the base selinux-policy version.
Please give me time to investigate.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


sepgsql-fixup-regtest-policy.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-03-10 Thread Kohei KaiGai

The attached patch revises error message when security label
is specified on unsupported object.
getObjectTypeDescription() may be better than oid of catalog.

postgres=# SECURITY LABEL FOR selinux ON ROLE kaigai
postgres-#   IS 'system_u:object_r:unlabeled_t:s0';
ERROR:  sepgsql provider does not support labels on role

2015-03-09 23:55 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Tue, Mar 3, 2015 at 5:01 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 From standpoint of SQL syntax, yep, SECURITY LABEL command support
 the object types below, however, it fully depends on security label
 provider; sepgsql.so in this case.
 At this moment, it supports database, schema, function, tables and
 column are supported by sepgsql. So, it is expected behavior.

 If the core system supports labels on other object types and sepgsql
 does not, it should give a better error for those cases, like:

 ERROR: sepgsql provider does not support labels on roles

 Errors like ERROR:  unsupported object type: 1260 are a good way to
 report a failure that is never expected to happen, but they shouldn't
 be used as user-facing error messages.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei kai...@kaigai.gr.jp


security-label-errmsg.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-03-10 Thread Kohei KaiGai

ERRCODE_FEATURE_NOT_SUPPORTED is suitable error code here.
Please see the attached one.

Thanks,


2015-03-11 4:34 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Tue, Mar 10, 2015 at 9:41 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 And perhaps make it an ereport also, with errcode etc.

 Yeah, definitely.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



-- 
KaiGai Kohei kai...@kaigai.gr.jp


security-label-errmsg.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-02-23 Thread Kohei KaiGai

 Amit and I had a long discussion about this on Friday while in Boston
 together.  I previously argued that the master and the slave should be
 executing the same node, ParallelSeqScan.  However, Amit argued
 persuasively that what the master is doing is really pretty different
 from what the worker is doing, and that they really ought to be
 running two different nodes.  This led us to cast about for a better
 design, and we came up with something that I think will be much
 better.

 The basic idea is to introduce a new node called Funnel.  A Funnel
 node will have a left child but no right child, and its job will be to
 fire up a given number of workers.  Each worker will execute the plan
 which is the left child of the funnel.  The funnel node itself will
 pull tuples from all of those workers, and can also (if there are no
 tuples available from any worker) execute the plan itself.  So a
 parallel sequential scan will look something like this:

 Funnel
 Workers: 4
 - Partial Heap Scan on xyz

 What this is saying is that each worker is going to scan part of the
 heap for xyz; together, they will scan the whole thing.


What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

 The neat thing about this way of separating things out is that we can
 eventually write code to push more stuff down into the funnel.  For
 example, consider this:

 Nested Loop
 - Seq Scan on foo
 - Index Scan on bar
 Index Cond: bar.x = foo.x

 Now, if a parallel sequential scan is cheaper than a regular
 sequential scan, we can instead do this:

 Nested Loop
 - Funnel
 - Partial Heap Scan on foo
 - Index Scan on bara
 Index Cond: bar.x = foo.x

 The problem with this is that the nested loop/index scan is happening
 entirely in the master.  But we can have logic that fixes that by
 knowing that a nested loop can be pushed through a funnel, yielding
 this:

 Funnel
 - Nested Loop
 - Partial Heap Scan on foo
 - Index Scan on bar
 Index Cond: bar.x = foo.x

 Now that's pretty neat.  One can also imagine doing this with
 aggregates.  Consider:

I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

 HashAggregate
 - Funnel
 - Partial Heap Scan on foo
 Filter: x = 1

 Here, we can't just push the HashAggregate through the filter, but
 given infrastructure for we could convert that to something like this:

 HashAggregateFinish
 - Funnel
 - HashAggregatePartial
 - Partial Heap Scan on foo
  Filter: x = 1

 That'd be swell.

 You can see that something like this will also work for breaking off
 an entire plan tree and shoving it down into a worker.  The master
 can't participate in the computation in that case, but it's otherwise
 the same idea.

I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commit fest 2015-12 enters money time

2015-02-16 Thread Kohei KaiGai

2015-02-16 12:25 GMT+09:00 Michael Paquier michael.paqu...@gmail.com:


 On Fri, Feb 13, 2015 at 10:06 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:

 In order to move on to the next CF, I am going to go through all the
 remaining patches and update their status accordingly. And sorry for
 slacking a bit regarding that. If you wish to complain, of course feel
 free to post messages on this thread or on the thread related to the
 patch itself.


 As all the entries have been either moved or updated, CF 2014-12 is closed,
 and CF 2015-02 has been changed to In Progress.

I tried to add my name on the CF entry for the Join pushdown support
for foreign tables patch, however, it was unintentionally marked as
rejected on the last CF, even though it should be returned with feedback.

https://commitfest.postgresql.org/3/20/

Is it available to move it to the current CF?
I couldn't find operation to add new entry on the current CF.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-01-10 Thread Kohei KaiGai

2015-01-11 10:40 GMT+09:00 Jim Nasby jim.na...@bluetreble.com:
 On 1/9/15, 8:51 PM, Kohei KaiGai wrote:

 2015-01-10 9:56 GMT+09:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/9/15, 6:54 PM, Jim Nasby wrote:


 On 1/9/15, 6:44 PM, Petr Jelinek wrote:



 Yep, I had a same impression when I looked at the code first time,
 however, it is defined as below. Not a manner of custom-scan itself.

 /*
* ==
* Scan nodes
* ==
*/
 typedef struct Scan
 {
   Planplan;
   Index   scanrelid;  /* relid is index into the range
 table
 */
 } Scan;


 Yeah there are actually several places in the code where relid means
 index in range table and not oid of relation, it still manages to
 confuse
 me. Nothing this patch can do about that.



 Well, since it's confused 3 of us now... should we change it (as a
 separate patch)? I'm willing to do that work but don't want to waste
 time if
 it'll just be rejected.

 Any other examples of this I should fix too?



 Sorry, to clarify... any other items besides Scan.scanrelid that I should
 fix?

 This naming is a little bit confusing, however, I don't think it should
 be
 changed because this structure has been used for a long time, so reworking
 will prevent back-patching when we find bugs around scanrelid.


 We can still backpatch; it just requires more work. And how many bugs do we
 actually expect to find around this anyway?

 If folks think this just isn't worth fixing fine, but I find the
 backpatching argument rather dubious.

Even though here is no problem around Scan structure itself, a bugfix may
use the variable name of scanrelid to fix it. If we renamed it on v9.5, we
also need a little adjustment to apply this bugfix on prior versions.
It seems to me a waste of time for committers.
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-01-09 Thread Kohei KaiGai

2015-01-10 9:56 GMT+09:00 Jim Nasby jim.na...@bluetreble.com:
 On 1/9/15, 6:54 PM, Jim Nasby wrote:

 On 1/9/15, 6:44 PM, Petr Jelinek wrote:


 Yep, I had a same impression when I looked at the code first time,
 however, it is defined as below. Not a manner of custom-scan itself.

 /*
   * ==
   * Scan nodes
   * ==
   */
 typedef struct Scan
 {
  Planplan;
  Index   scanrelid;  /* relid is index into the range table
 */
 } Scan;


 Yeah there are actually several places in the code where relid means
 index in range table and not oid of relation, it still manages to confuse
 me. Nothing this patch can do about that.


 Well, since it's confused 3 of us now... should we change it (as a
 separate patch)? I'm willing to do that work but don't want to waste time if
 it'll just be rejected.

 Any other examples of this I should fix too?


 Sorry, to clarify... any other items besides Scan.scanrelid that I should
 fix?

This naming is a little bit confusing, however, I don't think it should be
changed because this structure has been used for a long time, so reworking
will prevent back-patching when we find bugs around scanrelid.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-01-09 Thread Kohei KaiGai

2015-01-10 8:18 GMT+09:00 Jim Nasby jim.na...@bluetreble.com:
 On 1/6/15, 5:43 PM, Kouhei Kaigai wrote:

 scan_relid != InvalidOid
  

 
 Ideally, they should be OidIsValid(scan_relid)
 

 Scan.scanrelid is an index of range-tables list, not an object-id.
 So, InvalidOid or OidIsValid() are not a good choice.


 I think the name needs to change then; scan_relid certainly looks like the
 OID of a relation.

 scan_index?

Yep, I had a same impression when I looked at the code first time,
however, it is defined as below. Not a manner of custom-scan itself.

/*
 * ==
 * Scan nodes
 * ==
 */
typedef struct Scan
{
Planplan;
Index   scanrelid;  /* relid is index into the range table */
} Scan;

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] port/atomics/arch-*.h are missing from installation

2014-10-02 Thread Kohei KaiGai

I got the following error when I try to build my extension
towards the latest master branch.

Is the port/atomics/*.h files forgotten on make install?

[kaigai@magro pg_strom]$ make
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -fpic -Wall -O0
-DPGSTROM_DEBUG=1 -I. -I./ -I/usr/local/pgsql/include/server
-I/usr/local/pgsql/include/internal -D_GNU_SOURCE   -c -o shmem.o
shmem.c
In file included from /usr/local/pgsql/include/server/storage/barrier.h:21:0,
 from shmem.c:18:
/usr/local/pgsql/include/server/port/atomics.h:65:36: fatal error:
port/atomics/arch-x86.h: No such file or directory
 # include port/atomics/arch-x86.h
^
compilation terminated.
make: *** [shmem.o] Error 1


Even though the source directory has header files...

[kaigai@magro sepgsql]$ find ./src | grep atomics
./src/include/port/atomics
./src/include/port/atomics/generic-xlc.h
./src/include/port/atomics/arch-x86.h
./src/include/port/atomics/generic-acc.h
./src/include/port/atomics/arch-ppc.h
./src/include/port/atomics/generic.h
./src/include/port/atomics/arch-hppa.h
./src/include/port/atomics/generic-msvc.h
./src/include/port/atomics/arch-ia64.h
./src/include/port/atomics/generic-sunpro.h
./src/include/port/atomics/arch-arm.h
./src/include/port/atomics/generic-gcc.h
./src/include/port/atomics/fallback.h
./src/include/port/atomics.h
./src/backend/port/atomics.c

the install destination has only atomics.h

[kaigai@magro sepgsql]$ find /usr/local/pgsql/include | grep atomics
/usr/local/pgsql/include/server/port/atomics.h

The attached patch is probably right remedy.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


pgsql-v9.5-fixup-makefile-for-atomics.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-09-29 Thread Kohei KaiGai

2014-09-29 20:26 GMT+09:00 Thom Brown t...@linux.com:
 On 29 September 2014 09:48, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 On Wed, Sep 17, 2014 at 7:40 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  At this moment, I revised the above portion of the patches.
  create_custom_plan() was modified to call PlanCustomPath callback
  next to the initialization of tlist and clauses.
 
  It's probably same as what you suggested.

 create_custom_plan() is mis-named.  It's actually only applicable to the
 custom-scan case, because it's triggered by create_plan_recurse() getting
 a path node with a T_CustomScan pathtype.  Now, we could change that;
 although in general create_plan_recurse() dispatches on pathtype, we could
 make CustomPath an exception; the top of that function could say if
 (IsA(best_path, CustomPath)) { /* do custom stuff */ }.  But the problem
 with that idea is that, when the custom path is specifically a custom scan,
 rather than a join or some other thing, you want to do all of the same
 processing that's in create_scan_plan().

 So I think what should happen is that create_plan_recurse() should handle
 T_CustomScan the same way it handles T_SeqScan, T_IndexScan, et
 al: by calling create_scan_plan().  The switch inside that function can
 then call a function create_customscan_plan() if it sees T_CustomScan.  And
 that function will be simpler than the
 create_custom_plan() that you have now, and it will be named correctly,
 too.

 Fixed, according to what you suggested. It seems to me 
 create_customscan_plan()
 became more simplified than before.
 Probably, it will minimize the portion of special case handling if CustomScan
 with scanrelid==0 replaces built-in join plan in the future version.

 In ExplainNode(), I think sname should be set to Custom Scan, not 
 Custom.
 And further down, the custom_name should be printed as Custom Plan
 Provider not just Custom.

 Fixed. I added an additional regression test to check EXPLAIN output
 if not a text format.

 setrefs.c has remaining handling for the scanrelid = 0 case; please remove
 that.

 Sorry, I removed it, and checked the patch again to ensure here is no similar
 portions.

 Thanks for your reviewing.

 pgsql-v9.5-custom-scan.part-2.v11.patch

 +GetSpecialCustomVar(CustomPlanState *node,
 +Var *varnode,
 +PlanState **child_ps);

 This doesn't seem to strictly match the actual function:

 +GetSpecialCustomVar(PlanState *ps, Var *varnode, PlanState **child_ps)

It's more convenient if the first argument is PlanState, because
GetSpecialCustomVar() is called towards all the suspicious special
var-node that might be managed by custom-plan provider.
If we have to ensure its first argument is CustomPlanState on the
caller side, it makes function's invocation more complicated.
Also, the callback portion is called only when PlanState is
CustomPlanState, so it is natural to take CustomPlanState for
argument of the callback interface.
Do we need to match the prototype of wrapper function with callback?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-08-30 Thread Kohei KaiGai

2014-08-29 13:33 GMT-04:00 Robert Haas robertmh...@gmail.com:
 On Wed, Aug 27, 2014 at 6:51 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  I'd like to follow this direction, and start stripping the DDL support.

 ...please make it so.

 The attached patch eliminates DDL support.

 Instead of the new CREATE CUSTOM PLAN PROVIDER statement,
 it adds an internal function; register_custom_scan_provider
 that takes custom plan provider name and callback function
 to add alternative scan path (should have a form of CustomPath)
 during the query planner is finding out the cheapest path to
 scan the target relation.
 Also, documentation stuff is revised according to the latest
 design.
 Any other stuff keeps the previous design.

 Comments:

 1. There seems to be no reason for custom plan nodes to have MultiExec
 support; I think this as an area where extensibility is extremely
 unlikely to work out.  The MultiExec mechanism is really only viable
 between closely-cooperating nodes, like Hash and HashJoin, or
 BitmapIndexScan, BitmapAnd, BitmapOr, and BitmapHeapScan; and arguably
 those things could have been written as a single, more complex node.
 Are we really going to want to support a custom plan that can
 substitute for a Hash or BitmapAnd node?  I really doubt that's very
 useful.

This intends to allows a particular custom-scan provider to exchange
its internal data when multiple custom-scan node is stacked.
So, it can be considered a facility to implement closely-cooperating nodes;
both of them are managed by same custom-scan provider.
An example is gpu-accelerated version of hash-join that takes underlying
custom-scan node that will returns a hash table with gpu preferable data
structure, but should not be a part of row-by-row interface.
I believe it is valuable for some use cases, even though I couldn't find
a use-case in ctidscan example.

 2. This patch is still sort of on the fence about whether we're
 implementing custom plans (of any type) or custom scans (thus, of some
 particular relation).  I previously recommended that we confine
 ourselves initially to the task of adding custom *scans* and leave the
 question of other kinds of custom plan nodes to a future patch.  After
 studying the latest patch, I'm inclined to suggest a slightly revised
 strategy.  This patch is really adding THREE kinds of custom objects:
 CustomPlanState, CustomPlan, and CustomPath. CustomPlanState inherits
 from ScanState, so it is not really a generic CustomPlan, but
 specifically a CustomScan; likewise, CustomPlan inherits from Scan,
 and is therefore a CustomScan, not a CustomPlan.  But CustomPath is
 different: it's just a Path.  Even if we only have the hooks to inject
 CustomPaths that are effectively scans at this point, I think that
 part of the infrastructure could be somewhat generic.  Perhaps
 eventually we have CustomPath which can generate either CustomScan or
 CustomJoin which in turn could generate CustomScanState and
 CustomJoinState.

Suggestion seems to me reasonable. The reason why CustomPlanState
inheris ScanState and CustomPlan inherits Scan is, just convenience for
implementation of extensions. Some useful internal APIs, like ExecScan(),
takes argument of ScanState, so it was a better strategy to choose
Scan/ScanState instead of the bare Plan/PlanState.
Anyway, I'd like to follow the perspective that looks CustomScan as one
derivative from the CustomPath. It is more flexible.

 For now, I propose that we rename CustomPlan and CustomPlanState to
 CustomScan and CustomScanState, because that's what they are; but that
 we leave CustomPath as-is.  For ease of review, I also suggest
 splitting this into a series of three patches: (1) add support for
 CustomPath; (2) add support for CustomScan and CustomScanState; (3)
 ctidscan.

OK, I'll do that.

 3. Is it really a good idea to invoke custom scan providers for RTEs
 of every type?  It's pretty hard to imagine that a custom scan
 provider can do anything useful with, say, RTE_VALUES.  Maybe an
 accelerated scan of RTE_CTE or RTE_SUBQUERY is practical somehow, but
 even that feels like an awfully big stretch.  At least until clear use
 cases emerge, I'd be inclined to restrict this to RTE_RELATION scans
 where rte-relkind != RELKIND_FOREIGN_TABLE; that is, put the logic in
 set_plain_rel_pathlist() rather than set_rel_pathlist().

I'd like to agree. Indeed, it's not easy to assume a use case of
custom-logic for non-plain relations.

 (We might even want to consider whether the hook in
 set_plain_rel_pathlist() ought to be allowed to inject a non-custom
 plan; e.g. substitute a scan of relation B for a scan of relation A.
 For example, imagine that B contains all rows from A that satisfy some
 predicate. This could even be useful for foreign tables; e.g.
 substitute a scan of a local copy of a foreign table for a reference
 to that table.  But I put all of these ideas in parentheses because
 they're only good ideas to the extent that they don't sidetrack

Re: [HACKERS] [v9.5] Custom Plan API

2014-08-22 Thread Kohei KaiGai

2014-08-23 0:39 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Thu, Jul 17, 2014 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 I haven't followed this at all, but I just skimmed over it and noticed
 the CustomPlanMarkPos thingy; apologies if this has been discussed
 before.  It seems a bit odd to me; why isn't it sufficient to have a
 boolean flag in regular CustomPlan to indicate that it supports
 mark/restore?

 Yeah, I thought that was pretty bogus too, but it's well down the
 list of issues that were there last time I looked at this ...

 I think the threshold question for this incarnation of the patch is
 whether we're happy with new DDL (viz, CREATE CUSTOM PLAN PROVIDER) as
 a way of installing new plan providers into the database.  If we are,
 then I can go ahead and enumerate a long list of things that will need
 to be fixed to make that code acceptable (such as adding pg_dump
 support).  But if we're not, there's no point in spending any time on
 that part of the patch.

Even though I'm patch author, I'd like to agree with this approach.
In fact, the previous custom-plan interface I proposed to the v9.4
development cycle does not include DDL support to register the
custom-plan providers, however, it works fine.

One thing I was pointed out, it is the reason why I implemented
DDL support, is that intermediation of c-language function also
loads the extension module implicitly. It is an advantage, but
not sure whether it shall be supported from the beginning.

 I can see a couple of good reasons to think that this approach might
 be reasonable:

 - In some ways, a custom plan provider (really, at this point, a
 custom scan provider) is very similar to a foreign data wrapper.  To
 the guts of PostgreSQL, an FDW is a sort of black box that knows how
 to scan some data not managed by PostgreSQL.  A custom plan provider
 is similar, except that it scans data that *is* managed by PostgreSQL.

 - There's also some passing resemblance between a custom plan provider
 and an access method.  Access methods provide a set of APIs for fast
 access to data via an index, while custom plan providers provide an
 API for fast access to data via magic that the core system doesn't
 (and need not) understand.  While access methods don't have associated
 SQL syntax, they do include catalog structure, so perhaps this should
 too, by analogy.

 All that having been said, I'm having a hard time mustering up
 enthusiasm for this way of doing things.  As currently constituted,
 the pg_custom_plan_provider catalog contains only a name, a class
 that is always 's' for scan, and a handler function OID.  Quite
 frankly, that's a whole lot of nothing.  If we got rid of the
 pg_catalog structure and just had something like
 RegisterCustomPlanProvider(char *name, void (*)(customScanArg *),
 which could be invoked from _PG_init(), hundreds and hundreds of lines
 of code could go away and we wouldn't lose any actual functionality;
 you'd just list your custom plan providers in shared_preload_libraries
 or local_preload_libraries instead of listing them in a system
 catalog.  In fact, you might even have more functionality, because you
 could load providers into particular sessions rather than system-wide,
 which isn't possible with this design.

Indeed. It's an advantage of the approach without system catalog.


 I think the underlying issue here really has to do with when custom
 plan providers get invoked - what triggers that?  For foreign data
 wrappers, we have some relations that are plain tables (relkind = 'r')
 and no foreign data wrapper code is invoked.  We have others that are
 flagged as foreign tables (relkind = 'f') and for those we look up the
 matching FDW (via ftserver) and run the code.  Similarly, for an index
 AM, we notice that the relation is an index (relkind = 'r') and then
 consult relam to figure out which index AM we should invoke.  But as
 KaiGai is conceiving this feature, it's quite different.  Rather than
 applying only to particular relations, and being mutually exclusive
 with other options that might apply to those relations, it applies to
 *everything* in the database in addition to whatever other options may
 be present.  The included ctidscan implementation is just as good an
 example as PG-Strom: you inspect the query and see, based on the
 operators present, whether there's any hope of accelerating things.
 In other words, there's no user configuration - and also, not
 irrelevantly, no persistent on-disk state the way you have for an
 index, or even an FDW, which has on disk state to the extent that
 there have to be catalog entries tying a particular FDW to a
 particular table.

Yes, that's my point. In case of FDW or index AM, the query planner
can have some expectations how relevant executor node will handle
the given relation scan according to the persistent state.
However, custom-plan is a black-box to the query planner, and it
cannot have

Re: [HACKERS] RLS Design

2014-07-09 Thread Kohei KaiGai

2014-07-09 15:07 GMT+09:00 Stephen Frost sfr...@snowman.net:
 KaiGai,

 * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
 What I'd like to implement is adjustment of query like:
   SELECT * FROM t1 WHERE (x like '%abc%') AND (quals by built-in RLS)
   AND (quals by extension-1) AND ... AND (quals by extension-N);
 I never mind even if qualifiers in the second block are connected with OR'd
 manner, however, I want RLS infrastructure to accept additional security
 models provided by extensions.

 Would having a table-level 'AND'-vs-'OR' modifier for the RLS policies
 on that table be sufficient for what you're looking for?  That seems a
 simple enough addition which would still allow more complex groups to be
 developed later on...

Probably, things I'm considering is more simple.
If a table has multiple built-in RLS policies, its expression node will be
represented as a BoolExpr with OR_EXPR and every policies are linked
to its args field, isn't it? We assume the built-in RLS model merges
multiple policies by OR manner.
In case when an extension want to apply additional security model on
top of RLS infrastructure, a straightforward way is to add its own rules
in addition to the built-in rules. If extension can get control to modify
the above expression node and RLS infrastructure works well on the
modified expression node, I think it's sufficient to implement multiple
security models on the RLS infrastructure.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

2014-07-08 Thread Kohei KaiGai

2014-07-06 14:19 GMT+09:00 Stephen Frost sfr...@snowman.net:
 Kaigai,

 * Kouhei Kaigai (kai...@ak.jp.nec.com) wrote:
  Can you clarify where this is coming from..?  It sounds like you're
  referring to an existing implementation and, if so, it'd be good to get
  more information on how that works exactly.

 Oracle VPD - Multiple Policies for Each Table, View, or Synonym
 http://docs.oracle.com/cd/B19306_01/network.102/b14266/apdvpoli.htm#i1008351

 It says - Note that all policies applied to a table are enforced with AND 
 syntax.

 While I'm not against using this as an example to consider, it's much
 more complex than what we're talking about here- and it supports
 application contexts which allow groups of RLS rights to be applied or
 not applied; essentially it allows both AND and OR for sets of RLS
 policies, along with default policies which are applied no matter
 what.

 Not only Oracle VPD, it fits attitude of defense in depth.
 Please assume a system that installs network firewall, unix permissions
 and selinux. If somebody wants to reference an information asset within
 a file, he has to connect the server from the network address being allowed
 by the firewall configuration AND both of DAC and MAC has to allow his
 access.

 These are not independent systems and your argument would apply to our
 GRANT system also, which I hope it's agreed would make it far less
 useful.  Note also that SELinux brings in another complexity- it needs
 to make system calls out to check the access.

 Usually, we have to pass all the access control to reference the target
 information, not one of the access control stuffs being installed.

 This is true in some cases, and not in others.  Only one role you are a
 member of needs to have access to a relation, not all of them.  There
 are other examples of 'OR'-style security policies, this is merely one.
 I'm simply not convinced that it applies in the specific case we're
 talking about.

 In the end, I expect that either way people will be upset because they
 won't be able to specify fully which should be AND vs. which should be
 OR with the kind of flexibility other systems provide.  What I'm trying
 to get to is an initial implementation which is generally useful and is
 able to add such support later.

  This is similar to how roles work- your overall access includes all access
  granted to any roles you are a member of. You don't need SELECT rights 
  granted
  to every role you are a member of to select from the table. Additionally,
  if an admin wants to AND the quals together then they can simply create
  a policy which does that rather than have 2 policies.
 
 It seems to me a pain on database administration, if we have to pay attention
 not to conflict each RLS-policy.

 This notion of 'conflict' doesn't make much sense to me.  What is
 'conflicting' here?  Each policy would simply need to stand on its own
 for the role which it's being applied to.  That's very simple and
 straight-forward.

 I expect 90% of RLS-policy will be configured
 to PUBLIC user, to apply everybody same rules on access. In this case, DBA
 has to ensure the target table has no policy or existing policy does not
 conflict with the new policy to be set.
 I don't think it is a good idea to enforce DBA these checks.

 If the DBA only uses PUBLIC then they have to ensure that each policy
 they set up for PUBLIC can stand on its own- though, really, I expect if
 they go that route they'd end up with just one policy that calls a
 stored procedure...

   You are suggesting instead that if application 2 sets up policies on the
  table and then application 1 adds another policy that it should reduce what
  application 2's users can see?  That doesn't make any sense to me.  I'd
  actually expect these applications to at least use different roles anyway,
  which means they could each have a single role specific policy which only
  returns what that application is allowed to see.
 
 I don't think this assumption is reasonable.
 Please expect two applications: app-X that is a database security product
 to apply access control based on remote ip-address of the client for any
 table accesses by any database roles. app-Y that is a usual enterprise
 package for daily business data, with RLS-policy.
 What is the expected behavior in this case?

 That the DBA manage the rights on the tables.  I expect that will be
 required for quite a while with PG.  It's nice to think of these
 application products that will manage all access for users by setting up
 their own policies, but we have yet to even discuss how they would have
 appropriate rights on the table to be able to do so (and to not
 interfere with each other..).

 Let's at least get something which is generally useful in.  I'm all for
 trying to plan out how to get there and would welcome suggestions you
 have which are specific to PG on what we could do here (I'm not keen on
 just trying to mimic another product completely...), but at the level

Re: [HACKERS] 9.5 CF1

2014-07-07 Thread Kohei KaiGai

 Custom Plan API
 Shigeru Hanada has said he plans to post a design review soon.

 Any updates? Should this be moved to the next CF?

Now I'm working to revise the patch according to his suggestion;
will be completed within a couple of days.
A few issues needs design-level suggestion from committers.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No toast table for pg_shseclabel but for pg_seclabel

2014-07-04 Thread Kohei KaiGai

Here is no other reason than what Alvaro mentioned in the upthread.
We intended to store security label of SELinux (less than 100bytes at most),
so I didn't think it leads any problem actually.

On the other hands, pg_seclabel was merged in another development cycle.
We didn't have deep discussion about necessity of toast table of pg_seclabel.
I added its toast table mechanically.

It never means we need toast table for local object but don't need it for shared
database object.

Thanks,

2014-07-04 19:11 GMT+09:00 Andres Freund and...@2ndquadrant.com:
 On 2014-07-04 11:50:17 +0200, Andres Freund wrote:
 Hi,

 postgres=# SELECT oid::regclass, reltoastrelid FROM pg_class WHERE relname 
 IN ('pg_seclabel', 'pg_shseclabel');
   oid  | reltoastrelid
 ---+---
  pg_seclabel   |  3598
  pg_shseclabel | 0
 (2 rows)

 Isn't that a somewhat odd choice? Why do we assume that there cannot be
 lengthy seclabels on shared objects? Granted, most shared objects aren't
 candidates for large amounts of data, but both users and databases don't
 seem to fall into that category.

 Hm. It seems they were explicitly removed around
 http://archives.postgresql.org/message-id/1309888389-sup-3853%40alvh.no-ip.org

 I don't understand the reasoning there. There's a toast table for
 non-shared objects. Why would we expect less data for the shared ones,
 even though they're pretty basic objects and more likely to be used to
 store policies and such?

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] NUMA packaging and patch

2014-06-26 Thread Kohei KaiGai

Hello,

Let me comment on this patch.

It can be applied on head of the master branch, built and run
regression test successfully.
What this patch tries to do is quite simple and obvious.
It suggests operating system to distribute physical pages to
every numa nodes on allocation.

One thing I concern is, it may conflict with numa-balancing
features that is supported in the recent Linux kernel; that
migrates physical pages according to the location of tasks
which references the page beyond the numa zone.
# I'm not sure whether it is applied on shared memory region.
# Please correct me if I misunderstood. But it looks to me
# physical page in shared memory is also moved.
http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Probably, interleave policy should work well on OLTP workload.
How about OLAP workload if physical pages are migrated
by operating system transparently to local node?
In OLAP case, less concurrency is required, but a query run
complicated logic (usually including full-scan) on a particular
CPU.

Isn't it make sense to have a GUC to control the numa policy.
In some cases, it makes sense to allocate physical memory
according to operating system's choice.

Thanks,

2014-06-11 2:34 GMT+09:00 Kevin Grittner kgri...@ymail.com:
Josh Berkus j...@agliodbs.com wrote:
On 06/08/2014 03:45 PM, Kevin Grittner wrote:
By default, the OS cache and buffers are allocated in the memory
node with the shortest distance from the CPU a process is
running on.

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

I was not aware of that. Thanks. Do you have a URL handy?

In any event, that is the part of the problem which I think falls
into the realm of packagers and/or sysadmins; a patch for that
doesn't seem sensible, given how cpusets are implemented. I did
figure we would want to add some documentation around it, though.
Do you agree that is worthwhile?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei kai...@kaigai.gr.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2014-05-16 Thread Kohei KaiGai

2014-05-16 16:26 GMT+09:00 Heikki Linnakangas hlinnakan...@vmware.com:
 On 05/14/2014 07:33 AM, Sergey Muraviov wrote:

 I've got this compiler warning:
   relation.c: In function ‘sepgsql_relation_drop’:
 relation.c:472:25: warning: ‘tclass’ may be used uninitialized in this
 function [-Wmaybe-uninitialized]
sepgsql_avc_check_perms(object,
   ^

 KaiGei, could you take a look at this warning, too? It looks like a genuine
 bug to me, but I'm not sure what we should do there instead.

This warning is harmless, because the code path that does not initialize
tclass variable (a case when dropped relation is index) never goes to
the code path that references tclass.
It just checks schema's {remove_name} permission, then jumps to
another code path for index, never backed.

BTW, I could not produce this message in my environment with -Wall.
(Fedora 20, gcc-4.8.2). Is it a newer compiler's wisdom?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


sepgsql-fixup-maybe-uninitialized-warnning.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2014-05-15 Thread Kohei KaiGai

Sorry, I've forgotten the report.

The test fails on label test come from specification change in the mcs policy.
Previously, it was applied to all the domains including unconfined_t, but now,
it became to be applied on the domain with mcsconstrained attribute.

This regression test run sepgsql_seton() on the system unconfined_t domain,
and see the behavior when process intended to move wider or narrower ranged
categories, so it was affected by system policy change, even though it is our
intention of sepgsql.

The attached patch adds mcsconstrained attribute on the domain for this
regression test, if this attribute exists. So, it will work on both of F20 and
older system.

Regarding to the regression test on ddl and alter, this change looks to me
hook invocation around recomputeNamespacePath() were gone, because
the schema already allowed to search was already checked.
Is the behavior around recomputeNamespacePath() recently updated?
At least, it is not a matter since {search} permission towards
regtest_schema_2 is checked in this test scenario.

Thanks,

2014-05-14 13:33 GMT+09:00 Sergey Muraviov sergey.k.murav...@gmail.com:
 Hi.

 Some regression tests for sepgsql still not work on Fedora 20:

 == running regression test queries==
 test label... FAILED
 test dml  ... ok
 test ddl  ... FAILED
 test alter... FAILED
 test misc ... ok

 ==
  3 of 5 tests failed.
 ==

 $ sestatus
 SELinux status: enabled
 SELinuxfs mount:/sys/fs/selinux
 SELinux root directory: /etc/selinux
 Loaded policy name: targeted
 Current mode:   enforcing
 Mode from config file:  enforcing
 Policy MLS status:  enabled
 Policy deny_unknown status: allowed
 Max kernel policy version:  29

 $ uname -i -o -r
 3.14.3-200.fc20.x86_64 x86_64 GNU/Linux

 $ /usr/local/pgsql/bin/postgres --version
 postgres (PostgreSQL) 9.4beta1

 PS
 I've got this compiler warning:
  relation.c: In function ‘sepgsql_relation_drop’:
 relation.c:472:25: warning: ‘tclass’ may be used uninitialized in this
 function [-Wmaybe-uninitialized]
   sepgsql_avc_check_perms(object,
  ^


 2013-12-25 0:34 GMT+04:00 Kohei KaiGai kai...@kaigai.gr.jp:

 Hello,

 It seems to me changes in the base security policy on Fedora affected to
 the regression test. Our test cases for sepgsql_setcon() utilizes the MCS
 rules, that prevents domain transition from narrow categories to wider
 ones,
 to control the success cases and failure cases.

 However, its coverage was changed. It was applied all the domains in the
 system, thus unconfined_t domain had been enforced by MCS rules.
 But now, it shall be applied only domains with mcs_constrained_type
 attribute.

 [kaigai@vmlinux tmp]$ diff -up old/policy/mcs new/policy/mcs
   :
  snip
   :
  mlsconstrain process { transition dyntransition }
 -   (( h1 dom h2 ) or ( t1 == mcssetcats ));
 +   (( h1 dom h2 ) or ( t1 != mcs_constrained_type ));

 Probably, we need to define a domain by ourselves for regression test to
 ensure
 the test stability, not using the system unconfined domain that has
 different
 meaning by release.

 I'll make a patch. Please wait for a while.

 Thanks for your test  reports.

 2013/12/18 Sergey Muraviov sergey.k.murav...@gmail.com:
  # semodule -l | grep sepgslq
  sepgsql-regtest 1.07
 
  Full list of modules is in attachment.
 
 
  2013/12/18 Kohei KaiGai kai...@kaigai.gr.jp
 
  Could you show me semodule -l on your environment?
  I believe security policy has not been changed between F19 and F20...
 
  Thanks,
 
  2013/12/18 Sergey Muraviov sergey.k.murav...@gmail.com:
   Hi
  
   I've tried to test postgres 9.3.2 and 9.4devel with selinux on Fedora
   20
   and
   met with a label regression test failure.
  
   PS
   I've got some warning during build process.
  
   --
   Best regards,
   Sergey Muraviov
  
  
   --
   Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
   To make changes to your subscription:
   http://www.postgresql.org/mailpref/pgsql-hackers
  
 
 
 
  --
  KaiGai Kohei kai...@kaigai.gr.jp
 
 
 
 
  --
  Best regards,
  Sergey Muraviov



 --
 KaiGai Kohei kai...@kaigai.gr.jp




 --
 Best regards,
 Sergey Muraviov



-- 
KaiGai Kohei kai...@kaigai.gr.jp


sepgsql-fixup-regtest-policy.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-04-28 Thread Kohei KaiGai

 Yeah.  I'm still not exactly convinced that custom-scan will ever allow
 independent development of new plan types (which, with all due respect to
 Robert, is what it was being sold as last year in Ottawa).  But I'm not
 opposed in principle to committing it, if we can find a way to have a cleaner
 API for things like setrefs.c.  It seems like late-stage planner processing
 in general is an issue for this patch (createplan.c and subselect.c are
 also looking messy).  EXPLAIN isn't too great either.

 I'm not sure exactly what to do about those cases, but I wonder whether
 things would get better if we had the equivalent of
 expression_tree_walker/mutator capability for plan nodes.  The state of
 affairs in setrefs and subselect, at least, is a bit reminiscent of the
 bad old days when we had lots of different bespoke code for traversing
 expression trees.

 Hmm. If we have something like expression_tree_walker/mutator for plan nodes,
 we can pass a walker/mutator function's pointer instead of exposing static
 functions that takes recursive jobs.
 If custom-plan provider (that has sub-plans) got a callback with walker/
 mutator pointer, all it has to do for sub-plans are calling this new
 plan-tree walking support routine with supplied walker/mutator.
 It seems to me more simple design than what I did.

I tried to code the similar walker/mutator functions on plan-node tree,
however, it was not available to implement these routines enough
simple, because the job of walker/mutator functions are not uniform
thus caller side also must have a large switch-case branches.

I picked up setrefs.c for my investigation.
The set_plan_refs() applies fix_scan_list() on the expression tree being
appeared in the plan node if it is delivered from Scan, however, it also
applies set_join_references() for subclass of Join, or
set_dummy_tlist_references() for some other plan nodes.
It implies that the walker/mutator functions of Plan node has to apply
different operation according to the type of Plan node. I'm not certain
how much different forms are needed.
(In addition, set_plan_refs() performs usually like a walker, but
often performs as a mutator if trivial subquery)

I'm expecting the function like below. It allows to call plan_walker
function for each plan-node and also allows to call expr_walker
function for each expression-node on the plan node.

bool
plan_tree_walker(Plan *plan,
 bool (*plan_walker) (),
 bool (*expr_walker) (),
 void *context)

I'd like to see if something other form to implement this routine.


One alternative idea to give custom-plan provider a chance to
handle its subplans is, to give function pointers (1) to handle
recursion of plan-tree and (2) to set up backend's internal
state.
In case of setrefs.c, set_plan_refs() and fix_expr_common()
are minimum necessity for extensions. It also kills necessity
to export static functions.

How about your thought?
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Creating tables for columns

2014-03-21 Thread Kohei KaiGai

I had implemented similar code on top of FDW API.
  https://github.com/kaigai/pg_strom/blob/old_cuda/utilcmds.c#L244

Probably, heap_create_with_catalog() is what you are finding out.

2014-03-21 22:57 GMT+09:00 Rajashree Mandaogane rajashree@gmail.com:
 We are working on a project in which we need to create tables for each
 column. So which function should we call in recursion to create the tables?



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Best way to know frequency of column reference?

2014-03-21 Thread Kohei KaiGai

Hello,

As people may know, I've implemented a relation cache mechanism on top of
custom-plan interface, that holds contents of a particular columns only, thus
it does not need to take storage access as long as user's query refers the
columns on in-memory cache.
The key factor of how this mechanism works is selection of columns to be
cached on the initial read, or re-read, of the related heap.
The simplest idea is just picking up referenced columns in the query on demand,
and will reconstruct later if further query referenced more wider reference than
previous one, however, it is not a good strategy.
So, I'd like to investigate the way to select columns to be cached adaptively.
Probably, one better idea is columns-selection according to the frequency of
column references in a particular time-slot.
Right now, pg_statistic does not record such kind of information, if I can
understand correctly. Is there any way to retrieve how many times columns
were referenced? Or, do I need to implement an own extension to track it?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only table scan?)

2014-03-06 Thread Kohei KaiGai

2014-03-06 18:17 GMT+09:00 Haribabu Kommi kommi.harib...@gmail.com:

 On Tue, Mar 4, 2014 at 3:07 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

   4. + cchunk = ccache_vacuum_tuple(ccache, ccache-root_chunk, ctid);
   + if (pchunk != NULL  pchunk != cchunk)
  
   + ccache_merge_chunk(ccache, pchunk);
  
   + pchunk = cchunk;
  
  
 The merge_chunk is called only when the heap tuples are spread
   across two cache chunks. Actually one cache chunk can accommodate one
   or more than heap pages. it needs some other way of handling.
  
  I adjusted the logic to merge the chunks as follows:
 
  Once a tuple is vacuumed from a chunk, it also checks whether it can be
  merged with its child leafs. A chunk has up to two child leafs; left one
  has less ctid that the parent, and right one has greater ctid. It means
  a chunk without right child in the left sub-tree or a chunk without left
  child in the right sub-tree are neighbor of the chunk being vacuumed. In
  addition, if vacuumed chunk does not have either (or both) of children,
  it can be merged with parent node.
  I modified ccache_vacuum_tuple() to merge chunks during t-tree
  walk-down,
  if vacuumed chunk has enough free space.
 


 Patch looks good.

Thanks for your volunteering.

 Regarding merging of the nodes, instead of checking whether merge is
 possible or not for every tuple which is vacuumed,
 can we put some kind of threshold as whenever the node is 50% free then try
 to merge it from leaf nodes until 90% is full.
 The rest of the 10% will be left for the next inserts on the node. what do
 you say?

Hmm. Indeed, it makes sense. How about an idea that kicks chunk merging
if expected free space of merged chunk is less than 50%?
If threshold depends on the (expected) usage of merged chunk, it can avoid
over-merging.

 I will update you later regarding the performance test results.

Thhanks,

Also, I'll rebase the patch on top of the new custom-scan interfaces
according to Tom's suggestion, even though main logic of cache_scan
is not changed.

Best regards,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-04 Thread Kohei KaiGai

2014-03-04 23:09 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Mon, Mar 3, 2014 at 5:15 PM, Stephen Frost sfr...@snowman.net wrote:
 As I mentioned
 up-thread, I'd really like to see FDW join push-down, FDW aggregate
 push-down, parallel query execution, and parallel remote-FDW execution
 and I don't see this CustomScan approach as the right answer to any of
 those.

 In accordance with the above, what I'd like to see with this patch is
 removal of the postgres_fdw changes and any changes which were for that
 support.  In addition, I'd like to understand why 'ctidscan' makes any
 sense to have as an example of what to use this for- if that's valuable,
 why wouldn't we simply implement that in core?  I do want an example in
 contrib of how to properly use this capability, but I don't think that's
 it.

 I suggested that example to KaiGai at last year's PGCon.  It may
 indeed be something we want to have in core, but right now we don't.

 More generally, I think this discussion is focusing on the wrong set
 of issues.  The threshold issue for this patch is whether there is a
 set of hook points that enable a workable custom-scan functionality,
 and whether KaiGai has correctly identified them.  In other words, I
 think we should be worrying about whether KaiGai's found all of the
 places that need to be modified to support a custom scan, and whether
 the modifications he's made to each of those places are correct and
 adequate.  Whether he's picked the best possible example does not
 strike me as a matter of principal concern, and it's far too late to
 tell him he's got to go pick a different one at this point anyway.

That is definitely the point to be discussed here. Even though I *believe*
I could put the callbacks needed to implement alternative join / scan,
it may lead different conclusion from other person's viewpoint.

At least, I could implement a custom-scan as an alternative of join
using postgres_fdw, however, it's uncertain whether I could cover
all the possible case we should care about.
So, I'd like to see comments from others.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-04 Thread Kohei KaiGai

2014-03-04 23:10 GMT+09:00 Stephen Frost sfr...@snowman.net:
 The cache_scan module that I and Haribabu are discussing in another
 thread also might be a good demonstration for custom-scan interface,
 however, its code scale is a bit larger than ctidscan.

 That does sound interesting though I'm curious about the specifics...

This module caches a part of columns, but not all, thus allows to hold
much larger number of records for a particular amount of RAM than the
standard buffer cache.
It is constructed on top of custom-scan node, and also performs a new
hook for a callback on page vacuuming to invalidate its cache entry.
(I originally designed this module for demonstration of on-vacuum hook
because I already made ctidscan and postgres_fdw enhancement for
custom-scan node, by the way.)

  For one thing, an example where you could have this CustomScan node calling
  other nodes underneath would be interesting.  I realize the CTID scan can't
  do that directly but I would think your GPU-based system could; after all,
  if you're running a join or an aggregate with the GPU, the rows could come
  from nearly anything.  Have you considered that, or is the expectation that
  users will just go off and access the heap and/or whatever indexes 
  directly,
  like ctidscan does?  How would such a requirement be handled?
 
 In case when custom-scan node has underlying nodes, it shall be invoked using
 ExecProcNode as built-in node doing, then it will be able to fetch tuples
 come from underlying nodes. Of course, custom-scan provider can perform the
 tuples come from somewhere as if it came from underlying relation. It is
 responsibility of extension module. In some cases, it shall be required to
 return junk system attribute, like ctid, for row-level locks or table 
 updating.
 It is also responsibility of the extension module (or, should not add custom-
 path if this custom-scan provider cannot perform as required).

 Right, tons of work to do to make it all fit together and play nice-
 what I was trying to get at is: has this actually been done?  Is the GPU
 extension that you're talking about as the use-case for this been
 written?

Its chicken-and-egg problem, because implementation of the extension module
fully depends on the interface from the backend. Unlike commit-fest, here is no
deadline for my extension module, so I put higher priority on the submission of
custom-scan node, than the extension.
However, GPU extension is not fully theoretical stuff. I had implemented
a prototype using FDW APIs, and it allowed to accelerate sequential scan if
query has enough complicated qualifiers.

See the movie (from 2:45). The table t1 is a regular table, and t2 is a foreign
table. Both of them has same contents, however, response time of the query
is much faster, if GPU acceleration is working.
http://www.youtube.com/watch?v=xrUBffs9aJ0
So, I'm confident that GPU acceleration will have performance gain once it
can run regular tables, not only foreign tables.

 How does it handle all of the above?  Or are we going through
 all these gyrations in vain hope that it'll actually all work when
 someone tries to use it for something real?

I don't talk something difficult. If junk attribute requires to return ctid of
the tuple, custom-scan provider reads a tuple of underlying relation then
includes a correct item pointer. If this custom-scan is designed to run on
the cache, all it needs to do is reconstruct a tuple with correct item-pointer
(thus this cache needs to have ctid also). It's all I did in the cache_scan
module.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-04 Thread Kohei KaiGai

2014-03-05 5:52 GMT+09:00 Stephen Frost sfr...@snowman.net:
 * Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Mar 4, 2014 at 2:34 PM, Stephen Frost sfr...@snowman.net wrote:
  Alright- so do you feel that the simple ctidscan use-case is a
  sufficient justification and example of how this can be generally
  useful that we should be adding these hooks to core..?  I'm willing to
  work through the patch and clean it up this weekend if we agree that
  it's useful and unlikely to immediately be broken by expected changes..

 Yeah, I think it's useful.  But based on Tom's concurrently-posted
 review, I think there's probably a good deal of work left here.

 Yeah, it certainly looks like it.

 KaiGai- will you have time to go over and address Tom's concerns..?

Yes, I need to do. Let me take it through the later half of this week and
the weekend. So, I'd like to submit revised one by next Monday.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Triggers on foreign tables

2014-03-03 Thread Kohei KaiGai

I tried to check the latest (v8) patch again, then could not find
problem in your design change from the v7.
As Noah pointed out, it uses per query-depth tuplestore being released
on AfterTriggerEndSubXact.

So, may I mark it as ready for committer?

2014-03-03 15:48 GMT+09:00 Ronan Dunklau ronan.dunk...@dalibo.com:
 Hello.

 Did you have time to review the latest version of this patch ? Is there
 anything I can do to get this ready for commiter ?

 Thank you for all the work performed so far.




 Le mardi 4 février 2014 13:16:22 Ronan Dunklau a écrit :
 Le lundi 3 février 2014 23:28:45 Noah Misch a écrit :
  On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote:
   Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit :
On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote:
 What do you think about this approach ? Is there something I missed
 which
 would make it not sustainable ?
   
Seems basically reasonable.  I foresee multiple advantages from having
one
tuplestore per query level as opposed to one for the entire
transaction.
You would remove the performance trap of backing up the tuplestore by
rescanning. It permits reclaiming memory and disk space in
AfterTriggerEndQuery() rather than at end of transaction.  You could
remove
ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the
flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next
one
or
the next two tuples from the tuplestore.  Using work_mem per
AfterTriggerBeginQuery() instead of per transaction is no problem.
What
do
you think of that design change?
  
   I agree that this design is better, but I have some objections.
  
   We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but
   the
   rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch)
   can't go away.
  
   Consider for example the case of a foreign table with more than one
   AFTER
   UPDATE triggers. Unless we store the tuples once for each trigger, we
   will
   have to rescan the tuplestore.
 
  Will we?  Within a given query level, when do (non-deferred) triggers
  execute in an order other than the enqueue order?

 Let me explain what I had in mind.

 Looking at the code in AfterTriggerSaveEvent:

 - we build a template AfterTriggerEvent, and store the tuple(s)
 - for each suitable after trigger that matches the trigger type, as well as
 the WHEN condition if any, a copy of the previously built AfterTriggerEvent
 is queued

 Later, those events are fired in order.

 This means that more than one event can be fired for one tuple.

 Take this example:

 CREATE TRIGGER trig_row_after1
 AFTER UPDATE ON rem2
 FOR EACH ROW
 WHEN (NEW.f1 % 5  3)
 EXECUTE PROCEDURE trigger_func('TRIG1');

 CREATE TRIGGER trig_row_after2
 AFTER UPDATE ON rem2
 FOR EACH ROW
 WHEN (NEW.f1 % 5  4)
 EXECUTE PROCEDURE trigger_func('TRIG2');

 UPDATE rem2 set f2 = 'something';

 Assuming 5 rows with f1 as a serial, the fired AfterTriggerEvent's
 ate_tupleindex will be, in that order. Ass

 0-0-2-2-4-8-8

 So, at least a backward seek is required for trig_row_after2 to be able to
 retrieve a tuple that was already consumed when firing trig_row_after1.

 On a side note, this made me realize that it is better to avoid storing a
 tuple entirely if there is no enabled trigger (the f1 = 4 case above). The
 attached patch does that, so the previous sequence becomes:

 0-0-2-2-4-6-6

 It also prevents from initalizing a tuplestore at all if its not needed.

   To mitigate the effects of this behaviour, I added the option to perform
   a
   reverse_seek when the looked-up tuple is nearer from the current index
   than
   from the start.
 
  If there's still a need to seek within the tuplestore, that should get rid
  of the O(n^2) effect.  I'm hoping that per-query-level tuplestores will
  eliminate the need to seek entirely.

 I think the only case when seeking is still needed is when there are more
 than one after trigger that need to be fired, since the abovementioned
 change prevents from seeking to skip tuples.

If you do pursue that change, make sure the code still does the right
thing
when it drops queued entries during subxact abort.
  
   I don't really understand what should be done at that stage. Since
   triggers on foreign tables are not allowed to be deferred, everything
   should be cleaned up at the end of each query, right ? So, there
   shouldn't be any queued entries.
 
  I suspect that's right.  If you haven't looked over
  AfterTriggerEndSubXact(), please do so and ensure all its actions still
  make sense in the context of this new kind of trigger storage.

 You're right, I missed something here. When aborting a subxact, the
 tuplestores for queries below the subxact query depth should be cleaned, if
 any, because AfterTriggerEndQuery has not been called for the failing query.

 The attached patch fixes that.

 The attached

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-01 22:38 GMT+09:00 Stephen Frost sfr...@snowman.net:
 KaiGai,

 * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
 BTW, this kind of discussion looks like a talk with a ghost because
 we cannot see the new interface according to the parallel execution
 right now, so we cannot have tangible investigation whether it becomes
 really serious backward incompatibility, or not.

 Yeah, it would certainly be nice if we had all of the answers up-front.
 What I keep hoping for is that someone who has been working on this area
 (eg: Robert) would speak up...

I'd also like to see his opinion.

 My bet is minor one. I cannot imagine plan-node interface that does
 not support existing non-parallel SeqScan or NestLoop and so on.

 Sure you can- because once we change the interface, we're probably going
 to go through and make everything use the new one rather than have to
 special-case things.  That's more-or-less exactly my point here because
 having an external hook like CustomScan would make that kind of
 wholesale change more difficult.

I think, we should follow the general rule in case of custom-scan also.
In other words, it's responsibility of extension's author to follow up the
latest specification of interfaces.
For example, I have an extension module that is unable to work on the
latest PG- code because of interface changes at ProcessUtility_hook.
Is it a matter of backward incompatibility? Definitely, no. It should be
my job.

 That does *not* mean I'm against using GPUs and GPU optimizations.  What
 it means is that I'd rather see that done in core, which would allow us
 to simply change that interface along with the rest when doing wholesale
 changes and not have to worry about backwards compatibility and breaking
 other people's code.

I also have to introduce a previous background discussion.
Now we have two options for GPU programming: CUDA or OpenCL.
Both of libraries and drivers are provided under the proprietary license,
so it does not fit for the core implementation of PostgreSQL, but
extensions that shall be installed on user's responsibility.
Because of the story, I brought up a discussion about pluggable
planner/executor node (as a basis of GPU acceleration) in the last
developer meeting, then has worked for custom-scan node interface.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-02 9:51 GMT+09:00 Stephen Frost sfr...@snowman.net:
 KaiGai,

 * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
 Now we have two options for GPU programming: CUDA or OpenCL.
 Both of libraries and drivers are provided under the proprietary license,
 so it does not fit for the core implementation of PostgreSQL, but
 extensions that shall be installed on user's responsibility.

 Being able to work with libraries which are not BSD licensed doesn't
 change the license under which PostgreSQL code is released.  Nor does it
 require PostgreSQL to be licensed in any different way from how it is
 today.  Where it would get a bit ugly, I agree, is for the packagers who
 have to decide if they want to build against those libraries or not.  We
 might be able to make things a bit easier by having a startup-time
 determination of if these nodes are able to be used or not.  This isn't
 unlike OpenSSL which certainly isn't BSD nor is it even GPL-compatible,
 a situation which causes a great deal of difficulty already due to the
 whole readline nonsense- but it's difficulty for the packagers, not for
 the PostgreSQL project, per se.

As you mentioned, it is a headache for packagers, and does not make
sense for us if packager disabled the feature that requires proprietary
drivers. In fact, Fedora / RHEL does not admit to distribute software
under the none open source software license. Obviously, nvidia's cuda
is a library being distributed under the proprietary license, thus out of
the scope for the Linux distributors. It also leads them to turn off the
feature that shall be linked with proprietary drivers.
All we can do is to implement these features as extension, then offer
an option for users to use or not to use.

 Because of the story, I brought up a discussion about pluggable
 planner/executor node (as a basis of GPU acceleration) in the last
 developer meeting, then has worked for custom-scan node interface.

 And I'm still unconvinced of this approach and worry that it's going to
 break more often than it works.  That's my 2c on it, but I won't get in
 the way if someone else wants to step up and support it.  As I mentioned
 up-thread, I'd really like to see FDW join push-down, FDW aggregate
 push-down, parallel query execution, and parallel remote-FDW execution
 and I don't see this CustomScan approach as the right answer to any of
 those.

It's right approach for FDW functionality enhancement, I never opposed to.

What I'd like to implement is GPU acceleration that can perform on
regular tables, not only foreign tables. Also, regarding to the backlog
in the developer meeting, pluggable executor node is also required
feature by PG-XC folks to work their project with upstream.
I think custom-scan feature is capable to host these requirement,
and does not prevent enhancement FDW features.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-02 10:29 GMT+09:00 Stephen Frost sfr...@snowman.net:
 * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
 As you mentioned, it is a headache for packagers, and does not make
 sense for us if packager disabled the feature that requires proprietary
 drivers.

 No, I disagree with that.  I don't expect this use-case to be very
 common to begin with and telling individuals that they have to compile
 it themselves is certainly not out of the question.

 In fact, Fedora / RHEL does not admit to distribute software
 under the none open source software license.

 I'm pretty confident you can get RPMs for those distributions.

 Obviously, nvidia's cuda
 is a library being distributed under the proprietary license, thus out of
 the scope for the Linux distributors.

 This also doesn't make any sense to me- certainly the CUDA libraries are
 available under Debian derivatives, along with open-source wrappers for
 them like pycuda.

 It also leads them to turn off the
 feature that shall be linked with proprietary drivers.
 All we can do is to implement these features as extension, then offer
 an option for users to use or not to use.

 No, we can tell individuals who want it that they're going to need to
 build with support for it.  We don't offer OpenSSL as an extension (I
 certainly wish we did- and had a way to replace it w/ GNUTLS or one of
 the better licensed options).

I know there is some alternative ways. However, it requires users to take
additional knowledge and setting up efforts, also loses the benefit to use
distributor's Linux if alternative RPMs are required.
I don't want to recommend users such a complicated setting up procedure.

 What I'd like to implement is GPU acceleration that can perform on
 regular tables, not only foreign tables. Also, regarding to the backlog
 in the developer meeting, pluggable executor node is also required
 feature by PG-XC folks to work their project with upstream.
 I think custom-scan feature is capable to host these requirement,
 and does not prevent enhancement FDW features.

 I think you're conflating things again- while it might be possible to
 use CustomScan to implement FDW join-pushdown or FDW aggregate-pushdown,
 *I* don't think it's the right approach.  Regarding the PG-XC
 requirement, I expect they're looking for FDW join/aggregate-pushdown
 and also see that it *could* be done w/ CustomScan.

The reason why I submitted the part-3 patch (that enhances postgres_fdw
for remote-join using custom-scan) is easy to demonstrate the usage of
join-replacement by a special scan, with minimum scale of the patch to be
reviewed. If we have another idea to demonstrate it, I don't stick on the remot-
join feature on foreign tables.
Regarding to the PG-XC, I didn't know their exact needs because I didn't
attend the cluster meeting, but the someone mentioned about pluggable
plan/exec node in this context.

 We could punt on the whole thing and drop in hooks which could be used
 to replace everything done from the planner through to the executor and
 then anyone *could* implement any of the above, and parallel query too.
 That doesn't make it the right approach.

That is a problem I pointed out in the last developer meeting. Because we
have no way to enhance a part of plan / exec logic by extension, extension
has to replace whole of the planner / executor using hooks. It is painful for
authors of extensions.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-02 10:38 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Wed, Feb 26, 2014 at 10:23 AM, Stephen Frost sfr...@snowman.net wrote:
 * Kouhei Kaigai (kai...@ak.jp.nec.com) wrote:
 IIUC, his approach was integration of join-pushdown within FDW APIs,
 however, it does not mean the idea of remote-join is rejected.

 For my part, trying to consider doing remote joins *without* going
 through FDWs is just nonsensical.

 That is, of course, true by definition, but I think it's putting the
 focus in the wrong place.  It's possible that there are other cases
 when a scan might a plausible path for a joinrel even if there are no
 foreign tables in play.  For example, you could cache the joinrel
 output and then inject a cache scan as a path for the joinrel.

That might be an idea to demonstrate usage of custom-scan node,
rather than the (ad-hoc) enhancement of postgres_fdw.
As I have discussed in another thread, it is available to switch heap
reference by cache reference on the fly, it shall be a possible use-
case for custom-scan node.

So, I'm inclined to drop the portion for postgres_fdw in my submission
to focus on custom-scan capability.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-02-28 Thread Kohei KaiGai

2014-03-01 0:36 GMT+09:00 Stephen Frost sfr...@snowman.net:
 * Kouhei Kaigai (kai...@ak.jp.nec.com) wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
  I don't see how you can be when there hasn't been any discussion that I've
  seen about how parallel query execution is going to change things for us.
 
 If parallel query execution changes whole of the structure of plan nodes,
 it will also affect to the interface of custom-scan because it is a thin-
 abstraction of plan-node. However, if parallel execution feature is
 implemented as one of new plan node in addition to existing one, I cannot
 imagine a scenario that affects to the structure of another plan node.

 Let's just say that I have doubts that we'll be able to implement
 parallel execution *without* changing the plan node interface in some
 way which will require, hopefully minor, changes to all of the nodes.
 The issue is that even a minor change would break the custom-scan API
 and we'd immediately be in the boat of dealing with complaints regarding
 backwards compatibility.  Perhaps we can hand-wave that, and we've had
 some success changing hook APIs between major releases, but such changes
 may also be in ways which wouldn't break in obvious ways or even
 possibly be changes which have to be introduced into back-branches.
 Parallel query is going to be brand-new real soon and it's reasonable to
 think we'll need to make bug-fix changes to it after it's out which
 might even involve changes to the API which is developed for it.

Even if we will change the plan-node interface in the future release,
it shall not be a change that makes the existing stuff impossible.
The custom-scan API is designed to provide alternative way to scan
or join relations, in addition to the existing logic like SeqScan or
NestLoop. If this change breaks plan-node interfaces and it couldn't
implement existing stuff, it is problematic for all the stuff, not only
custom-scan node. I don't think such a change that makes impossible
to implement existing logic will be merged. Likely, the new parallel
execution feature can work together existing sequential logic and
custom-scan interface.

BTW, this kind of discussion looks like a talk with a ghost because
we cannot see the new interface according to the parallel execution
right now, so we cannot have tangible investigation whether it becomes
really serious backward incompatibility, or not.
My bet is minor one. I cannot imagine plan-node interface that does
not support existing non-parallel SeqScan or NestLoop and so on.

  The issue here is that we're going to be expected to maintain an interface
  once we provide it and so that isn't something we should be doing lightly.
  Particularly when it's as involved as this kind of change is with what's
  going on in the backend where we are nearly 100% sure to be changing things
  in the next release or two.
 
 FDW APIs are also revised several times in the recent releases. If we can
 design perfect interface from the beginning, it's best but usually hard
 to design.

 Sure, but FDWs also have a *much* broader set of use-cases, in my view,
 which is also why I was pushing to work on join-push-down to happen
 there instead of having this kind of a hook interface, which I don't
 think we'd want to directly expose as part of the 'official FDW API' as
 it ends up putting all the work on the FDW with little aide, making it
 terribly likely to end up with a bunch of duplciated code in the FDWs
 from the backend to deal with it, particularly for individuals writing
 FDWs who aren't familiar with what PG already has.

It might not be a good idea to use postgres_fdw as a basis of proof-
of-concept to demonstrate that custom-scan can effectively host
an alternative way to join; that fetches the result set of remote-join
as if relation scan, even though it demonstrated it is possible.
So, I never mind the part-3 portion (remote join of postgres_fdw on
top of custom-scan) being dropped from the submission.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-02-25 Thread Kohei KaiGai

 to believe these
arguments are one of the most stable stuffs.

 Below is the information we can reproduce but complicated steps:
  - List *mergeclause_list
  - bool mergejoin_allow
  - Relids param_source_rels
  - Relids extra_lateral_rels

 Below is the information we can reproduce easily:
  - SemiAntiJoinFactors *semifactors

 I think, the first two categories or the first category (if functions to
 reproduce the second group is exposed) should be informed to extension,
 however, priority of the third group is not high.


  BTW, is it a good idea for custom nodes to also affect other paths like
  append, group etc.? Will it need separate hooks for each of those?
 
 Yes. I plan to support above plan node, in addition to scan / join only.
 The custom-scan node is thin abstraction towards general executor
 behavior,
 so I believe it is not hard to enhance this node, without new plan node
 for each of them.
 Of course, it will need separate hook to add alternative path on the
 planner
 stage, but no individual plan nodes. (Sorry, it was unclear for me what
 does the hook mean.)


 If we represent all the operation like grouping, sorting, aggregation, as
 some sort of relation, we can create paths for each of the relation like we
 do (I am heavily borrowing from Tom's idea of pathifying those operations).
 We will need much lesser hooks in custom scan node.

 BTW, from the patch, I do not see this change to be light weight. I was
 expecting more of a list of hooks to be defined by the user and this
 infrastructure just calling them at appropriate places.

Let's focus on scan and join that we are currently working on.
Even if we need separate node type for grouping or sorting, it will not
be necessary to construct whole of the framework from the scratch.
For example, definition of CustomProvider table will be able to reuse
for other class of operations, because most of them are thin abstraction
of existing executor's interface.

Thanks,

 Thanks,
 --
 NEC OSS Promotion Center / PG-Strom Project
 KaiGai Kohei kai...@ak.jp.nec.com


  -Original Message-
  From: Ashutosh Bapat [mailto:ashutosh.ba...@enterprisedb.com]
  Sent: Tuesday, February 25, 2014 5:59 PM
  To: Kohei KaiGai
  Cc: Kaigai, Kouhei(海外, 浩平); Stephen Frost; Shigeru Hanada; Jim
  Mlodgenski; Robert Haas; Tom Lane; PgHacker; Peter Eisentraut
  Subject: Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)
 
 
 
 
  On Sun, Feb 23, 2014 at 6:54 PM, Kohei KaiGai kai...@kaigai.gr.jp
  wrote:
 
 
Folks,
 
Let me remind the custom-scan patches; that is a basis feature of
remote join of postgres_fdw, cache-only scan, (upcoming) GPU
acceleration feature or various alternative ways to scan/join
  relations.
Unfortunately, small amount of discussion we could have in this
  commit
fest, even though Hanada-san volunteered to move the patches into
ready for committer state at the CF-Nov.
 
 
 
  Sorry for jumping into this late.
 
  Instead of custom node, it might be better idea to improve FDW
  infrastructure
  to push join. For the starters, is it possible for the custom scan node
  hooks to create a ForeignScan node? In general, I think, it might be
  better
  for the custom scan hooks to create existing nodes if they serve the
  purpose.
 
 
 
 
Prior to time-up, I'd like to ask hacker's opinion about its
  potential
arguable points (from my standpoint) if it needs to be fixed up.
One is hook definition to add alternative join path, and the other
  one
is a special varno when a custom scan replace a join node.
I'd like to see your opinion about them while we still have to
  change
the design if needed.
 
(1) Interface to add alternative paths in addition to built-in
  join
  paths
 
 
This patch adds add_join_path_hook on add_paths_to_joinrel to
  allow
extensions to provide alternative scan path in addition to the
  built-in
join paths. Custom-scan path being added is assumed to perform to
  scan
on a (virtual) relation that is a result set of joining relations.
My concern is its arguments to be pushed. This hook is declared
  as follows:
 
/* Hook for plugins to add custom join path, in addition to
  default
  ones */
typedef void (*add_join_path_hook_type)(PlannerInfo *root,
RelOptInfo *joinrel,
RelOptInfo *outerrel,
RelOptInfo *innerrel,
JoinType jointype,
SpecialJoinInfo
  *sjinfo,
List *restrictlist,
List *mergeclause_list,
SemiAntiJoinFactors
  *semifactors

Re: [HACKERS] New hook after raw parsing, before analyze

2014-02-13 Thread Kohei KaiGai

See the discussion of Custom-Scan API.
https://commitfest.postgresql.org/action/patch_view?id=1282

I believe my third patch is what you really want to do...

 This rewritten query would be handled by the FDW table that I previously 
 added to the catalog.

 The reason I want this new hook is that I don't want tableA and tableB to be 
 in the catalog.

I'd like to see why you wants the pseudo table fdw_tableA_tableB to
be in the catalog,
instead of the tableA and tableB. In addition, parser shall raise
an error if referenced
columns (as a part of tableA or tableB) are not in-catalog because
of name lookup
error.

Thanks,


2014-02-13 19:01 GMT+09:00 David Beck db...@starschema.net:
 Hello Hackers,

 I work on a foreign data wrapper for a legacy system. I generally find the 
 hook system very useful and flexible way to extend Postgres.
 The post parse analyze hook almost fits what I need, but I have a few use 
 cases where I would need to tap right into the parsed queries but before any 
 catalog based validation is done.
 Please find the attached trivial patch for this new hook.

 One of the use cases I have is this:

 I have table like data structures in the source system for the FDW I work on.
 These tables are sometimes too big and the source system is able to filter 
 and join them with limitations, thus it is not optimal to transfer the data 
 to Postgres.
 At the same time I want the users to think in terms of the original tables.

 The idea is to rewrite the SQL queries like this:

   SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND 
 b.col2=987

 to:

   SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987


 This rewritten query would be handled by the FDW table that I previously 
 added to the catalog.

 The reason I want this new hook is that I don't want tableA and tableB to be 
 in the catalog.

 Looking forward to hear your thoughts, opinions, comments.

 Best regards, David





 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only table scan?)

2014-02-12 Thread Kohei KaiGai

2014-02-12 14:59 GMT+09:00 Haribabu Kommi kommi.harib...@gmail.com:
 I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
 are fine.
 And I have comments in the third patch related to cache scan.

Thanks for your volunteering.

 1. +# contrib/dbcache/Makefile

Makefile header comment is not matched with file name location.

Ahh, it's an old name when I started to implement.

 2.+   /*
 +   * Estimation of average width of cached tuples - it does not make
 +   * sense to construct a new cache if its average width is more than
 +   * 30% of the raw data.
 +   */

Move the estimation of average width calculation of cached tuples into
 the case where the cache is not found,
otherwise it is an overhead for cache hit scenario.

You are right. If and when existing cache is found and match, its width is
obviously less than 30% of total width.

 3. + if (old_cache)
 + attrs_used = bms_union(attrs_used, old_cache-attrs_used);

can't we need the check to see the average width is more than 30%? During
 estimation it doesn't
include the existing other attributes.

Indeed. It should drop some attributes on the existing cache if total average
width grows more than the threshold. Probably, we need to have a statistical
variable to track how many times or how recently referenced.

 4. + lchunk-right = cchunk;
 + lchunk-l_depth = TTREE_DEPTH(lchunk-right);

   I think it should be lchunk-r_depth needs to be set in a clock wise
 rotation.

Oops, nice cache.

 5. can you add some comments in the code with how the block is used?

Sorry, I'll add it. A block is consumed from the head to store pointers of
tuples, and from the tail to store contents of the tuples. A block can hold
multiple tuples unless usage of tuple pointers from the head does not cross
the area for tuple contents. Anyway, I'll put it on the source code.

 6. In do_insert_tuple function I felt moving the tuples and rearranging
 their addresses is little bit costly. How about the following way?

Always insert the tuple from the bottom of the block where the empty
 space is started and store their corresponding reference pointers
in the starting of the block in an array. As and when the new tuple
 inserts this array increases from block start and tuples from block end.
Just need to sort this array based on item pointers, no need to update
 their reference pointers.

In this case the movement is required only when the tuple is moved from
 one block to another block and also whenever if the continuous
free space is not available to insert the new tuple. you can decide based
 on how frequent the sorting will happen in general.

It seems to me a reasonable suggestion.
Probably, an easier implementation is replacing an old block with dead-
spaces by a new block that contains only valid tuples, if and when dead-
space grows threshold of block-usage.

 7. In ccache_find_tuple function this Assert(i_min + 1  cchunk-ntups); can
 go wrong when only one tuple present in the block
with the equal item pointer what we are searching in the forward scan
 direction.

It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk-ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal or
larger to the tuple being only cached, thus, it moves to the right or left node
according to the scan direction.

 8. I am not able to find a protection mechanism in insert/delete and etc of
 a tuple in Ttree. As this is a shared memory it can cause problems.

For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.

 9. + /* merge chunks if this chunk has enough space to merge */
 + ccache_merge_chunk(ccache, cchunk);

   calling the merge chunks for every call back of heap page prune is a
 overhead for vacuum. After the merge which may again leads
   to node splits because of new data.

OK, I'll check the condition to merge the chunks, to prevent too frequent
merge / split.

 10. columner is present in some places of the patch. correct it.

Ahh, it should be columnar.

 11. In cache_scan_next function, incase of cache insert fails because of
 shared memory the tuple pointer is not reset and cache is NULL.
 Because of this during next record fetch it leads to assert as cache !=
 NULL.

You are right. I had to modify the state of scan as if normal-seqscan path,
not just setting NULL on csstate-ccache.
I left an older manner during try  error during implementation.

 12. + if (ccache-status != CCACHE_STATUS_IN_PROGRESS)
   + cs_put_ccache(ccache);

 The cache is created with refcnt as 2 and in some times two times put
 cache is called to eliminate it and in some times with a different approach.
 It is little bit confusing, can you explain in with comments with why 2
 is required and how it

Re: [HACKERS] dynamic shared memory and locks

2014-02-10 Thread Kohei KaiGai

2014-02-08 4:52 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Tue, Jan 21, 2014 at 11:37 AM, Robert Haas robertmh...@gmail.com wrote:
 One idea I just had is to improve the dsm_toc module so that it can
 optionally set up a tranche of lwlocks for you, and provide some
 analogues of RequestAddinLWLocks and LWLockAssign for that case.  That
 would probably make this quite a bit simpler to use, at least for
 people using it with dynamic shared memory.  But I think that's a
 separate patch.

 I played with this a bit today and it doesn't actually seem to
 simplify things very much.  The backend that creates the DSM needs to
 do this:

 lwlocks = shm_toc_allocate(toc, sizeof(LWLockPadded) * nlwlocks);
 for (i = 0; i  nlwlocks; ++i)
 LWLockInitialize(lwlocks[i].lock, tranche_id);

 Since that's all of three lines, encapsulating it doesn't look all
 that helpful.  Then each backend needs to do something like this:

 static LWLockTranche mytranche;
 mytranche.name = some descriptive module name;
 mytranche.array_base = lwlocks;
 mytranche.array_stride = sizeof(LWLockPadded);
 LWLockRegisterTranche(tranche_id, mytranche);

 That's not a lot of code either, and there's no obvious way to reduce
 it much by hooking into shm_toc.

 I thought maybe we needed some cleanup when the dynamic shared memory
 segment is unmapped, but looks like we really don't.  One can do
 something like LWLockRegisterTranche(tranche_id, NULL) for debugging
 clarity, but it isn't strictly needed; one can release all lwlocks
 from the tranche or assert that none are held, but that should really
 only be a problem if the user does something like
 LWLockAcquire(lock_in_the_segment); dsm_detach(seg), because error
 recovery starts by releasing *all* lwlocks.  And if the user does it
 explicitly, I'm kinda OK with that just seg faulting.  After all, the
 user could equally well have done dsm_detach(seg);
 LWLockAcquire(lock_in_the_segment) and there's no way at all to
 cross-check for that sort of mistake.

Does it make another problem if dsm_detach() also releases lwlocks
being allocated on the dsm segment to be released?
Lwlocks being held is tracked in the held_lwlocks[] array; its length is
usually 100. In case when dsm_detach() is called towards the segment
between addr ~ (addr + length), it seems to me an easy job to walk on
this array to find out lwlocks on the range.


 I do see one thing about the status quo that does look reasonably
 annoying: the code expects that tranche IDs are going to stay
 relatively small.  For example, consider a module foobar that uses DSM
 + LWLocks.  It won't do at all for each backend, on first use of the
 foobar module, to do LWLockNewTrancheId() and then reuse that
 tranche_id repeatedly for each new DSM - because over a system
 lifetime of months, tranche IDs could grow into the millions, causing
 LWLockTrancheArray to get really big (and eventually repalloc will
 fail).  Rather, the programmer needs to ensure that
 LWLockNewTrancheId() gets called *once per postmaster lifetime*,
 perhaps by allocating a chunk of permanent shared memory and using
 that to store the tranche_id that should be used each time an
 individual backend fires up a DSM.  Considering that one of the goals
 of dynamic shared memory is to allow modules to be loaded after
 postmaster startup and still be able to do useful stuff, that's kind
 of obnoxious.  I don't have a good idea what to do about it, though;
 making LWLockTrancheArray anything other than a flat array looks
 likely to slow down --enable-dtrace builds unacceptably.

Just my rough idea. Doesn't it make sense to have an offset from the
head of DSM segment that contains the lwlock, instead of the identifier
form, to indicate a cstring datum? It does not prevent to allocate it
later; after the postmaster starting up, and here is no problem if number
of dynamic lwlocks grows millions or more.
If lwlock has a tranche_offset, not tranche_id, all it needs to do is:
1. find out either of DSM segment or regular SHM segment that contains
the supplied lwlock.
2. Calculate mapped_address + tranche_offset; that is the local location
where cstring form is put.

Probably, it needs a common utility routine on dsm.c to find out
a particular DSM segment that contains the supplied address.
(Inverse function towards dsm_segment_address)

How about my ideas?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-01-29 Thread Kohei KaiGai

2014-01-29 Christian Convey christian.con...@gmail.com:

 On Mon, Jan 27, 2014 at 7:14 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 FDW's join pushing down is one of the valuable use-cases of this
 interface,
 but not all. As you might know, my motivation is to implement GPU
 acceleration
 feature on top of this interface, that offers alternative way to scan or
 join
 relations or potentially sort or aggregate.


 I'm curious how this relates to the pluggable storage idea discussed here
 https://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting and here
 http://www.databasesoup.com/2013/05/postgresql-new-development-priorities-2.html

 I haven't seen a specific proposal about how much functionality should be
 encapsulated by a pluggable storage system.  But I wonder if that would be
 the best place for specialized table-scan code to end up?

If you are interested in designing your own storage layer (thus it needs to
have own scan/writer implementation), FDW is an option currently available.
It defines a set of interface that allows extensions to generate things to be
there on the fly. It does not force to perform as a client of remote database,
even though it was originally designed for dblink purpose.
In other words, FDW is a feature to translate a particular data source into
something visible according to the table definition. As long as driver can
intermediate table definition and data format of your own storage layer,
it shall work.

On the other hands, custom-scan interface, basically, allows extensions to
implement alternative way to access the data. If we have wiser way to
scan or join relations than built-in logic (yep, it will be a wiser
logic to scan
a result set of remote-join than local join on a couple of remote scan results),
this interface suggest the backend I have such a wise strategy, then planner
will choose one of them; including either built-in or additional one, according
to the cost.

Let's back to your question. This interface is, right now, not designed to
implement pluggable storage layer. FDW is an option now, and maybe
a development item in v9.5 cycle if we want regular tables being pluggable.
Because I'm motivated to implement my GPU acceleration feature to
perform on regular relations, I put my higher priority on the interface to
allow extension to suggest how to scan it.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-23 Thread Kohei KaiGai

Isn't it necessary to have an interface to initialize LWLock structure being
allocated on a dynamic shared memory segment?
Even though LWLock structure is exposed at lwlock.h, we have no common
way to initialize it.

How about to have a following function?

void
InitLWLock(LWLock *lock)
{
SpinLockInit(lock-lock.mutex);
lock-lock.releaseOK = true;
lock-lock.exclusive = 0;
lock-lock.shared = 0;
lock-lock.head = NULL;
lock-lock.tail = NULL;
}

Thanks,

2014/1/22 KaiGai Kohei kai...@ak.jp.nec.com:
 (2014/01/22 1:37), Robert Haas wrote:

 On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei kai...@ak.jp.nec.com
 wrote:

 I briefly checked the patch. Most of lines are mechanical replacement
 from LWLockId to LWLock *, and compiler didn't claim anything with
 -Wall -Werror option.

 My concern is around LWLockTranche mechanism. Isn't it too complicated
 towards the purpose?
 For example, it may make sense to add char lockname[NAMEDATALEN]; at
 the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it
 also adds an argument of LWLockAssign() to gives the human readable
 name. Is the additional 64bytes (= NAMEDATALEN) per lock too large
 for recent hardware?


 Well, we'd need it when either LOCK_DEBUG was defined or when
 LWLOCK_STATS was defined or when --enable-dtrace was used, and while
 the first two are probably rarely enough used that that would be OK, I
 think the third case is probably fairly common, and I don't think we
 want to have such a potentially performance-critical difference
 between builds with and without dtrace.

 Also... yeah, it's a lot of memory.  If we add an additional 64 bytes
 to the structure, then we're looking at 96 bytes per lwlock instead of
 32, after padding out to a 32-byte boundary to avoid crossing cache
 lines.  We need 2 lwlocks per buffer, so that's an additional 128
 bytes per 8kB buffer.  For shared_buffers = 8GB, that's an extra 128MB
 for storing lwlocks.  I'm not willing to assume nobody cares about
 that.  And while I agree that this is a bit complex, I don't think
 it's really as bad as all that.  We've gotten by for a long time
 without people being able to put lwlocks in other parts of memory, and
 most extension authors have gotten by with LWLockAssign() just fine
 and can continue doing things that way; only users with particularly
 sophisticated needs should bother to worry about the tranche stuff.

 Hmm... 1/64 of main memory (if large buffer system) might not be
 an ignorable memory consumption.


 One idea I just had is to improve the dsm_toc module so that it can
 optionally set up a tranche of lwlocks for you, and provide some
 analogues of RequestAddinLWLocks and LWLockAssign for that case.  That
 would probably make this quite a bit simpler to use, at least for
 people using it with dynamic shared memory.  But I think that's a
 separate patch.

 I agree with this idea. It seems to me quite natural to keep properties
 of objects held on shared memory (LWLock) on shared memory.
 Also, a LWLock once assigned shall not be never released. So, I think
 dsm_toc can provide a well suitable storage for them.


 Thanks,
 --
 OSS Promotion Center / The PG-Strom Project
 KaiGai Kohei kai...@ak.jp.nec.com


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-23 Thread Kohei KaiGai

2014/1/23 Andres Freund and...@2ndquadrant.com:
 On 2014-01-23 23:03:40 +0900, Kohei KaiGai wrote:
 Isn't it necessary to have an interface to initialize LWLock structure being
 allocated on a dynamic shared memory segment?
 Even though LWLock structure is exposed at lwlock.h, we have no common
 way to initialize it.

 There's LWLockInitialize() or similar in the patch afair.

Ahh, I oversight the code around tranche-id. Sorry for this noise.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2013-12-24 Thread Kohei KaiGai

Hello,

It seems to me changes in the base security policy on Fedora affected to
the regression test. Our test cases for sepgsql_setcon() utilizes the MCS
rules, that prevents domain transition from narrow categories to wider ones,
to control the success cases and failure cases.

However, its coverage was changed. It was applied all the domains in the
system, thus unconfined_t domain had been enforced by MCS rules.
But now, it shall be applied only domains with mcs_constrained_type
attribute.

[kaigai@vmlinux tmp]$ diff -up old/policy/mcs new/policy/mcs
  :
 snip
  :
 mlsconstrain process { transition dyntransition }
-   (( h1 dom h2 ) or ( t1 == mcssetcats ));
+   (( h1 dom h2 ) or ( t1 != mcs_constrained_type ));

Probably, we need to define a domain by ourselves for regression test to ensure
the test stability, not using the system unconfined domain that has different
meaning by release.

I'll make a patch. Please wait for a while.

Thanks for your test  reports.

2013/12/18 Sergey Muraviov sergey.k.murav...@gmail.com:
 # semodule -l | grep sepgslq
 sepgsql-regtest 1.07

 Full list of modules is in attachment.


 2013/12/18 Kohei KaiGai kai...@kaigai.gr.jp

 Could you show me semodule -l on your environment?
 I believe security policy has not been changed between F19 and F20...

 Thanks,

 2013/12/18 Sergey Muraviov sergey.k.murav...@gmail.com:
  Hi
 
  I've tried to test postgres 9.3.2 and 9.4devel with selinux on Fedora 20
  and
  met with a label regression test failure.
 
  PS
  I've got some warning during build process.
 
  --
  Best regards,
  Sergey Muraviov
 
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 



 --
 KaiGai Kohei kai...@kaigai.gr.jp




 --
 Best regards,
 Sergey Muraviov



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2013-12-18 Thread Kohei KaiGai

Could you show me semodule -l on your environment?
I believe security policy has not been changed between F19 and F20...

Thanks,

2013/12/18 Sergey Muraviov sergey.k.murav...@gmail.com:
 Hi

 I've tried to test postgres 9.3.2 and 9.4devel with selinux on Fedora 20 and
 met with a label regression test failure.

 PS
 I've got some warning during build process.

 --
 Best regards,
 Sergey Muraviov


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-12-08 Thread Kohei KaiGai

2013/12/6 Kohei KaiGai kai...@kaigai.gr.jp:
 What will happen if sender tries to send a large chunk that needs to
 be split into multiple sub-chunks and receiver concurrently detaches
 itself from the queue during the writes by sender?
 It seems to me the sender gets SHM_MQ_DETACHED and only
 earlier half of the chunk still remains on the queue even though
 its total length was already in the message queue.
 It may eventually lead infinite loop on the receiver side when another
 receiver appeared again later, then read incomplete chunk.
 Does it a feasible scenario? If so, it might be a solution to prohibit
 enqueuing something without receiver, and reset queue when a new
 receiver is attached.

Doesn't it an intended usage to attach a peer process on a message
queue that had once detached, does it?
If so, it may be a solution to put ereport() on shm_mq_set_receiver()
and shm_mq_set_sender() to prohibit to assign a process on the
message queue with mq_detached = true. It will make the situation
simplified.

Regarding to the test-shm-mq-v1.patch, setup_background_workers()
tries to launch nworkers of background worker processes, however,
may fail during the launching if max_worker_processes is not enough.
Is it a situation to attach the BGWORKER_EPHEMERAL flag when
your patch gets committed, isn't it?
I think it is waste of efforts to add error handling here instead of the
core support to be added, however, it makes sense to put a source
code comment not to forget to add this flag when it came.

Also, test_shm_mq_setup() waits for completion of starting up of
background worker processes. I'm uncertain whether it is really
needed, because this shared memory message queue allows to
send byte stream without receiver, and also blocks until byte
stream will come from the peer to be set later.
This module is designed for test purpose, so I think it makes more
sense if test condition is more corner case.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-12-05 Thread Kohei KaiGai

Sorry for my late.

I checked the part-3 (shm-mq-v1.patc) portion as below.
Your comments towards part-1 and part-2 are reasonable for me,
so I have no argue on this portion.

Even though shm_mq_create() expects the address is aligned,
however, no mechanism to ensure. How about to put Assert() here?


shm_mq_send_bytes() has an Assert() to check the length to be
added to mq_bytes_written. Is the MAXALIGN64() right check in
32-bit architecture?
The sendnow value might be Min(ringsize - used, nbytes - sent),
and the ringsize came from mq-mq_ring_size being aligned
using MAXALIGN(_DOWN) that has 32bit width.

/*
 * Update count of bytes written, with alignment padding.  Note
 * that this will never actually insert any padding except at the
 * end of a run of bytes, because the buffer size is a multiple of
 * MAXIMUM_ALIGNOF, and each read is as well.
 */
Assert(sent == nbytes || sendnow == MAXALIGN64(sendnow));
shm_mq_inc_bytes_written(mq, MAXALIGN64(sendnow));


What will happen if sender tries to send a large chunk that needs to
be split into multiple sub-chunks and receiver concurrently detaches
itself from the queue during the writes by sender?
It seems to me the sender gets SHM_MQ_DETACHED and only
earlier half of the chunk still remains on the queue even though
its total length was already in the message queue.
It may eventually lead infinite loop on the receiver side when another
receiver appeared again later, then read incomplete chunk.
Does it a feasible scenario? If so, it might be a solution to prohibit
enqueuing something without receiver, and reset queue when a new
receiver is attached.


The source code comments in shm_mq_wait_internal() is a little bit
misleading for me. It says nothing shall be written without attaching
the peer process, however, we have no checks as long as nsend is
less than queue size.

 * If handle != NULL, the queue can be read or written even before the
 * other process has attached.  We'll wait for it to do so if needed.  The
 * handle must be for a background worker initialized with bgw_notify_pid
 * equal to our PID.

Right now, that's all I can comment on. I'll do follow-up code reading
in the weekend.

Thanks,

2013/11/20 Robert Haas robertmh...@gmail.com:
 On Tue, Nov 19, 2013 at 12:33 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 * on-dsm-detach-v2.patch
 It reminded me the hook registration/invocation mechanism on apache/httpd.
 It defines five levels for invocation order (REALLY_FIRST, FIRST, MIDDLE,
 LAST, REALLY_LAST), but these are alias of integer values, in other words,
 they are just kicks the hook in order to the priority value associated with a
 function pointer.
 These flexibility may make sense. We may want to control the order to
 release resources more fine grained in the future. For example, isn't it
 a problem if we have only two levels when a stuff in a queue needs to be
 released earlier than the queue itself?
 I'm not 100% certain on this suggestion because on_shmen_exit is a hook
 that does not host so much callbacks, thus extension may implement
 their own way on the SHMEM_EXIT_EARLY or SHMEM_EXIT_LATE stage
 of course.

 I don't really see much point in adding more flexibility here until we
 need it, but I can imagine that we might someday need it, for reasons
 that are not now obvious.

 * shm-toc-v1.patch

 From my experience, it makes sense to put a magic number on the tail of
 toc segment to detect shared-memory usage overrun. It helps to find bugs
 bug hard to find because of concurrent jobs.
 Probably, shm_toc_freespace() is a point to put assertion.

 How many entries is shm_toc_lookup() assumed to chain?
 It assumes a liner search from the head of shm_toc segment, and it is
 prerequisite for lock-less reference, isn't it?
 Does it make a problem if shm_toc host many entries, like 100 or 1000?
 Or, it is not an expected usage?

 It is not an expected usage.In typical usage, I expect that the
 number of TOC entries will be about N+K, where K is a small constant
 ( 10) and N is the number of cooperating parallel workers.  It's
 possible that we'll someday be in a position to leverage 100 or 1000
 parallel workers on the same task, but I don't expect it to be soon.
 And, actually, I doubt that revising the data structure would pay off
 at N=100.  At N=1000, maybe.  At N=1, probably.  But we are
 *definitely* not going to need that kind of scale any time soon, and I
 don't think it makes sense to design a complex data structure to
 handle that case when there are so many more basic problems that need
 to be solved before we can go there.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2013-12-04 Thread Kohei KaiGai

Hanada-san,

Thanks for your reviewing,

2013/12/4 Shigeru Hanada shigeru.han...@gmail.com:
 I first reviewed postgres_fdw portion of the patches to learn the
 outline of Custom Plan.  Wiki page is also a good textbook of the
 feature.  I have some random comments about the basic design of Custom
 Plan:

 (1) IIUC add_join_path and add_scan_path are added to allow extensions
 to plug their code into planner.

Almost yes. For more correctness, these hooks allows extensions to
plug paths they can provide into a particular join or scan. Then planner
will choose the cheapest one  according to the cost value.

 (2) FDW framework has executor callbacks based on existing executor
 nodes.  Is there any plan to integrate them into one way, or wrap on
 by another?  I'm not sure that we should have two similar framework
 side by side.
 # I'm sorry if I've missed the past discussion about this issue.

Probably, FDW has different role from the CustomScan API.
As literal, FDW performs as a bridge between a relation form and
an opaque external data source, to intermediate two different world
on behalf of a foreign table.
On the other hand, CustomScan allows to provide alternative logic
to scan or join particular relations, in addition to the built-in ones,
but does not perform on behalf of foreign tables.

Existing FDW is designed to implement a scan on an intangible
relation, thus it can assume some things; like a tuple returned
from FDW has equivalent TupleDesc as table definition, or it can
always use ExecScan() for all the cases.
So, I don't think these two frameworks should be consolidated
because it makes confusion on the existing extensions that
assumes FDW callbacks always has a particular foreign table
definition.

 (3) Internal routines such as is_self_managed_relation and
 has_wholerow_reference seem to be useful for other FDWs.  Is it able
 to move them  into core?

Probably, src/backend/foreign/foreign.c is a good host for them.

 (4) postgres_fdw estimates costs of join by calculating local numbers.
  How about to support remote estimation by throwing EXPLALAIN query
 when use_remote_estimates = true.

I'm uncertain whether the cost value from remote EXPLAIN represents
right difficulty on the local side, because it indeed represents the
difficulty to join two relations on the remote side, however, does not
represents local job; that just fetches tuples from the result set of
remote query with table joining.
How about your opinion? Is the remote cost estimation value comparable
with local value?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2013-12-04 Thread Kohei KaiGai

Thanks for fixing many my carelessness.
I didn't know seek was an irregular verb...

Best regards,

2013/12/4 Shigeru Hanada shigeru.han...@gmail.com:
 2013/11/29 Kohei KaiGai kai...@kaigai.gr.jp:
 I merged all the propositions from Jim. Thanks, it made the documentation
 quality better. Also, I fixed up cosmetic stuff around whitespace - tab.

 I found some typos in documents and comments.  Please see attached
 patch for detail.

 --
 Shigeru HANADA



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Status of FDW pushdowns

2013-11-21 Thread Kohei KaiGai

2013/11/21 Bruce Momjian br...@momjian.us:
 Where are we on the remaining possible pushdowns for foreign data
 wrappers, particularly the Postgres one?  I know we do WHERE restriction
 pushdowns in 9.3, but what about join and aggregate pushdowns?  Is
 anyone working on those?

 I know join pushdowns seem insignificant, but it helps to restrict what
 data must be passed back because you would only pass back joined rows.

 Do we document these missing features anywhere?

Probably, custom-scan api will provide more flexible way to push-down
aggregate, sort or other stuff performing on regular tables, not only
foreign tables.
It allows extensions to offer alternative scan/join path on the planning
stage, then executor callbacks its custom logic instead of the built-in
one, if its cost is cheaper.

Right now, it performs on relation scan or join only. However, we will be
able to apply same concept on aggregation.
For example, an aggregation node on a foreign table scan is a good
candidate to push down because it can be replaced with a custom-
logic that scans a materialized result of the remote aggregation query,
if its cost is enough cheap than local aggregation.
Probably, we need to add a hook and some logic to compare the
built-in aggregation and alternative paths provided by extensions.
It is also helpful for the people who want to implement something like
parallel aggregate performing on regular tables, not only foreign table.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-11-18 Thread Kohei KaiGai

Hello,

I tried to look at the patch #1 and #2 at first, but I shall rest of
portion later.

* basic checks
All the patches (not only #1, #2) can be applied without any problem towards
the latest master branch. Its build was succeeded with Werror.
Regression test works fine on the core and contrib/test_shm_mq.

* on-dsm-detach-v2.patch
It reminded me the hook registration/invocation mechanism on apache/httpd.
It defines five levels for invocation order (REALLY_FIRST, FIRST, MIDDLE,
LAST, REALLY_LAST), but these are alias of integer values, in other words,
they are just kicks the hook in order to the priority value associated with a
function pointer.
These flexibility may make sense. We may want to control the order to
release resources more fine grained in the future. For example, isn't it
a problem if we have only two levels when a stuff in a queue needs to be
released earlier than the queue itself?
I'm not 100% certain on this suggestion because on_shmen_exit is a hook
that does not host so much callbacks, thus extension may implement
their own way on the SHMEM_EXIT_EARLY or SHMEM_EXIT_LATE stage
of course.

* shm-toc-v1.patch

From my experience, it makes sense to put a magic number on the tail of
toc segment to detect shared-memory usage overrun. It helps to find bugs
bug hard to find because of concurrent jobs.
Probably, shm_toc_freespace() is a point to put assertion.

How many entries is shm_toc_lookup() assumed to chain?
It assumes a liner search from the head of shm_toc segment, and it is
prerequisite for lock-less reference, isn't it?
Does it make a problem if shm_toc host many entries, like 100 or 1000?
Or, it is not an expected usage?

#3 and #4 should be looked later...

Thanks,

2013/11/8 Robert Haas robertmh...@gmail.com:
 On Wed, Nov 6, 2013 at 9:48 AM, Peter Eisentraut pete...@gmx.net wrote:
 This patch needs to be rebased.

 Thanks.  You didn't mention which of the four needed rebasing, but it
 seems that it's just the first one.  New version of just that patch
 attached; please use the prior versions of the remaining three.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

Hello,

It is a brief design proposal of a feature I'd like to implement on top of
custom-scan APIs. Because it (probably) requires a few additional base
features not only custom-scan, I'd like to see feedback from the hackers.

The cache-only table scan, being in subject line, is an alternative scan
logic towards sequential scan if all the referenced columns are cached.
It shall allow to scan a particular table without storage access, thus
make scan performance improved.
So what? Which is difference from large shared_buffers configuration?
This mechanism intends to cache a part of columns being referenced
in the query, not whole of the records. It makes sense to the workloads
that scan a table with many columns but qualifier references just a few
columns, typically used to analytic queries, because it enables to
reduce memory consumption to be cached, thus more number of records
can be cached.
In addition, it has another role from my standpoint. It also performs as
fast data supplier towards GPU/MIC devices. When we move data to
GPU device, the source address has to be a region marked as page-
locked that is exempted from concurrent swap out, if we want CUDA
or OpenCL to run asynchronous DMA transfer mode; the fastest one.

Probably, here is no problem on construction of this table cache.
All we need to do is inject a custom-scan node instead of seq-scan,
then it can construct table cache in concurrence with regular seq-
scan, even though the first access become a little bit slow down.

My concern is how to handle a case when table gets modified.
A straightforward idea is that each cached entries being modified
shall be invalidated by callback mechanism.
Trigger can help in case of INSERT, UPDATE, DELETE and
TRUNCATE. Furthermore, it's better if extension could inject
its own trigger definition at RelationBuildTriggers() on the fly,
to perform the feature transparently.
On the other hand, we have no way to get control around VACUUM.
I want to have a hook that allows extensions to get control when
a page got vacuumed. Once we have such a hook, it enables to
invalidate cached entries being indexed by tid, but already vacuumed.
Its best location is probably lazy_scan_heap() to call back extension
for each page getting vacuumed, with

How about your opinion?

I'd like to find out the best way to implement this table-caching
mechanism within scope of v9.4 functionality set.
Any ideas, comments or suggestions are welcome.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Tom Lane t...@sss.pgh.pa.us:
 Kohei KaiGai kai...@kaigai.gr.jp writes:
 The cache-only table scan, being in subject line, is an alternative scan
 logic towards sequential scan if all the referenced columns are cached.

 This seems like a pretty dubious idea to me --- you're talking about
 expending large amounts of memory on a single-purpose cache, without
 any clear way to know if the data will ever be used before it gets
 invalidated (and the life expectancy of the cache doesn't sound like
 it'd be very high either).

Thanks for your comments. I assume this table-cache shall be applied
on tables that holds data set for analytic workloads. Even though it
may consume large amount of memory, it will hit its major workload
in this use scenario.
Probably, it will have upper limit of cache memory usage, and need
a mechanism to reclaim the cache block from the oldest one. Here
is nothing special. Also, even I call it table cache, it is similar to
in-memory database being supported by rich memory hardware.

 I'd like to find out the best way to implement this table-caching
 mechanism within scope of v9.4 functionality set.

 There's no possible way you'll finish this for 9.4.  Keep in mind that
 CF3 starts Friday, and CF4 starts 2014-01-15, and project policy is that
 major feature patches should arrive (in at least rough form) by CF3.
 Even ignoring that policy, this is more than 2 months worth of work.

Yes, I understand it is not possible to submit whole of the patch until
CF3 deadline. So, I'd like to find out a way to implement it as an
extension using facilities being supported or to be enhanced on v9.4.

Best regards,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Robert Haas robertmh...@gmail.com:
 On Tue, Nov 12, 2013 at 9:45 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 It is a brief design proposal of a feature I'd like to implement on top of
 custom-scan APIs. Because it (probably) requires a few additional base
 features not only custom-scan, I'd like to see feedback from the hackers.

 The cache-only table scan, being in subject line, is an alternative scan
 logic towards sequential scan if all the referenced columns are cached.
 It shall allow to scan a particular table without storage access, thus
 make scan performance improved.
 So what? Which is difference from large shared_buffers configuration?
 This mechanism intends to cache a part of columns being referenced
 in the query, not whole of the records. It makes sense to the workloads
 that scan a table with many columns but qualifier references just a few
 columns, typically used to analytic queries, because it enables to
 reduce memory consumption to be cached, thus more number of records
 can be cached.
 In addition, it has another role from my standpoint. It also performs as
 fast data supplier towards GPU/MIC devices. When we move data to
 GPU device, the source address has to be a region marked as page-
 locked that is exempted from concurrent swap out, if we want CUDA
 or OpenCL to run asynchronous DMA transfer mode; the fastest one.

 Probably, here is no problem on construction of this table cache.
 All we need to do is inject a custom-scan node instead of seq-scan,
 then it can construct table cache in concurrence with regular seq-
 scan, even though the first access become a little bit slow down.

 My concern is how to handle a case when table gets modified.
 A straightforward idea is that each cached entries being modified
 shall be invalidated by callback mechanism.
 Trigger can help in case of INSERT, UPDATE, DELETE and
 TRUNCATE. Furthermore, it's better if extension could inject
 its own trigger definition at RelationBuildTriggers() on the fly,
 to perform the feature transparently.
 On the other hand, we have no way to get control around VACUUM.
 I want to have a hook that allows extensions to get control when
 a page got vacuumed. Once we have such a hook, it enables to
 invalidate cached entries being indexed by tid, but already vacuumed.
 Its best location is probably lazy_scan_heap() to call back extension
 for each page getting vacuumed, with

 How about your opinion?

 I think it's hard to say in the abstract.  I'd suggest adding the
 hooks you feel like you need and see what the resulting patch looks
 like.  Then post that and we can talk about how (and whether) to do it
 better.  My personal bet is that triggers are the wrong way to do
 something like this; I'd look to do it all with hooks.  Of course,
 figuring how to position those hooks so that they are maintainable and
 don't affect performance when not used is the tricky part.

Yep, right now, all the idea is still in my brain, so it takes additional
times to make up it as a patch form.
The reason why I'm inclined to use trigger is, it is already supported
on the place I want to get control. So, it enables to omit efforts to
add and maintenance similar features in the limited time slot.

 I'd like to find out the best way to implement this table-caching
 mechanism within scope of v9.4 functionality set.
 Any ideas, comments or suggestions are welcome.

 I think that Tom is right: there's not time to get something like this
 done for 9.4.  If you start working relatively soon, I think you could
 hope to have a good proposal in time for 9.5.

Yes, I agree his suggestion is realistic. It is nearly impossible to
get whole of the table-cache mechanism by CF3 deadline this week.
So, I assume to implement it as extension using facilities in v9.4.

It seems to me the last piece to implement this feature (of course,
custom-scan is pre-requisite) is a hook in vacuum.
Does it make sense to propose something other worth but small
contrib module that utilizes this last piece? For example, a contrib
module that reports progress of vacuum...

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Claudio Freire klaussfre...@gmail.com:
 On Tue, Nov 12, 2013 at 11:45 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 Hello,

 It is a brief design proposal of a feature I'd like to implement on top of
 custom-scan APIs. Because it (probably) requires a few additional base
 features not only custom-scan, I'd like to see feedback from the hackers.

 The cache-only table scan, being in subject line, is an alternative scan
 logic towards sequential scan if all the referenced columns are cached.
 It shall allow to scan a particular table without storage access, thus
 make scan performance improved.
 So what? Which is difference from large shared_buffers configuration?
 This mechanism intends to cache a part of columns being referenced
 in the query, not whole of the records. It makes sense to the workloads
 that scan a table with many columns but qualifier references just a few
 columns, typically used to analytic queries, because it enables to
 reduce memory consumption to be cached, thus more number of records
 can be cached.
 In addition, it has another role from my standpoint. It also performs as
 fast data supplier towards GPU/MIC devices. When we move data to
 GPU device, the source address has to be a region marked as page-
 locked that is exempted from concurrent swap out, if we want CUDA
 or OpenCL to run asynchronous DMA transfer mode; the fastest one.


 Wouldn't a columnar heap format be a better solution to this?

I've implemented using FDW, however, it requires application adjust its SQL
to replace CREATE TABLE by CREATE FOREIGN TABLE. In addition,
it lost a good feature of regular heap, like index scan if its cost is smaller
than sequential columnar scan.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Tom Lane t...@sss.pgh.pa.us:
 Kohei KaiGai kai...@kaigai.gr.jp writes:
 2013/11/12 Tom Lane t...@sss.pgh.pa.us:
 There's no possible way you'll finish this for 9.4.

 Yes, I understand it is not possible to submit whole of the patch until
 CF3 deadline. So, I'd like to find out a way to implement it as an
 extension using facilities being supported or to be enhanced on v9.4.

 Oh!  Okay, I misunderstood the context --- you meant this as an example
 use-case for the custom plan feature, right?  Makes more sense now.

 I'm still dubious that it'd actually be very useful in itself, but it
 seems reasonable as a test case to make sure that a set of hooks for
 custom plans are sufficient to do something useful with.

Yes. I intend to put most of this table-caching feature on the custom-scan
APIs set, even though it may take additional hooks due to its nature,
independent from planner and executor structure.

So, are you thinking it is a feasible approach to focus on custom-scan
APIs during the upcoming CF3, then table-caching feature as use-case
of this APIs on CF4?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Tom Lane t...@sss.pgh.pa.us:
 Kohei KaiGai kai...@kaigai.gr.jp writes:
 So, are you thinking it is a feasible approach to focus on custom-scan
 APIs during the upcoming CF3, then table-caching feature as use-case
 of this APIs on CF4?

 Sure.  If you work on this extension after CF3, and it reveals that the
 custom scan stuff needs some adjustments, there would be time to do that
 in CF4.  The policy about what can be submitted in CF4 is that we don't
 want new major features that no one has seen before, not that you can't
 make fixes to previously submitted stuff.  Something like a new hook
 in vacuum wouldn't be a major feature, anyway.

Thanks for this clarification.
3 days are too short to write a patch, however, 2 month may be sufficient
to develop a feature on top of the scheme being discussed in the previous
comitfest.

Best regards,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Exempting superuser from row-security isn't enough. Run predicates as DEFINER?

2013-11-11 Thread Kohei KaiGai

Hi Craig,

I'd like to vote the last options. It is a separate problem (or, might
be specification), I think.

According to the document of view,
http://www.postgresql.org/docs/devel/static/sql-createview.html

| Access to tables referenced in the view is determined by permissions of
| the view owner. In some cases, this can be used to provide secure but
| restricted access to the underlying tables. However, not all views are secure
| against tampering; see Section 38.5 for details. Functions called in the view
| are treated the same as if they had been called directly from the query using
| the view. Therefore the user of a view must have permissions to call all
| functions used by the view.

We checks permissions to the tables underlying a view with owner's
privilege of views, but permissions to the functions are checked by
the user who kicked the query.
I'm not certain what is the reason of this behavior except for implementation
reason, because table permission is checked on ExecCheckRTEPerms()
but function permission is checked on ExecEvalFunc(), thus its invocation
context is uncertain.

Anyway, it is a possible issue independent from existence of RLS feature.
So, we may need to consider an another solution, if we make a consensus
it is a problem to be tackled.
However, I think RLS is not a suitable solution towards this scenario.

Thanks,

2013/11/11 Craig Ringer cr...@2ndquadrant.com:
 Hi all

 I'm thinking about a possible solution for one of the row-security
 issues - the ability of a malicious user to write a row-security policy
 containing a malicious predicate function, then trick another user into
 ttSELECT/tting from the table and running the function.

 What about running the row-security predicate subquery as the predicate
 definer - the user who SET the predicate - or the owner of the object
 the predicate applies to? How expensive are security context and user id
 changes - and is it even practical to do this within the context of a
 security barrier subquery?

 Oracle and Teradata get around this by making the assignment of row
 security constraints a highly privileged operation - table owners can't
 set their own. That's the other option IMO.

 We already have this as a known issue with ttVIEW/tts,
 ttCHECK/tt constraints, etc, as shown below - so I'm tempted to
 hand-wave around it as a separate issue, and just say that you shouldn't
 access objects created by untrusted users.

 The problem is that CHECK constraints, VIEW access, etc doesn't affect
 pg_dump, it won't run view predicates or check constraint code. By
 contrast, pg_dump *does* read tables, so it needs to be exempted from
 row-security predicates defined by untrusted users. An exemption option
 is needed for performance anyway.

 Protecting pg_dump and the superuser alone aren't good enough, though.
 SuperSecretUser shouldn't have to fear SELECTing from a view written by
 user NewThisWeek.

 On a side note, pg_restore and psql running dump scripts *do* affect
 restores, which is kind of nasty.

 Here's a demo showing how to create a new superuser with a known
 password as a regular unprivileged user if you can trick the superuser
 into looking at one of your objects:


 CREATE USER user1;

 SET SESSION AUTHORIZATION user1;

 CREATE OR REPLACE FUNCTION haha() RETURNS boolean AS $$
 BEGIN
 RAISE NOTICE 'haha running as: %',current_user;
 CREATE USER haha PASSWORD 'haha' SUPERUSER;
 RETURN true;
 END;
 $$ LANGUAGE plpgsql;

 CREATE TABLE check_exploit ( a integer check (haha()) );

 CREATE VIEW view_exploit AS SELECT * FROM check_exploit WHERE haha();

 GRANT ALL ON check_exploit TO postgres;
 GRANT ALL ON view_exploit TO postgres;

 -- later, superuser comes along and looks at the table/view:
 SET SESSION AUTHORIZATION postgres;


 regress=# select * from view_exploit;
 NOTICE:  haha running as: postgres
  a
 ---
  1
 (1 row)

 regress=# \du haha
List of roles
  Role name | Attributes | Member of
 ---++---
  haha  | Superuser  | {}

 regress=# DROP USER haha;



 or for an admin reason adds/ modifies a row in the table:

 regress=# INSERT INTO check_exploit VALUES (100);
 NOTICE:  haha running as: postgres
 INSERT 0 1


 This even works with SECURITY BARRIER views, since they do nothing to
 control the user ID the view predicate runs as.

 If the superuser dumps this database then restores the schema and data
 as two separate passes, haha will run via the check constraint in that
 case too. Ouch.


 So, we've got a few options:

 * Run the RS constraint subqueries as DEFINER or table owner (if
 possible and performant)

 * Restrict the ability to set RS predicates to superuser by default, and
 create a right to allow it to be delegated.

 * Call it a separate problem and deal with it later, since similar
 issues already apply to VIEW, CHECK, etc. Document that running pg_dump
 as a user without RS exemption rights can run untrusted code written by
 other users.

 --

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2013-11-11 Thread Kohei KaiGai

Hi,

I tried to write up a wikipage to introduce how custom-scan works.

https://wiki.postgresql.org/wiki/CustomScanAPI

Any comments please.

2013/11/6 Kohei KaiGai kai...@kaigai.gr.jp:
 The attached patches provide a feature to implement custom scan node
 that allows extension to replace a part of plan tree with its own code
 instead of the built-in logic.
 In addition to the previous proposition, it enables us to integrate custom
 scan as a part of candidate paths to be chosen by optimizer.
 Here is two patches. The first one (pgsql-v9.4-custom-scan-apis) offers
 a set of API stuff and a simple demonstration module that implement
 regular table scan using inequality operator on ctid system column.
 The second one (pgsql-v9.4-custom-scan-remote-join) enhances
 postgres_fdw to support remote join capability.

 Below is an example to show how does custom-scan work.

 We usually run sequential scan even if clause has inequality operator
 that references ctid system column.

 postgres=# EXPLAIN SELECT ctid,* FROM t1 WHERE ctid  '(10,0)'::tid;
QUERY PLAN
 
  Seq Scan on t1  (cost=0.00..209.00 rows= width=43)
Filter: (ctid  '(10,0)'::tid)
 (2 rows)

 An extension that performs as custom-scan provider suggests
 an alternative path, and its cost was less than sequential scan,
 thus optimizer choose it.

 postgres=# LOAD 'ctidscan';
 LOAD
 postgres=# EXPLAIN SELECT ctid,* FROM t1 WHERE ctid  '(10,0)'::tid;
   QUERY PLAN
 --
  Custom Scan (ctidscan) on t1  (cost=0.00..100.00 rows= width=43)
Filter: (ctid  '(10,0)'::tid)
 (2 rows)

 Of course, more cost effective plan will win if exists.

 postgres=# EXPLAIN SELECT ctid,* FROM t1 WHERE ctid  '(10,0)'::tid AND a = 
 200;
 QUERY PLAN
 ---
  Index Scan using t1_pkey on t1  (cost=0.29..8.30 rows=1 width=43)
Index Cond: (a = 200)
Filter: (ctid  '(10,0)'::tid)
 (3 rows)

 One other worthwhile example is remote-join enhancement on the
 postgres_fdw as follows. Both of ft1 and ft2 are foreign table being
 managed by same foreign server.

 postgres=# EXPLAIN (verbose) SELECT * FROM ft1 JOIN ft2 ON a = x
   WHERE f_leak(b) AND y
 like '%aaa%';
QUERY PLAN
 --
  Custom Scan (postgres-fdw)  (cost=100.00..100.01 rows=0 width=72)
Output: a, b, x, y
Filter: f_leak(b)
Remote SQL: SELECT r1.a, r1.b, r2.x, r2.y FROM (public.ft1 r1 JOIN
 public.ft2 r2 ON ((r1.a = r2.x))) WHERE ((r2.y ~~ '%aaa%'::text))
 (4 rows)

 ---
 How does it works
 ---
 This patch adds two hooks (for base and join relations) around allpaths.c
 and joinpaths.c. It allows extensions to add alternative paths to handle
 scanning on the base relation or join of two relations.

 Its callback routine can add CustomPath using add_path() to inform
 optimizer this alternative scan path. Every custom-scan provider is
 identified by its name being registered preliminary using the following
 function.

   void register_custom_provider(const CustomProvider *provider);

 CustomProvider is a set of name string and function pointers of callbacks.

 Once CustomPath got chosen, create_scan_plan() construct a custom-
 scan plan and calls back extension to initialize the node.
 Rest of portions are similar to foreign scan, however, some of detailed
 portions are different. For example, foreign scan is assumed to return
 a tuple being formed according to table definition. On the other hand,
 custom-scan does not have such assumption, so extension needs to
 set tuple-descriptor on the scan tuple slot of ScanState structure by
 itself.

 In case of join, custom-scan performs as like a regular scan but it
 returns tuples being already joined on underlying relations.
 The patched postgres_fdw utilizes a hook at joinpaths.c to run
 remote join.

 
 Issues
 
 I'm not 100% certain whether arguments of add_join_path_hook is
 reasonable. I guess the first 7 arguments are minimum necessity.
 The mergeclause_list and semifactors might be useful if someone
 tries to implement its own mergejoin or semijoin. Also, I'm not
 good at usage of path parameterization, but the last two arguments
 are related to. Where is the best code to learn about its usage?

 +/* Hook for plugins to add custom join path, in addition to default ones */
 +typedef void (*add_join_path_hook_type)(PlannerInfo *root,
 +   RelOptInfo *joinrel,
 +   RelOptInfo *outerrel,
 +   RelOptInfo *innerrel

Re: [HACKERS] [v9.4] row level security

2013-11-06 Thread Kohei KaiGai

2013/11/6 Craig Ringer cr...@2ndquadrant.com:
 On 11/05/2013 09:36 PM, Robert Haas wrote:
 I haven't studied this patch in detail, but I see why there's some
 unhappiness about that code: it's an RLS-specific kluge.  Just
 shooting from the hip here, maybe we should attack the problem of
 making security-barrier views updatable first, as a separate patch.

 That's the approach I've been considering. There are a few wrinkles with
 it, though:

 (a) Updatable views are implemented in the rewriter, not the planner.
 The rewriter is not re-run when plans are invalidated or when the
 session authorization changes, etc. This means that we can't simply omit
 the RLS predicate for superuser because the same rewritten parse tree
 might get used for both superuser and non-superuser queries.

 Options:

 * Store the before-rewrite parse tree when RLS is discovered on one of
 the rels in the tree. Re-run the rewriter when re-planning. Ensure a
 change in current user always invalidates plans.

 * Declare that it's not legal to run a query originally parsed and
 rewritten as superuser as a non-superuser or vice versa. This would
 cause a great deal of pain with PL/PgSQL.

 * Always add the RLS predicate and solve the problem of reliably
 short-circuiting the user-supplied predicate for RLS-exempt users.  We'd
 need a way to allow direct (not query-based) COPY TO for tables with RLS
 applied, preventing the rewriting of direct table access into subqueries
 for COPY, but since we never save plans for COPY that may be fine.

 * ... ?

How about an idea that uses two different type of rules: the existing one
is expanded prior to planner stage as we are doing now, and the newer
one is expanded on the head of planner stage.
The argument of planner() is still parse tree, so it seems to me here is
no serious problem to call rewriter again to handle second stage rules.
If we go on this approach, ALTER TABLE ... SET ROW SECURITY
will become a synonym to declare a rule with special attribute.

 (b) Inheritance is a problem when RLS is done in the rewriter. As I
 understood it from Kohei KaiGai's description to me earlier, there was a
 strong preference on -hackers to enforce RLS predicates for child and
 parent tables completely independently. That's how RLS currently works,
 but it might be hard to get the same effect when applying RLS in the
 rewriter. We'd need to solve that, or redefine RLS's behaviour so that
 the predicate on a parent table applies to any child tables too.
 Personally I'd prefer the latter.

I'm not certain whether it was a strong preference, even though I followed
the consensus at that time. So, I think it makes sense to discuss how RLS
policy shall be enforced on the child tables.
As long as we can have consistent view on child tables even if it is referenced
without parent tables, I don't have any arguments to your preference.
Also, it makes implementation simple than the approach I tried to have; that
enforces RLS policy of tables individually, because of utilization of existing
rule mechanism.
It is not difficult to enforce parent's RLS policy on the child relation even if
it is referenced individually. All we need to do special is append RLS policy
of its parent, not only child's one, if referenced table has parent.

 (c) RLS might interact differently with rules declared on tables if
 implemented in the rewriter, so some investigation into that would be
 needed.

 I
 would think that if we make that work, this will also work without,
 hopefully, any special hackery.  And we'd get a separate,
 independently useful feature out of it, too.

 I tend to agree. I'm just a bit concerned about dealing with the issues
 around RLS-exempt operations and users.

 --
  Craig Ringer   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How should row-security affects ON UPDATE RESTRICT / CASCADE ?

2013-10-29 Thread Kohei KaiGai

2013/10/29 Tom Lane t...@sss.pgh.pa.us:
 Craig Ringer cr...@2ndquadrant.com writes:
 During my testing of Kohei KaiGai's row-security patches I've been
 looking into how foreign keys should be and are handled. There are some
 interesting wrinkles around FK cascades, the rights under which FK
 checks execute, and about the consistency effects of changing or
 applying an RLS policy.

 As I recall, I've been saying since day one that row-level security cannot
 sensibly coexist with foreign-key constraints, and I've been told that the
 potential users of such a feature don't care.  I'm glad to see somebody
 else complaining.

Not only RLS, it is not avoidable someone to estimate invisible records
using FK constraints, even if either of referencing or referenced records
were protected by column-level database privilege.
I don't remember how many times we had discussed about this topic.
Its conclusions was that access control itself is not capable to prevent
information leak (1bit; whether a particular key exists, or not) using FK
constraint, however, whole of its feature makes sense as long as user's
environment where RLS+PostgreSQL is installed allows such a small
fraction of information leak.
In case when user's environment does not allow to leak any bit, it is not
a reasonable solution, even though I don't know reasonable solution
in this prerequisites.
All of other commercial databases are standing on same assumption.
Even though their promotion white-paper might not say, their solution
of course have same weakness that may allow to leak something.

 Here's another example wherein there basically isn't a sensible solution:
 suppose you have delete rights on table A, and there is a table B
 with a foreign-key reference to A, and RLS says that there are rows in
 B that you can't see.  You try to delete some row in A that is referenced
 by an invisible-to-you row in B.  There are only two possible outcomes:
 the system refuses your request, and thereby exposes to you the fact that
 a referencing row exists; or the system allows the FK constraint to be
 violated.

My vote is, system should keep referencial integrity as if RLS policy is
not configured. It is more fundamental stuff than RLS policy per user
basis.

 As far as the points you're making go, I think we must say that RLS checks
 are not applied during FK trigger queries, ie the FK triggers can always
 see everything even though they don't run as superuser.

Existing my implementation does as above. If a record is referenced
by invisible records, its deletion shall fail in spite of the information
leakage.

  Otherwise you're
 going to end up with constraint violations, and as a database weenie
 I consider that unacceptable.  This will mean that a poorly-chosen FK
 arrangement will allow some leakage of row-existence info, but I don't
 believe that that can be avoided anyway, per the above example.

OK, Let's drop table-level and column-level privileges also. They will be
able to leak existence of invisible records, even if user don't have privilege
to reference. :-)
Any tools have its expected usage and suitable situation to be applied.
A significant thing is to use a feature with understanding its purpose
and limitations. As everybody knows, we have no silver bullets for security,
but useful tool can help us, depending on situation.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How should row-security affects ON UPDATE RESTRICT / CASCADE ?

2013-10-29 Thread Kohei KaiGai

2013/10/30 Craig Ringer cr...@2ndquadrant.com:
 On 10/30/2013 10:50 AM, Tom Lane wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
  I'd kind of like to see FK constraints affected by RLS for
  non-superusers, at least as an option.
 I think that's a complete nonstarter.  Aside from the fact that such a
 constraint will have no definable semantics, even the possibility that a
 constraint doesn't mean what it appears to mean will prevent us from
 making use of FK constraints for optimization --- something that's
 pretty high on the planner to-do list.

 Good point. That's another good argument for FK constraints to disregard
 RLS. In which case, is there actually any way to determine when an SPI
 query is being invoked directly from an FK constraint? We'll need a way
 to tell so RLS can skip adding the row-security check predicate.

For your reference, my implementation patches on ri_PerformCheck()
as follows. It didn't skip all the case (only when PK is modified), however,
its overall idea can be common.

--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -3008,6 +3008,7 @@ ri_PerformCheck(const RI_ConstraintInfo *riinfo,
int spi_result;
Oid save_userid;
int save_sec_context;
+   int temp_sec_context;
Datum   vals[RI_MAX_NUMKEYS * 2];
charnulls[RI_MAX_NUMKEYS * 2];

@@ -3087,8 +3088,18 @@ ri_PerformCheck(const RI_ConstraintInfo *riinfo,

/* Switch to proper UID to perform check as */
GetUserIdAndSecContext(save_userid, save_sec_context);
+
+   /*
+* Row-level security should be disabled in case when foreign-key
+* relation is queried to check existence of tuples that references
+* the primary-key being modified.
+*/
+   temp_sec_context = save_sec_context | SECURITY_LOCAL_USERID_CHANGE;
+   if (source_is_pk)
+   temp_sec_context |= SECURITY_ROW_LEVEL_DISABLED;
+
SetUserIdAndSecContext(RelationGetForm(query_rel)-relowner,
-  save_sec_context | SECURITY_LOCAL_USERID_CHANGE);
+  temp_sec_context);

/* Finally we can run the query. */
spi_result = SPI_execute_snapshot(qplan,

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.4] row level security

2013-10-16 Thread Kohei KaiGai

Because of CF-2nd end, it was moved to the next commit-fest.

In my personal opinion, it probably needs a few groundwork to get RLS
commitable,
prior to RLS itself.

One is a correct handling towards the scenario that Korotkov pointed
out. (How was
it concluded?) I think it is a problem we can fix with reasonable way.
Likely, solution
is to prohibit to show number of rows being filtered if plan node is
underlying a sub-
query with security-barrier.

One other stuff I'm concerned about is, existing implementation
assumes a certain
rtable entry performs as a source relation, also result relation in same time on
DELETE and UPDATE.
We usually scan a regular relation to fetch ctid of the tuples to be
modified, then
ModifyTable deletes or updates the tuple being identified. Here has
been no matter
for long period, even if same rtable entry is used to point out a
relation to be scanned
and to be modified.
It however makes RLS implementation complicated, because this feature tries to
replace a rtable entry to relation with row-level security policy by a
simple sub-query
with security-barrier attribute. Our implementation does not assume a sub-query
performs as a result relation, so I had to have some ad-hoc coding,
like adjustment
on varno/varattno of Var objects, to avoid problem.
I think, solution is to separate a rtable entry of result-relation
from rtable entry to
be scanned. It makes easier to implement RLS feature because all we need to do
is just replacement of rtable entry for scanning stage, and no need to
care about
ModifyTable operations towards sub-query.

Is it a right direction to go?

Thanks,

2013/10/10 Marc Munro m...@bloodnok.com:
 On Wed, 2013-09-04 at 14:35 +, Robert Haas wrote:

 On Fri, Aug 30, 2013 at 3:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I think it's entirely sensible to question whether we should reject
 (not
  hold up) RLS if it has major covert-channel problems.

 We've already had this argument before, about the security_barrier
 [ . . . ]

 Sorry for following up on this so late, I have just been trying to catch
 up with the mailing lists.

 I am the developer of Veil, which this thread mentioned a number of
 times.  I wanted to state/confirm a number of things:

 Veil is not up to date wrt Postgres versions.  I didn't release a new
 version for 9.2, and when no-one complained I figured no-one other than
 me was using it.  I'll happily update it if anyone wants it.

 Veil makes no attempt to avoid covert channels.  It can't.

 Veil is a low-level toolset designed for optimising queries about
 privileges.  It allows you to build RLS with reasonable performance, but
 it is not in itself a solution for RLS.

 I wish the Postgres RLS project well and look forward to its release in
 Postgres 9.4.

 __
 Marc





-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 >

1 - 100 of 521 matches

Mail list logo