from:"\"Kohei KaiGai\""

Re: [HACKERS] list of credits for release notes

2017-10-03 Thread Kohei KaiGai

2017-10-03 6:09 GMT+09:00 Tatsuo Ishii :
>> On Wed, Sep 27, 2017 at 8:29 PM, Michael Paquier
>>  wrote:
>>> Looking at this list, the first name is followed by the family name,
>>> so there are inconsistencies with some Japanese names:
>>> - Fujii Masao should be Masao Fujii.
>>> - KaiGai Kohei should be Kohei Kaigai.
>>
>> But his emails say Fujii Masao, not Masao Fujii.
>
> Michael is correct.
>
> Sometimes people choose family name first in the emails.  However I am
> sure "Fujii" is the family name and "Masao" is the firstname.
>
>> KaiGai's case is a bit trickier, as his email name has changed over time.
>
> Michael is correct about Kaigai too.
>
I set up my personal e-mail account using ${FAMILYNAME} ${FIRSTNAME}
manner according to the eastern convention.
However, my last company enforced a centralized e-mail account policy,
so ${FIRSTNAME} ${LASTNAME} was shown when I post a message
from the jp.nec.com domain.
It is the reason why my email name has changed.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Float value 'Infinity' is cast to numeric 1 on Windows

2017-09-27 Thread Kohei KaiGai

Hello,

Does it make sense to put a check of "isinf(val)" and ereport prior to
the snprintf()?
It is a lightweight check more than set_var_from_str().

Thanks,

2017-09-27 19:41 GMT+09:00 Taiki Kondo :
> Hi all,
>
> I build PostgreSQL 9.6.5 by Visual Studio 2013 on Windows, I found weird 
> behavior on it.
>
> I execute following SQL, occurring ERROR is expected, but I got mysterious 
> value 1.
>
>
> postgres=# select (float8 'inf')::numeric;
>  numeric
> -
>1
> (1 row)
>
>
> On Windows, at least Visual Studio, the string of float/double value meaning
> infinity is "1.#INF" not "inf". So, set_var_from_str() called from 
> float8_numeric()
> in utils/adt/numeric.c will return numeric value 1, and no one checks this 
> situation.
>
> This situation is also occurring by cast from float4.
>
> I wrote a patch to add check this situation.
> Please find attached.
>
>
> Sincerely,
>
> --
> Taiki Kondo
> NEC Solution Innovators, Ltd.
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

2017-02-25 1:40 GMT+09:00 Claudio Freire :
> On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas  wrote:
>> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
>>> This attached patch re-designed the previous v2 according to Robert's 
>>> comment.
>>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>>> ExecShutdownNode,
>>> because it is already modified to call child node first, earlier than
>>> ExecShutdownGather
>>> which releases DSM segment.
>>
>> Aside from the documentation, which needs some work, this looks fine
>> to me on a quick read.
>
> LGTM too.
>
> You may want to clarify in the docs when the hook is called in
> relation to other hooks too.
>
I tried to add a description how custom-scan callbacks performs under the
executor, and when invoked roughly.
However, it is fundamentally not easy for most of people because it assumes
FDW/CSP author understand the overall behavior of optimizer / executor, not
only APIs specifications.

Thanks,
-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Kohei KaiGai

Hello,

This attached patch re-designed the previous v2 according to Robert's comment.
All I had to do is putting entrypoint for ForeignScan/CustomScan at
ExecShutdownNode,
because it is already modified to call child node first, earlier than
ExecShutdownGather
which releases DSM segment.

Thanks,

2017-02-20 9:25 GMT+09:00 Kouhei Kaigai :
>> -Original Message-
>> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
>> Sent: Monday, February 20, 2017 2:20 AM
>> To: Kaigai Kouhei(海外 浩平) 
>> Cc: Claudio Freire ; Amit Kapila
>> ; pgsql-hackers 
>> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
>> ExecEndGather)
>>
>> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai 
>> wrote:
>> > The attached patch is revised one.
>> >
>> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
>> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
>> > tree twice.
>> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
>> > statistics only when query is executed under EXPLAIN ANALYZE.
>> >
>> > This enhancement allows FDW/CSP to collect its specific run- time
>> > statistics more than Instrumentation, then show them as output of
>> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
>> > transfer ratio and so on. These statistics will never appear in the
>> > Instrumentation structure, however, can be a hot- point of performance
>> > bottleneck if CustomScan works on background workers.
>>
>> Would gather_shutdown_children_first.patch from
>> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
>> b8eb10c-6aywjdxb...@mail.gmail.com
>> help with this problem also?  Suppose we did that, and then also added an
>> ExecShutdownCustom method.  Then you'd definitely be able to get control
>> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
>>
> Ah, yes, I couldn't find any problem around the above approach.
> ExecShutdownGather() can be called by either ExecShutdownNode() or
> ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
> prior to release of the DSM.
>
> Thanks,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


parallel-finish-fdw_csp.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-01-02 Thread Kohei KaiGai

Oops, I oversight this patch was marked as "returned with feedback",
not "moved to the next CF".

Its status has not been changed since the last update. (Code was
revised according to the last
comment by Jeevan, but CF-Nov was time up at that time.)

How do I handle the patch?

2016-12-05 16:49 GMT+09:00 Kouhei Kaigai :
> Hello,
>
> Sorry for my late response.
> The attached patch reflects your comments.
>
>> Here are few comments on latest patch:
>>
>>
>> 1.
>> make/make check is fine, however I am getting regression failure in
>> postgres_fdw contrib module (attached regression.diff).
>> Please investigate and fix.
>>
> It was an incorrect interaction when postgres_fdw tries to push down
> sorting to the remote side. We cannot attach LIMIT clause on the plain
> scan path across SORT, however, the previous version estimated the cost
> for the plain scan with LIMIT clause even if local sorting is needed.
> If remote scan may return just 10 rows, estimated cost of the local sort
> is very lightweight, thus, this unreasonable path was chosen.
> (On the other hands, its query execution results were correct because
> ps_numTuples is not delivered across Sort node, so ForeignScan eventually
> scanned all the remote tuples. It made correct results but not optimal
> from the viewpoint of performance.)
>
> The v3 patch estimates the cost with remote LIMIT clause only if supplied
> pathkey strictly matches with the final output order of the query, thus,
> no local sorting is expected.
>
> Some of the regression test cases still have different plans but due to
> the new optimization by remote LIMIT clause.
> Without remote LIMIT clause, some of regression test cases preferred
> remote-JOIN + local-SORT then local-LIMIT.
> Once we have remote-LIMIT option, it allows to discount the cost for
> remote-SORT by choice of top-k heap sorting.
> It changed the optimizer's decision on some test cases.
>
> Potential one big change is the test case below.
>
>  -- CROSS JOIN, not pushed down
>  EXPLAIN (VERBOSE, COSTS OFF)
>  SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 
> OFFSET 100 LIMIT 10;
>
> It assumed CROSS JOIN was not pushed down due to the cost for network
> traffic, however, remote LIMIT reduced the estimated number of tuples
> to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to run
> on the remote side.
>
>> 2.
>> + *
>> + * MEMO: root->limit_tuples is not attached when query
>> contains
>> + * grouping-clause or aggregate functions. So, we don's adjust
>> + * rows even if LIMIT  is supplied.
>>
>> Can you please explain why you are not doing this for grouping-clause or
>> aggregate functions.
>>
> See grouping_planner() at optimizer/plan/planner.c
> It puts an invalid value on the root->limit_tuples if query has GROUP BY 
> clause,
> so we cannot know number of tuples to be returned even if we have upper limit
> actually.
>
> /*
>  * Figure out whether there's a hard limit on the number of rows that
>  * query_planner's result subplan needs to return.  Even if we know a
>  * hard limit overall, it doesn't apply if the query has any
>  * grouping/aggregation operations, or SRFs in the tlist.
>  */
> if (parse->groupClause ||
> parse->groupingSets ||
> parse->distinctClause ||
> parse->hasAggs ||
> parse->hasWindowFuncs ||
> parse->hasTargetSRFs ||
> root->hasHavingQual)
> root->limit_tuples = -1.0;
> else
> root->limit_tuples = limit_tuples;
>
>> 3.
>> Typo:
>>
>> don's  => don't
>>
> Fixed,
>
> best regards,
> 
> PG-Strom Project / NEC OSS Promotion Center
> KaiGai Kohei 
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
KaiGai Kohei 


passdown-limit-fdw.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai

2016-12-23 8:24 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 11:01 PM, Kohei KaiGai  wrote:
>> Regardless of the ExpandedObject, does the flatten format need to
>> contain fully flatten data chunks?
>
> I suspect it does, and I think that's why this isn't going to get very
> far without a super-varlena format.
>
Yep, I'm now under investigation how to implement with typlen == -3
approach. Likely, it will be the most straight-forward infrastructure for
other potential use cases more than matrix/vector.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai

2016-12-23 8:23 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 10:44 PM, Kohei KaiGai  wrote:
>>> Handling objects >1GB at all seems like the harder part of the
>>> problem.
>>>
>> I could get your point almost. Does the last line above mention about
>> amount of the data object >1GB? even if the "super-varlena" format
>> allows 64bit length?
>
> Sorry, I can't understand your question about what I wrote.
>
I thought you just pointed out it is always harder part to treat large
amount of data even if data format allows >1GB or more. Right?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-08 Thread Kohei KaiGai

2016-12-08 16:11 GMT+09:00 Craig Ringer :
> On 8 December 2016 at 12:01, Kohei KaiGai  wrote:
>
>>> At a higher level, I don't understand exactly where such giant
>>> ExpandedObjects would come from.  (As you point out, there's certainly
>>> no easy way for a client to ship over the data for one.)  So this feels
>>> like a very small part of a useful solution, if indeed it's part of a
>>> useful solution at all, which is not obvious.
>>>
>> I expect an aggregate function that consumes millions of rows as source
>> of a large matrix larger than 1GB. Once it is formed to a variable, it is
>> easy to deliver as an argument of PL functions.
>
> You might be interested in how Java has historically dealt with similar 
> issues.
>
> For a long time the JVM had quite low limits on the maximum amount of
> RAM it could manage, in the single gigabytes for a long time. Even for
> the 64-bit JVM. Once those limitations were lifted, the garbage
> collector algorithm placed a low practical limit on how much RAM it
> could cope with effectively.
>
> If you were doing scientific computing with Java, lots of big
> image/video work, using GPGPUs, doing large scale caching, etc, this
> rapidly became a major pain point. So people introduced external
> memory mappings to Java, where objects could reference and manage
> memory outside the main JVM heap. The most well known is probably
> BigMemory (https://www.terracotta.org/products/bigmemory), but there
> are many others. They exposed this via small opaque handle objects
> that you used to interact with the external memory store via library
> functions.
>
> It might make a lot of sense to apply the same principle to
> PostgreSQL, since it's much less intrusive than true 64-bit VARLENA.
> Rather than extending all of PostgreSQL to handle special-case
> split-up VARLENA extended objects, have your interim representation be
> a simple opaque value that points to externally mapped memory. Your
> operators for the type, etc, know how to work with it. You probably
> don't need a full suite of normal operators, you'll be interacting
> with the data in a limited set of ways.
>
> The main issue would presumably be one of resource management, since
> we currently assume we can just copy a Datum around without telling
> anybody about it or doing any special management. You'd need to know
> when to clobber your external segment, when to copy(!) it if
> necessary, etc. This probably makes sense for working with GPGPUs
> anyway, since they like dealing with big contiguous chunks of memory
> (or used to, may have improved?).
>
> It sounds like only code specifically intended to work with the
> oversized type should be doing much with it except passing it around
> as an opaque handle, right?
>
Thanks for your proposition. Its characteristics looks like a largeobject
but volatile (because of no backing storage).
As you mentioned, I also think its resource management is the core of issues.
We have no reference count mechanism, so it is uncertain when we should
release the orphan memory chunk, or we have to accept this memory consumption
until clean-up of the relevant memory context.
Moreever, we have to pay attention to the scenario when this opaque identifier
is delivered to the background worker, thus, it needs to be valid on other
process's context. (Or, prohibit to exchange it on the planner stage?)

> Do you need to serialize this type to/from disk at all? Or just
> exchange it in chunks with a client? If you do need to, can you
> possibly do TOAST-like or pg_largeobject-like storage where you split
> it up for on disk storage then reassemble for use?
>
Even though I don't expect this big chunk is entirely exported to/imported
from the client at once, it makes sense to save/load this big chunk to/from
the disk, because construction of a big memory structure consumes much CPU
cycles than simple memory copy from buffers.
So, in other words, it does not need valid typinput/typoutput handlers, but
serialization/deserialization on disk i/o is helpful.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:36 GMT+09:00 Tom Lane :
> Robert Haas  writes:
>> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>>> I like to propose a new optional type handler 'typserialize' to
>>> serialize an in-memory varlena structure (that can have indirect
>>> references) to on-disk format.
>
>> I think it's probably a mistake to conflate objects with substructure
>> with objects > 1GB.  Those are two somewhat orthogonal needs.
>
> Maybe.  I think where KaiGai-san is trying to go with this is being
> able to turn an ExpandedObject (which could contain very large amounts
> of data) directly into a toast pointer or vice versa.  There's nothing
> really preventing a TOAST OID from having more than 1GB of data
> attached, and if you had a side channel like this you could transfer
> the data without ever having to form a larger-than-1GB tuple.
>
> The hole in that approach, to my mind, is that there are too many places
> that assume that they can serialize an ExpandedObject into part of an
> in-memory tuple, which might well never be written to disk, or at least
> not written to disk in a table.  (It might be intended to go into a sort
> or hash join, for instance.)  This design can't really work for that case,
> and unfortunately I think it will be somewhere between hard and impossible
> to remove all the places where that assumption is made.
>
Regardless of the ExpandedObject, does the flatten format need to
contain fully flatten data chunks?
If a data type internally contains multiple toast pointers as like an array,
its flatten image is likely very small we can store using an existing
varlena mechanism.
One problem is VARSIZE() will never tell us exact total length of
the data even if it references multiple GB scale chunks.

> At a higher level, I don't understand exactly where such giant
> ExpandedObjects would come from.  (As you point out, there's certainly
> no easy way for a client to ship over the data for one.)  So this feels
> like a very small part of a useful solution, if indeed it's part of a
> useful solution at all, which is not obvious.
>
I expect an aggregate function that consumes millions of rows as source
of a large matrix larger than 1GB. Once it is formed to a variable, it is
easy to deliver as an argument of PL functions.

> FWIW, ExpandedObjects themselves are far from a fully fleshed out
> concept, one of the main problems being that they don't have very long
> lifespans except in the case that they're the value of a plpgsql
> variable.  I think we would need to move things along quite a bit in
> that area before it would get to be useful to think in terms of
> ExpandedObjects containing multiple GB of data.  Otherwise, the
> inevitable flattenings and re-expansions are just going to kill you.q
>
> Likewise, the need for clients to be able to transfer data in chunks
> gets pressing well before you get to 1GB.  So there's a lot here that
> really should be worked on before we try to surmount that barrier.
>
Do you point out the problem around client<->server protocol, isn't it?
Likely, we eventually need this enhancement. I agree.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:04 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>> I like to propose a new optional type handler 'typserialize' to
>> serialize an in-memory varlena structure (that can have indirect
>> references) to on-disk format.
>> If any, it shall be involced on the head of toast_insert_or_update()
>> than indirect references are transformed to something other which
>> is safe to save. (My expectation is, the 'typserialize' handler
>> preliminary saves the indirect chunks to the toast relation, then
>> put toast pointers instead.)
>
> This might not work.  The reason is that we have important bits of
> code that expect that they can figure out how to do some operation on
> a datum (find its length, copy it, serialize it) based only on typlen
> and typbyval.  See src/backend/utils/adt/datum.c for assorted
> examples.  Note also the lengthy comment near the top of the file,
> which explains that typlen > 0 indicates a fixed-length type, typlen
> == -1 indicates a varlena, and typlen == -2 indicates a cstring.  I
> think there's probably room for other typlen values; for example, we
> could have typlen == -3 indicate some new kind of thing -- a
> super-varlena that has a higher length limit, or some other kind of
> thing altogether.
>
> Now, you can imagine trying to treat what you're talking about as a
> new type of TOAST pointer, but I think that's not going to help,
> because at some point the TOAST pointer gets de-toasted into a varlena
> ... which is still limited to 1GB.  So that's not really going to
> work.  And it brings out another point, which is that if you define a
> new typlen code, like -3, for super-big things, they won't be
> varlenas, which means they won't work with the existing TOAST
> interfaces.  Still, you might be able to fix that.  You would probably
> have to do some significant surgery on the wire protocol, per the
> commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd.
>
Hmm... The reason why I didn't introduce the idea of 64bit varlena format is
this approach seems too invasive for existing PostgreSQL core and extensions,
because I assumed this "long" variable length datum utilize/enhance existing
varlena infrastructure.
However, once we have completely independent infrastructure from the exiting
varlena, we may not need to have a risk of code mixture.
It seems to me an advantage.

My concern about ExpandedObject is its flat format still needs 32bit varlena
header that restrict the total length up to 1GB, so, it was not possible to
represent a large data chunk.
So, I didn't think the current ExpandedObject is a solution for us.

Regarding to the protocol stuff, I want to consider the support of a large
record next to the internal data format, because my expected primary
usage is an internal use for in-database analytics, then user will get
its results from a complicated logic described in PL function.

> I think it's probably a mistake to conflate objects with substructure
> with objects > 1GB.  Those are two somewhat orthogonal needs.  As Jim
> also points out, expanded objects serve the first need.  Of course,
> those assume that we're dealing with a varlena, so if we made a
> super-varlena, we'd still need to create an equivalent.  But perhaps
> it would be a fairly simple adaptation of what we've already got.
> Handling objects >1GB at all seems like the harder part of the
> problem.
>
I could get your point almost. Does the last line above mention about
amount of the data object >1GB? even if the "super-varlena" format
allows 64bit length?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

This is a design proposal for matrix data type which can be larger
than 1GB. Not only a new data type support, it also needs a platform
enhancement because existing varlena has a hard limit (1GB).
We had a discussion about this topic on the developer unconference
at Tokyo/Akihabara, the day before PGconf.ASIA. The design proposal
below stands on the overall consensus on the discussion.


 BACKGROUND

The varlena format has either short (1-byte) or long (4-bytes) header.
We use the long header for in-memory structure which is referenced
by VARSIZE()/VARDATA(), or on-disk strcuture which is larger than
126b but less than TOAST threshold. Elsewhere, the short format is
used if varlena is less than 126b or externally stored in the toast
relation. Any kind of varlena representation does not support data-
size larger than 1GB.
On the other hands, some use cases which handle (relative) big-data
in database are interested in variable length datum larger than 1GB.

One example is what I've presented at PGconf.SV.
A PL/CUDA function takes two arguments (2D-array of int4 instead of
matrix), then returns top-N combination of the chemical compounds
according to the similarity, like as:

SELECT knn_gpu_similarity(5, Q.matrix,
 D.matrix)
  FROM (SELECT array_matrix(bitmap) matrix
  FROM finger_print_query
 WHERE tag LIKE '%abc%') Q,
   (SELECT array_matrix(bitmap) matrix
  FROM finger_print_10m) D;

array_matrix() is an aggregate function to generate 2D-array which
contains all the input relation stream.
It works fine as long as data size is less than 1GB. Once it exceeds
the boundary, user has to split the 2D-array by manual, although
it is not uncommon the recent GPU model has more than 10GB RAM.

It is really problematic if we cannot split the mathematical problem
into several portions appropriately. And, it is not comfortable for
users who cannot use full capability of the GPU device.

=
 PROBLEM
=
Our problem is that varlena format does not permit to have a variable
length datum larger than 1GB, even if our "matrix" type wants to move
a bunch of data larger than 1GB.
So, we need to solve the problem of the varlena format restriction
prior to the matrix type implementation.

In the developer unconference, people had discussed three ideas towards
the problem. Then, overall consensus was the idea of special data-type
which can contain multiple indirect references to other data chunks.
Both of the main part and referenced data chunks are less than 1GB,
but total amount of data size we can represent is more than 1GB.

For example, even if we have a large matrix around 8GB, its sub-matrix
separated into 9 portions (3x3) are individually less than 1GB.
It is problematic when we try to save the matrix which contains indirect
reference to the sub-matrixes, because toast_save_datum() writes out
the flat portion just after the varlena head onto a tuple or a toast
relation as is.
If main portion of the matrix contains pointers (indirect reference),
it is obviously problematic. We need to have an infrastructure to
serialize the indirect reference prior to saving.

BTW, other ideas but less acknowledgement were 64bit varlena header
and utilization of large object. The earlier idea breaks effects to
the current data format, thus, will make unexpected side-effect on
the existing code of PostgreSQL core and extensions. The later one
requires users to construct a large object preliminary. It makes
impossible to use interim result of sub-query, and leads unnecessary
i/o for preparation.

==
 SOLUTION
==
I like to propose a new optional type handler 'typserialize' to
serialize an in-memory varlena structure (that can have indirect
references) to on-disk format.
If any, it shall be involced on the head of toast_insert_or_update()
than indirect references are transformed to something other which
is safe to save. (My expectation is, the 'typserialize' handler
preliminary saves the indirect chunks to the toast relation, then
put toast pointers instead.)

On the other hands, it is uncertain whether we need 'typdeserialize'
handler symmetrically. Because all the functions/operators which
support the special data types should know its internal format,
it is possible to load the data chunks indirectly referenced on
demand.
It will be beneficial from the performance perspective if functions
/operators touches only a part of the large structure, because the
rest of portions are not necessary to load into the memory.

One thing I'm not certain is, whether we can update the datum
supplied as an argument in functions/operators.
Let me assume the data structure below:

  struct {
  int32   varlena_head;
  Oid element_id;
  int matrix_type;
  int blocksz_x; /* horizontal size of each block */
  int blocksz_y; /* vertical size of each block */
  int gridsz_x;  /* horizon

Re: [HACKERS] raw output from copy

2016-12-06 Thread Kohei KaiGai

2016-12-06 16:59 GMT+09:00 Pavel Stehule :
>
>
> 2016-12-06 1:50 GMT+01:00 Kohei KaiGai :
>>
>> 2016-12-05 22:45 GMT+09:00 Pavel Stehule :
>> >
>> > There are more goals:
>> >
>> > 1. user friendly import of text or binary data - import text data (with
>> > psql) from file is possible - but you have to load a content to psql
>> > variable. For binary data you should to use workaround based on LO and
>> > transformation from LO to bytea.
>> >
>> > 2. user friendly export text or binary data - now, the binary data can
>> > be
>> > exported only via transformation to LO. The XML has very interesting
>> > features when is passing from/to client binary. This feature is
>> > impossible
>> > in psql now.
>> >
>>   :
>> 
>>   :
>> >> It seems to me extend of COPY statement for this optimization is a bit
>> >> overkill
>> >> solution. Do we find out an alternative solution that we can build on
>> >> the existing
>> >> infrastructure?
>> >
>> > The advantage and sense of COPY RAW was reusing existing interface. The
>> > question was: How I can export/import binary data simply from psql
>> > console?
>> >
>> OK, I could get your point.
>>
>> Likeky, we can implement the feature without COPY statement enhancement
>> by adding a special purpose function and \xxx command on psql.
>>
>> Let's assume the two commands below on psql:
>>
>> \blob_import   (STDIN|)
>> \blob_export  (STDOUT|)
>>
>> On \blob_import, the psql command reads the binary contents from either
>> stdin or file, than call a special purpose function that takes three
>> arguments; table name, column name and a binary data chunk.
>> PQexecParams() of libpq allows to deliver the data chunk with keeping
>> binary data format, then the special purpose function will be able to
>> lookup the destination table/column and construct a tuple that contains
>> the supplied data chunk. (I think _recv handler shall be used for
>> data validation, but not an element of this feature.)
>>
>>
>> On \blob_export, the psql command also set up a simple query as follows:
>>   SELECT blob_export((> For example,
>>   \blob_export SELECT binary_data FROM my_table WHERE id = 10   /tmp/aaa
>> shall be transformed to
>>   SELECT blob_export((SELECT binary_data FROM my_table WHERE id = 10))
>
>
> This is reason why I prefer a COPY statement - because it does all necessary
> things natural.  But if there is a disagreement against COPY RAW it can be
> implemented as psql commands.
>
Yes, both of approach will be able to implement what you want to do.
I agree it is valuable if psql can import/export a particular item with
simple shell-script description, however, here is no consensus how
to implement it.

If psql supports the special \xxx command, it is equivalently convenient
from the standpoint of users, with no enhancement of the statement.

I hope committers comment on the approach we will take on.

Thanks,

> export should be similar like \g, \gset feature
>
> so
>
> SELECT xmldoc FROM 
> \gbinary_store .xxx
>
> import is maybe better solved by proposed file references in queries
>
> Regards
>
> Pavel
>
>
>>
>>
>> This function is declared as:
>>   blob_export(anyelement) RETURNS bytea
>> So, as long as the user supplied query returns exactly one column and
>> one row, it can transform the argument to the binary stream, then psql
>> command receive it and dump somewhere; stdout or file.
>>
>> How about your thought?
>>
>> Thanks,
>> --
>> KaiGai Kohei 
>
>



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-05 Thread Kohei KaiGai

2016-12-05 22:45 GMT+09:00 Pavel Stehule :
>
> There are more goals:
>
> 1. user friendly import of text or binary data - import text data (with
> psql) from file is possible - but you have to load a content to psql
> variable. For binary data you should to use workaround based on LO and
> transformation from LO to bytea.
>
> 2. user friendly export text or binary data - now, the binary data can be
> exported only via transformation to LO. The XML has very interesting
> features when is passing from/to client binary. This feature is impossible
> in psql now.
>
  :

  :
>> It seems to me extend of COPY statement for this optimization is a bit
>> overkill
>> solution. Do we find out an alternative solution that we can build on
>> the existing
>> infrastructure?
>
> The advantage and sense of COPY RAW was reusing existing interface. The
> question was: How I can export/import binary data simply from psql console?
>
OK, I could get your point.

Likeky, we can implement the feature without COPY statement enhancement
by adding a special purpose function and \xxx command on psql.

Let's assume the two commands below on psql:

\blob_import   (STDIN|)
\blob_export  (STDOUT|)

On \blob_import, the psql command reads the binary contents from either
stdin or file, than call a special purpose function that takes three
arguments; table name, column name and a binary data chunk.
PQexecParams() of libpq allows to deliver the data chunk with keeping
binary data format, then the special purpose function will be able to
lookup the destination table/column and construct a tuple that contains
the supplied data chunk. (I think _recv handler shall be used for
data validation, but not an element of this feature.)


On \blob_export, the psql command also set up a simple query as follows:
  SELECT blob_export((


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2016-12-05 Thread Kohei KaiGai

Sorry for my late response.

I've briefly checked a series of discussion in the past.
I understood the target/purpose of this patch is provision of a fast interface
to import/export a particular cell of a relation, by skip of text<->binary
transformation. Its typical use case are XML and JSON data types. Right?

If so, how about the idea to use fast-path invocation protocol to call functions
to import/export these document types?
It allows to accept binary form of the data stream, with minimum overheads.

It seems to me extend of COPY statement for this optimization is a bit overkill
solution. Do we find out an alternative solution that we can build on
the existing
infrastructure?

Best regards,

2016-12-05 14:16 GMT+09:00 Haribabu Kommi :
>
>
> On Tue, Nov 22, 2016 at 10:48 PM, Haribabu Kommi 
> wrote:
>>
>>  Hi,
>>
>> This is a gentle reminder.
>>
>> you assigned as reviewer to the current patch in the 11-2016 commitfest.
>> But you haven't shared your review yet. Please share your review about
>> the patch. This will help us in smoother operation of commitfest.
>>
>> Please Ignore if you already shared your review.
>
>
> Patch is not applying properly to HEAD.
> Moved to next CF with "waiting on author" status.
>
>
> Regards,
> Hari Babu
> Fujitsu Australia

-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of pg_proc.probin is legal?

2016-11-17 Thread Kohei KaiGai

2016-11-16 7:46 GMT-08:00 Tom Lane :
> Kohei KaiGai  writes:
>> On the other hands, interpret_AS_clause() raises an ereport if SQL
>> function tries to use probin except
>> for C-language. Is it illegal for other languages to use probin field
>> to store something useful?
>
> Well, there's no convention about how to use it.
>
>> In my case, PL/CUDA language allows to define SQL function with a CUDA
>> code block.
>> It saves a raw CUDA source code on the pg_proc.prosrc and its
>> intermediate representation
>> on the pg_proc.probin; which is automatically constructed on the
>> validator callback of the language
>> handler.
>
> I have precisely zero sympathy for such a kluge.  The validator exists
> to validate, it is not supposed to modify the pg_proc row.
>
> We could imagine extending the PL API to allow storage of a compiled
> version in probin, but overloading the validator isn't the way to do
> that IMO.  I'd prefer to see a separate "compile" function for it.
> Existence of a compile function could be the trigger that instructs,
> eg, pg_dump not to include the probin value in the dump.
>
> (There once was a LANCOMPILER option in the CREATE LANGUAGE syntax,
> which I imagine was meant to do something like this, but it was never
> fully implemented and we got rid of it years ago.)
>
> The bigger question though is whether it's really worth the trouble.
> All existing PLs deal with this by caching compiled (to one degree
> or another) representations in process memory.  If you keep it in
> probin you can save some compilation time once per session, but on the
> other hand you're limited to representations that can fit in a flat blob.
>
After the thought with fresh brain, it may not be a serious problem.
Right now, it assigns dependency between a PL/CUDA function and
its helper functions (*1), then put function's OID in the intermediate
representation on the pg_proc.probin, to work even if function gets
renamed.
However, it is a situation simply we can raise an ereport if helper
function was not found, instead of the dependency mechanism.
In addition, the heavier compilation stuff is intermediate form to
the real binary form rather than the source-to-intermediate stuff,
although I don't know about other script language.

Thanks,

(*1) PL/CUDA allows to put some kind of helper function in its
program source for some purpose. For example, the directive
below suggests the runtime of PL/CUDA to allocate a particular
amount of device memory according to the result (int8) of
my_plcuda_workbuf_sz() which have identical argument set with
the PL/CUDA function.

$$
  :
#plcuda_workbuf_sz my_plcuda_func_workbuf_sz
  :
$$

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Use of pg_proc.probin is legal?

2016-11-15 Thread Kohei KaiGai

Hello,

I have a question as subject line.

https://www.postgresql.org/docs/devel/static/catalog-pg-proc.html
According to the documentation, the purpose of pg_proc.probin is
introduced as follows:
| Additional information about how to invoke the function. Again, the
interpretation is language-specific.

On the other hands, interpret_AS_clause() raises an ereport if SQL
function tries to use probin except
for C-language. Is it illegal for other languages to use probin field
to store something useful?

In my case, PL/CUDA language allows to define SQL function with a CUDA
code block.
It saves a raw CUDA source code on the pg_proc.prosrc and its
intermediate representation
on the pg_proc.probin; which is automatically constructed on the
validator callback of the language
handler.

I noticed the problematic scenario because pg_dump generates CREATE
FUNCTION statement
with two AS arguments, then pg_restore tries to run the statement but failed.

Solution seems to me either of them:
1. Admit CREATE FUNCTION with two AS arguments. If language does not
support is, probin is
simply ignored.
2. pg_dump does not dump pg_proc.probin if language is not C-language.
3. Update documentation to inform not to provide the probin if not C-language.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Academic help for Postgres

2016-05-11 Thread Kohei KaiGai

2016-05-11 23:20 GMT+09:00 Bruce Momjian :
> I am giving a keynote at an IEEE database conference in Helsinki next
> week (http://icde2016.fi/).  (Yes, I am not attending PGCon Ottawa
> because I accepted the Helsinki conference invitation before the PGCon
> Ottawa date was changed from June to May).
>
> As part of the keynote, I would like to mention areas where academia can
> help us.  The topics I can think of are:
>
> Query optimization
> Optimizer statistics
> Indexing structures
> Reducing function call overhead
> CPU locality
> Sorting
> Parallelism
> Sharding
>
> Any others?
>
How about NVRAM utilization?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] textlike under the LIKE operator for char(n)

2016-05-06 Thread Kohei KaiGai

2016-05-06 23:17 GMT+09:00 Kevin Grittner :
> On Fri, May 6, 2016 at 8:58 AM, Kohei KaiGai  wrote:
>
>> postgres=# select 'abcd'::char(20) LIKE 'ab%cd';
>>  ?column?
>> --
>>  f
>> (1 row)
>>
>> postgres=# select 'abcd'::char(4) LIKE 'ab%cd';
>>  ?column?
>> --
>>  t
>> (1 row)
>>
>> LIKE operator (that is eventually processed by textlike) considers the
>> padding space of char(n) data type as a part of string.
>
> The SQL standard generally requires this for CHAR(n) columns.
>
>> On the other hands, equal operator ignores the padding space when it
>> compares two strings.
>>
>> postgres=# select 'abcd'::char(20) = 'abcd';
>>  ?column?
>> --
>>  t
>> (1 row)
>>
>> postgres=# select 'abcd'::char(4) = 'abcd';
>>  ?column?
>> --
>>  t
>> (1 row)
>
> The SQL standard specifically requires this exception to the
> general rule.
>
>> Is this behavior as expected? or, bug?
>
> This has been discussed on community lists multiple times in the
> past; you might want to search the archives.  I'm not inclined to
> dig through the standard for details on this point again right now,
> but in general the behaviors we provide for CHAR(n) are mandated by
> standard.  It would not entirely shock me if there are some corner
> cases where different behavior could be allowed or even more
> correct, but my recollection is that what you have shown is all
> required to work that way.
>
Thanks, I couldn't find out the reason of the behavior shortly.
Requirement by SQL standard is a clear guidance even if it looks
a bit mysterious.

> Generally, I recommend avoiding CHAR(n) columns like the plague.
>
Yep, I agree. I found this matter when I port LIKE operator on GPU,
not a time when some real-life query tried to use char(n).

Best regards,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] textlike under the LIKE operator for char(n)

2016-05-06 Thread Kohei KaiGai

Hi,

I found a mysterious behavior when we use LIKE operator on char(n) data type.


postgres=# select 'abcd'::char(20) LIKE 'ab%cd';
 ?column?
--
 f
(1 row)

postgres=# select 'abcd'::char(4) LIKE 'ab%cd';
 ?column?
--
 t
(1 row)

LIKE operator (that is eventually processed by textlike) considers the
padding space of char(n) data type as a part of string.

On the other hands, equal operator ignores the padding space when it
compares two strings.

postgres=# select 'abcd'::char(20) = 'abcd';
 ?column?
--
 t
(1 row)

postgres=# select 'abcd'::char(4) = 'abcd';
 ?column?
--
 t
(1 row)

The LIKE operator on char(n) data type is implemented by textlike().

at pg_proc.h:
DATA(insert OID = 1631 (  bpcharlike   PGNSP PGUID 12 1 0 0 0 f f
f f t f i s 2 0 16 "1042 25" _null_ _null_ _null_ _null_ _null_
textlike _null_ _null_ _null_ ));

It calls GenericMatchText() with length of the target string,
calculated by VARSIZE_ANY_EXHDR, however, it includes the padding
space.
It seems to me bcTruelen() gives the correct length for char(n) data
types, instead of this macro.

Is this behavior as expected? or, bug?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-12 Thread Kohei KaiGai

2016-02-13 0:11 GMT+09:00 Robert Haas :
> On Fri, Feb 12, 2016 at 9:56 AM, Andres Freund  wrote:
>> On 2016-02-10 23:26:20 -0500, Robert Haas wrote:
>>> I think the part about whacking around the FDW API is a little more
>>> potentially objectionable to others, so I want to hold off doing that
>>> unless a few more people chime in with +1.  Perhaps we could start a
>>> new thread to talk about that specific idea.  This is useful even
>>> without that, though.
>>
>> FWIW, I can delete a couple hundred lines of code from citusdb thanks to
>> this...
>>
>> A quick questions about what you committed:
>>
>>> @@ -527,10 +532,17 @@ extern PGDLLIMPORT Node *newNodeMacroHolder;
>>>   */
>>>  extern char *nodeToString(const void *obj);
>>>
>>> +struct Bitmapset;/* not to include bitmapset.h here */
>>> +struct StringInfoData;   /* not to include stringinfo.h here */
>>> +extern void outToken(struct StringInfoData *str, const char *s);
>>> +extern void outBitmapset(struct StringInfoData *str,
>>> +  const struct Bitmapset *bms);
>>> +
>>>  /*
>>>   * nodes/{readfuncs.c,read.c}
>>>   */
>>>  extern void *stringToNode(char *str);
>>> +extern struct Bitmapset *readBitmapset(void);
>>
>> why exactly did you expose read/writeBitmapset(), and nothing else?
>> Afaics a lot of the other read routines are also pretty necessary from
>> the outside?
>
> Because that's what KaiGai submitted and I had no reason to
> second-guess it.  I'm happy to expose other stuff along similar lines
> if somebody writes a patch for it.
>
Although we have various service macro in outfuncs.c and readfuncs.c,
it is not a big burden for extension authors to write equivalent code,
except for string and bitmap. On the other hands, string needs _outToken,
bitmap needs _outBitmap. It is not good idea to suggest extensions to
have copy & paste code with no clear reason.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cash_mul_int8 / cash_div_int8

2015-10-07 Thread Kohei KaiGai

Also, cash_pl, cash_mi, cash_mul_int4 and so on... does not have overflow checks
like as int8pl has.

Of course, most of people don't need to worry about 64bit overflow for
money... :-).

2015-10-08 0:03 GMT+09:00 Kohei KaiGai :
> I noticed cash_mul_int8 / cash_div_int8 are defined in cash.c,
> however, pg_proc.h and pg_operator.h contains no relevant entries.
>
> Is it just a careless oversight?
>
> Thanks,
> --
> KaiGai Kohei 



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] cash_mul_int8 / cash_div_int8

2015-10-07 Thread Kohei KaiGai

I noticed cash_mul_int8 / cash_div_int8 are defined in cash.c,
however, pg_proc.h and pg_operator.h contains no relevant entries.

Is it just a careless oversight?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai

2015-08-19 21:29 GMT+09:00 Simon Riggs :
> On 19 August 2015 at 12:55, Kohei KaiGai  wrote:
>>
>> 2015-08-19 20:12 GMT+09:00 Simon Riggs :
>> > On 12 June 2015 at 00:29, Tomas Vondra 
>> > wrote:
>> >
>> >>
>> >> I see two ways to fix this:
>> >>
>> >> (1) enforce the 1GB limit (probably better for back-patching, if that's
>> >> necessary)
>> >>
>> >> (2) make it work with hash tables over 1GB
>> >>
>> >> I'm in favor of (2) if there's a good way to do that. It seems a bit
>> >> stupid not to be able to use fast hash table because there's some
>> >> artificial
>> >> limit. Are there any fundamental reasons not to use the
>> >> MemoryContextAllocHuge fix, proposed by KaiGai-san?
>> >
>> >
>> > If there are no objections, I will apply the patch for 2) to HEAD and
>> > backpatch to 9.5.
>> >
>> Please don't be rush. :-)
>
>
> Please explain what rush you see?
>
Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

Thanks,

>> It is not difficult to replace palloc() by palloc_huge(), however, it may
>> lead
>> another problem once planner gives us a crazy estimation.
>> Below is my comment on the another thread.
>
>
>  Yes, I can read both threads and would apply patches for each problem.
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai

2015-08-19 20:12 GMT+09:00 Simon Riggs :
> On 12 June 2015 at 00:29, Tomas Vondra  wrote:
>
>>
>> I see two ways to fix this:
>>
>> (1) enforce the 1GB limit (probably better for back-patching, if that's
>> necessary)
>>
>> (2) make it work with hash tables over 1GB
>>
>> I'm in favor of (2) if there's a good way to do that. It seems a bit
>> stupid not to be able to use fast hash table because there's some artificial
>> limit. Are there any fundamental reasons not to use the
>> MemoryContextAllocHuge fix, proposed by KaiGai-san?
>
>
> If there are no objections, I will apply the patch for 2) to HEAD and
> backpatch to 9.5.
>
Please don't be rush. :-)

It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

==
Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select
ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
 from web_sales ws1,web_sales ws2
 where ws1.ws_order_number = ws2.ws_order_number
   and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk;
 QUERY PLAN


 Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
   Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
   Join Filter: (ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
   Rows Removed by Join Filter: 127853313
   ->  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
 Sort Key: ws1.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 ->  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
   ->  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
 Sort Key: ws2.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 ->  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
 Planning time: 0.232 ms
 Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] security labels on databases are bad for dump & restore

2015-07-14 Thread Kohei KaiGai

2015-07-15 2:39 GMT+09:00 Ted Toth :
> That's exactly what I'm talking about like I said KaiGais branch was
> never merged into the mainline so I do not believe that it is used at
> all.
>
It depends on the definition of "integrated".
The PostgreSQL core offers an infrastructure for label based security
mechanism, not only selinux. Also, one extension module that is
usually distributed with PosgreSQL bridges the world of database and
the world of selinux (even though all the features I initially designed
are not yet implemented). I like to say it is integrated.

> On Tue, Jul 14, 2015 at 12:28 PM, Robert Haas  wrote:
>> On Tue, Jul 14, 2015 at 1:22 PM, Ted Toth  wrote:
>>> I'm sort of new to this so maybe I'm missing something but since the
>>> sepgsql SELinux userspace object manager was never integrated into
>>> postgresql (AFAIK KaiGais branch was never merged into the mainline)
>>> who uses these labels? What use are they?
>>
>> See contrib/sepgsql
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-06-24 Thread Kohei KaiGai

Does it make sense to put the result tuple of remote join on evety
estate->es_epqTupleSet[] slot represented by this ForeignScan if
scanrelid==0?

It allows to recheck qualifier for each LockRow that intends to lock
base foreign table underlying the remote join.
ForeignScan->fdw_relids tells us which rtindexes are represented
by this ForeignScan, so infrastructure side may be able to handle.

Thanks,


2015-06-24 11:40 GMT+09:00 Etsuro Fujita :
> Hi,
>
> While reviewing the foreign join pushdown core patch, I noticed that the
> patch doesn't perform an EvalPlanQual recheck properly.  The example
> that crashes the server will be shown below (it uses the postgres_fdw
> patch [1]).  I think the reason for that is because the ForeignScan node
> performing the foreign join remotely has scanrelid = 0 while
> ExecScanFetch assumes that its scan node has scanrelid > 0.
>
> I think this is a bug.  I've not figured out how to fix this yet, but I
> thought we would also need another plan that evaluates the join locally
> for the test tuples for EvalPlanQual.  Though I'm missing something though.
>
> Create an environment:
>
> postgres=# create table tab (a int, b int);
> CREATE TABLE
> postgres=# create foreign table foo (a int) server myserver options
> (table_name 'foo');
> CREATE FOREIGN TABLE
> postgres=# create foreign table bar (a int) server myserver options
> (table_name 'bar');
> CREATE FOREIGN TABLE
> postgres=# insert into tab values (1, 1);
> INSERT 0 1
> postgres=# insert into foo values (1);
> INSERT 0 1
> postgres=# insert into bar values (1);
> INSERT 0 1
> postgres=# analyze tab;
> ANALYZE
> postgres=# analyze foo;
> ANALYZE
> postgres=# analyze bar;
> ANALYZE
>
> Run the example:
>
> [Terminal 1]
> postgres=# begin;
> BEGIN
> postgres=# update tab set b = b + 1 where a = 1;
> UPDATE 1
>
> [Terminal 2]
> postgres=# explain verbose select tab.* from tab, foo, bar where tab.a =
> foo.a and foo.a = bar.a for update;
>
>  QUERY PLAN
>
> 
> 
>  LockRows  (cost=100.00..101.18 rows=4 width=70)
>Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
>->  Nested Loop  (cost=100.00..101.14 rows=4 width=70)
>  Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
>  Join Filter: (foo.a = tab.a)
>  ->  Seq Scan on public.tab  (cost=0.00..1.01 rows=1 width=14)
>Output: tab.a, tab.b, tab.ctid
>  ->  Foreign Scan  (cost=100.00..100.08 rows=4 width=64)
>Output: foo.*, foo.a, bar.*, bar.a
>Relations: (public.foo) INNER JOIN (public.bar)
>Remote SQL: SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT
> ROW(l.a9), l.a9 FROM (SELECT a a9 FROM public.foo FOR UPDATE) l) l (a1,
> a2) INNER
> JOIN (SELECT ROW(r.a9), r.a9 FROM (SELECT a a9 FROM public.bar FOR
> UPDATE) r) r (a1, a2) ON ((l.a2 = r.a2))
> (11 rows)
>
> postgres=# select tab.* from tab, foo, bar where tab.a = foo.a and foo.a
> = bar.a for update;
>
> [Terminal 1]
> postgres=# commit;
> COMMIT
>
> [Terminal 2]
> (After the commit in Terminal 1, Terminal 2 will show the following.)
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>
>
> Best regards,
> Etsuro Fujita
>
> [1]
> http://www.postgresql.org/message-id/CAEZqfEe9KGy=1_wagh2rgzpg0o4pqgd+iauyaj8wtze+cyj...@mail.gmail.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai

2015-06-11 23:33 GMT+09:00 Tomas Vondra :
> Hi,
>
> On 06/11/15 16:20, Jan Wieck wrote:
>>
>> On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

 curious: what was work_mem set to?

>>> work_mem=48GB
>>>
>>> My machine mounts 256GB physical RAM.
>>
>>
>> work_mem can be allocated several times per backend. Nodes like sort
>> and hash_aggregate may each allocate that much. You should set
>> work_mem to a fraction of physical-RAM / concurrent-connections
>> depending on the complexity of your queries. 48GB does not sound
>> reasonable.
>
>
> That's true, but there are cases where values like this may be useful (e.g.
> for a particular query). We do allow such work_mem values, so I consider
> this failure to be a bug.
>
> It probably existed in the past, but was amplified by the hash join
> improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead of
> NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we also
> much more memory than we had in the past.
>
> Interestingly, the hash code checks for INT_MAX overflows on a number of
> places, but does not check for this ...
>
Which number should be changed in this case?

Indeed, nbuckets is declared as int, so INT_MAX is hard limit of hash-slot.
However, some extreme usage can easily create a situation that we shall
touch this restriction.

Do we have nbuckets using long int?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai

2015-06-11 23:28 GMT+09:00 Robert Haas :
> On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai  wrote:
>> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
>> memset().
>> Indeed, this hash table is constructed towards the relation with 
>> nrows=119994544,
>> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> You forgot to attach the patch, I think.
>
Oops, I forgot to attach indeed.

>  It looks to me like the size
> of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
> That's a lot of buckets, but maybe not unreasonably many if you've got
> enough memory.
>
EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.

-- 
KaiGai Kohei 


hashslot-allocation-by-huge-alloc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai

2015-06-11 23:20 GMT+09:00 Jan Wieck :
> On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:
>>>
>>> curious: what was work_mem set to?
>>>
>> work_mem=48GB
>>
>> My machine mounts 256GB physical RAM.
>
>
> work_mem can be allocated several times per backend. Nodes like sort and
> hash_aggregate may each allocate that much. You should set work_mem to a
> fraction of physical-RAM / concurrent-connections depending on the
> complexity of your queries. 48GB does not sound reasonable.
>
Smaller number of max_connections and large work_mem configuration are
usual for typical OLAP workloads.

Even if configuration is not reasonable, it is not a right error message.
People cannot understand how to fix it.

psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] trust authentication behavior

2015-05-18 Thread Kohei KaiGai

2015-05-18 15:15 GMT+09:00 Denis Kirjanov :
>
>
> - Ursprüngliche Mail -
> Von: "Kohei KaiGai" 
> An: "Robert Haas" 
> CC: "David G. Johnston" , "Denis Kirjanov" 
> , pgsql-hackers@postgresql.org, "Kohei KaiGai" 
> 
> Gesendet: Samstag, 16. Mai 2015 03:32:55
> Betreff: Re: [HACKERS] trust authentication behavior
>
> 2015-05-16 5:13 GMT+09:00 Robert Haas :
>> On Thu, May 14, 2015 at 3:52 PM, David G. Johnston
>>  wrote:
>>> On Thu, May 14, 2015 at 12:22 PM, Denis Kirjanov  wrote:
>>>>
>>>> Yeah, but the idea is to do that without the pg_hba.conf
>>>
>>> You may want to try describing the problem and not just ask if the chosen
>>> solution is possible - of which I am doubtful but I have never used selinux
>>> or studied it in any depth.  pg_hba.conf is the chosen tool for this kind of
>>> thing so pointing out why it cannot be used is a much more useful first
>>> step.
>>
>> In mandatory access control systems like SE-Linux, the system security
>> policy is supposed to centralize all security decisions, and it should
>> be possible to enforce any necessary access control rule by modifying
>> that policy.  At least that's my understanding.  sepgsql lets the
>> kernel's mandatory access control policies filter down into access
>> control decisions that PostgreSQL makes.  sepgsql consults the
>> operating system policy when faced with an access control decision of
>> a type that it supports, and accepts or rejects the connect based on
>> that.
>>
>> So the question is whether the sepgsql integration points include
>> anything that can block a connection, rather than, say, allowing the
>> connection but blocking access to particular tables.  Looking at the
>> code, it appears that it vaguely contemplates a db_database:{access}
>> permission, which sounds like about the right thing, and it's also
>> mentioned at 
>> https://wiki.postgresql.org/wiki/SEPostgreSQL/Permissions#Connection
>> as maybe being the right thing, but I can't find anyplace that it is
>> actually enforce.  That's rather disappointing...
>>
>> KaiGai, any thoughts?
>>
> I'd like to understand what Denis Kirjanov actually wants to do
> first of all.
>
> If he wants to control accesses whole of the PostgreSQL instances
> according to the credential on operating system, it is a configuration
> on operating system side.
> He can write up self security policy module that allows "someone_t"
> domain to connect PostgreSQL instance, not per database basis.
> It is permission around the pair of someone_t and postgresql_t, than
> can be wrapped up by postgresql_stream_connect() in selinux's policy.
>
> If he wants to control accesses per database basis based on selinux
> policy, it is a right choice to consider sepgsql module.
> However, some of permissions are not still implemented, like
> db_database:{access} because of priority of permissions (and I had to
> focus on GPU acceleration infrastructure in v9.5 cycle...).
>
> If he wants to control accesses based on the credential of operating
> system, not limited to selinux, IDENT method is available, isn't it?
>
> Also, he may additionally needs labeled networking configuration,
> if he wants to carry security label information over the TCP/IP
> connection. It is a point to be considered for his requirement.
>
> Thanks,
> --
> KaiGai Kohei 
> - Ursprüngliche Mail -
>
>
> The idea is to put all the security decisions into the selinux policy.
>
Yep, it is fundamental idea of sepgsql module.

> As I wrote before it concerns the security label command behavior(done 
> through the policy) and
> the incoming connections (for now it's incoming  remove trust-based)
>
Are you concerning about client's security label when connection
comes over TCP/IP network? Or, something other concern??

If you want to get client's label over tcp/ip network, you need
additional configuration on peer-to-peer ipsec to exchange security
label over the network.
Elsewhere, selinux also offers a mechanism to assign client's
security label based on the source ip address.

If you have other concern, please introduce it.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] trust authentication behavior

2015-05-15 Thread Kohei KaiGai

2015-05-16 5:13 GMT+09:00 Robert Haas :
> On Thu, May 14, 2015 at 3:52 PM, David G. Johnston
>  wrote:
>> On Thu, May 14, 2015 at 12:22 PM, Denis Kirjanov  wrote:
>>>
>>> Yeah, but the idea is to do that without the pg_hba.conf
>>
>> You may want to try describing the problem and not just ask if the chosen
>> solution is possible - of which I am doubtful but I have never used selinux
>> or studied it in any depth.  pg_hba.conf is the chosen tool for this kind of
>> thing so pointing out why it cannot be used is a much more useful first
>> step.
>
> In mandatory access control systems like SE-Linux, the system security
> policy is supposed to centralize all security decisions, and it should
> be possible to enforce any necessary access control rule by modifying
> that policy.  At least that's my understanding.  sepgsql lets the
> kernel's mandatory access control policies filter down into access
> control decisions that PostgreSQL makes.  sepgsql consults the
> operating system policy when faced with an access control decision of
> a type that it supports, and accepts or rejects the connect based on
> that.
>
> So the question is whether the sepgsql integration points include
> anything that can block a connection, rather than, say, allowing the
> connection but blocking access to particular tables.  Looking at the
> code, it appears that it vaguely contemplates a db_database:{access}
> permission, which sounds like about the right thing, and it's also
> mentioned at 
> https://wiki.postgresql.org/wiki/SEPostgreSQL/Permissions#Connection
> as maybe being the right thing, but I can't find anyplace that it is
> actually enforce.  That's rather disappointing...
>
> KaiGai, any thoughts?
>
I'd like to understand what Denis Kirjanov actually wants to do
first of all.

If he wants to control accesses whole of the PostgreSQL instances
according to the credential on operating system, it is a configuration
on operating system side.
He can write up self security policy module that allows "someone_t"
domain to connect PostgreSQL instance, not per database basis.
It is permission around the pair of someone_t and postgresql_t, than
can be wrapped up by postgresql_stream_connect() in selinux's policy.

If he wants to control accesses per database basis based on selinux
policy, it is a right choice to consider sepgsql module.
However, some of permissions are not still implemented, like
db_database:{access} because of priority of permissions (and I had to
focus on GPU acceleration infrastructure in v9.5 cycle...).

If he wants to control accesses based on the credential of operating
system, not limited to selinux, IDENT method is available, isn't it?

Also, he may additionally needs labeled networking configuration,
if he wants to carry security label information over the TCP/IP
connection. It is a point to be considered for his requirement.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-05-13 Thread Kohei KaiGai

2015-05-13 21:45 GMT+09:00 Robert Haas :
> On Sun, May 10, 2015 at 3:15 AM, Kohei KaiGai  wrote:
>> 2015-05-01 9:52 GMT+09:00 Kohei KaiGai :
>>> 2015-05-01 7:40 GMT+09:00 Alvaro Herrera :
>>>> Kouhei Kaigai wrote:
>>>>> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>>>>> > > The idea of making the regression test entirely independent of the
>>>>> > > system's policy would presumably solve this problem, so I'd kind of
>>>>> > > like to see progress on that front.
>>>>> >
>>>>> > Apologies, I guess it wasn't clear, but that's what I was intending to
>>>>> > advocate.
>>>>> >
>>>>> OK, I'll try to design a new regression test policy that is independent
>>>>> from the system's policy assumption, like unconfined domain.
>>>>>
>>>>> Please give me time for this work.
>>>>
>>>> Any progress here?
>>>>
>>> Not done.
>>> The last version I rebuild had a trouble on user/role transition from
>>> unconfined_u/unconfined_r to the self defined user/role...
>>> So, I'm trying to keep the user/role field (that is not redefined for
>>> several years) but to define self domain/types (that have been
>>> redefined multiple times) for the regression test at this moment.
>>>
>> The second approach above works.
>> I defined a own privileged domain (sepgsql_regtest_superuser_t)
>> instead of system's unconfined_t domain.
>> The reason why regression test gets failed was, definition of
>> unconfined_t in the system default policy was changed to bypass
>> multi-category rules; which our regression test depends on.
>> So, the new sepgsql_regtest_superuser_t domain performs almost
>> like as unconfined_t, but restricted by multi-category policy as
>> traditional unconfined_t did.
>> It is self defined domain, so will not affected by system policy
>> change.
>> Even though the sepgsql-regtest.te still uses unconfined_u and
>> unconfined_r pair for selinux-user and role, it requires users to
>> define additional selinux-user by hand if we try to define own one.
>> In addition, its definition has not been changed for several years.
>> So, I thought it has less risk to rely on unconfined_u/unconfined_r
>> field unlike unconfined_t domain.
>
> Can you add this to the next CommitFest?
>
OK, done

https://commitfest.postgresql.org/5/249/

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-10 Thread Kohei KaiGai

Tom,

I briefly checked your updates.
Even though it is not described in the commit-log, I noticed a
problematic change.

This commit reverts create_plan_recurse() as static function. It means extension
cannot have child node, even if it wants to add a custom-join logic.
Please assume a simple case below:
  SELECT * FROM t0, t1 WHERE t0.a = t1.x;

An extension adds a custom join path, then its PlanCustomPath method will be
called back to create a plan node once it gets chosen by planner.
The role of PlanCustomPath is to construct a plan-node of itself, and plan-nodes
of the source relations also.
If create_plan_recurse() is static, we have no way to initialize
plan-node for t0
and t1 scan even if join-logic itself is powerful than built-in ones.

It was already discussed in the upthread, and people's consensus.
Please revert create_plan_recurse() as like initial commit.

Also, regarding of the *_scan_tlist,

> I don't have time to pursue this idea right now, but I think it would be
> a good change to squeeze into 9.5, just so that we could have some test
> coverage on those parts of this patch.
>
Do you want just a test purpose module and regression test cases?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-09 Thread Kohei KaiGai

2015-05-09 8:18 GMT+09:00 Kohei KaiGai :
> 2015-05-09 2:46 GMT+09:00 Tom Lane :
>> Kouhei Kaigai  writes:
>>>> I've been trying to code-review this patch, because the documentation
>>>> seemed several bricks shy of a load, and I find myself entirely confused
>>>> by the fdw_ps_tlist and custom_ps_tlist fields.
>>
>>> Main-point of your concern is lack of documentation/comments to introduce
>>> how does the pseudo-scan targetlist works here, isn't it??
>>
>> Well, there's a bunch of omissions and outright errors in the docs and
>> comments, but this is the main issue that I was uncertain how to fix
>> from looking at the patch.
>>
>>>> Also,
>>>> if that is what they're for (ie, to allow the FDW to redefine the scan
>>>> tuple contents) it would likely be better to decouple that feature from
>>>> whether the plan node is for a simple scan or a join.
>>
>>> In this version, we don't intend FDW/CSP to redefine the contents of
>>> scan tuples, even though I want off-loads heavy targetlist calculation
>>> workloads to external computing resources in *the future version*.
>>
>> I do not think it's a good idea to introduce such a field now and then
>> redefine how it works and what it's for in a future version.  We should
>> not be moving the FDW APIs around more than we absolutely have to,
>> especially not in ways that wouldn't throw an obvious compile error
>> for un-updated code.  Also, the longer we wait to make a change that
>> we know we want, the more pain we inflict on FDW authors (simply because
>> there will be more of them a year from now than there are today).
>>
> Ah, above my sentence don't intend to reuse the existing field for
> different works in the future version. It's just what I want to support
> in the future version.
> Yep, I see. It is not a good idea to redefine the existing field for
> different purpose silently. It's not my plan.
>
>>>> The business about
>>>> resjunk columns in that list also seems a bit half baked, or at least
>>>> underdocumented.
>>
>>> I'll add source code comments to introduce how does it works any when
>>> does it have resjunk=true. It will be a bit too deep to be introduced
>>> in the SGML file.
>>
>> I don't actually see a reason for resjunk marking in that list at all,
>> if what it's for is to define the contents of the scan tuple.  I think we
>> should just s/ExecCleanTypeFromTL/ExecTypeFromTL/ in nodeForeignscan and
>> nodeCustom, and get rid of the "sanity check" in create_foreignscan_plan
>> (which is pretty pointless anyway, considering the number of other ways
>> you could screw up that tlist without it being detected).
>>
> http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010d7...@bpxm15gp.gisp.nec.co.jp
>
> Does the introduction in above post make sense?
> The *_ps_tlist is not only used for a basic of scan-tuple descriptor, but
> also used to solve var-node if varno==INDEX_VAR in EXPLAIN command.
> On the other hands, existence of the junk entries (which are referenced in
> external computing resources only) may cause unnecessary projection.
> So, I want to discriminate target-entries for basis of scan-tuple descriptor
> from other ones just for EXPLAIN command.
>
>> I'm also inclined to rename the fields to
>> fdw_scan_tlist/custom_scan_tlist, which would better reflect what they do,
>> and to change the API of make_foreignscan() to add a parameter
>> corresponding to the scan tlist.  It's utterly bizarre and error-prone
>> that this patch has added a field that the FDW is supposed to set and
>> not changed make_foreignscan to match.
>>
> OK, I'll do the both of changes. The name of ps_tlist is a shorten of
> "pseudo-scan target-list". So, fdw_scan_tlist/custom_scan_tlist are
> almost intentional.
>
The attached patch renamed *_ps_tlist by *_scan_tlist according to
the suggestion.
Also, put a few detailed source code comments around this alternative
scan_tlist.

Thanks,
-- 
KaiGai Kohei 


custom-join-rename-ps_tlist.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-09 Thread Kohei KaiGai

2015-05-09 8:32 GMT+09:00 Kohei KaiGai :
> 2015-05-09 3:51 GMT+09:00 Tom Lane :
>> Robert Haas  writes:
>>> On Fri, May 8, 2015 at 1:46 PM, Tom Lane  wrote:
>>>> That's nice, but 9.5 feature freeze is only a week away.  I don't have a
>>>> lot of confidence that this stuff is actually in a state where we won't
>>>> regret shipping it in 9.5.
>>
>>> Yeah.  The POC you were asking for upthread certainly exists and has
>>> for a while, or I would not have committed this.  But I do not think
>>> it likely that the  postgres_fdw support will be ready for 9.5.
>>
>> Well, we have two alternatives.  I can keep hacking on this and get it
>> to a state where it seems credible to me, but we won't have any proof
>> that it actually works (though perhaps we could treat any problems
>> as bugs that should hopefully get found before 9.5 ships, if a
>> postgres_fdw patch shows up in the next few months).  Or we could
>> revert the whole thing and bounce it to the 9.6 cycle.  I don't really
>> like doing the latter, but I'm pretty uncomfortable with committing to
>> published FDW APIs that are (a) as messy as this and (b) practically
>> untested.  The odds that something slipped through the cracks are high.
>>
>> Aside from the other gripes I raised, I'm exceedingly unhappy with the
>> ad-hoc APIs proposed for GetForeignJoinPaths and set_join_pathlist_hook.
>> It's okay for internal calls in joinpath.c to look like that, but
>> exporting that set of parameters seems like pure folly.  We've changed
>> those parameter lists repeatedly (for instance in 9.2 and again in 9.3);
>> the odds that they'll need to change again in future approach 100%.
>>
>> One way we could reduce the risk of code breakage there is to stuff all
>> or most of those parameters into a struct.  This might result in a small
>> slowdown for the internal calls, or then again maybe not --- there
>> probably aren't many architectures that can pass 10 parameters in
>> registers anyway.
>>
> Is it like a following structure definition?
>
>   typedef struct
>   {
> PlannerInfo *root;
> RelOptInfo *joinrel;
> RelOptInfo *outerrel;
> RelOptInfo *innerrel;
> List *restrictlist;
> JoinType jointype;
> SpecialJoinInfo *sjinfo;
> SemiAntiJoinFactors *semifactors;
> Relids param_source_rels;
> Relids extra_lateral_rels;
>   } SetJoinPathListArgs;
>
> I agree the idea. It also helps CSP driver implementation where it calls
> next driver that was already chained on its installation.
>
>   if (set_join_pathlist_next)
>   set_join_pathlist_next(args);
>
> is more stable manner than
>
>   if (set_join_pathlist_next)
>   set_join_pathlist_next(root,
>joinrel,
>outerrel,
>innerrel,
>restrictlist,
>jointype,
>sjinfo,
>semifactors,
>param_source_rels,
>extra_lateral_rels);
>
The attached patch newly defines ExtraJoinPathArgs struct to pack
all the necessary information to be delivered on GetForeignJoinPaths
and set_join_pathlist_hook.

Its definition is below:
  typedef struct
  {
 PlannerInfo*root;
 RelOptInfo *joinrel;
 RelOptInfo *outerrel;
 RelOptInfo *innerrel;
 List   *restrictlist;
 JoinTypejointype;
 SpecialJoinInfo *sjinfo;
 SemiAntiJoinFactors *semifactors;
 Relids  param_source_rels;
 Relids  extra_lateral_rels;
  } ExtraJoinPathArgs;

then, hook invocation gets much simplified, like:

/*
 * 6. Finally, give extensions a chance to manipulate the path list.
 */
if (set_join_pathlist_hook)
set_join_pathlist_hook(&jargs);

Thanks,
-- 
KaiGai Kohei 


custom-join-argument-by-struct.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-09 Thread Kohei KaiGai

2015-05-09 11:21 GMT+09:00 Robert Haas :
> On Fri, May 8, 2015 at 5:48 PM, Tom Lane  wrote:
>> ... btw, I just noticed something that had escaped me because it seems so
>> obviously wrong that I had not even stopped to consider the possibility
>> that the code was doing what it's doing.  To wit, that the planner
>> supposes that two foreign tables are potentially remote-joinable if they
>> share the same underlying FDW handler function.  Not the same server, and
>> not even the same pg_foreign_data_wrapper entry, but the pg_proc entry for
>> the handler function.  I think this is fundamentally bogus.  Under what
>> circumstances are we not just laying off the need to check same server
>> origin onto the FDW?  How is it that the urgent need for the FDW to check
>> for that isn't even mentioned in the documentation?
>>
>> I think that we'd really be better off insisting on same server (as in
>> same pg_foreign_server OID), hence automatically same FDW, and what's
>> even more important, same user mapping for any possible query execution
>> context.  The possibility that there are some corner cases where some FDWs
>> could optimize other scenarios seems to me to be poor return for the bugs
>> and security holes that will arise any time typical FDWs forget to check
>> this.
>
> I originally wanted to go quite the other way with this and check for
> join pushdown via handler X any time at least one of the two relations
> involved used handler X, even if the other one used some other handler
> or was a plain table.  In particular, it seems to me quite plausible
> to want to teach an FDW that a certain local table is replicated on a
> remote node, allowing a join between a foreign table and a plain table
> to be pushed down.  This infrastructure can't be used that way anyhow,
> so maybe there's no harm in tightening it up, but I'm wary of
> circumscribing what FDW authors can do.  I think it's better to be
> rather expansive in terms of when we call them and let them return
> without doing anything some of them time than to define the situations
> in which we call them too narrowly and end up ruling out interesting
> use cases.
>
Probably, it is relatively minor case to join a foreign table and a replicated
local relation on remote side. Even if the rough check by sameness of
foreign server-id does not invoke GetForeignJoinPaths, FDW driver can
implement its arbitrary logic using set_join_pathlist_hook by its own risk,
isn't it?

The attached patch changed the logic to check joinability of two foreign
relations. As upthread, it checks foreign server-id instead of handler
function.
build_join_rel() set fdw_server of RelOptInfo if inner and outer foreign-
relations have same value, then it eventually allows to kick
GetForeignJoinPaths on add_paths_to_joinrel().

Thanks,
-- 
KaiGai Kohei 


custom-join-fdw-pushdown-check-by-server.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-08 Thread Kohei KaiGai

2015-05-09 3:51 GMT+09:00 Tom Lane :
> Robert Haas  writes:
>> On Fri, May 8, 2015 at 1:46 PM, Tom Lane  wrote:
>>> That's nice, but 9.5 feature freeze is only a week away.  I don't have a
>>> lot of confidence that this stuff is actually in a state where we won't
>>> regret shipping it in 9.5.
>
>> Yeah.  The POC you were asking for upthread certainly exists and has
>> for a while, or I would not have committed this.  But I do not think
>> it likely that the  postgres_fdw support will be ready for 9.5.
>
> Well, we have two alternatives.  I can keep hacking on this and get it
> to a state where it seems credible to me, but we won't have any proof
> that it actually works (though perhaps we could treat any problems
> as bugs that should hopefully get found before 9.5 ships, if a
> postgres_fdw patch shows up in the next few months).  Or we could
> revert the whole thing and bounce it to the 9.6 cycle.  I don't really
> like doing the latter, but I'm pretty uncomfortable with committing to
> published FDW APIs that are (a) as messy as this and (b) practically
> untested.  The odds that something slipped through the cracks are high.
>
> Aside from the other gripes I raised, I'm exceedingly unhappy with the
> ad-hoc APIs proposed for GetForeignJoinPaths and set_join_pathlist_hook.
> It's okay for internal calls in joinpath.c to look like that, but
> exporting that set of parameters seems like pure folly.  We've changed
> those parameter lists repeatedly (for instance in 9.2 and again in 9.3);
> the odds that they'll need to change again in future approach 100%.
>
> One way we could reduce the risk of code breakage there is to stuff all
> or most of those parameters into a struct.  This might result in a small
> slowdown for the internal calls, or then again maybe not --- there
> probably aren't many architectures that can pass 10 parameters in
> registers anyway.
>
Is it like a following structure definition?

  typedef struct
  {
PlannerInfo *root;
RelOptInfo *joinrel;
RelOptInfo *outerrel;
RelOptInfo *innerrel;
List *restrictlist;
JoinType jointype;
SpecialJoinInfo *sjinfo;
SemiAntiJoinFactors *semifactors;
Relids param_source_rels;
Relids extra_lateral_rels;
  } SetJoinPathListArgs;

I agree the idea. It also helps CSP driver implementation where it calls
next driver that was already chained on its installation.

  if (set_join_pathlist_next)
  set_join_pathlist_next(args);

is more stable manner than

  if (set_join_pathlist_next)
  set_join_pathlist_next(root,
   joinrel,
   outerrel,
   innerrel,
   restrictlist,
   jointype,
   sjinfo,
   semifactors,
   param_source_rels,
   extra_lateral_rels);

Thanks,
--
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-08 Thread Kohei KaiGai

2015-05-09 2:46 GMT+09:00 Tom Lane :
> Kouhei Kaigai  writes:
>>> I've been trying to code-review this patch, because the documentation
>>> seemed several bricks shy of a load, and I find myself entirely confused
>>> by the fdw_ps_tlist and custom_ps_tlist fields.
>
>> Main-point of your concern is lack of documentation/comments to introduce
>> how does the pseudo-scan targetlist works here, isn't it??
>
> Well, there's a bunch of omissions and outright errors in the docs and
> comments, but this is the main issue that I was uncertain how to fix
> from looking at the patch.
>
>>> Also,
>>> if that is what they're for (ie, to allow the FDW to redefine the scan
>>> tuple contents) it would likely be better to decouple that feature from
>>> whether the plan node is for a simple scan or a join.
>
>> In this version, we don't intend FDW/CSP to redefine the contents of
>> scan tuples, even though I want off-loads heavy targetlist calculation
>> workloads to external computing resources in *the future version*.
>
> I do not think it's a good idea to introduce such a field now and then
> redefine how it works and what it's for in a future version.  We should
> not be moving the FDW APIs around more than we absolutely have to,
> especially not in ways that wouldn't throw an obvious compile error
> for un-updated code.  Also, the longer we wait to make a change that
> we know we want, the more pain we inflict on FDW authors (simply because
> there will be more of them a year from now than there are today).
>
Ah, above my sentence don't intend to reuse the existing field for
different works in the future version. It's just what I want to support
in the future version.
Yep, I see. It is not a good idea to redefine the existing field for
different purpose silently. It's not my plan.

>>> The business about
>>> resjunk columns in that list also seems a bit half baked, or at least
>>> underdocumented.
>
>> I'll add source code comments to introduce how does it works any when
>> does it have resjunk=true. It will be a bit too deep to be introduced
>> in the SGML file.
>
> I don't actually see a reason for resjunk marking in that list at all,
> if what it's for is to define the contents of the scan tuple.  I think we
> should just s/ExecCleanTypeFromTL/ExecTypeFromTL/ in nodeForeignscan and
> nodeCustom, and get rid of the "sanity check" in create_foreignscan_plan
> (which is pretty pointless anyway, considering the number of other ways
> you could screw up that tlist without it being detected).
>
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010d7...@bpxm15gp.gisp.nec.co.jp

Does the introduction in above post make sense?
The *_ps_tlist is not only used for a basic of scan-tuple descriptor, but
also used to solve var-node if varno==INDEX_VAR in EXPLAIN command.
On the other hands, existence of the junk entries (which are referenced in
external computing resources only) may cause unnecessary projection.
So, I want to discriminate target-entries for basis of scan-tuple descriptor
from other ones just for EXPLAIN command.

> I'm also inclined to rename the fields to
> fdw_scan_tlist/custom_scan_tlist, which would better reflect what they do,
> and to change the API of make_foreignscan() to add a parameter
> corresponding to the scan tlist.  It's utterly bizarre and error-prone
> that this patch has added a field that the FDW is supposed to set and
> not changed make_foreignscan to match.
>
OK, I'll do the both of changes. The name of ps_tlist is a shorten of
"pseudo-scan target-list". So, fdw_scan_tlist/custom_scan_tlist are
almost intentional.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-05-08 Thread Kohei KaiGai

2015-05-09 6:48 GMT+09:00 Tom Lane :
> ... btw, I just noticed something that had escaped me because it seems so
> obviously wrong that I had not even stopped to consider the possibility
> that the code was doing what it's doing.  To wit, that the planner
> supposes that two foreign tables are potentially remote-joinable if they
> share the same underlying FDW handler function.  Not the same server, and
> not even the same pg_foreign_data_wrapper entry, but the pg_proc entry for
> the handler function.  I think this is fundamentally bogus.  Under what
> circumstances are we not just laying off the need to check same server
> origin onto the FDW?  How is it that the urgent need for the FDW to check
> for that isn't even mentioned in the documentation?
>
Indeed. Comparison of fdw_handler may cause unexpected behavior.
I agree it needs to be fixed up.

> I think that we'd really be better off insisting on same server (as in
> same pg_foreign_server OID), hence automatically same FDW, and what's
> even more important, same user mapping for any possible query execution
> context.  The possibility that there are some corner cases where some FDWs
> could optimize other scenarios seems to me to be poor return for the bugs
> and security holes that will arise any time typical FDWs forget to check
> this.
>
The former version of foreign/custom-join patch did check for joinable relations
using FDW server id, however, it was changed to the current form because it
may have additional optimization opportunity - in case when multiple foreign
servers have same remote host, access credential and others...
Also, I understand your concern about potential security holes by oversight.
It is an issue like a weighing scales, however, it seems to me the benefit
come from the potential optimization case does not negate the security-
hole risk enough.
So, I'll make a patch to change the logic to check joinable foreign-tables.

Thanks,
--
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] feature freeze and beta schedule

2015-05-02 Thread Kohei KaiGai

2015-05-02 1:37 GMT+09:00 Andres Freund :
> * ctidscan as an example of custom-scan
>   Hasn't really gotten sufficient review.
>   => Move
>
I have to agree.

> * Join pushdown support for foreign tables
>   Hasn't gotten particularly much review yet. And it doesn't look
>   entirely ready to me on a very quick scan. But I don't know that much
>   about the area.
>   => Move?
>
I think its overall design had been discussed and people reached
consensus. Even though some documentation / comments works
are required, it is not a fundamental design change isn't it?
I expect the join-pushdown of postgres_fdw will reach the commitable
state within two weeks.

Thanks,
--
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-04-30 Thread Kohei KaiGai

2015-05-01 7:40 GMT+09:00 Alvaro Herrera :
> Kouhei Kaigai wrote:
>> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> > > The idea of making the regression test entirely independent of the
>> > > system's policy would presumably solve this problem, so I'd kind of
>> > > like to see progress on that front.
>> >
>> > Apologies, I guess it wasn't clear, but that's what I was intending to
>> > advocate.
>> >
>> OK, I'll try to design a new regression test policy that is independent
>> from the system's policy assumption, like unconfined domain.
>>
>> Please give me time for this work.
>
> Any progress here?
>
Not done.
The last version I rebuild had a trouble on user/role transition from
unconfined_u/unconfined_r to the self defined user/role...
So, I'm trying to keep the user/role field (that is not redefined for
several years) but to define self domain/types (that have been
redefined multiple times) for the regression test at this moment.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-03-11 Thread Kohei KaiGai

2015-03-12 1:27 GMT+09:00 Alvaro Herrera :
> Robert Haas wrote:
>> On Tue, Mar 10, 2015 at 6:58 PM, Kohei KaiGai  wrote:
>> > ERRCODE_FEATURE_NOT_SUPPORTED is suitable error code here.
>> > Please see the attached one.
>>
>> Committed.  I did not bother back-patching this, but I can do that if
>> people think it's important.
>
> I don't really care myself.
>
>> The sepgsql regression tests don't seem
>> to pass for me any more; I wonder if some expected-output changes are
>> needed as a result of core changes.
>> I'm guessing these tests are not running in an automated fashion anywhere?
>
> Oops, that's bad.  I vaguely recall asking someone for a buildfarm
> animal running these tests, but I guess that didn't happen.
>
This regression test fail come from the base security policy of selinux.
In the recent selinux-policy package, "unconfined" domain was changed
to have unrestricted permission as literal. So, this test case relies multi-
category policy restricts unconfined domain, but its assumption is not
correct now.
The attached patch fixes the policy module of regression test.
However, I also think we may stop to rely permission set of pre-defined
selinux domains. Instead of pre-defined one, sepgsql-regtest.te may be
ought to define own domain with appropriate permission set independent
from the base selinux-policy version.
Please give me time to investigate.

Thanks,
-- 
KaiGai Kohei 


sepgsql-fixup-regtest-policy.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-03-10 Thread Kohei KaiGai

ERRCODE_FEATURE_NOT_SUPPORTED is suitable error code here.
Please see the attached one.

Thanks,


2015-03-11 4:34 GMT+09:00 Robert Haas :
> On Tue, Mar 10, 2015 at 9:41 AM, Alvaro Herrera
>  wrote:
>> And perhaps make it an ereport also, with errcode etc.
>
> Yeah, definitely.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
KaiGai Kohei 


security-label-errmsg.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] One question about security label command

2015-03-10 Thread Kohei KaiGai

The attached patch revises error message when security label
is specified on unsupported object.
getObjectTypeDescription() may be better than oid of catalog.

postgres=# SECURITY LABEL FOR selinux ON ROLE kaigai
postgres-#   IS 'system_u:object_r:unlabeled_t:s0';
ERROR:  sepgsql provider does not support labels on role

2015-03-09 23:55 GMT+09:00 Robert Haas :
> On Tue, Mar 3, 2015 at 5:01 AM, Kouhei Kaigai  wrote:
>> From standpoint of SQL syntax, yep, SECURITY LABEL command support
>> the object types below, however, it fully depends on security label
>> provider; sepgsql.so in this case.
>> At this moment, it supports database, schema, function, tables and
>> column are supported by sepgsql. So, it is expected behavior.
>
> If the core system supports labels on other object types and sepgsql
> does not, it should give a better error for those cases, like:
>
> ERROR: sepgsql provider does not support labels on roles
>
> Errors like "ERROR:  unsupported object type: 1260" are a good way to
> report a failure that is never expected to happen, but they shouldn't
> be used as user-facing error messages.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


security-label-errmsg.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-02-23 Thread Kohei KaiGai

> Amit and I had a long discussion about this on Friday while in Boston
> together.  I previously argued that the master and the slave should be
> executing the same node, ParallelSeqScan.  However, Amit argued
> persuasively that what the master is doing is really pretty different
> from what the worker is doing, and that they really ought to be
> running two different nodes.  This led us to cast about for a better
> design, and we came up with something that I think will be much
> better.
>
> The basic idea is to introduce a new node called Funnel.  A Funnel
> node will have a left child but no right child, and its job will be to
> fire up a given number of workers.  Each worker will execute the plan
> which is the left child of the funnel.  The funnel node itself will
> pull tuples from all of those workers, and can also (if there are no
> tuples available from any worker) execute the plan itself.  So a
> parallel sequential scan will look something like this:
>
> Funnel
> Workers: 4
> -> Partial Heap Scan on xyz
>
> What this is saying is that each worker is going to scan part of the
> heap for xyz; together, they will scan the whole thing.
>

What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

> The neat thing about this way of separating things out is that we can
> eventually write code to push more stuff down into the funnel.  For
> example, consider this:
>
> Nested Loop
> -> Seq Scan on foo
> -> Index Scan on bar
> Index Cond: bar.x = foo.x
>
> Now, if a parallel sequential scan is cheaper than a regular
> sequential scan, we can instead do this:
>
> Nested Loop
> -> Funnel
> -> Partial Heap Scan on foo
> -> Index Scan on bara
> Index Cond: bar.x = foo.x
>
> The problem with this is that the nested loop/index scan is happening
> entirely in the master.  But we can have logic that fixes that by
> knowing that a nested loop can be pushed through a funnel, yielding
> this:
>
> Funnel
> -> Nested Loop
> -> Partial Heap Scan on foo
> -> Index Scan on bar
> Index Cond: bar.x = foo.x
>
> Now that's pretty neat.  One can also imagine doing this with
> aggregates.  Consider:
>
I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

> HashAggregate
> -> Funnel
> -> Partial Heap Scan on foo
> Filter: x = 1
>
> Here, we can't just push the HashAggregate through the filter, but
> given infrastructure for we could convert that to something like this:
>
> HashAggregateFinish
> -> Funnel
> -> HashAggregatePartial
> -> Partial Heap Scan on foo
>  Filter: x = 1
>
> That'd be swell.
>
> You can see that something like this will also work for breaking off
> an entire plan tree and shoving it down into a worker.  The master
> can't participate in the computation in that case, but it's otherwise
> the same idea.
>
I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commit fest 2015-12 enters money time

2015-02-16 Thread Kohei KaiGai

2015-02-16 12:25 GMT+09:00 Michael Paquier :
>
>
> On Fri, Feb 13, 2015 at 10:06 AM, Michael Paquier
>  wrote:
>>
>> In order to move on to the next CF, I am going to go through all the
>> remaining patches and update their status accordingly. And sorry for
>> slacking a bit regarding that. If you wish to complain, of course feel
>> free to post messages on this thread or on the thread related to the
>> patch itself.
>
>
> As all the entries have been either moved or updated, CF 2014-12 is closed,
> and CF 2015-02 has been changed to "In Progress".
>
I tried to add my name on the CF entry for the "Join pushdown support
for foreign tables" patch, however, it was unintentionally marked as
rejected on the last CF, even though it should be "returned with feedback".

https://commitfest.postgresql.org/3/20/

Is it available to move it to the current CF?
I couldn't find operation to add new entry on the current CF.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-01-10 Thread Kohei KaiGai

2015-01-11 10:40 GMT+09:00 Jim Nasby :
> On 1/9/15, 8:51 PM, Kohei KaiGai wrote:
>>
>> 2015-01-10 9:56 GMT+09:00 Jim Nasby :
>>>
>>> On 1/9/15, 6:54 PM, Jim Nasby wrote:
>>>>
>>>>
>>>> On 1/9/15, 6:44 PM, Petr Jelinek wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Yep, I had a same impression when I looked at the code first time,
>>>>>> however, it is defined as below. Not a manner of custom-scan itself.
>>>>>>
>>>>>> /*
>>>>>>* ==
>>>>>>* Scan nodes
>>>>>>* ==
>>>>>>*/
>>>>>> typedef struct Scan
>>>>>> {
>>>>>>   Planplan;
>>>>>>   Index   scanrelid;  /* relid is index into the range
>>>>>> table
>>>>>> */
>>>>>> } Scan;
>>>>>>
>>>>>
>>>>> Yeah there are actually several places in the code where "relid" means
>>>>> index in range table and not oid of relation, it still manages to
>>>>> confuse
>>>>> me. Nothing this patch can do about that.
>>>>
>>>>
>>>>
>>>> Well, since it's confused 3 of us now... should we change it (as a
>>>> separate patch)? I'm willing to do that work but don't want to waste
>>>> time if
>>>> it'll just be rejected.
>>>>
>>>> Any other examples of this I should fix too?
>>>
>>>
>>>
>>> Sorry, to clarify... any other items besides Scan.scanrelid that I should
>>> fix?
>>>
>> This naming is a little bit confusing, however, I don't think it "should"
>> be
>> changed because this structure has been used for a long time, so reworking
>> will prevent back-patching when we find bugs around "scanrelid".
>
>
> We can still backpatch; it just requires more work. And how many bugs do we
> actually expect to find around this anyway?
>
> If folks think this just isn't worth fixing fine, but I find the
> backpatching argument rather dubious.
>
Even though here is no problem around Scan structure itself, a bugfix may
use the variable name of "scanrelid" to fix it. If we renamed it on v9.5, we
also need a little adjustment to apply this bugfix on prior versions.
It seems to me a waste of time for committers.
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-01-09 Thread Kohei KaiGai

2015-01-10 9:56 GMT+09:00 Jim Nasby :
> On 1/9/15, 6:54 PM, Jim Nasby wrote:
>>
>> On 1/9/15, 6:44 PM, Petr Jelinek wrote:
>
>
 Yep, I had a same impression when I looked at the code first time,
 however, it is defined as below. Not a manner of custom-scan itself.

 /*
   * ==
   * Scan nodes
   * ==
   */
 typedef struct Scan
 {
  Planplan;
  Index   scanrelid;  /* relid is index into the range table
 */
 } Scan;

>>>
>>> Yeah there are actually several places in the code where "relid" means
>>> index in range table and not oid of relation, it still manages to confuse
>>> me. Nothing this patch can do about that.
>>
>>
>> Well, since it's confused 3 of us now... should we change it (as a
>> separate patch)? I'm willing to do that work but don't want to waste time if
>> it'll just be rejected.
>>
>> Any other examples of this I should fix too?
>
>
> Sorry, to clarify... any other items besides Scan.scanrelid that I should
> fix?
>
This naming is a little bit confusing, however, I don't think it "should" be
changed because this structure has been used for a long time, so reworking
will prevent back-patching when we find bugs around "scanrelid".

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-01-09 Thread Kohei KaiGai

2015-01-10 8:18 GMT+09:00 Jim Nasby :
> On 1/6/15, 5:43 PM, Kouhei Kaigai wrote:

 scan_relid != InvalidOid
 > >
>>>
>>> >
>>> >Ideally, they should be OidIsValid(scan_relid)
>>> >
>>
>> Scan.scanrelid is an index of range-tables list, not an object-id.
>> So, InvalidOid or OidIsValid() are not a good choice.
>
>
> I think the name needs to change then; scan_relid certainly looks like the
> OID of a relation.
>
> scan_index?
>
Yep, I had a same impression when I looked at the code first time,
however, it is defined as below. Not a manner of custom-scan itself.

/*
 * ==
 * Scan nodes
 * ==
 */
typedef struct Scan
{
Planplan;
Index   scanrelid;  /* relid is index into the range table */
} Scan;

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] "port/atomics/arch-*.h" are missing from installation

2014-10-02 Thread Kohei KaiGai

I got the following error when I try to build my extension
towards the latest master branch.

Is the "port/atomics/*.h" files forgotten on make install?

[kaigai@magro pg_strom]$ make
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -fpic -Wall -O0
-DPGSTROM_DEBUG=1 -I. -I./ -I/usr/local/pgsql/include/server
-I/usr/local/pgsql/include/internal -D_GNU_SOURCE   -c -o shmem.o
shmem.c
In file included from /usr/local/pgsql/include/server/storage/barrier.h:21:0,
 from shmem.c:18:
/usr/local/pgsql/include/server/port/atomics.h:65:36: fatal error:
port/atomics/arch-x86.h: No such file or directory
 # include "port/atomics/arch-x86.h"
^
compilation terminated.
make: *** [shmem.o] Error 1


Even though the source directory has header files...

[kaigai@magro sepgsql]$ find ./src | grep atomics
./src/include/port/atomics
./src/include/port/atomics/generic-xlc.h
./src/include/port/atomics/arch-x86.h
./src/include/port/atomics/generic-acc.h
./src/include/port/atomics/arch-ppc.h
./src/include/port/atomics/generic.h
./src/include/port/atomics/arch-hppa.h
./src/include/port/atomics/generic-msvc.h
./src/include/port/atomics/arch-ia64.h
./src/include/port/atomics/generic-sunpro.h
./src/include/port/atomics/arch-arm.h
./src/include/port/atomics/generic-gcc.h
./src/include/port/atomics/fallback.h
./src/include/port/atomics.h
./src/backend/port/atomics.c

the install destination has only atomics.h

[kaigai@magro sepgsql]$ find /usr/local/pgsql/include | grep atomics
/usr/local/pgsql/include/server/port/atomics.h

The attached patch is probably right remedy.

Thanks,
-- 
KaiGai Kohei 


pgsql-v9.5-fixup-makefile-for-atomics.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-10-01 Thread Kohei KaiGai

2014-10-02 0:41 GMT+09:00 Merlin Moncure :
> On Tue, Jul 8, 2014 at 6:55 AM, Kouhei Kaigai  wrote:
>> * Syntax also reflects what the command does more. New syntax to
>>   define custom plan provider is:
>> CREATE CUSTOM PLAN PROVIDER 
>>   FOR  HANDLER ;
>
> -1 on 'cpp' prefix.  I don't see acronyms used in the syntax
> documentation and cpp will make people reflexively think 'c++'.  How
> about   and ?
>
It is not a living code. I already eliminated the SQL syntax portion,
instead of the internal interface (register_custom_path_provider)
that registers callbacks on extension load time.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-09-29 Thread Kohei KaiGai

2014-09-29 20:26 GMT+09:00 Thom Brown :
> On 29 September 2014 09:48, Kouhei Kaigai  wrote:
>>> On Wed, Sep 17, 2014 at 7:40 PM, Kouhei Kaigai  wrote:
>>> > At this moment, I revised the above portion of the patches.
>>> > create_custom_plan() was modified to call "PlanCustomPath" callback
>>> > next to the initialization of tlist and clauses.
>>> >
>>> > It's probably same as what you suggested.
>>>
>>> create_custom_plan() is mis-named.  It's actually only applicable to the
>>> custom-scan case, because it's triggered by create_plan_recurse() getting
>>> a path node with a T_CustomScan pathtype.  Now, we could change that;
>>> although in general create_plan_recurse() dispatches on pathtype, we could
>>> make CustomPath an exception; the top of that function could say if
>>> (IsA(best_path, CustomPath)) { /* do custom stuff */ }.  But the problem
>>> with that idea is that, when the custom path is specifically a custom scan,
>>> rather than a join or some other thing, you want to do all of the same
>>> processing that's in create_scan_plan().
>>>
>>> So I think what should happen is that create_plan_recurse() should handle
>>> T_CustomScan the same way it handles T_SeqScan, T_IndexScan, et
>>> al: by calling create_scan_plan().  The switch inside that function can
>>> then call a function create_customscan_plan() if it sees T_CustomScan.  And
>>> that function will be simpler than the
>>> create_custom_plan() that you have now, and it will be named correctly,
>>> too.
>>>
>> Fixed, according to what you suggested. It seems to me 
>> create_customscan_plan()
>> became more simplified than before.
>> Probably, it will minimize the portion of special case handling if CustomScan
>> with scanrelid==0 replaces built-in join plan in the future version.
>>
>>> In ExplainNode(), I think sname should be set to "Custom Scan", not 
>>> "Custom".
>>> And further down, the custom_name should be printed as "Custom Plan
>>> Provider" not just "Custom".
>>>
>> Fixed. I added an additional regression test to check EXPLAIN output
>> if not a text format.
>>
>>> setrefs.c has remaining handling for the scanrelid = 0 case; please remove
>>> that.
>>>
>> Sorry, I removed it, and checked the patch again to ensure here is no similar
>> portions.
>>
>> Thanks for your reviewing.
>
> pgsql-v9.5-custom-scan.part-2.v11.patch
>
> +GetSpecialCustomVar(CustomPlanState *node,
> +Var *varnode,
> +PlanState **child_ps);
>
> This doesn't seem to strictly match the actual function:
>
> +GetSpecialCustomVar(PlanState *ps, Var *varnode, PlanState **child_ps)
>
It's more convenient if the first argument is PlanState, because
GetSpecialCustomVar() is called towards all the suspicious special
var-node that might be managed by custom-plan provider.
If we have to ensure its first argument is CustomPlanState on the
caller side, it makes function's invocation more complicated.
Also, the callback portion is called only when PlanState is
CustomPlanState, so it is natural to take CustomPlanState for
argument of the callback interface.
Do we need to match the prototype of wrapper function with callback?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-08-30 Thread Kohei KaiGai

2014-08-29 13:33 GMT-04:00 Robert Haas :
> On Wed, Aug 27, 2014 at 6:51 PM, Kouhei Kaigai  wrote:
>>> > I'd like to follow this direction, and start stripping the DDL support.
>>>
>>> ...please make it so.
>>>
>> The attached patch eliminates DDL support.
>>
>> Instead of the new CREATE CUSTOM PLAN PROVIDER statement,
>> it adds an internal function; register_custom_scan_provider
>> that takes custom plan provider name and callback function
>> to add alternative scan path (should have a form of CustomPath)
>> during the query planner is finding out the cheapest path to
>> scan the target relation.
>> Also, documentation stuff is revised according to the latest
>> design.
>> Any other stuff keeps the previous design.
>
> Comments:
>
> 1. There seems to be no reason for custom plan nodes to have MultiExec
> support; I think this as an area where extensibility is extremely
> unlikely to work out.  The MultiExec mechanism is really only viable
> between closely-cooperating nodes, like Hash and HashJoin, or
> BitmapIndexScan, BitmapAnd, BitmapOr, and BitmapHeapScan; and arguably
> those things could have been written as a single, more complex node.
> Are we really going to want to support a custom plan that can
> substitute for a Hash or BitmapAnd node?  I really doubt that's very
> useful.
>
This intends to allows a particular custom-scan provider to exchange
its internal data when multiple custom-scan node is stacked.
So, it can be considered a facility to implement closely-cooperating nodes;
both of them are managed by same custom-scan provider.
An example is gpu-accelerated version of hash-join that takes underlying
custom-scan node that will returns a hash table with gpu preferable data
structure, but should not be a part of row-by-row interface.
I believe it is valuable for some use cases, even though I couldn't find
a use-case in ctidscan example.

> 2. This patch is still sort of on the fence about whether we're
> implementing custom plans (of any type) or custom scans (thus, of some
> particular relation).  I previously recommended that we confine
> ourselves initially to the task of adding custom *scans* and leave the
> question of other kinds of custom plan nodes to a future patch.  After
> studying the latest patch, I'm inclined to suggest a slightly revised
> strategy.  This patch is really adding THREE kinds of custom objects:
> CustomPlanState, CustomPlan, and CustomPath. CustomPlanState inherits
> from ScanState, so it is not really a generic CustomPlan, but
> specifically a CustomScan; likewise, CustomPlan inherits from Scan,
> and is therefore a CustomScan, not a CustomPlan.  But CustomPath is
> different: it's just a Path.  Even if we only have the hooks to inject
> CustomPaths that are effectively scans at this point, I think that
> part of the infrastructure could be somewhat generic.  Perhaps
> eventually we have CustomPath which can generate either CustomScan or
> CustomJoin which in turn could generate CustomScanState and
> CustomJoinState.
>
Suggestion seems to me reasonable. The reason why CustomPlanState
inheris ScanState and CustomPlan inherits Scan is, just convenience for
implementation of extensions. Some useful internal APIs, like ExecScan(),
takes argument of ScanState, so it was a better strategy to choose
Scan/ScanState instead of the bare Plan/PlanState.
Anyway, I'd like to follow the perspective that looks CustomScan as one
derivative from the CustomPath. It is more flexible.

> For now, I propose that we rename CustomPlan and CustomPlanState to
> CustomScan and CustomScanState, because that's what they are; but that
> we leave CustomPath as-is.  For ease of review, I also suggest
> splitting this into a series of three patches: (1) add support for
> CustomPath; (2) add support for CustomScan and CustomScanState; (3)
> ctidscan.
>
OK, I'll do that.

> 3. Is it really a good idea to invoke custom scan providers for RTEs
> of every type?  It's pretty hard to imagine that a custom scan
> provider can do anything useful with, say, RTE_VALUES.  Maybe an
> accelerated scan of RTE_CTE or RTE_SUBQUERY is practical somehow, but
> even that feels like an awfully big stretch.  At least until clear use
> cases emerge, I'd be inclined to restrict this to RTE_RELATION scans
> where rte->relkind != RELKIND_FOREIGN_TABLE; that is, put the logic in
> set_plain_rel_pathlist() rather than set_rel_pathlist().
>
I'd like to agree. Indeed, it's not easy to assume a use case of
custom-logic for non-plain relations.

> (We might even want to consider whether the hook in
> set_plain_rel_pathlist() ought to be allowed to inject a non-custom
> plan; e.g. substitute a scan of relation B for a scan of relation A.
> For example, imagine that B contains all rows from A that satisfy some
> predicate. This could even be useful for foreign tables; e.g.
> substitute a scan of a local copy of a foreign table for a reference
> to that table.  But I put all of these ideas in parentheses because
> they're only go

Re: [HACKERS] [v9.5] Custom Plan API

2014-08-22 Thread Kohei KaiGai

2014-08-23 0:39 GMT+09:00 Robert Haas :
> On Thu, Jul 17, 2014 at 3:38 PM, Tom Lane  wrote:
>> Alvaro Herrera  writes:
>>> I haven't followed this at all, but I just skimmed over it and noticed
>>> the CustomPlanMarkPos thingy; apologies if this has been discussed
>>> before.  It seems a bit odd to me; why isn't it sufficient to have a
>>> boolean flag in regular CustomPlan to indicate that it supports
>>> mark/restore?
>>
>> Yeah, I thought that was pretty bogus too, but it's well down the
>> list of issues that were there last time I looked at this ...
>
> I think the threshold question for this incarnation of the patch is
> whether we're happy with new DDL (viz, CREATE CUSTOM PLAN PROVIDER) as
> a way of installing new plan providers into the database.  If we are,
> then I can go ahead and enumerate a long list of things that will need
> to be fixed to make that code acceptable (such as adding pg_dump
> support).  But if we're not, there's no point in spending any time on
> that part of the patch.
>
Even though I'm patch author, I'd like to agree with this approach.
In fact, the previous custom-plan interface I proposed to the v9.4
development cycle does not include DDL support to register the
custom-plan providers, however, it works fine.

One thing I was pointed out, it is the reason why I implemented
DDL support, is that intermediation of c-language function also
loads the extension module implicitly. It is an advantage, but
not sure whether it shall be supported from the beginning.

> I can see a couple of good reasons to think that this approach might
> be reasonable:
>
> - In some ways, a custom plan provider (really, at this point, a
> custom scan provider) is very similar to a foreign data wrapper.  To
> the guts of PostgreSQL, an FDW is a sort of black box that knows how
> to scan some data not managed by PostgreSQL.  A custom plan provider
> is similar, except that it scans data that *is* managed by PostgreSQL.
>
> - There's also some passing resemblance between a custom plan provider
> and an access method.  Access methods provide a set of APIs for fast
> access to data via an index, while custom plan providers provide an
> API for fast access to data via magic that the core system doesn't
> (and need not) understand.  While access methods don't have associated
> SQL syntax, they do include catalog structure, so perhaps this should
> too, by analogy.
>
> All that having been said, I'm having a hard time mustering up
> enthusiasm for this way of doing things.  As currently constituted,
> the pg_custom_plan_provider catalog contains only a name, a "class"
> that is always 's' for scan, and a handler function OID.  Quite
> frankly, that's a whole lot of nothing.  If we got rid of the
> pg_catalog structure and just had something like
> RegisterCustomPlanProvider(char *name, void (*)(customScanArg *),
> which could be invoked from _PG_init(), hundreds and hundreds of lines
> of code could go away and we wouldn't lose any actual functionality;
> you'd just list your custom plan providers in shared_preload_libraries
> or local_preload_libraries instead of listing them in a system
> catalog.  In fact, you might even have more functionality, because you
> could load providers into particular sessions rather than system-wide,
> which isn't possible with this design.
>
Indeed. It's an advantage of the approach without system catalog.


> I think the underlying issue here really has to do with when custom
> plan providers get invoked - what triggers that?  For foreign data
> wrappers, we have some relations that are plain tables (relkind = 'r')
> and no foreign data wrapper code is invoked.  We have others that are
> flagged as foreign tables (relkind = 'f') and for those we look up the
> matching FDW (via ftserver) and run the code.  Similarly, for an index
> AM, we notice that the relation is an index (relkind = 'r') and then
> consult relam to figure out which index AM we should invoke.  But as
> KaiGai is conceiving this feature, it's quite different.  Rather than
> applying only to particular relations, and being mutually exclusive
> with other options that might apply to those relations, it applies to
> *everything* in the database in addition to whatever other options may
> be present.  The included ctidscan implementation is just as good an
> example as PG-Strom: you inspect the query and see, based on the
> operators present, whether there's any hope of accelerating things.
> In other words, there's no user configuration - and also, not
> irrelevantly, no persistent on-disk state the way you have for an
> index, or even an FDW, which has on disk state to the extent that
> there have to be catalog entries tying a particular FDW to a
> particular table.
>
Yes, that's my point. In case of FDW or index AM, the query planner
can have some expectations how relevant executor node will handle
the given relation scan according to the persistent state.
However, custom-plan is a black-box to the query pla

Re: [HACKERS] RLS Design

2014-07-08 Thread Kohei KaiGai

2014-07-09 15:07 GMT+09:00 Stephen Frost :
> KaiGai,
>
> * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
>> What I'd like to implement is adjustment of query like:
>>   SELECT * FROM t1 WHERE (x like '%abc%') AND (quals by built-in RLS)
>>   AND (quals by extension-1) AND ... AND (quals by extension-N);
>> I never mind even if qualifiers in the second block are connected with OR'd
>> manner, however, I want RLS infrastructure to accept additional security
>> models provided by extensions.
>
> Would having a table-level 'AND'-vs-'OR' modifier for the RLS policies
> on that table be sufficient for what you're looking for?  That seems a
> simple enough addition which would still allow more complex groups to be
> developed later on...
>
Probably, things I'm considering is more simple.
If a table has multiple built-in RLS policies, its expression node will be
represented as a BoolExpr with OR_EXPR and every policies are linked
to its args field, isn't it? We assume the built-in RLS model merges
multiple policies by OR manner.
In case when an extension want to apply additional security model on
top of RLS infrastructure, a straightforward way is to add its own rules
in addition to the built-in rules. If extension can get control to modify
the above expression node and RLS infrastructure works well on the
modified expression node, I think it's sufficient to implement multiple
security models on the RLS infrastructure.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

2014-07-08 Thread Kohei KaiGai

2014-07-06 14:19 GMT+09:00 Stephen Frost :
> Kaigai,
>
> * Kouhei Kaigai (kai...@ak.jp.nec.com) wrote:
>> > Can you clarify where this is coming from..?  It sounds like you're
>> > referring to an existing implementation and, if so, it'd be good to get
>> > more information on how that works exactly.
>>
>> Oracle VPD - Multiple Policies for Each Table, View, or Synonym
>> http://docs.oracle.com/cd/B19306_01/network.102/b14266/apdvpoli.htm#i1008351
>>
>> It says - Note that all policies applied to a table are enforced with AND 
>> syntax.
>
> While I'm not against using this as an example to consider, it's much
> more complex than what we're talking about here- and it supports
> application contexts which allow groups of RLS rights to be applied or
> not applied; essentially it allows both "AND" and "OR" for sets of RLS
> policies, along with "default" policies which are applied no matter
> what.
>
>> Not only Oracle VPD, it fits attitude of defense in depth.
>> Please assume a system that installs network firewall, unix permissions
>> and selinux. If somebody wants to reference an information asset within
>> a file, he has to connect the server from the network address being allowed
>> by the firewall configuration AND both of DAC and MAC has to allow his
>> access.
>
> These are not independent systems and your argument would apply to our
> GRANT system also, which I hope it's agreed would make it far less
> useful.  Note also that SELinux brings in another complexity- it needs
> to make system calls out to check the access.
>
>> Usually, we have to pass all the access control to reference the target
>> information, not one of the access control stuffs being installed.
>
> This is true in some cases, and not in others.  Only one role you are a
> member of needs to have access to a relation, not all of them.  There
> are other examples of 'OR'-style security policies, this is merely one.
> I'm simply not convinced that it applies in the specific case we're
> talking about.
>
> In the end, I expect that either way people will be upset because they
> won't be able to specify fully which should be AND vs. which should be
> OR with the kind of flexibility other systems provide.  What I'm trying
> to get to is an initial implementation which is generally useful and is
> able to add such support later.
>
>> > This is similar to how roles work- your overall access includes all access
>> > granted to any roles you are a member of. You don't need SELECT rights 
>> > granted
>> > to every role you are a member of to select from the table. Additionally,
>> > if an admin wants to AND the quals together then they can simply create
>> > a policy which does that rather than have 2 policies.
>> >
>> It seems to me a pain on database administration, if we have to pay attention
>> not to conflict each RLS-policy.
>
> This notion of 'conflict' doesn't make much sense to me.  What is
> 'conflicting' here?  Each policy would simply need to stand on its own
> for the role which it's being applied to.  That's very simple and
> straight-forward.
>
>> I expect 90% of RLS-policy will be configured
>> to PUBLIC user, to apply everybody same rules on access. In this case, DBA
>> has to ensure the target table has no policy or existing policy does not
>> conflict with the new policy to be set.
>> I don't think it is a good idea to enforce DBA these checks.
>
> If the DBA only uses PUBLIC then they have to ensure that each policy
> they set up for PUBLIC can stand on its own- though, really, I expect if
> they go that route they'd end up with just one policy that calls a
> stored procedure...
>
>> >  You are suggesting instead that if application 2 sets up policies on the
>> > table and then application 1 adds another policy that it should reduce what
>> > application 2's users can see?  That doesn't make any sense to me.  I'd
>> > actually expect these applications to at least use different roles anyway,
>> > which means they could each have a single role specific policy which only
>> > returns what that application is allowed to see.
>> >
>> I don't think this assumption is reasonable.
>> Please expect two applications: app-X that is a database security product
>> to apply access control based on remote ip-address of the client for any
>> table accesses by any database roles. app-Y that is a usual enterprise
>> package for daily business data, with RLS-policy.
>> What is the expected behavior in this case?
>
> That the DBA manage the rights on the tables.  I expect that will be
> required for quite a while with PG.  It's nice to think of these
> application products that will manage all access for users by setting up
> their own policies, but we have yet to even discuss how they would have
> appropriate rights on the table to be able to do so (and to not
> interfere with each other..).
>
> Let's at least get something which is generally useful in.  I'm all for
> trying to plan out how to get there and would welcome suggestions you
>

Re: [HACKERS] 9.5 CF1

2014-07-07 Thread Kohei KaiGai

>> Custom Plan API
>> Shigeru Hanada has said he plans to post a design review soon.
>
> Any updates? Should this be moved to the next CF?
>
Now I'm working to revise the patch according to his suggestion;
will be completed within a couple of days.
A few issues needs design-level suggestion from committers.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No toast table for pg_shseclabel but for pg_seclabel

2014-07-04 Thread Kohei KaiGai

Here is no other reason than what Alvaro mentioned in the upthread.
We intended to store security label of SELinux (less than 100bytes at most),
so I didn't think it leads any problem actually.

On the other hands, pg_seclabel was merged in another development cycle.
We didn't have deep discussion about necessity of toast table of pg_seclabel.
I added its toast table mechanically.

It never means we need toast table for local object but don't need it for shared
database object.

Thanks,

2014-07-04 19:11 GMT+09:00 Andres Freund :
> On 2014-07-04 11:50:17 +0200, Andres Freund wrote:
>> Hi,
>>
>> postgres=# SELECT oid::regclass, reltoastrelid FROM pg_class WHERE relname 
>> IN ('pg_seclabel', 'pg_shseclabel');
>>   oid  | reltoastrelid
>> ---+---
>>  pg_seclabel   |  3598
>>  pg_shseclabel | 0
>> (2 rows)
>>
>> Isn't that a somewhat odd choice? Why do we assume that there cannot be
>> lengthy seclabels on shared objects? Granted, most shared objects aren't
>> candidates for large amounts of data, but both users and databases don't
>> seem to fall into that category.
>
> Hm. It seems they were explicitly removed around
> http://archives.postgresql.org/message-id/1309888389-sup-3853%40alvh.no-ip.org
>
> I don't understand the reasoning there. There's a toast table for
> non-shared objects. Why would we expect less data for the shared ones,
> even though they're pretty basic objects and more likely to be used to
> store policies and such?
>
> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] NUMA packaging and patch

2014-06-26 Thread Kohei KaiGai

Hello,

Let me comment on this patch.

It can be applied on head of the master branch, built and run
regression test successfully.
What this patch tries to do is quite simple and obvious.
It suggests operating system to distribute physical pages to
every numa nodes on allocation.

One thing I concern is, it may conflict with numa-balancing
features that is supported in the recent Linux kernel; that
migrates physical pages according to the location of tasks
which references the page beyond the numa zone.
# I'm not sure whether it is applied on shared memory region.
# Please correct me if I misunderstood. But it looks to me
# physical page in shared memory is also moved.
http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Probably, interleave policy should work well on OLTP workload.
How about OLAP workload if physical pages are migrated
by operating system transparently to local node?
In OLAP case, less concurrency is required, but a query run
complicated logic (usually including full-scan) on a particular
CPU.

Isn't it make sense to have a GUC to control the numa policy.
In some cases, it makes sense to allocate physical memory
according to operating system's choice.

Thanks,

2014-06-11 2:34 GMT+09:00 Kevin Grittner :
> Josh Berkus  wrote:
>> On 06/08/2014 03:45 PM, Kevin Grittner wrote:
>>> By default, the OS cache and buffers are allocated in the memory
>>> node with the shortest "distance" from the CPU a process is
>>> running on.
>>
>> Note that this will stop being the default in future Linux kernels.
>> However, we'll have to deal with the old ones for some time to come.
>
> I was not aware of that.  Thanks.  Do you have a URL handy?
>
> In any event, that is the part of the problem which I think falls
> into the realm of packagers and/or sysadmins; a patch for that
> doesn't seem sensible, given how cpusets are implemented.  I did
> figure we would want to add some documentation around it, though.
> Do you agree that is worthwhile?
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2014-05-16 Thread Kohei KaiGai

2014-05-16 16:26 GMT+09:00 Heikki Linnakangas :
> On 05/14/2014 07:33 AM, Sergey Muraviov wrote:
>>
>> I've got this compiler warning:
>>   relation.c: In function ‘sepgsql_relation_drop’:
>> relation.c:472:25: warning: ‘tclass’ may be used uninitialized in this
>> function [-Wmaybe-uninitialized]
>>sepgsql_avc_check_perms(&object,
>>   ^
>
> KaiGei, could you take a look at this warning, too? It looks like a genuine
> bug to me, but I'm not sure what we should do there instead.
>
This warning is harmless, because the code path that does not initialize
"tclass" variable (a case when dropped relation is index) never goes to
the code path that references "tclass".
It just checks schema's {remove_name} permission, then jumps to
another code path for index, never backed.

BTW, I could not produce this message in my environment with -Wall.
(Fedora 20, gcc-4.8.2). Is it a newer compiler's wisdom?

Thanks,
-- 
KaiGai Kohei 


sepgsql-fixup-maybe-uninitialized-warnning.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2014-05-15 Thread Kohei KaiGai

Sorry, I've forgotten the report.

The test fails on label test come from specification change in the mcs policy.
Previously, it was applied to all the domains including unconfined_t, but now,
it became to be applied on the domain with "mcsconstrained" attribute.

This regression test run sepgsql_seton() on the system "unconfined_t" domain,
and see the behavior when process intended to move wider or narrower ranged
categories, so it was affected by system policy change, even though it is our
intention of sepgsql.

The attached patch adds "mcsconstrained" attribute on the domain for this
regression test, if this attribute exists. So, it will work on both of F20 and
older system.

Regarding to the regression test on ddl and alter, this change looks to me
hook invocation around recomputeNamespacePath() were gone, because
the schema already allowed to search was already checked.
Is the behavior around recomputeNamespacePath() recently updated?
At least, it is not a matter since {search} permission towards
"regtest_schema_2" is checked in this test scenario.

Thanks,

2014-05-14 13:33 GMT+09:00 Sergey Muraviov :
> Hi.
>
> Some regression tests for sepgsql still not work on Fedora 20:
>
> == running regression test queries==
> test label... FAILED
> test dml  ... ok
> test ddl  ... FAILED
> test alter... FAILED
> test misc ... ok
>
> ==
>  3 of 5 tests failed.
> ==
>
> $ sestatus
> SELinux status: enabled
> SELinuxfs mount:/sys/fs/selinux
> SELinux root directory: /etc/selinux
> Loaded policy name: targeted
> Current mode:   enforcing
> Mode from config file:  enforcing
> Policy MLS status:  enabled
> Policy deny_unknown status: allowed
> Max kernel policy version:  29
>
> $ uname -i -o -r
> 3.14.3-200.fc20.x86_64 x86_64 GNU/Linux
>
> $ /usr/local/pgsql/bin/postgres --version
> postgres (PostgreSQL) 9.4beta1
>
> PS
> I've got this compiler warning:
>  relation.c: In function ‘sepgsql_relation_drop’:
> relation.c:472:25: warning: ‘tclass’ may be used uninitialized in this
> function [-Wmaybe-uninitialized]
>   sepgsql_avc_check_perms(&object,
>  ^
>
>
> 2013-12-25 0:34 GMT+04:00 Kohei KaiGai :
>
>> Hello,
>>
>> It seems to me changes in the base security policy on Fedora affected to
>> the regression test. Our test cases for sepgsql_setcon() utilizes the MCS
>> rules, that prevents domain transition from narrow categories to wider
>> ones,
>> to control the success cases and failure cases.
>>
>> However, its coverage was changed. It was applied all the domains in the
>> system, thus "unconfined_t" domain had been enforced by MCS rules.
>> But now, it shall be applied only domains with "mcs_constrained_type"
>> attribute.
>>
>> [kaigai@vmlinux tmp]$ diff -up old/policy/mcs new/policy/mcs
>>   :
>>  
>>   :
>>  mlsconstrain process { transition dyntransition }
>> -   (( h1 dom h2 ) or ( t1 == mcssetcats ));
>> +   (( h1 dom h2 ) or ( t1 != mcs_constrained_type ));
>>
>> Probably, we need to define a domain by ourselves for regression test to
>> ensure
>> the test stability, not using the system "unconfined" domain that has
>> different
>> meaning by release.
>>
>> I'll make a patch. Please wait for a while.
>>
>> Thanks for your test & reports.
>>
>> 2013/12/18 Sergey Muraviov :
>> > # semodule -l | grep sepgslq
>> > sepgsql-regtest 1.07
>> >
>> > Full list of modules is in attachment.
>> >
>> >
>> > 2013/12/18 Kohei KaiGai 
>> >>
>> >> Could you show me semodule -l on your environment?
>> >> I believe security policy has not been changed between F19 and F20...
>> >>
>> >> Thanks,
>> >>
>> >> 2013/12/18 Sergey Muraviov :
>> >> > Hi
>> >> >
>> >> > I've tried to test postgres 9.3.2 and 9.4devel with selinux on Fedora
>> >> > 20
>> >> > and
>> >> > met with a label regression test failure.
>> >> >
>> >> > PS
>> >> > I've got some warning during build process.
>> >> >
>> >> > --
>> >> > Best regards,
>> >> > Sergey Muraviov
>> >> >
>> >> >
>> >> > --
>> >> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> >> > To make changes to your subscription:
>> >> > http://www.postgresql.org/mailpref/pgsql-hackers
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> KaiGai Kohei 
>> >
>> >
>> >
>> >
>> > --
>> > Best regards,
>> > Sergey Muraviov
>>
>>
>>
>> --
>> KaiGai Kohei 
>
>
>
>
> --
> Best regards,
> Sergey Muraviov



-- 
KaiGai Kohei 


sepgsql-fixup-regtest-policy.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-04-28 Thread Kohei KaiGai

>> Yeah.  I'm still not exactly convinced that custom-scan will ever allow
>> independent development of new plan types (which, with all due respect to
>> Robert, is what it was being sold as last year in Ottawa).  But I'm not
>> opposed in principle to committing it, if we can find a way to have a cleaner
>> API for things like setrefs.c.  It seems like late-stage planner processing
>> in general is an issue for this patch (createplan.c and subselect.c are
>> also looking messy).  EXPLAIN isn't too great either.
>>
>> I'm not sure exactly what to do about those cases, but I wonder whether
>> things would get better if we had the equivalent of
>> expression_tree_walker/mutator capability for plan nodes.  The state of
>> affairs in setrefs and subselect, at least, is a bit reminiscent of the
>> bad old days when we had lots of different bespoke code for traversing
>> expression trees.
>>
> Hmm. If we have something like expression_tree_walker/mutator for plan nodes,
> we can pass a walker/mutator function's pointer instead of exposing static
> functions that takes recursive jobs.
> If custom-plan provider (that has sub-plans) got a callback with walker/
> mutator pointer, all it has to do for sub-plans are calling this new
> plan-tree walking support routine with supplied walker/mutator.
> It seems to me more simple design than what I did.
>
I tried to code the similar walker/mutator functions on plan-node tree,
however, it was not available to implement these routines enough
simple, because the job of walker/mutator functions are not uniform
thus caller side also must have a large switch-case branches.

I picked up setrefs.c for my investigation.
The set_plan_refs() applies fix_scan_list() on the expression tree being
appeared in the plan node if it is delivered from Scan, however, it also
applies set_join_references() for subclass of Join, or
set_dummy_tlist_references() for some other plan nodes.
It implies that the walker/mutator functions of Plan node has to apply
different operation according to the type of Plan node. I'm not certain
how much different forms are needed.
(In addition, set_plan_refs() performs usually like a walker, but
often performs as a mutator if trivial subquery)

I'm expecting the function like below. It allows to call plan_walker
function for each plan-node and also allows to call expr_walker
function for each expression-node on the plan node.

bool
plan_tree_walker(Plan *plan,
 bool (*plan_walker) (),
 bool (*expr_walker) (),
 void *context)

I'd like to see if something other form to implement this routine.


One alternative idea to give custom-plan provider a chance to
handle its subplans is, to give function pointers (1) to handle
recursion of plan-tree and (2) to set up backend's internal
state.
In case of setrefs.c, set_plan_refs() and fix_expr_common()
are minimum necessity for extensions. It also kills necessity
to export static functions.

How about your thought?
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Best way to know frequency of column reference?

2014-03-21 Thread Kohei KaiGai

Hello,

As people may know, I've implemented a relation cache mechanism on top of
custom-plan interface, that holds contents of a particular columns only, thus
it does not need to take storage access as long as user's query refers the
columns on in-memory cache.
The key factor of how this mechanism works is selection of columns to be
cached on the initial read, or re-read, of the related heap.
The simplest idea is just picking up referenced columns in the query on demand,
and will reconstruct later if further query referenced more wider reference than
previous one, however, it is not a good strategy.
So, I'd like to investigate the way to select columns to be cached adaptively.
Probably, one better idea is columns-selection according to the frequency of
column references in a particular time-slot.
Right now, pg_statistic does not record such kind of information, if I can
understand correctly. Is there any way to retrieve how many times columns
were referenced? Or, do I need to implement an own extension to track it?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Creating tables for columns

2014-03-21 Thread Kohei KaiGai

I had implemented similar code on top of FDW API.
  https://github.com/kaigai/pg_strom/blob/old_cuda/utilcmds.c#L244

Probably, heap_create_with_catalog() is what you are finding out.

2014-03-21 22:57 GMT+09:00 Rajashree Mandaogane :
> We are working on a project in which we need to create tables for each
> column. So which function should we call in recursion to create the tables?



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only table scan?)

2014-03-06 Thread Kohei KaiGai

2014-03-06 18:17 GMT+09:00 Haribabu Kommi :
>
> On Tue, Mar 4, 2014 at 3:07 PM, Kouhei Kaigai  wrote:
>>
>> > > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
>> > > + if (pchunk != NULL && pchunk != cchunk)
>> > >
>> > > + ccache_merge_chunk(ccache, pchunk);
>> > >
>> > > + pchunk = cchunk;
>> > >
>> > >
>> > >   The merge_chunk is called only when the heap tuples are spread
>> > > across two cache chunks. Actually one cache chunk can accommodate one
>> > > or more than heap pages. it needs some other way of handling.
>> > >
>> > I adjusted the logic to merge the chunks as follows:
>> >
>> > Once a tuple is vacuumed from a chunk, it also checks whether it can be
>> > merged with its child leafs. A chunk has up to two child leafs; left one
>> > has less ctid that the parent, and right one has greater ctid. It means
>> > a chunk without right child in the left sub-tree or a chunk without left
>> > child in the right sub-tree are neighbor of the chunk being vacuumed. In
>> > addition, if vacuumed chunk does not have either (or both) of children,
>> > it can be merged with parent node.
>> > I modified ccache_vacuum_tuple() to merge chunks during t-tree
>> > walk-down,
>> > if vacuumed chunk has enough free space.
>> >
>
>
> Patch looks good.
>
Thanks for your volunteering.

> Regarding merging of the nodes, instead of checking whether merge is
> possible or not for every tuple which is vacuumed,
> can we put some kind of threshold as whenever the node is 50% free then try
> to merge it from leaf nodes until 90% is full.
> The rest of the 10% will be left for the next inserts on the node. what do
> you say?
>
Hmm. Indeed, it makes sense. How about an idea that kicks chunk merging
if "expected" free space of merged chunk is less than 50%?
If threshold depends on the (expected) usage of merged chunk, it can avoid
over-merging.

> I will update you later regarding the performance test results.
>
Thhanks,

Also, I'll rebase the patch on top of the new custom-scan interfaces
according to Tom's suggestion, even though main logic of cache_scan
is not changed.

Best regards,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-04 Thread Kohei KaiGai

2014-03-05 5:52 GMT+09:00 Stephen Frost :
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Tue, Mar 4, 2014 at 2:34 PM, Stephen Frost  wrote:
>> > Alright- so do you feel that the simple ctidscan use-case is a
>> > sufficient justification and example of how this can be generally
>> > useful that we should be adding these hooks to core..?  I'm willing to
>> > work through the patch and clean it up this weekend if we agree that
>> > it's useful and unlikely to immediately be broken by expected changes..
>>
>> Yeah, I think it's useful.  But based on Tom's concurrently-posted
>> review, I think there's probably a good deal of work left here.
>
> Yeah, it certainly looks like it.
>
> KaiGai- will you have time to go over and address Tom's concerns..?
>
Yes, I need to do. Let me take it through the later half of this week and
the weekend. So, I'd like to submit revised one by next Monday.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-04 Thread Kohei KaiGai

2014-03-04 23:10 GMT+09:00 Stephen Frost :
>> The "cache_scan" module that I and Haribabu are discussing in another
>> thread also might be a good demonstration for custom-scan interface,
>> however, its code scale is a bit larger than ctidscan.
>
> That does sound interesting though I'm curious about the specifics...
>
This module caches a part of columns, but not all, thus allows to hold
much larger number of records for a particular amount of RAM than the
standard buffer cache.
It is constructed on top of custom-scan node, and also performs a new
hook for a callback on page vacuuming to invalidate its cache entry.
(I originally designed this module for demonstration of on-vacuum hook
because I already made ctidscan and postgres_fdw enhancement for
custom-scan node, by the way.)

>> > For one thing, an example where you could have this CustomScan node calling
>> > other nodes underneath would be interesting.  I realize the CTID scan can't
>> > do that directly but I would think your GPU-based system could; after all,
>> > if you're running a join or an aggregate with the GPU, the rows could come
>> > from nearly anything.  Have you considered that, or is the expectation that
>> > users will just go off and access the heap and/or whatever indexes 
>> > directly,
>> > like ctidscan does?  How would such a requirement be handled?
>> >
>> In case when custom-scan node has underlying nodes, it shall be invoked using
>> ExecProcNode as built-in node doing, then it will be able to fetch tuples
>> come from underlying nodes. Of course, custom-scan provider can perform the
>> tuples come from somewhere as if it came from underlying relation. It is
>> responsibility of extension module. In some cases, it shall be required to
>> return junk system attribute, like ctid, for row-level locks or table 
>> updating.
>> It is also responsibility of the extension module (or, should not add custom-
>> path if this custom-scan provider cannot perform as required).
>
> Right, tons of work to do to make it all fit together and play nice-
> what I was trying to get at is: has this actually been done?  Is the GPU
> extension that you're talking about as the use-case for this been
> written?
>
Its chicken-and-egg problem, because implementation of the extension module
fully depends on the interface from the backend. Unlike commit-fest, here is no
deadline for my extension module, so I put higher priority on the submission of
custom-scan node, than the extension.
However, GPU extension is not fully theoretical stuff. I had implemented
a prototype using FDW APIs, and it allowed to accelerate sequential scan if
query has enough complicated qualifiers.

See the movie (from 2:45). The table t1 is a regular table, and t2 is a foreign
table. Both of them has same contents, however, response time of the query
is much faster, if GPU acceleration is working.
http://www.youtube.com/watch?v=xrUBffs9aJ0
So, I'm confident that GPU acceleration will have performance gain once it
can run regular tables, not only foreign tables.

> How does it handle all of the above?  Or are we going through
> all these gyrations in vain hope that it'll actually all work when
> someone tries to use it for something real?
>
I don't talk something difficult. If junk attribute requires to return "ctid" of
the tuple, custom-scan provider reads a tuple of underlying relation then
includes a correct item pointer. If this custom-scan is designed to run on
the cache, all it needs to do is reconstruct a tuple with correct item-pointer
(thus this cache needs to have ctid also). It's all I did in the cache_scan
module.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-04 Thread Kohei KaiGai

2014-03-04 23:09 GMT+09:00 Robert Haas :
> On Mon, Mar 3, 2014 at 5:15 PM, Stephen Frost  wrote:
>>> As I mentioned
>>> up-thread, I'd really like to see FDW join push-down, FDW aggregate
>>> push-down, parallel query execution, and parallel remote-FDW execution
>>> and I don't see this CustomScan approach as the right answer to any of
>>> those.
>>
>> In accordance with the above, what I'd like to see with this patch is
>> removal of the postgres_fdw changes and any changes which were for that
>> support.  In addition, I'd like to understand why 'ctidscan' makes any
>> sense to have as an example of what to use this for- if that's valuable,
>> why wouldn't we simply implement that in core?  I do want an example in
>> contrib of how to properly use this capability, but I don't think that's
>> it.
>
> I suggested that example to KaiGai at last year's PGCon.  It may
> indeed be something we want to have in core, but right now we don't.
>
> More generally, I think this discussion is focusing on the wrong set
> of issues.  The threshold issue for this patch is whether there is a
> set of hook points that enable a workable custom-scan functionality,
> and whether KaiGai has correctly identified them.  In other words, I
> think we should be worrying about whether KaiGai's found all of the
> places that need to be modified to support a custom scan, and whether
> the modifications he's made to each of those places are correct and
> adequate.  Whether he's picked the best possible example does not
> strike me as a matter of principal concern, and it's far too late to
> tell him he's got to go pick a different one at this point anyway.
>
That is definitely the point to be discussed here. Even though I *believe*
I could put the callbacks needed to implement alternative join / scan,
it may lead different conclusion from other person's viewpoint.

At least, I could implement a custom-scan as an alternative of join
using postgres_fdw, however, it's uncertain whether I could cover
all the possible case we should care about.
So, I'd like to see comments from others.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Triggers on foreign tables

2014-03-03 Thread Kohei KaiGai

I tried to check the latest (v8) patch again, then could not find
problem in your design change from the v7.
As Noah pointed out, it uses per query-depth tuplestore being released
on AfterTriggerEndSubXact.

So, may I mark it as "ready for committer"?

2014-03-03 15:48 GMT+09:00 Ronan Dunklau :
> Hello.
>
> Did you have time to review the latest version of this patch ? Is there
> anything I can do to get this "ready for commiter" ?
>
> Thank you for all the work performed so far.
>
>
>
>
> Le mardi 4 février 2014 13:16:22 Ronan Dunklau a écrit :
>> Le lundi 3 février 2014 23:28:45 Noah Misch a écrit :
>> > On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote:
>> > > Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit :
>> > > > On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote:
>> > > > > What do you think about this approach ? Is there something I missed
>> > > > > which
>> > > > > would make it not sustainable ?
>> > > >
>> > > > Seems basically reasonable.  I foresee multiple advantages from having
>> > > > one
>> > > > tuplestore per query level as opposed to one for the entire
>> > > > transaction.
>> > > > You would remove the performance trap of backing up the tuplestore by
>> > > > rescanning. It permits reclaiming memory and disk space in
>> > > > AfterTriggerEndQuery() rather than at end of transaction.  You could
>> > > > remove
>> > > > ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the
>> > > > flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next
>> > > > one
>> > > > or
>> > > > the next two tuples from the tuplestore.  Using work_mem per
>> > > > AfterTriggerBeginQuery() instead of per transaction is no problem.
>> > > > What
>> > > > do
>> > > > you think of that design change?
>> > >
>> > > I agree that this design is better, but I have some objections.
>> > >
>> > > We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but
>> > > the
>> > > rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch)
>> > > can't go away.
>> > >
>> > > Consider for example the case of a foreign table with more than one
>> > > AFTER
>> > > UPDATE triggers. Unless we store the tuples once for each trigger, we
>> > > will
>> > > have to rescan the tuplestore.
>> >
>> > Will we?  Within a given query level, when do (non-deferred) triggers
>> > execute in an order other than the enqueue order?
>>
>> Let me explain what I had in mind.
>>
>> Looking at the code in AfterTriggerSaveEvent:
>>
>> - we build a "template" AfterTriggerEvent, and store the tuple(s)
>> - for each suitable after trigger that matches the trigger type, as well as
>> the WHEN condition if any, a copy of the previously built AfterTriggerEvent
>> is queued
>>
>> Later, those events are fired in order.
>>
>> This means that more than one event can be fired for one tuple.
>>
>> Take this example:
>>
>> CREATE TRIGGER trig_row_after1
>> AFTER UPDATE ON rem2
>> FOR EACH ROW
>> WHEN (NEW.f1 % 5 < 3)
>> EXECUTE PROCEDURE trigger_func('TRIG1');
>>
>> CREATE TRIGGER trig_row_after2
>> AFTER UPDATE ON rem2
>> FOR EACH ROW
>> WHEN (NEW.f1 % 5 < 4)
>> EXECUTE PROCEDURE trigger_func('TRIG2');
>>
>> UPDATE rem2 set f2 = 'something';
>>
>> Assuming 5 rows with f1 as a serial, the fired AfterTriggerEvent's
>> ate_tupleindex will be, in that order. Ass
>>
>> 0-0-2-2-4-8-8
>>
>> So, at least a backward seek is required for trig_row_after2 to be able to
>> retrieve a tuple that was already consumed when firing trig_row_after1.
>>
>> On a side note, this made me realize that it is better to avoid storing a
>> tuple entirely if there is no enabled trigger (the f1 = 4 case above). The
>> attached patch does that, so the previous sequence becomes:
>>
>> 0-0-2-2-4-6-6
>>
>> It also prevents from initalizing a tuplestore at all if its not needed.
>>
>> > > To mitigate the effects of this behaviour, I added the option to perform
>> > > a
>> > > reverse_seek when the looked-up tuple is nearer from the current index
>> > > than
>> > > from the start.
>> >
>> > If there's still a need to seek within the tuplestore, that should get rid
>> > of the O(n^2) effect.  I'm hoping that per-query-level tuplestores will
>> > eliminate the need to seek entirely.
>>
>> I think the only case when seeking is still needed is when there are more
>> than one after trigger that need to be fired, since the abovementioned
>> change prevents from seeking to skip tuples.
>>
>> > > > If you do pursue that change, make sure the code still does the right
>> > > > thing
>> > > > when it drops queued entries during subxact abort.
>> > >
>> > > I don't really understand what should be done at that stage. Since
>> > > triggers on foreign tables are not allowed to be deferred, everything
>> > > should be cleaned up at the end of each query, right ? So, there
>> > > shouldn't be any queued entries.
>> >
>> > I suspect that's right.  If you haven't looked over
>> > AfterTriggerEndSubXact(), please do so and ensure all its ac

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-02 10:38 GMT+09:00 Robert Haas :
> On Wed, Feb 26, 2014 at 10:23 AM, Stephen Frost  wrote:
>> * Kouhei Kaigai (kai...@ak.jp.nec.com) wrote:
>>> IIUC, his approach was integration of join-pushdown within FDW APIs,
>>> however, it does not mean the idea of remote-join is rejected.
>>
>> For my part, trying to consider doing remote joins *without* going
>> through FDWs is just nonsensical.
>
> That is, of course, true by definition, but I think it's putting the
> focus in the wrong place.  It's possible that there are other cases
> when a scan might a plausible path for a joinrel even if there are no
> foreign tables in play.  For example, you could cache the joinrel
> output and then inject a cache scan as a path for the joinrel.
>
That might be an idea to demonstrate usage of custom-scan node,
rather than the (ad-hoc) enhancement of postgres_fdw.
As I have discussed in another thread, it is available to switch heap
reference by cache reference on the fly, it shall be a possible use-
case for custom-scan node.

So, I'm inclined to drop the portion for postgres_fdw in my submission
to focus on custom-scan capability.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-02 10:29 GMT+09:00 Stephen Frost :
> * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
>> As you mentioned, it is a headache for packagers, and does not make
>> sense for us if packager disabled the feature that requires proprietary
>> drivers.
>
> No, I disagree with that.  I don't expect this use-case to be very
> common to begin with and telling individuals that they have to compile
> it themselves is certainly not out of the question.
>
>> In fact, Fedora / RHEL does not admit to distribute software
>> under the none open source software license.
>
> I'm pretty confident you can get RPMs for those distributions.
>
>> Obviously, nvidia's cuda
>> is a library being distributed under the proprietary license, thus out of
>> the scope for the Linux distributors.
>
> This also doesn't make any sense to me- certainly the CUDA libraries are
> available under Debian derivatives, along with open-source wrappers for
> them like pycuda.
>
>> It also leads them to turn off the
>> feature that shall be linked with proprietary drivers.
>> All we can do is to implement these features as extension, then offer
>> an option for users to use or not to use.
>
> No, we can tell individuals who want it that they're going to need to
> build with support for it.  We don't offer OpenSSL as an extension (I
> certainly wish we did- and had a way to replace it w/ GNUTLS or one of
> the better licensed options).
>
I know there is some alternative ways. However, it requires users to take
additional knowledge and setting up efforts, also loses the benefit to use
distributor's Linux if alternative RPMs are required.
I don't want to recommend users such a complicated setting up procedure.

>> What I'd like to implement is GPU acceleration that can perform on
>> regular tables, not only foreign tables. Also, regarding to the backlog
>> in the developer meeting, pluggable executor node is also required
>> feature by PG-XC folks to work their project with upstream.
>> I think custom-scan feature is capable to host these requirement,
>> and does not prevent enhancement FDW features.
>
> I think you're conflating things again- while it might be possible to
> use CustomScan to implement FDW join-pushdown or FDW aggregate-pushdown,
> *I* don't think it's the right approach.  Regarding the PG-XC
> requirement, I expect they're looking for FDW join/aggregate-pushdown
> and also see that it *could* be done w/ CustomScan.
>
The reason why I submitted the part-3 patch (that enhances postgres_fdw
for remote-join using custom-scan) is easy to demonstrate the usage of
join-replacement by a special scan, with minimum scale of the patch to be
reviewed. If we have another idea to demonstrate it, I don't stick on the remot-
join feature on foreign tables.
Regarding to the PG-XC, I didn't know their exact needs because I didn't
attend the cluster meeting, but the someone mentioned about pluggable
plan/exec node in this context.

> We could punt on the whole thing and drop in hooks which could be used
> to replace everything done from the planner through to the executor and
> then anyone *could* implement any of the above, and parallel query too.
> That doesn't make it the right approach.
>
That is a problem I pointed out in the last developer meeting. Because we
have no way to enhance a part of plan / exec logic by extension, extension
has to replace whole of the planner / executor using hooks. It is painful for
authors of extensions.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-02 9:51 GMT+09:00 Stephen Frost :
> KaiGai,
>
> * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
>> Now we have two options for GPU programming: CUDA or OpenCL.
>> Both of libraries and drivers are provided under the proprietary license,
>> so it does not fit for the core implementation of PostgreSQL, but
>> extensions that shall be installed on user's responsibility.
>
> Being able to work with libraries which are not BSD licensed doesn't
> change the license under which PostgreSQL code is released.  Nor does it
> require PostgreSQL to be licensed in any different way from how it is
> today.  Where it would get a bit ugly, I agree, is for the packagers who
> have to decide if they want to build against those libraries or not.  We
> might be able to make things a bit easier by having a startup-time
> determination of if these nodes are able to be used or not.  This isn't
> unlike OpenSSL which certainly isn't BSD nor is it even GPL-compatible,
> a situation which causes a great deal of difficulty already due to the
> whole readline nonsense- but it's difficulty for the packagers, not for
> the PostgreSQL project, per se.
>
As you mentioned, it is a headache for packagers, and does not make
sense for us if packager disabled the feature that requires proprietary
drivers. In fact, Fedora / RHEL does not admit to distribute software
under the none open source software license. Obviously, nvidia's cuda
is a library being distributed under the proprietary license, thus out of
the scope for the Linux distributors. It also leads them to turn off the
feature that shall be linked with proprietary drivers.
All we can do is to implement these features as extension, then offer
an option for users to use or not to use.

>> Because of the story, I brought up a discussion about pluggable
>> planner/executor node (as a basis of GPU acceleration) in the last
>> developer meeting, then has worked for custom-scan node interface.
>
> And I'm still unconvinced of this approach and worry that it's going to
> break more often than it works.  That's my 2c on it, but I won't get in
> the way if someone else wants to step up and support it.  As I mentioned
> up-thread, I'd really like to see FDW join push-down, FDW aggregate
> push-down, parallel query execution, and parallel remote-FDW execution
> and I don't see this CustomScan approach as the right answer to any of
> those.
>
It's right approach for FDW functionality enhancement, I never opposed to.

What I'd like to implement is GPU acceleration that can perform on
regular tables, not only foreign tables. Also, regarding to the backlog
in the developer meeting, pluggable executor node is also required
feature by PG-XC folks to work their project with upstream.
I think custom-scan feature is capable to host these requirement,
and does not prevent enhancement FDW features.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-03-01 Thread Kohei KaiGai

2014-03-01 22:38 GMT+09:00 Stephen Frost :
> KaiGai,
>
> * Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
>> BTW, this kind of discussion looks like a talk with a ghost because
>> we cannot see the new interface according to the parallel execution
>> right now, so we cannot have tangible investigation whether it becomes
>> really serious backward incompatibility, or not.
>
> Yeah, it would certainly be nice if we had all of the answers up-front.
> What I keep hoping for is that someone who has been working on this area
> (eg: Robert) would speak up...
>
I'd also like to see his opinion.

>> My bet is minor one. I cannot imagine plan-node interface that does
>> not support existing non-parallel SeqScan or NestLoop and so on.
>
> Sure you can- because once we change the interface, we're probably going
> to go through and make everything use the new one rather than have to
> special-case things.  That's more-or-less exactly my point here because
> having an external hook like CustomScan would make that kind of
> wholesale change more difficult.
>
I think, we should follow the general rule in case of custom-scan also.
In other words, it's responsibility of extension's author to follow up the
latest specification of interfaces.
For example, I have an extension module that is unable to work on the
latest PG- code because of interface changes at ProcessUtility_hook.
Is it a matter of backward incompatibility? Definitely, no. It should be
my job.

> That does *not* mean I'm against using GPUs and GPU optimizations.  What
> it means is that I'd rather see that done in core, which would allow us
> to simply change that interface along with the rest when doing wholesale
> changes and not have to worry about backwards compatibility and breaking
> other people's code.
>
I also have to introduce a previous background discussion.
Now we have two options for GPU programming: CUDA or OpenCL.
Both of libraries and drivers are provided under the proprietary license,
so it does not fit for the core implementation of PostgreSQL, but
extensions that shall be installed on user's responsibility.
Because of the story, I brought up a discussion about pluggable
planner/executor node (as a basis of GPU acceleration) in the last
developer meeting, then has worked for custom-scan node interface.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-02-28 Thread Kohei KaiGai

2014-03-01 0:36 GMT+09:00 Stephen Frost :
> * Kouhei Kaigai (kai...@ak.jp.nec.com) wrote:
>> * Stephen Frost (sfr...@snowman.net) wrote:
>> > I don't see how you can be when there hasn't been any discussion that I've
>> > seen about how parallel query execution is going to change things for us.
>> >
>> If parallel query execution changes whole of the structure of plan nodes,
>> it will also affect to the interface of custom-scan because it is a thin-
>> abstraction of plan-node. However, if parallel execution feature is
>> implemented as one of new plan node in addition to existing one, I cannot
>> imagine a scenario that affects to the structure of another plan node.
>
> Let's just say that I have doubts that we'll be able to implement
> parallel execution *without* changing the plan node interface in some
> way which will require, hopefully minor, changes to all of the nodes.
> The issue is that even a minor change would break the custom-scan API
> and we'd immediately be in the boat of dealing with complaints regarding
> backwards compatibility.  Perhaps we can hand-wave that, and we've had
> some success changing hook APIs between major releases, but such changes
> may also be in ways which wouldn't break in obvious ways or even
> possibly be changes which have to be introduced into back-branches.
> Parallel query is going to be brand-new real soon and it's reasonable to
> think we'll need to make bug-fix changes to it after it's out which
> might even involve changes to the API which is developed for it.
>
Even if we will change the plan-node interface in the future release,
it shall not be a change that makes the existing stuff impossible.
The custom-scan API is designed to provide alternative way to scan
or join relations, in addition to the existing logic like SeqScan or
NestLoop. If this change breaks plan-node interfaces and it couldn't
implement existing stuff, it is problematic for all the stuff, not only
custom-scan node. I don't think such a change that makes impossible
to implement existing logic will be merged. Likely, the new parallel
execution feature can work together existing sequential logic and
custom-scan interface.

BTW, this kind of discussion looks like a talk with a ghost because
we cannot see the new interface according to the parallel execution
right now, so we cannot have tangible investigation whether it becomes
really serious backward incompatibility, or not.
My bet is minor one. I cannot imagine plan-node interface that does
not support existing non-parallel SeqScan or NestLoop and so on.

>> > The issue here is that we're going to be expected to maintain an interface
>> > once we provide it and so that isn't something we should be doing lightly.
>> > Particularly when it's as involved as this kind of change is with what's
>> > going on in the backend where we are nearly 100% sure to be changing things
>> > in the next release or two.
>> >
>> FDW APIs are also revised several times in the recent releases. If we can
>> design "perfect" interface from the beginning, it's best but usually hard
>> to design.
>
> Sure, but FDWs also have a *much* broader set of use-cases, in my view,
> which is also why I was pushing to work on join-push-down to happen
> there instead of having this kind of a hook interface, which I don't
> think we'd want to directly expose as part of the 'official FDW API' as
> it ends up putting all the work on the FDW with little aide, making it
> terribly likely to end up with a bunch of duplciated code in the FDWs
> from the backend to deal with it, particularly for individuals writing
> FDWs who aren't familiar with what PG already has.
>
It might not be a good idea to use postgres_fdw as a basis of proof-
of-concept to demonstrate that custom-scan can effectively host
an alternative way to join; that fetches the result set of remote-join
as if relation scan, even though it demonstrated it is possible.
So, I never mind the part-3 portion (remote join of postgres_fdw on
top of custom-scan) being dropped from the submission.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-02-25 Thread Kohei KaiGai

>
> Most of this information is available through corresponding RelOptInfo, or
> we should make RelOptInfo contain all the information related to every
> relation required to be computed during the query. So, any function which
> creates paths can just take that RelOptInfo as an argument and produce the
> path/s. That way there is lesser chance that the function signatures change.
>
Uhmm It is inconvenience to write extensions. I want the variables
in the first and second groups being delivered to the hook, even though
it may have minor modification in the future release.
Relations join is one of the heart of RDBMS, so I'd like to believe these
arguments are one of the most stable stuffs.

>> Below is the information we can reproduce but complicated steps:
>>  - List *mergeclause_list
>>  - bool mergejoin_allow
>>  - Relids param_source_rels
>>  - Relids extra_lateral_rels
>>
>> Below is the information we can reproduce easily:
>>  - SemiAntiJoinFactors *semifactors
>>
>> I think, the first two categories or the first category (if functions to
>> reproduce the second group is exposed) should be informed to extension,
>> however, priority of the third group is not high.
>>
>>
>> > BTW, is it a good idea for custom nodes to also affect other paths like
>> > append, group etc.? Will it need separate hooks for each of those?
>> >
>> Yes. I plan to support above plan node, in addition to scan / join only.
>> The custom-scan node is thin abstraction towards general executor
>> behavior,
>> so I believe it is not hard to enhance this node, without new plan node
>> for each of them.
>> Of course, it will need separate hook to add alternative path on the
>> planner
>> stage, but no individual plan nodes. (Sorry, it was unclear for me what
>> does the "hook" mean.)
>>
>
> If we represent all the operation like grouping, sorting, aggregation, as
> some sort of relation, we can create paths for each of the relation like we
> do (I am heavily borrowing from Tom's idea of pathifying those operations).
> We will need much lesser hooks in custom scan node.
>
> BTW, from the patch, I do not see this change to be light weight. I was
> expecting more of a list of hooks to be defined by the user and this
> infrastructure just calling them at appropriate places.
>
Let's focus on scan and join that we are currently working on.
Even if we need separate node type for grouping or sorting, it will not
be necessary to construct whole of the framework from the scratch.
For example, definition of CustomProvider table will be able to reuse
for other class of operations, because most of them are thin abstraction
of existing executor's interface.

Thanks,

>> Thanks,
>> --
>> NEC OSS Promotion Center / PG-Strom Project
>> KaiGai Kohei 
>>
>>
>> > -Original Message-
>> > From: Ashutosh Bapat [mailto:ashutosh.ba...@enterprisedb.com]
>> > Sent: Tuesday, February 25, 2014 5:59 PM
>> > To: Kohei KaiGai
>> > Cc: Kaigai, Kouhei(海外, 浩平); Stephen Frost; Shigeru Hanada; Jim
>> > Mlodgenski; Robert Haas; Tom Lane; PgHacker; Peter Eisentraut
>> > Subject: Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)
>> >
>> >
>> >
>> >
>> > On Sun, Feb 23, 2014 at 6:54 PM, Kohei KaiGai 
>> > wrote:
>> >
>> >
>> >   Folks,
>> >
>> >   Let me remind the custom-scan patches; that is a basis feature of
>> >   remote join of postgres_fdw, cache-only scan, (upcoming) GPU
>> >   acceleration feature or various alternative ways to scan/join
>> > relations.
>> >   Unfortunately, small amount of discussion we could have in this
>> > commit
>> >   fest, even though Hanada-san volunteered to move the patches into
>> >   "ready for committer" state at the CF-Nov.
>> >
>> >
>> >
>> > Sorry for jumping into this late.
>> >
>> > Instead of custom node, it might be better idea to improve FDW
>> > infrastructure
>> > to push join. For the starters, is it possible for the custom scan node
>> > hooks to create a ForeignScan node? In general, I think, it might be
>> > better
>> > for the custom scan hooks to create existing nodes if they serve the
>> > purpose.
>> >
>> >
>> >
>> >
>> >   Prior to time-up, I'd like to ask hacker's opinion about its
>> > potential
>> >   arguable points (from my standpoint) if it needs to

Re: [HACKERS] New hook after raw parsing, before analyze

2014-02-13 Thread Kohei KaiGai

See the discussion of Custom-Scan API.
https://commitfest.postgresql.org/action/patch_view?id=1282

I believe my third patch is what you really want to do...

> This rewritten query would be handled by the FDW table that I previously 
> added to the catalog.
>
> The reason I want this new hook is that I don't want tableA and tableB to be 
> in the catalog.
>
I'd like to see why you wants the pseudo table "fdw_tableA_tableB" to
be in the catalog,
instead of the "tableA" and "tableB". In addition, parser shall raise
an error if referenced
columns (as a part of "tableA" or "tableB") are not in-catalog because
of name lookup
error.

Thanks,


2014-02-13 19:01 GMT+09:00 David Beck :
> Hello Hackers,
>
> I work on a foreign data wrapper for a legacy system. I generally find the 
> hook system very useful and flexible way to extend Postgres.
> The post parse analyze hook almost fits what I need, but I have a few use 
> cases where I would need to tap right into the parsed queries but before any 
> catalog based validation is done.
> Please find the attached trivial patch for this new hook.
>
> One of the use cases I have is this:
>
> I have table like data structures in the source system for the FDW I work on.
> These tables are sometimes too big and the source system is able to filter 
> and join them with limitations, thus it is not optimal to transfer the data 
> to Postgres.
> At the same time I want the users to think in terms of the original tables.
>
> The idea is to rewrite the SQL queries like this:
>
>   "SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND 
> b.col2=987"
>
> to:
>
>   "SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987"
>
>
> This rewritten query would be handled by the FDW table that I previously 
> added to the catalog.
>
> The reason I want this new hook is that I don't want tableA and tableB to be 
> in the catalog.
>
> Looking forward to hear your thoughts, opinions, comments.
>
> Best regards, David
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only table scan?)

2014-02-12 Thread Kohei KaiGai

2014-02-12 14:59 GMT+09:00 Haribabu Kommi :
> I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
> are fine.
> And I have comments in the third patch related to cache scan.
>
Thanks for your volunteering.

> 1. +# contrib/dbcache/Makefile
>
>Makefile header comment is not matched with file name location.
>
Ahh, it's an old name when I started to implement.

> 2.+   /*
> +   * Estimation of average width of cached tuples - it does not make
> +   * sense to construct a new cache if its average width is more than
> +   * 30% of the raw data.
> +   */
>
>Move the estimation of average width calculation of cached tuples into
> the case where the cache is not found,
>otherwise it is an overhead for cache hit scenario.
>
You are right. If and when existing cache is found and match, its width is
obviously less than 30% of total width.

> 3. + if (old_cache)
> + attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
>
>can't we need the check to see the average width is more than 30%? During
> estimation it doesn't
>include the existing other attributes.
>
Indeed. It should drop some attributes on the existing cache if total average
width grows more than the threshold. Probably, we need to have a statistical
variable to track how many times or how recently referenced.

> 4. + lchunk->right = cchunk;
> + lchunk->l_depth = TTREE_DEPTH(lchunk->right);
>
>   I think it should be lchunk->r_depth needs to be set in a clock wise
> rotation.
>
Oops, nice cache.

> 5. can you add some comments in the code with how the block is used?
>
Sorry, I'll add it. A block is consumed from the head to store pointers of
tuples, and from the tail to store contents of the tuples. A block can hold
multiple tuples unless usage of tuple pointers from the head does not cross
the area for tuple contents. Anyway, I'll put it on the source code.

> 6. In do_insert_tuple function I felt moving the tuples and rearranging
> their addresses is little bit costly. How about the following way?
>
>Always insert the tuple from the bottom of the block where the empty
> space is started and store their corresponding reference pointers
>in the starting of the block in an array. As and when the new tuple
> inserts this array increases from block start and tuples from block end.
>Just need to sort this array based on item pointers, no need to update
> their reference pointers.
>
>In this case the movement is required only when the tuple is moved from
> one block to another block and also whenever if the continuous
>free space is not available to insert the new tuple. you can decide based
> on how frequent the sorting will happen in general.
>
It seems to me a reasonable suggestion.
Probably, an easier implementation is replacing an old block with dead-
spaces by a new block that contains only valid tuples, if and when dead-
space grows threshold of block-usage.

> 7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups); can
> go wrong when only one tuple present in the block
>with the equal item pointer what we are searching in the forward scan
> direction.
>
It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk->ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal or
larger to the tuple being only cached, thus, it moves to the right or left node
according to the scan direction.

> 8. I am not able to find a protection mechanism in insert/delete and etc of
> a tuple in Ttree. As this is a shared memory it can cause problems.
>
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.

> 9. + /* merge chunks if this chunk has enough space to merge */
> + ccache_merge_chunk(ccache, cchunk);
>
>   calling the merge chunks for every call back of heap page prune is a
> overhead for vacuum. After the merge which may again leads
>   to node splits because of new data.
>
OK, I'll check the condition to merge the chunks, to prevent too frequent
merge / split.

> 10. "columner" is present in some places of the patch. correct it.
>
Ahh, it should be "columnar".

> 11. In cache_scan_next function, incase of cache insert fails because of
> shared memory the tuple pointer is not reset and cache is NULL.
> Because of this during next record fetch it leads to assert as cache !=
> NULL.
>
You are right. I had to modify the state of scan as if normal-seqscan path,
not just setting NULL on csstate->ccache.
I left an older manner during try & error during implementation.

> 12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
>   + cs_put_ccache(ccache);
>
> The cache is created with refcnt as 2 and in some times two times put
> cache is called to eliminate it and in some times with a different approach.
> It is little bit confusing, can

Re: [HACKERS] dynamic shared memory and locks

2014-02-10 Thread Kohei KaiGai

2014-02-08 4:52 GMT+09:00 Robert Haas :
> On Tue, Jan 21, 2014 at 11:37 AM, Robert Haas  wrote:
>> One idea I just had is to improve the dsm_toc module so that it can
>> optionally set up a tranche of lwlocks for you, and provide some
>> analogues of RequestAddinLWLocks and LWLockAssign for that case.  That
>> would probably make this quite a bit simpler to use, at least for
>> people using it with dynamic shared memory.  But I think that's a
>> separate patch.
>
> I played with this a bit today and it doesn't actually seem to
> simplify things very much.  The backend that creates the DSM needs to
> do this:
>
> lwlocks = shm_toc_allocate(toc, sizeof(LWLockPadded) * nlwlocks);
> for (i = 0; i < nlwlocks; ++i)
> LWLockInitialize(&lwlocks[i].lock, tranche_id);
>
> Since that's all of three lines, encapsulating it doesn't look all
> that helpful.  Then each backend needs to do something like this:
>
> static LWLockTranche mytranche;
> mytranche.name = "some descriptive module name";
> mytranche.array_base = lwlocks;
> mytranche.array_stride = sizeof(LWLockPadded);
> LWLockRegisterTranche(tranche_id, &mytranche);
>
> That's not a lot of code either, and there's no obvious way to reduce
> it much by hooking into shm_toc.
>
> I thought maybe we needed some cleanup when the dynamic shared memory
> segment is unmapped, but looks like we really don't.  One can do
> something like LWLockRegisterTranche(tranche_id, NULL) for debugging
> clarity, but it isn't strictly needed; one can release all lwlocks
> from the tranche or assert that none are held, but that should really
> only be a problem if the user does something like
> LWLockAcquire(lock_in_the_segment); dsm_detach(seg), because error
> recovery starts by releasing *all* lwlocks.  And if the user does it
> explicitly, I'm kinda OK with that just seg faulting.  After all, the
> user could equally well have done dsm_detach(seg);
> LWLockAcquire(lock_in_the_segment) and there's no way at all to
> cross-check for that sort of mistake.
>
Does it make another problem if dsm_detach() also releases lwlocks
being allocated on the dsm segment to be released?
Lwlocks being held is tracked in the held_lwlocks[] array; its length is
usually 100. In case when dsm_detach() is called towards the segment
between addr ~ (addr + length), it seems to me an easy job to walk on
this array to find out lwlocks on the range.


> I do see one thing about the status quo that does look reasonably
> annoying: the code expects that tranche IDs are going to stay
> relatively small.  For example, consider a module foobar that uses DSM
> + LWLocks.  It won't do at all for each backend, on first use of the
> foobar module, to do LWLockNewTrancheId() and then reuse that
> tranche_id repeatedly for each new DSM - because over a system
> lifetime of months, tranche IDs could grow into the millions, causing
> LWLockTrancheArray to get really big (and eventually repalloc will
> fail).  Rather, the programmer needs to ensure that
> LWLockNewTrancheId() gets called *once per postmaster lifetime*,
> perhaps by allocating a chunk of permanent shared memory and using
> that to store the tranche_id that should be used each time an
> individual backend fires up a DSM.  Considering that one of the goals
> of dynamic shared memory is to allow modules to be loaded after
> postmaster startup and still be able to do useful stuff, that's kind
> of obnoxious.  I don't have a good idea what to do about it, though;
> making LWLockTrancheArray anything other than a flat array looks
> likely to slow down --enable-dtrace builds unacceptably.
>
Just my rough idea. Doesn't it make sense to have an offset from the
head of DSM segment that contains the lwlock, instead of the identifier
form, to indicate a cstring datum? It does not prevent to allocate it
later; after the postmaster starting up, and here is no problem if number
of dynamic lwlocks grows millions or more.
If lwlock has a "tranche_offset", not "tranche_id", all it needs to do is:
1. find out either of DSM segment or regular SHM segment that contains
the supplied lwlock.
2. Calculate mapped_address + tranche_offset; that is the local location
where cstring form is put.

Probably, it needs a common utility routine on dsm.c to find out
a particular DSM segment that contains the supplied address.
(Inverse function towards dsm_segment_address)

How about my ideas?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2014-01-29 Thread Kohei KaiGai

2014-01-29 Christian Convey :
>
> On Mon, Jan 27, 2014 at 7:14 PM, Kouhei Kaigai  wrote:
>>
>> FDW's join pushing down is one of the valuable use-cases of this
>> interface,
>> but not all. As you might know, my motivation is to implement GPU
>> acceleration
>> feature on top of this interface, that offers alternative way to scan or
>> join
>> relations or potentially sort or aggregate.
>
>
> I'm curious how this relates to the pluggable storage idea discussed here
> https://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting and here
> http://www.databasesoup.com/2013/05/postgresql-new-development-priorities-2.html
>
> I haven't seen a specific proposal about how much functionality should be
> encapsulated by a pluggable storage system.  But I wonder if that would be
> the best place for specialized table-scan code to end up?
>
If you are interested in designing your own storage layer (thus it needs to
have own scan/writer implementation), FDW is an option currently available.
It defines a set of interface that allows extensions to generate "things to be
there" on the fly. It does not force to perform as a client of remote database,
even though it was originally designed for dblink purpose.
In other words, FDW is a feature to translate a particular data source into
something visible according to the table definition. As long as driver can
intermediate table definition and data format of your own storage layer,
it shall work.

On the other hands, custom-scan interface, basically, allows extensions to
implement "alternative way to access" the data. If we have wiser way to
scan or join relations than built-in logic (yep, it will be a wiser
logic to scan
a result set of remote-join than local join on a couple of remote scan results),
this interface suggest the backend "I have such a wise strategy", then planner
will choose one of them; including either built-in or additional one, according
to the cost.

Let's back to your question. This interface is, right now, not designed to
implement pluggable storage layer. FDW is an option now, and maybe
a development item in v9.5 cycle if we want regular tables being pluggable.
Because I'm motivated to implement my GPU acceleration feature to
perform on regular relations, I put my higher priority on the interface to
allow extension to suggest "how to scan" it.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-23 Thread Kohei KaiGai

2014/1/23 Andres Freund :
> On 2014-01-23 23:03:40 +0900, Kohei KaiGai wrote:
>> Isn't it necessary to have an interface to initialize LWLock structure being
>> allocated on a dynamic shared memory segment?
>> Even though LWLock structure is exposed at lwlock.h, we have no common
>> way to initialize it.
>
> There's LWLockInitialize() or similar in the patch afair.
>
Ahh, I oversight the code around tranche-id. Sorry for this noise.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory and locks

2014-01-23 Thread Kohei KaiGai

Isn't it necessary to have an interface to initialize LWLock structure being
allocated on a dynamic shared memory segment?
Even though LWLock structure is exposed at lwlock.h, we have no common
way to initialize it.

How about to have a following function?

void
InitLWLock(LWLock *lock)
{
SpinLockInit(&lock->lock.mutex);
lock->lock.releaseOK = true;
lock->lock.exclusive = 0;
lock->lock.shared = 0;
lock->lock.head = NULL;
lock->lock.tail = NULL;
}

Thanks,

2014/1/22 KaiGai Kohei :
> (2014/01/22 1:37), Robert Haas wrote:
>>
>> On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei 
>> wrote:
>>>
>>> I briefly checked the patch. Most of lines are mechanical replacement
>>> from LWLockId to LWLock *, and compiler didn't claim anything with
>>> -Wall -Werror option.
>>>
>>> My concern is around LWLockTranche mechanism. Isn't it too complicated
>>> towards the purpose?
>>> For example, it may make sense to add "char lockname[NAMEDATALEN];" at
>>> the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it
>>> also adds an argument of LWLockAssign() to gives the human readable
>>> name. Is the additional 64bytes (= NAMEDATALEN) per lock too large
>>> for recent hardware?
>>
>>
>> Well, we'd need it when either LOCK_DEBUG was defined or when
>> LWLOCK_STATS was defined or when --enable-dtrace was used, and while
>> the first two are probably rarely enough used that that would be OK, I
>> think the third case is probably fairly common, and I don't think we
>> want to have such a potentially performance-critical difference
>> between builds with and without dtrace.
>>
>> Also... yeah, it's a lot of memory.  If we add an additional 64 bytes
>> to the structure, then we're looking at 96 bytes per lwlock instead of
>> 32, after padding out to a 32-byte boundary to avoid crossing cache
>> lines.  We need 2 lwlocks per buffer, so that's an additional 128
>> bytes per 8kB buffer.  For shared_buffers = 8GB, that's an extra 128MB
>> for storing lwlocks.  I'm not willing to assume nobody cares about
>> that.  And while I agree that this is a bit complex, I don't think
>> it's really as bad as all that.  We've gotten by for a long time
>> without people being able to put lwlocks in other parts of memory, and
>> most extension authors have gotten by with LWLockAssign() just fine
>> and can continue doing things that way; only users with particularly
>> sophisticated needs should bother to worry about the tranche stuff.
>>
> Hmm... 1/64 of main memory (if large buffer system) might not be
> an ignorable memory consumption.
>
>
>> One idea I just had is to improve the dsm_toc module so that it can
>> optionally set up a tranche of lwlocks for you, and provide some
>> analogues of RequestAddinLWLocks and LWLockAssign for that case.  That
>> would probably make this quite a bit simpler to use, at least for
>> people using it with dynamic shared memory.  But I think that's a
>> separate patch.
>>
> I agree with this idea. It seems to me quite natural to keep properties
> of objects held on shared memory (LWLock) on shared memory.
> Also, a LWLock once assigned shall not be never released. So, I think
> dsm_toc can provide a well suitable storage for them.
>
>
> Thanks,
> --
> OSS Promotion Center / The PG-Strom Project
> KaiGai Kohei 
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2013-12-24 Thread Kohei KaiGai

Hello,

It seems to me changes in the base security policy on Fedora affected to
the regression test. Our test cases for sepgsql_setcon() utilizes the MCS
rules, that prevents domain transition from narrow categories to wider ones,
to control the success cases and failure cases.

However, its coverage was changed. It was applied all the domains in the
system, thus "unconfined_t" domain had been enforced by MCS rules.
But now, it shall be applied only domains with "mcs_constrained_type"
attribute.

[kaigai@vmlinux tmp]$ diff -up old/policy/mcs new/policy/mcs
  :
 
  :
 mlsconstrain process { transition dyntransition }
-   (( h1 dom h2 ) or ( t1 == mcssetcats ));
+   (( h1 dom h2 ) or ( t1 != mcs_constrained_type ));

Probably, we need to define a domain by ourselves for regression test to ensure
the test stability, not using the system "unconfined" domain that has different
meaning by release.

I'll make a patch. Please wait for a while.

Thanks for your test & reports.

2013/12/18 Sergey Muraviov :
> # semodule -l | grep sepgslq
> sepgsql-regtest 1.07
>
> Full list of modules is in attachment.
>
>
> 2013/12/18 Kohei KaiGai 
>>
>> Could you show me semodule -l on your environment?
>> I believe security policy has not been changed between F19 and F20...
>>
>> Thanks,
>>
>> 2013/12/18 Sergey Muraviov :
>> > Hi
>> >
>> > I've tried to test postgres 9.3.2 and 9.4devel with selinux on Fedora 20
>> > and
>> > met with a label regression test failure.
>> >
>> > PS
>> > I've got some warning during build process.
>> >
>> > --
>> > Best regards,
>> > Sergey Muraviov
>> >
>> >
>> > --
>> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> > To make changes to your subscription:
>> > http://www.postgresql.org/mailpref/pgsql-hackers
>> >
>>
>>
>>
>> --
>> KaiGai Kohei 
>
>
>
>
> --
> Best regards,
> Sergey Muraviov



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: label regression test failed

2013-12-18 Thread Kohei KaiGai

Could you show me semodule -l on your environment?
I believe security policy has not been changed between F19 and F20...

Thanks,

2013/12/18 Sergey Muraviov :
> Hi
>
> I've tried to test postgres 9.3.2 and 9.4devel with selinux on Fedora 20 and
> met with a label regression test failure.
>
> PS
> I've got some warning during build process.
>
> --
> Best regards,
> Sergey Muraviov
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-12-08 Thread Kohei KaiGai

2013/12/6 Kohei KaiGai :
> What will happen if sender tries to send a large chunk that needs to
> be split into multiple sub-chunks and receiver concurrently detaches
> itself from the queue during the writes by sender?
> It seems to me the sender gets SHM_MQ_DETACHED and only
> earlier half of the chunk still remains on the queue even though
> its total length was already in the message queue.
> It may eventually lead infinite loop on the receiver side when another
> receiver appeared again later, then read incomplete chunk.
> Does it a feasible scenario? If so, it might be a solution to prohibit
> enqueuing something without receiver, and reset queue when a new
> receiver is attached.
>
Doesn't it an intended usage to attach a peer process on a message
queue that had once detached, does it?
If so, it may be a solution to put ereport() on shm_mq_set_receiver()
and shm_mq_set_sender() to prohibit to assign a process on the
message queue with mq_detached = true. It will make the situation
simplified.

Regarding to the test-shm-mq-v1.patch, setup_background_workers()
tries to launch nworkers of background worker processes, however,
may fail during the launching if max_worker_processes is not enough.
Is it a situation to attach the BGWORKER_EPHEMERAL flag when
your patch gets committed, isn't it?
I think it is waste of efforts to add error handling here instead of the
core support to be added, however, it makes sense to put a source
code comment not to forget to add this flag when it came.

Also, test_shm_mq_setup() waits for completion of starting up of
background worker processes. I'm uncertain whether it is really
needed, because this shared memory message queue allows to
send byte stream without receiver, and also blocks until byte
stream will come from the peer to be set later.
This module is designed for test purpose, so I think it makes more
sense if test condition is more corner case.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-12-05 Thread Kohei KaiGai

Sorry for my late.

I checked the part-3 (shm-mq-v1.patc) portion as below.
Your comments towards part-1 and part-2 are reasonable for me,
so I have no argue on this portion.

Even though shm_mq_create() expects the "address" is aligned,
however, no mechanism to ensure. How about to put Assert() here?

shm_mq_send_bytes() has an Assert() to check the length to be
added to mq_bytes_written. Is the MAXALIGN64() right check in
32-bit architecture?
The sendnow value might be Min(ringsize - used, nbytes - sent),
and the ringsize came from mq->mq_ring_size being aligned
using MAXALIGN(_DOWN) that has 32bit width.

/*
 * Update count of bytes written, with alignment padding.  Note
 * that this will never actually insert any padding except at the
 * end of a run of bytes, because the buffer size is a multiple of
 * MAXIMUM_ALIGNOF, and each read is as well.
 */
Assert(sent == nbytes || sendnow == MAXALIGN64(sendnow));
shm_mq_inc_bytes_written(mq, MAXALIGN64(sendnow));

What will happen if sender tries to send a large chunk that needs to
be split into multiple sub-chunks and receiver concurrently detaches
itself from the queue during the writes by sender?
It seems to me the sender gets SHM_MQ_DETACHED and only
earlier half of the chunk still remains on the queue even though
its total length was already in the message queue.
It may eventually lead infinite loop on the receiver side when another
receiver appeared again later, then read incomplete chunk.
Does it a feasible scenario? If so, it might be a solution to prohibit
enqueuing something without receiver, and reset queue when a new
receiver is attached.

The source code comments in shm_mq_wait_internal() is a little bit
misleading for me. It says nothing shall be written without attaching
the peer process, however, we have no checks as long as nsend is
less than queue size.

 * If handle != NULL, the queue can be read or written even before the
 * other process has attached.  We'll wait for it to do so if needed.  The
 * handle must be for a background worker initialized with bgw_notify_pid
 * equal to our PID.

Right now, that's all I can comment on. I'll do follow-up code reading
in the weekend.

Thanks,

2013/11/20 Robert Haas :
> On Tue, Nov 19, 2013 at 12:33 AM, Kohei KaiGai  wrote:
>> * on-dsm-detach-v2.patch
>> It reminded me the hook registration/invocation mechanism on apache/httpd.
>> It defines five levels for invocation order (REALLY_FIRST, FIRST, MIDDLE,
>> LAST, REALLY_LAST), but these are alias of integer values, in other words,
>> they are just kicks the hook in order to the priority value associated with a
>> function pointer.
>> These flexibility may make sense. We may want to control the order to
>> release resources more fine grained in the future. For example, isn't it
>> a problem if we have only two levels when a stuff in a queue needs to be
>> released earlier than the queue itself?
>> I'm not 100% certain on this suggestion because on_shmen_exit is a hook
>> that does not host so much callbacks, thus extension may implement
>> their own way on the SHMEM_EXIT_EARLY or SHMEM_EXIT_LATE stage
>> of course.
>
> I don't really see much point in adding more flexibility here until we
> need it, but I can imagine that we might someday need it, for reasons
> that are not now obvious.
>
>> * shm-toc-v1.patch
>>
>> From my experience, it makes sense to put a magic number on the tail of
>> toc segment to detect shared-memory usage overrun. It helps to find bugs
>> bug hard to find because of concurrent jobs.
>> Probably, shm_toc_freespace() is a point to put assertion.
>>
>> How many entries is shm_toc_lookup() assumed to chain?
>> It assumes a liner search from the head of shm_toc segment, and it is
>> prerequisite for lock-less reference, isn't it?
>> Does it make a problem if shm_toc host many entries, like 100 or 1000?
>> Or, it is not an expected usage?
>
> It is not an expected usage.In typical usage, I expect that the
> number of TOC entries will be about N+K, where K is a small constant
> (< 10) and N is the number of cooperating parallel workers.  It's
> possible that we'll someday be in a position to leverage 100 or 1000
> parallel workers on the same task, but I don't expect it to be soon.
> And, actually, I doubt that revising the data structure would pay off
> at N=100.  At N=1000, maybe.  At N=1, probably.  But we are
> *definitely* not going to need that kind of scale any time soon, and I
> don't think it makes sense to design a complex data structure to
> handle that case when there are so many more basic problems that nee

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2013-12-04 Thread Kohei KaiGai

Thanks for fixing many my carelessness.
I didn't know "seek" was an irregular verb...

Best regards,

2013/12/4 Shigeru Hanada :
> 2013/11/29 Kohei KaiGai :
>> I merged all the propositions from Jim. Thanks, it made the documentation
>> quality better. Also, I fixed up cosmetic stuff around whitespace <-> tab.
>
> I found some typos in documents and comments.  Please see attached
> patch for detail.
>
> --
> Shigeru HANADA



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2013-12-04 Thread Kohei KaiGai

Hanada-san,

Thanks for your reviewing,

2013/12/4 Shigeru Hanada :
> I first reviewed postgres_fdw portion of the patches to learn the
> outline of Custom Plan.  Wiki page is also a good textbook of the
> feature.  I have some random comments about the basic design of Custom
> Plan:
>
> (1) IIUC add_join_path and add_scan_path are added to allow extensions
> to plug their code into planner.
>
Almost yes. For more correctness, these hooks allows extensions to
plug paths they can provide into a particular join or scan. Then planner
will choose the cheapest one  according to the cost value.

> (2) FDW framework has executor callbacks based on existing executor
> nodes.  Is there any plan to integrate them into one way, or wrap on
> by another?  I'm not sure that we should have two similar framework
> side by side.
> # I'm sorry if I've missed the past discussion about this issue.
>
Probably, FDW has different role from the CustomScan API.
As literal, FDW performs as a bridge between a relation form and
an opaque external data source, to intermediate two different world
on behalf of a foreign table.
On the other hand, CustomScan allows to provide alternative logic
to scan or join particular relations, in addition to the built-in ones,
but does not perform on behalf of foreign tables.

Existing FDW is designed to implement a scan on an intangible
relation, thus it can assume some things; like a tuple returned
from FDW has equivalent TupleDesc as table definition, or it can
always use ExecScan() for all the cases.
So, I don't think these two frameworks should be consolidated
because it makes confusion on the existing extensions that
assumes FDW callbacks always has a particular foreign table
definition.

> (3) Internal routines such as is_self_managed_relation and
> has_wholerow_reference seem to be useful for other FDWs.  Is it able
> to move them  into core?
>
Probably, src/backend/foreign/foreign.c is a good host for them.

> (4) postgres_fdw estimates costs of join by calculating local numbers.
>  How about to support remote estimation by throwing EXPLALAIN query
> when use_remote_estimates = true.
>
I'm uncertain whether the cost value from remote EXPLAIN represents
right difficulty on the local side, because it indeed represents the
difficulty to join two relations on the remote side, however, does not
represents local job; that just fetches tuples from the result set of
remote query with table joining.
How about your opinion? Is the remote cost estimation value comparable
with local value?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Status of FDW pushdowns

2013-11-21 Thread Kohei KaiGai

2013/11/21 Bruce Momjian :
> Where are we on the remaining possible pushdowns for foreign data
> wrappers, particularly the Postgres one?  I know we do WHERE restriction
> pushdowns in 9.3, but what about join and aggregate pushdowns?  Is
> anyone working on those?
>
> I know join pushdowns seem insignificant, but it helps to restrict what
> data must be passed back because you would only pass back joined rows.
>
> Do we document these missing features anywhere?
>
Probably, custom-scan api will provide more flexible way to push-down
aggregate, sort or other stuff performing on regular tables, not only
foreign tables.
It allows extensions to offer alternative scan/join path on the planning
stage, then executor callbacks its custom logic instead of the built-in
one, if its cost is cheaper.

Right now, it performs on relation scan or join only. However, we will be
able to apply same concept on aggregation.
For example, an aggregation node on a foreign table scan is a good
candidate to push down because it can be replaced with a custom-
logic that scans a materialized result of the remote aggregation query,
if its cost is enough cheap than local aggregation.
Probably, we need to add a hook and some logic to compare the
built-in aggregation and alternative paths provided by extensions.
It is also helpful for the people who want to implement something like
"parallel aggregate" performing on regular tables, not only foreign table.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-11-18 Thread Kohei KaiGai

Hello,

I tried to look at the patch #1 and #2 at first, but I shall rest of
portion later.

* basic checks
All the patches (not only #1, #2) can be applied without any problem towards
the latest master branch. Its build was succeeded with Werror.
Regression test works fine on the core and contrib/test_shm_mq.

* on-dsm-detach-v2.patch
It reminded me the hook registration/invocation mechanism on apache/httpd.
It defines five levels for invocation order (REALLY_FIRST, FIRST, MIDDLE,
LAST, REALLY_LAST), but these are alias of integer values, in other words,
they are just kicks the hook in order to the priority value associated with a
function pointer.
These flexibility may make sense. We may want to control the order to
release resources more fine grained in the future. For example, isn't it
a problem if we have only two levels when a stuff in a queue needs to be
released earlier than the queue itself?
I'm not 100% certain on this suggestion because on_shmen_exit is a hook
that does not host so much callbacks, thus extension may implement
their own way on the SHMEM_EXIT_EARLY or SHMEM_EXIT_LATE stage
of course.

* shm-toc-v1.patch

>From my experience, it makes sense to put a magic number on the tail of
toc segment to detect shared-memory usage overrun. It helps to find bugs
bug hard to find because of concurrent jobs.
Probably, shm_toc_freespace() is a point to put assertion.

How many entries is shm_toc_lookup() assumed to chain?
It assumes a liner search from the head of shm_toc segment, and it is
prerequisite for lock-less reference, isn't it?
Does it make a problem if shm_toc host many entries, like 100 or 1000?
Or, it is not an expected usage?

#3 and #4 should be looked later...

Thanks,

2013/11/8 Robert Haas :
> On Wed, Nov 6, 2013 at 9:48 AM, Peter Eisentraut  wrote:
>> This patch needs to be rebased.
>
> Thanks.  You didn't mention which of the four needed rebasing, but it
> seems that it's just the first one.  New version of just that patch
> attached; please use the prior versions of the remaining three.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Tom Lane :
> Kohei KaiGai  writes:
>> So, are you thinking it is a feasible approach to focus on custom-scan
>> APIs during the upcoming CF3, then table-caching feature as use-case
>> of this APIs on CF4?
>
> Sure.  If you work on this extension after CF3, and it reveals that the
> custom scan stuff needs some adjustments, there would be time to do that
> in CF4.  The policy about what can be submitted in CF4 is that we don't
> want new major features that no one has seen before, not that you can't
> make fixes to previously submitted stuff.  Something like a new hook
> in vacuum wouldn't be a "major feature", anyway.
>
Thanks for this clarification.
3 days are too short to write a patch, however, 2 month may be sufficient
to develop a feature on top of the scheme being discussed in the previous
comitfest.

Best regards,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Tom Lane :
> Kohei KaiGai  writes:
>> 2013/11/12 Tom Lane :
>>> There's no possible way you'll finish this for 9.4.
>
>> Yes, I understand it is not possible to submit whole of the patch until
>> CF3 deadline. So, I'd like to find out a way to implement it as an
>> extension using facilities being supported or to be enhanced on v9.4.
>
> Oh!  Okay, I misunderstood the context --- you meant this as an example
> use-case for the custom plan feature, right?  Makes more sense now.
>
> I'm still dubious that it'd actually be very useful in itself, but it
> seems reasonable as a test case to make sure that a set of hooks for
> custom plans are sufficient to do something useful with.
>
Yes. I intend to put most of this table-caching feature on the custom-scan
APIs set, even though it may take additional hooks due to its nature,
independent from planner and executor structure.

So, are you thinking it is a feasible approach to focus on custom-scan
APIs during the upcoming CF3, then table-caching feature as use-case
of this APIs on CF4?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Claudio Freire :
> On Tue, Nov 12, 2013 at 11:45 AM, Kohei KaiGai  wrote:
>> Hello,
>>
>> It is a brief design proposal of a feature I'd like to implement on top of
>> custom-scan APIs. Because it (probably) requires a few additional base
>> features not only custom-scan, I'd like to see feedback from the hackers.
>>
>> The cache-only table scan, being in subject line, is an alternative scan
>> logic towards sequential scan if all the referenced columns are cached.
>> It shall allow to scan a particular table without storage access, thus
>> make scan performance improved.
>> So what? Which is difference from large shared_buffers configuration?
>> This mechanism intends to cache a part of columns being referenced
>> in the query, not whole of the records. It makes sense to the workloads
>> that scan a table with many columns but qualifier references just a few
>> columns, typically used to analytic queries, because it enables to
>> reduce memory consumption to be cached, thus more number of records
>> can be cached.
>> In addition, it has another role from my standpoint. It also performs as
>> fast data supplier towards GPU/MIC devices. When we move data to
>> GPU device, the source address has to be a region marked as "page-
>> locked" that is exempted from concurrent swap out, if we want CUDA
>> or OpenCL to run asynchronous DMA transfer mode; the fastest one.
>
>
> Wouldn't a columnar heap format be a better solution to this?
>
I've implemented using FDW, however, it requires application adjust its SQL
to replace "CREATE TABLE" by "CREATE FOREIGN TABLE". In addition,
it lost a good feature of regular heap, like index scan if its cost is smaller
than sequential columnar scan.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Robert Haas :
> On Tue, Nov 12, 2013 at 9:45 AM, Kohei KaiGai  wrote:
>> It is a brief design proposal of a feature I'd like to implement on top of
>> custom-scan APIs. Because it (probably) requires a few additional base
>> features not only custom-scan, I'd like to see feedback from the hackers.
>>
>> The cache-only table scan, being in subject line, is an alternative scan
>> logic towards sequential scan if all the referenced columns are cached.
>> It shall allow to scan a particular table without storage access, thus
>> make scan performance improved.
>> So what? Which is difference from large shared_buffers configuration?
>> This mechanism intends to cache a part of columns being referenced
>> in the query, not whole of the records. It makes sense to the workloads
>> that scan a table with many columns but qualifier references just a few
>> columns, typically used to analytic queries, because it enables to
>> reduce memory consumption to be cached, thus more number of records
>> can be cached.
>> In addition, it has another role from my standpoint. It also performs as
>> fast data supplier towards GPU/MIC devices. When we move data to
>> GPU device, the source address has to be a region marked as "page-
>> locked" that is exempted from concurrent swap out, if we want CUDA
>> or OpenCL to run asynchronous DMA transfer mode; the fastest one.
>>
>> Probably, here is no problem on construction of this table cache.
>> All we need to do is inject a custom-scan node instead of seq-scan,
>> then it can construct table cache in concurrence with regular seq-
>> scan, even though the first access become a little bit slow down.
>>
>> My concern is how to handle a case when table gets modified.
>> A straightforward idea is that each cached entries being modified
>> shall be invalidated by callback mechanism.
>> Trigger can help in case of INSERT, UPDATE, DELETE and
>> TRUNCATE. Furthermore, it's better if extension could inject
>> its own trigger definition at RelationBuildTriggers() on the fly,
>> to perform the feature transparently.
>> On the other hand, we have no way to get control around VACUUM.
>> I want to have a hook that allows extensions to get control when
>> a page got vacuumed. Once we have such a hook, it enables to
>> invalidate cached entries being indexed by tid, but already vacuumed.
>> Its best location is probably lazy_scan_heap() to call back extension
>> for each page getting vacuumed, with
>>
>> How about your opinion?
>
> I think it's hard to say in the abstract.  I'd suggest adding the
> hooks you feel like you need and see what the resulting patch looks
> like.  Then post that and we can talk about how (and whether) to do it
> better.  My personal bet is that triggers are the wrong way to do
> something like this; I'd look to do it all with hooks.  Of course,
> figuring how to position those hooks so that they are maintainable and
> don't affect performance when not used is the tricky part.
>
Yep, right now, all the idea is still in my brain, so it takes additional
times to make up it as a patch form.
The reason why I'm inclined to use trigger is, it is already supported
on the place I want to get control. So, it enables to omit efforts to
add and maintenance similar features in the limited time slot.

>> I'd like to find out the best way to implement this table-caching
>> mechanism within scope of v9.4 functionality set.
>> Any ideas, comments or suggestions are welcome.
>
> I think that Tom is right: there's not time to get something like this
> done for 9.4.  If you start working relatively soon, I think you could
> hope to have a good proposal in time for 9.5.
>
Yes, I agree his suggestion is realistic. It is nearly impossible to
get whole of the table-cache mechanism by CF3 deadline this week.
So, I assume to implement it as extension using facilities in v9.4.

It seems to me the last piece to implement this feature (of course,
custom-scan is pre-requisite) is a hook in vacuum.
Does it make sense to propose something other worth but small
contrib module that utilizes this last piece? For example, a contrib
module that reports progress of vacuum...

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

2013/11/12 Tom Lane :
> Kohei KaiGai  writes:
>> The cache-only table scan, being in subject line, is an alternative scan
>> logic towards sequential scan if all the referenced columns are cached.
>
> This seems like a pretty dubious idea to me --- you're talking about
> expending large amounts of memory on a single-purpose cache, without
> any clear way to know if the data will ever be used before it gets
> invalidated (and the life expectancy of the cache doesn't sound like
> it'd be very high either).
>
Thanks for your comments. I assume this table-cache shall be applied
on tables that holds data set for analytic workloads. Even though it
may consume large amount of memory, it will hit its major workload
in this use scenario.
Probably, it will have upper limit of cache memory usage, and need
a mechanism to reclaim the cache block from the oldest one. Here
is nothing special. Also, even I call it "table cache", it is similar to
in-memory database being supported by rich memory hardware.

>> I'd like to find out the best way to implement this table-caching
>> mechanism within scope of v9.4 functionality set.
>
> There's no possible way you'll finish this for 9.4.  Keep in mind that
> CF3 starts Friday, and CF4 starts 2014-01-15, and project policy is that
> major feature patches should arrive (in at least rough form) by CF3.
> Even ignoring that policy, this is more than 2 months worth of work.
>
Yes, I understand it is not possible to submit whole of the patch until
CF3 deadline. So, I'd like to find out a way to implement it as an
extension using facilities being supported or to be enhanced on v9.4.

Best regards,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] What's needed for cache-only table scan?

2013-11-12 Thread Kohei KaiGai

Hello,

It is a brief design proposal of a feature I'd like to implement on top of
custom-scan APIs. Because it (probably) requires a few additional base
features not only custom-scan, I'd like to see feedback from the hackers.

The cache-only table scan, being in subject line, is an alternative scan
logic towards sequential scan if all the referenced columns are cached.
It shall allow to scan a particular table without storage access, thus
make scan performance improved.
So what? Which is difference from large shared_buffers configuration?
This mechanism intends to cache a part of columns being referenced
in the query, not whole of the records. It makes sense to the workloads
that scan a table with many columns but qualifier references just a few
columns, typically used to analytic queries, because it enables to
reduce memory consumption to be cached, thus more number of records
can be cached.
In addition, it has another role from my standpoint. It also performs as
fast data supplier towards GPU/MIC devices. When we move data to
GPU device, the source address has to be a region marked as "page-
locked" that is exempted from concurrent swap out, if we want CUDA
or OpenCL to run asynchronous DMA transfer mode; the fastest one.

Probably, here is no problem on construction of this table cache.
All we need to do is inject a custom-scan node instead of seq-scan,
then it can construct table cache in concurrence with regular seq-
scan, even though the first access become a little bit slow down.

My concern is how to handle a case when table gets modified.
A straightforward idea is that each cached entries being modified
shall be invalidated by callback mechanism.
Trigger can help in case of INSERT, UPDATE, DELETE and
TRUNCATE. Furthermore, it's better if extension could inject
its own trigger definition at RelationBuildTriggers() on the fly,
to perform the feature transparently.
On the other hand, we have no way to get control around VACUUM.
I want to have a hook that allows extensions to get control when
a page got vacuumed. Once we have such a hook, it enables to
invalidate cached entries being indexed by tid, but already vacuumed.
Its best location is probably lazy_scan_heap() to call back extension
for each page getting vacuumed, with

How about your opinion?

I'd like to find out the best way to implement this table-caching
mechanism within scope of v9.4 functionality set.
Any ideas, comments or suggestions are welcome.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom Scan APIs (Re: [HACKERS] Custom Plan node)

2013-11-11 Thread Kohei KaiGai

Hi,

I tried to write up a wikipage to introduce how custom-scan works.

https://wiki.postgresql.org/wiki/CustomScanAPI

Any comments please.

2013/11/6 Kohei KaiGai :
> The attached patches provide a feature to implement custom scan node
> that allows extension to replace a part of plan tree with its own code
> instead of the built-in logic.
> In addition to the previous proposition, it enables us to integrate custom
> scan as a part of candidate paths to be chosen by optimizer.
> Here is two patches. The first one (pgsql-v9.4-custom-scan-apis) offers
> a set of API stuff and a simple demonstration module that implement
> regular table scan using inequality operator on ctid system column.
> The second one (pgsql-v9.4-custom-scan-remote-join) enhances
> postgres_fdw to support remote join capability.
>
> Below is an example to show how does custom-scan work.
>
> We usually run sequential scan even if clause has inequality operator
> that references ctid system column.
>
> postgres=# EXPLAIN SELECT ctid,* FROM t1 WHERE ctid > '(10,0)'::tid;
>QUERY PLAN
> 
>  Seq Scan on t1  (cost=0.00..209.00 rows= width=43)
>Filter: (ctid > '(10,0)'::tid)
> (2 rows)
>
> An extension that performs as custom-scan provider suggests
> an alternative path, and its cost was less than sequential scan,
> thus optimizer choose it.
>
> postgres=# LOAD 'ctidscan';
> LOAD
> postgres=# EXPLAIN SELECT ctid,* FROM t1 WHERE ctid > '(10,0)'::tid;
>   QUERY PLAN
> --
>  Custom Scan (ctidscan) on t1  (cost=0.00..100.00 rows= width=43)
>Filter: (ctid > '(10,0)'::tid)
> (2 rows)
>
> Of course, more cost effective plan will win if exists.
>
> postgres=# EXPLAIN SELECT ctid,* FROM t1 WHERE ctid > '(10,0)'::tid AND a = 
> 200;
> QUERY PLAN
> ---
>  Index Scan using t1_pkey on t1  (cost=0.29..8.30 rows=1 width=43)
>Index Cond: (a = 200)
>Filter: (ctid > '(10,0)'::tid)
> (3 rows)
>
> One other worthwhile example is remote-join enhancement on the
> postgres_fdw as follows. Both of ft1 and ft2 are foreign table being
> managed by same foreign server.
>
> postgres=# EXPLAIN (verbose) SELECT * FROM ft1 JOIN ft2 ON a = x
>   WHERE f_leak(b) AND y
> like '%aaa%';
>QUERY PLAN
> --
>  Custom Scan (postgres-fdw)  (cost=100.00..100.01 rows=0 width=72)
>Output: a, b, x, y
>Filter: f_leak(b)
>Remote SQL: SELECT r1.a, r1.b, r2.x, r2.y FROM (public.ft1 r1 JOIN
> public.ft2 r2 ON ((r1.a = r2.x))) WHERE ((r2.y ~~ '%aaa%'::text))
> (4 rows)
>
> ---
> How does it works
> ---
> This patch adds two hooks (for base and join relations) around allpaths.c
> and joinpaths.c. It allows extensions to add alternative paths to handle
> scanning on the base relation or join of two relations.
>
> Its callback routine can add CustomPath using add_path() to inform
> optimizer this alternative scan path. Every custom-scan provider is
> identified by its name being registered preliminary using the following
> function.
>
>   void register_custom_provider(const CustomProvider *provider);
>
> CustomProvider is a set of name string and function pointers of callbacks.
>
> Once CustomPath got chosen, create_scan_plan() construct a custom-
> scan plan and calls back extension to initialize the node.
> Rest of portions are similar to foreign scan, however, some of detailed
> portions are different. For example, foreign scan is assumed to return
> a tuple being formed according to table definition. On the other hand,
> custom-scan does not have such assumption, so extension needs to
> set tuple-descriptor on the scan tuple slot of ScanState structure by
> itself.
>
> In case of join, custom-scan performs as like a regular scan but it
> returns tuples being already joined on underlying relations.
> The patched postgres_fdw utilizes a hook at joinpaths.c to run
> remote join.
>
> 
> Issues
> 
> I'm not 100% certain whether arguments of add_join_path_hook is
> reasonable. I guess the first 7 arguments are minimum necessity.
> The mergeclause_list and semifactors might be useful if some

[HACKERS] Re: Exempting superuser from row-security isn't enough. Run predicates as DEFINER?

2013-11-11 Thread Kohei KaiGai

Hi Craig,

I'd like to vote the last options. It is a separate problem (or, might
be specification), I think.

According to the document of view,
http://www.postgresql.org/docs/devel/static/sql-createview.html

| Access to tables referenced in the view is determined by permissions of
| the view owner. In some cases, this can be used to provide secure but
| restricted access to the underlying tables. However, not all views are secure
| against tampering; see Section 38.5 for details. Functions called in the view
| are treated the same as if they had been called directly from the query using
| the view. Therefore the user of a view must have permissions to call all
| functions used by the view.

We checks permissions to the tables underlying a view with owner's
privilege of views, but permissions to the functions are checked by
the user who kicked the query.
I'm not certain what is the reason of this behavior except for implementation
reason, because table permission is checked on ExecCheckRTEPerms()
but function permission is checked on ExecEvalFunc(), thus its invocation
context is uncertain.

Anyway, it is a possible issue independent from existence of RLS feature.
So, we may need to consider an another solution, if we make a consensus
it is a problem to be tackled.
However, I think RLS is not a suitable solution towards this scenario.

Thanks,

2013/11/11 Craig Ringer :
> Hi all
>
> I'm thinking about a possible solution for one of the row-security
> issues - the ability of a malicious user to write a row-security policy
> containing a malicious predicate function, then trick another user into
> SELECTing from the table and running the function.
>
> What about running the row-security predicate subquery as the predicate
> definer - the user who SET the predicate - or the owner of the object
> the predicate applies to? How expensive are security context and user id
> changes - and is it even practical to do this within the context of a
> security barrier subquery?
>
> Oracle and Teradata get around this by making the assignment of row
> security constraints a highly privileged operation - table owners can't
> set their own. That's the other option IMO.
>
> We already have this as a known issue with VIEWs,
> CHECK constraints, etc, as shown below - so I'm tempted to
> hand-wave around it as a separate issue, and just say that you shouldn't
> access objects created by untrusted users.
>
> The problem is that CHECK constraints, VIEW access, etc doesn't affect
> pg_dump, it won't run view predicates or check constraint code. By
> contrast, pg_dump *does* read tables, so it needs to be exempted from
> row-security predicates defined by untrusted users. An exemption option
> is needed for performance anyway.
>
> Protecting pg_dump and the superuser alone aren't good enough, though.
> SuperSecretUser shouldn't have to fear SELECTing from a view written by
> user NewThisWeek.
>
> On a side note, pg_restore and psql running dump scripts *do* affect
> restores, which is kind of nasty.
>
> Here's a demo showing how to create a new superuser with a known
> password as a regular unprivileged user if you can trick the superuser
> into looking at one of your objects:
>
>
> CREATE USER user1;
>
> SET SESSION AUTHORIZATION user1;
>
> CREATE OR REPLACE FUNCTION haha() RETURNS boolean AS $$
> BEGIN
> RAISE NOTICE 'haha running as: %',current_user;
> CREATE USER haha PASSWORD 'haha' SUPERUSER;
> RETURN true;
> END;
> $$ LANGUAGE plpgsql;
>
> CREATE TABLE check_exploit ( a integer check (haha()) );
>
> CREATE VIEW view_exploit AS SELECT * FROM check_exploit WHERE haha();
>
> GRANT ALL ON check_exploit TO postgres;
> GRANT ALL ON view_exploit TO postgres;
>
> -- later, superuser comes along and looks at the table/view:
> SET SESSION AUTHORIZATION postgres;
>
>
> regress=# select * from view_exploit;
> NOTICE:  haha running as: postgres
>  a
> ---
>  1
> (1 row)
>
> regress=# \du haha
>List of roles
>  Role name | Attributes | Member of
> ---++---
>  haha  | Superuser  | {}
>
> regress=# DROP USER haha;
>
>
>
> or for an admin reason adds/ modifies a row in the table:
>
> regress=# INSERT INTO check_exploit VALUES (100);
> NOTICE:  haha running as: postgres
> INSERT 0 1
>
>
> This even works with SECURITY BARRIER views, since they do nothing to
> control the user ID the view predicate runs as.
>
> If the superuser dumps this database then restores the schema and data
> as two separate passes, "haha" will run via the check constraint in that
> case too. Ouch.
>
>
> So, we've got a few options:
>
> * Run the RS constraint subqueries as DEFINER or table owner (if
> possible and performant)
>
> * Restrict the ability to set RS predicates to superuser by default, and
> create a right to allow it to be delegated.
>
> * Call it a separate problem and deal with it later, since similar
> issues already apply to VIEW, CHECK, etc. Document that running pg_dump
> as a user without RS

Re: [HACKERS] [v9.4] row level security

2013-11-06 Thread Kohei KaiGai

2013/11/6 Craig Ringer :
> On 11/05/2013 09:36 PM, Robert Haas wrote:
>> I haven't studied this patch in detail, but I see why there's some
>> unhappiness about that code: it's an RLS-specific kluge.  Just
>> shooting from the hip here, maybe we should attack the problem of
>> making security-barrier views updatable first, as a separate patch.
>
> That's the approach I've been considering. There are a few wrinkles with
> it, though:
>
> (a) Updatable views are implemented in the rewriter, not the planner.
> The rewriter is not re-run when plans are invalidated or when the
> session authorization changes, etc. This means that we can't simply omit
> the RLS predicate for superuser because the same rewritten parse tree
> might get used for both superuser and non-superuser queries.
>
> Options:
>
> * Store the before-rewrite parse tree when RLS is discovered on one of
> the rels in the tree. Re-run the rewriter when re-planning. Ensure a
> change in current user always invalidates plans.
>
> * Declare that it's not legal to run a query originally parsed and
> rewritten as superuser as a non-superuser or vice versa. This would
> cause a great deal of pain with PL/PgSQL.
>
> * Always add the RLS predicate and solve the problem of reliably
> short-circuiting the user-supplied predicate for RLS-exempt users.  We'd
> need a way to allow direct (not query-based) COPY TO for tables with RLS
> applied, preventing the rewriting of direct table access into subqueries
> for COPY, but since we never save plans for COPY that may be fine.
>
> * ... ?
>
How about an idea that uses two different type of rules: the existing one
is expanded prior to planner stage as we are doing now, and the newer
one is expanded on the head of planner stage.
The argument of planner() is still parse tree, so it seems to me here is
no serious problem to call rewriter again to handle second stage rules.
If we go on this approach, ALTER TABLE ... SET ROW SECURITY
will become a synonym to declare a rule with special attribute.

> (b) Inheritance is a problem when RLS is done in the rewriter. As I
> understood it from Kohei KaiGai's description to me earlier, there was a
> strong preference on -hackers to enforce RLS predicates for child and
> parent tables completely independently. That's how RLS currently works,
> but it might be hard to get the same effect when applying RLS in the
> rewriter. We'd need to solve that, or redefine RLS's behaviour so that
> the predicate on a parent table applies to any child tables too.
> Personally I'd prefer the latter.
>
I'm not certain whether it was a "strong preference", even though I followed
the consensus at that time. So, I think it makes sense to discuss how RLS
policy shall be enforced on the child tables.
As long as we can have consistent view on child tables even if it is referenced
without parent tables, I don't have any arguments to your preference.
Also, it makes implementation simple than the approach I tried to have; that
enforces RLS policy of tables individually, because of utilization of existing
rule mechanism.
It is not difficult to enforce parent's RLS policy on the child relation even if
it is referenced individually. All we need to do special is append RLS policy
of its parent, not only child's one, if referenced table has parent.

> (c) RLS might interact differently with rules declared on tables if
> implemented in the rewriter, so some investigation into that would be
> needed.
>
>> I
>> would think that if we make that work, this will also work without,
>> hopefully, any special hackery.  And we'd get a separate,
>> independently useful feature out of it, too.
>
> I tend to agree. I'm just a bit concerned about dealing with the issues
> around RLS-exempt operations and users.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How should row-security affects ON UPDATE RESTRICT / CASCADE ?

2013-10-29 Thread Kohei KaiGai

2013/10/30 Craig Ringer :
> On 10/30/2013 10:50 AM, Tom Lane wrote:
>> Craig Ringer  writes:
>>> > I'd kind of like to see FK constraints affected by RLS for
>>> > non-superusers, at least as an option.
>> I think that's a complete nonstarter.  Aside from the fact that such a
>> constraint will have no definable semantics, even the possibility that a
>> constraint doesn't mean what it appears to mean will prevent us from
>> making use of FK constraints for optimization --- something that's
>> pretty high on the planner to-do list.
>
> Good point. That's another good argument for FK constraints to disregard
> RLS. In which case, is there actually any way to determine when an SPI
> query is being invoked directly from an FK constraint? We'll need a way
> to tell so RLS can skip adding the row-security check predicate.
>
For your reference, my implementation patches on ri_PerformCheck()
as follows. It didn't skip all the case (only when PK is modified), however,
its overall idea can be common.

--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -3008,6 +3008,7 @@ ri_PerformCheck(const RI_ConstraintInfo *riinfo,
int spi_result;
Oid save_userid;
int save_sec_context;
+   int temp_sec_context;
Datum   vals[RI_MAX_NUMKEYS * 2];
charnulls[RI_MAX_NUMKEYS * 2];

@@ -3087,8 +3088,18 @@ ri_PerformCheck(const RI_ConstraintInfo *riinfo,

/* Switch to proper UID to perform check as */
GetUserIdAndSecContext(&save_userid, &save_sec_context);
+
+   /*
+* Row-level security should be disabled in case when foreign-key
+* relation is queried to check existence of tuples that references
+* the primary-key being modified.
+*/
+   temp_sec_context = save_sec_context | SECURITY_LOCAL_USERID_CHANGE;
+   if (source_is_pk)
+   temp_sec_context |= SECURITY_ROW_LEVEL_DISABLED;
+
SetUserIdAndSecContext(RelationGetForm(query_rel)->relowner,
-  save_sec_context | SECURITY_LOCAL_USERID_CHANGE);
+  temp_sec_context);

/* Finally we can run the query. */
spi_result = SPI_execute_snapshot(qplan,

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How should row-security affects ON UPDATE RESTRICT / CASCADE ?

2013-10-29 Thread Kohei KaiGai

2013/10/29 Tom Lane :
> Craig Ringer  writes:
>> During my testing of Kohei KaiGai's row-security patches I've been
>> looking into how foreign keys should be and are handled. There are some
>> interesting wrinkles around FK cascades, the rights under which FK
>> checks execute, and about the consistency effects of changing or
>> applying an RLS policy.
>
> As I recall, I've been saying since day one that row-level security cannot
> sensibly coexist with foreign-key constraints, and I've been told that the
> potential users of such a feature don't care.  I'm glad to see somebody
> else complaining.
>
Not only RLS, it is not avoidable someone to estimate invisible records
using FK constraints, even if either of referencing or referenced records
were protected by column-level database privilege.
I don't remember how many times we had discussed about this topic.
Its conclusions was that access control itself is not capable to prevent
information leak (1bit; whether a particular key exists, or not) using FK
constraint, however, whole of its feature makes sense as long as user's
environment where RLS+PostgreSQL is installed allows such a small
fraction of information leak.
In case when user's environment does not allow to leak any bit, it is not
a reasonable solution, even though I don't know "reasonable solution"
in this prerequisites.
All of other commercial databases are standing on same assumption.
Even though their promotion white-paper might not say, their solution
of course have same weakness that may allow to leak something.

> Here's another example wherein there basically isn't a sensible solution:
> suppose you have delete rights on table A, and there is a table B
> with a foreign-key reference to A, and RLS says that there are rows in
> B that you can't see.  You try to delete some row in A that is referenced
> by an invisible-to-you row in B.  There are only two possible outcomes:
> the system refuses your request, and thereby exposes to you the fact that
> a referencing row exists; or the system allows the FK constraint to be
> violated.
>
My vote is, system should keep referencial integrity as if RLS policy is
not configured. It is more fundamental stuff than RLS policy per user
basis.

> As far as the points you're making go, I think we must say that RLS checks
> are not applied during FK trigger queries, ie the FK triggers can always
> see everything even though they don't run as superuser.
>
Existing my implementation does as above. If a record is referenced
by invisible records, its deletion shall fail in spite of the information
leakage.

>  Otherwise you're
> going to end up with constraint violations, and as a database weenie
> I consider that unacceptable.  This will mean that a poorly-chosen FK
> arrangement will allow some leakage of row-existence info, but I don't
> believe that that can be avoided anyway, per the above example.
>
OK, Let's drop table-level and column-level privileges also. They will be
able to leak existence of invisible records, even if user don't have privilege
to reference. :-)
Any tools have its expected usage and suitable situation to be applied.
A significant thing is to use a feature with understanding its purpose
and limitations. As everybody knows, we have no silver bullets for security,
but useful tool can help us, depending on situation.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 >

1 - 100 of 526 matches

Mail list logo