Re: Large files for relations

2023-05-12 Thread Jim Mlodgenski
On Thu, May 11, 2023 at 7:38 PM Thomas Munro  wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro 
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease.  Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit).  It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size.  But however hypothetical the
> scenario might be, it should work,
>

Agreed, it is frustrating, but it is not hypothetical. I have seen a number
of
users having single tables larger than 16TB and don't use partitioning
because
of the limitations we have today. The most common reason is needing multiple
unique constraints on the table that don't include the partition key.
Something
like a user_id and email. There are workarounds for those cases, but usually
it's easier to deal with a single large table than to deal with the sharp
edges
those workarounds introduce.


Re: Large files for relations

2023-05-11 Thread Jim Mlodgenski
On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:

>
> I am not aware of any modern/non-historic filesystem[2] that can't do
> large files with ease.  Anyone know of anything to worry about on that
> front?


There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?


Re: support for CREATE MODULE

2022-02-16 Thread Jim Mlodgenski
On Wed, Feb 16, 2022 at 12:20 AM Pavel Stehule 
wrote:

>
> The main issue in this case is fact, so plpgsql is fully integrated to
> Postgres (on second hand this integration is a big performance win). It is
> pretty different from PL/SQL. In Oracle you have a package, or any other
> similar features, because PL/SQL is an "independent" environment to the
> database engine. You cannot do the same with PL/pgSQL. And if you try to
> implement some enhancement of object hierarchy for PL/pgSQL, then you have
> to do it in the PostgreSQL core engine first. I'm 100% for enhancing stored
> procedures about full modularization, but this feature cannot be
> implemented step by step because you can break compatibility in any step.
> We need a robust solution. The solution, that helps with something, but it
> is not robust, it is not progress.
>
>
I'm not sure I understand your feedback. The proposed design for modules
is implemented in the engine and is orthogonal to PL/pgSQL. A module can
contain a mix of PL/pgSQL, PL/perl and PL/TCL functions if one wants
to do something like that.

Do you have any thoughts on how some sort of modularization or grouping
should be implemented? The Module route was the first thought on this
because it's the way the spec tells us we should solve this problem. We
can invent something new if we have a better way of solving this.


Re: support for CREATE MODULE

2022-02-04 Thread Jim Mlodgenski
On Fri, Feb 4, 2022 at 5:12 PM Tom Lane  wrote:

>
> On the whole I'm kind of allergic to inventing an entire new concept
> that has as much overlap with extensions as modules seem to.  I'd
> rather try to understand what functional requirements we're missing
> and see if we can add them to extensions.  Yeah, we won't end up being
> bug-compatible with Oracle's feature, but that's not a project goal
> anyway --- and where we have tried to emulate Oracle closely, it's
> often not worked out well (poster child: to_date).
>
>
Developers need a way to group related objects in some fashion so
that they can be more easily reasoned about. Modules are just the
way to do this in the spec, but if we want to leverage extensions,
that will work too. Many users who need this only have access through
a database connection. They wouldn't have access to the file system
to add a control file nor a script to add the objects. Enhancing
CREATE EXTENSION to be able to create some sort of empty extension
and then having the ability to add and remove objects from that
extension may be the minimum amount of functionality we would need
to give users the ability to group their objects.


Re: Hook for extensible parsing.

2021-09-15 Thread Jim Mlodgenski
On Wed, Sep 15, 2021 at 9:25 AM Simon Riggs
 wrote:
>
> The general rule has always been that we don't just put hooks in, we
> always require an in-core use for those hooks. I was reminded of that
> myself recently.
>
That's not historically what has happened. There are several hooks with
no in core use such as emit_log_hook and ExplainOneQuery_hook. The recent
openssl_tls_init_hook only has a usage in src/test/modules

> What we need is something in core that actually makes use of this. The
> reason for that is not politics, but a simple test of whether the
> feature makes sense AND includes all required bells and whistles to be
> useful in the real world.
>
Agreed. There should be something in src/test/modules to exercise this
but probably more to flush things out. Maybe extending adminpack to use
this so if enabled, it can use syntax like:
FILE READ 'foo.txt'

> Core doesn't need a LOL parser and I don't think we should commit that.
>
+1

> If we do this, I think it should have CREATE LANGUAGE support, so that
> each plugin can be seen as an in-core object and allow security around
> which users can execute which language types, allow us to switch
> between languages and have default languages for specific users or
> databases.
>
This hook allows extension developers to supplement syntax in addition
to adding a whole new language allowing the extension to appear more
native to the end user. For example, pglogical could use this to add
syntax to do a CREATE NODE instead of calling the function create_node.
Adding CREATE LANGUAGE support around this would just be for a narrow
set of use cases where a new language is added.




Re: Hook for extensible parsing.

2021-07-07 Thread Jim Mlodgenski
On Wed, Jul 7, 2021 at 5:26 AM Julien Rouhaud  wrote:
>
> Also, if this patch is eventually committed and having some code to
> experience the hook is wanted it would probably be better to have a
> very naive parser (based on a few strcmp() calls or something like
> that) to validate the behavior rather than having a real parser.
>

The test module is very useful to show how to use the hook but it isn't
very useful to the general user like most other things in contrib. It probably
fits better in src/test/modules




Re: Hook for extensible parsing.

2021-07-06 Thread Jim Mlodgenski
On Sat, Jun 12, 2021 at 4:29 AM Julien Rouhaud  wrote:

> I'd like to propose an alternative approach, which is to allow multiple 
> parsers
> to coexist, and let third-party parsers optionally fallback on the core
> parsers.  I'm sending this now as a follow-up of [1] and to avoid duplicated
> efforts, as multiple people are interested in that topic.

The patches all build properly and pass all regressions tests.

> pg_parse_query() will instruct plugins to parse a query at a time.  They're
> free to ignore that mode if they want to implement the 3rd mode.  If so, they
> should either return multiple RawStmt, a single RawStmt with a 0 or
> strlen(query_string) stmt_len, or error out.  Otherwise, they will implement
> either mode 1 or 2, and they should always return a List containing a single
> RawStmt with properly set stmt_len, even if the underlying statement is NULL.
> This is required to properly skip valid strings that don't contain a
> statements, and pg_parse_query() will skip RawStmt that don't contain an
> underlying statement.

Wouldn't we want to only loop through the individual statements if parser_hook
exists? The current patch seems to go through the new code path regardless
of the hook being grabbed.




Re: Support for CREATE MODULE?

2021-06-04 Thread Jim Mlodgenski
On Thu, Jun 3, 2021 at 8:49 AM Peter Eisentraut
 wrote:
> Given that, as you said, the concept of modules is in the SQL standard,
> there is surely some guidance in there about how this is supposed to
> affect name resolution.  So let's start with that.  Maybe we won't like
> it in the end or whatever, but we should surely look there first.

Studying the spec further, catalog/schema/module are all used to
identify a module-level routine. I don't see it spelled out that
is needs to be in that format of catalog.schema.module.routine to
fully qualify the routine, but it would likely be awkward for users
to come up with an alternative syntax like
(catalog.schema.module).routine or catalog.scheme.module->routine

The way the spec is worded, I read it as that schemas take precedence
over modules regarding path resolution. So for example with 2-level
naming if there is a schema 'foo' and a module 'public.foo' both with
functions 'bar' 'foo.bar' would refer to the schema-level function not
the module-level function. I've not found guidance on throwing catalog
into the mix and 3-level naming. Say we had a catalog 'postgres' with a
schema 'foo' with a function 'bar' and a schema 'postgres' with a module
'foo' with a function 'bar'. What would 'postgres.foo.bar' refer to? If
the SQL was executed from a catalog other than 'postgres', we'd have no
way of knowing if 'foo.bar' existed there. So if it's implementation
dependent, saying schemas take precedence over catalogs may make sense
and 'postgres.foo.bar' refers to the module-level function in the
'postgres' schema.




Re: Support for CREATE MODULE?

2021-06-02 Thread Jim Mlodgenski
On Wed, Jun 2, 2021 at 9:58 AM Tom Lane  wrote:
>
> In the first place, what do you think the primary key of pg_namespace is now?

In the patch the unique constraint is (nspname, nspnamespace) which is
certainly awkward. I initially went down the pg_module route to avoid
adding another catalog, but in retrospect, that may be a cleaner way.


> It's already quite hard to tell which part
> of a multiply.qualified.name is which, given that SQL says that you can
> optionally put a "catalog" (database) name in front of the others.
> I really doubt there is a way to shoehorn sub-schemas in there without
> creating terrible ambiguities.  Is "a.b.c" a reference to object c in
> schema b in database a, or is it a reference to object c in sub-schema b
> in schema a?

That was the area I had the most difficult part to reason about. I tried to make
some simplifying assumptions by checking if "a" was the current database.
Since we don't support cross database access, if it was not, I assumed "a"
was a schema. I not sure if that would be valid, but it did scope things
to a more manageable problem.

> I wonder whether it'd be better to consider modules as a kind of
> extension, or at least things with the same sort of ownership relations
> as extensions have.

That would solve the problem of associating objects which is the larger
problem for users today. The objects can all live in their respective
schemas with the module tying them all together.




Support for CREATE MODULE?

2021-06-02 Thread Jim Mlodgenski
Working with users over the years, some have large libraries of server
side code sometimes consisting of 100k+ lines of code over 1000+ functions
and procedures. This usually comes from a migration of a commercial
database like Oracle where it was best practice to put all of your
business logic into stored procedures. In these types of apps, just
managing the code is a challenge. To help classify objects, schemas
are used, but you are at the mercy of a naming convention to show
association. For example, a frequent naming convention would be having
related schemas with the names of foo_bar and foo_baz. For devs, that's
akin to keeping a file like xlog.c in a directory structure like
backend_access_transam instead of backend/access/transam. IMHO, having
a hierarchy makes it simpler to reason about related code bits.

The SQL spec does have a concept of modules that help address this. It's
defined as a persistent object within a schema that contains one or more
routines. It also defines other things like local temporary tables and
path specifications. There are other databases like DB2 that have
implemented module support each with their own way of defining the
routines within the module. The spec doesn't really give guidance on
how to manipulate the objects within the module.

Attached is a POC patch for modules. I modeled it as a sub-schema because
that is more what it seems like to me. It adds additional columns to
pg_namespace and allows for 3-part (or 4 with the database name) naming
of objects within the module. This simple example works with the patch.

CREATE SCHEMA foo;
CREATE MODULE foo.bar
  CREATE FUNCTION hello() RETURNS text
 LANGUAGE sql
 RETURN 'hello'
  CREATE FUNCTION world() RETURNS text
 LANGUAGE sql
 RETURN 'world';
SELECT foo.bar.hello();

Questions
- Do we want to add module support?

- If we do, should it be implemented as a type of namespace or should it
  be its own object type that lives in something like pg_module?

- How should users interact with objects within a module? They could be
  mostly independent like the current POC or we can introduce a path like
  ALTER MODULE foo ADD FUNCTION blah

--Jim


modules_poc_1.patch
Description: Binary data


modules_poc_2.patch
Description: Binary data


modules_poc_3.patch
Description: Binary data


modules_poc_4.patch
Description: Binary data


Re: Parser Hook

2021-03-15 Thread Jim Mlodgenski
On Mon, Mar 15, 2021 at 12:58 PM Joel Jacobson  wrote:

> On Mon, Mar 15, 2021, at 16:48, Jim Mlodgenski wrote:
>
> The example I have is adding a CREATE JOB command that a scheduler may
> use.
>
>
> This CREATE JOB thing sounds interesting.
>
> Are you working on adding the ability to schedule SQL-commands to run in
> the background,
> similar to cronjob and/or adding ampersand ("&") to a command in the
> terminal?
>

No, it was just a sample of how the parser could be extended to all an
extension like pg_cron can use CREATE JOB instead of calling a function
like SELECT cron.schedule(...)


Re: Parser Hook

2021-03-15 Thread Jim Mlodgenski
On Mon, Mar 15, 2021 at 12:43 PM Julien Rouhaud  wrote:

> On Mon, Mar 15, 2021 at 11:48:58AM -0400, Jim Mlodgenski wrote:
> >
> > Going deeper on this, I created another POC as an example. Yes, having a
> > hook at the top of the parser does mean an extension needs to copy the
> > existing grammar and modify it. Without a total redesign of how the
> grammar
> > is handled, I'm not seeing how else this could be accomplished. The
> example
> > I have is adding a CREATE JOB command that a scheduler may use. The
> amount
> > of effort needed for an extension maintainer doesn't appear to be that
> > onerous. Its not ideal having to copy and patch gram.y, but certainly
> > doable for someone wanting to extend the parser.
>
> AFAIK nothing in bison prevents you from silently ignoring unhandled
> grammar
> rather than erroring out.  So you could have a parser hook called first,
> and
> if no valid command was recognized fall back on the original parser.  I'm
> not
> saying that it's a good idea or will be performant (although the added
> grammar
> will likely be very small, so it may not be that bad), but you could
> definitely
> avoid the need to duplicate the whole grammar in each and every extension,
> and
> allow multiple extensions extending the grammar.
>
>
That's a good point. That does simplify it


> That won't reduce the difficulty of producing a correct parse tree if you
> want
> to implement some syntactic sugar for already handled DML though.
>

Complex DML like Oracle's outer join syntax is tricky no matter which way
you slice it.


> > However we would want to modify the parser to allow it to be more
> plugable
> > in the future, we would very likely need to have a hook at the top of the
> > parser to intiailize things like keywords. Having a hook at the top of
> the
> > parser along with the existing ProcessUtility_hook allows extension to
> add
> > their own utility commands if they wish. I would image that covers many
> > existing use cases for extensions today.
>
> What happens if multiple extensions want to add their own new grammar?
> There
> will at least be possible conflicts with the additional node tags.
>

The extensions would need to play nice with one another like they do with
other hooks and properly call the previous hook.


> Also, I'm not sure that many extensions would really benefit from custom
> utility command, as you can already do pretty much anything you want using
> SQL
> functions.  For instance it would be nice for hypopg to be able to support
>
> CREATE HYPOTHETICAL INDEX ...
>
> rather than
>
> SELECT hypopg_create_index('CREATE INDEX...')
>
> But really the only benefit would be autocompletion, which still wouldn't
> be
> possible as psql autocompletion won't be extended.  And even if it somehow
> was,
> I wouldn't expect all psql clients to be setup as needed.
>

Having the functionality exposed through DDL gives it a more native feel to
it for users and for some more likely use the exentions.


Re: Parser Hook

2021-03-15 Thread Jim Mlodgenski
On Mon, Feb 22, 2021 at 3:52 PM Andres Freund  wrote:

> Hi,
>
> On 2021-02-22 11:20:54 -0500, Jim Mlodgenski wrote:
> > As Jan mentioned in his thread about a pluggable wire protocol [0], AWS
> is
> > working on a set of extensions for Babelfish. The intention is to not
> > necessarily have it as a single monolithic extension, but be possible for
> > people to use pieces of it as they need when they are migrating to
> > PostgreSQL. Some may just need the functions or data types. Others may
> need
> > the stored procedure language. Many times when enterprises are migrating
> > databases, they have satellite applications that they may not be able to
> > change or they are on a different schedules than the main application so
> > the database still needs to support some of the old syntax. A common need
> > in these situations is the parser.
> >
> > Attached is a patch to place a hook at the top of the parser to allow
> for a
> > pluggable parser. It is modeled after the planner_hook [1]. To test the
> > hook, I have also attached a simple proof of concept that wraps the
> parser
> > in a TRY/CATCH block to catch any parse errors. That could potentially
> help
> > a class of users who are sensitive to parse errors ending up in the logs
> > and leaking PII data or passwords.
>
> I don't think these are really comparable. In case of the planner hook
> you can reuse the normal planner pieces, and just deal with the one part
> you need to extend. But we have pretty much no infrastructure to use the
> parser in a piecemeal fashion (there's a tiny bit for plpgsql).
>
> Which in turn means that to effectively use the proposed hook to
> *extend* what postgres accepts, you need to copy the existing parser,
> and hack in your extensions. Which in turn invariably will lead to
> complaints about parser changes / breakages the community will get
> complaints about in minor releases etc.
>

Going deeper on this, I created another POC as an example. Yes, having a
hook at the top of the parser does mean an extension needs to copy the
existing grammar and modify it. Without a total redesign of how the grammar
is handled, I'm not seeing how else this could be accomplished. The example
I have is adding a CREATE JOB command that a scheduler may use. The amount
of effort needed for an extension maintainer doesn't appear to be that
onerous. Its not ideal having to copy and patch gram.y, but certainly
doable for someone wanting to extend the parser. I also extended the patch
to add another hook in parse_expr.c to see what we would need to add
another keyword and have it call a function like SYSDATE. That appears to
be a lot of work to get all of the potentail hook points that an extension
may want to add and there may not be that many usecases worth the effort.



> I think the cost incurred for providing a hook that only allows
> extensions to replace the parser with a modified copy of ours will be
> higher than the gain.  Note that I'm not saying that I'm against
> extending the parser, or hooks - just that I don't think just adding the
> hook is a step worth doing on its own.
>
>
However we would want to modify the parser to allow it to be more plugable
in the future, we would very likely need to have a hook at the top of the
parser to intiailize things like keywords. Having a hook at the top of the
parser along with the existing ProcessUtility_hook allows extension to add
their own utility commands if they wish. I would image that covers many
existing use cases for extensions today.


> Imo a functional approach would really need to do the work to allow to
> extend & reuse the parser in a piecemeal fashion and *then* add a hook.
>
> Greetings,
>
> Andres Freund
>


poc_parser_hook.patch
Description: Binary data


poc_extended_parser.tar.gz
Description: GNU Zip compressed data


Parser Hook

2021-02-22 Thread Jim Mlodgenski
As Jan mentioned in his thread about a pluggable wire protocol [0], AWS is
working on a set of extensions for Babelfish. The intention is to not
necessarily have it as a single monolithic extension, but be possible for
people to use pieces of it as they need when they are migrating to
PostgreSQL. Some may just need the functions or data types. Others may need
the stored procedure language. Many times when enterprises are migrating
databases, they have satellite applications that they may not be able to
change or they are on a different schedules than the main application so
the database still needs to support some of the old syntax. A common need
in these situations is the parser.

Attached is a patch to place a hook at the top of the parser to allow for a
pluggable parser. It is modeled after the planner_hook [1]. To test the
hook, I have also attached a simple proof of concept that wraps the parser
in a TRY/CATCH block to catch any parse errors. That could potentially help
a class of users who are sensitive to parse errors ending up in the logs
and leaking PII data or passwords.

-- Jim
-- Amazon Web Services

[0] -
https://www.postgresql.org/message-id/flat/CAGBW59d5SjLyJLt-jwNv%2BoP6esbD8SCB%3D%3D%3D11WVe5%3DdOHLQ5wQ%40mail.gmail.com
[1] -
https://www.postgresql.org/message-id/flat/27516.1180053940%40sss.pgh.pa.us


parser_hook.patch
Description: Binary data


quiet_parser.c
Description: Binary data


Re: Extensibility of the PostgreSQL wire protocol

2021-02-11 Thread Jim Mlodgenski
On Thu, Feb 11, 2021 at 10:29 AM Andrew Dunstan  wrote:

>
> On 2/11/21 10:06 AM, Tom Lane wrote:
> > Robert Haas  writes:
> >> On Thu, Feb 11, 2021 at 9:42 AM Jonah H. Harris 
> wrote:
> >>> As Jan said in his last email, they're not proposing all the different
> >>> aspects needed. In fact, nothing has actually been proposed yet. This
> >>> is an entirely philosophical debate. I don't even know what's being
> >>> proposed at this point - I just know it *could* be useful. Let's just
> >>> wait and see what is actually proposed before shooting it down, yes?
> >> I don't think I'm trying to shoot anything down, because as I said, I
> >> like extensibility and am generally in favor of it. Rather, I'm
> >> expressing a concern which seems to me to be justified, based on what
> >> was posted. I'm sorry that my tone seems to have aggravated you, but
> >> it wasn't intended to do so.
> > Likewise, the point I was trying to make is that a "pluggable wire
> > protocol" is only a tiny part of what would be needed to have a credible
> > MySQL, Oracle, or whatever clone.  There are large semantic differences
> > from those products; there are maintenance issues arising from the fact
> > that we whack structures like parse trees around all the time; and so on.
> > Maybe there is some useful thing that can be accomplished here, but we
> > need to consider the bigger picture rather than believing (without proof)
> > that a few hook variables will be enough to do anything.
>
>
>
> Yeah. I think we'd need a fairly fully worked implementation to see
> where it goes. Is Amazon going to release (under TPL) its TDS
> implementation of this? That might go a long way to convincing me this
> is worth considering.
>
> Everything is planned to be released under the Apache 2.0 license so
people are free to do with it as they choose.