Re: [sqlalchemy] postgresql large objects

2014-07-08 Thread Jonathan Vanasco
Working on this level of integration is way over my head, but I'd love to 
see this happen and wanted to quickly +1. 

I just had to refactor some code to store User Submitted Media into 
Postgres for background processing.  While doing a bit of research, I 
learned that the streaming functionality is rather important even on 
small files.  Postgres has to materialize the entire BYTEA type in memory 
when selecting; when inserting, a 2MB binary file can account for an 8MB 
encoded string.  According to the pg-users list, this can often max out 
memory allocations and/or seriously degrade performance.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


[sqlalchemy] postgresql large objects

2014-06-25 Thread Jason Newton
Hi,

I'm wondering how I might get at postgresql's large object type (lob).  
It's not to be confused with the TOASTED bytea, which are limited currently 
to 1 GiB yet in practice is much lower (for me 400MiB)  - it's a special 
table + api designed to handle very large binary objects, like a few 
hundred MiBs to more recently TiBs. I don't see appropriate definitions 
anywhere and can't find any mention of it really with sqlalchemy. psycopg2 
has support for it and calls it lobject, it provides a file like interface 
to the lob which is a good mapping since with libpgsql you use lo_creat, 
lo_seek, lo_write, lo_read to work with these beasts

I took a look at UserDefinedType but on the bind_processor, this doesn't 
distinguish between inserts and updates.  With inserts, you'd use an oid 
allocated from lo_creat in the transaction.  On updates, you'd use 
lo_trunc/lo_write.  As one more constraint, you must be in a transaction 
before any of these functions are usable.  To reference large objects, as 
they are explicitly an out of table storage, the postgresql specific oid is 
used (which allows garbage collection, referential integrity checks etc).

I'll also mention that other tables reference these large objects via oids, 
something like smart pointers in postgres. 

It'd be great to plug large objects into sqlalchemy properly - but can it 
be done?

-Jason

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] postgresql large objects

2014-06-25 Thread Mike Bayer

On 6/25/14, 2:35 AM, Jason Newton wrote:
 Hi,

 I'm wondering how I might get at postgresql's large object type
 (lob).  It's not to be confused with the TOASTED bytea, which are
 limited currently to 1 GiB yet in practice is much lower (for me
 400MiB)  - it's a special table + api designed to handle very large
 binary objects, like a few hundred MiBs to more recently TiBs. I don't
 see appropriate definitions anywhere and can't find any mention of it
 really with sqlalchemy. psycopg2 has support for it and calls it
 lobject, it provides a file like interface to the lob which is a good
 mapping since with libpgsql you use lo_creat, lo_seek, lo_write,
 lo_read to work with these beasts

 I took a look at UserDefinedType but on the bind_processor, this
 doesn't distinguish between inserts and updates.  With inserts, you'd
 use an oid allocated from lo_creat in the transaction.  On updates,
 you'd use lo_trunc/lo_write.  As one more constraint, you must be in a
 transaction before any of these functions are usable.  To reference
 large objects, as they are explicitly an out of table storage, the
 postgresql specific oid is used (which allows garbage collection,
 referential integrity checks etc).

 I'll also mention that other tables reference these large objects via
 oids, something like smart pointers in postgres.

 It'd be great to plug large objects into sqlalchemy properly - but can
 it be done?
well we just added the OID type in 0.9.5, so you at least have that.

The PG LOB feature is very sparsely documented - on PG's docs, they only
seem to document the C API
(www.postgresql.org/docs/9.2/static/lo-interfaces.html), feel free to
point me to better documentation on this.   

As far as transparently embedding this into INSERT/UPDATE, my first
thought is that this might be unwise considering how far removed these
functions are from any kind of mainstream usage in such statements -
particularly if separate statements have to be called per value to get
at OIDs or similar.That PG's docs barely mention this whole feature
much less any kind of regular SQL integration is a red flag.   PG's
BYTEA type is already arbitrarily large so there is probably not much
interest in a type like this.If it's the streaming feature you're
looking for, SQLA's usual approach such as that of Oracle's LOB is to
pre-stream it on the result set side (necessary, because fetching a
batch of rows requires it), and cx_Oracle doesn't provide too much
option to stream on the write side.   I've dealt a lot with streaming
datatypes back in the day but sometime in the early 2000's everyone just
stopped using them.

As the feature involves SQL functions I don't think you'd be dealing
only with bind_processor(), the SQL functions themselves would probably
be via SQL-level processing, see
http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#applying-sql-level-bind-result-processing.
   
It's true these functions aren't given access to the compiler context
where you'd be able to discern INSERT from UPDATE, so I'd probably use
two variants of datatype, an InsertLOB and UpdateLOB datatype, then I'd
apply these to an INSERT or UPDATE statement at runtime probably using a
@compiles decorator -
http://docs.sqlalchemy.org/en/rel_0_9/core/compiler.html, that is, take
the existing LOB type and just swap it out for InsertLOB/UpdateLOB.   
Adding compiler context to TypeEngine is something we could look into as
a feature as well so that bind_expression() has this available somehow
and switching around types wouldn't be needed.

building this would involve stringing together hooks that are not quite
set up for this, however for me to really assist here I'd need to see
exact examples of what INSERT, UPDATE and anything else looks like in
conjunction with these functions.

I would note that psycopg2 itself also provides for extension types,
including custom Connection and Cursor subclasses.   If a lot of things
have to happen when these types are in play it might be something that
can occur at that level, PG's type API is obviously a lot more PG specific.




 -Jason
 -- 
 You received this message because you are subscribed to the Google
 Groups sqlalchemy group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to sqlalchemy+unsubscr...@googlegroups.com
 mailto:sqlalchemy+unsubscr...@googlegroups.com.
 To post to this group, send email to sqlalchemy@googlegroups.com
 mailto:sqlalchemy@googlegroups.com.
 Visit this group at http://groups.google.com/group/sqlalchemy.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit 

Re: [sqlalchemy] postgresql large objects

2014-06-25 Thread Jason Newton
Hi,

I've replied inline below.


On Wed, Jun 25, 2014 at 6:46 AM, Mike Bayer mike...@zzzcomputing.com
wrote:


  well we just added the OID type in 0.9.5, so you at least have that.


I came across the entry on the issue tracker a little bit after
submitting.  As usual for me, it's support wasn't added very long ago.


 The PG LOB feature is very sparsely documented - on PG's docs, they only
 seem to document the C API (
 www.postgresql.org/docs/9.2/static/lo-interfaces.html), feel free to
 point me to better documentation on this.


There's also server side functions, they operate almost exactly the same as
client side api: http://www.postgresql.org/docs/9.3/static/lo-funcs.html .
There is no better documentation than those two that I know of, but they
were sufficient for me.


 As far as transparently embedding this into INSERT/UPDATE, my first
 thought is that this might be unwise considering how far removed these
 functions are from any kind of mainstream usage in such statements -
 particularly if separate statements have to be called per value to get at
 OIDs or similar.That PG's docs barely mention this whole feature much
 less any kind of regular SQL integration is a red flag.   PG's BYTEA type
 is already arbitrarily large so there is probably not much interest in a
 type like this.If it's the streaming feature you're looking for,
 SQLA's usual approach such as that of Oracle's LOB is to pre-stream it on
 the result set side (necessary, because fetching a batch of rows requires
 it), and cx_Oracle doesn't provide too much option to stream on the write
 side.   I've dealt a lot with streaming datatypes back in the day but
 sometime in the early 2000's everyone just stopped using them.


Seems to fly in the face at the point of SQLA although integration
difficulties are appreciated.  Most advanced postgresql drivers in any
language bindings have added support for this type although none of them
can hide that it's file like.  PG's BYTEA is NOT arbitrarily large, it has
quite real practical limits and I've hit them regularly in storing
compressed HDF5 documents in the database as part of a bench testing
framework.  The theoretical limit is 1GB but this limit is far less in
practice (
http://www.postgresql.org/message-id/cafj8pracfkoinp2uxeizod5krx29n2ofsoldh0w6ej7rxko...@mail.gmail.com
) . http://www.postgresql.org/docs/9.3/static/storage-toast.html  I'm not
looking for streaming, retrieve/set binary buffers.  It'd be nice to
translate it transparently to HDF5 python in-memory objects (ie h5py).



 As the feature involves SQL functions I don't think you'd be dealing only
 with bind_processor(), the SQL functions themselves would probably be via
 SQL-level processing, see
 http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#applying-sql-level-bind-result-processing.
 It's true these functions aren't given access to the compiler context where
 you'd be able to discern INSERT from UPDATE, so I'd probably use two
 variants of datatype, an InsertLOB and UpdateLOB datatype, then I'd apply
 these to an INSERT or UPDATE statement at runtime probably using a
 @compiles decorator -
 http://docs.sqlalchemy.org/en/rel_0_9/core/compiler.html, that is, take
 the existing LOB type and just swap it out for InsertLOB/UpdateLOB.
 Adding compiler context to TypeEngine is something we could look into as a
 feature as well so that bind_expression() has this available somehow and
 switching around types wouldn't be needed.


Separate datatypes is clearly not a good approach to this.


 building this would involve stringing together hooks that are not quite
 set up for this, however for me to really assist here I'd need to see exact
 examples of what INSERT, UPDATE and anything else looks like in conjunction
 with these functions.


Everything starts with a transaction block.  Hand crafted inserts use
server side lo_create(-1) (in sql) which allocates a new large object and
returning clause to get the oid in one go back to the client side.  Then I
start using the lobject api on the returned oid.   In patches I've made to
cppdb, in the bind of std::istream I call client side lo_create, lo_open, a
sequence of lo_writes, and finally lo_close.  In retrospect this doesn't
handle updates particularly well and maybe this could be used in sqlalchemy
as is too - an update just allocates a new large object and unreferences
the old one.  This would leave it up to vacuumlo to GC those updated
lobs.  In my work load however, my lobs so far are immutable - they are
results of simulations and this is the way I've worked to date.  It
probably won't stay that way forever.



 I would note that psycopg2 itself also provides for extension types,
 including custom Connection and Cursor subclasses.   If a lot of things
 have to happen when these types are in play it might be something that can
 occur at that level, PG's type API is obviously a lot more PG specific.


-Jason

-- 
You received this message because you are subscribed 

Re: [sqlalchemy] postgresql large objects

2014-06-25 Thread Mike Bayer

On 6/25/14, 2:53 PM, Jason Newton wrote:
 Seems to fly in the face at the point of SQLA although integration
 difficulties are appreciated.  Most advanced postgresql drivers in any
 language bindings have added support for this type although none of
 them can hide that it's file like. 
SQLA doesn't go out of its way for unusual, very database-specific
features that up til now have demonstrated little to no real-world use
(nobody's ever asked for this feature before and googling about
Postgresql LOBs turns up very little).  There are tons of examples of
features like this across many different database backends.  If they are
easy to add, we add them, or if they are easy to address via a recipe,
we add the recipe up to the wiki.

But if the behavior requires substantial changes to the core and
dialect, and the ratio of complexity of impact to sparseness of need is
really high, it's not worth it and actually kind of damaging to most
users to complicate the library for use cases that are extremely rare
and can just as well be addressed by dropping down to raw DBAPI code. 
Complications/destabiliziations/performance degradations that are
hoisted onto the whole userbase for the benefit of a single feature that
is virtually never needed is the wrong choice to make; I'm presented
with this choice all the time and there's nearly always work to be done
in extricating ill-conceived features and behaviors that went in too
quickly.  I'm pretty confident that this feature won't require any of
that, but that remains to be seen.

Just in case this wasn't apparent, you certainly *can* use psycopg2's
bindings when you're in an otherwise SQLAlchemy app.  Worst case you can
retrieve a raw psycopg2 connection using connection.raw_connection and
do whatever you need.  If you truly have some use for LOBs, SQLAlchemy
isn't preventing you from using it, it's just not providing any niceties
around it.   The fact that these unusual use cases are not ever
prohibited by SQLA further raises the bar to adding first class support
for them.


 PG's BYTEA is NOT arbitrarily large, it has quite real practical
 limits and I've hit them regularly in storing compressed HDF5
 documents in the database as part of a bench testing framework.  The
 theoretical limit is 1GB but this limit is far less in practice
 (http://www.postgresql.org/message-id/cafj8pracfkoinp2uxeizod5krx29n2ofsoldh0w6ej7rxko...@mail.gmail.com
 ) . http://www.postgresql.org/docs/9.3/static/storage-toast.html  I'm
 not looking for streaming, retrieve/set binary buffers.  It'd be nice
 to translate it transparently to HDF5 python in-memory objects (ie h5py).
Python translation is very easy in SQLA, its just if you have special
needs for SQL syntaxes, that's where special behaviors may be needed. 
So far it sounds like the only blocking factor is that bind_sql needs to
distinguish between INSERT and UPDATE.  that's not a terribly tall order
though it is inconvenient in that the API would need a
backwards-compatibility layer.


  


 As the feature involves SQL functions I don't think you'd be
 dealing only with bind_processor(), the SQL functions themselves
 would probably be via SQL-level processing, see
 
 http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#applying-sql-level-bind-result-processing.

 It's true these functions aren't given access to the compiler
 context where you'd be able to discern INSERT from UPDATE, so I'd
 probably use two variants of datatype, an InsertLOB and UpdateLOB
 datatype, then I'd apply these to an INSERT or UPDATE statement at
 runtime probably using a @compiles decorator -
 http://docs.sqlalchemy.org/en/rel_0_9/core/compiler.html, that is,
 take the existing LOB type and just swap it out for
 InsertLOB/UpdateLOB.Adding compiler context to TypeEngine is
 something we could look into as a feature as well so that
 bind_expression() has this available somehow and switching around
 types wouldn't be needed.


 Separate datatypes is clearly not a good approach to this.

Let me clarify that these separate datatypes would be totally invisible
to the user.   The user would work with a single LOB type.  Translation
to Insert/Update versions would not be explicit and would occur at the
point at which the insert/update construct is compiled.   At the moment
this may be the best approach short of modifying the library (but then
again I can't say much about the approach because i have little idea
what the SQL we're talking about looks like).
 
 Everything starts with a transaction block. 
the psycopg2 DBAPI is implicitly in a transaction block at all times
unless autocommit mode is set up, so there shouldn't be any problem here.

 Hand crafted inserts use server side lo_create(-1) (in sql) which
 allocates a new large object and returning clause to get the oid in
 one go back to the client side.  Then I start using the lobject api on
 the returned oid. 
I'd like to see explicit SQL, 

Re: [sqlalchemy] postgresql large objects

2014-06-25 Thread Jason Newton
 Just in case this wasn't apparent, you certainly *can* use psycopg2's
 bindings when you're in an otherwise SQLAlchemy app.  Worst case you can
 retrieve a raw psycopg2 connection using connection.raw_connection and do
 whatever you need.  If you truly have some use for LOBs, SQLAlchemy isn't
 preventing you from using it, it's just not providing any niceties around
 it.   The fact that these unusual use cases are not ever prohibited by SQLA
 further raises the bar to adding first class support for them.

 Yes, this much I knew would work, its just clunkier than it could be.





  PG's BYTEA is NOT arbitrarily large, it has quite real practical limits
 and I've hit them regularly in storing compressed HDF5 documents in the
 database as part of a bench testing framework.  The theoretical limit is
 1GB but this limit is far less in practice (
 http://www.postgresql.org/message-id/cafj8pracfkoinp2uxeizod5krx29n2ofsoldh0w6ej7rxko...@mail.gmail.com
 ) . http://www.postgresql.org/docs/9.3/static/storage-toast.html  I'm not
 looking for streaming, retrieve/set binary buffers.  It'd be nice to
 translate it transparently to HDF5 python in-memory objects (ie h5py).

 Python translation is very easy in SQLA, its just if you have special
 needs for SQL syntaxes, that's where special behaviors may be needed.  So
 far it sounds like the only blocking factor is that bind_sql needs to
 distinguish between INSERT and UPDATE.  that's not a terribly tall order
 though it is inconvenient in that the API would need a
 backwards-compatibility layer.


Perhaps you can keep backwards compatibility by introspecting the callbacks
to determine the arity of their arguments.








 As the feature involves SQL functions I don't think you'd be dealing only
 with bind_processor(), the SQL functions themselves would probably be via
 SQL-level processing, see
 http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#applying-sql-level-bind-result-processing.
 It's true these functions aren't given access to the compiler context where
 you'd be able to discern INSERT from UPDATE, so I'd probably use two
 variants of datatype, an InsertLOB and UpdateLOB datatype, then I'd apply
 these to an INSERT or UPDATE statement at runtime probably using a
 @compiles decorator -
 http://docs.sqlalchemy.org/en/rel_0_9/core/compiler.html, that is, take
 the existing LOB type and just swap it out for InsertLOB/UpdateLOB.
 Adding compiler context to TypeEngine is something we could look into as a
 feature as well so that bind_expression() has this available somehow and
 switching around types wouldn't be needed.


  Separate datatypes is clearly not a good approach to this.


 Let me clarify that these separate datatypes would be totally invisible to
 the user.   The user would work with a single LOB type.  Translation to
 Insert/Update versions would not be explicit and would occur at the point
 at which the insert/update construct is compiled.   At the moment this may
 be the best approach short of modifying the library (but then again I can't
 say much about the approach because i have little idea what the SQL we're
 talking about looks like).


Ah.  I thought it would be done in the declarative table descriptions.
I'm not sure, being relatively new to SQLA, that would look like to do.




   Everything starts with a transaction block.

 the psycopg2 DBAPI is implicitly in a transaction block at all times
 unless autocommit mode is set up, so there shouldn't be any problem here.

 AFAIK it won't work with autocommit but I don't have documentation to
prove that, I just remember getting errors in I think JDBC pertaining to it.



Hand crafted inserts use server side lo_create(-1) (in sql) which
 allocates a new large object and returning clause to get the oid in one go
 back to the client side.  Then I start using the lobject api on the
 returned oid.

 I'd like to see explicit SQL, preferably in the form of a psycopg2 script
 that illustrates all the operations you wish to support and specifically
 how they must interact with the database.


I've attached example usages.  As I indicated in my prior email, right now
I only do inserts/selects.



In patches I've made to cppdb, in the bind of std::istream I call
 client side lo_create, lo_open, a sequence of lo_writes, and finally
 lo_close.  In retrospect this doesn't handle updates particularly well and
 maybe this could be used in sqlalchemy as is too - an update just allocates
 a new large object and unreferences the old one.

 I wonder why the mechanics of these functions can't be more cleanly
 wrapped into server-side stored procedures?If they can be distilled
 into simple insert_lob() update_lob() functions that would potentially be a
 better separation of duties.


I think it's because only small lobs could rely on that -  huge lobs that
exceed working memory fit better with a file like api.  It would be a nice
convenience to have, perhaps.  These same small lobs (ie lobs that fit in

Re: [sqlalchemy] postgresql large objects

2014-06-25 Thread Mike Bayer

On 6/25/14, 5:35 PM, Jason Newton wrote:

  


 Hand crafted inserts use server side lo_create(-1) (in sql) which
 allocates a new large object and returning clause to get the oid
 in one go back to the client side.  Then I start using the
 lobject api on the returned oid. 
 I'd like to see explicit SQL, preferably in the form of a psycopg2
 script that illustrates all the operations you wish to support and
 specifically how they must interact with the database.


 I've attached example usages.  As I indicated in my prior email, right
 now I only do inserts/selects.
Here's my immediate thought about the INSERT (and the UPDATE) - the
first way is using events, the second would attempt to move this system
into something more native to the psycopg2 dialect:

Use the before_cursor_execute() and after_cursor_execute() events to get
at this:

http://docs.sqlalchemy.org/en/latest/core/events.html#sqlalchemy.events.ConnectionEvents.before_cursor_execute
http://docs.sqlalchemy.org/en/latest/core/events.html#sqlalchemy.events.ConnectionEvents.after_cursor_execute

1. Create a new type called PGLOB or whatever you want.  For starters,
this can just be a blank UserDefinedType subclass.

in before_cursor_execute:

2. determine if INSERT or UPDATE using context.isinsert, context.isupdate

3. examine the datatypes that SQLAlchemy is using here, by looking at
context.compiled.binds.  The values here are BindParameter objects, you
want to look in those for the ones that are of type PGLOB.

4. From context.compiled.binds, you have the names of the bound params
with the type.  Search and replace the statement for occurrences of
that bound parameter, replace with lo_creat(-1) or whatever you need
there.

5. Also append to the RETURNING clause those cols you need.

6. the statement as a return value will be used, if you set up the event
with retval=True (see the docs).

in after_cursor_execute:

7. in after_cursor_execute - call fetchone() to get the RETURNING
values.  Get that OID you care about then do that work with conn.lobject
and all that.  hopefully this doesnt mess up the existing cursor state.

8.  now the tricky part.  SQLAlchemy needs that row if you're doing
implicit returning to get at primary key values.   psycopg2's cursor
seems to have a scroll() method that works for client side cursors.  
I'd scroll it back one so that SQLA gets the state it expects.

Alternative system, more plugged in:

1. We would be creating new features on
sqlalchemy/dialects/postgresql/psycopg2.py -
PGExecutionContext_psycopg2.  Similar hooks are available here which you
can use to accomplish similar tasks;  you'd want to look at the
pre_exec(), post_exec() and possibly post_insert() methods, and maybe
even _fetch_implicit_returning().If SQLA were to support this more
natively, things would be happening at this level.   But again, I really
wouldn't want all kinds of hardcoded expectations of LOB objects in this
object taking up processing time for the vast majority of use cases that
don't use LOBs, so the extra logic here should be contained within
something that can easily be placed aside based on options or similar.




-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.