Re: [sqlalchemy] Disable `session_identity_map` for `_instance_processor`

Антонио Антуан Thu, 23 Nov 2017 05:44:39 -0800

> A Query can have lots of entities in it, and if you're doing sharding a
> single result set can refer to any number of shard identifiers within
> not just a single result set but within a single row; they might have
> come from dozens of different databases at once


In my case it is not possible: all entities in query can be gotten only
from one particular shard. We have totally the same database structure for
each shard. The difference is just data stored into database. No `shard_id`
or any other key as part of primary key for any table. If I want to make
query for particular database I always want to retrieve data ONLY from that
database. And even more than that: ONLY one database during one session
transaction (or, in other words, one http-request to our app).

> This could only be suited with a very open plugin point that is carefully
> architected, tested, and documented and I don't have the resources to
> envision this for a short-term use case.

I've seen a lot of questions (for example in stackoverflow) "how to manage
several binds with sqlalchemy" and all answers are "Use separated
sessions". I really not understand, what is the problem to implement
"multibound" session. I make it in my project and it's really beautiful,
clear and... Don't have ideal vocabulary to explain how our team like it :)
Currently I see only one problem: loading instances. Of course, after
fixing other problems may appear...

I can (and want) make it as part of SQLAlchemy library. Fully-tested part,
of course. If you say that it is bad idea, ok then. I can make it as a
plugin, but there is a problem: functions in `loading` module are
monolithic and it needs some refactor for the plugin. May I suggest
refactor as pull request? And if so: could it be merged not only for major
release but for, at least, 1.0.* (yes, in our project we use 1.0.19 :) )?
I really don't want (of course!) to copy entire `loading` module for
additional logic of `identitykey` construction. But currently I do not see
any other way to implement it for my project :(

> when you query two different databases, you are using
> two independent transactions in any case;

So, what is the difference, if there are two transactions in any case? :)

> I don't understand why you can't use independent sessions

The first problem is `query` property of `Base` instances. If we use
several sessions, we need to use the same amount of `Base` classes and,
consequently, the same amount of models, don't we?
Another problem for us is already existed code. We can use sessions
registry, but it take a lot of month to override entire project. Another
way:  append into `Query.__iter__` such a code:
>> self.with_session(sessions.get(self._shard_id))
>> return super(Query, self).__iter__()
But it has no effect for UPDATE and INSERT queries. Also, I'm not sure that
there is no problems in that way...


I have one more thought. Don't you think that it is some kind of bug: I
make query for one bind and got entity from another. Yes, that behavior is
not foreseen by library. But from other point of view, library docs have
examples how to use several binds within one session. So, problem may
happens not only in my case.

Anyway, can my suggestion (
https://gist.github.com/aCLr/746f92dedb4d303a49033c0db22beced) has any
effect for classic one-bound `Session`? If it can't, so, what's the
problem? :)


Excuse me for wasting your time.
And excuse me if my suggestions are idiotic :)

Appreciate your help.

чт, 23 нояб. 2017 г. в 0:20, Mike Bayer <[email protected]>:

> On Wed, Nov 22, 2017 at 4:56 AM, Антонио Антуан <[email protected]>
> wrote:
> > Glad to see that you remember my messages :)
> >
> > I've dived into `loading` module and I see that currently it is really
> > complicated to store additional data into pkey for each instance.
> >
> > Unfortunately, suggested solutions not good for my project.
> > Also, I think that `shard` meaning in my case is not the same as usual.
> >
> > I want to describe structure of out project, maybe it can help.
> >
> > Here is definition of our databases structure:
> > http://joxi.ru/nAyJVvGiXMv0Dr.
> > We got master db and several geo databases. Catalogs like `users`,
> `groups`,
> > `offers` and other are replicating to geo databases, so that data is
> always
> > the same.
> > But also we have tables like `clicks` and `leads`. Each app instance
> > contains the data about them in database, related to its geo:
> > europe-instance into europe-db, usa-instance into usa-database and so on.
> > Periodically master-app pulls clicks and leads to master-database. Synced
> > objects always have different ids into master- and get-db, so it is ok.
> >
> > But one time project owner came and said: "I need SAAS".
> > We see, that in current structure it's very hard (and really ugly) to
> > implement saas-solution. Amount of `Base*`, `Session*`, `Order*` and
> other
> > models will be multiplied with tenants amount.
> >
> > I discovered that I can override `get_bind` with another logic and it was
> > great: we can remove several `Base` classes, several Sessions and several
> > `Orders`.
> >
> > Mechanism looks like this:
> > - we use one instance on each geo for all tenants.
> > - we create separated databases for each tenant: this will be multiplied
> > with tenants amount: http://joxi.ru/nAyJVvGiXMv0Dr.
> > - we detect `tenant_id` using `request.host` (we use flask): each domain
> > binds with particular tenant;
> > - we store `tenant_id` into global storage.
> > - we use stored `tenant_id` into `Session.get_bind`:
> > https://gist.github.com/aCLr/9f578a2eeb225e7d65099ffb49aa8f3a
> > - into flask `teardown_request` we clear `tenant_id` storage and call
> > `Session.remove()`
> > - if we need to read from another get, just write `query =
> > query.set_shard_id(GEO)` `tenant_id`
> >
> > For celery we use this:
> > https://gist.github.com/aCLr/d8c5ac38956947da092375b2f89d7b50
> > Clear to.
> >
> > All this leads us only to pros, without cons: any developer has no need
> to
> > think about database chosing, just write code like there is only one
> > database. If you need to read from another geo-database, just call
> > `query.set_shard(GEO)`, tenant will be appended automatically to it.
> >
> > Problems begin when we tried to test non-flask and non-celery scripts,
> like
> > cron tasks: we may want to query several tenant-databases during one
> > SQLA-transaction, somethins like in my first example:
> > https://gist.github.com/aCLr/ff9462b634031ee6bccbead8d913c41f
> > (`assert_got_correct_objects_with_remove` and
> > `assert_got_cached_objects_without_remove`). The result you know.
> >
> >
> > During writing this message, I found out, that we need only one
> additional
> > data for primary key: `connection.bind.url`. I see, that SQLA already
> have
> > it inside `_instance_processor`, so it exists inside `_instance`. I
> think,
> > that `identity_key` should be constructed (in my case) with this code:
> > https://gist.github.com/aCLr/746f92dedb4d303a49033c0db22beced. Clear,
> don't
> > you think so?
>
> that's where something needs to happen but SQLAlchemy can't do this in
> such a way that is hardcoded to exactly your particular use case.   A
> Query can have lots of entities in it, and if you're doing sharding a
> single result set can refer to any number of shard identifiers within
> not just a single result set but within a single row; they might have
> come from dozens of different databases at once.  This could only be
> suited with a very open plugin point that is carefully architected,
> tested, and documented and I don't have the resources to envision this
> for a short-term use case.
>
> > Problems begin when we tried to test non-flask and non-celery scripts,
> like
> > cron tasks: we may want to query several tenant-databases during one
> > SQLA-transaction,
>
> I don't understand why you can't use independent sessions for this,
> because when you query two different databases, you are using two
> independent transactions in any case;  they are only coordinated if
> one is using two-phase transactions which I doubt is the case here
> (while SQLAlchemy put a lot of work into making that possible, I don't
> think anyone has ever used that feature).
>
> Instead of:
>
> Session.query(Foo).set_shard(id)
>
> you say:
>
> sessions.get(shard_id).query(Foo)
>
>
>
>
>
> >
> >
> >
> > вт, 21 нояб. 2017 г. в 19:15, Mike Bayer <[email protected]>:
> >>
> >> I've looked to see how hard it would be to allow "supplemental"
> >> attributes to form part of the mapper's primary key tuple, and it
> >> would be pretty hard.   The "easy" part is getting the mapper to set
> >> itself up with some extra attributes that can deliver some kind of
> >> supplemental value to the identity key.  the harder part is then in
> >> loading.py where we get new rows from the DB and need this
> >> value...which means some whole new kind of system would need to
> >> deliver this for any arbitrary part of the result set given a mapping
> >> and the selectable we're looking at (keep in mind a Query can have
> >> lots of the same mapping in a single row with aliases).   This would
> >> be very complicated to implement and test.   I am not seeing any quick
> >> way to suit this use case, which has also not ever been requested
> >> before.
> >>
> >>
> >>
> >> On Tue, Nov 21, 2017 at 10:12 AM, Mike Bayer <[email protected]>
> >> wrote:
> >> > On Tue, Nov 21, 2017 at 7:39 AM, Антонио Антуан <[email protected]>
> >> > wrote:
> >> >> Hi guys.
> >> >>
> >> >> I got this code example:
> >> >> https://gist.github.com/aCLr/ff9462b634031ee6bccbead8d913c41f.
> >> >>
> >> >> Here I make custom `Session` and custom `Query`. As you see,
> `Session`
> >> >> has
> >> >> several binds.
> >> >>
> >> >> Also, you can see that there are two functions:
> >> >> `assert_got_correct_objects_with_remove` and
> >> >> `assert_got_cached_objects_without_remove`.
> >> >>
> >> >> The first checks that we got correct results if `Session.remove`
> >> >> called.
> >> >> The second checks, that we got incorrect results if `Session.remove`
> >> >> not
> >> >> called.
> >> >>
> >> >> I understand, that behavior is correct: we don't remove session - so,
> >> >> we got
> >> >> same result from "cache-like"
> >> >> `sqlalchemy.orm.loading._instance_processor.session_identity_map`.
> >> >>
> >> >> I want to avoid that mechanism and don't want to use
> >> >> `session_identity_map`
> >> >> for different binds. In ideal, bind should be used as part of key for
> >> >> `session_identity_map`, but I guess, that it is not possible.
> >> >> Another way, acceptable for me: disable this mechanism. But I do not
> >> >> found
> >> >> ways to achieve this.
> >> >> And the third option: construct instances manually. Looks like I
> should
> >> >> copy
> >> >> code from `loading` module and add that method to `CustomSession`:
> >> >
> >> >
> >> > there's really no reason at all to use a "ShardedSession" if you have
> >> > overlapping primary key spaces from each of your binds.   I'm not sure
> >> > if I mentioned this at the beginning of the emails regarding this
> >> > project but I hope that I mentioned just using separate Session
> >> > objects is vastly simpler for non-intricate sharding cases, such as
> >> > where you always know which shard you care about and you don't care
> >> > about any of the others for a certain operation.     The point of
> >> > ShardedSession is so that objects pulled from multiple databases can
> >> > be intermingled in the same query and in the same transaction - which
> >> > by definition means they have unique primary keys.   If that's not
> >> > what you're doing here I don't see what advantage all this complexity
> >> > is getting you.
> >> >
> >> > If you're still convinced you need to be using a monolithic
> >> > ShardedSession then there needs to be some kind of translation of data
> >> > such that the mapper sees unique primary keys across the shards, or
> >> > unique classes.
> >> >
> >> > I've tried to think of ways to do this without too much difficulty but
> >> > none of them are really worth the complexity and hackiness it would
> >> > require.   The absolutely quickest and most well-supported, no hacks
> >> > required way would be to properly create a composite primary key on
> >> > your classes, where the second column is your shard id:
> >> >
> >> > class A(Base):
> >> >     __tablename__ = 'a'
> >> >
> >> >     id = Column(Integer, primary_key=True)
> >> >     shard_id = Column(Integer, primary_key=True)
> >> >
> >> > I tried to see if the "shard_id" column can be some kind of expression
> >> > that is not a Column on the Table but the mapper() is not set up to
> >> > support this unless you mapped the whole class to a select()
> >> > construct, which would make for too-complicated SQL, and you'd still
> >> > need to intercept this select() using events to put the right shard id
> >> > in.  Another is to create a custom column that renders in a special
> >> > way, but again you need to create events to intercept it in every case
> >> > to put the right shard id in, and/or remove it from things like
> >> > insert() statements.
> >> >
> >> > by far your two best solutions are: 1. use separate Session objects
> >> > per shard  2. make sure your data actually has shard-specific primary
> >> > keys
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >>
> >> >> def instances(self, cursor, __context=None):
> >> >>     context = __context
> >> >>     if context is None:
> >> >>         context = QueryContext(self)
> >> >>     return self._custom_instances(self, cursor, context)
> >> >>
> >> >>
> >> >>
> >> >> def custom_instances(query, cursor, context):
> >> >>      """copied from `loading.instances` code with disabled
> >> >> `session_identity_map`"""
> >> >>
> >> >>
> >> >>
> >> >> The third way is the most ugly and I want to avoid it.
> >> >>
> >> >> Could you help me with my hard choice and, maybe, suggest any other
> >> >> ways and
> >> >> options? :)
> >> >>
> >> >> Thank you.
> >> >>
> >> >> --
> >> >> SQLAlchemy -
> >> >> The Python SQL Toolkit and Object Relational Mapper
> >> >>
> >> >> http://www.sqlalchemy.org/
> >> >>
> >> >> To post example code, please provide an MCVE: Minimal, Complete, and
> >> >> Verifiable Example. See http://stackoverflow.com/help/mcve for a
> full
> >> >> description.
> >> >> ---
> >> >> You received this message because you are subscribed to the Google
> >> >> Groups
> >> >> "sqlalchemy" group.
> >> >> To unsubscribe from this group and stop receiving emails from it,
> send
> >> >> an
> >> >> email to [email protected].
> >> >> To post to this group, send email to [email protected].
> >> >> Visit this group at https://groups.google.com/group/sqlalchemy.
> >> >> For more options, visit https://groups.google.com/d/optout.
> >>
> >> --
> >> SQLAlchemy -
> >> The Python SQL Toolkit and Object Relational Mapper
> >>
> >> http://www.sqlalchemy.org/
> >>
> >> To post example code, please provide an MCVE: Minimal, Complete, and
> >> Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full
> >> description.
> >> ---
> >> You received this message because you are subscribed to the Google
> Groups
> >> "sqlalchemy" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to [email protected].
> >> To post to this group, send email to [email protected].
> >> Visit this group at https://groups.google.com/group/sqlalchemy.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> >
> > Антон
> >
> > --
> > SQLAlchemy -
> > The Python SQL Toolkit and Object Relational Mapper
> >
> > http://www.sqlalchemy.org/
> >
> > To post example code, please provide an MCVE: Minimal, Complete, and
> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> > description.
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "sqlalchemy" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To post to this group, send email to [email protected].
> > Visit this group at https://groups.google.com/group/sqlalchemy.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.
>
-- 

Антон

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Disable `session_identity_map` for `_instance_processor`

Reply via email to