Re: [sqlalchemy] Slow entities pickling for caching purposes

2019-10-11 Thread Alexander Egorov
I've considered manual queries and building classes by myself. This option 
is good in terms of performance, but it requires a lot of code duplication 
and kills code reuse. So I want to use the same mapper classes for both 
read and modify operations. Also, I want to have the same query 
description. I've tried to build classes using Query.values and calling 
constructor and it gave ~3 times boost. What I want is to have the same 
functionality in generic form, without need to manually write these loops. 
Pseudocode to illustrate my probably crazy idea:
session.query(Entity).all()  # loads entities with modification tracking, 
with _sa_instance_state filled and added to the session.
session.query(Entity).readonly().all()  # loads entities without 
modification tracking, without _sa_instance_state and not added to the 
session.

# Probably it would be useful in cases when the Entity has a read-only link 
to another entity via relationship.
session.query(Entity).options(sa.orm.joinedload(Entity.another_entity, 
readonly_mapping=True)).get(123)

PyPy is an option I'm considering too. Its pickle module seems to be 
significantly slower, however. I need 3.6 version which is in beta now, so 
I have to wait.
Name (time in ms) PyPy 3.5   CPython 3.6.3
--
test_unpickle_dicts   12.2969 (1.0)  2.3764
test_load_from_database   26.4408 (2.15)91.3482
test_unpickle_classes 30.7339 (2.50) 9.4188
--



пятница, 11 октября 2019 г., 20:52:57 UTC+7 пользователь Mike Bayer написал:
>
>
>
> On Fri, Oct 11, 2019, at 5:23 AM, Александр Егоров wrote:
>
> Thank you very much for your answers!
>
> Some more questions
> 1) I see you use sa.inspect(self).dict in __getstate__. Is it safe to 
> replace it with self.__dict__? I didn't find any difference in the result, 
> but __dict__ works 30 times faster.
>
>
> there's no difference.   there's an old extension that allows objects to 
> persist in some form other than __dict__ but I'm sure you're not using it.
>
>
> 2) I'm going to implement __getstate__ and __setstate__ in a Base class. 
> Is it a risky idea from your point of view?
>
>
> not really, pickle is already flaky enough
>
>
>
> 3) We've greatly optimized the unpickling for read-only entities. Is it 
> possible to do the same during loading from the database? When I read data, 
> I don't need ORM state, because I'm not going to modify the data and don't 
> want lazy loads.
>
>
> I would recommend using a plain result object from Query.statement or 
> Core.statement and build up the objects yourself in that case. the "ORM 
> state" you refer towards is intrinsic to everything the ORM does including 
> all the relationship loading, etc.you'd need to re-implement all of 
> that yourself.
>
> Since your performance requirements appear to be well beyond the norm have 
> you considered using pypy ?   This would appear to be a much simpler way to 
> double or triple your performance without building your own ORM.
>
>
>
> On Friday, 11 October 2019 00:34:03 UTC+7, Mike Bayer wrote:
>
>
>
> On Thu, Oct 10, 2019, at 2:48 AM, Александр Егоров wrote:
>
> Hello!
>
> I want to cache query results in a distributed cache (Redis), so I need to 
> serialize/deserialize fetched entities very fast.
> However, it turned that SQLAlchemy entities are very heavy for pickle to 
> dump/load. Regular classes and dicts with the same structure takes 
> significantly less amount of time to reconstruct from bytes.
> It's even slower than fetching and mapping the entities from a database.
>
> Here are benchmark results, loading 1 simple entities:
>
> *class *Entity(Base):
> __tablename__ = *'test_entity'*
> id = sa.Column(sa.Integer, primary_key=*True*, autoincrement=*True*)
> field1 = sa.Column(sa.String)
> field2 = sa.Column(sa.Integer)
>
>
>
> Name (time in ms)  MedianStdDev   
> Dump bytes
> 
> test_unpickle_dicts2.3764 (1.0)  0.0797 (1.09)  
> 337806
> test_unpickle_dataclasses  9.2546 (3.89) 0.0734 (1.0)   
> 437861
> test_unpickle_classes  9.4188 (3.96) 0.1230 (1.68)  
> 437862
> test_load_from_database   91.3482 (38.44)0.6686 (9.10)
> test_unpickle_unsaved_entities   108.6726 (45.73)0.5154 (7.02) 
> 1448169
> test_unpickle_persisted_entities 166.4531 (70.04)0.3787 (5.16) 
> 1697849
> -
>
>
> *Environment*:
> OS: Ubuntu 16.04
> Python: CPython 3.6.3
> SQLAlchemy: 1.3.0
> Database: in-memory SQLite, PostgreSQL (all gives similar results)
> The full 

Re: [sqlalchemy] Slow entities pickling for caching purposes

2019-10-11 Thread Mike Bayer


On Fri, Oct 11, 2019, at 5:23 AM, Александр Егоров wrote:
> Thank you very much for your answers!
> 
> Some more questions
> 1) I see you use sa.inspect(self).dict in __getstate__. Is it safe to replace 
> it with self.__dict__? I didn't find any difference in the result, but 
> __dict__ works 30 times faster.

there's no difference. there's an old extension that allows objects to persist 
in some form other than __dict__ but I'm sure you're not using it.


> 2) I'm going to implement __getstate__ and __setstate__ in a Base class. Is 
> it a risky idea from your point of view?

not really, pickle is already flaky enough



> 3) We've greatly optimized the unpickling for read-only entities. Is it 
> possible to do the same during loading from the database? When I read data, I 
> don't need ORM state, because I'm not going to modify the data and don't want 
> lazy loads.

I would recommend using a plain result object from Query.statement or 
Core.statement and build up the objects yourself in that case. the "ORM state" 
you refer towards is intrinsic to everything the ORM does including all the 
relationship loading, etc. you'd need to re-implement all of that yourself.

Since your performance requirements appear to be well beyond the norm have you 
considered using pypy ? This would appear to be a much simpler way to double or 
triple your performance without building your own ORM.


> 
> On Friday, 11 October 2019 00:34:03 UTC+7, Mike Bayer wrote:
>> 
>> 
>> On Thu, Oct 10, 2019, at 2:48 AM, Александр Егоров wrote:
>>> Hello!
>>> 
>>> I want to cache query results in a distributed cache (Redis), so I need to 
>>> serialize/deserialize fetched entities very fast.
>>> However, it turned that SQLAlchemy entities are very heavy for pickle to 
>>> dump/load. Regular classes and dicts with the same structure takes 
>>> significantly less amount of time to reconstruct from bytes.
>>> It's even slower than fetching and mapping the entities from a database.
>>> 
>>> Here are benchmark results, loading 1 simple entities:
>>> 
>>> *class *Entity(Base):
>>>     __tablename__ = *'test_entity'**
**    *id = sa.Column(sa.Integer, primary_key=*True*, autoincrement=*True*)
>>>     field1 = sa.Column(sa.String)
>>>     field2 = sa.Column(sa.Integer)
>>> 
>>> 
>>> Name (time in ms)                      Median            StdDev           
>>> Dump bytes
>>> 
>>> test_unpickle_dicts                    2.3764 (1.0)      0.0797 (1.09)      
>>> 337806
>>> test_unpickle_dataclasses              9.2546 (3.89)     0.0734 (1.0)       
>>> 437861
>>> test_unpickle_classes                  9.4188 (3.96)     0.1230 (1.68)      
>>> 437862
>>> test_load_from_database               91.3482 (38.44)    0.6686 (9.10)
>>> test_unpickle_unsaved_entities       108.6726 (45.73)    0.5154 (7.02)     
>>> 1448169
>>> test_unpickle_persisted_entities     166.4531 (70.04)    0.3787 (5.16)     
>>> 1697849
>>> -
>>> 
>>> *Environment*:
>>> OS: Ubuntu 16.04
>>> Python: CPython 3.6.3
>>> SQLAlchemy: 1.3.0
>>> Database: in-memory SQLite, PostgreSQL (all gives similar results)
>>> The full benchmark code is in a single attachment file.
>>> To run: 
>>> pytest test_ser.py --benchmark-columns=median,stddev --benchmark-disable-gc
>>> Requirements:
>>> pip install sqlalchemy==1.3.0 pytest==3.6.0 pytest-benchmark==3.1.1 
>>> dataclasses==0.6
>>> 
>>> *Questions:*
>>> 1) What strategy for caching I should consider while using SQLAlchemy? 
>>> Currently, the only option I see is to have a duplicated declaration of 
>>> entities in a form of simple classes and use it when I don't need 
>>> modification. Needles to say, it's a lot of code duplication.
>>> 2) Is it possible to have a read-only loader in SQLAlchemy? I can imagine 
>>> mapping loaded data to an entity class but without instrumentation overhead 
>>> which is not needed to read the data.
>>> If continue the idea further, I can imagine a situation when I load the 
>>> entity for modification, and it has a read-only relationship to another 
>>> entity. So that I don't pay for things I don't use.
>>> 
>>> Probably I'm making some crazy things and all wrong, feel free to point me 
>>> out. My ultimate goal is to manage caching with SQLAlchemy and understand 
>>> SQLAlchemy more in this direction.
>>> Thank you in advance! 
>> 
>> Hi there -
>> 
>> there's no simple answer for this, have you considered writing your own 
>> serializer for your objects rather than using pickle's default strategy? 
>> that is, if you implement __getstate__ and __setstate__, you can allow the 
>> serialization to not include the "_sa_instance_state" that's associated with 
>> the objects which is where you'd get some of the slowness on dumping / 
>> loading, it also would save lots of memory / network overhead by 

Re: [sqlalchemy] Slow entities pickling for caching purposes

2019-10-11 Thread Александр Егоров
Thank you very much for your answers!

Some more questions
1) I see you use sa.inspect(self).dict in __getstate__. Is it safe to 
replace it with self.__dict__? I didn't find any difference in the result, 
but __dict__ works 30 times faster.
2) I'm going to implement __getstate__ and __setstate__ in a Base class. Is 
it a risky idea from your point of view?
3) We've greatly optimized the unpickling for read-only entities. Is it 
possible to do the same during loading from the database? When I read data, 
I don't need ORM state, because I'm not going to modify the data and don't 
want lazy loads.

On Friday, 11 October 2019 00:34:03 UTC+7, Mike Bayer wrote:
>
>
>
> On Thu, Oct 10, 2019, at 2:48 AM, Александр Егоров wrote:
>
> Hello!
>
> I want to cache query results in a distributed cache (Redis), so I need to 
> serialize/deserialize fetched entities very fast.
> However, it turned that SQLAlchemy entities are very heavy for pickle to 
> dump/load. Regular classes and dicts with the same structure takes 
> significantly less amount of time to reconstruct from bytes.
> It's even slower than fetching and mapping the entities from a database.
>
> Here are benchmark results, loading 1 simple entities:
>
> *class *Entity(Base):
> __tablename__ = *'test_entity'*
> id = sa.Column(sa.Integer, primary_key=*True*, autoincrement=*True*)
> field1 = sa.Column(sa.String)
> field2 = sa.Column(sa.Integer)
>
>
>
> Name (time in ms)  MedianStdDev   
> Dump bytes
> 
> test_unpickle_dicts2.3764 (1.0)  0.0797 (1.09)  
> 337806
> test_unpickle_dataclasses  9.2546 (3.89) 0.0734 (1.0)   
> 437861
> test_unpickle_classes  9.4188 (3.96) 0.1230 (1.68)  
> 437862
> test_load_from_database   91.3482 (38.44)0.6686 (9.10)
> test_unpickle_unsaved_entities   108.6726 (45.73)0.5154 (7.02) 
> 1448169
> test_unpickle_persisted_entities 166.4531 (70.04)0.3787 (5.16) 
> 1697849
> -
>
>
> *Environment*:
> OS: Ubuntu 16.04
> Python: CPython 3.6.3
> SQLAlchemy: 1.3.0
> Database: in-memory SQLite, PostgreSQL (all gives similar results)
> The full benchmark code is in a single attachment file.
> To run: 
> pytest test_ser.py --benchmark-columns=median,stddev --benchmark-disable-
> gc
> Requirements:
> pip install sqlalchemy==1.3.0 pytest==3.6.0 pytest-benchmark==3.1.1 
> dataclasses==0.6
>
> *Questions:*
> 1) What strategy for caching I should consider while using SQLAlchemy? 
> Currently, the only option I see is to have a duplicated declaration of 
> entities in a form of simple classes and use it when I don't need 
> modification. Needles to say, it's a lot of code duplication.
> 2) Is it possible to have a read-only loader in SQLAlchemy? I can imagine 
> mapping loaded data to an entity class but without instrumentation overhead 
> which is not needed to read the data.
> If continue the idea further, I can imagine a situation when I load the 
> entity for modification, and it has a read-only relationship to another 
> entity. So that I don't pay for things I don't use.
>
> Probably I'm making some crazy things and all wrong, feel free to point me 
> out. My ultimate goal is to manage caching with SQLAlchemy and understand 
> SQLAlchemy more in this direction.
> Thank you in advance! 
>
>
> Hi there -
>
> there's no simple answer for this, have you considered writing your own 
> serializer for your objects rather than using pickle's default strategy?  
> that is, if you implement __getstate__ and __setstate__,  you can allow the 
> serialization to not include the "_sa_instance_state" that's associated 
> with the objects which is where you'd get some of the slowness on dumping / 
> loading, it also would save lots of memory / network overhead by not 
> storing lots of data that you can probably forego for storing in your cache.
>
> For bringing the classes back,  you can implement a setstate that returns 
> them as non-ORM classes, or patches them in as ORM classes.
>
> I've altered your test to include only ORM mapped classes, comparing your 
> baseline to one that uses __getstate__, and then several variants which 
> will grant more or less ORM functionality once you have unpickled them.
> the very fastest one which omits any ORM functionality at all, and looks 
> like what you propose, has a mean of 26 ms compared to 380 ms for the most 
> expensive.  
>
> there's one ORM-level guarantee you would want me to add for this most 
> optimized version, that is if you have an ORM object that has no 
> _sa_instance_state at all, you can still access attributes on it.  This 
> works but I would want to add test support so that I keep this as a 
> supported use case.
>
>
>

Re: [sqlalchemy] Slow entities pickling for caching purposes

2019-10-10 Thread Mike Bayer


On Thu, Oct 10, 2019, at 2:48 AM, Александр Егоров wrote:
> Hello!
> 
> I want to cache query results in a distributed cache (Redis), so I need to 
> serialize/deserialize fetched entities very fast.
> However, it turned that SQLAlchemy entities are very heavy for pickle to 
> dump/load. Regular classes and dicts with the same structure takes 
> significantly less amount of time to reconstruct from bytes.
> It's even slower than fetching and mapping the entities from a database.
> 
> Here are benchmark results, loading 1 simple entities:
> 
> *class *Entity(Base):
>     __tablename__ = *'test_entity'**
**    *id = sa.Column(sa.Integer, primary_key=*True*, autoincrement=*True*)
>     field1 = sa.Column(sa.String)
>     field2 = sa.Column(sa.Integer)
> 
> 
> Name (time in ms)                      Median            StdDev           
> Dump bytes
> 
> test_unpickle_dicts                    2.3764 (1.0)      0.0797 (1.09)      
> 337806
> test_unpickle_dataclasses              9.2546 (3.89)     0.0734 (1.0)       
> 437861
> test_unpickle_classes                  9.4188 (3.96)     0.1230 (1.68)      
> 437862
> test_load_from_database               91.3482 (38.44)    0.6686 (9.10)
> test_unpickle_unsaved_entities       108.6726 (45.73)    0.5154 (7.02)     
> 1448169
> test_unpickle_persisted_entities     166.4531 (70.04)    0.3787 (5.16)     
> 1697849
> -
> 
> *Environment*:
> OS: Ubuntu 16.04
> Python: CPython 3.6.3
> SQLAlchemy: 1.3.0
> Database: in-memory SQLite, PostgreSQL (all gives similar results)
> The full benchmark code is in a single attachment file.
> To run: 
> pytest test_ser.py --benchmark-columns=median,stddev --benchmark-disable-gc
> Requirements:
> pip install sqlalchemy==1.3.0 pytest==3.6.0 pytest-benchmark==3.1.1 
> dataclasses==0.6
> 
> *Questions:*
> 1) What strategy for caching I should consider while using SQLAlchemy? 
> Currently, the only option I see is to have a duplicated declaration of 
> entities in a form of simple classes and use it when I don't need 
> modification. Needles to say, it's a lot of code duplication.
> 2) Is it possible to have a read-only loader in SQLAlchemy? I can imagine 
> mapping loaded data to an entity class but without instrumentation overhead 
> which is not needed to read the data.
> If continue the idea further, I can imagine a situation when I load the 
> entity for modification, and it has a read-only relationship to another 
> entity. So that I don't pay for things I don't use.
> 
> Probably I'm making some crazy things and all wrong, feel free to point me 
> out. My ultimate goal is to manage caching with SQLAlchemy and understand 
> SQLAlchemy more in this direction.
> Thank you in advance! 

Hi there -

there's no simple answer for this, have you considered writing your own 
serializer for your objects rather than using pickle's default strategy? that 
is, if you implement __getstate__ and __setstate__, you can allow the 
serialization to not include the "_sa_instance_state" that's associated with 
the objects which is where you'd get some of the slowness on dumping / loading, 
it also would save lots of memory / network overhead by not storing lots of 
data that you can probably forego for storing in your cache.

For bringing the classes back, you can implement a setstate that returns them 
as non-ORM classes, or patches them in as ORM classes.

I've altered your test to include only ORM mapped classes, comparing your 
baseline to one that uses __getstate__, and then several variants which will 
grant more or less ORM functionality once you have unpickled them. the very 
fastest one which omits any ORM functionality at all, and looks like what you 
propose, has a mean of 26 ms compared to 380 ms for the most expensive. 

there's one ORM-level guarantee you would want me to add for this most 
optimized version, that is if you have an ORM object that has no 
_sa_instance_state at all, you can still access attributes on it. This works 
but I would want to add test support so that I keep this as a supported use 
case.








> 

> --
>  SQLAlchemy - 
>  The Python SQL Toolkit and Object Relational Mapper
> 
> http://www.sqlalchemy.org/
> 
>  To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
>  --- 
>  You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
>  To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
>  To view this discussion on the web visit 
> https://groups.google.com/d/msgid/sqlalchemy/75aa92b7-3c94-4b5a-afed-062a203792fd%40googlegroups.com
>  
>