[sqlalchemy] Slow entities pickling for caching purposes

2019-10-10 Thread Александр Егоров
Hello!

I want to cache query results in a distributed cache (Redis), so I need to 
serialize/deserialize fetched entities very fast.
However, it turned that SQLAlchemy entities are very heavy for pickle to 
dump/load. Regular classes and dicts with the same structure takes 
significantly less amount of time to reconstruct from bytes.
It's even slower than fetching and mapping the entities from a database.

Here are benchmark results, loading 1 simple entities:

class Entity(Base):
__tablename__ = 'test_entity'
id = sa.Column(sa.Integer, primary_key=True, autoincrement=True)
field1 = sa.Column(sa.String)
field2 = sa.Column(sa.Integer)



Name (time in ms)  MedianStdDev   Dump 
bytes

test_unpickle_dicts2.3764 (1.0)  0.0797 (1.09)  
337806
test_unpickle_dataclasses  9.2546 (3.89) 0.0734 (1.0)   
437861
test_unpickle_classes  9.4188 (3.96) 0.1230 (1.68)  
437862
test_load_from_database   91.3482 (38.44)0.6686 (9.10)
test_unpickle_unsaved_entities   108.6726 (45.73)0.5154 (7.02) 
1448169
test_unpickle_persisted_entities 166.4531 (70.04)0.3787 (5.16) 
1697849
-


*Environment*:
OS: Ubuntu 16.04
Python: CPython 3.6.3
SQLAlchemy: 1.3.0
Database: in-memory SQLite, PostgreSQL (all gives similar results)
The full benchmark code is in a single attachment file.
To run: 
pytest test_ser.py --benchmark-columns=median,stddev --benchmark-disable-gc
Requirements: 
pip install sqlalchemy==1.3.0 pytest==3.6.0 pytest-benchmark==3.1.1 
dataclasses==0.6

*Questions:*
1) What strategy for caching I should consider while using SQLAlchemy? 
Currently, the only option I see is to have a duplicated declaration of 
entities in a form of simple classes and use it when I don't need 
modification. Needles to say, it's a lot of code duplication.
2) Is it possible to have a read-only loader in SQLAlchemy? I can imagine 
mapping loaded data to an entity class but without instrumentation overhead 
which is not needed to read the data.
If continue the idea further, I can imagine a situation when I load the 
entity for modification, and it has a read-only relationship to another 
entity. So that I don't pay for things I don't use.

Probably I'm making some crazy things and all wrong, feel free to point me 
out. My ultimate goal is to manage caching with SQLAlchemy and understand 
SQLAlchemy more in this direction.
Thank you in advance! 

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/75aa92b7-3c94-4b5a-afed-062a203792fd%40googlegroups.com.
import pytest
from dataclasses import dataclass
from sqlalchemy.ext.declarative import declarative_base

import pickle
import sqlalchemy as sa

Base = declarative_base()


class Entity(Base):
__tablename__ = 'test_entity'

id = sa.Column(sa.Integer, primary_key=True, autoincrement=True)

field1 = sa.Column(sa.String)
field2 = sa.Column(sa.Integer)

def __init__(self, id_: int, field1: str, field2: int):
self.id = id_
self.field1 = field1
self.field2 = field2

def some_method(self) -> str:
return self.field1 + str(self.field2)


@pytest.fixture(scope='session')
def dbengine():
engine = sa.create_engine('sqlite://')
Base.metadata.create_all(engine)
yield engine


@pytest.fixture(scope='function')
def session(dbengine):
session_factory = sa.orm.sessionmaker(bind=dbengine)
Session = sa.orm.scoped_session(session_factory)
yield Session
Session.query(Entity).delete()
Session.remove()


@dataclass
class EntityData:
id: int
field1: str
field2: int

def some_method(self) -> str:
return self.field1 + str(self.field2)


class EntityClass:
def __init__(self, id_: int, field1: str, field2: int):
self.id = id_
self.field1 = field1
self.field2 = field2

def some_method(self) -> str:
return self.field1 + str(self.field2)


def generate_data():
for i in range(1):
yield i + 1, str(i), i * 2


def test_unpickle_unsaved_entities(benchmark):
entities = [Entity(*data) for data in generate_data()]
dump_unsaved_entities = pickle.dumps(entities)
print(f'\nDump unsaved 

Re: [sqlalchemy] Slow entities pickling for caching purposes

2019-10-11 Thread Александр Егоров
Thank you very much for your answers!

Some more questions
1) I see you use sa.inspect(self).dict in __getstate__. Is it safe to 
replace it with self.__dict__? I didn't find any difference in the result, 
but __dict__ works 30 times faster.
2) I'm going to implement __getstate__ and __setstate__ in a Base class. Is 
it a risky idea from your point of view?
3) We've greatly optimized the unpickling for read-only entities. Is it 
possible to do the same during loading from the database? When I read data, 
I don't need ORM state, because I'm not going to modify the data and don't 
want lazy loads.

On Friday, 11 October 2019 00:34:03 UTC+7, Mike Bayer wrote:
>
>
>
> On Thu, Oct 10, 2019, at 2:48 AM, Александр Егоров wrote:
>
> Hello!
>
> I want to cache query results in a distributed cache (Redis), so I need to 
> serialize/deserialize fetched entities very fast.
> However, it turned that SQLAlchemy entities are very heavy for pickle to 
> dump/load. Regular classes and dicts with the same structure takes 
> significantly less amount of time to reconstruct from bytes.
> It's even slower than fetching and mapping the entities from a database.
>
> Here are benchmark results, loading 1 simple entities:
>
> *class *Entity(Base):
> __tablename__ = *'test_entity'*
> id = sa.Column(sa.Integer, primary_key=*True*, autoincrement=*True*)
> field1 = sa.Column(sa.String)
> field2 = sa.Column(sa.Integer)
>
>
>
> Name (time in ms)  MedianStdDev   
> Dump bytes
> 
> test_unpickle_dicts2.3764 (1.0)  0.0797 (1.09)  
> 337806
> test_unpickle_dataclasses  9.2546 (3.89) 0.0734 (1.0)   
> 437861
> test_unpickle_classes  9.4188 (3.96) 0.1230 (1.68)  
> 437862
> test_load_from_database   91.3482 (38.44)0.6686 (9.10)
> test_unpickle_unsaved_entities   108.6726 (45.73)0.5154 (7.02) 
> 1448169
> test_unpickle_persisted_entities 166.4531 (70.04)0.3787 (5.16) 
> 1697849
> -
>
>
> *Environment*:
> OS: Ubuntu 16.04
> Python: CPython 3.6.3
> SQLAlchemy: 1.3.0
> Database: in-memory SQLite, PostgreSQL (all gives similar results)
> The full benchmark code is in a single attachment file.
> To run: 
> pytest test_ser.py --benchmark-columns=median,stddev --benchmark-disable-
> gc
> Requirements:
> pip install sqlalchemy==1.3.0 pytest==3.6.0 pytest-benchmark==3.1.1 
> dataclasses==0.6
>
> *Questions:*
> 1) What strategy for caching I should consider while using SQLAlchemy? 
> Currently, the only option I see is to have a duplicated declaration of 
> entities in a form of simple classes and use it when I don't need 
> modification. Needles to say, it's a lot of code duplication.
> 2) Is it possible to have a read-only loader in SQLAlchemy? I can imagine 
> mapping loaded data to an entity class but without instrumentation overhead 
> which is not needed to read the data.
> If continue the idea further, I can imagine a situation when I load the 
> entity for modification, and it has a read-only relationship to another 
> entity. So that I don't pay for things I don't use.
>
> Probably I'm making some crazy things and all wrong, feel free to point me 
> out. My ultimate goal is to manage caching with SQLAlchemy and understand 
> SQLAlchemy more in this direction.
> Thank you in advance! 
>
>
> Hi there -
>
> there's no simple answer for this, have you considered writing your own 
> serializer for your objects rather than using pickle's default strategy?  
> that is, if you implement __getstate__ and __setstate__,  you can allow the 
> serialization to not include the "_sa_instance_state" that's associated 
> with the objects which is where you'd get some of the slowness on dumping / 
> loading, it also would save lots of memory / network overhead by not 
> storing lots of data that you can probably forego for storing in your cache.
>
> For bringing the classes back,  you can implement a setstate that returns 
> them as non-ORM classes, or patches them in as ORM classes.
>
> I've altered your test to include only ORM mapped classes, comparing your 
> baseline to one that uses __getstate__, and then several variants which 
> will grant more or less ORM functionality once you have unpickled them.
> the very fastest one which omits any ORM functionality at all, and looks 
> like what you propose, has a mean of 26 ms compared to 380 ms for the most 
> expensive.  
>
> there's one ORM-l