Re: [sqlalchemy] Improving performance for ORM queries: skip IdentityMap / create detached instances

Theo Nonsense Mon, 10 Dec 2012 15:33:42 -0800

Thanks for the detailed reply!  Comments inline below.

On Friday, December 7, 2012 12:45:42 PM UTC-8, Michael Bayer wrote:
>
>
> On Dec 7, 2012, at 3:01 PM, Theo Nonsense wrote:
>
> I like using the ORM to query and have results translated to objects.  I'm 
> currently using declarative for mapping and I'm trying to figure out a good 
> way to ignore the overhead of the IdentityMap and other ORM niceties when 
> they're not needed.  Specifically, when dealing with a relatively low 
> number of results (~30k), the overhead of mapper._instance(), 
> identity.add(), mapper.populate_state(), and other functions that generally 
> keep track of the state / changes of an object is out weighing the benefit 
> for some cases.  In circumstances where the results are retrieved, but 
> never need a database connection after that point, I'd like to be able to 
> avoid that overhead.  In other words, I'd like to have simple objects with 
> business logic that are used throughout the codebase, but I also want to 
> use the ORM wherever it is not causing a performance bottleneck.  Something 
> like:
>
> Session.query(User).options(instrument_results=False, 
> create_detached=True).all()
>
> This way whenever there is some location where the results do not need to 
> be tracked, I can just specify it in the query and not incur the cost of 
> the tracking.  This is just an example.  I'm looking for ways to achieve 
> the same goal.  They don't necessarily have to be args to .options().
>
> - The first option is to use the ORM to build the queries, but issue them 
> directly through the Core / Session.execute().  The results would then be 
> translated to objects manually, which is the first dislike with this 
> approach.  The ORM already knows how to create the objects.  Also, I'd like 
> to use the same objects as the ORM ones so that the business logic can all 
> live in the same place.  However, creating the ORM objects means that 
> they'll be instrumented, which I'd like to avoid.
>
> Also, I've seen a few other posts here and on StackExchange regarding the 
> notion of read-only or long-lived objects, but none seem to be what I'm 
> looking for.
>
> - Custom Mapper / ClassManager / Instrumentation manager for immutable 
> domain models - https://github.com/andreypopp/saimmutable
>
> This approach is interesting, but doesn't seem to allow toggling for 
> queries / loads that are performance bottlenecks.  I'd like to be able to 
> only enable the quicker / simpler path when needed.  I suppose I could have 
> a table mapped to two different classes through different mappers.  One 
> mapper is the default one and one that ignores instrumentation.  It might 
> be possible to make one a sub-class of the other or provide a mixin for 
> business logic.  Or it might be possible to have one class mapped through 
> two different mappers?  Even if this all worked, it would mean multiple 
> classes for each table and doesn't avoid the overhead of the IdentityMap.
>
> - Detaching the instances from the session - 
> https://groups.google.com/forum/?fromgroups=#!searchin/sqlalchemy/detached/sqlalchemy/8rFy5JGGfeo/IN28lfg-Je8J
>
> This approach incurs the cost of the IdentityMap and Instrumentation when 
> translating the results.  By the time expunge() can be called, it is too 
> late.
>
>


> Well there's a bit of a contradiction here, you're saying, you don't want 
> the identity map or mapper._instance() or any of that, but then you're 
> saying, "the ORM already knows how to create the objects".    I'd advise a 
> deep dive into the mechanics to learn intimately how that all works.  In 
> particular, the identity map is extremely central to how relationship 
> loading works, both eager loading where such a construct is required, as 
> well as lazy loading, where it provides a critical performance boost by 
> allowing objects that are already present to be used without any SQL, or at 
> least without being loaded redundantly.  Instrumentation is required for 
> lazy loading - without lazy loading, you'd need to ensure that all queries 
> occur up front for all attributes.
>

I spoke too broadly when mentioning parts that I'd like to avoid. 
 Mapper._instance() does a lot of work and I'd only like to have a flag to 
ignore some of that work, some of the time.  Specifically, I'm looking at 
some timing profiles and noticing that there is a significant amount of 
overall time spent in the following (decreasing order of time spent):

sqlalchemy/orm/identity.py:119(add)
sqlalchemy/orm/mapper.py:1996(_instance)
sqlalchemy/orm/state.py:42(__init__)
sqlalchemy/orm/mapper.py:1953(populate_state)
sqlalchemy/orm/attributes.py:160(__get__)
sqlalchemy/orm/state.py:401(commit_all)
sqlalchemy/orm/instrumentation.py:278(new_instance)
sqlalchemy/orm/identity.py:140(get)
...

The time for each call seems acceptable for a lower quantity of results 
given the ease and features of the ORM.  When the quantity gets larger, 
around 30k for the models I'm dealing with, the overall time is not always 
worth the extra time.

I should've also mentioned that eager loading all data that will be needed 
is acceptable and what I'm currently doing via the Core.  When using both 
ORM and Core, I have a more complex set of models.  Some base models that 
contain the business logic and the ORM models for persistence, which are 
sub-classes / mixins of the base models.  It seems unnecessary to maintain 
all of that, if I'm able to inform the ORM to create the base models in the 
cases where performance is needed.
 

> If you don't like the performance hit of identity map, less effort would 
> be, contribute one for us written in C.   Or see if pypy can help.   
>

Good point.  Both of these options should increase overall performance and 
not just for the case I'm talking about.  They may be enough to address the 
performance impact I'm concerned about.  However, I believe there is still 
a use case here and am interested in a solution.
 

> I would note that Query can load individual columns, where you do get to 
> skip all the overhead of object loads, and you get back a named-tuple-like 
> object.  So if you don't care about relationship loading and just want 
> tuple-like objects, that mechanism is there right now, and it wouldn't be 
> much effort at all to add a helper that expands a given mapped object into 
> it's individual per-column attributes.
>

I don't think tuple results are the right approach here.  It would seem 
that I'd need some tuple->object conversion if I'd want to be able to use 
the objects throughout the rest of the application, along with their 
attached business logic.  As soon as I create those objects, I'd incur the 
instrumentation hit.
 

> It's really relationships that require a lot of the complexity to loading. 
>     Other ORMs have the approach where a relationship attribute basically 
> lazy-loads the related collection every time.   SQLAlchemy's approach saves 
> on SQL as an already-loaded object caches its related collections and 
> object references.
>

In a lot of circumstances, the ORM is wonderful.  In the cases where it 
isn't quite as performant, I'd like to be able to skip the pieces that 
won't be needed.

I'm not familiar enough to discuss the internals of _instance() and the 
IdentityMap.  That's the next item to investigate.  I'm hoping it is 
possible to create a custom Mapper / MapperProperty that can be dynamically 
changed on a per query invocation.

Example:

Base = declarative_base()

class Article(Base):
    __tablename__ = 'articles'
    article_id = Column(Integer, primary_key=True)
    article_title = Column(String)
    
    def __init__(self, article_id=None, article_title=None):
        self.article_id = article_id
        self.article_title = article_title
    
    def do_interesting(self):
        # do some interesting business logic here
        return
    
class Tag(Base):
    __tablename__ = 'tags'
    tag_id = Column(Integer, primary_key=True)
    tag_name = Column(String, unique=True)
    
    articles = relationship('Article', backref='tags')

# querying using ORM
###################################
articles = session.query(Article).filter(Article.article_title=='a cool 
title').all()

# querying using Core from ORM -- avoids IdentityMap, but not 
instrumentation
###################################
q_tbl = Article.__table__
query = q_table.select(q_table.c.article_title=='a cool title')
results = session.execute(query)

objs = []
for row in results:
    # custom code to create an Article from a row.
    # incurs penalty of instrumentation
    obj = Article(row[q_table.c.article_id], row[q_table.c.article_title])
    objs.append(obj)

# to avoid instrumentation and IdentityMap, we'd need unmapped classes
###################################

# unmapped article with business logic... i.e. do_interesting()
class SimpleArticle(object):
    def __init__(self, article_id=None, article_title=None):
        self.article_id = article_id
        self.article_title = article_title
    
    def do_interesting(self):
        # do some interesting business logic here
        return

# mapped article - mixin SimpleArticle
class Article(SimpleArticle, Base):
    __tablename__ = 'articles'
    article_id = Column(Integer, primary_key=True)
    article_title = Column(String)

# unmapped tag
class SimpleTag(object):
    def __init__(self, tag_id=None, article_title=None):
        self.tag_id = tag_id
        self.article_title = article_title

# mapped tag
class Tag(SimpleTag, Base):
    __tablename__ = 'tags'
    tag_id = Column(Integer, primary_key=True)
    tag_name = Column(String, unique=True)
    
    articles = relationship('Article', backref='tags')

# query...
q_tbl = Article.__table__
query = q_table.select(q_table.c.article_title=='a cool title')
results = session.execute(query)

objs = []
for row in results:
    # custom code to create a SimpleArticle from a row.
    obj = SimpleArticle(row[q_table.c.article_id], 
row[q_table.c.article_title])
    objs.append(obj)


# desired query using ORM with flag
###################################
query = session.query(Article).filter(Article.article_title=='a cool title')
query = query.options(instrument=False, detached=True)
articles = query.all()

# end example

Notice how the last one is almost identical to the normal ORM case with the 
added .options().  No extra unmapped classes to define or deal with.  No 
need to drop to Core and write the row conversions for each query.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/sqlalchemy/-/q3-tiXt75EoJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Re: [sqlalchemy] Improving performance for ORM queries: skip IdentityMap / create detached instances

Reply via email to