[storm] Profiling project

Jeroen Vermeulen Mon, 20 Jun 2011 08:51:11 -0700

Hi folks,

I hope some of you will have time to help me with this. I'm reviving aback-burner project I've had dormant for a while: access patternprofiling. The goal is to provide better information to driveoptimization of Storm-based application.

Before I start asking questions, which I will do in separate threads,let me introduce my Storm profiling prototype. If you see anythingwrong here, I'd be grateful if you could bring it up on the list so Ican improve it!


The bzr branch is here:

    https://code.launchpad.net/~jtv/storm/profile-fetches

And now, a "quick" (ahem) run-through of the design:


= Profiles =

When you look at a Storm-backed object in memory, it got there in one ofthree ways:


1. It's just been created by the application.
2. It was returned by a database query (think Store.find).
3. The application followed a reference from another object.

Case 3 is the big problem with ORMs: it's far too easy to end up withlots of inefficient small queries. It'd be nice to optimize those away.

So as an application developer you want to know: what reference? Wheredid that other object come from, all the way back to an object that camefrom either 1 or 2? How many objects were loaded into memory from thedatabase at each step along that trail? What objects should Ipre-fetch, and where? The profiler tries to answer exactly these questions.

The design as it stands only cares when objects first come into thecache. When the application accesses objects that are already in cache— hopefully most of the time! — the profiler doesn't count anything; itjust stays out of the way. Maybe we'll want information about cachedobjects later, but I'd like to see how far we can get with minimalprofiling overhead first.



= Contexts =

Profiles are grouped per "context." The application gets to definethese. Contexts are named and nestable, much like the functions in acall stack, so if you have a function that may be used in different waysfrom different places in the application, you can have separate contextsfor those uses. Each "call stack" of contexts is profiled separately,so you'll be able to see whether the different uses of your functionneed separate optimization or not.

Each "trail" of accesses is tracked in the context where its _original_object was first loaded or created. It doesn't matter where the lateraccesses happen; they are all tracked in the original context. So theprofile for a context shows you not the "past" of objects in thatcontext, but their "future" — and how you can improve that future bypre-fetching.

There can be multiple free-form queries in one context, and that may bea little confusing. I'd be happy to discuss better ways to do this; Ifigured that python call stacks or the file/line location of a querywould be too fine-grained to be very useful. Plus, I'd like to supportsaving and reloading of profiles across program upgrades, to getmeaningful long-term profiling data.



= Future =

The profiler is really just the first step in what I hope will be alonger-term project. Here's what I'm hoping we can do over time:

 * Access profiling.
 * Automated optimization suggestions.
 * Long-term data management (load/save, scrubbing).
 * Query-time accounting.
 * Automated dynamic, profile-driven optimization.
 * "Long tail" of optimization tuning.

The ideal end product is something very close to how we use ORMs today,but without the headaches of manual optimization: lost time, abstractionleaks, unreadable code, painful rewrites when circumstances change. Ibelieve we can win back a good portion of the performance we gave up forconvenience, and keep the convenience. And I hope you're willing tojoin me on this adventure!



Jeroen

--
storm mailing list
[email protected]
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/storm

[storm] Profiling project

Reply via email to