AFAICT, Chistian's problem is that an SQL database would not be such a
good fit, due to the "time travel" requirement. IIUC, he has to look
up records as they were in the past, including whatever fields they
had in the past, even if they're no longer part of the schema for the
current data. To do this in SQL:

 * EITHER he creates a new SQL database (or a new set of tables on the
same database) for each new revision of the incoming information, and
each database (or set of tables) would then be free to have its own
schema, but there would be lots of duplication in the data that hasn't
changed between databases,

 * OR he'll have to do a time-travel superstructure on his one
database [1], adding time range columns to each table indicating the
validity of each record, and having lots of duplicated records
differing only in the time-range and a few fields. Not to mention the
fact that the schema for these tables would contain the union of all
columns that were valid at any one point in time, and lots of NULLS in
these columns.

[1] http://en.wikipedia.org/wiki/Temporal_database

So, I believe Christian is considering flexible (or rather,
non-existent) schema of ZODB (and perhaps the built-in time-travel
capabilities) as pretty good fit to his problem.

But he seems to be worried about data volume and its impact on
performance. He's also wondering how to best design the storage of
this data on ZODB taking into account the fact that the schema changes
frequently.

If (as he indicates) he stores the data as tuples in BTrees (one BTree
per "table", keyed by the primary key of the original table), he'll be
forced to rewrite all the tuples of each BTree (table) that changes
schema, which could mean almost as much duplication as the "one SQL
Database per revision" case.

On the other hand, he seems to speculate that perhaps he could store
one BTree per table COLUMN (per revision?), keyed by the primary key
of the original table. This way, each new incoming data revision would
only need touch the data that actually changed, and schema changes
would mean the deletion or addition of entire BTrees, w/o having to
touch the unchanged data.

Cheers,

Leo


On Sun, Aug 18, 2013 at 3:07 PM, Claudiu Saftoiu <csaft...@gmail.com> wrote:
> I wonder, if you have a problem which an SQL database would be so good for 
> that youre mimicking an SQL database with zodb, why not just use an SQL 
> database? It doesn't sound like you'll gain much from being able to persist 
> objects which is one of the main reasons to use an object database...
>
>
> On Aug 18, 2013, at 12:17 PM, Christian Tismer <tis...@stackless.com> wrote:
>
>> On 18.08.13 17:09, Jim Fulton wrote:
>>> On Fri, Aug 16, 2013 at 11:49 PM, Christian Tismer <tis...@stackless.com> 
>>> wrote:
>>> <snip>
>>
>> Explaining very concisely, now.
>>
>>> I don't think I/we understand your problem well enough to answer. If data 
>>> has a very low shelf life, then replacing it frequently might make sense. 
>>> If the schema changes that frequently, I'd as why. If this is a data 
>>> analysis application, you might be better served by tools designed for that.
>>>> Is Python still the way to go, or should I stop this and use something like
>>>> PostgreSQL? (And I doubt that this would give a benefit, actually).
>>> Ditto,
>>>
>>>> Would you implement a column store, and how would you do that?
>>> Ditto.
>>>
>>>> Right now, everything gets too large, and I'm quite desperate. Therefore,
>>>> I'm
>>>> asking the master, which you definately are!
>>> "large" can mean many things. The examples you give don't
>>> seem very large in terms of storage, at least not for ZODB.
>>>
>>> Beyond that there are lots of dimensions of scale that ZODB
>>> doesn't handle well (e.g. large transaction rates, very
>>> high availability).
>>>
>>> It's really hard to make specific recommendations without
>>> knowing more about the problem. (And it's likely that someone
>>> wouldn't be able to spend the time necessary to learn more
>>> about the problem without a stake in it. IOW, don't assume I'll
>>> read a much longer post getting into details. :)
>>>
>>
>> Ok, just the sketch of it to make things clearer, don't waste time on this 
>> ;-)
>>
>> We get a medication prescription database in a certain serialized format
>> which is standard in Germany for all pharmacy support companies.
>>
>> This database comes in ~25 files == tables in a zip file every two weeks.
>> The DB is actually a structured set of SQL tables with references et al.
>>
>> I actually did not want to change the design and simply created the table
>> structure that they have, using ZODB, with tables as btrees that contain
>> tuples for the records, so this is basically the SQL model, mimicked in Zodb.
>>
>> What is boring is the fact, that the database gets incremental updates all 
>> the time,
>> changed prices, packing info, etc.
>> We need to cope with millions of recipes that come from certain dates
>> and therefore need to inquire different versions of the database.
>>
>> I just hate the huge redundancy that these database versions would have
>> and tried to find a way to put this all into a single Zodb with a way to
>> time-travel to every version.
>>
>> The weird thing is that the DB also changes its structure over time:
>>
>> - new fields are added, old fields dropped.
>>
>> That's the reason why I thought to store the tables by column, and each 
>> column is
>> a BTree on itself. Is that feasible at all?
>>
>> Of the 25 tables, there are 4 quite large, like
>> 4 tables x 500,000 rows x 100 columns,
>> == 200,000,000 cells in one database.
>>
>> With a btree bucket size of ~60, this gives ~ 3,333,333 buckets.
>> With multiple versions, this will be even more.
>>
>> -- Can Zodb handle so many objects and still open the db fast?
>> -- Or will the huge index kill performance?
>>
>> That's all I'm asking before doing another experiment ;-)
>>
>> but don't waste time, just telling you the story -- chris
>>
>> --
>> Christian Tismer             :^)   <mailto:tis...@stackless.com>
>> Software Consulting          :     Have a break! Take a ride on Python's
>> Karl-Liebknecht-Str. 121     :    *Starship* http://starship.python.net/
>> 14482 Potsdam                :     PGP key -> http://pgp.uni-mainz.de
>> phone +49 173 24 18 776  fax +49 (30) 700143-0023
>> PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
>>      whom do you want to sponsor today?   http://www.stackless.com/
>>
>> _______________________________________________
>> For more information about ZODB, see http://zodb.org/
>>
>> ZODB-Dev mailing list  -  ZODB-Dev@zope.org
>> https://mail.zope.org/mailman/listinfo/zodb-dev
> _______________________________________________
> For more information about ZODB, see http://zodb.org/
>
> ZODB-Dev mailing list  -  ZODB-Dev@zope.org
> https://mail.zope.org/mailman/listinfo/zodb-dev
_______________________________________________
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev

Reply via email to