the best thing to do would be to experiment with some various schemas  
and see what works best (and maybe read some SQL books or books about  
web log data mining if you feel you want some background).  The  
primary key of your table would be easiest as a surrogate integer key,  
group by is provded by the "group_by()" method of "select()" or  
"Query".   You also might want to investigate the star schema approach  
which is popular for the "slicing and dicing" approach:  
http://en.wikipedia.org/wiki/Star_schema 
   .


On May 25, 2009, at 12:08 PM, Chris Withers wrote:

>
> I wish I could make pathetic whining noises and puppydog eyes...
>
> If anyone with some relational nouse could give me even some vague
> pointers that would be great...
>
> Chris
>
> Chris Withers wrote:
>> Hi All,
>>
>> I'm looking to do (something like) weblog analysis and thought it'd  
>> be
>> interesting to try an rdb through SQLAlchemy.
>>
>> Apologies in advance, my knowledge of rdb and sa is much poorer  
>> than I'd
>> like, so simple pointers may help...
>>
>> So, assuming I have data something like:
>>
>> 2009-05-20-00:01:02.345 user1 192.168.1.1  /some/folder/some/path
>> 2009-05-20-00:01:03.452 user1 192.168.1.1  /some/folder/other/path
>> 2009-05-20-00:01:02.345 user2 192.168.1.10 /some/folder/some/path
>>
>> ...what would people recommend for storage?
>>
>> One table mapped to a Row class or some such?
>> What would the primary key be? (since two entries could theoretically
>> happen at exactly the same time, I can't use the date...)
>>
>> I'd like to then aggregate first by user, then by month, giving a  
>> UI like:
>>
>> Users
>> =====
>>
>> Username   Total Pages    Last Viewed
>> --------   -----------    -----------
>> user1      32434          27th July 2009 16:05
>> user2      15000          1st Jan 2009 01:05
>> ...lots more users, probably batched...
>>
>> Clicking on a username gives:
>>
>> User1
>> =====
>>
>> Month       Total Pages  Last Viewed
>> ----------  -----------  -----------
>>   May 2009  1000         15th May 2009 13:50
>> April 2009  1000         30th April 2009 23:55
>> ...lots of months...
>>
>> Clicking on a month gives:
>>
>> User1 -> May 2009
>> =================
>>
>> Page                     Hits   Last Viewed
>> -----------------------  -----  -----------
>> /some/folder/some/path   20     15th May 2009 13:50
>> /some/folder/other/path  33     1st May 2009 13:50
>>
>> What queries would I use to generate these views? I'm envisaging some
>> group-by's here, but don't know how I'd plug those into SA. Would I  
>> have
>> classes for users, months and pages that had selects map to them or
>> something else?
>>
>> How would I batch them such that if there are several thousand rows  
>> in a
>> table I can show the old "1 to 20 of 3000" without having to load all
>> 3000 rows?
>>
>> Another option I thought of was to not actually store the raw rows,  
>> but
>> have several tables:
>>
>> class Users(Base):
>>    __tablename__ 'users'
>>    name = Column(String,primary_key=True)
>>    total_pages = Column(Integer)
>>    last_viewed = Column(DateTime)
>>
>> class Month(Base):
>>    __tablename__ 'months'
>>    username = Column(
>>                String,ForeignKey('users.username'),primary_key=True
>>                )
>>    monthname = Column(String,primary_key=True)
>>    total_pages = Column(Integer)
>>    last_viewed = Column(DateTime)
>>
>> class Page(Base):
>>    __tablename__ 'users'
>>    username = Column(
>>                String,ForeignKey('months.username'),primary_key=True
>>                )
>>    monthname = Column(
>>                String,ForeignKey('months.monthname'),primary_key=True
>>                )
>>    hits = Column(Integer)
>>    last_viewed = Column(DateTime)
>>
>> ...and then populate them while parsing the logs. I can see this  
>> storing
>> a lot less data, but would it be faster? Still, how would I do  
>> batching
>> of results?
>> Ideally, I'd prefer not to go down this route as it limits different
>> types of analysis later down the line but...
>>
>> What do people think? What's the normal/correct way of handling large
>> amounts of data to be sliced and diced like this?
>>
>> cheers,
>>
>> Chris
>>
>>
>
> -- 
> Simplistix - Content Management, Zope & Python Consulting
>            - http://www.simplistix.co.uk
>
> >


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to