Re: [sqlite] Sqlite optim

2009-02-24 Thread Jérôme Loyet
> On Monday 23 February 2009 13:09:58 Jérôme Loyet wrote:
>> My questions:
>> 1- Is there a better way to populate the `siblings` table ? (optimize
>> my sql request)
>
> You may use compose index on (cookie,referer,date) and REAL datatype for
> dates.

I choosed integer for the date as everything is stored in unix
timestamp, so it's just about comparing integers. But maybe I'm wrong.

>
>> 2- What can I do to optimize the all process ? (play with some PRAGMA
>> parameters maybe)
>
> pragma page_size=4096;
> pragma default_cache_size=20;
> pragma cache_size=20;
>
> (cache with this options is 200 000 * 4k=800 Mb).
>
>> 3- Is sqlite a good choice for this case ? Would mysql or postgresql a
>> better choice ?
>
> SQLite is good choice but data flow parser is more better then sql queries for
> realtime analyze. You can write demon for prepare data to load in database and
> inserting data periodically with transactions.

i'm not planning to do real time analysis. I want to do some stats
after (several days or month) to study the users behaviours on the
website.

>
> Best regards.

Thx a lot
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Is sqlite the good choice in my case ? And how can I optimize my SQL requests

2009-02-23 Thread Jérôme Loyet
Hello everyone,

I'm doing some web logs analysis. I decided to use a sqlite database
because the application has to be easily portable to another
workstation or server and there is absolutly no need of multi-clients.
Moreover, I don't want to spend my time dealing with the database.

What I want to do is quite simple. I want to parse the logs from a
high traffic apache website (loadbalanced on 8 front servers) and
integrated them in a database. That's the easy part. In apache
configuration I added the mod_usertrack in order to be able to follow
a user in the logs.

Then I want to create parent / child relations between requests from a
unique user (id by the mod_usertrack). The final goal is to see all
users. Then by user we can see the requests issued from a click on a
webpage (manual human interface --> parent). Finaly for each parent
requests (click) I want to have all sub requests issued from the
parent page (automatic fetch (images, css, javascript, ...)).

The global aim of all this is to analyse the behaviour of visitors on
our website and to extract different populations of users and make
statistics.

Here are the tables I use:

CREATE TABLE IF NOT EXISTS requests (
  id integer primary key autoincrement,
  date date not null,
  request varchar not null,
  method varchar not null,
  code integer not null,
  referer varchar,
  response_time integer not null,
  vhost varchar DEFAULT NULL,
  cookie not null REFERENCES cookies(id))

CREATE TABLE IF NOT EXISTS cookies (
  id integer primary key autoincrement,
  cookie varchar not null UNIQUE)

CREATE TABLE IF NOT EXISTS siblings (
  parent not null REFERENCES requests(id),
  child not null REFERENCES requests(id))

I ran a test on 2 hours logs from the 18th of february (from 10:00AM
to 11:59AM).

The first step is to parse the logs in order to populate the
`requests` and `cookies` tables. It took about 2min to parse and
insert 238304 cookies and 1163728 requests. After this step the
database is 317Mo sized.

Then I have to populate the `siblings` table. I use the following request:

INSERT INTO siblings (parent, child) SELECT a.id, b.id FROM requests a
INNER JOIN requests b ON a.cookie = b.cookie WHERE a.page = b.referer
AND b.date BETWEEN a.date AND a.date + 20

It took about 3hours and 30 minutes. At the end there is 7158001 rows
in the `siblings` table and the database is about 600Mo sized.

My questions:
1- Is there a better way to populate the `siblings` table ? (optimize
my sql request)
2- What can I do to optimize the all process ? (play with some PRAGMA
parameters maybe)
3- Is sqlite a good choice for this case ? Would mysql or postgresql a
better choice ?

Thanks in advance for your time.

++ Jerome
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Sqlite optim

2009-02-23 Thread Jérôme Loyet
Hello everyone,

I'm doing some web logs analysis. I decided to use a sqlite database
because the application has to be easily portable to another
workstation or server and there is absolutly no need of multi-clients.
Moreover, I don't want to spend my time dealing with the database.

What I want to do is quite simple. I want to parse the logs from a
high traffic apache website (loadbalanced on 8 front servers) and
integrated them in a database. That's the easy part. In apache
configuration I added the mod_usertrack in order to be able to follow
a user in the logs.

Then I want to create parent / child relations between requests from a
unique user (id by the mod_usertrack). The final goal is to see all
users. Then by user we can see the requests issued from a click on a
webpage (manual human interface --> parent). Finaly for each parent
requests (click) I want to have all sub requests issued from the
parent page (automatic fetch (images, css, javascript, ...)).

The global aim of all this is to analyse the behaviour of visitors on
our website and to extract different populations of users and make
statistics.

Here are the tables I use:

CREATE TABLE IF NOT EXISTS requests (
 id integer primary key autoincrement,
 date date not null,
 request varchar not null,
 method varchar not null,
 code integer not null,
 referer varchar,
 response_time integer not null,
 vhost varchar DEFAULT NULL,
 cookie not null REFERENCES cookies(id))

CREATE TABLE IF NOT EXISTS cookies (
 id integer primary key autoincrement,
 cookie varchar not null UNIQUE)

CREATE TABLE IF NOT EXISTS siblings (
 parent not null REFERENCES requests(id),
 child not null REFERENCES requests(id))

I ran a test on 2 hours logs from the 18th of february (from 10:00AM
to 11:59AM).

The first step is to parse the logs in order to populate the
`requests` and `cookies` tables. It took about 2min to parse and
insert 238304 cookies and 1163728 requests. After this step the
database is 317Mo sized.

Then I have to populate the `siblings` table. I use the following request:

INSERT INTO siblings (parent, child) SELECT a.id, b.id FROM requests a
INNER JOIN requests b ON a.cookie = b.cookie WHERE a.page = b.referer
AND b.date BETWEEN a.date AND a.date + 20

It took about 3hours and 30 minutes. At the end there is 7158001 rows
in the `siblings` table and the database is about 600Mo sized.

My questions:
1- Is there a better way to populate the `siblings` table ? (optimize
my sql request)
2- What can I do to optimize the all process ? (play with some PRAGMA
parameters maybe)
3- Is sqlite a good choice for this case ? Would mysql or postgresql a
better choice ?

Thanks in advance for your time.

++ Jerome
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users