Re: [sqlite] How to optimize a select that gets 17 rows from a 700k row DB (via perl using DBI)?

2007-01-13 Thread Jay Sprenkle

Have you considered running a process that summarizes the data from the table
and just fetching the last summary for display? Will a periodic snapshot work
for your reporting or do you need realtime summarization?


On 1/12/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Hi all.  I will try to post all data I can think of.  The basic issue is
that I have a Perl CGI script that shows the most recent load stats for a
bunch of servers we use at our site for long, computing-intense programs.
The environment is all RedHat Enterprise Linux 4 ( 2.6.9-42.26.ELsmp #1
SMP i686 i686 i386 GNU/Linux)  SQLite version is 3.2.2.  As the number of
records in my database has increased, the page load time has also
increased, now to about 10 seconds, which is unacceptable, IMO.  The idea
is to give users an idea of the least loaded machine to help them make the
decision of which to use for their next program.  The whole concept of how
to best do this is, of course, in and of itself a complicated issue.  At
my org. there is no other system in place so this is how I do it.  I've
gotten a big positive response from users.  The last implementation used a
CSV file and each iteration of gathering the data from the servers only
allowed the cron job to run every five minutes.  After upgrading to
SQLite, the now minutely cron job has never failed to complete
successfully in under one minute that I know of.  I'd like to focus on
optimizing my current implementation, and then, if relevant, talk about
other, perhaps better ways to do it.  Currently, both the old, 5-minutely
CSV version and the current beta (SQLite) version run in parallel, and the
CSV version is still more widely used since the page actually loads in a
reasonable amount of time.  The CSV data are overwritten each time so the
process of reading in that data is always the same.  The SQLite database
is INSERTed into because I want to keep historical data.  The rationale
for this is explained later.

The current system uses a DB with a table called stats that has more or
less minutely data on various conditions for ~17 so-called peak machines.
The table can be described thus:

CREATE TABLE stats (Machine text not null,Load real,Scratch text(4), Mem
int, MemPctFree int, Procs int, Users int, Timestamp text(20) not null,
Message text);

My CGI script fetches the most recent observation for each machine with a
select statement of the form:

select a.* from stats a, (select Machine, max(Timestamp) as M from stats
group by machine) b where a.machine=b.machine and a.timestamp=b.M order by
load, Mem*MemPctFree desc, Scratch desc;

I floundered about trying to find a select statement that would simply
give me the right answer; I suspect the main opportunity for optimization
is in rewriting that statement.

The CGI script uses DBI and the exact statement is my $aref =
$dbh->selectall_arrayref($ss2); where $ss2 is the select statement above.

The idea being that the least loaded machines are first in the list.  The
usefulness of the stats reported are of course debatable, but again, let's
make sure the problem is addressed separately.  Some of the machines have
very large scratch partitions which make those machines particularly
suitable for jobs that generate very large files.

A typical group of rows might look like this:


Machine
Load
Scratch
space
available
Total
memory
(GB)
Free
memory
(%)
Running
processes
User(s)
Timestamp
Message
peaklx6
0
21G
12
96
1
1
[EMAIL PROTECTED]:28:35

peaklx1
0
58G
12
88
1
1
[EMAIL PROTECTED]:29:03

peaklx5
0
19G
12
86
1
1
[EMAIL PROTECTED]:28:33

peaklx4
0

12
84
1
4
[EMAIL PROTECTED]:28:31

peaklx2
0
59G
12
65
1
2
[EMAIL PROTECTED]:29:05

peaklx21
0

4
14
1
2
[EMAIL PROTECTED]:29:07

peaklx3
0.1

12
78
1
3
[EMAIL PROTECTED]:28:27

fstlx1
0.2
20G
4
5
1
11
[EMAIL PROTECTED]:28:37

peaklx22
1

4
53
2
0
[EMAIL PROTECTED]:29:09

peaklx24
2

4
58
3
0
[EMAIL PROTECTED]:29:12

peaklx29
2

4
40
3
3
[EMAIL PROTECTED]:28:25

peaklx23
2

4
39
3
0
[EMAIL PROTECTED]:29:11

peaklx30
2

4
29
3
0
[EMAIL PROTECTED]:28:29

peaklx28
2

4
28
3
0
[EMAIL PROTECTED]:28:23

peaklx26
3

12
69
4
0
[EMAIL PROTECTED]:29:17

peaklx25
4

4
21
5
0
[EMAIL PROTECTED]:29:15

peaklx27
4

4
17
5
2
[EMAIL PROTECTED]:28:21


The problem I am having is that the page load times are getting very, very
long.  The table now has about 700,000 columns and the select statement is
taking about 10 seconds to execute (I benchmarked various parts of the
program and it's not the DB connect or loading the modules or whatnot).
The DB file lives on an NFS mounted share but running the same query
locally and interactively and both all take approximately the same amount
of time.  The size of the DB file itself is currently 89MB.  The web
server (and all the servers in question) all are dual processor 2.8GHz
Xeons with a minimum of 4GB of memory (all running the same OS).  Some
have hyperthreading turned on and others don't.

I tried putting a copy of the DB file in the same directory as the script;

Re: [sqlite] How to optimize a select that gets 17 rows from a 700k row DB (via perl using DBI)?

2007-01-13 Thread Nicolas Williams
On Sat, Jan 13, 2007 at 12:57:43AM -0500, [EMAIL PROTECTED] wrote:
>  The SQLite database 
> is INSERTed into because I want to keep historical data.  The rationale 
> for this is explained later.

For your main application (finding the least loaded machine) you don't
need historical data, so put the historical data in a separate table or
else add a column that you'll set to NULL for historical data and which
will be part of an index/primary key so that historical data is not
indexed.  And since you have so few "live" rows you might as well just
have two tables and for the non-historical table don't even bother with
indexes.

Nico
-- 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] How to optimize a select that gets 17 rows from a 700k row DB (via perl using DBI)?

2007-01-12 Thread Jonathan . W . Crane
Hi all.  I will try to post all data I can think of.  The basic issue is 
that I have a Perl CGI script that shows the most recent load stats for a 
bunch of servers we use at our site for long, computing-intense programs. 
The environment is all RedHat Enterprise Linux 4 ( 2.6.9-42.26.ELsmp #1 
SMP i686 i686 i386 GNU/Linux)  SQLite version is 3.2.2.  As the number of 
records in my database has increased, the page load time has also 
increased, now to about 10 seconds, which is unacceptable, IMO.  The idea 
is to give users an idea of the least loaded machine to help them make the 
decision of which to use for their next program.  The whole concept of how 
to best do this is, of course, in and of itself a complicated issue.  At 
my org. there is no other system in place so this is how I do it.  I've 
gotten a big positive response from users.  The last implementation used a 
CSV file and each iteration of gathering the data from the servers only 
allowed the cron job to run every five minutes.  After upgrading to 
SQLite, the now minutely cron job has never failed to complete 
successfully in under one minute that I know of.  I'd like to focus on 
optimizing my current implementation, and then, if relevant, talk about 
other, perhaps better ways to do it.  Currently, both the old, 5-minutely 
CSV version and the current beta (SQLite) version run in parallel, and the 
CSV version is still more widely used since the page actually loads in a 
reasonable amount of time.  The CSV data are overwritten each time so the 
process of reading in that data is always the same.  The SQLite database 
is INSERTed into because I want to keep historical data.  The rationale 
for this is explained later.

The current system uses a DB with a table called stats that has more or 
less minutely data on various conditions for ~17 so-called peak machines. 
The table can be described thus:

CREATE TABLE stats (Machine text not null,Load real,Scratch text(4), Mem 
int, MemPctFree int, Procs int, Users int, Timestamp text(20) not null, 
Message text);

My CGI script fetches the most recent observation for each machine with a 
select statement of the form:

select a.* from stats a, (select Machine, max(Timestamp) as M from stats 
group by machine) b where a.machine=b.machine and a.timestamp=b.M order by 
load, Mem*MemPctFree desc, Scratch desc;

I floundered about trying to find a select statement that would simply 
give me the right answer; I suspect the main opportunity for optimization 
is in rewriting that statement. 

The CGI script uses DBI and the exact statement is my $aref = 
$dbh->selectall_arrayref($ss2); where $ss2 is the select statement above.

The idea being that the least loaded machines are first in the list.  The 
usefulness of the stats reported are of course debatable, but again, let's 
make sure the problem is addressed separately.  Some of the machines have 
very large scratch partitions which make those machines particularly 
suitable for jobs that generate very large files.

A typical group of rows might look like this:


Machine
Load
Scratch
space
available
Total
memory
(GB)
Free
memory
(%)
Running
processes
User(s)
Timestamp
Message
peaklx6
0
21G
12
96
1
1
[EMAIL PROTECTED]:28:35

peaklx1
0
58G
12
88
1
1
[EMAIL PROTECTED]:29:03

peaklx5
0
19G
12
86
1
1
[EMAIL PROTECTED]:28:33

peaklx4
0

12
84
1
4
[EMAIL PROTECTED]:28:31

peaklx2
0
59G
12
65
1
2
[EMAIL PROTECTED]:29:05

peaklx21
0

4
14
1
2
[EMAIL PROTECTED]:29:07

peaklx3
0.1

12
78
1
3
[EMAIL PROTECTED]:28:27

fstlx1
0.2
20G
4
5
1
11
[EMAIL PROTECTED]:28:37

peaklx22
1

4
53
2
0
[EMAIL PROTECTED]:29:09

peaklx24
2

4
58
3
0
[EMAIL PROTECTED]:29:12

peaklx29
2

4
40
3
3
[EMAIL PROTECTED]:28:25

peaklx23
2

4
39
3
0
[EMAIL PROTECTED]:29:11

peaklx30
2

4
29
3
0
[EMAIL PROTECTED]:28:29

peaklx28
2

4
28
3
0
[EMAIL PROTECTED]:28:23

peaklx26
3

12
69
4
0
[EMAIL PROTECTED]:29:17

peaklx25
4

4
21
5
0
[EMAIL PROTECTED]:29:15

peaklx27
4

4
17
5
2
[EMAIL PROTECTED]:28:21


The problem I am having is that the page load times are getting very, very 
long.  The table now has about 700,000 columns and the select statement is 
taking about 10 seconds to execute (I benchmarked various parts of the 
program and it's not the DB connect or loading the modules or whatnot). 
The DB file lives on an NFS mounted share but running the same query 
locally and interactively and both all take approximately the same amount 
of time.  The size of the DB file itself is currently 89MB.  The web 
server (and all the servers in question) all are dual processor 2.8GHz 
Xeons with a minimum of 4GB of memory (all running the same OS).  Some 
have hyperthreading turned on and others don't.

I tried putting a copy of the DB file in the same directory as the script; 
no dice.

I tried timing the subquery and found that the query 'select Machine, 
max(Timestamp) as M from stats group by machine' takes about 2 seconds 
while the entire query takes about 10.  I tried adding several indices to