-What I'am trying to do:
Bit hard to explain. I've got a table consisting of ip addresses
(ipv4_src), destination addresses (ipv4_dst), and port numbers
(port_dst) and some other irrelevant columns. Ultimately my goal is to
find a linear order in a subset of ports. For example, host A connects
to B on port 1, 2, 3, 4,...20, I would like to find this linear
relation. To achieve this, I have written some mathematical formula's.
Unfortunately there is one downside to my formula: It can be fooled by
adding some extreme values. It cannot find a linear relation in this
list: 1, 2, 3, 4,...20, 45000. Although there are 20 numbers lined up,
the 45000 number ruïns the result of the algorithm. So the query I've
submitted, is ment to remove extreme values. The methods
boxplot(pF.port_dst,LOW) and boxplot(pF.port_dst,HIGH) calculate
in linear time the allowed range of numbers. Extreme values won't be
included in this range. So in the example, the range would be [1,20]
therby omitting the value 45000. Finally I would like to filter my
table with port numbers for every ipv4_srcipv4_dst tuple and
remove all the numbers not fitting in the range.
-In human readable pseudo code this is the query:
SELECT source,dest,port,octets FROM
(
SELECT source,dest,boxplot(port,LOW) AS low,boxplot(port,HIGH) AS
high FROM --Calculate the LOW and HIGH values for each source,dest
pair.
(
SELECT source,dest,port,octets FROM... GROUP BY source,dest,port
--This removes the duplicate entries.
) pF GROUP BY source,dest
) boxplot
(
SELECT source,dest,port,octets FROM... GROUP BY source,dest,port
--This removes the duplicate entries (again!).
) filter
WHERE filter.source=boxplot.source AND filter.dest=boxplot.dest AND
filter.port=boxplot.LOW AND filter.port=boxplot.HIGH --Relate the
tables 'boxplot' and 'filter' to eachother AND select only the
source,dest,port tuples where port is in the range [LOW,HIGH] from
the filter table.
-Here is the original query I would like to optimize again:
SELECT filter.ipv4_src, filter.ipv4_dst, filter.port_dst, filter.octets FROM
(
SELECT pF.ipv4_src, pF.ipv4_dst, boxplot(pF.port_dst,LOW) AS
low,boxplot(pF.port_dst,HIGH) AS high FROM
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM ... GROUP BY
ipv4_src, ipv4_dst, port_dst
) pF GROUP BY pF.ipv4_src, pF.ipv4_dst
) boxplot,
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM ... GROUP BY
ipv4_src, ipv4_dst, port_dst
)filter
WHERE filter.ipv4_src=boxplot.ipv4_src AND
filter.ipv4_dst=boxplot.ipv4_dst AND filter.port_dst=boxplot.low AND
filter.port_dst=boxplot.low
I'll try the 'EXPLAIN' query tonight, but therefor I've got to create
a small database first to speed up the results.
Thnx in advance for your help.
On Jan 22, 2008 8:15 AM, Sebastian Mendel [EMAIL PROTECTED] wrote:
Joris Kinable schrieb:
Optimize query
I've got one query, which I would like to improve a lot since it takes
very long (24 hours) to execute. Here is the idea:
1. Take the table ipv4_srcipv4_dstport_dst (other rows in this
table are not mentioned for clearity) and remove all duplicate
tuple's. This is done by subquery 'filter'.
2. The same query is performed by the boxplot query, but this time an
aditional group by command is executed, therby calculating a User
Defined Function boxplot(row,type) which returns a double value.
3. Finally the results of the query in step 2 are used to select a
subset of results from the 'filter' table.
4. As you can see, the subquery 'pF' used in step 2 is identical to
the query 'filter'. It's an extreme waste to calculate the same table
twice. I've tried to create a temporary table from filter, but
unfortunately Mysql doesn't allow you to access a temporary table
twice in the same query. I prefer a 1 query answer, instead of
creating views, or temporary tables.
Is there a way to improve this query, therby improving the execution time?
Query:
SELECT filter.ipv4_src, filter.ipv4_dst, filter.port_dst, filter.octets FROM
(
SELECT pF.ipv4_src, pF.ipv4_dst, boxplot(pF.port_dst,LOW) AS
low,boxplot(pF.port_dst,HIGH) AS high FROM
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM mytable WHERE
prot=6 GROUP BY ipv4_src, ipv4_dst, port_dst ORDER BY
ipv4_src,ipv4_dst,port_dst ASC
) pF GROUP BY pF.ipv4_src, pF.ipv4_dst HAVING
COUNT(filter.port_dst)10
) boxplot,
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM mytable WHERE prot=6
GROUP BY ipv4_src, ipv4_dst, port_dst ORDER BY
ipv4_src,ipv4_dst,port_dst ASC
) filter
WHERE filter.ipv4_src=boxplot.ipv4_src AND
filter.ipv4_dst=boxplot.ipv4_dst AND filter.port_dst=boxplot.low AND
filter.port_dst=boxplot.low
what you are trying to do?
and how about formating your query in a human readable way?
did you tried EXPLAIN?
what type of syntax is this: boxplot(pF.port_dst,LOW) ?
--
Sebastian