Strange result of Group by on real values

2008-01-30 Thread Joris Kinable
Evening,

Could someone explain the result of the Query 2? Query 2 does a GROUP
BY avgOctets, but still there are duplicate avgOctets in my result
set! Boxplot(octets,AVG,0) is a UDF which returns REAL values. In
this case, the real values have no decimals.
The only reasonable explanation I could think of is that the REAL
value returned by boxplot(...) is an estimation and therefore should
be casted to an INTEGER before comparison by GROUP BY is possible?


Inner Query: SELECT boxplot(octets,AVG,0) AS avgOctets FROM
joris_filter GROUP BY ipv4_src, ipv4_dst

Result:

avgOctets
46
46
46
46
46
46
46
40
46
46
46
144
48
48



Query 2: (With inner query as inner query):

SELECT avgOctets, COUNT(*) AS frequency FROM
  (
SELECT boxplot(octets,AVG,0) AS avgOctets FROM joris_filter
GROUP BY ipv4_src, ipv4_dst
) avarages
GROUP BY avgOctets ORDER BY avgOctets;

Result:

avgOctetsfrequency
40   350026
41   1
41   1
41   6
41   2
41   3
41   1
42   1
...

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Query optimization

2008-01-22 Thread Joris Kinable
-What I'am trying to do:
Bit hard to explain. I've got a table consisting of ip addresses
(ipv4_src), destination addresses (ipv4_dst), and port numbers
(port_dst) and some other irrelevant columns. Ultimately my goal is to
find a linear order in a subset of ports. For example, host A connects
to B on port 1, 2, 3, 4,...20, I would like to find this linear
relation. To achieve this, I have written some mathematical formula's.
Unfortunately there is one downside to my formula: It can be fooled by
adding some extreme values. It cannot find a linear relation in this
list: 1, 2, 3, 4,...20, 45000. Although there are 20 numbers lined up,
the 45000 number ruïns the result of the algorithm. So the query I've
submitted, is ment to remove extreme values. The methods
boxplot(pF.port_dst,LOW) and boxplot(pF.port_dst,HIGH) calculate
in linear time the allowed range of numbers. Extreme values won't be
included in this range. So in the example, the range would be [1,20]
therby omitting the value 45000. Finally I would like to filter my
table with port numbers for every ipv4_srcipv4_dst tuple and
remove all the numbers not fitting in the range.

-In human readable pseudo code this is the query:

SELECT source,dest,port,octets FROM
(
SELECT source,dest,boxplot(port,LOW) AS low,boxplot(port,HIGH) AS
high FROM --Calculate the LOW and HIGH values for each source,dest
pair.
(
SELECT source,dest,port,octets FROM... GROUP BY source,dest,port
--This removes the duplicate entries.
) pF GROUP BY source,dest
) boxplot
(
SELECT source,dest,port,octets FROM... GROUP BY source,dest,port
--This removes the duplicate entries (again!).
) filter
WHERE filter.source=boxplot.source AND filter.dest=boxplot.dest AND
filter.port=boxplot.LOW AND filter.port=boxplot.HIGH --Relate the
tables 'boxplot' and 'filter' to eachother AND select only the
source,dest,port tuples where port is in the range [LOW,HIGH] from
the filter table.


-Here is the original query I would like to optimize again:

SELECT filter.ipv4_src, filter.ipv4_dst, filter.port_dst, filter.octets FROM
(
SELECT pF.ipv4_src, pF.ipv4_dst, boxplot(pF.port_dst,LOW) AS
low,boxplot(pF.port_dst,HIGH) AS high FROM
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM ... GROUP BY
ipv4_src, ipv4_dst, port_dst
) pF GROUP BY pF.ipv4_src, pF.ipv4_dst
) boxplot,
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM ... GROUP BY
ipv4_src, ipv4_dst, port_dst
)filter
WHERE filter.ipv4_src=boxplot.ipv4_src AND
filter.ipv4_dst=boxplot.ipv4_dst AND filter.port_dst=boxplot.low AND
filter.port_dst=boxplot.low

I'll try the 'EXPLAIN' query tonight, but therefor I've got to create
a small database first to speed up the results.

Thnx in advance for your help.



On Jan 22, 2008 8:15 AM, Sebastian Mendel [EMAIL PROTECTED] wrote:
 Joris Kinable schrieb:

  Optimize query
 
  I've got one query, which I would like to improve a lot since it takes
  very long (24 hours) to execute. Here is the idea:
  1. Take the table ipv4_srcipv4_dstport_dst (other rows in this
  table are not mentioned for clearity) and remove all duplicate
  tuple's. This is done by subquery 'filter'.
  2. The same query is performed by the boxplot query, but this time an
  aditional group by command is executed, therby calculating a User
  Defined Function boxplot(row,type) which returns a double value.
  3. Finally the results of the query in step 2 are used to select a
  subset of results from the 'filter' table.
  4. As you can see, the subquery 'pF' used in step 2 is identical to
  the query 'filter'. It's an extreme waste to calculate the same table
  twice. I've tried to create a temporary table from filter, but
  unfortunately Mysql doesn't allow you to access a temporary table
  twice in the same query. I prefer a 1 query answer, instead of
  creating views, or temporary tables.
 
  Is there a way to improve this query, therby improving the execution time?
 
  Query:
 
  SELECT filter.ipv4_src, filter.ipv4_dst, filter.port_dst, filter.octets FROM
  (
SELECT pF.ipv4_src, pF.ipv4_dst, boxplot(pF.port_dst,LOW) AS
  low,boxplot(pF.port_dst,HIGH) AS high FROM
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM mytable WHERE
  prot=6 GROUP BY ipv4_src, ipv4_dst, port_dst ORDER BY
  ipv4_src,ipv4_dst,port_dst ASC
) pF GROUP BY pF.ipv4_src, pF.ipv4_dst HAVING 
  COUNT(filter.port_dst)10
  ) boxplot,
  (
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM mytable WHERE prot=6
  GROUP BY ipv4_src, ipv4_dst, port_dst ORDER BY
  ipv4_src,ipv4_dst,port_dst ASC
  ) filter
  WHERE filter.ipv4_src=boxplot.ipv4_src AND
  filter.ipv4_dst=boxplot.ipv4_dst AND filter.port_dst=boxplot.low AND
  filter.port_dst=boxplot.low

 what you are trying to do?

 and how about formating your query in a human readable way?

 did you tried EXPLAIN?

 what type of syntax is this: boxplot(pF.port_dst,LOW) ?


 --
 Sebastian

Query optimization

2008-01-21 Thread Joris Kinable
Optimize query

I've got one query, which I would like to improve a lot since it takes
very long (24 hours) to execute. Here is the idea:
1. Take the table ipv4_srcipv4_dstport_dst (other rows in this
table are not mentioned for clearity) and remove all duplicate
tuple's. This is done by subquery 'filter'.
2. The same query is performed by the boxplot query, but this time an
aditional group by command is executed, therby calculating a User
Defined Function boxplot(row,type) which returns a double value.
3. Finally the results of the query in step 2 are used to select a
subset of results from the 'filter' table.
4. As you can see, the subquery 'pF' used in step 2 is identical to
the query 'filter'. It's an extreme waste to calculate the same table
twice. I've tried to create a temporary table from filter, but
unfortunately Mysql doesn't allow you to access a temporary table
twice in the same query. I prefer a 1 query answer, instead of
creating views, or temporary tables.

Is there a way to improve this query, therby improving the execution time?

Query:

SELECT filter.ipv4_src, filter.ipv4_dst, filter.port_dst, filter.octets FROM
(
SELECT pF.ipv4_src, pF.ipv4_dst, boxplot(pF.port_dst,LOW) AS
low,boxplot(pF.port_dst,HIGH) AS high FROM
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM mytable WHERE
prot=6 GROUP BY ipv4_src, ipv4_dst, port_dst ORDER BY
ipv4_src,ipv4_dst,port_dst ASC
) pF GROUP BY pF.ipv4_src, pF.ipv4_dst HAVING COUNT(filter.port_dst)10
) boxplot,
(
SELECT ipv4_src, ipv4_dst, port_dst, octets FROM mytable WHERE prot=6
GROUP BY ipv4_src, ipv4_dst, port_dst ORDER BY
ipv4_src,ipv4_dst,port_dst ASC
) filter
WHERE filter.ipv4_src=boxplot.ipv4_src AND
filter.ipv4_dst=boxplot.ipv4_dst AND filter.port_dst=boxplot.low AND
filter.port_dst=boxplot.low

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



creating indexes with myisamchk

2007-11-18 Thread Joris Kinable
Good evening,

I've got to create a very large table: 180GB of data has to be stored.
In order to to this I'm using the following steps:

1. Create database structure including keys.
2. Disable keys: ALTER TABLE ut_netflow_4 DISABLE KEYS
3. Load data into the database.
4. Generate all index keys: myisamchk --sort_buffer_size=2700M
--key_buffer_size=2700M -r --tmpdir=/tmpdir --quick /ut_netflow_4.MYI
5. flush privileges: FLUSH TABLE ut_netflow_4
6. restart server.

Everything up to step 3 goes great. Step 4 and 5 on the other hand do
not seem to work. After running the command at step 4, I've got a huge
ut_netflow_4.MYI index file, but none of the indexes seem to be
enabled. If I use phpmysql to view the table, it says that the
cardinalities of the indexes are 0. Instead of the command at step 4,
I've also tried:
-myisamchk --sort_buffer_size=2700M --key_buffer_size=2700M
--sort-recover /ut_netflow_4.MYI
-myisamchk --sort_buffer_size=2700M --key_buffer_size=2700M
--parallel-recover /ut_netflow_4.MYI
-myisamchk --sort_buffer_size=2700M --key_buffer_size=2700M -r /ut_netflow_4.MYI
None of these seem to work either. The index file becomes larger, but
phpmyadmin keeps indicating that there are no indexes.

What am I doing wrong? How can I create and enable the indexes after
loading the data into the table?

Thnx in advance,

Joris

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]