Re: [PERFORM] Help with bulk read performance

2010-12-16 Thread Krzysztof Nienartowicz
Hello Daniel,
We have the same scenario for the native Java arrays, so we are storing bytea 
and doing conversion at the client side, but for the server side SQL,  plJava 
comes very handy:

No sure how you want to create stored procedures to convert internally but this 
is how we do this:

One has to define conversion routines in Java then deploy them to plJava. 
Scanning though this field would be still CPU bound, around 2x slower than with 
native arrays and 6x slower than with blobs,  but at least one has this 
ability. It's even possible to pass them to plR to do some statistical 
processing directly, so depending on the operations you do it may be still 
cheaper then streaming out  over the wire to the regular JDBC client.

1. deploy class like this within plJava (null handling left out for brevity)

import java.io.File;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;

public class myArrayConversion 
{
   
public myArrayConversion() {}

/** Serialize double array to blob */
public static byte[] convertDoubleArrayToBytea(double[] obj) throws 
IOException {
 ByteArrayOutputStream baos =  new ByteArrayOutputStream();
 ObjectOutputStream oos = new ObjectOutputStream(baos);
 oos.writeObject(obj); 
 return baos.toByteArray();
}
/** Serialize int array to blob */
public static byte[] convertIntToBytea(int[] obj) throws IOException {
 ByteArrayOutputStream baos =  new ByteArrayOutputStream();
 ObjectOutputStream oos = new ObjectOutputStream(baos);
 oos.writeObject(obj); 
 return baos.toByteArray();
}

/** Deserialize blob to double array */
public static double[] convertToDoubleArray(byte[] obj) throws IOException,
 
ClassNotFoundException {
// Deserialize from a byte array
  ObjectInputStream ios = new ObjectInputStream(new 
ByteArrayInputStream(obj));
  return (double[])ios.readObject();
}

/** Deserialize blob to it array */
public static int[] convertIntToArray(byte[] obj) throws IOException,
 
ClassNotFoundException {
// Deserialize from a byte array
   ObjectInputStream ios = new ObjectInputStream(new 
ByteArrayInputStream(obj));
   return (int[])ios.readObject();
}


// other types arrays streaming...
//...
}

2. then create a mapping functions as a db owner:

sql
CREATE OR REPLACE FUNCTION public.convertDoubleArrayToBytea(double precision[])
  RETURNS bytea AS
'mappingPkg.convertDoubleArrayToBytea(double[])'
  LANGUAGE 'javau' IMMUTABLE
  COST 50;

GRANT EXECUTE ON FUNCTION public.convertDoubleArrayToBytea(double precision[]) 
TO public;



CREATE OR REPLACE FUNCTION public.convertToDoubleArray(bytea)
  RETURNS double precision[] AS
'mappingPkg.convertToDoubleArray(byte[])'
  LANGUAGE 'javau' IMMUTABLE
  COST 50;

GRANT EXECUTE ON FUNCTION public.convertToDoubleArray(bytea) TO public;
/sql


then you can have conversion either way:

select 
convertToDoubleArray(convertDoubleArrayToBytea(array[i::float8,1.1,100.1,i*0.1]::float8[]))
 from generate_series(1,100) i;

so you'd be also able to create bytea objects from native SQL arrays within SQL.

PLJava seems to be enjoying revival last days thanks to Johann 'Myrkraverk' 
Oskarsson who fixed several long-standing bugs. Check out the plJava list for 
details.


Krzysztof



On Dec 16, 2010, at 10:22 AM, pgsql-performance-ow...@postgresql.org wrote:

 From: Dan Schaffer daniel.s.schaf...@noaa.gov
 Date: December 15, 2010 9:15:14 PM GMT+01:00
 To: Andy Colson a...@squeakycode.net
 Cc: Jim Nasby j...@nasby.net, pgsql-performance@postgresql.org, Nick 
 Matheson nick.d.mathe...@noaa.gov
 Subject: Re: Help with bulk read performance
 Reply-To: daniel.s.schaf...@noaa.gov
 
 
 Hi,
 My name is Dan and I'm a co-worker of Nick Matheson who initially submitted 
 this question (because the mail group had me blacklisted for awhile for some 
 reason).
 
 
 Thank you for all of the suggestions.  We were able to improve out bulk read 
 performance from 3 MB/s to 60 MB/s (assuming the data are NOT in cache in 
 both cases) by doing the following:
 
 1. Storing the data in a bytea column instead of an array column.
 2. Retrieving the data via the Postgres 9 CopyManager#copyOut(String sql, 
 OutputStream stream)  method
 
 The key to the dramatic improvement appears to be the reduction in packing 
 and unpacking time on the server and client, respectively.  The server 
 packing occurs when the retrieved data are packed into a bytestream for 
 sending across the network. Storing the data as a simple byte array reduces 
 this time substantially.  The client-side unpacking time is spent generating 
 a ResultSet object. By unpacking the bytestream into the desired arrays of 
 floats by hand instead, this time became close to 

[PERFORM] Re: [BUGS] Query causing explosion of temp space with join involving partitioning

2010-06-02 Thread Krzysztof Nienartowicz
I made a brute force check and indeed, for one of the parameters the query was 
switching to sequential scans (or bitmaps scans with condition on survey_pk=16 
only if sequential scans were off). After closer look at the plan cardinalities 
I thought it would be worthy to increase histogram size and I set statistics on 
sources(srcid) to 1000 from default 10.  It fixed the plan! Sources table was 
around 100M so skewness in this range must have been looking odd for the 
planner..
Thank you for the hints!
Best Regards,
Krzysztof
On May 27, 2010, at 6:41 PM, Tom Lane wrote:

 Krzysztof Nienartowicz krzysztof.nienartow...@unige.ch writes:
 Logs of the system running queries are not utterly clear, so chasing the
 parameters for the explosive query is not that simple (shared logs between
 multiple threads), but from what I see there is no difference between them
 and the plan looks like (without removal of irrelevant parameters this time,
 most of them are float8, but also bytea)
 [ nestloop with inner index scans over the inherited table ]
 
 Well, that type of plan isn't going to consume much memory or disk
 space.  What I suspect is happening is that sometimes, depending on the
 specific parameter values called out in the query, the planner is
 switching to another plan type that does consume lots of space (probably
 via sort or hash temp files).  The most obvious guess is that that will
 happen when the range limits on srcid get far enough apart to make a
 nestloop not look cheap.  You could try experimenting with EXPLAIN and
 different constant values to see what you get.
 
   regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Query causing explosion of temp space with join involving partitioning

2010-05-27 Thread Krzysztof Nienartowicz
Hello,

Sorry for the re-post  - not sure list is the relevant one, I included
slightly changed query in the previous message, sent to bugs list.

I have an ORM-generated queries where parent table keys are used to
fetch the records from the child table (with relevant FK indicies),
where child table is partitioned. My understanding is that Postgres is
unable to properly use constraint exclusion to query only a relevant
table? Half of the join condition is propagated down, while the other
is not.

table sources has pk (sureyid,srcid), ts has fk(survey_pk,source_pk)
on source (sureyid,srcid) and another index with
survey_pk,source_pk,tstype (not used in the query).

This is very unfortunate as the queries are auto-generated and I
cannot move predicate to apply it directly to partitioned table.

The plan includes all the partitions, next snippet shows exclusion
works for the table when condition is used directly on the partitioned
table.

surveys- SELECT  t1.SURVEY_PK, t1.SOURCE_PK, t1.TSTYPE,  t1.VALS
surveys-   FROM sources t0 ,TS t1 where
surveys-   (t0.SURVEYID = 16 AND t0.SRCID = 203510110032281 AND
t0.SRCID = 203520107001677 and t0.SURVEYID = t1.SURVEY_PK AND t0.SRCID =
t1.SOURCE_PK ) ORDER BY t0.SURVEYID ASC, t0.SRCID ASC
surveys-
surveys- ;
                                                             QUERY
PLAN

Merge Join  (cost=11575858.83..11730569.40 rows=3448336 width=60)
 Merge Cond: (t0.srcid = t1.source_pk)
 -  Index Scan using sources_pkey on sources t0
(cost=0.00..68407.63 rows=37817 width=12)
       Index Cond: ((surveyid = 16) AND (srcid =
203510110032281::bigint) AND (srcid = 203520107001677::bigint))
 -  Materialize  (cost=11575858.83..11618963.03 rows=3448336 width=48)
       -  Sort  (cost=11575858.83..11584479.67 rows=3448336 width=48)
             Sort Key: t1.source_pk
             -  Append  (cost=0.00..11049873.18 rows=3448336 width=48)
                   -  Index Scan using ts_pkey on ts t1
(cost=0.00..8.27 rows=1 width=853)
                         Index Cond: (survey_pk = 16)
                   -  Index Scan using ts_part_bs3000l0_ts_pkey
on ts_part_bs3000l0 t1  (cost=0.00..8.27 rows=1 width=48)
                         Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_bs3000l1_cg0346l0 t1  (cost=5760.36..1481735.21
rows=462422 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_bs3000l1_cg0346l0_ts_pkey  (cost=0.00..5644.75
rows=462422 width=0)
                               Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_cg0346l1_cg0816k0 t1  (cost=5951.07..1565423.79
rows=488582 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_cg0346l1_cg0816k0_ts_pkey  (cost=0.00..5828.93
rows=488582 width=0)
                               Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_cg0816k1_cg1180k0 t1  (cost=5513.54..1432657.90
rows=447123 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_cg0816k1_cg1180k0_ts_pkey  (cost=0.00..5401.75
rows=447123 width=0)
                               Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_cg1180k1_cg6204k0 t1  (cost=5212.63..1329884.46
rows=415019 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_cg1180k1_cg6204k0_ts_pkey  (cost=0.00..5108.87
rows=415019 width=0)
                               Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_cg6204k1_lm0022n0 t1  (cost=5450.37..1371917.76
rows=428113 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_cg6204k1_lm0022n0_ts_pkey  (cost=0.00..5343.35
rows=428113 width=0)
                               Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_lm0022n1_lm0276m0 t1  (cost=5136.71..1298542.32
rows=405223 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_lm0022n1_lm0276m0_ts_pkey  (cost=0.00..5035.40
rows=405223 width=0)
                               Index Cond: (survey_pk = 16)
                   -  Bitmap Heap Scan on
ts_part_lm0276m1_lm0584k0 t1  (cost=5770.98..1525737.42
rows=476204 width=48)
                         Recheck Cond: (survey_pk = 16)
                         -  Bitmap Index Scan on
ts_part_lm0276m1_lm0584k0_ts_pkey  (cost=0.00..5651.93
rows=476204 width=0)
                               Index Cond: (survey_pk = 16)

[PERFORM] Re: [BUGS] Query causing explosion of temp space with join involving partitioning

2010-05-27 Thread Krzysztof Nienartowicz
) AND (t1.source_pk =
t0.srcid))
 -  Index Scan using ts_part_lm0584k1_sm0073k0_ts_pkey on
ts_part_lm0584k1_sm0073k0 t1  (cost=0.00..103.47 rows=93 width=1242)
(actual time=0.004..0.004 rows=0 loops=500)
   Index Cond: ((t1.survey_pk = 16) AND (t1.source_pk =
t0.srcid))
 Total runtime: 585.566 ms
(28 rows)

Time: 588.102 ms


Would be grateful for any pointers as the server restart is the only option
now once such a query starts trashing the disk.

Best Regards,
Krzysztof


Krzysztof Nienartowicz krzysztof.nienartowicz.c...@gmail.com writes:
 surveys- SELECT  t1.SURVEY_PK, t1.SOURCE_PK, t1.TSTYPE,  t1.VALS
 surveys-   FROM sources t0 ,TS t1 where
 surveys-   (t0.SURVEYID = 16 AND t0.SRCID = 203510110032281 AND
 t0.SRCID = 203520107001677 and t0.SURVEYID = t1.SURVEY_PK AND t0.SRCID =
 t1.SOURCE_PK ) ORDER BY t0.SURVEYID ASC, t0.SRCID ASC

We don't make any attempt to infer derived inequality conditions,
so no, those constraints on t0.srcid won't be propagated over to
t1.source_pk.  Sorry.  It's been suggested before, but it would be
a lot of new mechanism and expense in the planner, and for most
queries it'd just slow things down to try to do that.

 I have around 30 clients running the same query with different
 parameters, but the query always returns 1000 rows (boundary values
 are pre-calculated,so it's like traversal of the equiwidth histogram
 if it comes to srcid/source_pk) and the rows from parallel queries
 cannot be overlapping. Usually query returns within around a second.
 I noticed however there are some queries that hang for many hours and
 what is most curious some of them created several GB of temp files.

Can you show us the query plan for the slow cases?

regards, tom lane