Re: [GENERAL] Import large data set into a table and resolve duplicates?

2015-02-15 Thread Paul A Jungwirth
Hi Eugene,

 Now I need to import the patch into the database, and produce another file as
 - if the passed series field exists in the database, then return ID:series
 - otherwise insert a new row to the table and generate new ID and return 
 ID:series
 for each row in the source file.

I think Francisco's approach is good, and I agree that ~200k rows is
hardly anything. My approach is similar but uses CTEs to combine a lot
of Francisco's queries into one. I still have a separate COPY command
though. (It'd be great if you could COPY into a CTE, but I guess
COPYing into a temporary table is pretty close.) Anyway, when I run
this on my machine, the import finishes in a few seconds:

# Makefile
database=dictionary
port=5432
words=/usr/share/dict/american-english
SHELL=/bin/bash

initial.txt:
  for i in {1..3}; do \
cat ${words} | while read line; do \
  echo $$i $$line; \
done; \
  done  initial.txt

tables: initial.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f tables.sql  initial.txt'

a.txt:
  for i in {1,4}; do \
cat ${words} | while read line; do \
  echo $$i $$line; \
done; \
  done  a.txt

b.txt:
  for i in {4,5}; do \
cat ${words} | while read line; do \
  echo $$i $$line; \
done; \
  done  b.txt

a: a.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f import.sql  a.txt'

b: b.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f import.sql  b.txt'

clean:
  rm -f initial.txt a.txt b.txt

.PHONY: tables a b clean


# tables.sql
DROP TABLE IF EXISTS dictionary;
CREATE TABLE dictionary (id SERIAL PRIMARY KEY, series VARCHAR NOT NULL);
\copy dictionary (series) from pstdin
CREATE UNIQUE INDEX idx_series ON dictionary (series);


# import.sql
CREATE TEMPORARY TABLE staging (
  series VARCHAR NOT NULL
);
\copy staging (series) from pstdin
CREATE INDEX idx_staging_series ON staging (series);

WITH already AS (
  SELECT  id, staging.series
  FROMstaging
  LEFT OUTER JOIN dictionary
  ON  dictionary.series = staging.series
),
adding as (
  INSERT INTO dictionary
  (series)
  SELECT  series::text
  FROMalready
  WHERE   id IS NULL
  RETURNING id, series
)
SELECT  id, series
FROMadding
UNION
SELECT  id, series
FROMalready WHERE id IS NOT NULL
;


Good luck!
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] On using doubles as primary keys

2015-04-17 Thread Paul A Jungwirth
On Apr 17, 2015 8:35 AM, Kynn Jones kyn...@gmail.com wrote:
 (The only reason for wanting to transfer this data to a Pg table
 is the hope that it will be easier to work with it by using SQL

800 million 8-byte numbers doesn't seem totally unreasonable for
python/R/Matlab, if you have a lot of memory. Are you sure you want it in
Postgres? Load the file once then filter it as you like. If you don't have
the memory I can see how using Postgres to get fewer rows at a time might
help. Fewer columns at a time would help even more if that's possible.

 In its simplest form, this would mean using
 doubles as primary keys, but this seems to me a bit weird.

I'd avoid that and just include an integer PK with your data. Datagrams in
the languages above support that, or just slice off the PK column before
doing your matrix math.

Also instead of 401 columns per row maybe store all 400 doubles in an array
column? Not sure if that's useful for you but maybe it's worth considering.

Also if you put the metadata in the same table as the doubles, can you
leave off the PKs altogether? Why join if you don't have to? It sounds like
the tables are 1-to-1? Even if some metadata is not, maybe you can finesse
it with hstore/arrays.

Good luck!

Paul


Re: [GENERAL] Possible to create canonicalized range type without being superuser?

2016-07-05 Thread Paul A Jungwirth
On Tue, Jul 5, 2016 at 3:28 PM, David G. Johnston
<david.g.johns...@gmail.com> wrote:
> On Tue, Jul 5, 2016 at 5:37 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Paul Jungwirth <p...@illuminatedcomputing.com> writes:
>> > The problem is this (tried on 9.3 and 9.5):
>>
>> The only other obvious way to deal with this is to allow the canonical
>> function to be defined after the range type is created, and then added to
>> the type via an ALTER TYPE command.
>
> Can the canonical function be definitionally optional but runtime required?

Thanks Tom and David for your help! I actually tried to find something
in ALTER TYPE that would do just that, but of course it's not there.
It seems like an appealing solution though.

With Tom's warning that a shell type would require writing the
function in C, I figured I might as well just package it all up as an
extension, so here it is:

https://github.com/pjungwir/inetrange

Of course it still requires superuser privileges, but as least the C
part is just `make && sudo make install`.

Thanks again!
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Possible to create canonicalized range type without being superuser?

2016-07-05 Thread Paul A Jungwirth
On Tue, Jul 5, 2016 at 10:17 PM, Paul A Jungwirth
<p...@illuminatedcomputing.com> wrote:
> db=> create type inetrange;

Here is a follow-up question for creating inet ranges. Is there any
way to prevent someone from doing this?:

db=> select inetrange('1.2.3.4',
'2001:0db8::0042::8a2e:0370:7334', '[]');
inetrange
-
 [1.2.3.4,2001:db8:0:42:0:8a2e:370:7334]

Thanks!
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: query returned no rows

2017-06-26 Thread Paul A Jungwirth
On Mon, Jun 26, 2017 at 12:47 PM, Adrian Klaver
<adrian.kla...@aklaver.com> wrote:
> On 06/26/2017 12:03 PM, Paul Jungwirth wrote:
>> Perhaps
>> you should see what is line 85 when you do `\sf words_skip_game` (rather
>> than line 85 in your own source code).
>
> Or easier yet:
>
> https://www.postgresql.org/docs/9.5/static/app-psql.html
> "
> \ef [ function_description [ line_number ] ]

I am always nervous about answering questions here when so many actual
Postgres maintainers are around. But it's pretty great that so often
when I do it provokes an even better answer. It's not the first time!
:-)

Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Partitioning and Table Inheritance

2017-05-08 Thread Paul A Jungwirth
I'm working on a problem where partitioning seems to be the right
approach, but we would need a lot of partitions (say 10k or 100k).
Everywhere I read that after ~100 child tables you experience
problems. I have a few questions about that:

1. Is it true that the only disadvantage to 10k children is query
planning time when hitting the parent table? Is there anything else to
worry about? (Unlike ext3, ext4 seems to have no problem with lots of
files in one directory (e.g. 1,000,000). It doesn't break and it even
performs fine.) I know a lot of tables increases memory use. There is
clutter too in \dt but I think I can use schemas to cut down on that.
Anything else?

2. Is there any reason I shouldn't bypass the query planning cost by
having the app just SELECT from the correct child table (which is easy
for us 99% of the time)? Similarly I could bypass trigger time by
INSERTing directly into the child, etc.

3. Is it true that query planning time should scale linearly as I add
more child tables? I'm seeing ~16ms planning time from EXPLAIN ANALYZE
with 1000 tables. I haven't tried 10k tables yet, but so far 16ms or
even 160ms seems tolerable if it's only for the 1% of queries that
can't SELECT directly from a known child table.

4. I tried a scheme where instead of one parent table with 1000 direct
children, I have 1 parent with 10 children, and they each have 10
children, and they each have 10 children. I thought by using
increasingly-specific constraints I could maybe make query planning
scale by O(log n) instead of O(n), but instead it went up! Upon
investigating it looks like maybe the planner flattens the inheritance
hierarchy before doing constraint_exclusion work. Is that true? If so,
is there any interest in changing this in future versions? (I might be
willing to take a stab at a patch for it.)

Btw this is on 9.4 but it could be 9.5 or 9.6 if that would help any.

Thanks,
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance appending to an array column

2017-09-21 Thread Paul A Jungwirth
> It's going to suck big-time :-(.

Ha ha that's what I thought, but thank you for confirming. :-)

> We ended up keeping
> the time series data outside the DB; I doubt the conclusion would be
> different today.

Interesting. That seems a little radical to me, but I'll consider it
more seriously now. I also tried cstore_fdw for this, but my queries
(building a 2-D histogram) were taking 4+ seconds, compared to 500ms
using arrays. Putting everything into regular files gives up filtering
and other SQL built-ins, but maybe I could write my own extension to
load regular files into Postgres arrays, sort of getting the best of
both worlds.

Anyway, thanks for sharing your experience!

Yours,
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Performance appending to an array column

2017-09-21 Thread Paul A Jungwirth
I'm considering a table structure where I'd be continuously appending
to long arrays of floats (10 million elements or more). Keeping the
data in arrays gives me much faster SELECT performance vs keeping it
in millions of rows.

But since these arrays keep growing, I'm wondering about the UPDATE
performance. I was reading this commit message about improving
performance of *overwriting* individual array elements, and I was
wondering if there is anything similar for growing an array column?:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1dc5ebc9077ab742

Is there a faster way to append to an array than just this?:

UPDATE measurements
SET vals = vals || ARRAY[5.0, 4.2, 9.9]::float[]
;

Thanks!
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Is float8 a reference type?

2017-09-22 Thread Paul A Jungwirth
The docs say that a Datum can be 4 bytes or 8 depending on the machine:

https://www.postgresql.org/docs/9.5/static/sql-createtype.html

Is a Datum always 8 bytes for 64-bit architectures?

And if so, can my C extension skip a loop like this when compiling
there, and just do a memcpy (or even a cast)?:

float8 *floats;
Datum *datums;

datums = palloc(arrlen * sizeof(Datum));
for (i = 0; i < arrlen; i++) {
  datums[i] = Float8GetDatum(floats[i]);
}

Thanks!
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Is float8 a reference type?

2017-09-22 Thread Paul A Jungwirth
On Fri, Sep 22, 2017 at 8:05 PM, Pavel Stehule  wrote:
> yes, it is 8 bytes on 64-bit.

Thanks!

> I don't think so it is good idea to write 64bit only extensions.

I agree, but how about this?:

if (FLOAT8PASSBYVAL) {
  datums = (Datum *)floats;
} else {
  datums = palloc0(arrlen * sizeof(Datum));
  for (i = 0; i < arrlen; i++) {
datums[i] = Float8GetDatum(floats[i]);
  }
}

Thanks,
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Is float8 a reference type?

2017-09-22 Thread Paul A Jungwirth
On Fri, Sep 22, 2017 at 7:52 PM, Paul A Jungwirth
<p...@illuminatedcomputing.com> wrote:
> Is a Datum always 8 bytes for 64-bit architectures?

Never mind, I found this in `pg_config.h`:

/* float8, int8, and related values are passed by value if 'true', by
   reference if 'false' */
#define FLOAT8PASSBYVAL true

Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Is float8 a reference type?

2017-09-23 Thread Paul A Jungwirth
On Sat, Sep 23, 2017 at 9:40 AM, Tom Lane  wrote:
> I wonder whether you're using up-to-date Postgres headers (ones
> where Float8GetDatum is a static inline function).

I'm building against 9.6.3 on both machines. I'm not doing anything
special to change the compilation options. Here is my whole Makefile:

MODULES = floatfile
EXTENSION = floatfile
EXTENSION_VERSION = 1.0.0
DATA = floatfile--$(EXTENSION_VERSION).sql

PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

But what I'm really interested in is this: What are the bad things
that can happen if I do `datums = (Datum *)floats`, as long as it's
only when Datums are 8 bytes wide? Is there a platform with
pass-by-val float8s where that won't work?

Thanks,
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Is float8 a reference type?

2017-09-23 Thread Paul A Jungwirth
On Fri, Sep 22, 2017 at 8:38 PM, Tom Lane  wrote:
> "Premature optimization is the root of all evil".  Do you have good reason
> to think that it's worth your time to write unsafe/unportable code?  Do
> you know that your compiler doesn't turn Float8GetDatum into a no-op
> already?  (Mine does, on a 64-bit machine.)

Ha ha, thank you for keeping me honest! But can you explain what is
unsafe about the cast? For a little more context: I've loaded a float8
array from a file, but I need to pass a Datum array to
construct_md_array. With an 8-byte Datum, I can just pass the original
float array, right? But with smaller Datums I need to go through the
array and convert each element. (I'm not really worried about these
files being moved between machines, so I'm willing to make the on-disk
format the same as the in-memory format.)

Since I'm expecting ~10 million elements per array, it seems like
skipping the conversion will have a real effect. I checked the
assembly and do see a difference (on both Mac+clang and Linux+gcc).
Here is the Mac command line:

platter:floatfile paul$ clang -Wall -Wmissing-prototypes
-Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -O2  -I. -I./
-I/usr/local/Cellar/postgresql@9.6/9.6.3/include/server
-I/usr/local/Cellar/postgresql@9.6/9.6.3/include/internal
-I/usr/local/opt/gettext/include -I/usr/local/opt/openldap/include
-I/usr/local/opt/openssl/include -I/usr/local/opt/readline/include
-I/usr/local/opt/tcl-tk/include -g -S -o floatfile.s floatfile.c

Here is the assembly for the cast:

  .loc2 391 23 is_stmt 1  ## floatfile.c:391:23
  movq-48(%rbp), %r15
Ltmp176:
  ##DEBUG_VALUE: load_floatfile:datums <- %R15

Here is the assembly for the loop (after just changing the code to `if
(FLOAT8PASSBYVAL && false)`):

  .loc2 393 21 is_stmt 1  ## floatfile.c:393:21
  movslq%r15d, %r13
  .loc2 393 28 is_stmt 0  ## floatfile.c:393:28
  leaq(,%r13,8), %rdi
  .loc2 393 14## floatfile.c:393:14
  callq_palloc
  movq%rax, %r12
Ltmp177:
  ##DEBUG_VALUE: load_floatfile:i <- 0
  .loc2 394 19 is_stmt 1 discriminator 1 ## floatfile.c:394:19
  testl%r15d, %r15d
Ltmp178:
  .loc2 394 5 is_stmt 0 discriminator 1 ## floatfile.c:394:5
  jeLBB7_11
Ltmp179:
## BB#9:
  ##DEBUG_VALUE: load_floatfile:arrlen <- %R15D
  ##DEBUG_VALUE: load_floatfile:nulls <- [%RBP+-80]
  ##DEBUG_VALUE: load_floatfile:floats <- [%RBP+-72]
  ##DEBUG_VALUE: load_floatfile:filename <- %RBX
  .loc2 0 5 discriminator 1   ## floatfile.c:0:5
  movq-72(%rbp), %rbx
Ltmp180:
  ##DEBUG_VALUE: load_floatfile:floats <- %RBX
  xorl%r14d, %r14d
Ltmp181:
  .p2align4, 0x90
LBB7_10:## =>This Inner Loop Header: Depth=1
  ##DEBUG_VALUE: load_floatfile:floats <- %RBX
  ##DEBUG_VALUE: load_floatfile:arrlen <- %R15D
  ##DEBUG_VALUE: load_floatfile:nulls <- [%RBP+-80]
  .loc2 395 34 is_stmt 1  ## floatfile.c:395:34
  movsd(%rbx,%r14,8), %xmm0## xmm0 = mem[0],zero
  .loc2 395 19 is_stmt 0  ## floatfile.c:395:19
  callq_Float8GetDatum
  .loc2 395 17## floatfile.c:395:17
  movq%rax, (%r12,%r14,8)
Ltmp182:
  .loc2 394 30 is_stmt 1 discriminator 2 ## floatfile.c:394:30
  incq%r14
  .loc2 394 19 is_stmt 0 discriminator 1 ## floatfile.c:394:19
  cmpq%r13, %r14
Ltmp183:
  .loc2 394 5 discriminator 1 ## floatfile.c:394:5
  jlLBB7_10
Ltmp184:
LBB7_11:
  ##DEBUG_VALUE: load_floatfile:arrlen <- %R15D
  ##DEBUG_VALUE: load_floatfile:nulls <- [%RBP+-80]

I get the same results on gcc too: the palloc, the loop, and even
`call Float8GetDatum@PLT`.

I'll do some timing of each version too, but it doesn't look like a
pointless optimization. I'd still like to know what is unsafe about it
though.

Thanks!
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pg_dump throws too many command-line arguments in Postgres 10

2017-10-18 Thread Paul A Jungwirth
On Wed, Oct 18, 2017 at 8:05 AM, Andrus  wrote:
> pg_dump.exe -b -f b.backup -Fc -h  -U admin -p 5432 mydb
>
> causes error
>
> pg_dump: too many command-line arguments (first is "-p")

Don't you need a hostname after -h? I think right now pg_dump thinks
your hostname is "-U", your database is "admin", and everything after
that is extra.

Yours,
Paul


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general