Re: [pgsql-advocacy] [PERFORM] Postgres and really huge tables

2007-01-19 Thread Oleg Bartunov

On Thu, 18 Jan 2007, Tom Lane wrote:


Brian Hurt [EMAIL PROTECTED] writes:

Is there any experience with Postgresql and really huge tables?  I'm
talking about terabytes (plural) here in a single table.


The 2MASS sky survey point-source catalog
http://www.ipac.caltech.edu/2mass/releases/allsky/doc/sec2_2a.html
is 470 million rows by 60 columns; I don't have it loaded up but
a very conservative estimate would be a quarter terabyte.  (I've
got a copy of the data ... 5 double-sided DVDs, gzipped ...)
I haven't heard from Rae Stiening recently but I know he's been using
Postgres to whack that data around since about 2001 (PG 7.1 or so,
which is positively medieval compared to current releases).  So at
least for static data, it's certainly possible to get useful results.
What are your processing requirements?


We are working in production with 2MASS and other catalogues, and
2MASS is not the biggest. The nomad catalog has more than milliard records.
You could query them online
http://vo.astronet.ru/cas/conesearch.php
Everything is in PostgreSQL 8.1.5 and at present migrate to the 8.2.1,
which is very slow, since slow COPY.
The hardware we use is HP rx1620, dual Itanium2, MSA 20, currently
4.5 Tb.





regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] DB benchmark and pg config file help

2007-01-19 Thread Merlin Moncure

On 1/17/07, Kevin Hunter [EMAIL PROTECTED] wrote:

Hello List,

Not sure to which list I should post (gray lines, and all that), so
point me in the right direction if'n it's a problem.

I am in the process of learning some of the art/science of benchmarking.
  Given novnov's recent post about the comparison of MS SQL vs
PostgresQL, I felt it time to do a benchmark comparison of sorts for
myself . . . more for me and the benchmark learning process than the
DB's, but I'm interested in DB's in general, so it's a good fit.  (If I
find anything interesting/new, I will of course share the results.)


Just remember that all the major commercial databases have
anti-benchmark clauses in their license agreements.  So, if you decide
to publish your results (especially in a formal benchmark), you can't
mention the big boys by name. [yes this is cowardice]

merlin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] DB benchmark and pg config file help

2007-01-19 Thread Kevin Hunter

On 19 Jan 2007 at 8:45a -0500, Merlin Moncure wrote:

On 1/17/07, Kevin Hunter [hunteke∈earlham.edu] wrote:
I am in the process of learning some of the art/science of benchmarking. 
Given novnov's recent post about the comparison of MS SQL vs 
PostgresQL, I felt it time to do a benchmark comparison of sorts for 
myself . . . more for me and the benchmark learning process than the 
DB's, but I'm interested in DB's in general, so it's a good fit.  (If I 
find anything interesting/new, I will of course share the results.)


Just remember that all the major commercial databases have 
anti-benchmark clauses in their license agreements.  So, if you decide 
to publish your results (especially in a formal benchmark), you can't 
mention the big boys by name. [yes this is cowardice]


Anti-benchmark clauses in the license agreements?!?  Cowardice indeed! 
 wry_lookSo, by implication, I should do my benchmarking with 
borrowed copies, right?  No sale, no agreement . . . /wry_look


Seriously though, that would have bitten me.  Thank you, I did not know 
that.  Does that mean that I can't publish the results outside of my 
work/research/personal unit at all?  Or do I just need to obscure about 
which DB I'm talking?  (Like Vendor {1,2,3,...} Product).


Appreciatively,

Kevin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Postgres and really huge tables

2007-01-19 Thread Merlin Moncure

On 1/18/07, Brian Hurt [EMAIL PROTECTED] wrote:

Is there any experience with Postgresql and really huge tables?  I'm
talking about terabytes (plural) here in a single table.  Obviously the
table will be partitioned, and probably spread among several different
file systems.  Any other tricks I should know about?


A pretty effective partitioning strategy that works in some cases is
to identify a criteria in your dataset that isolates your data on a
session basis.  For example, if you have a company_id that divides up
your company data and a session only needs to deal with company_id,
you can separate out all your tables based on company_id into
different schemas and have the session set the search_path variable
when it logs in.  Data that does not partition on your criteria sits
in public schemas that all the companies can see.

This takes advantage of a special trick regarding stored procedures
that they do not attach to tables until the first time they are
executed in a session -- keeping you from having to make a function
for each schema. (note: views do not have this property).   You can
still cross query using views and the like or hand rolled sql.

I would call this type of partitioning logical partitioning since you
are leveraging logical divisions in your data.  It obviously doesn't
work in all cases but when it does  it works great.


We have a problem of that form here.  When I asked why postgres wasn't
being used, the opinion that postgres would just explicitive die was
given.  Personally, I'd bet money postgres could handle the problem (and
better than the ad-hoc solution we're currently using).  But I'd like a
couple of replies of the form yeah, we do that here- no problem to
wave around.


pg will of course not die as when your dataset hits a certain
threshold.  It will become slower based on well know mathematical
patterns that grow with your working set size.  One of the few things
that gets to be a pain with large tables is vacuum -- since you can't
vacuum a piece of table and there are certain annoyances with having a
long running vacuum this is something to think about.

Speaking broadly about table partitioning, it optimizes one case at
the expense of another.  Your focus (IMO) should be on reducing your
working set size under certain conditions -- not the physical file
size.  If you have a properly laid out and logical dataset and can
identify special cases where you need some information and not other
information, the partitioning strategy should fall into place, whether
it is to do nothing, isolate data into separate schemas/tables/files,
or use the built in table partitioning feature (which to be honest I
am not crazy about).

merlin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] DB benchmark and pg config file help

2007-01-19 Thread Bruno Wolff III
On Fri, Jan 19, 2007 at 09:05:35 -0500,
  Kevin Hunter [EMAIL PROTECTED] wrote:
 
 Seriously though, that would have bitten me.  Thank you, I did not know 
 that.  Does that mean that I can't publish the results outside of my 
 work/research/personal unit at all?  Or do I just need to obscure about 
 which DB I'm talking?  (Like Vendor {1,2,3,...} Product).

Check with your lawyer. Depending on where you are, those clauses may not even
be valid.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Configuration Advice

2007-01-19 Thread Merlin Moncure

On 1/17/07, Steve [EMAIL PROTECTED] wrote:

Hey there;
I've been lurking on this list awhile, and I've been working with postgres
for a number of years so I'm not exactly new to this.  But I'm still
having trouble getting a good balance of settings and I'd like to see what
other people think.  We may also be willing to hire a contractor to help
tackle this problem if anyone is interested.


I happen to be something of a cobol-sql expert, if you are interested
in some advice you can contact me off-list.  I converted an enterprise
cobol (in acucobol) app to Postgresql by plugging pg into the cobol
system via custom c++ isam driver.


I've got an application here that runs large (in terms of length -- the
queries have a lot of conditions in them) queries that can potentially
return millions of rows but on average probably return tens of thousands
of rows.  It's read only for most of the day, and pretty much all the
queries except one are really fast.


If it's just one query I think I'd focus on optimizing that query, not
.conf settings.  In my opinion .conf tuning (a few gotchas aside)
doesn't really get you all that much.


However, each night we load data from a legacy cobol system into the SQL
system and then we summarize that data to make the reports faster.  This
load process is intensely insert/update driven but also has a hefty
amount of selects as well.  This load process is taking ever longer to
complete.


SO ... our goal here is to make this load process take less time.  It
seems the big part is building the big summary table; this big summary
table is currently 9 million rows big.  Every night, we drop the table,
re-create it, build the 9 million rows of data (we use COPY to put hte
data in when it's prepared, not INSERT), and then build the indexes on it
-- of which there are many.  Unfortunately this table gets queried
in a lot of different ways and needs these indexes; also unfortunately, we
have operator class indexes to support both ASC and DESC sorting on


I have some very specific advice here.  Check out row-wise comparison
feature introduced in 8.2.


columns so these are for all intents and purposes duplicate but required
under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still
a requirement?)



Building these indexes takes forever!  It's a long grind through inserts
and then building the indexes takes a hefty amount of time too.  (about 9
hours).  Now, the application is likely part at fault, and we're working
to make it more efficient, but it has nothing to do with the index
building time.  I'm wondering what we can do to make this better if
anything; would it be better to leave the indexes on?  It doesn't seem to
be.  Would it be better to use INSERTs instead of copies?  Doesn't seem to


no.

probably any optimization strategies would focus on reducing the
amount of data you had to load.

merlin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] DB benchmark and pg config file help

2007-01-19 Thread Kevin Hunter

On 19 Jan 2007 at 10:56a -0600, Bruno Wolff III wrote:

On Fri, Jan 19, 2007 at 09:05:35 -0500,
  Kevin Hunter [EMAIL PROTECTED] wrote:
Seriously though, that would have bitten me.  Thank you, I did not know 
that.  Does that mean that I can't publish the results outside of my 
work/research/personal unit at all?  Or do I just need to obscure about 
which DB I'm talking?  (Like Vendor {1,2,3,...} Product).


Check with your lawyer. Depending on where you are, those clauses may not even
be valid.


grins /

/me = student = no money . . . lawyer?  You /are/ my lawyers.  ;)

Well, sounds like America's legal system/red tape will at least slow my 
efforts against the non-open source DBs, until I get a chance to find 
out for sure.


I really do appreciate the warnings/heads ups.

Kevin

BTW: I'm currently located in Richmond, IN, USA.  A pin for someone's 
map. :)


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] [pgsql-advocacy] Postgres and really huge tables

2007-01-19 Thread Josh Berkus

 A lot of data, but not a lot of records... I don't know if that's
 valid. I guess the people at Greenplum and/or Sun have more exciting
 stories ;)

Not really.   Pretty much multi-terabyte tables are fine on vanilla 
PostgreSQL if you can stick to partitioned and/or indexed access.  If you 
need to do unindexed fishing expeditions on 5tb of data, then talk to 
Greenplum.

http://www.powerpostgresql.com/Downloads/terabytes_osc2005.pdf

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq