Paul Sheer wrote:
Hadoop backend for PostGreSQL
Resurrecting an old thread, it seems some guys at Yale implemented
something very similar to what this thread was discussing.
http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html
It's an open source stack that
As far as I can tell, the PG storage manager API is at the wrong level
of abstraction for pretty much everything. These days, everything we do
is atop the Unix filesystem API, and anything that smgr might have been
Is there a complete list of filesystem API calls somewhere that I can get
why not just stream it in via set-returning functions and make sure
that we can mark a set returning function as STREAMABLE or so (to
prevent joins, whatever).
is it the easiest way to get it right and it helps in many other cases.
i think that the storage manager is definitely the wrong
Tom Lane wrote:
It's interesting to speculate about where we could draw an abstraction
boundary that would be more useful. I don't think the MySQL guys got it
right either...
The supposed smgr abstraction of PostgreSQL, which tells more or less
how to get a byte to the disk, is quite far
With a distributed data store, the data would become a logical
object - no adding or removal of machines would affect the data.
This is an ideal that would remove a tremendous maintenance
burden from many sites well, at least the one's I have worked
at as far as I can see.
Two things:
It would only be possible to have the actual PostgreSQL backends
running on a single node anyway, because they use shared memory to
This is not problem: Performance is a secondary consideration (at least
as far as the problem I was referring to).
The primary usefulness is to have the data be a
On Mon, Feb 23, 2009 at 9:08 AM, Paul Sheer paulsh...@gmail.com wrote:
It would only be possible to have the actual PostgreSQL backends
running on a single node anyway, because they use shared memory to
This is not problem: Performance is a secondary consideration (at least
as far as the
Paul Sheer wrote
I have also found it's no use having RAID or ZFS. Each of these ties
the data to an OS installation. If the OS needs to be reinstalled, all
the data has to be manually moved in a way that is, well... dangerous.
How about network storage, fiber attach? If you move the db you
Hi,
Paul Sheer wrote:
This is not problem: Performance is a secondary consideration (at least
as far as the problem I was referring to).
Well, if you don't mind your database running .. ehm.. creeping several
orders of magnitudes slower, you might also be interested in
Single-System Image
On Sun, Feb 22, 2009 at 3:47 PM, Robert Haas robertmh...@gmail.com wrote:
In theory, I think you could make postgres work on any type of
underlying storage you like by writing a second smgr implementation
that would exist alongside md.c. The fly in the ointment is that
you'd need a more
Jonah H. Harris jonah.har...@gmail.com writes:
I believe there is more than that which would need to be done nowadays. I
seem to recall that the storage manager abstraction has slowly been
dedicated/optimized for md over the past 6 years or so.
As far as I can tell, the PG storage manager API
| I believe there is more than that which would need to be done
nowadays. I seem to recall that the storage manager|
| abstraction has slowly been dedicated/optimized for md over the past 6
years or so. It may even be easier/preferred
| to write a hadoop specific access method
hi ...
i think the easiest way to do this is to simply add a mechanism to
functions which allows a function to stream data through.
it would basically mean losing join support as you cannot read data
again in a way which is good enough good enough for joining with the
function providing
On Sat, Feb 21, 2009 at 9:37 PM, pi song pi.so...@gmail.com wrote:
1) Hadoop file system is very optimized for mostly read operation
2) As of a few months ago, hdfs doesn't support file appending.
There might be a bit of impedance to make them go together.
However, I think it should a very
One more problem is that data placement on HDFS is inherent, meaning you
have no explicit control. Thus, you cannot place two sets of data which are
likely to be joined together on the same node = uncontrollable latency
during query processing.
Pi Song
On Mon, Feb 23, 2009 at 7:47 AM, Robert Haas
On Sun, Feb 22, 2009 at 5:18 PM, pi song pi.so...@gmail.com wrote:
One more problem is that data placement on HDFS is inherent, meaning you
have no explicit control. Thus, you cannot place two sets of data which are
likely to be joined together on the same node = uncontrollable latency
during
On Mon, Feb 23, 2009 at 3:56 PM, pi song pi.so...@gmail.com wrote:
I think the point that you can access more system cache is right but that
doesn't mean it will be more efficient than accessing from your local disk.
Take Hadoop for example, your request for file content will have to go to
Hadoop backend for PostGreSQL
A problem that my client has, and one that I come across often,
is that a database seems to always be associated with a particular
physical machine, a physical machine that has to be upgraded,
replaced, or otherwise maintained.
Even if the database is
1) Hadoop file system is very optimized for mostly read operation2) As of a
few months ago, hdfs doesn't support file appending.
There might be a bit of impedance to make them go together.
However, I think it should a very good initiative to come up with ideas to
be able to run postgres on
19 matches
Mail list logo