PostgreSQL is a free software object-relational database server (database management system), released under the flexible BSD-style license. It offers an alternative to other open-source database systems (such as MySQL, Firebird, and MaxDB), as well as to proprietary systems such as Oracle, IBM's DB2 and Microsoft SQL Server.
PostgreSQL's unusual-looking name gives some readers pause in trying to pronounce it, especially those who pronounce "SQL" as "sequel". PostgreSQL's developers pronounce it "post-gress-Q-L". (Audio sample, 5.6k MP3).
PostgreSQL has had a lengthy evolution, starting with the Ingres project at UC Berkeley. The project lead, Michael Stonebraker had left Berkeley to commercialize Ingres in 1982, but eventually returned to academia. After returning to Berkeley in 1985, Stonebraker started a post-Ingres project to address the problems with contemporary database systems that had become increasing clear during the early 1980s.
The resulting project, named Postgres, aimed at introducing the minimum number of features needed to add complete support for types. These features included the ability to define types, but also the ability to fully describe relationships – up until this time widely used but maintained entirely by the user. In Postgres the database "understood" relationships, and could retrieve information in related tables in a natural way using rules.
Starting in 1986 the team released a number of papers describing the basis of the system, and by 1988 the project had a prototype version up and running. The team released version 1 to a small number of users in June 1989, followed by Version 2 with a re-written rules system in June 1990. 1991's Version 3 re-wrote the rules system again, but also added support for multiple storage managers and for an improved query engine. By 1993 a huge number of users existed and began to overwhelm the project with requests for support and features. After releasing a Version 4 --primarily as a cleanup -- the project ended.
Although the Postgres project had officially ended, the BSD license (under which Postgres was released) enabled Open Source developers to obtain copies and to develop the system further. In 1994 two UC Berkeley graduate students, Andrew Yu and Jolly Chen, added a SQL language interpreter to replace the earlier Ingres-based QUEL system, creating Postgres95. The code was subsequently released to the web to find its own way in the world. 1996 saw a re-naming of the project: in order to reflect the database's new SQL query language, Postgres95 became PostgreSQL.
The first PostgreSQL release formed version 6.0. Subsequently a group of database developers and volunteers from around the world, coordinated via the Internet, have maintained the software. Since 6.0, many subsequent releases have appeared, and many improvements have occurred in the system; as of August 2004, the current release series is version 7.4, with version 8.0 in beta testing.
Although the license allowed for the commercialization of Postgres, the Postgres code did not develop commercially with the same rapidity as Ingres -- somewhat surprisingly considering the advantages Postgres offered. The main offshoot originated when Paula Hawthorn (an original Ingres team member who moved from Ingres) and Michael Stonebraker formed Illustra Information Technologies to commercialize Postgres.
A cursory examination of PostgreSQL might suggest that the system resembles other database systems. PostgreSQL uses the SQL language to run queries on data. That data exists as a series of tables with foreign keys linking related data together. One might characterise the primary advantage of PostgreSQL over some of its competitors as programmability: PostgreSQL makes it much easier to build real-world applications using data taken from the database.
The SQL data stores simple data types in "flat tables", requiring the user to gather up related information using queries. This contrasts with the way the data itself ends up being used, typically in a high-level language with rich data types where all of the related data is considered as a complete unit of its own, typically referred to as a record or object (depending on the language).
Converting information from the SQL world into the object-oriented programming world presents difficulties because the two have very different models of data organization. This problem is widely known as impedance mismatch in the industry; mapping from one model to the other might take up to 40% of a project's development time. A number of mapping solutions, typically referred to as object-relational mapping, address the issue, but they tend to be expensive and have problems of their own, causing poor performance or forcing all data access to take place through the one language that the mapping supports.
PostgreSQL can solve many of these issues directly in the database. PostgreSQL allows the user to define new types based on the normal SQL types, allowing the database itself to understand complex data. For instance, you can define an address to consist of several strings for things like street number, city and country. From that point on one can easily create tables containing all the fields needed to hold an address with a single line.
PostgreSQL also allows types to include inheritance, one of the major concepts in object-oriented programming. For instance, one could define a post_code type, and then create us_zip_code and canadian_postal_code based on it. Addresses in the database could then take either us_address or canadian_address form, and specialized rules could validate the data in each case. In early versions of PostgreSQL, implementing new types required writing C extensions and compiling them into the database server; in version 7.4, it has become much easier to create and use custom types via CREATE DOMAIN.
Another very useful feature of PostgreSQL involves direct understanding of the relationships that exist between tables. People in the real world typically have several addresses, which the relational model approaches by storing the addresses in one table and the rest of the user information in another. The addresses become "related" to a particular user by storing some unique information, say the user's name, in the address table itself. In order to find all the addresses for "Bob Smith", the user writes a query that "joins" the data back together, by selecting a particular name from the users table and then searching for that name in the address table. Doing a search for all the users in New York can become somewhat complex, requiring the database to find all the user names in the address table, then search the user table for those users. A typical search might look like this:
SELECT u.* FROM user u, address a WHERE a.city='New York' AND a.user_name=u.user_name
PostgreSQL can explicitly define the relationship between users and addresses. Once defined, the address becomes a property of the user, so the search can be greatly simplified to:
SELECT * FROM user WHERE address.city='New York'
This code requires no "join": the database itself understands the user.address relationship.
A related example shows the power of types. If one uses PostgreSQL to do:
SELECT address FROM user
the database filters the results automatically, returning only those addresses for users, not those for companies or other objects that might also use the address table.
The programming of the database itself can profit greatly from the use of functions. Most SQL systems allow users to write a stored procedure, a block of SQL code that other SQL statements can call. However SQL itself remains unsuitable as a programming language, and SQL users can experience great difficulty in constructing complex logic. Worse, SQL itself does not support many of the most basic operations in a programming language, like branching and looping. Instead each vendor has written their own extensions to the SQL language to add these features, and such extensions do not necessarily operate across database platforms.
In PostgreSQL programmers can write such logic in any one of a considerable set of supported languages.
The programmer can insert the code into the server as a function, a small wrapper that makes the code resemble a stored procedure. In this way SQL code can call (for instance) C code and vice-versa.
These advantages add up to making PostgreSQL arguably the most advanced database system from a programming perspective, which helps to explain the success of Illustra. Using PostgreSQL can dramatically reduce overall programming time on many projects, with its advantages growing with project complexity.
Some features of PostgreSQL rarely found in other relational databases include:
In addition, PostgreSQL supports almost all the constructs expected from an enterprise-level database, including:
One can loosely separate the shortcomings PostgreSQL into:
PostgreSQL has lacked a good replication solution; third-party packages exist which address this, however. Systems use replication in situations where a single database server cannot keep up with query workloads and/or where a requirement exists to minimise database downtime.
Perceived deficiencies of lesser importance include point-in-time recovery and nested transactions. (PostgreSQL 8.0 will contain support for point-in-time recovery and nested transactions).
PostgreSQL currently lacks the ability to automatically infer the rules needed to make views support updates; presently, application developers need to define these rules themselves.
In the second group of concerns one should note the need for periodic VACUUMing. Due to MVCC, when a row is updated or deleted by a transaction the old version of the row remains in the database, since another transaction may still have that data in use, or the transaction which modified it may rollback. This provides the substantial benefit that rows do not usually need to be locked, even though one transaction may need a consistent view of database state while others concurrently update data. The process of VACUUMing marks (as stale) that data which the DBMS definitely no longer requires. Old versions of PostgreSQL required an exclusive lock on a table to perform the VACUUM, but recent versions can perform VACUUMing concurrently with normal database access. This substantially reduces the impact of VACUUMing on other database activity. Nevertheless, it is still considered good practice to schedule VACUUMs during periods of low database load. Failure to VACUUM regularly may result in wasteful use of disk space (since one cannot re-use stale row versions until VACUUM has been performed) and performance degradation (because stale row versions must be skipped during query execution). In addition, one must VACUUM the entire database at least once every billion transactions, or else no new transactions may be performed.
Standard aggregate functions, such as COUNT, also perform unusually poorly compared to some other database systems in cases where the query aggregates the entire table. Specifically, some other databases will make use of indexes or system metadata to process a query such as SELECT COUNT(*) FROM table; efficiently. Since index entries do not contain MVCC information, PostgreSQL cannot improve performance by merely reading the index; it must also examine the tuples themselves to verify which ones are visible to the currently-running transaction.
Also, considerable demand exists for a native port to the Microsoft Windows environment. PostgreSQL can run under Windows using the Cygwin Unix emulation library, but offers less than optimal performance and suffers from a complicated installation procedure. The upcoming 8.0 release will include a native Win32 port.