Vectors are the new JSON in PostgreSQL

Mon, Jun 26, 2023 10-minute read

Vectors are the new JSON.

That in itself is an interesting statement, given vectors are a well-studied mathematical structure, and JSON is a data interchange format. And yet in the world of data storage and retrieval, both of these data representations have become the lingua franca of their domains and are either essential, or soon-to-be-essential, ingredients in modern application development. And if current trends continue (I think they will), vectors will be as crucial as JSON is for building applications.

Generative AI and all the buzz around it has caused developers to look for convenient ways to store and run queries against the outputs of these systems, with PostgreSQL being a natural choice for a lot of reasons. But even with the hype around generative AI, this is not a new data pattern. Vectors, as a mathematical concept, have been around for hundreds of years. Machine learning has over a half-century worth of research. The array – the fundamental data structure for a vector – is taught in most introductory computer science classes. Even PostgreSQL has had support for vector operations for over 20 years (more on that later)!

So, what is new? It’s the accessibility of these AI/ML algorithms and how easy it is to represent some “real world” structure (text, images, video) as a vector and store it for some future use by an application. And again, while folks may point to the fact it’s not new to store the output of these systems (“embeddings”) in data storage systems, the emergent pattern is the accessibility of being able to query and return this data in near real-time in almost any application.

What does this have to do with PostgreSQL? Everything! Efficient storage and retrieval of a data type used in a common pattern greatly simplifies app development, lets people to keep their related data in the same place, and can work with existing tooling. We saw this with JSON over 10 years ago, and now we’re seeing this with vector data.

To understand why vectors are the new JSON, let’s rewind and look back at what happened as JSON emerged as the de facto data type for web communications.

A very brief history of JSON in PostgreSQL

Back during the “rise of JSON” I was very much still an app developer. The systems I was building either emitted JSON data to the frontend so that it could complete some sort of action (e.g. rendering an updatable widget) or working with a “modern” API that returned its data in JSON format. What was nice about JSON was its simplicity (very easy to both read and manipulate), yet its relative expressiveness for a data interchange format. There are some things I would have loved to see in JSON – skewing towards the database-side, I’m definitely a fan of having schemas – but JSON did simplify communicating between systems efficiently, both from a development and operations standpoint.

While JSON was meant first and foremost to be an interchange format, people did ask “well, why can’t I just store and query this natively?” This lead to the emergence of specialized data storage systems that let you store and query JSON documents. While I had tried out a few different ad hoc JSON storage systems for a very specific problem, I wasn’t sure if I wanted to bring them into my application stack due to performance and maintainability reasons (I won’t name names, as I did this analysis well over a decade ago and a lot has changed). This lead to the question – is it possible store JSON data in PostgreSQL.

I remember going to PostgreSQL events eagerly awaiting what updates there were on PostgreSQL-native support for storage and retrieval of JSON documents. I remember the excitement of PostgreSQL 9.2 adding the text-based JSON type. The initial support for JSON in PostgreSQL validated that the content you were storing was valid JSON, and came with a few functions and operators that helped you extract data contained within the document. There was no native indexing support, but you could build expression indexes if you planned to frequently query a key within the document.

This initial JSON support in PostgreSQL helped solve several problems for me, specifically, snapshotting the state of several tables in my database, and logging output from APIs I interfaced with. The initial text-based JSON data type did not have much in the way of search capabilities: it was possible to build expression indexes to query on a particular key in a JSON document, but pragmatically I would store that key in a column next to the JSON document.

There’s a key element in here: the initial support for JSON had limited utility as a “JSON database.” Yes, we could now store JSON, and we had some limited querying abilities for it, and it was clear it needed more work to compete with the functionality of ad hoc JSON databases. However, PostgreSQL was still good enough for many of these use-cases, and in part developers were okay with these limitations so long as they could use the text-based JSON data type with their existing application infrastructure. PostgreSQL was also the first relational database to add support for JSON, setting a trend that ultimately lead to the adoption of JSON in the SQL standard.

However, the viability of PostgreSQL as a “JSON database” changed with the release of PostgreSQL 9.4. This release added JSONB, a binary representation of the JSON data type, but came with the ability to use GIN indexing to search on arbitrary data with JSON documents. From a performance standpoint, this put PostgreSQL on par with JSON databases, while still providing all the other benefits of keeping data in a relational database. While it took a few years, PostgreSQL was able to adapt and evolve to support the workloads of applications.

Support for JSON in PostgreSQL has continued to evolve and improve through the years, and no doubt will continue to do so as PostgreSQL continues to implement and adopt SQL/JSON. I’ve talked to PostgreSQL users who use its JSON support to store tens of terabytes of JSON documents within a PostgreSQL database – and they have positive feedback on the experience!

The key part of this story is that developers were willing to bet on PostgreSQL to have a competitive JSON storage system, and worked with its initial limitations until there was more robust support. Which brings us to vectors.

The rise of the vector: “a new kind of JSON”

Vectors are not new, but they’re having a surge in popularity these days. As mentioned earlier, this is due to the newfound accessibility of AI/ML systems, and that the output of these systems are vectors. A common use-case is to build a model on stored data (text, sound, video), convert it to vector format, and then use it for “semantic search.” In this case, semantic search is performed when you take a new input, convert it to its corresponding vector, and find the most similar results in the database. Similarity is found using a distance function, such as Euclidean or cosine distance, and the results are often capped at the “k nearest neighbors” (K-NN), or k most similar objects. It can take a lot of time to encode the “training set” of vectors, so it makes sense to “cache” them in a permanent data storage system, such as a database, and perform K-NN queries there. Having a set of vectors that are ready to be queried for semantic searches makes a generally better experience for users, which has given rise to the notion of needing a “vector database.”

Storing a vector is not new in PostgreSQL. The PostgreSQL array type was around when PostgreSQL was first open-sourced in 1996(!), though there were many improvements made to it through the years. In fact, PostgreSQL “arrays” is a bit of a misnomer, as they can store multiple dimensions of data (e.g. a matrix). Natively in PostgreSQL arrays contain limited functionality around operations that are common amongst vectors, e.g. calculating the “distance” between two arrays. It’s possible to write stored procedures to handle this, but that puts extra work on the developer.

Fortunately, the cube data type overcomes these limitations. cube has also been around for 20+ years in the PostgreSQL code base, and is designed for performing operations on higher dimensional vectors* (more on that in a second). cube contains most of the common distance functions used in vector similarity searches, including Euclidean distance, and can use GiST indexes to perform efficient K-NN queries! However, cube is capped at storing vectors with 100 dimensions, and many modern AI/ML systems have dimensionality that far exceeds that.

So, if arrays can handle vector dimensionality, but not operations, and cubes can handle operations, but not dimensionality, what can we do?

pgvector: an open source extension for storing and searching vectors in PostgreSQL

One of the foundations of PostgreSQL is its extensibility: PostgreSQL has interfaces to create new data types and new indexing methods. This gives us pgvector, an open source PostgreSQL extension that provides an indexable vector data type. In a nutshell, pgvector lets you store vectors in PostgreSQL and perform K-NN queries with an assortment of distance metrics: Euclidean, cosine, and inner product. As of today, pgvector comes with one index, ivfflat, which implements the IVF FLAT method of vector indexing.

What happens when you query indexed vector data may be a bit different than how you’re used to querying data in PostgreSQL. Due to the computational expense of performing nearest-neighbor searches over high-dimensionality vectors, many vector indexing methods look for “approximate” answers that are “close enough” to the correct answer. This has lead to the field of “Approximate Nearest Neighbor” (ANN) searches. The two dimensions that people look at for ANN queries are the tradeoff between performance and “recall”, where “recall” is the percentage of relevant results returned.

Let’s look at the ivfflat method as an example. When building an ivfflat index, you decide how many lists you want to have in it. Each list represents a “center”; these centers are calculated using a k-means algorithm. Once you determine all of your centers, ivfflat determines what center each vector is closest to and adds it to the index. When it’s time to query your vector data, you then decide how many centers to check, which is determined by the ivfflat.probes parameter. This is where you see the ANN performance/recall tradeoff: the more centers you visit, the more precise your results, but at the expense of performance.

Because of the popularity of storing the output of AI/ML in “vector databases” and of pgvector, there are plenty of examples for how to use pgvector. So instead, I’ll focus on where things are heading.

The next steps for better support for vectors in PostgreSQL

Similar to the PostgreSQL 9.2 days of JSON, we’re in the earlier stages of how we store vector data in PostgreSQL – and while a lot of what we see in both PostgreSQL and pgvector is very good, it’s about to get a whole lot better.

pgvector can already handle many common use cases for AI/ML data – I’ve already seen many users successfully deploy apps with it! – so the next step is to help it scale. This is not too different from what happened with JSON and JSONB in PostgreSQL, but having pgvector as an extension will help things to iterate more rapidly.

At PGCon 2023, which is a PostgreSQL conference where many internals developers gather, I presented a lightning talk called Vectors are the new JSON where I shared use-cases and some upcoming challenges with improving PostgreSQL and pgvector performance for querying vector data. Some problems to tackle (many of which are in progress!) involve adding more parallelism to pgvector, adding support for indexing for vectors with more than 2,000 dimensions, and leverage hardware acceleration where possible to speed up calculations. The good news is that some of these things are not too hard to add, they just require open source contributions!

There is a lot of excitement around using PostgreSQL as a vector database (emphasis on database ;-), and I expect that, as history has shown with JSON, the PostgreSQL community will find a way to support this emergent workload in a way that’s scalable and safe. I do encourage you to provide feedback – both on PostgreSQL itself and pgvector – on how you’re working with vector data in PostgreSQL, or how you want to work with data in PostgreSQL, as that will help guide the community on providing optimal support for vector queries.