pgvector/README.md

# pgvector

Open-source vector similarity search for Postgres

```sql
CREATE TABLE table (column vector(3));
CREATE INDEX ON table USING ivfflat (column vector_l2_ops);
SELECT * FROM table ORDER BY column <-> '[1,2,3]' LIMIT 5;
```

Supports L2 distance, inner product, and cosine distance

[![Build Status](https://github.com/pgvector/pgvector/workflows/build/badge.svg?branch=master)](https://github.com/pgvector/pgvector/actions)

## Installation

Compile and install the extension (supports Postgres 9.6+)

```sh
git clone --branch v0.2.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
```

Then load it in databases where you want to use it

```sql
CREATE EXTENSION vector;
```

You can also install it with [Docker](#docker), [Homebrew](#homebrew), or [PGXN](#pgxn)

## Getting Started

Create a vector column with 3 dimensions (replace `table` and `column` with non-reserved names)

```sql
CREATE TABLE table (column vector(3));
```

Insert values

```sql
INSERT INTO table VALUES ('[1,2,3]'), ('[4,5,6]');
```

Get the nearest neighbor by L2 distance

```sql
SELECT * FROM table ORDER BY column <-> '[3,1,2]' LIMIT 1;
```

Also supports inner product (`<#>`) and cosine distance (`<=>`)

Note: `<#>` returns the negative inner product since Postgres only supports `ASC` order index scans on operators

## Indexing

Speed up queries with an approximate index. Add an index for each distance function you want to use.

L2 distance

```sql
CREATE INDEX ON table USING ivfflat (column vector_l2_ops);
```

Inner product

```sql
CREATE INDEX ON table USING ivfflat (column vector_ip_ops);
```

Cosine distance

```sql
CREATE INDEX ON table USING ivfflat (column vector_cosine_ops);
```

Indexes should be created after the table has data for optimal clustering. If the distribution of data changes significantly, you can reindex without downtime:

```sql
-- Postgres 12+
REINDEX INDEX CONCURRENTLY index_name;

-- Postgres < 12
CREATE INDEX CONCURRENTLY temp_name ON table USING ivfflat (column opclass);
DROP INDEX CONCURRENTLY index_name;
ALTER INDEX temp_name RENAME TO index_name;
```

Also, unlike typical indexes which only affect performance, you may see different results for queries after adding an approximate index.

### Index Options

Specify the number of inverted lists (100 by default)

```sql
CREATE INDEX ON table USING ivfflat (column opclass) WITH (lists = 100);
```

A [good place to start](https://github.com/facebookresearch/faiss/issues/112) is `4 * sqrt(rows)`

### Query Options

Specify the number of probes (1 by default)

```sql
SET ivfflat.probes = 1;
```

A higher value improves recall at the cost of speed.

Use `SET LOCAL` inside a transaction to set it for a single query

```sql
BEGIN;
SET LOCAL ivfflat.probes = 1;
SELECT ...
COMMIT;
```

### Indexing Progress

Check [indexing progress](https://www.postgresql.org/docs/current/progress-reporting.html#CREATE-INDEX-PROGRESS-REPORTING) with Postgres 12+

```sql
SELECT phase, tuples_done, tuples_total FROM pg_stat_progress_create_index;
```

The phases are:

1. `initializing`
2. `sampling table`
3. `performing k-means`
4. `sorting tuples`
5. `loading tuples`

Note: `tuples_done` and `tuples_total` are only populated during the `loading tuples` phase

### Partial Indexes

Consider [partial indexes](https://www.postgresql.org/docs/current/indexes-partial.html) for queries with a `WHERE` clause

```sql
CREATE INDEX ON table USING ivfflat (column opclass) WHERE (other_column = 123);
```

To index many different values of `other_column`, consider [partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) on `other_column`.

## Performance

To speed up queries without an index, increase `max_parallel_workers_per_gather`.

```sql
SET max_parallel_workers_per_gather = 4;
```

To speed up queries with an index, increase the number of inverted lists (at the expense of recall).

```sql
CREATE INDEX ON table USING ivfflat (column opclass) WITH (lists = 1000);
```

## Reference

### Vector Type

Each vector takes `4 * dimensions + 8` bytes of storage. Each element is a float, and all elements must be finite (no `NaN`, `Infinity` or `-Infinity`). Vectors can have up to 1024 dimensions.

### Vector Operators

Operator | Description
--- | ---
\+ | element-wise addition
\- | element-wise subtraction
<-> | Euclidean distance
<#> | negative inner product
<=> | cosine distance

### Vector Functions

Function | Description
--- | ---
cosine_distance(vector, vector) | cosine distance
inner_product(vector, vector) | inner product
l2_distance(vector, vector) | Euclidean distance
vector_dims(vector) | number of dimensions
vector_norm(vector) | Euclidean norm

## Libraries

Libraries that use pgvector:

- [pgvector-python](https://github.com/pgvector/pgvector-python) (Python)
- [Neighbor](https://github.com/ankane/neighbor) (Ruby)
- [pgvector-node](https://github.com/pgvector/pgvector-node) (Node.js)
- [pgvector-go](https://github.com/pgvector/pgvector-go) (Go)
- [pgvector-rust](https://github.com/pgvector/pgvector-rust) (Rust)
- [pgvector-cpp](https://github.com/pgvector/pgvector-cpp) (C++)

## Frequently Asked Questions

#### How many vectors can be stored in a single table?

A non-partitioned table has a limit of 32 TB by default in Postgres. A partitioned table can have thousands of partitions of that size.

#### Is replication supported?

Yes, pgvector uses the write-ahead log (WAL), which allows for replication and point-in-time recovery.

#### What if my data has more than 1024 dimensions?

Two things you can try are:

1. use dimensionality reduction
2. compile Postgres with a larger block size (`./configure --with-blocksize=32`) and edit the limit in `src/vector.h`

## Additional Installation Methods

### Docker

Get the [Docker image](https://hub.docker.com/r/ankane/pgvector) with:

```sh
docker pull ankane/pgvector
```

This adds pgvector to the [Postgres image](https://hub.docker.com/_/postgres).

You can also build the image manually

```sh
git clone --branch v0.2.4 https://github.com/pgvector/pgvector.git
cd pgvector
docker build -t pgvector .
```

### Homebrew

On Mac with Homebrew Postgres, you can use:

```sh
brew install pgvector/brew/pgvector
```

### PGXN

Install from the [PostgreSQL Extension Network](https://pgxn.org/dist/vector) with:

```sh
pgxn install vector
```

## Hosted Postgres

Some Postgres providers only support specific extensions. To request a new extension:

- Amazon RDS - follow the instructions on [this page](https://aws.amazon.com/rds/postgresql/faqs/)
- Google Cloud SQL - follow the instructions on [this page](https://cloud.google.com/sql/docs/postgres/extensions#requesting-support-for-a-new-extension)
- DigitalOcean Managed Databases - vote or comment on [this page](https://ideas.digitalocean.com/app-framework-services/p/pgvector-extension-for-postgresql)
- Azure Database for PostgreSQL - follow the instructions on [this page](https://docs.microsoft.com/en-us/azure/postgresql/concepts-extensions#next-steps)

## Upgrading

Install the latest version and run:

```sql
ALTER EXTENSION vector UPDATE;
```

## Thanks

Thanks to:

- [PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension](https://dl.acm.org/doi/pdf/10.1145/3318464.3386131)
- [Faiss: A Library for Efficient Similarity Search and Clustering of Dense Vectors](https://github.com/facebookresearch/faiss)
- [Using the Triangle Inequality to Accelerate k-means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
- [k-means++: The Advantage of Careful Seeding](https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf)
- [Concept Decompositions for Large Sparse Text Data using Clustering](https://www.cs.utexas.edu/users/inderjit/public_papers/concept_mlj.pdf)

## History

View the [changelog](https://github.com/pgvector/pgvector/blob/master/CHANGELOG.md)

## Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

- [Report bugs](https://github.com/pgvector/pgvector/issues)
- Fix bugs and [submit pull requests](https://github.com/pgvector/pgvector/pulls)
- Write, clarify, or fix documentation
- Suggest or add new features

To get started with development:

```sh
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
make install
```

To run all tests:

```sh
make installcheck        # regression tests
make prove_installcheck  # TAP tests
```

To run single tests:

```sh
make installcheck REGRESS=functions                    # regression test
make prove_installcheck PROVE_TESTS=test/t/001_wal.pl  # TAP test
```

Resources for contributors

- [Extension Building Infrastructure](https://www.postgresql.org/docs/current/extend-pgxs.html)
- [Index Access Method Interface Definition](https://www.postgresql.org/docs/current/indexam.html)
- [Generic WAL Records](https://www.postgresql.org/docs/13/generic-wal.html)