Files
pgvector/README.md
2022-02-06 12:14:43 -08:00

323 lines
8.7 KiB
Markdown

# pgvector
Open-source vector similarity search for Postgres
```sql
CREATE TABLE table (column vector(3));
CREATE INDEX ON table USING ivfflat (column vector_l2_ops);
SELECT * FROM table ORDER BY column <-> '[1,2,3]' LIMIT 5;
```
Supports L2 distance, inner product, and cosine distance
[![Build Status](https://github.com/pgvector/pgvector/workflows/build/badge.svg?branch=master)](https://github.com/pgvector/pgvector/actions)
## Installation
Compile and install the extension (supports Postgres 9.6+)
```sh
git clone --branch v0.2.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
```
Then load it in databases where you want to use it
```sql
CREATE EXTENSION vector;
```
You can also install it with [Docker](#docker), [Homebrew](#homebrew), or [PGXN](#pgxn)
## Getting Started
Create a vector column with 3 dimensions (replace `table` and `column` with non-reserved names)
```sql
CREATE TABLE table (column vector(3));
```
Insert values
```sql
INSERT INTO table VALUES ('[1,2,3]'), ('[4,5,6]');
```
Get the nearest neighbor by L2 distance
```sql
SELECT * FROM table ORDER BY column <-> '[3,1,2]' LIMIT 1;
```
Also supports inner product (`<#>`) and cosine distance (`<=>`)
Note: `<#>` returns the negative inner product since Postgres only supports `ASC` order index scans on operators
## Indexing
Speed up queries with an approximate index. Add an index for each distance function you want to use.
L2 distance
```sql
CREATE INDEX ON table USING ivfflat (column vector_l2_ops);
```
Inner product
```sql
CREATE INDEX ON table USING ivfflat (column vector_ip_ops);
```
Cosine distance
```sql
CREATE INDEX ON table USING ivfflat (column vector_cosine_ops);
```
Indexes should be created after the table has data for optimal clustering. If the distribution of data changes significantly, you can reindex without downtime:
```sql
-- Postgres 12+
REINDEX INDEX CONCURRENTLY index_name;
-- Postgres < 12
CREATE INDEX CONCURRENTLY temp_name ON table USING ivfflat (column opclass);
DROP INDEX CONCURRENTLY index_name;
ALTER INDEX temp_name RENAME TO index_name;
```
Also, unlike typical indexes which only affect performance, you may see different results for queries after adding an approximate index.
### Index Options
Specify the number of inverted lists (100 by default)
```sql
CREATE INDEX ON table USING ivfflat (column opclass) WITH (lists = 100);
```
A [good place to start](https://github.com/facebookresearch/faiss/issues/112) is `4 * sqrt(rows)`
### Query Options
Specify the number of probes (1 by default)
```sql
SET ivfflat.probes = 1;
```
A higher value improves recall at the cost of speed.
Use `SET LOCAL` inside a transaction to set it for a single query
```sql
BEGIN;
SET LOCAL ivfflat.probes = 1;
SELECT ...
COMMIT;
```
### Indexing Progress
Check [indexing progress](https://www.postgresql.org/docs/current/progress-reporting.html#CREATE-INDEX-PROGRESS-REPORTING) with Postgres 12+
```sql
SELECT phase, tuples_done, tuples_total FROM pg_stat_progress_create_index;
```
The phases are:
1. `initializing`
2. `sampling table`
3. `performing k-means`
4. `sorting tuples`
5. `loading tuples`
Note: `tuples_done` and `tuples_total` are only populated during the `loading tuples` phase
### Partial Indexes
Consider [partial indexes](https://www.postgresql.org/docs/current/indexes-partial.html) for queries with a `WHERE` clause
```sql
CREATE INDEX ON table USING ivfflat (column opclass) WHERE (other_column = 123);
```
To index many different values of `other_column`, consider [partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) on `other_column`.
## Performance
To speed up queries without an index, increase `max_parallel_workers_per_gather`.
```sql
SET max_parallel_workers_per_gather = 4;
```
To speed up queries with an index, increase the number of inverted lists (at the expense of recall).
```sql
CREATE INDEX ON table USING ivfflat (column opclass) WITH (lists = 1000);
```
## Reference
### Vector Type
Each vector takes `4 * dimensions + 8` bytes of storage. Each element is a float, and all elements must be finite (no `NaN`, `Infinity` or `-Infinity`). Vectors can have up to 1024 dimensions.
### Vector Operators
Operator | Description
--- | ---
\+ | element-wise addition
\- | element-wise subtraction
<-> | Euclidean distance
<#> | negative inner product
<=> | cosine distance
### Vector Functions
Function | Description
--- | ---
cosine_distance(vector, vector) | cosine distance
inner_product(vector, vector) | inner product
l2_distance(vector, vector) | Euclidean distance
vector_dims(vector) | number of dimensions
vector_norm(vector) | Euclidean norm
## Libraries
Libraries that use pgvector:
- [pgvector-python](https://github.com/pgvector/pgvector-python) (Python)
- [Neighbor](https://github.com/ankane/neighbor) (Ruby)
- [pgvector-node](https://github.com/pgvector/pgvector-node) (Node.js)
- [pgvector-go](https://github.com/pgvector/pgvector-go) (Go)
- [pgvector-rust](https://github.com/pgvector/pgvector-rust) (Rust)
- [pgvector-cpp](https://github.com/pgvector/pgvector-cpp) (C++)
## Frequently Asked Questions
#### How many vectors can be stored in a single table?
A non-partitioned table has a limit of 32 TB by default in Postgres. A partitioned table can have thousands of partitions of that size.
#### Is replication supported?
Yes, pgvector uses the write-ahead log (WAL), which allows for replication and point-in-time recovery.
#### What if my data has more than 1024 dimensions?
Two things you can try are:
1. use dimensionality reduction
2. compile Postgres with a larger block size (`./configure --with-blocksize=32`) and edit the limit in `src/vector.h`
## Additional Installation Methods
### Docker
Get the [Docker image](https://hub.docker.com/r/ankane/pgvector) with:
```sh
docker pull ankane/pgvector
```
This adds pgvector to the [Postgres image](https://hub.docker.com/_/postgres).
You can also build the image manually
```sh
git clone --branch v0.2.4 https://github.com/pgvector/pgvector.git
cd pgvector
docker build -t pgvector .
```
### Homebrew
On Mac with Homebrew Postgres, you can use:
```sh
brew install pgvector/brew/pgvector
```
### PGXN
Install from the [PostgreSQL Extension Network](https://pgxn.org/dist/vector) with:
```sh
pgxn install vector
```
## Hosted Postgres
Some Postgres providers only support specific extensions. To request a new extension:
- Amazon RDS - follow the instructions on [this page](https://aws.amazon.com/rds/postgresql/faqs/)
- Google Cloud SQL - follow the instructions on [this page](https://cloud.google.com/sql/docs/postgres/extensions#requesting-support-for-a-new-extension)
- DigitalOcean Managed Databases - vote or comment on [this page](https://ideas.digitalocean.com/app-framework-services/p/pgvector-extension-for-postgresql)
- Azure Database for PostgreSQL - follow the instructions on [this page](https://docs.microsoft.com/en-us/azure/postgresql/concepts-extensions#next-steps)
## Upgrading
Install the latest version and run:
```sql
ALTER EXTENSION vector UPDATE;
```
## Thanks
Thanks to:
- [PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension](https://dl.acm.org/doi/pdf/10.1145/3318464.3386131)
- [Faiss: A Library for Efficient Similarity Search and Clustering of Dense Vectors](https://github.com/facebookresearch/faiss)
- [Using the Triangle Inequality to Accelerate k-means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
- [k-means++: The Advantage of Careful Seeding](https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf)
- [Concept Decompositions for Large Sparse Text Data using Clustering](https://www.cs.utexas.edu/users/inderjit/public_papers/concept_mlj.pdf)
## History
View the [changelog](https://github.com/pgvector/pgvector/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/pgvector/pgvector/issues)
- Fix bugs and [submit pull requests](https://github.com/pgvector/pgvector/pulls)
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
```sh
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
make install
```
To run all tests:
```sh
make installcheck # regression tests
make prove_installcheck # TAP tests
```
To run single tests:
```sh
make installcheck REGRESS=functions # regression test
make prove_installcheck PROVE_TESTS=test/t/001_wal.pl # TAP test
```
Resources for contributors
- [Extension Building Infrastructure](https://www.postgresql.org/docs/current/extend-pgxs.html)
- [Index Access Method Interface Definition](https://www.postgresql.org/docs/current/indexam.html)
- [Generic WAL Records](https://www.postgresql.org/docs/13/generic-wal.html)