Commit Graph

796 Commits

Author SHA1 Message Date
Andrew Kane
2cbd08b6c0 Moved unions and macros [skip ci] 2024-10-10 09:41:26 -07:00
Andrew Kane
fa6782985a Added HnswQuery struct for query data 2024-10-09 23:45:47 -07:00
Andrew Kane
32ab27d72a Added HnswSupport struct for support functions 2024-10-09 23:10:26 -07:00
Andrew Kane
064db12de7 Moved procinfo initialization for inserts [skip ci] 2024-10-09 21:59:21 -07:00
Andrew Kane
45a6eef9e0 Improved variable name [skip ci] 2024-10-09 21:52:10 -07:00
Andrew Kane
17266ed409 Use inMemory for conditionals 2024-10-09 21:49:32 -07:00
Andrew Kane
a98534e5ab DRY HNSW procinfo 2024-10-09 21:03:18 -07:00
Andrew Kane
57c05c59a2 DRY code for forming index value 2024-10-09 20:50:17 -07:00
Andrew Kane
3126fbdb6f Use double for distance [skip ci] 2024-10-09 17:04:25 -07:00
Andrew Kane
f4b67b078f DRY HNSW distance calculations 2024-10-09 17:01:49 -07:00
Andrew Kane
77688b4309 Improve total cost for cost estimation (#686) 2024-10-08 12:42:03 -07:00
Andrew Kane
d5f4a0e435 Fixed memory context leak in HnswUpdateNeighborsOnDisk - fixes #692 2024-10-08 12:21:26 -07:00
Andrew Kane
57248ba128 Use separate memory context for updating neighbors, which improves performance around 10% for larger vectors 2024-09-30 11:15:27 -07:00
Andrew Kane
ff6da4fcea Moved logic to get update neighbor on disk to separate function 2024-09-30 10:30:01 -07:00
Andrew Kane
a8b4b6675a Moved logic to get update index to separate function 2024-09-30 10:14:52 -07:00
Andrew Kane
d148b4e61b Fixed insert logic 2024-09-30 09:59:12 -07:00
Andrew Kane
658d74e2f6 Use Size for memory [skip ci] 2024-09-29 23:48:58 -07:00
Andrew Kane
7ba593c492 Improved SelectNeighbors signature [skip ci] 2024-09-29 23:03:02 -07:00
Andrew Kane
525e3b81e1 Improved HnswUpdateConnection parameters [skip ci] 2024-09-29 19:47:25 -07:00
Andrew Kane
8eb8cdf0f3 Moved insert-specific code to hnswinsert.c 2024-09-29 19:44:11 -07:00
Andrew Kane
4c72f91206 Improved variable name [skip ci] 2024-09-29 19:26:15 -07:00
Andrew Kane
4ac86f62a1 Improved variable names [skip ci] 2024-09-29 19:22:35 -07:00
Andrew Kane
648dd8af78 Moved LoadElementsForInsert to separate function and removed unused code path 2024-09-29 19:12:38 -07:00
Andrew Kane
ee43ee9b16 Use HnswLoadNeighborTids for inserts 2024-09-29 18:52:12 -07:00
Andrew Kane
5ce367e18b Removed lc from HnswUpdateConnection [skip ci] 2024-09-29 18:18:42 -07:00
Andrew Kane
f371eb119b Removed lc from SelectNeighbors [skip ci] 2024-09-29 18:14:28 -07:00
Andrew Kane
382a25aefb Split loading neighbor TIDs into separate function [skip ci] 2024-09-29 17:20:54 -07:00
Andrew Kane
0b6214aad6 Moved HnswLoadNeighbors to hnswinsert.c [skip ci] 2024-09-29 15:49:01 -07:00
Andrew Kane
f2afd11257 Use sc for search candidates [skip ci] 2024-09-29 15:09:54 -07:00
Andrew Kane
cae3458329 Updated distance to use double 2024-09-29 15:06:50 -07:00
Andrew Kane
dc23752618 Fixed uninitialized variable [skip ci] 2024-09-28 19:18:52 -07:00
Andrew Kane
54fa16e3e3 Added safety check [skip ci] 2024-09-26 08:32:44 -07:00
Andrew Kane
5776a4d937 Only adjust for TOAST [skip ci] 2024-09-25 15:39:56 -07:00
Andrew Kane
242a12b7d5 Added same cost adjustment to HNSW as IVFFlat since TOAST not included in seq scan cost - #682 [skip ci] 2024-09-25 15:33:57 -07:00
Andrew Kane
1370dd6e86 Removed unneeded floor and fixed comment formatting [skip ci] 2024-09-25 14:13:02 -07:00
Andrew Kane
a100dc67e5 Ran pgindent [skip ci] 2024-09-25 14:03:51 -07:00
Jonathan S. Katz
2df9f24aad Update HNSW cost estimatation to utilize search and index info (#682)
Previously, the cost estimation formula for a HNSW index scan utilized
a methodology that only factored in the entry level for an HNSW scan
and the "m" index parameter, which reflects the number of tuples (or
vectors) to scan at each step of a HNSW graph traversal. While this
would bias the PostgreSQL query planner to choose an HNSW index scan
over other available paths, this could lead to potential suboptimal
index selection, for example, choosing to use a HNSW index instead of
an available B-tree index that has better selectivity.

The number of tuples scanned during HNSW graph traversal is principally
influenced by these factors:

 * The number of tuples stored in the index
 * `m` - the number of tuples that are scanned in each step of the graph
   traversal
 * `hnsw.ef_search` - which influences the total number of steps it
   takes for the scan to converge on the approximated nearest neighbors

Through testing different source models for vectors, we also observed
that the correlation of vectors in mdoels would impact this convergence.
For this first iteration, we've opted to hardcode a constant scaling
factor and set it to `0.55`, though a future commit may turn this into
a configurable parameter.

The high-level formula for estimating the cost of a HNSW index scan is
as such:

```
(entryLevel * m) + (layer0TuplesMax * layer0Selectivity)
```

where

- `(entryLevel * m)` is the lower bound of tuples to scan, as it
accounts for the graph traversal to layer 0 (L0). (L1 and above has an ef=1)
- `layer0TuplesMax` is an estimate of the maximum number of tuples to
scan at L0. This accounts for tuples that may end up being discarded due
to them already being visited. Testing shows that the number of steps
until converge is similar to the value of `hnsw.ef_search`, thus we can
estimate tuples max at `hnsw.ef_search * m * 2`
- `layer0Selectivity` - estimates the percentage of tuples that will
actually be scanned during the index traversal, multipled by the scaling
factor

In addition to the `m` build parameter and `hsnw.ef_search`, costs
estimates can be influenced by standard PostgreSQL costing parameters,
though adjusting those (e.g. `random_page_cost`) should be done with
care.

Co-authored-by: @ankane
2024-09-25 14:01:33 -07:00
Andrew Kane
8e979ed377 Do not adjust index selectivity based on probes [skip ci] 2024-09-25 13:48:24 -07:00
Andrew Kane
87ac108bf7 Removed code for Postgres 12 [skip ci] 2024-09-23 15:26:31 -07:00
Andrew Kane
97cf990e0f Free TupleDesc [skip ci] 2024-09-21 19:15:34 -07:00
Andrew Kane
55dc735e1a Moved allocations out of GetScanItems [skip ci] 2024-09-21 19:10:25 -07:00
Andrew Kane
be4e9a9df2 Added macros for IvfflatScanList [skip ci] 2024-09-21 18:10:37 -07:00
Andrew Kane
d5e8fc96a5 Changed HnswPairingHeapNode to HnswSearchCandidate to reduce allocations and improve code 2024-09-21 12:07:44 -07:00
Andrew Kane
6d2af6d3f9 Improved code [skip ci] 2024-09-20 15:21:57 -07:00
Andrew Kane
a6ab5d07c0 Fixed CI 2024-09-19 20:50:51 -07:00
Andrew Kane
aa77346103 Improved code [skip ci] 2024-09-19 19:57:16 -07:00
Andrew Kane
b0da2d95d9 Fixed array_to_sparsevec on Windows [skip ci] 2024-09-19 19:52:16 -07:00
Andrew Kane
3fb05eb847 Added casts for arrays to sparsevec - #604
Co-authored-by: Narek Galstyan <narekg@berkeley.edu>
Co-authored-by: Di Qi <di@lantern.dev>
2024-09-19 19:17:05 -07:00
Andrew Kane
b738ffecc1 Dropped support for Postgres 12 2024-09-19 18:13:54 -07:00
Heikki Linnakangas
7117513532 Add error codes to a few errors (#657)
With elog(), you get XX000 "internal_error", which sounds scary.

It's not self-evident what the right error codes for some of these
errors are, but I tried to use my best judgment.
2024-09-19 18:04:23 -07:00