Commit Graph

177 Commits

Author SHA1 Message Date
Andrew Kane
c46f078e3c Merge branch 'master' into hnsw-streaming 2024-10-10 01:14:33 -07:00
Andrew Kane
c0f6570c4a Debug updates [skip ci] 2024-10-08 22:58:09 -07:00
Andrew Kane
77688b4309 Improve total cost for cost estimation (#686) 2024-10-08 12:42:03 -07:00
Andrew Kane
8b9333d468 Merge branch 'master' into hnsw-streaming 2024-09-29 15:13:51 -07:00
Andrew Kane
e1c2d03dba Improved cost estimation [skip ci] 2024-09-28 16:04:07 -07:00
Andrew Kane
158d9340bc Added distance filters to cost tests [skip ci] 2024-09-28 14:50:23 -07:00
Andrew Kane
ec4a23fe49 Added cost estimation [skip ci] 2024-09-25 16:45:04 -07:00
Andrew Kane
38207f5640 Merge branch 'master' into hnsw-streaming 2024-09-25 16:09:09 -07:00
Andrew Kane
b8c27914d4 Improved test [skip ci] 2024-09-25 15:58:41 -07:00
Andrew Kane
2d85af51a8 Added test for IVFFlat costs [skip ci] 2024-09-25 15:53:34 -07:00
Andrew Kane
e0ad441306 Added test for costs [skip ci] 2024-09-25 15:49:03 -07:00
Jonathan S. Katz
2df9f24aad Update HNSW cost estimatation to utilize search and index info (#682)
Previously, the cost estimation formula for a HNSW index scan utilized
a methodology that only factored in the entry level for an HNSW scan
and the "m" index parameter, which reflects the number of tuples (or
vectors) to scan at each step of a HNSW graph traversal. While this
would bias the PostgreSQL query planner to choose an HNSW index scan
over other available paths, this could lead to potential suboptimal
index selection, for example, choosing to use a HNSW index instead of
an available B-tree index that has better selectivity.

The number of tuples scanned during HNSW graph traversal is principally
influenced by these factors:

 * The number of tuples stored in the index
 * `m` - the number of tuples that are scanned in each step of the graph
   traversal
 * `hnsw.ef_search` - which influences the total number of steps it
   takes for the scan to converge on the approximated nearest neighbors

Through testing different source models for vectors, we also observed
that the correlation of vectors in mdoels would impact this convergence.
For this first iteration, we've opted to hardcode a constant scaling
factor and set it to `0.55`, though a future commit may turn this into
a configurable parameter.

The high-level formula for estimating the cost of a HNSW index scan is
as such:

```
(entryLevel * m) + (layer0TuplesMax * layer0Selectivity)
```

where

- `(entryLevel * m)` is the lower bound of tuples to scan, as it
accounts for the graph traversal to layer 0 (L0). (L1 and above has an ef=1)
- `layer0TuplesMax` is an estimate of the maximum number of tuples to
scan at L0. This accounts for tuples that may end up being discarded due
to them already being visited. Testing shows that the number of steps
until converge is similar to the value of `hnsw.ef_search`, thus we can
estimate tuples max at `hnsw.ef_search * m * 2`
- `layer0Selectivity` - estimates the percentage of tuples that will
actually be scanned during the index traversal, multipled by the scaling
factor

In addition to the `m` build parameter and `hsnw.ef_search`, costs
estimates can be influenced by standard PostgreSQL costing parameters,
though adjusting those (e.g. `random_page_cost`) should be done with
care.

Co-authored-by: @ankane
2024-09-25 14:01:33 -07:00
Andrew Kane
8e979ed377 Do not adjust index selectivity based on probes [skip ci] 2024-09-25 13:48:24 -07:00
Andrew Kane
77b3d1f2a8 Added test for join with attribute filtering [skip ci] 2024-09-24 23:21:34 -07:00
Andrew Kane
ecd0738728 Improved test [skip ci] 2024-09-24 23:13:30 -07:00
Andrew Kane
62ffc3641c Added test for join [skip ci] 2024-09-24 23:12:27 -07:00
Andrew Kane
11e4d040d9 Fixed test [skip ci] 2024-09-24 19:38:06 -07:00
Andrew Kane
721d4b7e3f Improved test for ef_stream [skip ci] 2024-09-22 18:51:38 -07:00
Andrew Kane
28066d8fe4 Added test for ef_stream [skip ci] 2024-09-22 18:35:35 -07:00
Andrew Kane
80cbd32dab Added streaming option for HNSW 2024-09-22 12:02:48 -07:00
Andrew Kane
020d3edaa9 Changed warnings to errors for TAP tests [skip ci] 2024-07-27 12:28:03 -07:00
Andrew Kane
1e9e355175 Updated TAP tests to use PostgreSQL::Test packages [skip ci] 2024-07-27 12:24:15 -07:00
Andrew Kane
8684c2ba62 Updated formatting [skip ci] 2024-07-27 08:25:09 -07:00
Andrew Kane
8c5a4bfb6c Fixed failed to add index item error with sparsevec - fixes #625 2024-07-19 13:54:36 -07:00
Andrew Kane
24c8a2ff40 Fixed flaky tests [skip ci] 2024-04-29 13:54:30 -07:00
Andrew Kane
b52beefbc6 Added basic fuzz testing for input functions 2024-04-27 10:49:45 -07:00
Andrew Kane
c9fb66d54d Fixed flaky tests 2024-04-26 13:20:27 -07:00
Andrew Kane
48e68e5e42 Improved HNSW recall tests - #535 2024-04-26 13:08:48 -07:00
Andrew Kane
78d32943ac Added test for halfvec sum 2024-04-25 22:03:34 -07:00
Andrew Kane
cf494f15ac Added aggregate test for halfvec [skip ci] 2024-04-25 21:42:10 -07:00
Andrew Kane
1475c06902 Reordered TAP tests [skip ci] 2024-04-25 21:08:55 -07:00
Andrew Kane
7140a18283 Reordered TAP tests [skip ci] 2024-04-25 21:04:23 -07:00
Andrew Kane
7dcdaef96c Renamed TAP tests [skip ci] 2024-04-25 20:57:41 -07:00
Andrew Kane
914f9aa04a Fixed flaky test [skip ci] 2024-04-25 11:57:40 -07:00
Andrew Kane
c67dc6f9b0 Added test for bit with duplicate centers 2024-04-25 10:29:28 -07:00
Andrew Kane
c39cb25c32 Fixed flaky tests [skip ci] 2024-04-24 22:26:08 -07:00
Andrew Kane
b2f7dad8a7 Removed support for L1 distance and Jaccard distance from ivfflat due to non-optimal clustering 2024-04-22 14:11:29 -07:00
Andrew Kane
f9c071a761 Improved tests for L1 distance with halfvec 2024-04-22 13:14:45 -07:00
Andrew Kane
9f4b770db3 Added support for indexing sparsevec with L1 distance [skip ci] 2024-04-22 13:08:12 -07:00
Andrew Kane
70b299a7ff Added support for indexing halfvec with L1 distance [skip ci] 2024-04-22 13:00:59 -07:00
Andrew Kane
af9d50481d Added support for indexing L1 distance 2024-04-22 12:44:03 -07:00
Andrew Kane
77ec24641e Fixed flaky test [skip ci] 2024-04-17 00:27:37 -07:00
Andrew Kane
c361f80465 Synced recall test [skip ci] 2024-04-16 17:27:06 -07:00
Andrew Kane
e730fef99d Fixed flaky test [skip ci] 2024-04-16 17:25:50 -07:00
Andrew Kane
04af15c9d6 Added support for bit to IVFFlat 2024-04-16 17:12:27 -07:00
Andrew Kane
31dfd3d1a6 Added type to assertion message [skip ci] 2024-04-16 13:11:37 -07:00
Andrew Kane
8df8dd01b9 Added halfvec to distance functions TAP test 2024-04-16 13:09:44 -07:00
Andrew Kane
b9b30cc16e Added function name to assertion message [skip ci] 2024-04-16 12:20:22 -07:00
Andrew Kane
e1565af0dc Added TAP test for sparsevec distance functions 2024-04-16 12:11:54 -07:00
Andrew Kane
26d6fbb1d0 Updated comment [skip ci] 2024-04-16 11:45:19 -07:00