Commit Graph

341 Commits

Author SHA1 Message Date
Andrew Kane
c46f078e3c Merge branch 'master' into hnsw-streaming 2024-10-10 01:14:33 -07:00
Andrew Kane
c0f6570c4a Debug updates [skip ci] 2024-10-08 22:58:09 -07:00
Andrew Kane
77688b4309 Improve total cost for cost estimation (#686) 2024-10-08 12:42:03 -07:00
Andrew Kane
8b9333d468 Merge branch 'master' into hnsw-streaming 2024-09-29 15:13:51 -07:00
Andrew Kane
e1c2d03dba Improved cost estimation [skip ci] 2024-09-28 16:04:07 -07:00
Andrew Kane
158d9340bc Added distance filters to cost tests [skip ci] 2024-09-28 14:50:23 -07:00
Andrew Kane
ec4a23fe49 Added cost estimation [skip ci] 2024-09-25 16:45:04 -07:00
Andrew Kane
38207f5640 Merge branch 'master' into hnsw-streaming 2024-09-25 16:09:09 -07:00
Andrew Kane
b8c27914d4 Improved test [skip ci] 2024-09-25 15:58:41 -07:00
Andrew Kane
2d85af51a8 Added test for IVFFlat costs [skip ci] 2024-09-25 15:53:34 -07:00
Andrew Kane
e0ad441306 Added test for costs [skip ci] 2024-09-25 15:49:03 -07:00
Jonathan S. Katz
2df9f24aad Update HNSW cost estimatation to utilize search and index info (#682)
Previously, the cost estimation formula for a HNSW index scan utilized
a methodology that only factored in the entry level for an HNSW scan
and the "m" index parameter, which reflects the number of tuples (or
vectors) to scan at each step of a HNSW graph traversal. While this
would bias the PostgreSQL query planner to choose an HNSW index scan
over other available paths, this could lead to potential suboptimal
index selection, for example, choosing to use a HNSW index instead of
an available B-tree index that has better selectivity.

The number of tuples scanned during HNSW graph traversal is principally
influenced by these factors:

 * The number of tuples stored in the index
 * `m` - the number of tuples that are scanned in each step of the graph
   traversal
 * `hnsw.ef_search` - which influences the total number of steps it
   takes for the scan to converge on the approximated nearest neighbors

Through testing different source models for vectors, we also observed
that the correlation of vectors in mdoels would impact this convergence.
For this first iteration, we've opted to hardcode a constant scaling
factor and set it to `0.55`, though a future commit may turn this into
a configurable parameter.

The high-level formula for estimating the cost of a HNSW index scan is
as such:

```
(entryLevel * m) + (layer0TuplesMax * layer0Selectivity)
```

where

- `(entryLevel * m)` is the lower bound of tuples to scan, as it
accounts for the graph traversal to layer 0 (L0). (L1 and above has an ef=1)
- `layer0TuplesMax` is an estimate of the maximum number of tuples to
scan at L0. This accounts for tuples that may end up being discarded due
to them already being visited. Testing shows that the number of steps
until converge is similar to the value of `hnsw.ef_search`, thus we can
estimate tuples max at `hnsw.ef_search * m * 2`
- `layer0Selectivity` - estimates the percentage of tuples that will
actually be scanned during the index traversal, multipled by the scaling
factor

In addition to the `m` build parameter and `hsnw.ef_search`, costs
estimates can be influenced by standard PostgreSQL costing parameters,
though adjusting those (e.g. `random_page_cost`) should be done with
care.

Co-authored-by: @ankane
2024-09-25 14:01:33 -07:00
Andrew Kane
8e979ed377 Do not adjust index selectivity based on probes [skip ci] 2024-09-25 13:48:24 -07:00
Andrew Kane
77b3d1f2a8 Added test for join with attribute filtering [skip ci] 2024-09-24 23:21:34 -07:00
Andrew Kane
ecd0738728 Improved test [skip ci] 2024-09-24 23:13:30 -07:00
Andrew Kane
62ffc3641c Added test for join [skip ci] 2024-09-24 23:12:27 -07:00
Andrew Kane
11e4d040d9 Fixed test [skip ci] 2024-09-24 19:38:06 -07:00
Andrew Kane
721d4b7e3f Improved test for ef_stream [skip ci] 2024-09-22 18:51:38 -07:00
Andrew Kane
28066d8fe4 Added test for ef_stream [skip ci] 2024-09-22 18:35:35 -07:00
Andrew Kane
80cbd32dab Added streaming option for HNSW 2024-09-22 12:02:48 -07:00
Andrew Kane
3fb05eb847 Added casts for arrays to sparsevec - #604
Co-authored-by: Narek Galstyan <narekg@berkeley.edu>
Co-authored-by: Di Qi <di@lantern.dev>
2024-09-19 19:17:05 -07:00
Andrew Kane
020d3edaa9 Changed warnings to errors for TAP tests [skip ci] 2024-07-27 12:28:03 -07:00
Andrew Kane
1e9e355175 Updated TAP tests to use PostgreSQL::Test packages [skip ci] 2024-07-27 12:24:15 -07:00
Andrew Kane
8684c2ba62 Updated formatting [skip ci] 2024-07-27 08:25:09 -07:00
Andrew Kane
8c5a4bfb6c Fixed failed to add index item error with sparsevec - fixes #625 2024-07-19 13:54:36 -07:00
Andrew Kane
24c8a2ff40 Fixed flaky tests [skip ci] 2024-04-29 13:54:30 -07:00
Andrew Kane
6df583a6f6 Fixed regression test for vector type 2024-04-29 13:48:04 -07:00
Andrew Kane
b52beefbc6 Added basic fuzz testing for input functions 2024-04-27 10:49:45 -07:00
Andrew Kane
6f2afb16ff Use consistent error message for sparsevec index out of bounds [skip ci] 2024-04-26 17:27:09 -07:00
Andrew Kane
1e94907179 Improved sparsevec error messages [skip ci] 2024-04-26 17:11:11 -07:00
Andrew Kane
c9fb66d54d Fixed flaky tests 2024-04-26 13:20:27 -07:00
Andrew Kane
48e68e5e42 Improved HNSW recall tests - #535 2024-04-26 13:08:48 -07:00
Andrew Kane
78d32943ac Added test for halfvec sum 2024-04-25 22:03:34 -07:00
Andrew Kane
cf494f15ac Added aggregate test for halfvec [skip ci] 2024-04-25 21:42:10 -07:00
Andrew Kane
1475c06902 Reordered TAP tests [skip ci] 2024-04-25 21:08:55 -07:00
Andrew Kane
7140a18283 Reordered TAP tests [skip ci] 2024-04-25 21:04:23 -07:00
Andrew Kane
7dcdaef96c Renamed TAP tests [skip ci] 2024-04-25 20:57:41 -07:00
Andrew Kane
708da0e058 Improved copy test [skip ci] 2024-04-25 15:39:47 -07:00
Andrew Kane
80d34830f6 Condensed regression tests [skip ci] 2024-04-25 15:35:36 -07:00
Andrew Kane
68ac05e11e Condensed regression tests [skip ci] 2024-04-25 15:30:38 -07:00
Andrew Kane
914f9aa04a Fixed flaky test [skip ci] 2024-04-25 11:57:40 -07:00
Andrew Kane
c67dc6f9b0 Added test for bit with duplicate centers 2024-04-25 10:29:28 -07:00
Andrew Kane
c39cb25c32 Fixed flaky tests [skip ci] 2024-04-24 22:26:08 -07:00
Andrew Kane
d244a040e1 Increased max sparsevec dimensions to 1B [skip ci] 2024-04-24 11:17:25 -07:00
Andrew Kane
c3448a25e2 Improved error messages for sparsevec input 2024-04-24 11:12:28 -07:00
Andrew Kane
9696835a19 Improved tests for sparsevec input [skip ci] 2024-04-24 09:58:27 -07:00
Andrew Kane
b2a5259607 Switched to strtoint for sparsevec input 2024-04-24 09:56:09 -07:00
Andrew Kane
c198fd58ee Added more tests for subvector function [skip ci] 2024-04-24 01:31:50 -07:00
Andrew Kane
8c408759dc Added more tests for subvector function [skip ci] 2024-04-24 01:28:25 -07:00
Heikki Linnakangas
14b351bc92 Fix integer overflow in subvector() function (#530)
`end = start + count` can overflow if `start` is very large. That
leads to a segfault later in the function. Add test case for it.
2024-04-24 01:20:16 -07:00