Commit Graph

1612 Commits

Author SHA1 Message Date
Andrew Kane
e1c2d03dba Improved cost estimation [skip ci] 2024-09-28 16:04:07 -07:00
Andrew Kane
ba8e29600b Added todo [skip ci] 2024-09-28 15:24:21 -07:00
Andrew Kane
158d9340bc Added distance filters to cost tests [skip ci] 2024-09-28 14:50:23 -07:00
Andrew Kane
49e05fb5ba Updated readme [skip ci] 2024-09-28 13:13:10 -07:00
Andrew Kane
9b42662188 Only adjust cost if scanning less than half of the tuples [skip ci] 2024-09-28 12:21:20 -07:00
Andrew Kane
7265927fd6 Updated readme [skip ci] 2024-09-28 12:07:09 -07:00
Andrew Kane
ab57217f48 Added todo [skip ci] 2024-09-28 10:07:09 -07:00
Andrew Kane
5f6e031ccc Updated changelog [skip ci] 2024-09-28 09:57:05 -07:00
Andrew Kane
5ee0471ead Updated readme [skip ci] 2024-09-28 09:23:41 -07:00
Andrew Kane
54f8d9733d Updated default Postgres version in Dockerfile [skip ci] 2024-09-27 16:19:57 -07:00
Andrew Kane
cf419f448b Updated Postgres version for Docker [skip ci] 2024-09-27 16:19:06 -07:00
Andrew Kane
8a2eebd6a4 Added note about Postgres 17 on Windows - #669 [skip ci] 2024-09-27 14:05:36 -07:00
Andrew Kane
daf9c5c743 Updated package versions in readme [skip ci] 2024-09-27 13:52:11 -07:00
Andrew Kane
2bca4e406b Restored quarterly package version for FreeBSD in readme [skip ci] 2024-09-27 13:50:57 -07:00
Andrew Kane
74020a90da Updated package versions in readme [skip ci] 2024-09-27 13:49:43 -07:00
Andrew Kane
44d8d28b40 Added note about postgresql@17 formula [skip ci] 2024-09-27 13:39:54 -07:00
Andrew Kane
1a1221f905 Merge branch 'master' into hnsw-streaming 2024-09-26 08:34:07 -07:00
Andrew Kane
54fa16e3e3 Added safety check [skip ci] 2024-09-26 08:32:44 -07:00
Andrew Kane
40c3e402c7 Removed todo [skip ci] 2024-09-25 17:29:20 -07:00
Andrew Kane
058248fdcc Improved cost code [skip ci] 2024-09-25 17:23:29 -07:00
Andrew Kane
73c5145b77 Use int for ef [skip ci] 2024-09-25 16:52:41 -07:00
Andrew Kane
ec4a23fe49 Added cost estimation [skip ci] 2024-09-25 16:45:04 -07:00
Andrew Kane
38207f5640 Merge branch 'master' into hnsw-streaming 2024-09-25 16:09:09 -07:00
Andrew Kane
46de265a24 Updated changelog [skip ci] 2024-09-25 16:03:58 -07:00
Andrew Kane
b8c27914d4 Improved test [skip ci] 2024-09-25 15:58:41 -07:00
Andrew Kane
2d85af51a8 Added test for IVFFlat costs [skip ci] 2024-09-25 15:53:34 -07:00
Andrew Kane
e0ad441306 Added test for costs [skip ci] 2024-09-25 15:49:03 -07:00
Andrew Kane
5776a4d937 Only adjust for TOAST [skip ci] 2024-09-25 15:39:56 -07:00
Andrew Kane
242a12b7d5 Added same cost adjustment to HNSW as IVFFlat since TOAST not included in seq scan cost - #682 [skip ci] 2024-09-25 15:33:57 -07:00
Andrew Kane
1370dd6e86 Removed unneeded floor and fixed comment formatting [skip ci] 2024-09-25 14:13:02 -07:00
Andrew Kane
a100dc67e5 Ran pgindent [skip ci] 2024-09-25 14:03:51 -07:00
Jonathan S. Katz
2df9f24aad Update HNSW cost estimatation to utilize search and index info (#682)
Previously, the cost estimation formula for a HNSW index scan utilized
a methodology that only factored in the entry level for an HNSW scan
and the "m" index parameter, which reflects the number of tuples (or
vectors) to scan at each step of a HNSW graph traversal. While this
would bias the PostgreSQL query planner to choose an HNSW index scan
over other available paths, this could lead to potential suboptimal
index selection, for example, choosing to use a HNSW index instead of
an available B-tree index that has better selectivity.

The number of tuples scanned during HNSW graph traversal is principally
influenced by these factors:

 * The number of tuples stored in the index
 * `m` - the number of tuples that are scanned in each step of the graph
   traversal
 * `hnsw.ef_search` - which influences the total number of steps it
   takes for the scan to converge on the approximated nearest neighbors

Through testing different source models for vectors, we also observed
that the correlation of vectors in mdoels would impact this convergence.
For this first iteration, we've opted to hardcode a constant scaling
factor and set it to `0.55`, though a future commit may turn this into
a configurable parameter.

The high-level formula for estimating the cost of a HNSW index scan is
as such:

```
(entryLevel * m) + (layer0TuplesMax * layer0Selectivity)
```

where

- `(entryLevel * m)` is the lower bound of tuples to scan, as it
accounts for the graph traversal to layer 0 (L0). (L1 and above has an ef=1)
- `layer0TuplesMax` is an estimate of the maximum number of tuples to
scan at L0. This accounts for tuples that may end up being discarded due
to them already being visited. Testing shows that the number of steps
until converge is similar to the value of `hnsw.ef_search`, thus we can
estimate tuples max at `hnsw.ef_search * m * 2`
- `layer0Selectivity` - estimates the percentage of tuples that will
actually be scanned during the index traversal, multipled by the scaling
factor

In addition to the `m` build parameter and `hsnw.ef_search`, costs
estimates can be influenced by standard PostgreSQL costing parameters,
though adjusting those (e.g. `random_page_cost`) should be done with
care.

Co-authored-by: @ankane
2024-09-25 14:01:33 -07:00
Andrew Kane
8e979ed377 Do not adjust index selectivity based on probes [skip ci] 2024-09-25 13:48:24 -07:00
Andrew Kane
4e35c6abe3 Updated readme [skip ci] 2024-09-24 23:24:48 -07:00
Andrew Kane
77b3d1f2a8 Added test for join with attribute filtering [skip ci] 2024-09-24 23:21:34 -07:00
Andrew Kane
ecd0738728 Improved test [skip ci] 2024-09-24 23:13:30 -07:00
Andrew Kane
62ffc3641c Added test for join [skip ci] 2024-09-24 23:12:27 -07:00
Andrew Kane
11e4d040d9 Fixed test [skip ci] 2024-09-24 19:38:06 -07:00
Andrew Kane
87ac108bf7 Removed code for Postgres 12 [skip ci] 2024-09-23 15:26:31 -07:00
Andrew Kane
b2fa625255 Fixed crash with empty index [skip ci] 2024-09-23 09:42:31 -07:00
Andrew Kane
a8e699c927 Improved message [skip ci] 2024-09-22 22:31:48 -07:00
Andrew Kane
91541fece6 Fixed example [skip ci] 2024-09-22 22:28:39 -07:00
Andrew Kane
f3de487da2 Started readme updates [skip ci] 2024-09-22 22:26:27 -07:00
Andrew Kane
721d4b7e3f Improved test for ef_stream [skip ci] 2024-09-22 18:51:38 -07:00
Andrew Kane
28066d8fe4 Added test for ef_stream [skip ci] 2024-09-22 18:35:35 -07:00
Andrew Kane
495041e43b Added option to limit tuples [skip ci] 2024-09-22 18:10:19 -07:00
Andrew Kane
52c385c03a Only pass discarded when streaming [skip ci] 2024-09-22 17:47:10 -07:00
Andrew Kane
80cbd32dab Added streaming option for HNSW 2024-09-22 12:02:48 -07:00
Andrew Kane
97cf990e0f Free TupleDesc [skip ci] 2024-09-21 19:15:34 -07:00
Andrew Kane
55dc735e1a Moved allocations out of GetScanItems [skip ci] 2024-09-21 19:10:25 -07:00