pgvector

mirror of https://github.com/pgvector/pgvector.git synced 2026-07-22 03:57:34 +08:00

Author	SHA1	Message	Date
Andrew Kane	c46f078e3c	Merge branch 'master' into hnsw-streaming	2024-10-10 01:14:33 -07:00
Andrew Kane	c0f6570c4a	Debug updates [skip ci]	2024-10-08 22:58:09 -07:00
Andrew Kane	77688b4309	Improve total cost for cost estimation (#686 )	2024-10-08 12:42:03 -07:00
Andrew Kane	8b9333d468	Merge branch 'master' into hnsw-streaming	2024-09-29 15:13:51 -07:00
Andrew Kane	e1c2d03dba	Improved cost estimation [skip ci]	2024-09-28 16:04:07 -07:00
Andrew Kane	158d9340bc	Added distance filters to cost tests [skip ci]	2024-09-28 14:50:23 -07:00
Andrew Kane	ec4a23fe49	Added cost estimation [skip ci]	2024-09-25 16:45:04 -07:00
Andrew Kane	38207f5640	Merge branch 'master' into hnsw-streaming	2024-09-25 16:09:09 -07:00
Andrew Kane	b8c27914d4	Improved test [skip ci]	2024-09-25 15:58:41 -07:00
Andrew Kane	2d85af51a8	Added test for IVFFlat costs [skip ci]	2024-09-25 15:53:34 -07:00
Andrew Kane	e0ad441306	Added test for costs [skip ci]	2024-09-25 15:49:03 -07:00
Jonathan S. Katz	2df9f24aad	Update HNSW cost estimatation to utilize search and index info (#682 ) Previously, the cost estimation formula for a HNSW index scan utilized a methodology that only factored in the entry level for an HNSW scan and the "m" index parameter, which reflects the number of tuples (or vectors) to scan at each step of a HNSW graph traversal. While this would bias the PostgreSQL query planner to choose an HNSW index scan over other available paths, this could lead to potential suboptimal index selection, for example, choosing to use a HNSW index instead of an available B-tree index that has better selectivity. The number of tuples scanned during HNSW graph traversal is principally influenced by these factors: * The number of tuples stored in the index * `m` - the number of tuples that are scanned in each step of the graph traversal * `hnsw.ef_search` - which influences the total number of steps it takes for the scan to converge on the approximated nearest neighbors Through testing different source models for vectors, we also observed that the correlation of vectors in mdoels would impact this convergence. For this first iteration, we've opted to hardcode a constant scaling factor and set it to `0.55`, though a future commit may turn this into a configurable parameter. The high-level formula for estimating the cost of a HNSW index scan is as such: ``` (entryLevel * m) + (layer0TuplesMax * layer0Selectivity) ``` where - `(entryLevel * m)` is the lower bound of tuples to scan, as it accounts for the graph traversal to layer 0 (L0). (L1 and above has an ef=1) - `layer0TuplesMax` is an estimate of the maximum number of tuples to scan at L0. This accounts for tuples that may end up being discarded due to them already being visited. Testing shows that the number of steps until converge is similar to the value of `hnsw.ef_search`, thus we can estimate tuples max at `hnsw.ef_search * m * 2` - `layer0Selectivity` - estimates the percentage of tuples that will actually be scanned during the index traversal, multipled by the scaling factor In addition to the `m` build parameter and `hsnw.ef_search`, costs estimates can be influenced by standard PostgreSQL costing parameters, though adjusting those (e.g. `random_page_cost`) should be done with care. Co-authored-by: @ankane	2024-09-25 14:01:33 -07:00
Andrew Kane	8e979ed377	Do not adjust index selectivity based on probes [skip ci]	2024-09-25 13:48:24 -07:00
Andrew Kane	77b3d1f2a8	Added test for join with attribute filtering [skip ci]	2024-09-24 23:21:34 -07:00
Andrew Kane	ecd0738728	Improved test [skip ci]	2024-09-24 23:13:30 -07:00
Andrew Kane	62ffc3641c	Added test for join [skip ci]	2024-09-24 23:12:27 -07:00
Andrew Kane	11e4d040d9	Fixed test [skip ci]	2024-09-24 19:38:06 -07:00
Andrew Kane	721d4b7e3f	Improved test for ef_stream [skip ci]	2024-09-22 18:51:38 -07:00
Andrew Kane	28066d8fe4	Added test for ef_stream [skip ci]	2024-09-22 18:35:35 -07:00
Andrew Kane	80cbd32dab	Added streaming option for HNSW	2024-09-22 12:02:48 -07:00
Andrew Kane	3fb05eb847	Added casts for arrays to sparsevec - #604 Co-authored-by: Narek Galstyan <narekg@berkeley.edu> Co-authored-by: Di Qi <di@lantern.dev>	2024-09-19 19:17:05 -07:00
Andrew Kane	020d3edaa9	Changed warnings to errors for TAP tests [skip ci]	2024-07-27 12:28:03 -07:00
Andrew Kane	1e9e355175	Updated TAP tests to use PostgreSQL::Test packages [skip ci]	2024-07-27 12:24:15 -07:00
Andrew Kane	8684c2ba62	Updated formatting [skip ci]	2024-07-27 08:25:09 -07:00
Andrew Kane	8c5a4bfb6c	Fixed failed to add index item error with sparsevec - fixes #625	2024-07-19 13:54:36 -07:00
Andrew Kane	24c8a2ff40	Fixed flaky tests [skip ci]	2024-04-29 13:54:30 -07:00
Andrew Kane	6df583a6f6	Fixed regression test for vector type	2024-04-29 13:48:04 -07:00
Andrew Kane	b52beefbc6	Added basic fuzz testing for input functions	2024-04-27 10:49:45 -07:00
Andrew Kane	6f2afb16ff	Use consistent error message for sparsevec index out of bounds [skip ci]	2024-04-26 17:27:09 -07:00
Andrew Kane	1e94907179	Improved sparsevec error messages [skip ci]	2024-04-26 17:11:11 -07:00
Andrew Kane	c9fb66d54d	Fixed flaky tests	2024-04-26 13:20:27 -07:00
Andrew Kane	48e68e5e42	Improved HNSW recall tests - #535	2024-04-26 13:08:48 -07:00
Andrew Kane	78d32943ac	Added test for halfvec sum	2024-04-25 22:03:34 -07:00
Andrew Kane	cf494f15ac	Added aggregate test for halfvec [skip ci]	2024-04-25 21:42:10 -07:00
Andrew Kane	1475c06902	Reordered TAP tests [skip ci]	2024-04-25 21:08:55 -07:00
Andrew Kane	7140a18283	Reordered TAP tests [skip ci]	2024-04-25 21:04:23 -07:00
Andrew Kane	7dcdaef96c	Renamed TAP tests [skip ci]	2024-04-25 20:57:41 -07:00
Andrew Kane	708da0e058	Improved copy test [skip ci]	2024-04-25 15:39:47 -07:00
Andrew Kane	80d34830f6	Condensed regression tests [skip ci]	2024-04-25 15:35:36 -07:00
Andrew Kane	68ac05e11e	Condensed regression tests [skip ci]	2024-04-25 15:30:38 -07:00
Andrew Kane	914f9aa04a	Fixed flaky test [skip ci]	2024-04-25 11:57:40 -07:00
Andrew Kane	c67dc6f9b0	Added test for bit with duplicate centers	2024-04-25 10:29:28 -07:00
Andrew Kane	c39cb25c32	Fixed flaky tests [skip ci]	2024-04-24 22:26:08 -07:00
Andrew Kane	d244a040e1	Increased max sparsevec dimensions to 1B [skip ci]	2024-04-24 11:17:25 -07:00
Andrew Kane	c3448a25e2	Improved error messages for sparsevec input	2024-04-24 11:12:28 -07:00
Andrew Kane	9696835a19	Improved tests for sparsevec input [skip ci]	2024-04-24 09:58:27 -07:00
Andrew Kane	b2a5259607	Switched to strtoint for sparsevec input	2024-04-24 09:56:09 -07:00
Andrew Kane	c198fd58ee	Added more tests for subvector function [skip ci]	2024-04-24 01:31:50 -07:00
Andrew Kane	8c408759dc	Added more tests for subvector function [skip ci]	2024-04-24 01:28:25 -07:00
Heikki Linnakangas	14b351bc92	Fix integer overflow in subvector() function (#530 ) `end = start + count` can overflow if `start` is very large. That leads to a segfault later in the function. Add test case for it.	2024-04-24 01:20:16 -07:00

1 2 3 4 5 ...

341 Commits