From 82f1ef2def8e580a8a9a943b33a2484f7c5a0f43 Mon Sep 17 00:00:00 2001 From: Ben Heckmann <79015931+benheckmann@users.noreply.github.com> Date: Thu, 9 Jan 2025 14:35:59 +0100 Subject: [PATCH] DRAFT add Data Health Checker (#1574) * #854 implement first data health checker draft * #854 added support for qlib's data format, implemented factor check, reformatted summary * adaptation current dataset * format with black * add data health check to docs * fix sphinx error * fix pylint error * update code * format with black * format with pylint --------- Co-authored-by: Linlang --- README.md | 10 ++ docs/component/data.rst | 51 +++++++++ scripts/check_data_health.py | 203 +++++++++++++++++++++++++++++++++++ 3 files changed, 264 insertions(+) create mode 100644 scripts/check_data_health.py diff --git a/README.md b/README.md index e2810d328..4de34cf79 100644 --- a/README.md +++ b/README.md @@ -264,6 +264,16 @@ We recommend users to prepare their own data if they have a high-quality dataset * *trading_date*: start of trading day * *end_date*: end of trading day(not included) +### Checking the health of the data + * We provide a script to check the health of the data, you can run the following commands to check whether the data is healthy or not. + ``` + python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data + ``` + * Of course, you can also add some parameters to adjust the test results, such as this. + ``` + python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data --missing_data_num 30055 --large_step_threshold_volume 94485 --large_step_threshold_price 20 + ``` + * If you want more information about `check_data_health`, please refer to the [documentation](https://qlib.readthedocs.io/en/latest/component/data.html#checking-the-health-of-the-data).