1
0
mirror of https://github.com/microsoft/qlib.git synced 2026-06-29 09:01:18 +08:00

Compare commits

...

50 Commits

Author SHA1 Message Date
you-n-g
27f476b311 Update __init__.py 2023-06-26 00:00:46 +08:00
you-n-g
0e61cac6a8 Update release-drafter.yml (#1569)
* Update release-drafter.yml

* Update release-drafter.yml
2023-06-25 23:48:37 +08:00
Linlang
21f0b394e7 change get_data url (#1558)
* change_url

* fix_CI

* fix_CI_2

* fix_CI_3

* fix_CI_4

* fix_CI_5

* fix_CI_6

* fix_CI_7

* fix_CI_8

* fix_CI_9

* fix_CI_10

* fix_CI_11

* fix_CI_12

* fix_CI_13

* fix_CI_13

* fix_CI_14

* fix_CI_15

* fix_CI_16

* fix_CI_17

* fix_CI_18

* fix_CI_19

* fix_CI_20

* fix_CI_21

* fix_CI_22

* fix_CI_23

* fix_CI_24

* fix_CI_25

* fix_CI_26

* fix_CI_27

* fix_get_data_error

* fix_get_data_error2

* modify_get_data

* modify_get_data2

* modify_get_data3

* modify_get_data4

* fix_CI_28

* fix_CI_29

* fix_CI_30

---------

Co-authored-by: Linlang <v-linlanglv@microsoft.com>
2023-06-25 23:39:11 +08:00
Wendi Li
cd4ab998fb Update on Dynamic Benchmark (#1539)
* move config file to benchmark_dynamic & switch default sim task model to GBDT

* Update benchmark_dynamic results

* Change the default value of alpha of DDG-DA
2023-06-03 08:42:24 +08:00
you-n-g
0e9ac9dce7 Fix CI (#1529) 2023-05-31 08:39:52 +08:00
yaxuan999
efffb2819a added KRNN and Sandwich models and their example results based on Alpha360 (#1414)
* Update README.md

updated the result of KRNN and Sandwich models based on Alpha360

* Update README.md

* Update README.md

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* Add files via upload

* Delete pytorch_krnn.py

* Delete pytorch_sandwich.py

* Add files via upload

* Update pytorch_sandwich.py

* Update pytorch_krnn.py

* Update pytorch_sandwich.py

* Update pytorch_krnn.py

* Update README.md

* Update README.md

* Update requirements.txt

* Update requirements.txt

* Update README.md

* Update README.md

* Update pytorch_sandwich.py

* Update link on index

---------

Co-authored-by: Young <afe.young@gmail.com>
2023-05-26 18:42:58 +08:00
Fivele-Li
19a0eb78bc Fix TCN model input dimension mismatch (#1520)
* transpose dimension 1 and 2 to match nn.Conv1d input

* 1.update TCN benchmarks;
2.Emphasize updating the benchmark table;

* replace specific version with main

---------

Co-authored-by: lijinhui <362237642@qq.com>
2023-05-26 14:44:34 +08:00
Fivele-Li
370477288d fix_DDG-DA_workflow_bug (#1516)
* 1.specify group_keys=False to avoid FutureWarning;
2.fix get train_start from dict unexpected problem;

* fix black

* Add comments

* Add make file

---------

Co-authored-by: Young <afe.young@gmail.com>
2023-05-24 15:49:58 +08:00
you-n-g
94268619c4 Update README.md 2023-05-23 09:50:00 +08:00
Huoran Li
8d60a6a02b Resolve RL FIXMES (#1503)
* Solve several small FIXMEs left in RL

* Add TODO in example

* Minor bugfix

* black
2023-05-17 16:57:08 +08:00
Fivele-Li
7234308651 Add base config in yml (#1500)
* path on Windows contains double '/' which may cause open file failed.

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* add baseConfig in yml,user can add new keys or update/drop keys in baseConfig;

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* pip release version 23.1 on Apr.15 2023, CI failed to run, Please refer to #1495 ofr detailed logs. The pip version has been temporarily fixed to 23.0.1.

* 1.Search for baseConfig in multiple directories;
2.Add user instructions in qrun;

* fix format with black

* 1.modify baseConfig key to BASE_CONFIG_PATH;
2.only find config file in absolute path and relative path;

* load BASE_CONFIG_PATH on absolute path & relative path;

* fix Lint with black

---------

Co-authored-by: lijinhui <362237642@qq.com>
2023-05-12 17:35:37 +08:00
Chaoying
acf5df27ce Add support for redis password (#1508) 2023-05-08 16:17:15 +08:00
Chaoying
37a59f28d3 Fix deprecated syntax in numpy (#1507)
* Fix deprecated syntax in numpy

* Replace np.bool with bool
2023-05-08 16:17:02 +08:00
YQ Tsui
b084c352f5 provide dtype to empty series to surpress warning; fix type (#1449) 2023-05-05 17:47:44 +08:00
Maksim Zayakin
9e22e5168b Remove unused DNNModelPytorch params (#1470)
* Remove lr_decay and lr_decay_steps params

More flexible way to pass a scheduler (via callable function) is already
supported

* remove lr_decay and lr_decay_steps from mlp workflow configs
2023-04-28 17:48:40 +08:00
Fivele-Li
dceff7b471 Specify the tianshou version to match the dev environment to avoid the error in issue #1477. (#1502) 2023-04-28 13:50:25 +08:00
Huoran Li
7f1e8c5206 Refine Qlib RL data format (#1480)
* wip

* wip

* wip

* Fix naming errors

* Backtest test passed

* Why training stuck?

* Minor

* Refine train configs

* Use dummy in training

* Remove pickle_dataframe

* CI

* CI

* Add more strict condition to filter orders

* Pass test

* Add TODO in example

---------

Co-authored-by: Young <afe.young@gmail.com>
2023-04-26 21:14:30 +08:00
Fivele-Li
46264dfec9 normpath for Windows (#1495)
* path on Windows contains double '/' which may cause open file failed.

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* locate import numpy error

* pip release version 23.1 on Apr.15 2023, CI failed to run, Please refer to #1495 ofr detailed logs. The pip version has been temporarily fixed to 23.0.1.

---------

Co-authored-by: lijinhui <362237642@qq.com>
2023-04-26 16:26:12 +08:00
Fivele-Li
754799ab05 update ubuntu CI version; (#1488)
* update ubuntu CI version;
(End of standard support for 18.04 LTS - 31 May 2023)

* update ubuntu CI version;

---------

Co-authored-by: lijinhui <362237642@qq.com>
2023-04-10 17:06:48 +08:00
you-n-g
32c3070b73 Refine DDG-DA (#1472)
* Run ddg-da successfully

* Support include valid; More parameters

* Support L2 reg & visualization

* Blackformat

* Enable fill_method

* Support specify handler & optim dataset

* Fix Pylint
2023-04-07 15:00:21 +08:00
you-n-g
40de67265a Update Docs about some concepts in DataHandler (#1485) 2023-04-07 10:02:16 +08:00
saurabh dave
e6f9a94fc5 fix: removed extra blank link between sections (#1451) 2023-04-03 17:32:01 +08:00
Fivele-Li
73937863f1 Merge pull request #1475 from qianyun210603/bugfix
[BUGFIX] potential file// url parsing error
2023-03-24 11:22:57 +08:00
BookSword
d010219ba6 Merge branch 'main' into bugfix 2023-03-23 16:11:19 +08:00
BookSword
4fc8a5f25f merge 2023-03-23 16:05:09 +08:00
Linlang
0e8bfcb5d3 fix_pylint_w0719 (#1463)
* fix_pylint_w0719

* remove_fixme
2023-03-17 19:25:49 +08:00
you-n-g
e457ca8511 Improve annotation & documentation for handler (#1312)
* Improve annotation & documentation for handler

* Add type
2023-03-15 21:15:40 +08:00
Huoran Li
4dbb8ecb86 Remove (#1464) 2023-03-15 15:26:44 +08:00
Huoran Li
653c082e7a Order execution open source (#1447)
* Waiting for bin data

* Complete readme

* CI

* Add inst filter by time

* Update qlib/data/dataset/processor.py

* typo

* Fix time filter bug

* Add Filter and set Universe

* Complete data pipeline

* Fix Provider Logger Info Args

* Add DQN; a minor bugfix in ppo reward.

* update readme. modify assertion logic in strategy check.

* Fix Doc issues and fix black

* Fix pylint Error

---------

Co-authored-by: Young <afe.young@gmail.com>
Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
2023-03-13 12:06:28 +08:00
you-n-g
f98e04ca9d Fix Field Name Error 2023-03-03 16:28:47 +08:00
Cadenza-Li
76f2fb1a1a Add ipynb format check (#1439)
* Update test_qlib_from_source.yml

* add ipynb format check to workflow

* test ipynb CI

* modify nbqa check path

* add pylint flake8 mypy check to ipynb

* check ipynb with black and pylint

* reformat .ipynb files

* format line length

nbqa black . -l 120

* update nbqa .ipynb format CI

* format old ipynb files

* add nbconvert check to CI

* adjust CI order to avoid repeating download data
2023-02-21 09:23:22 +08:00
Huoran Li
5eb5ac1f1f RL backtest pipeline on 5-min data (#1417)
* Workflow runnable

* CI

* Slight changes to make the workflow runnable. The changes of handler/provider should be reverted before merging.

* Train experiment successful

* Refine handler & provider

* test passed

* Ready to test on server

* Minor

* Test passed

* TWAP training

* Add PPOReward

* Add a FIXME

* Refine PPO reward according to PR comments

* Minor

* Resolve PR comments

* CI issues

* CI issues

* CI issues
2023-02-13 12:43:22 +08:00
Young
6295939346 Update to Dev Version 2023-01-29 18:55:23 +08:00
Young
5f3e322784 Update Version 2023-01-29 18:53:25 +08:00
you-n-g
691b7f1f60 Remove Json
Because it is a standard library of Python.
2023-01-20 09:03:08 +08:00
Huoran Li
d8fc9aea6b RL Training pipeline on 5-min data (#1415)
* Workflow runnable

* CI

* Slight changes to make the workflow runnable. The changes of handler/provider should be reverted before merging.

* Train experiment successful

* Refine handler & provider

* CI issues

* Resolve PR comments

* Resolve PR comments

* CI issues

* Fix test issue

* Black
2023-01-18 16:17:06 +08:00
YQ Tsui
d8764660dc [BUGFIX] allow sell in limit-up case and allow buy in limit-down case in topk strategy (#1407)
* 1) check limit_up/down should consider direction; 2) fix some typo, typehint etc

* fix error

* Update test_all_pipeline.py

Believe it's just some arbitrary number.
The excess return is expected to change when trading logic changes.

* add flag forbid_all_trade_at_limit to keep previous behivour for backward compatibility
2023-01-10 09:46:18 +08:00
Linlang
7f08e6c7b3 fix subprocess.check_output bug (#1409)
* fix_check_output_bug

* change_log_info

* recover_feature
2023-01-06 21:44:23 +08:00
Linlang
0f3abfed74 fix_labeler_bug (#1406) 2023-01-03 14:10:56 +08:00
Huoran Li
44ce91ee9d Simple RL notebook (#1395)
* Simple RL notebook

* Add link to the notebook

Co-authored-by: Young <afe.young@gmail.com>
2023-01-03 00:17:18 +08:00
Wendi Li
ebb8ec34f3 [DDG-DA] Update crowd-sourced data results (#1405)
* [DDG-DA] Update crowd-sourced data experiments

* Remove internal data version

* Modify README
2023-01-03 00:15:50 +08:00
YQ Tsui
4fe3ffccfd fix typo, staticmethod etc. (#1402)
* config.py: fix typo; static method

* fix typo in qlib/utils/paral

* 1) limit numpy version as numba support for 1.24+ has not been released; 2) no need to use custom numba version for pytest.

* remove useless argument

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
2022-12-31 08:02:05 +08:00
YQ Tsui
2f5ce3dc01 Plot enhancement (#1390)
* horizontally put the bar figures

* 1) use rangebreaks to handle gaps in datetime axis instead of make them string; 2) allow simultaneously plot rankic in ic_figure

* pylint improvement

* fix black lint

* better axis formatting

* default not show gaps

* resolve doc built error

* fix pylint

* Update qlib/contrib/report/analysis_model/analysis_model_performance.py

More detailed description

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* Update qlib/contrib/report/analysis_model/analysis_model_performance.py

for Python backward compatibility

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* add doc string

* fix black

* 1) limit numpy version as numba support for 1.24+ has not been released; 2) no need to use custom numba version for pytest.

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
2022-12-31 07:58:41 +08:00
Linlang
756bd0f65b Fix ZScoreNorm processor bug (#1398)
* fix_ZScoreNorm_bug

* fix_CI_error

* fix_CI_error

* add_test_processor

* fix_pylint_error

* fix_some_error_and_optimize_code

* modify_terrible_code

* optimize_code

* optimize_code
2022-12-30 20:42:37 +08:00
Linlang
667fb0e4d9 add label to PR Automatically (#1393)
* auto_add_label

* add md file to rule

* change name and rules

* change_label_name

* change_rule_syntax

* change match rule

* change label name
2022-12-17 00:12:33 +08:00
you-n-g
f326f83fae Remove Wrong Package Name (#1394)
* Remove Wrong Package Name

* Update requirements.txt
2022-12-16 08:10:36 +08:00
Chia-hung Tai
cbd69fb0ed The limit threshold in Taiwan stock market is also 10%. (#1391)
* The limit threshold in Taiwan stock market is also 10%.

* Warning limit_threshold when it is None.
2022-12-12 21:37:01 +08:00
YQ Tsui
5e3924d7a6 fix some typo in doc/comments (#1389)
* fix typo in docstrings

* fix typo

* fix typo

* fix black lint

* fix black lint
2022-12-11 14:29:16 +08:00
Linlang
57f9813f85 optimize_yahoo_collector (#1388) 2022-12-11 12:05:54 +08:00
Young
26d24b5b23 Bump to Dev Version 2022-12-09 18:21:39 +08:00
137 changed files with 4025 additions and 1153 deletions

6
.github/labeler.yml vendored Normal file
View File

@@ -0,0 +1,6 @@
documentation:
- 'docs/**/*'
- '**/*.md'
waiting for triage:
- any: ['**/*', '!docs/**/*', '!**/*.md']

View File

@@ -14,6 +14,9 @@ categories:
label:
- 'doc'
- 'documentation'
- title: '🧹 Maintenance'
label:
- 'maintenance'
change-template: '- $TITLE @$AUTHOR (#$NUMBER)'
change-title-escapes: '\<*_&' # You can add # and @ to disable mentions, and add ` to disable code blocks.
version-resolver:
@@ -30,4 +33,4 @@ version-resolver:
template: |
## Changes
$CHANGES
$CHANGES

14
.github/workflows/labeler.yml vendored Normal file
View File

@@ -0,0 +1,14 @@
name: "Add label automatically"
on:
- pull_request_target
jobs:
triage:
permissions:
contents: read
pull-requests: write
runs-on: ubuntu-latest
steps:
- uses: actions/labeler@v4
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"

View File

@@ -13,7 +13,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [windows-latest, ubuntu-18.04, ubuntu-20.04, macos-11, macos-latest]
os: [windows-latest, ubuntu-20.04, ubuntu-22.04, macos-11, macos-latest]
# not supporting 3.6 due to annotations is not supported https://stackoverflow.com/a/52890129
python-version: [3.7, 3.8]

View File

@@ -14,22 +14,34 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [windows-latest, ubuntu-18.04, ubuntu-20.04, macos-11, macos-latest]
os: [windows-latest, ubuntu-20.04, ubuntu-22.04, macos-11, macos-latest]
# not supporting 3.6 due to annotations is not supported https://stackoverflow.com/a/52890129
python-version: [3.7, 3.8]
steps:
- name: Test qlib from source
uses: actions/checkout@v2
uses: actions/checkout@v3
# Since version 3.7 of python for MacOS is installed in CI, version 3.7.17, this version causes "_bz not found error".
# So we make the version number of python 3.7 for MacOS more specific.
# refs: https://github.com/actions/setup-python/issues/682
- name: Set up Python ${{ matrix.python-version }}
if: (matrix.os == 'macos-latest' && matrix.python-version == '3.7') || (matrix.os == 'macos-11' && matrix.python-version == '3.7')
uses: actions/setup-python@v4
with:
python-version: "3.7.16"
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
if: (matrix.os != 'macos-latest' || matrix.python-version != '3.7') && (matrix.os != 'macos-11' || matrix.python-version != '3.7')
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Update pip to the latest version
# pip release version 23.1 on Apr.15 2023, CI failed to run, Please refer to #1495 ofr detailed logs.
# The pip version has been temporarily fixed to 23.0
run: |
python -m pip install --upgrade pip
python -m pip install pip==23.0
- name: Installing pytorch for macos
if: ${{ matrix.os == 'macos-11' || matrix.os == 'macos-latest' }}
@@ -37,15 +49,13 @@ jobs:
python -m pip install torch torchvision torchaudio
- name: Installing pytorch for ubuntu
if: ${{ matrix.os == 'ubuntu-18.04' || matrix.os == 'ubuntu-20.04' }}
if: ${{ matrix.os == 'ubuntu-20.04' || matrix.os == 'ubuntu-22.04' }}
run: |
python -m pip install --upgrade pip
python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
- name: Installing pytorch for windows
if: ${{ matrix.os == 'windows-latest' }}
run: |
python -m pip install --upgrade pip
python -m pip install torch torchvision torchaudio
- name: Set up Python tools
@@ -120,12 +130,16 @@ jobs:
run: |
mypy qlib --install-types --non-interactive || true
mypy qlib --verbose
- name: Check Qlib ipynb with nbqa
run: |
nbqa black . -l 120 --check --diff
nbqa pylint . --disable=C0104,C0114,C0115,C0116,C0301,C0302,C0411,C0413,C1802,R0401,R0801,R0902,R0903,R0911,R0912,R0913,R0914,R0915,R1720,W0105,W0123,W0201,W0511,W0613,W1113,W1514,E0401,E1121,C0103,C0209,R0402,R1705,R1710,R1725,R1735,W0102,W0212,W0221,W0223,W0231,W0237,W0612,W0621,W0622,W0703,W1309,E1102,E1136,W0719,W0104,W0404,C0412,W0611,C0410 --const-rgx='[a-z_][a-z0-9_]{2,30}$'
- name: Test data downloads
run: |
python scripts/get_data.py qlib_data --name qlib_data_simple --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn
azcopy copy https://qlibpublic.blob.core.windows.net/data/rl /tmp/qlibpublic/data --recursive
mv /tmp/qlibpublic/data tests/.data
python scripts/get_data.py download_data --file_name rl_data.zip --target_dir tests/.data/rl
- name: Install Lightgbm for MacOS
if: ${{ matrix.os == 'macos-11' || matrix.os == 'macos-latest' }}
@@ -138,12 +152,15 @@ jobs:
brew unlink libomp
brew install libomp.rb
# Run after data downloads
- name: Check Qlib ipynb with nbconvert
run: |
# add more ipynb files in future
jupyter nbconvert --to notebook --execute examples/workflow_by_code.ipynb
- name: Test workflow by config (install from source)
run: |
# Version 0.52.0 of numba must be installed manually in CI, otherwise it will cause incompatibility with the latest version of numpy.
python -m pip install numba==0.52.0
# You must update numpy manually, because when installing python tools, it will try to uninstall numpy and cause CI to fail.
python -m pip install --upgrade numpy
python -m pip install numba
python qlib/workflow/cli.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
- name: Unit tests with Pytest

View File

@@ -14,23 +14,34 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [windows-latest, ubuntu-18.04, ubuntu-20.04, macos-11, macos-latest]
os: [windows-latest, ubuntu-20.04, ubuntu-22.04, macos-11, macos-latest]
# not supporting 3.6 due to annotations is not supported https://stackoverflow.com/a/52890129
python-version: [3.7, 3.8]
steps:
- name: Test qlib from source slow
uses: actions/checkout@v2
uses: actions/checkout@v3
# Since version 3.7 of python for MacOS is installed in CI, version 3.7.17, this version causes "_bz not found error".
# So we make the version number of python 3.7 for MacOS more specific.
# refs: https://github.com/actions/setup-python/issues/682
- name: Set up Python ${{ matrix.python-version }}
if: (matrix.os == 'macos-latest' && matrix.python-version == '3.7') || (matrix.os == 'macos-11' && matrix.python-version == '3.7')
uses: actions/setup-python@v4
with:
python-version: "3.7.16"
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
if: (matrix.os != 'macos-latest' || matrix.python-version != '3.7') && (matrix.os != 'macos-11' || matrix.python-version != '3.7')
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Set up Python tools
# pip release version 23.1 on Apr.15 2023, CI failed to run, Please refer to #1495 ofr detailed logs.
# The pip version has been temporarily fixed to 23.0
run: |
python -m pip install --upgrade pip
# python -m pip is necessary to upgrade pip.
python -m pip install pip==23.0
pip install --upgrade cython numpy
pip install -e .[dev]

3
.gitignore vendored
View File

@@ -10,7 +10,6 @@ _build
build/
dist/
*.pkl
*.hd5
*.csv
@@ -27,6 +26,8 @@ examples/estimator/estimator_example/
examples/rl/data/
examples/rl/checkpoints/
examples/rl/outputs/
examples/rl_order_execution/data/
examples/rl_order_execution/outputs/
*.egg-info/

View File

@@ -11,6 +11,8 @@
Recent released features
| Feature | Status |
| -- | ------ |
| KRNN and Sandwich models | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/1414/) on May 26, 2023 |
| Release Qlib v0.9.0 | :octocat: [Released](https://github.com/microsoft/qlib/releases/tag/v0.9.0) on Dec 9, 2022 |
| RL Learning Framework | :hammer: :chart_with_upwards_trend: Released on Nov 10, 2022. [#1332](https://github.com/microsoft/qlib/pull/1332), [#1322](https://github.com/microsoft/qlib/pull/1322), [#1316](https://github.com/microsoft/qlib/pull/1316),[#1299](https://github.com/microsoft/qlib/pull/1299),[#1263](https://github.com/microsoft/qlib/pull/1263), [#1244](https://github.com/microsoft/qlib/pull/1244), [#1169](https://github.com/microsoft/qlib/pull/1169), [#1125](https://github.com/microsoft/qlib/pull/1125), [#1076](https://github.com/microsoft/qlib/pull/1076)|
| HIST and IGMTF models | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/1040) on Apr 10, 2022 |
| Qlib [notebook tutorial](https://github.com/microsoft/qlib/tree/main/examples/tutorial) | 📖 [Released](https://github.com/microsoft/qlib/pull/1037) on Apr 7, 2022 |
@@ -41,13 +43,11 @@ Features released before 2021 are not listed here.
<img src="http://fintech.msra.cn/images_v070/logo/1.png" />
</p>
Qlib is an open-source, AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms, including supervised learning, market dynamics modeling, and reinforcement learning.
Qlib is an AI-oriented quantitative investment platform, which aims to realize the potential, empower the research, and create the value of AI technologies in quantitative investment.
An increasing number of SOTA Quant research works/papers in diverse paradigms are being released in Qlib to collaboratively solve key challenges in quantitative investment. For example, 1) using supervised learning to mine the market's complex non-linear patterns from rich and heterogeneous financial data, 2) modeling the dynamic nature of the financial market using adaptive concept drift technology, and 3) using reinforcement learning to model continuous investment decisions and assist investors in optimizing their trading strategies.
It contains the full ML pipeline of data processing, model training, back-testing; and covers the entire chain of quantitative investment: alpha seeking, risk modeling, portfolio optimization, and order execution.
With Qlib, users can easily try ideas to create better Quant investment strategies.
For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative Investment Platform"](https://arxiv.org/abs/2009.11189).
@@ -354,6 +354,8 @@ Here is a list of models built on `Qlib`.
- [ADD based on pytorch (Hongshun Tang, et al.2020)](examples/benchmarks/ADD/)
- [IGMTF based on pytorch (Wentao Xu, et al.2021)](examples/benchmarks/IGMTF/)
- [HIST based on pytorch (Wentao Xu, et al.2021)](examples/benchmarks/HIST/)
- [KRNN based on pytorch](examples/benchmarks/KRNN/)
- [Sandwich based on pytorch](examples/benchmarks/Sandwich/)
Your PR of new Quant models is highly welcomed.

View File

@@ -119,7 +119,7 @@ Here are some example:
for daily data:
.. code-block:: bash
python scripts/get_data.py csv_data_cn --target_dir ~/.qlib/csv_data/cn_data
python scripts/get_data.py download_data --file_name csv_data_cn.zip --target_dir ~/.qlib/csv_data/cn_data
for 1min data:
.. code-block:: bash

View File

@@ -42,4 +42,8 @@ As you may have noticed, a training vessel itself holds all the required compone
With a training vessel, the trainer could finally launch the training pipeline by simple, Scikit-learn-like interfaces (i.e., ``trainer.fit()``).
The API for Trainer and TrainingVessel and can be found `here <../../reference/api.html#module-qlib.rl.trainer>`__.
The API for Trainer and TrainingVessel and can be found `here <../../reference/api.html#module-qlib.rl.trainer>`__.
The RL module is designed in a loosely-coupled way. Currently, RL examples are integrated with concrete business logic.
But the core part of RL is much simpler than what you see.
To demonstrate the simple core of RL, `a dedicated notebook <https://github.com/microsoft/qlib/tree/main/examples/rl/simple_example.ipynb>`__ for RL without business loss is created.

View File

@@ -0,0 +1,8 @@
# KRNN
* Code: [https://github.com/microsoft/FOST/blob/main/fostool/model/krnn.py](https://github.com/microsoft/FOST/blob/main/fostool/model/krnn.py)
# Introductions about the settings/configs.
* Torch_geometric is used in the original model in FOST, but we didn't use it.
* make use your CUDA version matches the torch version to allow the usage of GPU, we use CUDA==10.2 and torch.__version__==1.12.1

View File

@@ -0,0 +1,2 @@
numpy==1.23.4
pandas==1.5.2

View File

@@ -0,0 +1,91 @@
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
region: cn
market: &market csi300
benchmark: &benchmark SH000300
data_handler_config: &data_handler_config
start_time: 2008-01-01
end_time: 2020-08-01
fit_start_time: 2008-01-01
fit_end_time: 2014-12-31
instruments: *market
infer_processors:
- class: RobustZScoreNorm
kwargs:
fields_group: feature
clip_outlier: true
- class: Fillna
kwargs:
fields_group: feature
learn_processors:
- class: DropnaLabel
- class: CSRankNorm
kwargs:
fields_group: label
label: ["Ref($close, -2) / Ref($close, -1) - 1"]
port_analysis_config: &port_analysis_config
strategy:
class: TopkDropoutStrategy
module_path: qlib.contrib.strategy
kwargs:
signal:
- <MODEL>
- <DATASET>
topk: 50
n_drop: 5
backtest:
start_time: 2017-01-01
end_time: 2020-08-01
account: 100000000
benchmark: *benchmark
exchange_kwargs:
limit_threshold: 0.095
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
task:
model:
class: KRNN
module_path: qlib.contrib.model.pytorch_krnn
kwargs:
fea_dim: 6
cnn_dim: 8
cnn_kernel_size: 3
rnn_dim: 8
rnn_dups: 2
rnn_layers: 2
n_epochs: 200
lr: 0.001
early_stop: 20
batch_size: 2000
metric: loss
GPU: 0
dataset:
class: DatasetH
module_path: qlib.data.dataset
kwargs:
handler:
class: Alpha360
module_path: qlib.contrib.data.handler
kwargs: *data_handler_config
segments:
train: [2008-01-01, 2014-12-31]
valid: [2015-01-01, 2016-12-31]
test: [2017-01-01, 2020-08-01]
record:
- class: SignalRecord
module_path: qlib.workflow.record_temp
kwargs:
model: <MODEL>
dataset: <DATASET>
- class: SigAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
ana_long_short: False
ann_scaler: 252
- class: PortAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
config: *port_analysis_config

View File

@@ -29,13 +29,13 @@ class Avg15minHandler(DataHandlerLP):
fit_end_time=None,
process_type=DataHandlerLP.PTYPE_A,
filter_pipe=None,
inst_processor=None,
inst_processors=None,
**kwargs,
):
infer_processors = check_transform_proc(infer_processors, fit_start_time, fit_end_time)
learn_processors = check_transform_proc(learn_processors, fit_start_time, fit_end_time)
data_loader = Avg15minLoader(
config=self.loader_config(), filter_pipe=filter_pipe, freq=freq, inst_processor=inst_processor
config=self.loader_config(), filter_pipe=filter_pipe, freq=freq, inst_processors=inst_processors
)
super().__init__(
instruments=instruments,

View File

@@ -18,7 +18,7 @@ data_handler_config: &data_handler_config
label: day
feature: 1min
# with label as reference
inst_processor:
inst_processors:
feature:
- class: Resample1minProcessor
module_path: features_sample.py

View File

@@ -19,7 +19,7 @@ data_handler_config: &data_handler_config
feature_15min: 1min
feature_day: day
# with label as reference
inst_processor:
inst_processors:
feature_15min:
- class: ResampleNProcessor
module_path: features_resample_N.py

View File

@@ -64,8 +64,6 @@ task:
kwargs:
loss: mse
lr: 0.002
lr_decay: 0.96
lr_decay_steps: 100
optimizer: adam
max_steps: 8000
batch_size: 8192

View File

@@ -64,8 +64,6 @@ task:
kwargs:
loss: mse
lr: 0.002
lr_decay: 0.96
lr_decay_steps: 100
optimizer: adam
max_steps: 8000
batch_size: 8192

View File

@@ -52,8 +52,6 @@ task:
kwargs:
loss: mse
lr: 0.002
lr_decay: 0.96
lr_decay_steps: 100
optimizer: adam
max_steps: 8000
batch_size: 4096

View File

@@ -52,8 +52,6 @@ task:
kwargs:
loss: mse
lr: 0.002
lr_decay: 0.96
lr_decay_steps: 100
optimizer: adam
max_steps: 8000
batch_size: 4096

View File

@@ -26,7 +26,7 @@ The numbers shown below demonstrate the performance of the entire `workflow` of
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Annualized Return | Information Ratio | Max Drawdown |
|------------------------------------------|-------------------------------------|-------------|-------------|-------------|-------------|-------------------|-------------------|--------------|
| TCN(Shaojie Bai, et al.) | Alpha158 | 0.0275±0.00 | 0.2157±0.01 | 0.0411±0.00 | 0.3379±0.01 | 0.0190±0.02 | 0.2887±0.27 | -0.1202±0.03 |
| TCN(Shaojie Bai, et al.) | Alpha158 | 0.0279±0.00 | 0.2181±0.01 | 0.0421±0.00 | 0.3429±0.01 | 0.0262±0.02 | 0.4133±0.25 | -0.1090±0.03 |
| TabNet(Sercan O. Arik, et al.) | Alpha158 | 0.0204±0.01 | 0.1554±0.07 | 0.0333±0.00 | 0.2552±0.05 | 0.0227±0.04 | 0.3676±0.54 | -0.1089±0.08 |
| Transformer(Ashish Vaswani, et al.) | Alpha158 | 0.0264±0.00 | 0.2053±0.02 | 0.0407±0.00 | 0.3273±0.02 | 0.0273±0.02 | 0.3970±0.26 | -0.1101±0.02 |
| GRU(Kyunghyun Cho, et al.) | Alpha158(with selected 20 features) | 0.0315±0.00 | 0.2450±0.04 | 0.0428±0.00 | 0.3440±0.03 | 0.0344±0.02 | 0.5160±0.25 | -0.1017±0.02 |
@@ -68,6 +68,8 @@ The numbers shown below demonstrate the performance of the entire `workflow` of
| TRA(Hengxu Lin, et al.) | Alpha360 | 0.0485±0.00 | 0.3787±0.03 | 0.0587±0.00 | 0.4756±0.03 | 0.0920±0.03 | 1.2789±0.42 | -0.0834±0.02 |
| IGMTF(Wentao Xu, et al.) | Alpha360 | 0.0480±0.00 | 0.3589±0.02 | 0.0606±0.00 | 0.4773±0.01 | 0.0946±0.02 | 1.3509±0.25 | -0.0716±0.02 |
| HIST(Wentao Xu, et al.) | Alpha360 | 0.0522±0.00 | 0.3530±0.01 | 0.0667±0.00 | 0.4576±0.01 | 0.0987±0.02 | 1.3726±0.27 | -0.0681±0.01 |
| KRNN | Alpha360 | 0.0173±0.01 | 0.1210±0.06 | 0.0270±0.01 | 0.2018±0.04 | -0.0465±0.05 | -0.5415±0.62 | -0.2919±0.13 |
| Sandwich | Alpha360 | 0.0258±0.00 | 0.1924±0.04 | 0.0337±0.00 | 0.2624±0.03 | 0.0005±0.03 | 0.0001±0.33 | -0.1752±0.05 |
- The selected 20 features are based on the feature importance of a lightgbm-based model.
@@ -134,7 +136,7 @@ If you want to contribute your new models, you can follow the steps below.
- `README.md`: a brief introduction to your models
- `workflow_config_<model name>_<dataset>.yaml`: a configuration which can read by `qrun`. You are encouraged to run your model in all datasets.
3. You can integrate your model as a module [in this folder](https://github.com/microsoft/qlib/tree/main/qlib/contrib/model).
4. Please updated your results in the benchmark tables, e.g. [Alpha360](#alpha158-dataset), [Alpha158](#alpha158-dataset)(the values of each metric are the mean and std calculated based on 20 runs with different random seeds, if you don't have enough computational resource, you can ask for help in the PR).
4. Please update your results in the above **Benchmark Tables**, e.g. [Alpha360](#alpha158-dataset), [Alpha158](#alpha158-dataset)(the values of each metric are the mean and std calculated based on **20 Runs** with different random seeds. You can accomplish the above operations through the automated [script](https://github.com/microsoft/qlib/blob/main/examples/run_all_model.py#LL286C22-L286C22) provided by Qlib, and get the final result in the .md file. if you don't have enough computational resource, you can ask for help in the PR).
5. Update the info in the index page in the [news list](https://github.com/microsoft/qlib#newspaper-whats-new----sparkling_heart) and [model list](https://github.com/microsoft/qlib#quant-model-paper-zoo).
Finally, you can send PR for review. ([here is an example](https://github.com/microsoft/qlib/pull/1040))

View File

@@ -0,0 +1,8 @@
# Sandwich
* Code: [https://github.com/microsoft/FOST/blob/main/fostool/model/sandwich.py](https://github.com/microsoft/FOST/blob/main/fostool/model/sandwich.py)
# Introductions about the settings/configs.
* Torch_geometric is used in the original model in FOST, but we didn't use it.
make use your CUDA version matches the torch version to allow the usage of GPU, we use CUDA==10.2 and torch.version==1.12.1

View File

@@ -0,0 +1,2 @@
numpy==1.23.4
pandas==1.5.2

View File

@@ -0,0 +1,93 @@
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
region: cn
market: &market csi300
benchmark: &benchmark SH000300
data_handler_config: &data_handler_config
start_time: 2008-01-01
end_time: 2020-08-01
fit_start_time: 2008-01-01
fit_end_time: 2014-12-31
instruments: *market
infer_processors:
- class: RobustZScoreNorm
kwargs:
fields_group: feature
clip_outlier: true
- class: Fillna
kwargs:
fields_group: feature
learn_processors:
- class: DropnaLabel
- class: CSRankNorm
kwargs:
fields_group: label
label: ["Ref($close, -2) / Ref($close, -1) - 1"]
port_analysis_config: &port_analysis_config
strategy:
class: TopkDropoutStrategy
module_path: qlib.contrib.strategy
kwargs:
signal:
- <MODEL>
- <DATASET>
topk: 50
n_drop: 5
backtest:
start_time: 2017-01-01
end_time: 2020-08-01
account: 100000000
benchmark: *benchmark
exchange_kwargs:
limit_threshold: 0.095
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
task:
model:
class: Sandwich
module_path: qlib.contrib.model.pytorch_sandwich
kwargs:
fea_dim: 6
cnn_dim_1: 16
cnn_dim_2: 16
cnn_kernel_size: 3
rnn_dim_1: 8
rnn_dim_2: 8
rnn_dups: 2
rnn_layers: 2
n_epochs: 200
lr: 0.001
early_stop: 20
batch_size: 2000
metric: loss
GPU: 0
dataset:
class: DatasetH
module_path: qlib.data.dataset
kwargs:
handler:
class: Alpha360
module_path: qlib.contrib.data.handler
kwargs: *data_handler_config
segments:
train: [2008-01-01, 2014-12-31]
valid: [2015-01-01, 2016-12-31]
test: [2017-01-01, 2020-08-01]
record:
- class: SignalRecord
module_path: qlib.workflow.record_temp
kwargs:
model: <MODEL>
dataset: <DATASET>
- class: SigAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
ana_long_short: False
ann_scaler: 252
- class: PortAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
config: *port_analysis_config

View File

@@ -25,59 +25,65 @@
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"sns.set(style='white')\n",
"matplotlib.rcParams['pdf.fonttype'] = 42\n",
"matplotlib.rcParams['ps.fonttype'] = 42\n",
"\n",
"sns.set(style=\"white\")\n",
"matplotlib.rcParams[\"pdf.fonttype\"] = 42\n",
"matplotlib.rcParams[\"ps.fonttype\"] = 42\n",
"\n",
"from tqdm.auto import tqdm\n",
"from joblib import Parallel, delayed\n",
"\n",
"\n",
"def func(x, N=80):\n",
" ret = x.ret.copy()\n",
" x = x.rank(pct=True)\n",
" x['ret'] = ret\n",
" x[\"ret\"] = ret\n",
" diff = x.score.sub(x.label)\n",
" r = x.nlargest(N, columns='score').ret.mean()\n",
" r -= x.nsmallest(N, columns='score').ret.mean()\n",
" return pd.Series({\n",
" 'MSE': diff.pow(2).mean(), \n",
" 'MAE': diff.abs().mean(), \n",
" 'IC': x.score.corr(x.label),\n",
" 'R': r\n",
" })\n",
" \n",
" r = x.nlargest(N, columns=\"score\").ret.mean()\n",
" r -= x.nsmallest(N, columns=\"score\").ret.mean()\n",
" return pd.Series(\n",
" {\n",
" \"MSE\": diff.pow(2).mean(),\n",
" \"MAE\": diff.abs().mean(),\n",
" \"IC\": x.score.corr(x.label),\n",
" \"R\": r,\n",
" }\n",
" )\n",
"\n",
"\n",
"ret = pd.read_pickle(\"data/ret.pkl\").clip(-0.1, 0.1)\n",
"\n",
"\n",
"def backtest(fname, **kwargs):\n",
" pred = pd.read_pickle(fname).loc['2018-09-21':'2020-06-30'] # test period\n",
" pred['ret'] = ret\n",
" pred = pd.read_pickle(fname).loc[\"2018-09-21\":\"2020-06-30\"] # test period\n",
" pred[\"ret\"] = ret\n",
" dates = pred.index.unique(level=0)\n",
" res = Parallel(n_jobs=-1)(delayed(func)(pred.loc[d], **kwargs) for d in dates)\n",
" res = {\n",
" dates[i]: res[i]\n",
" for i in range(len(dates))\n",
" }\n",
" res = {dates[i]: res[i] for i in range(len(dates))}\n",
" res = pd.DataFrame(res).T\n",
" r = res['R'].copy()\n",
" r = res[\"R\"].copy()\n",
" r.index = pd.to_datetime(r.index)\n",
" r = r.reindex(pd.date_range(r.index[0], r.index[-1])).fillna(0) # paper use 365 days\n",
" return {\n",
" 'MSE': res['MSE'].mean(),\n",
" 'MAE': res['MAE'].mean(),\n",
" 'IC': res['IC'].mean(),\n",
" 'ICIR': res['IC'].mean()/res['IC'].std(),\n",
" 'AR': r.mean()*365,\n",
" 'AV': r.std()*365**0.5,\n",
" 'SR': r.mean()/r.std()*365**0.5,\n",
" 'MDD': (r.cumsum().cummax() - r.cumsum()).max()\n",
" \"MSE\": res[\"MSE\"].mean(),\n",
" \"MAE\": res[\"MAE\"].mean(),\n",
" \"IC\": res[\"IC\"].mean(),\n",
" \"ICIR\": res[\"IC\"].mean() / res[\"IC\"].std(),\n",
" \"AR\": r.mean() * 365,\n",
" \"AV\": r.std() * 365**0.5,\n",
" \"SR\": r.mean() / r.std() * 365**0.5,\n",
" \"MDD\": (r.cumsum().cummax() - r.cumsum()).max(),\n",
" }, r\n",
"\n",
"\n",
"def fmt(x, p=3, scale=1, std=False):\n",
" _fmt = '{:.%df}'%p\n",
" _fmt = \"{:.%df}\" % p\n",
" string = _fmt.format((x.mean() if not isinstance(x, (float, np.floating)) else x) * scale)\n",
" if std and len(x) > 1:\n",
" string += ' ('+_fmt.format(x.std()*scale)+')'\n",
" string += \" (\" + _fmt.format(x.std() * scale) + \")\"\n",
" return string\n",
"\n",
"\n",
"def backtest_multi(files, **kwargs):\n",
" res = []\n",
" pnl = []\n",
@@ -88,14 +94,14 @@
" res = pd.DataFrame(res)\n",
" pnl = pd.concat(pnl, axis=1)\n",
" return {\n",
" 'MSE': fmt(res['MSE'], std=True),\n",
" 'MAE': fmt(res['MAE'], std=True),\n",
" 'IC': fmt(res['IC']),\n",
" 'ICIR': fmt(res['ICIR']),\n",
" 'AR': fmt(res['AR'], scale=100, p=1)+'%',\n",
" 'VR': fmt(res['AV'], scale=100, p=1)+'%',\n",
" 'SR': fmt(res['SR']),\n",
" 'MDD': fmt(res['MDD'], scale=100, p=1)+'%'\n",
" \"MSE\": fmt(res[\"MSE\"], std=True),\n",
" \"MAE\": fmt(res[\"MAE\"], std=True),\n",
" \"IC\": fmt(res[\"IC\"]),\n",
" \"ICIR\": fmt(res[\"ICIR\"]),\n",
" \"AR\": fmt(res[\"AR\"], scale=100, p=1) + \"%\",\n",
" \"VR\": fmt(res[\"AV\"], scale=100, p=1) + \"%\",\n",
" \"SR\": fmt(res[\"SR\"]),\n",
" \"MDD\": fmt(res[\"MDD\"], scale=100, p=1) + \"%\",\n",
" }, pnl"
]
},
@@ -124,16 +130,20 @@
"outputs": [],
"source": [
"exps = {\n",
" 'Linear': ['output/Linear/pred.pkl'],\n",
" 'LightGBM': ['output/GBDT/lr0.05_leaves128/pred.pkl'],\n",
" 'MLP': glob.glob('output/search/MLP/hs128_bs512_do0.3_lr0.001_seed*/pred.pkl'),\n",
" 'SFM': glob.glob('output/search/SFM/hs32_bs512_do0.5_lr0.001_seed*/pred.pkl'),\n",
" 'ALSTM': glob.glob('output/search/LSTM_Attn/hs256_bs1024_do0.1_lr0.0002_seed*/pred.pkl'),\n",
" 'Trans.': glob.glob('output/search/Transformer/head4_hs64_bs1024_do0.1_lr0.0002_seed*/pred.pkl'),\n",
" 'ALSTM+TS':glob.glob('output/LSTM_Attn_TS/hs256_bs1024_do0.1_lr0.0002_seed*/pred.pkl'),\n",
" 'Trans.+TS':glob.glob('output/Transformer_TS/head4_hs64_bs1024_do0.1_lr0.0002_seed*/pred.pkl'),\n",
" 'ALSTM+TRA(Ours)': glob.glob('output/search/finetune/LSTM_Attn_tra/K10_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl'),\n",
" 'Trans.+TRA(Ours)': glob.glob('output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb1.0_head4_hs64_bs512_do0.1_lr0.0005_seed*/pred.pkl')\n",
" \"Linear\": [\"output/Linear/pred.pkl\"],\n",
" \"LightGBM\": [\"output/GBDT/lr0.05_leaves128/pred.pkl\"],\n",
" \"MLP\": glob.glob(\"output/search/MLP/hs128_bs512_do0.3_lr0.001_seed*/pred.pkl\"),\n",
" \"SFM\": glob.glob(\"output/search/SFM/hs32_bs512_do0.5_lr0.001_seed*/pred.pkl\"),\n",
" \"ALSTM\": glob.glob(\"output/search/LSTM_Attn/hs256_bs1024_do0.1_lr0.0002_seed*/pred.pkl\"),\n",
" \"Trans.\": glob.glob(\"output/search/Transformer/head4_hs64_bs1024_do0.1_lr0.0002_seed*/pred.pkl\"),\n",
" \"ALSTM+TS\": glob.glob(\"output/LSTM_Attn_TS/hs256_bs1024_do0.1_lr0.0002_seed*/pred.pkl\"),\n",
" \"Trans.+TS\": glob.glob(\"output/Transformer_TS/head4_hs64_bs1024_do0.1_lr0.0002_seed*/pred.pkl\"),\n",
" \"ALSTM+TRA(Ours)\": glob.glob(\n",
" \"output/search/finetune/LSTM_Attn_tra/K10_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl\"\n",
" ),\n",
" \"Trans.+TRA(Ours)\": glob.glob(\n",
" \"output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb1.0_head4_hs64_bs512_do0.1_lr0.0005_seed*/pred.pkl\"\n",
" ),\n",
"}"
]
},
@@ -160,14 +170,8 @@
}
],
"source": [
"res = {\n",
" name: backtest_multi(exps[name])\n",
" for name in tqdm(exps)\n",
"}\n",
"report = pd.DataFrame({\n",
" k: v[0]\n",
" for k, v in res.items()\n",
"}).T"
"res = {name: backtest_multi(exps[name]) for name in tqdm(exps)}\n",
"report = pd.DataFrame({k: v[0] for k, v in res.items()}).T"
]
},
{
@@ -385,24 +389,40 @@
}
],
"source": [
"df = pd.read_pickle('output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb0.0_head4_hs64_bs512_do0.1_lr0.0005_seed1000/pred.pkl')\n",
"code = 'SH600157'\n",
"date = '2018-09-28'\n",
"df = pd.read_pickle(\n",
" \"output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb0.0_head4_hs64_bs512_do0.1_lr0.0005_seed1000/pred.pkl\"\n",
")\n",
"code = \"SH600157\"\n",
"date = \"2018-09-28\"\n",
"lookbackperiod = 50\n",
"\n",
"prob = df.iloc[:, -3:].loc(axis=0)[:, code].reset_index(level=1, drop=True).loc[date:].iloc[:lookbackperiod]\n",
"pred = df.loc[:,[\"score_0\",\"score_1\",\"score_2\",\"label\"]].loc(axis=0)[:, code].reset_index(level=1, drop=True).loc[date:].iloc[:lookbackperiod]\n",
"e_all = pred.iloc[:,:-1].sub(pred.iloc[:,-1], axis=0).pow(2)\n",
"pred = (\n",
" df.loc[:, [\"score_0\", \"score_1\", \"score_2\", \"label\"]]\n",
" .loc(axis=0)[:, code]\n",
" .reset_index(level=1, drop=True)\n",
" .loc[date:]\n",
" .iloc[:lookbackperiod]\n",
")\n",
"e_all = pred.iloc[:, :-1].sub(pred.iloc[:, -1], axis=0).pow(2)\n",
"e_all = e_all.sub(e_all.min(axis=1), axis=0)\n",
"e_all.columns = [r'$\\theta_%d$'%d for d in range(1, 4)]\n",
"e_all.columns = [r\"$\\theta_%d$\" % d for d in range(1, 4)]\n",
"prob = pd.Series(np.argmax(prob.values, axis=1), index=prob.index).rolling(7).mean().round()\n",
"\n",
"fig, axes = plt.subplots(1, 2, figsize=(7, 3))\n",
"e_all.plot(ax=axes[0], xlabel='', rot=30)\n",
"prob.plot(ax=axes[1], xlabel='', rot=30, color='red', linestyle='None', marker='^', markersize=5)\n",
"e_all.plot(ax=axes[0], xlabel=\"\", rot=30)\n",
"prob.plot(\n",
" ax=axes[1],\n",
" xlabel=\"\",\n",
" rot=30,\n",
" color=\"red\",\n",
" linestyle=\"None\",\n",
" marker=\"^\",\n",
" markersize=5,\n",
")\n",
"plt.yticks(np.array([0, 1, 2]), e_all.columns.values)\n",
"axes[0].set_ylabel('Predictor Loss')\n",
"axes[1].set_ylabel('Router Selection')\n",
"axes[0].set_ylabel(\"Predictor Loss\")\n",
"axes[1].set_ylabel(\"Router Selection\")\n",
"plt.tight_layout()\n",
"# plt.savefig('select.pdf', bbox_inches='tight')\n",
"plt.show()"
@@ -428,10 +448,18 @@
"outputs": [],
"source": [
"exps = {\n",
" 'Random': glob.glob('output/search/LSTM_Attn_tra/K10_traHs16_traSrcNONE_traLamb1.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl'),\n",
" 'LR': glob.glob('output/search/LSTM_Attn_tra/K10_traHs16_traSrcLR_traLamb1.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl'),\n",
" 'TPE': glob.glob('output/search/LSTM_Attn_tra/K10_traHs16_traSrcTPE_traLamb1.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl'),\n",
" 'LR+TPE': glob.glob('output/search/finetune/LSTM_Attn_tra/K10_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl')\n",
" \"Random\": glob.glob(\n",
" \"output/search/LSTM_Attn_tra/K10_traHs16_traSrcNONE_traLamb1.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl\"\n",
" ),\n",
" \"LR\": glob.glob(\n",
" \"output/search/LSTM_Attn_tra/K10_traHs16_traSrcLR_traLamb1.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl\"\n",
" ),\n",
" \"TPE\": glob.glob(\n",
" \"output/search/LSTM_Attn_tra/K10_traHs16_traSrcTPE_traLamb1.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl\"\n",
" ),\n",
" \"LR+TPE\": glob.glob(\n",
" \"output/search/finetune/LSTM_Attn_tra/K10_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/pred.pkl\"\n",
" ),\n",
"}"
]
},
@@ -456,14 +484,8 @@
}
],
"source": [
"res = {\n",
" name: backtest_multi(exps[name])\n",
" for name in tqdm(exps)\n",
"}\n",
"report = pd.DataFrame({\n",
" k: v[0]\n",
" for k, v in res.items()\n",
"}).T"
"res = {name: backtest_multi(exps[name]) for name in tqdm(exps)}\n",
"report = pd.DataFrame({k: v[0] for k, v in res.items()}).T"
]
},
{
@@ -597,18 +619,22 @@
}
],
"source": [
"a = pd.read_pickle('output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb0.0_head4_hs64_bs512_do0.1_lr0.0005_seed3000/pred.pkl')\n",
"b = pd.read_pickle('output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb2.0_head4_hs64_bs512_do0.1_lr0.0005_seed3000/pred.pkl')\n",
"a = pd.read_pickle(\n",
" \"output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb0.0_head4_hs64_bs512_do0.1_lr0.0005_seed3000/pred.pkl\"\n",
")\n",
"b = pd.read_pickle(\n",
" \"output/search/finetune/Transformer_tra/K3_traHs16_traSrcLR_TPE_traLamb2.0_head4_hs64_bs512_do0.1_lr0.0005_seed3000/pred.pkl\"\n",
")\n",
"a = a.iloc[:, -3:]\n",
"b = b.iloc[:, -3:]\n",
"b = np.eye(3)[b.values.argmax(axis=1)]\n",
"a = np.eye(3)[a.values.argmax(axis=1)]\n",
"\n",
"res = pd.DataFrame({\n",
" 'with OT': b.sum(axis=0) / b.sum(),\n",
" 'without OT': a.sum(axis=0)/ a.sum() \n",
"},index=[r'$\\theta_1$',r'$\\theta_2$',r'$\\theta_3$'])\n",
"res.plot.bar(rot=30, figsize=(5, 4), color=['b', 'g'])\n",
"res = pd.DataFrame(\n",
" {\"with OT\": b.sum(axis=0) / b.sum(), \"without OT\": a.sum(axis=0) / a.sum()},\n",
" index=[r\"$\\theta_1$\", r\"$\\theta_2$\", r\"$\\theta_3$\"],\n",
")\n",
"res.plot.bar(rot=30, figsize=(5, 4), color=[\"b\", \"g\"])\n",
"del a, b"
]
},
@@ -633,11 +659,19 @@
"outputs": [],
"source": [
"exps = {\n",
" 'K=1': glob.glob('output/search/LSTM_Attn/hs256_bs1024_do0.1_lr0.0002_seed*/info.json'),\n",
" 'K=3': glob.glob('output/search/finetune/LSTM_Attn_tra/K3_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json'),\n",
" 'K=5': glob.glob('output/search/finetune/LSTM_Attn_tra/K5_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json'),\n",
" 'K=10': glob.glob('output/search/finetune/LSTM_Attn_tra/K10_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json'),\n",
" 'K=20': glob.glob('output/search/finetune/LSTM_Attn_tra/K20_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json')\n",
" \"K=1\": glob.glob(\"output/search/LSTM_Attn/hs256_bs1024_do0.1_lr0.0002_seed*/info.json\"),\n",
" \"K=3\": glob.glob(\n",
" \"output/search/finetune/LSTM_Attn_tra/K3_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json\"\n",
" ),\n",
" \"K=5\": glob.glob(\n",
" \"output/search/finetune/LSTM_Attn_tra/K5_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json\"\n",
" ),\n",
" \"K=10\": glob.glob(\n",
" \"output/search/finetune/LSTM_Attn_tra/K10_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json\"\n",
" ),\n",
" \"K=20\": glob.glob(\n",
" \"output/search/finetune/LSTM_Attn_tra/K20_traHs16_traSrcLR_TPE_traLamb2.0_hs256_bs1024_do0.1_lr0.0001_seed*/info.json\"\n",
" ),\n",
"}"
]
},
@@ -649,16 +683,11 @@
"source": [
"report = dict()\n",
"for k, v in exps.items():\n",
" \n",
" tmp = dict()\n",
" for fname in v:\n",
" with open(fname) as f:\n",
" info = json.load(f)\n",
" tmp[fname] = (\n",
" {\n",
" \"IC\":info[\"metric\"][\"IC\"],\n",
" \"MSE\":info[\"metric\"][\"MSE\"]\n",
" })\n",
" tmp[fname] = {\"IC\": info[\"metric\"][\"IC\"], \"MSE\": info[\"metric\"][\"MSE\"]}\n",
" tmp = pd.DataFrame(tmp).T\n",
" report[k] = tmp.mean()\n",
"report = pd.DataFrame(report).T"
@@ -681,13 +710,14 @@
}
],
"source": [
"fig, axes = plt.subplots(1, 2, figsize=(6,3)); axes = axes.flatten()\n",
"report['IC'].plot.bar(rot=30, ax=axes[0])\n",
"fig, axes = plt.subplots(1, 2, figsize=(6, 3))\n",
"axes = axes.flatten()\n",
"report[\"IC\"].plot.bar(rot=30, ax=axes[0])\n",
"axes[0].set_ylim(0.045, 0.062)\n",
"axes[0].set_title('IC performance')\n",
"report['MSE'].astype(float).plot.bar(rot=30, ax=axes[1], color='green')\n",
"axes[0].set_title(\"IC performance\")\n",
"report[\"MSE\"].astype(float).plot.bar(rot=30, ax=axes[1], color=\"green\")\n",
"axes[1].set_ylim(0.155, 0.1585)\n",
"axes[1].set_title('MSE performance')\n",
"axes[1].set_title(\"MSE performance\")\n",
"plt.tight_layout()\n",
"# plt.savefig('sensitivity.pdf')"
]

View File

@@ -0,0 +1,4 @@
.PHONY: clean
clean:
-rm -r *.pkl mlruns || true

View File

@@ -0,0 +1,107 @@
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
plt.rcParams["font.sans-serif"] = "SimHei"
plt.rcParams["axes.unicode_minus"] = False
from tqdm.auto import tqdm
# tqdm.pandas() # for progress_apply
# %matplotlib inline
# %load_ext autoreload
# # Meta Input
# +
with open("./internal_data_s20.pkl", "rb") as f:
data = pickle.load(f)
data.data_ic_df.columns.names = ["start_date", "end_date"]
data_sim = data.data_ic_df.droplevel(axis=1, level="end_date")
data_sim.index.name = "test datetime"
# -
plt.figure(figsize=(40, 20))
sns.heatmap(data_sim)
plt.figure(figsize=(40, 20))
sns.heatmap(data_sim.rolling(20).mean())
# # Meta Model
from qlib import auto_init
auto_init()
from qlib.workflow import R
exp = R.get_exp(experiment_name="DDG-DA")
meta_rec = exp.list_recorders(rtype="list", max_results=1)[0]
meta_m = meta_rec.load_object("model")
pd.DataFrame(meta_m.tn.twm.linear.weight.detach().numpy()).T[0].plot()
pd.DataFrame(meta_m.tn.twm.linear.weight.detach().numpy()).T[0].rolling(5).mean().plot()
# # Meta Output
# +
with open("./tasks_s20.pkl", "rb") as f:
tasks = pickle.load(f)
task_df = {}
for t in tasks:
test_seg = t["dataset"]["kwargs"]["segments"]["test"]
if None not in test_seg:
# The last rolling is skipped.
task_df[test_seg] = t["reweighter"].time_weight
task_df = pd.concat(task_df)
task_df.index.names = ["OS_start", "OS_end", "IS_start", "IS_end"]
task_df = task_df.droplevel(["OS_end", "IS_end"])
task_df = task_df.unstack("OS_start")
# -
plt.figure(figsize=(40, 20))
sns.heatmap(task_df.T)
plt.figure(figsize=(40, 20))
sns.heatmap(task_df.rolling(10).mean().T)
# # Sub Models
#
# NOTE:
# - this section assumes that the model is Linear model!!
# - Other models does not support this analysis
exp = R.get_exp(experiment_name="rolling_ds")
def show_linear_weight(exp):
coef_df = {}
for r in exp.list_recorders("list"):
t = r.load_object("task")
if None in t["dataset"]["kwargs"]["segments"]["test"]:
continue
m = r.load_object("params.pkl")
coef_df[t["dataset"]["kwargs"]["segments"]["test"]] = pd.Series(m.coef_)
coef_df = pd.concat(coef_df)
coef_df.index.names = ["test_start", "test_end", "coef_idx"]
coef_df = coef_df.droplevel("test_end").unstack("coef_idx").T
plt.figure(figsize=(40, 20))
sns.heatmap(coef_df)
plt.show()
show_linear_weight(R.get_exp(experiment_name="rolling_ds"))
show_linear_weight(R.get_exp(experiment_name="rolling_models"))

View File

@@ -10,8 +10,10 @@ import pandas as pd
import fire
import sys
import pickle
from typing import Optional
from qlib import auto_init
from qlib.model.trainer import TrainerR
from qlib.typehint import Literal
from qlib.utils import init_instance_by_config
from qlib.workflow import R
from qlib.tests.data import GetData
@@ -30,7 +32,33 @@ class DDGDA:
- `rm -r mlruns`
"""
def __init__(self, sim_task_model="linear", forecast_model="linear"):
def __init__(
self,
sim_task_model: Literal["linear", "gbdt"] = "gbdt",
forecast_model: Literal["linear", "gbdt"] = "linear",
h_path: Optional[str] = None,
test_end: Optional[str] = None,
train_start: Optional[str] = None,
meta_1st_train_end: Optional[str] = None,
task_ext_conf: Optional[dict] = None,
alpha: float = 0.01,
proxy_hd: str = "handler_proxy.pkl",
):
"""
Parameters
----------
train_start: Optional[str]
the start datetime for data. It is used in training start time (for both tasks & meta learing)
test_end: Optional[str]
the end datetime for data. It is used in test end time
meta_1st_train_end: Optional[str]
the datetime of training end of the first meta_task
alpha: float
Setting the L2 regularization for ridge
The `alpha` is only passed to MetaModelDS (it is not passed to sim_task_model currently..)
"""
self.step = 20
# NOTE:
# the horizon must match the meaning in the base task template
@@ -38,10 +66,19 @@ class DDGDA:
self.meta_exp_name = "DDG-DA"
self.sim_task_model = sim_task_model # The model to capture the distribution of data.
self.forecast_model = forecast_model # downstream forecasting models' type
self.rb_kwargs = {
"h_path": h_path,
"test_end": test_end,
"train_start": train_start,
"task_ext_conf": task_ext_conf,
}
self.alpha = alpha
self.meta_1st_train_end = meta_1st_train_end
self.proxy_hd = proxy_hd
def get_feature_importance(self):
# this must be lightGBM, because it needs to get the feature importance
rb = RollingBenchmark(model_type="gbdt")
rb = RollingBenchmark(model_type="gbdt", **self.rb_kwargs)
task = rb.basic_task()
with R.start(experiment_name="feature_importance"):
@@ -69,7 +106,7 @@ class DDGDA:
fi = self.get_feature_importance()
col_selected = fi.nlargest(topk)
rb = RollingBenchmark(model_type=self.sim_task_model)
rb = RollingBenchmark(model_type=self.sim_task_model, **self.rb_kwargs)
task = rb.basic_task()
dataset = init_instance_by_config(task["dataset"])
prep_ds = dataset.prepare(slice(None), col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
@@ -79,7 +116,9 @@ class DDGDA:
feature_selected = feature_df.loc[:, col_selected.index]
feature_selected = feature_selected.groupby("datetime").apply(lambda df: (df - df.mean()).div(df.std()))
feature_selected = feature_selected.groupby("datetime", group_keys=False).apply(
lambda df: (df - df.mean()).div(df.std())
)
feature_selected = feature_selected.fillna(0.0)
df_all = {
@@ -96,7 +135,7 @@ class DDGDA:
"kwargs": {"config": DIRNAME / "fea_label_df.pkl"},
}
)
handler.to_pickle(DIRNAME / "handler_proxy.pkl", dump_all=True)
handler.to_pickle(DIRNAME / self.proxy_hd, dump_all=True)
@property
def _internal_data_path(self):
@@ -108,7 +147,7 @@ class DDGDA:
This function will dump the input data for meta model
"""
# According to the experiments, the choice of the model type is very important for achieving good results
rb = RollingBenchmark(model_type=self.sim_task_model)
rb = RollingBenchmark(model_type=self.sim_task_model, **self.rb_kwargs)
sim_task = rb.basic_task()
if self.sim_task_model == "gbdt":
@@ -122,24 +161,28 @@ class DDGDA:
with self._internal_data_path.open("wb") as f:
pickle.dump(internal_data, f)
def train_meta_model(self):
def train_meta_model(self, fill_method="max"):
"""
training a meta model based on a simplified linear proxy model;
"""
# 1) leverage the simplified proxy forecasting model to train meta model.
# - Only the dataset part is important, in current version of meta model will integrate the
rb = RollingBenchmark(model_type=self.sim_task_model)
rb = RollingBenchmark(model_type=self.sim_task_model, **self.rb_kwargs)
sim_task = rb.basic_task()
# the train_start for training meta model does not necessarily align with final rolling
train_start = "2008-01-01" if self.rb_kwargs.get("train_start") is None else self.rb_kwargs.get("train_start")
train_end = "2010-12-31" if self.meta_1st_train_end is None else self.meta_1st_train_end
test_start = (pd.Timestamp(train_end) + pd.Timedelta(days=1)).strftime("%Y-%m-%d")
proxy_forecast_model_task = {
# "model": "qlib.contrib.model.linear.LinearModel",
"dataset": {
"class": "qlib.data.dataset.DatasetH",
"kwargs": {
"handler": f"file://{(DIRNAME / 'handler_proxy.pkl').absolute()}",
"handler": f"file://{(DIRNAME / self.proxy_hd).absolute()}",
"segments": {
"train": ("2008-01-01", "2010-12-31"),
"test": ("2011-01-01", sim_task["dataset"]["kwargs"]["segments"]["test"][1]),
"train": (train_start, train_end),
"test": (test_start, sim_task["dataset"]["kwargs"]["segments"]["test"][1]),
},
},
},
@@ -156,7 +199,7 @@ class DDGDA:
segments=0.62, # keep test period consistent with the dataset yaml
trunc_days=1 + self.horizon,
hist_step_n=30,
fill_method="max",
fill_method=fill_method,
rolling_ext_days=0,
)
# NOTE:
@@ -165,12 +208,15 @@ class DDGDA:
# So the misalignment will not affect the effectiveness of the method.
with self._internal_data_path.open("rb") as f:
internal_data = pickle.load(f)
md = MetaDatasetDS(exp_name=internal_data, **kwargs)
# 3) train and logging meta model
with R.start(experiment_name=self.meta_exp_name):
R.log_params(**kwargs)
mm = MetaModelDS(step=self.step, hist_step_n=kwargs["hist_step_n"], lr=0.001, max_epoch=200, seed=43)
mm = MetaModelDS(
step=self.step, hist_step_n=kwargs["hist_step_n"], lr=0.001, max_epoch=30, seed=43, alpha=self.alpha
)
mm.fit(md)
R.save_objects(model=mm)
@@ -203,7 +249,7 @@ class DDGDA:
hist_step_n = int(param["hist_step_n"])
fill_method = param.get("fill_method", "max")
rb = RollingBenchmark(model_type=self.forecast_model)
rb = RollingBenchmark(model_type=self.forecast_model, **self.rb_kwargs)
task_l = rb.create_rolling_tasks()
# 2.2) create meta dataset for final dataset
@@ -233,13 +279,13 @@ class DDGDA:
"""
with self._task_path.open("rb") as f:
tasks = pickle.load(f)
rb = RollingBenchmark(rolling_exp="rolling_ds", model_type=self.forecast_model)
rb = RollingBenchmark(rolling_exp="rolling_ds", model_type=self.forecast_model, **self.rb_kwargs)
rb.train_rolling_tasks(tasks)
rb.ens_rolling()
rb.update_rolling_rec()
def run_all(self):
# 1) file: handler_proxy.pkl
# 1) file: handler_proxy.pkl (self.proxy_hd)
self.dump_data_for_proxy_model()
# 2)
# file: internal_data_s20.pkl

View File

@@ -4,15 +4,23 @@ So adapting the forecasting models/strategies to market dynamics is very importa
The table below shows the performances of different solutions on different forecasting models.
## Alpha158 dataset
## Alpha158 Dataset
Here is the [crowd sourced version of qlib data](data_collector/crowd_source/README.md): https://github.com/chenditc/investment_data/releases
```bash
wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
mkdir -p ~/.qlib/qlib_data/cn_data
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
rm -f qlib_bin.tar.gz
```
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Annualized Return | Information Ratio | Max Drawdown |
|------------------|---------|----|------|---------|-----------|-------------------|-------------------|--------------|
| RR[Linear] |Alpha158 |0.088|0.570|0.102 |0.622 |0.077 |1.175 |-0.086 |
| DDG-DA[Linear] |Alpha158 |0.093|0.622|0.106 |0.670 |0.085 |1.213 |-0.093 |
| RR[LightGBM] |Alpha158 |0.079|0.566|0.088 |0.592 |0.075 |1.226 |-0.096 |
| DDG-DA[LightGBM] |Alpha158 |0.084|0.639|0.093 |0.664 |0.099 |1.442 |-0.071 |
|------------------|---------|------|------|---------|-----------|-------------------|-------------------|--------------|
| RR[Linear] |Alpha158 |0.0945|0.5989|0.1069 |0.6495 |0.0857 |1.3682 |-0.0986 |
| DDG-DA[Linear] |Alpha158 |0.0983|0.6157|0.1108 |0.6646 |0.0764 |1.1904 |-0.0769 |
| RR[LightGBM] |Alpha158 |0.0816|0.5887|0.0912 |0.6263 |0.0771 |1.3196 |-0.0909 |
| DDG-DA[LightGBM] |Alpha158 |0.0878|0.6185|0.0975 |0.6524 |0.1261 |2.0096 |-0.0744 |
- The label horizon of the `Alpha158` dataset is set to 20.
- The rolling time intervals are set to 20 trading days.
- The test rolling periods are from January 2017 to August 2020.
- The results are based on the crowd-sourced version. The Yahoo version of qlib data does not contain `VWAP`, so all related factors are missing and filled with 0, which leads to a rank-deficient matrix (a matrix does not have full rank) and makes lower-level optimization of DDG-DA can not be solved.

View File

@@ -1,13 +1,17 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from typing import Optional
from qlib.model.ens.ensemble import RollingEnsemble
from qlib.utils import init_instance_by_config
import fire
import yaml
import pandas as pd
from qlib import auto_init
from pathlib import Path
from tqdm.auto import tqdm
from qlib.model.trainer import TrainerR
from qlib.log import get_module_logger
from qlib.utils.data import update_config
from qlib.workflow import R
from qlib.tests.data import GetData
@@ -25,23 +29,57 @@ class RollingBenchmark:
"""
def __init__(self, rolling_exp="rolling_models", model_type="linear") -> None:
def __init__(
self,
rolling_exp: str = "rolling_models",
model_type: str = "linear",
h_path: Optional[str] = None,
train_start: Optional[str] = None,
test_end: Optional[str] = None,
task_ext_conf: Optional[dict] = None,
) -> None:
"""
Parameters
----------
rolling_exp : str
The name for the experiments for rolling
model_type : str
The model to be boosted.
h_path : Optional[str]
the dumped data handler;
test_end : Optional[str]
the test end for the data. It is typically used together with the handler
train_start : Optional[str]
the train start for the data. It is typically used together with the handler.
task_ext_conf : Optional[dict]
some option to update the
"""
self.step = 20
self.horizon = 20
self.rolling_exp = rolling_exp
self.model_type = model_type
self.h_path = h_path
self.train_start = train_start
self.test_end = test_end
self.logger = get_module_logger("RollingBenchmark")
self.task_ext_conf = task_ext_conf
def basic_task(self):
"""For fast training rolling"""
if self.model_type == "gbdt":
conf_path = DIRNAME.parent.parent / "benchmarks" / "LightGBM" / "workflow_config_lightgbm_Alpha158.yaml"
conf_path = DIRNAME / "workflow_config_lightgbm_Alpha158.yaml"
# dump the processed data on to disk for later loading to speed up the processing
h_path = DIRNAME / "lightgbm_alpha158_handler_horizon{}.pkl".format(self.horizon)
elif self.model_type == "linear":
conf_path = DIRNAME.parent.parent / "benchmarks" / "Linear" / "workflow_config_linear_Alpha158.yaml"
# We use ridge regression to stabilize the performance
conf_path = DIRNAME / "workflow_config_linear_Alpha158.yaml"
h_path = DIRNAME / "linear_alpha158_handler_horizon{}.pkl".format(self.horizon)
else:
raise AssertionError("Model type is not supported!")
if self.h_path is not None:
h_path = Path(self.h_path)
with conf_path.open("r") as f:
conf = yaml.safe_load(f)
@@ -52,6 +90,9 @@ class RollingBenchmark:
task = conf["task"]
if self.task_ext_conf is not None:
task = update_config(task, self.task_ext_conf)
if not h_path.exists():
h_conf = task["dataset"]["kwargs"]["handler"]
h = init_instance_by_config(h_conf)
@@ -59,6 +100,15 @@ class RollingBenchmark:
task["dataset"]["kwargs"]["handler"] = f"file://{h_path}"
task["record"] = ["qlib.workflow.record_temp.SignalRecord"]
if self.train_start is not None:
seg = task["dataset"]["kwargs"]["segments"]["train"]
task["dataset"]["kwargs"]["segments"]["train"] = pd.Timestamp(self.train_start), seg[1]
if self.test_end is not None:
seg = task["dataset"]["kwargs"]["segments"]["test"]
task["dataset"]["kwargs"]["segments"]["test"] = seg[0], pd.Timestamp(self.test_end)
self.logger.info(task)
return task
def create_rolling_tasks(self):
@@ -93,7 +143,7 @@ class RollingBenchmark:
"""
Evaluate the combined rolling results
"""
for rid, rec in R.list_recorders(experiment_name=self.COMB_EXP).items():
for _, rec in R.list_recorders(experiment_name=self.COMB_EXP).items():
for rt_cls in SigAnaRecord, PortAnaRecord:
rt = rt_cls(recorder=rec, skip_existing=True)
rt.generate()

View File

@@ -0,0 +1,72 @@
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
region: cn
market: &market csi300
benchmark: &benchmark SH000300
data_handler_config: &data_handler_config
start_time: 2008-01-01
end_time: 2020-08-01
fit_start_time: 2008-01-01
fit_end_time: 2014-12-31
instruments: *market
port_analysis_config: &port_analysis_config
strategy:
class: TopkDropoutStrategy
module_path: qlib.contrib.strategy
kwargs:
model: <MODEL>
dataset: <DATASET>
topk: 50
n_drop: 5
backtest:
start_time: 2017-01-01
end_time: 2020-08-01
account: 100000000
benchmark: *benchmark
exchange_kwargs:
limit_threshold: 0.095
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
task:
model:
class: LGBModel
module_path: qlib.contrib.model.gbdt
kwargs:
loss: mse
colsample_bytree: 0.8879
learning_rate: 0.2
subsample: 0.8789
lambda_l1: 205.6999
lambda_l2: 580.9768
max_depth: 8
num_leaves: 210
num_threads: 20
dataset:
class: DatasetH
module_path: qlib.data.dataset
kwargs:
handler:
class: Alpha158
module_path: qlib.contrib.data.handler
kwargs: *data_handler_config
segments:
train: [2008-01-01, 2014-12-31]
valid: [2015-01-01, 2016-12-31]
test: [2017-01-01, 2020-08-01]
record:
- class: SignalRecord
module_path: qlib.workflow.record_temp
kwargs:
model: <MODEL>
dataset: <DATASET>
- class: SigAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
ana_long_short: False
ann_scaler: 252
- class: PortAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
config: *port_analysis_config

View File

@@ -0,0 +1,79 @@
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
region: cn
market: &market csi300
benchmark: &benchmark SH000300
data_handler_config: &data_handler_config
start_time: 2008-01-01
end_time: 2020-08-01
fit_start_time: 2008-01-01
fit_end_time: 2014-12-31
instruments: *market
infer_processors:
- class: RobustZScoreNorm
kwargs:
fields_group: feature
clip_outlier: true
- class: Fillna
kwargs:
fields_group: feature
learn_processors:
- class: DropnaLabel
- class: CSRankNorm
kwargs:
fields_group: label
port_analysis_config: &port_analysis_config
strategy:
class: TopkDropoutStrategy
module_path: qlib.contrib.strategy
kwargs:
signal:
- <MODEL>
- <DATASET>
topk: 50
n_drop: 5
backtest:
start_time: 2017-01-01
end_time: 2020-08-01
account: 100000000
benchmark: *benchmark
exchange_kwargs:
limit_threshold: 0.095
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
task:
model:
class: LinearModel
module_path: qlib.contrib.model.linear
kwargs:
estimator: ridge
alpha: 0.05
dataset:
class: DatasetH
module_path: qlib.data.dataset
kwargs:
handler:
class: Alpha158
module_path: qlib.contrib.data.handler
kwargs: *data_handler_config
segments:
train: [2008-01-01, 2014-12-31]
valid: [2015-01-01, 2016-12-31]
test: [2017-01-01, 2020-08-01]
record:
- class: SignalRecord
module_path: qlib.workflow.record_temp
kwargs:
model: <MODEL>
dataset: <DATASET>
- class: SigAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
ana_long_short: True
ann_scaler: 252
- class: PortAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
config: *port_analysis_config

View File

@@ -1,55 +0,0 @@
This folder contains a simple example of how to run Qlib RL. It contains:
```
.
├── experiment_config
│ ├── backtest # Backtest config
│ └── training # Training config
├── README.md # Readme (the current file)
└── scripts # Scripts for data pre-processing
```
## Data preparation
Use [AzCopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10) to download data:
```
azcopy copy https://qlibpublic.blob.core.windows.net/data/rl/qlib_rl_example_data ./ --recursive
mv qlib_rl_example_data data
```
The downloaded data will be placed at `./data`. The original data are in `data/csv`. To create all data needed by the case, run:
```
bash scripts/data_pipeline.sh
```
After the execution finishes, the `data/` directory should be like:
```
data
├── backtest_orders.csv
├── bin
├── csv
├── pickle
├── pickle_dataframe
└── training_order_split
```
## Run training
Run:
```
python -m qlib.rl.contrib.train_onpolicy --config_path ./experiment_config/training/config.yml
```
After training, checkpoints will be stored under `checkpoints/`.
## Run backtest
```
python -m qlib.rl.contrib.backtest --config_path ./experiment_config/backtest/config.yml
```
The backtest workflow will use the trained model in `checkpoints/`. The backtest summary can be found in `outputs/`.

View File

@@ -1,57 +0,0 @@
order_file: ./data/backtest_orders.csv
start_time: "9:45"
end_time: "14:44"
qlib:
provider_uri_1min: ./data/bin
feature_root_dir: ./data/pickle
feature_columns_today: [
"$open", "$high", "$low", "$close", "$vwap", "$volume",
]
feature_columns_yesterday: [
"$open_v1", "$high_v1", "$low_v1", "$close_v1", "$vwap_v1", "$volume_v1",
]
exchange:
limit_threshold: ['$close == 0', '$close == 0']
deal_price: ["If($close == 0, $vwap, $close)", "If($close == 0, $vwap, $close)"]
volume_threshold:
all: ["cum", "0.2 * DayCumsum($volume, '9:45', '14:44')"]
buy: ["current", "$close"]
sell: ["current", "$close"]
strategies:
30min:
class: TWAPStrategy
module_path: qlib.contrib.strategy.rule_strategy
kwargs: {}
1day:
class: SAOEIntStrategy
module_path: qlib.rl.order_execution.strategy
kwargs:
state_interpreter:
class: FullHistoryStateInterpreter
module_path: qlib.rl.order_execution.interpreter
kwargs:
max_step: 8
data_ticks: 240
data_dim: 6
processed_data_provider:
class: PickleProcessedDataProvider
module_path: qlib.rl.data.pickle_styled
kwargs:
data_dir: ./data/pickle_dataframe/feature
action_interpreter:
class: CategoricalActionInterpreter
module_path: qlib.rl.order_execution.interpreter
kwargs:
values: 14
max_step: 8
network:
class: Recurrent
module_path: qlib.rl.order_execution.network
kwargs: {}
policy:
class: PPO
module_path: qlib.rl.order_execution.policy
kwargs:
lr: 1.0e-4
weight_file: ./checkpoints/latest.pth
concurrency: 5

View File

@@ -1,59 +0,0 @@
simulator:
time_per_step: 30
vol_limit: null
env:
concurrency: 1
parallel_mode: dummy
action_interpreter:
class: CategoricalActionInterpreter
kwargs:
values: 14
max_step: 8
module_path: qlib.rl.order_execution.interpreter
state_interpreter:
class: FullHistoryStateInterpreter
kwargs:
data_dim: 6
data_ticks: 240
max_step: 8
processed_data_provider:
class: PickleProcessedDataProvider
module_path: qlib.rl.data.pickle_styled
kwargs:
data_dir: ./data/pickle_dataframe/feature
module_path: qlib.rl.order_execution.interpreter
reward:
class: PAPenaltyReward
kwargs:
penalty: 100.0
module_path: qlib.rl.order_execution.reward
data:
source:
order_dir: ./data/training_order_split
data_dir: ./data/pickle_dataframe/backtest
total_time: 240
default_start_time: 0
default_end_time: 240
proc_data_dim: 6
num_workers: 0
queue_size: 20
network:
class: Recurrent
module_path: qlib.rl.order_execution.network
policy:
class: PPO
kwargs:
lr: 0.0001
module_path: qlib.rl.order_execution.policy
runtime:
seed: 42
use_cuda: false
trainer:
max_epoch: 2
repeat_per_collect: 5
earlystop_patience: 2
episode_per_collect: 20
batch_size: 16
val_every_n_epoch: 1
checkpoint_path: ./checkpoints
checkpoint_every_n_iters: 1

View File

@@ -1,21 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import os
import pickle
import pandas as pd
from tqdm import tqdm
os.makedirs(os.path.join("data", "pickle_dataframe"), exist_ok=True)
for tag in ("backtest", "feature"):
df = pickle.load(open(os.path.join("data", "pickle", f"{tag}.pkl"), "rb"))
df = pd.concat(list(df.values())).reset_index()
df["date"] = df["datetime"].dt.date.astype("datetime64")
instruments = sorted(set(df["instrument"]))
os.makedirs(os.path.join("data", "pickle_dataframe", tag), exist_ok=True)
for instrument in tqdm(instruments):
cur = df[df["instrument"] == instrument].sort_values(by=["datetime"])
cur = cur.set_index(["instrument", "datetime", "date"])
pickle.dump(cur, open(os.path.join("data", "pickle_dataframe", tag, f"{instrument}.pkl"), "wb"))

View File

@@ -1,14 +0,0 @@
# Generate `bin` format data
set -e
python ../../scripts/dump_bin.py dump_all --csv_path ./data/csv --qlib_dir ./data/bin --include_fields open,close,high,low,vwap,volume --symbol_field_name symbol --date_field_name date --freq 1min
# Generate pickle format data
python scripts/gen_pickle_data.py -c scripts/pickle_data_config.yml
if [ -e stat/ ]; then
rm -r stat/
fi
python scripts/collect_pickle_dataframe.py
# Sample orders
python scripts/gen_training_orders.py
python scripts/gen_backtest_orders.py

View File

@@ -1,55 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import argparse
import os
import pandas as pd
import numpy as np
import pickle
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, default=20220926)
parser.add_argument("--num_order", type=int, default=10)
args = parser.parse_args()
np.random.seed(args.seed)
path = os.path.join("data", "pickle", "backtesttest.pkl")
df = pickle.load(open(path, "rb")).reset_index()
df["date"] = df["datetime"].dt.date.astype("datetime64")
instruments = sorted(set(df["instrument"]))
# TODO: The example is expected to be able to handle data containing missing values.
# TODO: Currently, we just simply skip dates that contain missing data. We will add
# TODO: this feature in the future.
skip_dates = {}
for instrument in instruments:
csv_df = pd.read_csv(os.path.join("data", "csv", f"{instrument}.csv"))
csv_df = csv_df[csv_df["close"].isna()]
dates = set([str(d).split(" ")[0] for d in csv_df["date"]])
skip_dates[instrument] = dates
df_list = []
for instrument in instruments:
print(instrument)
cur_df = df[df["instrument"] == instrument]
dates = sorted(set([str(d).split(" ")[0] for d in cur_df["date"]]))
dates = [date for date in dates if date not in skip_dates[instrument]]
n = args.num_order
df_list.append(
pd.DataFrame(
{
"date": sorted(np.random.choice(dates, size=n, replace=False)),
"instrument": [instrument] * n,
"amount": np.random.randint(low=3, high=11, size=n) * 100.0,
"order_type": np.random.randint(low=0, high=2, size=n),
}
).set_index(["date", "instrument"]),
)
total_df = pd.concat(df_list)
total_df.to_csv("data/backtest_orders.csv")

View File

@@ -1,39 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import argparse
import os
import pandas as pd
import numpy as np
import pickle
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, default=20220926)
parser.add_argument("--stock", type=str, default="AAPL")
parser.add_argument("--train_size", type=int, default=10)
parser.add_argument("--valid_size", type=int, default=2)
parser.add_argument("--test_size", type=int, default=2)
args = parser.parse_args()
np.random.seed(args.seed)
os.makedirs(os.path.join("data", "training_order_split"), exist_ok=True)
for group, n in zip(("train", "valid", "test"), (args.train_size, args.valid_size, args.test_size)):
path = os.path.join("data", "pickle", f"backtest{group}.pkl")
df = pickle.load(open(path, "rb")).reset_index()
df["date"] = df["datetime"].dt.date.astype("datetime64")
dates = sorted(set([str(d).split(" ")[0] for d in df["date"]]))
data_df = pd.DataFrame(
{
"date": sorted(np.random.choice(dates, size=n, replace=False)),
"instrument": [args.stock] * n,
"amount": np.random.randint(low=3, high=11, size=n) * 100.0,
"order_type": [0] * n,
}
).set_index(["date", "instrument"])
os.makedirs(os.path.join("data", "training_order_split", group), exist_ok=True)
pickle.dump(data_df, open(os.path.join("data", "training_order_split", group, f"{args.stock}.pkl"), "wb"))

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,100 @@
# RL Example for Order Execution
This folder comprises an example of Reinforcement Learning (RL) workflows for order execution scenario, including both training workflows and backtest workflows.
## Data Processing
### Get Data
```
python -m qlib.run.get_data qlib_data qlib_data --target_dir ./data/bin --region hs300 --interval 5min
```
### Generate Pickle-Style Data
To run codes in this example, we need data in pickle format. To achieve this, run following commands (might need a few minutes to finish):
[//]: # (TODO: Instead of dumping dataframe with different format &#40;like `_gen_dataset` and `_gen_day_dataset` in `qlib/contrib/data/highfreq_provider.py`&#41;, we encourage to implement different subclass of `Dataset` and `DataHandler`. This will keep the workflow cleaner and interfaces more consistent, and move all the complexity to the subclass.)
```
python scripts/gen_pickle_data.py -c scripts/pickle_data_config.yml
python scripts/gen_training_orders.py
python scripts/merge_orders.py
```
When finished, the structure under `data/` should be:
```
data
├── bin
├── orders
└── pickle
```
## Training
Each training task is specified by a config file. The config file for task `TASKNAME` is `exp_configs/train_TASKNAME.yml`. This example provides two training tasks:
- **PPO**: Method proposed by IJCAL 2020 paper "[An End-to-End Optimal Trade Execution Framework based on Proximal Policy Optimization](https://www.ijcai.org/proceedings/2020/0627.pdf)".
- **OPDS**: Method proposed by AAAI 2021 paper "[Universal Trading for Order Execution with Oracle Policy Distillation](https://arxiv.org/abs/2103.10860)".
The main differece between these two methods is their reward functions. Please see their config files for details.
Take OPDS as an example, to run the training workflow, run:
```
python -m qlib.rl.contrib.train_onpolicy --config_path exp_configs/train_opds.yml --run_backtest
```
Metrics, logs, and checkpoints will be stored under `outputs/opds` (configured by `exp_configs/train_opds.yml`).
## Backtest
Once the training workflow has completed, the trained model can be used for the backtesting workflow. Still taking OPDS as an example, once training is finished, the latest checkpoint of the model can be found at `outputs/opds/checkpoints/latest.pth`. To run backtest workflow:
1. Uncomment the `weight_file` parameter in `exp_configs/train_opds.yml` (it is commented by default). While it is possible to run the backtesting workflow without setting a checkpoint, this will lead to randomly initialized model results, thus making them meaningless.
2. Run `python -m qlib.rl.contrib.backtest --config_path exp_configs/backtest_opds.yml`.
The backtest result is stored in `outputs/checkpoints/backtest_result.csv`.
In addition to OPDS and PPO, we also provide TWAP ([Time-weighted average price](https://en.wikipedia.org/wiki/Time-weighted_average_price)) as a weak baseline. The config file for TWAP is `exp_configs/backtest_twap.yml`.
### Gap between backtest and training pipeline's testing
It is worthy to notice that the results of the backtesting process may differ from the results of the testing process used during training.
This is because different simulators are used to simulate market conditions during training and backtesting.
In training pipeline, the simplified simulator called `SingleAssetOrderExecutionSimple` is used for efficiency reasons.
`SingleAssetOrderExecutionSimple` makes no restriction to trading amounts.
No matter what the amount of the order is, it can be completely executed.
However, during backtesting, a more realistic simulator called `SingleAssetOrderExecution` is used.
It takes into account practical constraints in more real-world scenarios (for example, the trading volume must be a multiple of the smallest trading unit).
As a result, the amount of an order that is actually executed during backtesting may differ from the amount expected to be executed.
If you would like to obtain results that are exactly the same as those obtained during testing in the training pipeline, you could run training pipeline with only backtest phrase.
In order to do this:
- Modify the training config. Add the path of the checkpoint you want to use (see following for an example).
- Run `python -m qlib.rl.contrib.train_onpolicy --config_path PATH/TO/CONFIG --run_backtest --no_training`
```yaml
...
policy:
class: PPO # PPO, DQN
kwargs:
lr: 0.0001
weight_file: PATH/TO/CHECKPOINT
module_path: qlib.rl.order_execution.policy
...
```
## Benchmarks (TBD)
To accurately evaluate the performance of models using Reinforcement Learning algorithms, it's best to run experiments multiple times and compute the average performance across all trials. However, given the time-consuming nature of model training, this is not always feasible. An alternative approach is to run each training task only once, selecting the 10 checkpoints with the highest validation performance to simulate multiple trials. In this example, we use "Price Advantage (PA)" as the metric for selecting these checkpoints. The average performance of these 10 checkpoints on the testing set is as follows:
| **Model** | **PA mean with std.** |
|-----------------------------|-----------------------|
| OPDS (with PPO policy) | 0.4785 ± 0.7815 |
| OPDS (with DQN policy) | -0.0114 ± 0.5780 |
| PPO | -1.0935 ± 0.0922 |
| TWAP | ≈ 0.0 ± 0.0 |
The table above also includes TWAP as a rule-based baseline. The ideal PA of TWAP should be 0.0, however, in this example, the order execution is divided into two steps: first, the order is split equally among each half hour, and then each five minutes within each half hour. Since trading is forbidden during the last five minutes of the day, this approach may slightly differ from traditional TWAP over the course of a full day (as there are 5 minutes missing in the last "half hour"). Therefore, the PA of TWAP can be considered as a number that is close to 0.0. To verify this, you may run a TWAP backtest and check the results.

View File

@@ -0,0 +1,53 @@
order_file: ./data/orders/test_orders.pkl
start_time: "9:30"
end_time: "14:54"
data_granularity: "5min"
qlib:
provider_uri_5min: ./data/bin/
exchange:
limit_threshold: null
deal_price: ["$close", "$close"]
volume_threshold: null
strategies:
1day:
class: SAOEIntStrategy
kwargs:
data_granularity: 5
action_interpreter:
class: CategoricalActionInterpreter
kwargs:
max_step: 8
values: 4
module_path: qlib.rl.order_execution.interpreter
network:
class: Recurrent
kwargs: {}
module_path: qlib.rl.order_execution.network
policy:
class: PPO # PPO, DQN
kwargs:
lr: 0.0001
# Restore `weight_file` once the training workflow finishes. You can change the checkpoint file you want to use.
# weight_file: outputs/opds/checkpoints/latest.pth
module_path: qlib.rl.order_execution.policy
state_interpreter:
class: FullHistoryStateInterpreter
kwargs:
data_dim: 5
data_ticks: 48
max_step: 8
processed_data_provider:
class: HandlerProcessedDataProvider
kwargs:
data_dir: ./data/pickle/
feature_columns_today: ["$high", "$low", "$open", "$close", "$volume"]
feature_columns_yesterday: ["$high_1", "$low_1", "$open_1", "$close_1", "$volume_1"]
module_path: qlib.rl.data.native
module_path: qlib.rl.order_execution.interpreter
module_path: qlib.rl.order_execution.strategy
30min:
class: TWAPStrategy
kwargs: {}
module_path: qlib.contrib.strategy.rule_strategy
concurrency: 16
output_dir: outputs/opds/

View File

@@ -0,0 +1,53 @@
order_file: ./data/orders/test_orders.pkl
start_time: "9:30"
end_time: "14:54"
data_granularity: "5min"
qlib:
provider_uri_5min: ./data/bin/
exchange:
limit_threshold: null
deal_price: ["$close", "$close"]
volume_threshold: null
strategies:
1day:
class: SAOEIntStrategy
kwargs:
data_granularity: 5
action_interpreter:
class: CategoricalActionInterpreter
kwargs:
max_step: 8
values: 4
module_path: qlib.rl.order_execution.interpreter
network:
class: Recurrent
kwargs: {}
module_path: qlib.rl.order_execution.network
policy:
class: PPO # PPO, DQN
kwargs:
lr: 0.0001
# Restore `weight_file` once the training workflow finishes. You can change the checkpoint file you want to use.
# weight_file: outputs/ppo/checkpoints/latest.pth
module_path: qlib.rl.order_execution.policy
state_interpreter:
class: FullHistoryStateInterpreter
kwargs:
data_dim: 5
data_ticks: 48
max_step: 8
processed_data_provider:
class: HandlerProcessedDataProvider
kwargs:
data_dir: ./data/pickle/
feature_columns_today: ["$high", "$low", "$open", "$close", "$volume"]
feature_columns_yesterday: ["$high_1", "$low_1", "$open_1", "$close_1", "$volume_1"]
module_path: qlib.rl.data.native
module_path: qlib.rl.order_execution.interpreter
module_path: qlib.rl.order_execution.strategy
30min:
class: TWAPStrategy
kwargs: {}
module_path: qlib.contrib.strategy.rule_strategy
concurrency: 16
output_dir: outputs/ppo/

View File

@@ -0,0 +1,21 @@
order_file: ./data/orders/test_orders.pkl
start_time: "9:30"
end_time: "14:54"
data_granularity: "5min"
qlib:
provider_uri_5min: ./data/bin/
exchange:
limit_threshold: null
deal_price: ["$close", "$close"]
volume_threshold: null
strategies:
1day:
class: TWAPStrategy
kwargs: {}
module_path: qlib.contrib.strategy.rule_strategy
30min:
class: TWAPStrategy
kwargs: {}
module_path: qlib.contrib.strategy.rule_strategy
concurrency: 16
output_dir: outputs/twap/

View File

@@ -0,0 +1,66 @@
simulator:
data_granularity: 5
time_per_step: 30
vol_limit: null
env:
concurrency: 32
parallel_mode: dummy
action_interpreter:
class: CategoricalActionInterpreter
kwargs:
values: 4
max_step: 8
module_path: qlib.rl.order_execution.interpreter
state_interpreter:
class: FullHistoryStateInterpreter
kwargs:
data_dim: 5
data_ticks: 48 # 48 = 240 min / 5 min
max_step: 8
processed_data_provider:
class: HandlerProcessedDataProvider
kwargs:
data_dir: ./data/pickle/
feature_columns_today: ["$high", "$low", "$open", "$close", "$volume"]
feature_columns_yesterday: ["$high_1", "$low_1", "$open_1", "$close_1", "$volume_1"]
backtest: false
module_path: qlib.rl.data.native
module_path: qlib.rl.order_execution.interpreter
reward:
class: PAPenaltyReward
kwargs:
penalty: 4.0
scale: 0.01
module_path: qlib.rl.order_execution.reward
data:
source:
order_dir: ./data/orders
feature_root_dir: ./data/pickle/
feature_columns_today: ["$close0", "$volume0"]
feature_columns_yesterday: []
total_time: 240
default_start_time_index: 0
default_end_time_index: 235
proc_data_dim: 5
num_workers: 0
queue_size: 20
network:
class: Recurrent
module_path: qlib.rl.order_execution.network
policy:
class: PPO # PPO, DQN
kwargs:
lr: 0.0001
module_path: qlib.rl.order_execution.policy
runtime:
seed: 42
use_cuda: false
trainer:
max_epoch: 500
repeat_per_collect: 25
earlystop_patience: 50
episode_per_collect: 10000
batch_size: 1024
val_every_n_epoch: 4
checkpoint_path: ./outputs/opds
checkpoint_every_n_iters: 1

View File

@@ -0,0 +1,67 @@
simulator:
data_granularity: 5
time_per_step: 30
vol_limit: null
env:
concurrency: 32
parallel_mode: dummy
action_interpreter:
class: CategoricalActionInterpreter
kwargs:
values: 4
max_step: 8
module_path: qlib.rl.order_execution.interpreter
state_interpreter:
class: FullHistoryStateInterpreter
kwargs:
data_dim: 5
data_ticks: 48 # 48 = 240 min / 5 min
max_step: 8
processed_data_provider:
class: HandlerProcessedDataProvider
kwargs:
data_dir: ./data/pickle/
feature_columns_today: ["$high", "$low", "$open", "$close", "$volume"]
feature_columns_yesterday: ["$high_1", "$low_1", "$open_1", "$close_1", "$volume_1"]
backtest: false
module_path: qlib.rl.data.native
module_path: qlib.rl.order_execution.interpreter
reward:
class: PPOReward
kwargs:
max_step: 8
start_time_index: 0
end_time_index: 46 # 46 = (240 - 5) min / 5 min - 1
module_path: qlib.rl.order_execution.reward
data:
source:
order_dir: ./data/orders
feature_root_dir: ./data/pickle/
feature_columns_today: ["$close0", "$volume0"]
feature_columns_yesterday: []
total_time: 240
default_start_time_index: 0
default_end_time_index: 235
proc_data_dim: 5
num_workers: 0
queue_size: 20
network:
class: Recurrent
module_path: qlib.rl.order_execution.network
policy:
class: PPO # PPO, DQN
kwargs:
lr: 0.0001
module_path: qlib.rl.order_execution.policy
runtime:
seed: 42
use_cuda: false
trainer:
max_epoch: 500
repeat_per_collect: 25
earlystop_patience: 50
episode_per_collect: 10000
batch_size: 1024
val_every_n_epoch: 4
checkpoint_path: ./outputs/ppo
checkpoint_every_n_iters: 1

View File

@@ -4,6 +4,7 @@
import yaml
import argparse
import os
import shutil
from copy import deepcopy
from qlib.contrib.data.highfreq_provider import HighFreqProvider
@@ -41,3 +42,5 @@ if __name__ == "__main__":
if args.split == "stock" or args.split == "both":
provider._gen_stock_dataset(deepcopy(provider.feature_conf), "feature")
provider._gen_stock_dataset(deepcopy(provider.backtest_conf), "backtest")
shutil.rmtree("stat/", ignore_errors=True)

View File

@@ -0,0 +1,53 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import os
import numpy as np
import pandas as pd
from pathlib import Path
DATA_PATH = Path(os.path.join("data", "pickle", "backtest"))
OUTPUT_PATH = Path(os.path.join("data", "orders"))
def generate_order(stock: str, start_idx: int, end_idx: int) -> bool:
dataset = pd.read_pickle(DATA_PATH / f"{stock}.pkl")
df = dataset.handler.fetch(level=None).reset_index()
if len(df) == 0 or df.isnull().values.any() or min(df["$volume0"]) < 1e-5:
return False
df["date"] = df["datetime"].dt.date.astype("datetime64")
df = df.set_index(["instrument", "datetime", "date"])
df = df.groupby("date").take(range(start_idx, end_idx)).droplevel(level=0)
order_all = pd.DataFrame(df.groupby(level=(2, 0)).mean().dropna())
order_all["amount"] = np.random.lognormal(-3.28, 1.14) * order_all["$volume0"]
order_all = order_all[order_all["amount"] > 0.0]
order_all["order_type"] = 0
order_all = order_all.drop(columns=["$volume0"])
order_train = order_all[order_all.index.get_level_values(0) <= pd.Timestamp("2021-06-30")]
order_test = order_all[order_all.index.get_level_values(0) > pd.Timestamp("2021-06-30")]
order_valid = order_test[order_test.index.get_level_values(0) <= pd.Timestamp("2021-09-30")]
order_test = order_test[order_test.index.get_level_values(0) > pd.Timestamp("2021-09-30")]
for order, tag in zip((order_train, order_valid, order_test, order_all), ("train", "valid", "test", "all")):
path = OUTPUT_PATH / tag
os.makedirs(path, exist_ok=True)
if len(order) > 0:
order.to_pickle(path / f"{stock}.pkl.target")
return True
np.random.seed(1234)
file_list = sorted(os.listdir(DATA_PATH))
stocks = [f.replace(".pkl", "") for f in file_list]
np.random.shuffle(stocks)
cnt = 0
for stock in stocks:
if generate_order(stock, 0, 240 // 5 - 1):
cnt += 1
if cnt == 100:
break

View File

@@ -0,0 +1,15 @@
import pickle
import os
import pandas as pd
from tqdm import tqdm
for tag in ["test", "valid"]:
files = os.listdir(os.path.join("data/orders/", tag))
dfs = []
for f in tqdm(files):
df = pickle.load(open(os.path.join("data/orders/", tag, f), "rb"))
df = df.drop(["$close0"], axis=1)
dfs.append(df)
total_df = pd.concat(dfs)
pickle.dump(total_df, open(os.path.join("data", "orders", f"{tag}_orders.pkl"), "wb"))

View File

@@ -1,15 +1,16 @@
# start & end time for training/validation/test datasets
start_time: !!str &start 2020-01-01
end_time: !!str &end 2020-07-31
train_end_time: !!str &tend 2020-03-31
valid_start_time: !!str &vstart 2020-04-01
valid_end_time: !!str &vend 2020-05-31
test_start_time: !!str &tstart 2020-06-01
end_time: !!str &end 2021-12-31
train_end_time: !!str &tend 2021-06-30
valid_start_time: !!str &vstart 2021-07-01
valid_end_time: !!str &vend 2021-09-30
test_start_time: !!str &tstart 2021-10-01
# the instrument set
instruments: &ins all
instruments: &ins csi300s19_22
# qlib related configuration
qlib_conf:
provider_uri: ./data/bin # path to generated qlib bin
provider_uri:
5min: ./data/bin # path to generated qlib bin
redis_port: 233
feature_conf:
path: ./data/pickle/feature.pkl # output path of feature
@@ -26,14 +27,23 @@ feature_conf:
fit_end_time: *tend
instruments: *ins
day_length: 240 # how many minutes in one trading day
freq: 5min
columns: ["$open", "$high", "$low", "$close"]
infer_processors:
- class: HighFreqNorm
module_path: qlib.contrib.data.highfreq_processor
kwargs:
feature_save_dir: ./stat/ # output path of statistics of features (for feature normalization)
norm_groups:
price: 10
price: 8
volume: 2
inst_processors:
- class: TimeRangeFlt
module_path: qlib.data.dataset.processor
kwargs:
start_time: "2020-01-01"
end_time: "2021-12-31"
freq: 5min
segments:
train: !!python/tuple [*start, *tend]
valid: !!python/tuple [*vstart, *vend]
@@ -51,7 +61,17 @@ backtest_conf:
end_time: *end
instruments: *ins
day_length: 240
freq: 5min
columns: ["$close", "$volume"]
inst_processors:
- class: TimeRangeFlt
module_path: qlib.data.dataset.processor
kwargs:
start_time: "2020-01-01"
end_time: "2021-12-31"
freq: 5min
segments:
train: !!python/tuple [*start, *tend]
valid: !!python/tuple [*vstart, *vend]
test: !!python/tuple [*tstart, *end]
freq: 5min

View File

@@ -88,6 +88,7 @@
"outputs": [],
"source": [
"from qlib.tests.data import GetData\n",
"\n",
"GetData().qlib_data(exists_skip=True)"
]
},
@@ -99,6 +100,7 @@
"outputs": [],
"source": [
"import qlib\n",
"\n",
"qlib.init()"
]
},
@@ -134,7 +136,8 @@
"outputs": [],
"source": [
"from qlib.data import D\n",
"D.calendar(start_time='2010-01-01', end_time='2017-12-31', freq='day')[:2] # calendar data"
"\n",
"print(D.calendar(start_time=\"2010-01-01\", end_time=\"2017-12-31\", freq=\"day\")[:2]) # calendar data"
]
},
{
@@ -152,7 +155,12 @@
"metadata": {},
"outputs": [],
"source": [
"df = D.features(['SH601216'], ['$open', '$high', '$low', '$close', '$factor'], start_time='2020-05-01', end_time='2020-05-31') "
"df = D.features(\n",
" [\"SH601216\"],\n",
" [\"$open\", \"$high\", \"$low\", \"$close\", \"$factor\"],\n",
" start_time=\"2020-05-01\",\n",
" end_time=\"2020-05-31\",\n",
")"
]
},
{
@@ -163,11 +171,18 @@
"outputs": [],
"source": [
"import plotly.graph_objects as go\n",
"fig = go.Figure(data=[go.Candlestick(x=df.index.get_level_values(\"datetime\"),\n",
" open=df['$open'],\n",
" high=df['$high'],\n",
" low=df['$low'],\n",
" close=df['$close'])])\n",
"\n",
"fig = go.Figure(\n",
" data=[\n",
" go.Candlestick(\n",
" x=df.index.get_level_values(\"datetime\"),\n",
" open=df[\"$open\"],\n",
" high=df[\"$high\"],\n",
" low=df[\"$low\"],\n",
" close=df[\"$close\"],\n",
" )\n",
" ]\n",
")\n",
"fig.show()"
]
},
@@ -197,11 +212,18 @@
"outputs": [],
"source": [
"import plotly.graph_objects as go\n",
"fig = go.Figure(data=[go.Candlestick(x=df.index.get_level_values(\"datetime\"),\n",
" open=df['$open'] / df['$factor'],\n",
" high=df['$high'] / df['$factor'],\n",
" low=df['$low'] / df['$factor'],\n",
" close=df['$close'] / df['$factor'])])\n",
"\n",
"fig = go.Figure(\n",
" data=[\n",
" go.Candlestick(\n",
" x=df.index.get_level_values(\"datetime\"),\n",
" open=df[\"$open\"] / df[\"$factor\"],\n",
" high=df[\"$high\"] / df[\"$factor\"],\n",
" low=df[\"$low\"] / df[\"$factor\"],\n",
" close=df[\"$close\"] / df[\"$factor\"],\n",
" )\n",
" ]\n",
")\n",
"fig.show()"
]
},
@@ -240,7 +262,7 @@
"outputs": [],
"source": [
"# dynamic universe\n",
"universe = D.list_instruments(D.instruments('csi100'), start_time='2010-01-01', end_time='2020-12-31')\n",
"universe = D.list_instruments(D.instruments(\"csi100\"), start_time=\"2010-01-01\", end_time=\"2020-12-31\")\n",
"pprint(universe)"
]
},
@@ -271,8 +293,8 @@
"metadata": {},
"outputs": [],
"source": [
"df = D.features(D.instruments('csi100'), ['$close'], start_time='2010-01-01', end_time='2020-12-31') \n",
"df.groupby('datetime').size().plot()"
"df = D.features(D.instruments(\"csi100\"), [\"$close\"], start_time=\"2010-01-01\", end_time=\"2020-12-31\")\n",
"df.groupby(\"datetime\").size().plot()"
]
},
{
@@ -313,8 +335,7 @@
" !cd ../../scripts/data_collector/pit/ && pip install -r requirements.txt\n",
" !cd ../../scripts/data_collector/pit/ && python collector.py download_data --source_dir ~/.qlib/stock_data/source/pit --start 2000-01-01 --end 2020-01-01 --interval quarterly --symbol_regex \"^(600519|000725).*\"\n",
" !cd ../../scripts/data_collector/pit/ && python collector.py normalize_data --interval quarterly --source_dir ~/.qlib/stock_data/source/pit --normalize_dir ~/.qlib/stock_data/source/pit_normalized\n",
" !cd ../../scripts/ && python dump_pit.py dump --csv_path ~/.qlib/stock_data/source/pit_normalized --qlib_dir ~/.qlib/qlib_data/cn_data --interval quarterly\n",
" pass"
" !cd ../../scripts/ && python dump_pit.py dump --csv_path ~/.qlib/stock_data/source/pit_normalized --qlib_dir ~/.qlib/qlib_data/cn_data --interval quarterly"
]
},
{
@@ -338,7 +359,13 @@
"outputs": [],
"source": [
"instruments = [\"sh600519\"]\n",
"data = D.features(instruments, ['P($$roewa_q)'], start_time=\"2019-01-01\", end_time=\"2019-07-19\", freq=\"day\")"
"data = D.features(\n",
" instruments,\n",
" [\"P($$roewa_q)\"],\n",
" start_time=\"2019-01-01\",\n",
" end_time=\"2019-07-19\",\n",
" freq=\"day\",\n",
")"
]
},
{
@@ -366,7 +393,10 @@
"metadata": {},
"outputs": [],
"source": [
"D.features([\"sh600519\"], ['(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close'])"
"D.features(\n",
" [\"sh600519\"],\n",
" [\"(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close\"],\n",
")"
]
},
{
@@ -418,7 +448,7 @@
"metadata": {},
"outputs": [],
"source": [
"qdl = QlibDataLoader(config=(['$close / Ref($close, 10)'], ['RET10']))"
"qdl = QlibDataLoader(config=([\"$close / Ref($close, 10)\"], [\"RET10\"]))"
]
},
{
@@ -428,7 +458,7 @@
"metadata": {},
"outputs": [],
"source": [
"qdl.load(instruments=['sh600519'], start_time='20190101', end_time='20191231')"
"qdl.load(instruments=[\"sh600519\"], start_time=\"20190101\", end_time=\"20191231\")"
]
},
{
@@ -456,7 +486,7 @@
"metadata": {},
"outputs": [],
"source": [
"df = qdl.load(instruments=['sh600519'], start_time='20190101', end_time='20191231')"
"df = qdl.load(instruments=[\"sh600519\"], start_time=\"20190101\", end_time=\"20191231\")"
]
},
{
@@ -476,7 +506,7 @@
"metadata": {},
"outputs": [],
"source": [
"df.plot(kind='hist')"
"df.plot(kind=\"hist\")"
]
},
{
@@ -508,9 +538,16 @@
"source": [
"# NOTE: normally, the training & validation time range will be `fit_start_time` `fit_end_time`\n",
"# howeverall the components are decomposed, so the training & validation time range is unknown when preprocessing.\n",
"dh = DataHandlerLP(instruments=['sh600519'], start_time='20170101', end_time='20191231',\n",
" infer_processors=[ZScoreNorm(fit_start_time='20170101', fit_end_time='20181231'), Fillna()],\n",
" data_loader=qdl)"
"dh = DataHandlerLP(\n",
" instruments=[\"sh600519\"],\n",
" start_time=\"20170101\",\n",
" end_time=\"20191231\",\n",
" infer_processors=[\n",
" ZScoreNorm(fit_start_time=\"20170101\", fit_end_time=\"20181231\"),\n",
" Fillna(),\n",
" ],\n",
" data_loader=qdl,\n",
")"
]
},
{
@@ -550,7 +587,7 @@
"metadata": {},
"outputs": [],
"source": [
"df.plot(kind='hist')"
"df.plot(kind=\"hist\")"
]
},
{
@@ -586,7 +623,7 @@
"metadata": {},
"outputs": [],
"source": [
"ds = DatasetH(dh, segments={\"train\": ('20180101', '20181231'), \"valid\": ('20190101', '20191231')})"
"ds = DatasetH(dh, segments={\"train\": (\"20180101\", \"20181231\"), \"valid\": (\"20190101\", \"20191231\")})"
]
},
{
@@ -596,7 +633,7 @@
"metadata": {},
"outputs": [],
"source": [
"ds.prepare('train')"
"ds.prepare(\"train\")"
]
},
{
@@ -606,7 +643,7 @@
"metadata": {},
"outputs": [],
"source": [
"ds.prepare('valid')"
"ds.prepare(\"valid\")"
]
},
{
@@ -628,8 +665,12 @@
"metadata": {},
"outputs": [],
"source": [
"ds = TSDatasetH(step_len=10, handler=dh, segments={\"train\": ('20180101', '20181231'), \"valid\": ('20190101', '20191231')})\n",
"train_sampler = ds.prepare('train')"
"ds = TSDatasetH(\n",
" step_len=10,\n",
" handler=dh,\n",
" segments={\"train\": (\"20180101\", \"20181231\"), \"valid\": (\"20190101\", \"20191231\")},\n",
")\n",
"train_sampler = ds.prepare(\"train\")"
]
},
{
@@ -649,7 +690,7 @@
"metadata": {},
"outputs": [],
"source": [
"train_sampler[0] # Retrieving the first example"
"train_sampler[0] # Retrieving the first example"
]
},
{
@@ -659,7 +700,7 @@
"metadata": {},
"outputs": [],
"source": [
"train_sampler['2018-01-08', 'sh600519'] # get the time series by <'timestamp', 'instrument_id'> index"
"train_sampler[\"2018-01-08\", \"sh600519\"] # get the time series by <'timestamp', 'instrument_id'> index"
]
},
{
@@ -682,11 +723,11 @@
"outputs": [],
"source": [
"handler_kwargs = {\n",
" \"start_time\": \"2008-01-01\",\n",
" \"end_time\": \"2020-08-01\",\n",
" \"fit_start_time\": \"2008-01-01\",\n",
" \"fit_end_time\": \"2014-12-31\",\n",
" \"instruments\": MARKET,\n",
" \"start_time\": \"2008-01-01\",\n",
" \"end_time\": \"2020-08-01\",\n",
" \"fit_start_time\": \"2008-01-01\",\n",
" \"fit_end_time\": \"2014-12-31\",\n",
" \"instruments\": MARKET,\n",
"}\n",
"handler_conf = {\n",
" \"class\": \"Alpha158\",\n",
@@ -735,6 +776,7 @@
"outputs": [],
"source": [
"from qlib.contrib.data.handler import Alpha158\n",
"\n",
"hd = Alpha158(**handler_kwargs)"
]
},
@@ -826,7 +868,7 @@
"metadata": {},
"outputs": [],
"source": [
"hd.process_type # appending type"
"hd.process_type # appending type"
]
},
{
@@ -857,16 +899,16 @@
"outputs": [],
"source": [
"dataset_conf = {\n",
" \"class\": \"DatasetH\",\n",
" \"module_path\": \"qlib.data.dataset\",\n",
" \"kwargs\": {\n",
" \"handler\": hd,\n",
" \"segments\": {\n",
" \"train\": (\"2008-01-01\", \"2014-12-31\"),\n",
" \"valid\": (\"2015-01-01\", \"2016-12-31\"),\n",
" \"test\": (\"2017-01-01\", \"2020-08-01\"),\n",
" },\n",
" \"class\": \"DatasetH\",\n",
" \"module_path\": \"qlib.data.dataset\",\n",
" \"kwargs\": {\n",
" \"handler\": hd,\n",
" \"segments\": {\n",
" \"train\": (\"2008-01-01\", \"2014-12-31\"),\n",
" \"valid\": (\"2015-01-01\", \"2016-12-31\"),\n",
" \"test\": (\"2017-01-01\", \"2020-08-01\"),\n",
" },\n",
" },\n",
"}"
]
},
@@ -908,7 +950,8 @@
"metadata": {},
"outputs": [],
"source": [
"model = init_instance_by_config({\n",
"model = init_instance_by_config(\n",
" {\n",
" \"class\": \"LGBModel\",\n",
" \"module_path\": \"qlib.contrib.model.gbdt\",\n",
" \"kwargs\": {\n",
@@ -922,7 +965,8 @@
" \"num_leaves\": 210,\n",
" \"num_threads\": 20,\n",
" },\n",
"})"
" }\n",
")"
]
},
{
@@ -938,7 +982,7 @@
" R.save_objects(trained_model=model)\n",
"\n",
" rec = R.get_recorder()\n",
" rid = rec.id # save the record id\n",
" rid = rec.id # save the record id\n",
"\n",
" # Inference and saving signal\n",
" sr = SignalRecord(model, dataset, rec)\n",
@@ -1001,12 +1045,11 @@
"\n",
"# backtest and analysis\n",
"with R.start(experiment_name=EXP_NAME, recorder_id=rid, resume=True):\n",
"\n",
" # signal-based analysis\n",
" rec = R.get_recorder()\n",
" sar = SigAnaRecord(rec)\n",
" sar.generate()\n",
" \n",
"\n",
" # portfolio-based analysis: backtest\n",
" par = PortAnaRecord(rec, port_analysis_config, \"day\")\n",
" par.generate()"
@@ -1137,7 +1180,7 @@
"outputs": [],
"source": [
"label_df = dataset.prepare(\"test\", col_set=\"label\")\n",
"label_df.columns = ['label']"
"label_df.columns = [\"label\"]"
]
},
{

View File

@@ -38,7 +38,7 @@
" # install qlib\n",
" ! pip install --upgrade numpy\n",
" ! pip install pyqlib\n",
" if 'google.colab' in sys.modules:\n",
" if \"google.colab\" in sys.modules:\n",
" # The Google colab environment is a little outdated. We have to downgrade the pyyaml to make it compatible with other packages\n",
" ! pip install pyyaml==5.4.1\n",
" # reload\n",
@@ -50,7 +50,8 @@
" scripts_dir = Path(\"~/tmp/qlib_code/scripts\").expanduser().resolve()\n",
" scripts_dir.mkdir(parents=True, exist_ok=True)\n",
" import requests\n",
" with requests.get(\"https://raw.githubusercontent.com/microsoft/qlib/main/scripts/get_data.py\") as resp:\n",
"\n",
" with requests.get(\"https://raw.githubusercontent.com/microsoft/qlib/main/scripts/get_data.py\", timeout=10) as resp:\n",
" with open(scripts_dir.joinpath(\"get_data.py\"), \"wb\") as fp:\n",
" fp.write(resp.content)"
]
@@ -61,14 +62,13 @@
"metadata": {},
"outputs": [],
"source": [
"\n",
"import qlib\n",
"import pandas as pd\n",
"from qlib.constant import REG_CN\n",
"from qlib.utils import exists_qlib_data, init_instance_by_config\n",
"from qlib.workflow import R\n",
"from qlib.workflow.record_temp import SignalRecord, PortAnaRecord\n",
"from qlib.utils import flatten_dict\n"
"from qlib.utils import flatten_dict"
]
},
{
@@ -86,6 +86,7 @@
" print(f\"Qlib data is not found in {provider_uri}\")\n",
" sys.path.append(str(scripts_dir))\n",
" from get_data import GetData\n",
"\n",
" GetData().qlib_data(target_dir=provider_uri, region=REG_CN)\n",
"qlib.init(provider_uri=provider_uri, region=REG_CN)"
]
@@ -169,7 +170,7 @@
" R.log_params(**flatten_dict(task))\n",
" model.fit(dataset)\n",
" R.save_objects(trained_model=model)\n",
" rid = R.get_recorder().id\n"
" rid = R.get_recorder().id"
]
},
{
@@ -238,7 +239,7 @@
"\n",
" # backtest & analysis\n",
" par = PortAnaRecord(recorder, port_analysis_config, \"day\")\n",
" par.generate()\n"
" par.generate()"
]
},
{
@@ -256,6 +257,7 @@
"source": [
"from qlib.contrib.report import analysis_model, analysis_position\n",
"from qlib.data import D\n",
"\n",
"recorder = R.get_recorder(recorder_id=ba_rid, experiment_name=\"backtest_analysis\")\n",
"print(recorder)\n",
"pred_df = recorder.load_object(\"pred.pkl\")\n",
@@ -317,7 +319,7 @@
"outputs": [],
"source": [
"label_df = dataset.prepare(\"test\", col_set=\"label\")\n",
"label_df.columns = ['label']"
"label_df.columns = [\"label\"]"
]
},
{

View File

@@ -2,7 +2,7 @@
# Licensed under the MIT License.
from pathlib import Path
__version__ = "0.9.0"
__version__ = "0.9.2"
__version__bak = __version__ # This version is backup for QlibConfig.reset_qlib_version
import os
from typing import Union

View File

@@ -40,8 +40,8 @@ def get_exchange(
open_cost: float = 0.0015,
close_cost: float = 0.0025,
min_cost: float = 5.0,
limit_threshold: Union[Tuple[str, str], float, None] = None,
deal_price: Union[str, Tuple[str, str], List[str]] = None,
limit_threshold: Union[Tuple[str, str], float, None] | None = None,
deal_price: Union[str, Tuple[str, str], List[str]] | None = None,
**kwargs: Any,
) -> Exchange:
"""get_exchange
@@ -284,7 +284,7 @@ def collect_data(
account: Union[float, int, dict] = 1e9,
exchange_kwargs: dict = {},
pos_type: str = "Position",
return_value: dict = None,
return_value: dict | None = None,
) -> Generator[object, None, None]:
"""initialize the strategy and executor, then collect the trade decision data for rl training

View File

@@ -152,7 +152,9 @@ class Account:
# trading related metrics(e.g. high-frequency trading)
self.indicator = Indicator()
def reset(self, freq: str = None, benchmark_config: dict = None, port_metr_enabled: bool = None) -> None:
def reset(
self, freq: str | None = None, benchmark_config: dict | None = None, port_metr_enabled: bool | None = None
) -> None:
"""reset freq and report of account
Parameters

View File

@@ -55,7 +55,7 @@ def collect_data_loop(
end_time: Union[pd.Timestamp, str],
trade_strategy: BaseStrategy,
trade_executor: BaseExecutor,
return_value: dict = None,
return_value: dict | None = None,
) -> Generator[BaseTradeDecision, Optional[BaseTradeDecision], None]:
"""Generator for collecting the trade decision data for rl training

View File

@@ -254,7 +254,7 @@ class IdxTradeRange(TradeRange):
self._start_idx = start_idx
self._end_idx = end_idx
def __call__(self, trade_calendar: TradeCalendarManager = None) -> Tuple[int, int]:
def __call__(self, trade_calendar: TradeCalendarManager | None = None) -> Tuple[int, int]:
return self._start_idx, self._end_idx
def clip_time_range(self, start_time: pd.Timestamp, end_time: pd.Timestamp) -> Tuple[pd.Timestamp, pd.Timestamp]:
@@ -315,7 +315,7 @@ class BaseTradeDecision(Generic[DecisionType]):
2. Same as `case 1.3`
"""
def __init__(self, strategy: BaseStrategy, trade_range: Union[Tuple[int, int], TradeRange] = None) -> None:
def __init__(self, strategy: BaseStrategy, trade_range: Union[Tuple[int, int], TradeRange, None] = None) -> None:
"""
Parameters
----------
@@ -554,7 +554,7 @@ class TradeDecisionWO(BaseTradeDecision[Order]):
self,
order_list: List[Order],
strategy: BaseStrategy,
trade_range: Union[Tuple[int, int], TradeRange] = None,
trade_range: Union[Tuple[int, int], TradeRange, None] = None,
) -> None:
super().__init__(strategy, trade_range=trade_range)
self.order_list = cast(List[Order], order_list)

View File

@@ -18,7 +18,7 @@ import pandas as pd
from qlib.backtest.position import BasePosition
from ..config import C
from ..constant import REG_CN
from ..constant import REG_CN, REG_TW
from ..data.data import D
from ..log import get_module_logger
from .decision import Order, OrderDir, OrderHelper
@@ -41,10 +41,10 @@ class Exchange:
start_time: Union[pd.Timestamp, str] = None,
end_time: Union[pd.Timestamp, str] = None,
codes: Union[list, str] = "all",
deal_price: Union[str, Tuple[str, str], List[str]] = None,
deal_price: Union[str, Tuple[str, str], List[str], None] = None,
subscribe_fields: list = [],
limit_threshold: Union[Tuple[str, str], float, None] = None,
volume_threshold: Union[tuple, dict] = None,
volume_threshold: Union[tuple, dict, None] = None,
open_cost: float = 0.0015,
close_cost: float = 0.0025,
min_cost: float = 5.0,
@@ -148,10 +148,10 @@ class Exchange:
# It is just for performance consideration.
self.limit_type = self._get_limit_type(limit_threshold)
if limit_threshold is None:
if C.region == REG_CN:
if C.region in [REG_CN, REG_TW]:
self.logger.warning(f"limit_threshold not set. The stocks hit the limit may be bought/sold")
elif self.limit_type == self.LT_FLT and abs(cast(float, limit_threshold)) > 0.1:
if C.region == REG_CN:
if C.region in [REG_CN, REG_TW]:
self.logger.warning(f"limit_threshold may not be set to a reasonable value")
if isinstance(deal_price, str):
@@ -340,7 +340,7 @@ class Exchange:
stock_id: str,
start_time: pd.Timestamp,
end_time: pd.Timestamp,
direction: int = None,
direction: int | None = None,
) -> bool:
"""
Parameters
@@ -406,7 +406,7 @@ class Exchange:
stock_id: str,
start_time: pd.Timestamp,
end_time: pd.Timestamp,
direction: int = None,
direction: int | None = None,
) -> bool:
# check if stock can be traded
return not (
@@ -421,8 +421,8 @@ class Exchange:
def deal_order(
self,
order: Order,
trade_account: Account = None,
position: BasePosition = None,
trade_account: Account | None = None,
position: BasePosition | None = None,
dealt_order_amount: Dict[str, float] = defaultdict(float),
) -> Tuple[float, float, float]:
"""
@@ -586,7 +586,7 @@ class Exchange:
)
return amount_dict
def get_real_deal_amount(self, current_amount: float, target_amount: float, factor: float = None) -> float:
def get_real_deal_amount(self, current_amount: float, target_amount: float, factor: float | None = None) -> float:
"""
Calculate the real adjust deal amount when considering the trading unit
:param current_amount:
@@ -712,8 +712,8 @@ class Exchange:
def _get_factor_or_raise_error(
self,
factor: float = None,
stock_id: str = None,
factor: float | None = None,
stock_id: str | None = None,
start_time: pd.Timestamp = None,
end_time: pd.Timestamp = None,
) -> float:
@@ -728,8 +728,8 @@ class Exchange:
def get_amount_of_trade_unit(
self,
factor: float = None,
stock_id: str = None,
factor: float | None = None,
stock_id: str | None = None,
start_time: pd.Timestamp = None,
end_time: pd.Timestamp = None,
) -> Optional[float]:
@@ -762,8 +762,8 @@ class Exchange:
def round_amount_by_trade_unit(
self,
deal_amount: float,
factor: float = None,
stock_id: str = None,
factor: float | None = None,
stock_id: str | None = None,
start_time: pd.Timestamp = None,
end_time: pd.Timestamp = None,
) -> float:

View File

@@ -31,8 +31,8 @@ class BaseExecutor:
generate_portfolio_metrics: bool = False,
verbose: bool = False,
track_data: bool = False,
trade_exchange: Exchange = None,
common_infra: CommonInfrastructure = None,
trade_exchange: Exchange | None = None,
common_infra: CommonInfrastructure | None = None,
settle_type: str = BasePosition.ST_NO,
**kwargs: Any,
) -> None:
@@ -161,7 +161,7 @@ class BaseExecutor:
"""
return self.level_infra.get("trade_calendar")
def reset(self, common_infra: CommonInfrastructure = None, **kwargs: Any) -> None:
def reset(self, common_infra: CommonInfrastructure | None = None, **kwargs: Any) -> None:
"""
- reset `start_time` and `end_time`, used in trade calendar
- reset `common_infra`, used to reset `trade_account`, `trade_exchange`, .etc
@@ -227,7 +227,7 @@ class BaseExecutor:
def collect_data(
self,
trade_decision: BaseTradeDecision,
return_value: dict = None,
return_value: dict | None = None,
level: int = 0,
) -> Generator[Any, Any, List[object]]:
"""Generator for collecting the trade decision data for rl training
@@ -327,7 +327,7 @@ class NestedExecutor(BaseExecutor):
track_data: bool = False,
skip_empty_decision: bool = True,
align_range_limit: bool = True,
common_infra: CommonInfrastructure = None,
common_infra: CommonInfrastructure | None = None,
**kwargs: Any,
) -> None:
"""
@@ -534,7 +534,7 @@ class SimulatorExecutor(BaseExecutor):
generate_portfolio_metrics: bool = False,
verbose: bool = False,
track_data: bool = False,
common_infra: CommonInfrastructure = None,
common_infra: CommonInfrastructure | None = None,
trade_type: str = TT_SERIAL,
**kwargs: Any,
) -> None:

View File

@@ -1,6 +1,7 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from __future__ import annotations
from datetime import timedelta
from typing import Any, Dict, List, Union
@@ -320,7 +321,7 @@ class Position(BasePosition):
self.position[stock]["price"] = price_dict[stock]
self.position["now_account_value"] = self.calculate_value()
def _init_stock(self, stock_id: str, amount: float, price: float = None) -> None:
def _init_stock(self, stock_id: str, amount: float, price: float | None = None) -> None:
"""
initialization the stock in current position

View File

@@ -1,6 +1,7 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from __future__ import annotations
import pathlib
from collections import OrderedDict
@@ -86,7 +87,7 @@ class PortfolioMetrics:
self.benches: dict = OrderedDict()
self.latest_pm_time: Optional[pd.TimeStamp] = None
def init_bench(self, freq: str = None, benchmark_config: dict = None) -> None:
def init_bench(self, freq: str | None = None, benchmark_config: dict | None = None) -> None:
if freq is not None:
self.freq = freq
self.benchmark_config = benchmark_config
@@ -149,15 +150,15 @@ class PortfolioMetrics:
self,
trade_start_time: Union[str, pd.Timestamp] = None,
trade_end_time: Union[str, pd.Timestamp] = None,
account_value: float = None,
cash: float = None,
return_rate: float = None,
total_turnover: float = None,
turnover_rate: float = None,
total_cost: float = None,
cost_rate: float = None,
stock_value: float = None,
bench_value: float = None,
account_value: float | None = None,
cash: float | None = None,
return_rate: float | None = None,
total_turnover: float | None = None,
turnover_rate: float | None = None,
total_cost: float | None = None,
cost_rate: float | None = None,
stock_value: float | None = None,
bench_value: float | None = None,
) -> None:
# check data
if None in [

View File

@@ -31,7 +31,7 @@ class TradeCalendarManager:
freq: str,
start_time: Union[str, pd.Timestamp] = None,
end_time: Union[str, pd.Timestamp] = None,
level_infra: LevelInfrastructure = None,
level_infra: LevelInfrastructure | None = None,
) -> None:
"""
Parameters
@@ -99,7 +99,7 @@ class TradeCalendarManager:
def get_trade_step(self) -> int:
return self.trade_step
def get_step_time(self, trade_step: int = None, shift: int = 0) -> Tuple[pd.Timestamp, pd.Timestamp]:
def get_step_time(self, trade_step: int | None = None, shift: int = 0) -> Tuple[pd.Timestamp, pd.Timestamp]:
"""
Get the left and right endpoints of the trade_step'th trading interval

View File

@@ -75,7 +75,8 @@ class Config:
def set_conf_from_C(self, config_c):
self.update(**config_c.__dict__["_config"])
def register_from_C(self, config, skip_register=True):
@staticmethod
def register_from_C(config, skip_register=True):
from .utils import set_log_with_config # pylint: disable=C0415
if C.registered and skip_register:
@@ -146,6 +147,7 @@ _default_config = {
"redis_host": "127.0.0.1",
"redis_port": 6379,
"redis_task_db": 1,
"redis_password": None,
# This value can be reset via qlib.init
"logging_level": logging.INFO,
# Global configuration of qlib log
@@ -202,7 +204,7 @@ _default_config = {
"task_url": "mongodb://localhost:27017/",
"task_db_name": "default_task_db",
},
# Shift minute for highfreq minite data, used in backtest
# Shift minute for highfreq minute data, used in backtest
# if min_data_shift == 0, use default market time [9:30, 11:29, 1:00, 2:59]
# if min_data_shift != 0, use shifted market time [9:30, 11:29, 1:00, 2:59] - shift*minute
"min_data_shift": 0,

View File

@@ -56,7 +56,7 @@ class Alpha360(DataHandlerLP):
fit_start_time=None,
fit_end_time=None,
filter_pipe=None,
inst_processor=None,
inst_processors=None,
**kwargs
):
infer_processors = check_transform_proc(infer_processors, fit_start_time, fit_end_time)
@@ -71,7 +71,7 @@ class Alpha360(DataHandlerLP):
},
"filter_pipe": filter_pipe,
"freq": freq,
"inst_processor": inst_processor,
"inst_processors": inst_processors,
},
}
@@ -152,7 +152,7 @@ class Alpha158(DataHandlerLP):
fit_end_time=None,
process_type=DataHandlerLP.PTYPE_A,
filter_pipe=None,
inst_processor=None,
inst_processors=None,
**kwargs
):
infer_processors = check_transform_proc(infer_processors, fit_start_time, fit_end_time)
@@ -167,7 +167,7 @@ class Alpha158(DataHandlerLP):
},
"filter_pipe": filter_pipe,
"freq": freq,
"inst_processor": inst_processor,
"inst_processors": inst_processors,
},
}
super().__init__(

View File

@@ -44,7 +44,7 @@ class HighFreqHandler(DataHandlerLP):
names = []
template_if = "If(IsNull({1}), {0}, {1})"
template_paused = "Select(Gt($hx_paused_num, 1.001), {0})"
template_paused = "Select(Gt($paused_num, 1.001), {0})"
def get_normalized_price_feature(price_field, shift=0):
# norm with the close price of 237th minute of yesterday.
@@ -113,8 +113,12 @@ class HighFreqGeneralHandler(DataHandlerLP):
fit_end_time=None,
drop_raw=True,
day_length=240,
freq="1min",
columns=["$open", "$high", "$low", "$close", "$vwap"],
inst_processors=None,
):
self.day_length = day_length
self.columns = columns
infer_processors = check_transform_proc(infer_processors, fit_start_time, fit_end_time)
learn_processors = check_transform_proc(learn_processors, fit_start_time, fit_end_time)
@@ -124,7 +128,8 @@ class HighFreqGeneralHandler(DataHandlerLP):
"kwargs": {
"config": self.get_feature_config(),
"swap_level": False,
"freq": "1min",
"freq": freq,
"inst_processors": inst_processors,
},
}
super().__init__(
@@ -160,19 +165,13 @@ class HighFreqGeneralHandler(DataHandlerLP):
)
return feature_ops
fields += [get_normalized_price_feature("$open", 0)]
fields += [get_normalized_price_feature("$high", 0)]
fields += [get_normalized_price_feature("$low", 0)]
fields += [get_normalized_price_feature("$close", 0)]
fields += [get_normalized_price_feature("$vwap", 0)]
names += ["$open", "$high", "$low", "$close", "$vwap"]
for column_name in self.columns:
fields.append(get_normalized_price_feature(column_name, 0))
names.append(column_name)
fields += [get_normalized_price_feature("$open", self.day_length)]
fields += [get_normalized_price_feature("$high", self.day_length)]
fields += [get_normalized_price_feature("$low", self.day_length)]
fields += [get_normalized_price_feature("$close", self.day_length)]
fields += [get_normalized_price_feature("$vwap", self.day_length)]
names += ["$open_1", "$high_1", "$low_1", "$close_1", "$vwap_1"]
for column_name in self.columns:
fields.append(get_normalized_price_feature(column_name, self.day_length))
names.append(column_name + "_1")
# calculate and fill nan with 0
fields += [
@@ -258,14 +257,19 @@ class HighFreqGeneralBacktestHandler(DataHandler):
start_time=None,
end_time=None,
day_length=240,
freq="1min",
columns=["$close", "$vwap", "$volume"],
inst_processors=None,
):
self.day_length = day_length
self.columns = set(columns)
data_loader = {
"class": "QlibDataLoader",
"kwargs": {
"config": self.get_feature_config(),
"swap_level": False,
"freq": "1min",
"freq": freq,
"inst_processors": inst_processors,
},
}
super().__init__(
@@ -279,21 +283,24 @@ class HighFreqGeneralBacktestHandler(DataHandler):
fields = []
names = []
template_paused = f"Cut({{0}}, {self.day_length * 2}, None)"
template_fillnan = "FFillNan({0})"
template_if = "If(IsNull({1}), {0}, {1})"
fields += [
template_paused.format(template_fillnan.format("$close")),
]
names += ["$close0"]
if "$close" in self.columns:
template_paused = f"Cut({{0}}, {self.day_length * 2}, None)"
template_fillnan = "FFillNan({0})"
template_if = "If(IsNull({1}), {0}, {1})"
fields += [
template_paused.format(template_fillnan.format("$close")),
]
names += ["$close0"]
fields += [
template_paused.format(template_if.format(template_fillnan.format("$close"), "$vwap")),
]
names += ["$vwap0"]
if "$vwap" in self.columns:
fields += [
template_paused.format(template_if.format(template_fillnan.format("$close"), "$vwap")),
]
names += ["$vwap0"]
fields += [template_paused.format("If(IsNull({0}), 0, {0})".format("$volume"))]
names += ["$volume0"]
if "$volume" in self.columns:
fields += [template_paused.format("If(IsNull({0}), 0, {0})".format("$volume"))]
names += ["$volume0"]
return fields, names
@@ -308,6 +315,7 @@ class HighFreqOrderHandler(DataHandlerLP):
learn_processors=[],
fit_start_time=None,
fit_end_time=None,
inst_processors=None,
drop_raw=True,
):
@@ -320,6 +328,7 @@ class HighFreqOrderHandler(DataHandlerLP):
"config": self.get_feature_config(),
"swap_level": False,
"freq": "1min",
"inst_processors": inst_processors,
},
}
super().__init__(
@@ -479,7 +488,7 @@ class HighFreqBacktestOrderHandler(DataHandler):
names = []
template_if = "If(IsNull({1}), {0}, {1})"
template_paused = "Select(Gt($hx_paused_num, 1.001), {0})"
template_paused = "Select(Gt($paused_num, 1.001), {0})"
template_fillnan = "FFillNan({0})"
fields += [
template_fillnan.format(template_paused.format("$close")),

View File

@@ -28,6 +28,7 @@ class HighFreqProvider:
feature_conf: dict,
label_conf: Optional[dict] = None,
backtest_conf: dict = None,
freq: str = "1min",
**kwargs,
) -> None:
self.start_time = start_time
@@ -42,6 +43,7 @@ class HighFreqProvider:
self.backtest_conf = backtest_conf
self.qlib_conf = qlib_conf
self.logger = get_module_logger("HighFreqProvider")
self.freq = freq
def get_pre_datasets(self):
"""Generate the training, validation and test datasets for prediction
@@ -116,8 +118,8 @@ class HighFreqProvider:
# This code used the copy-on-write feature of Linux
# to avoid calculating the calendar multiple times in the subprocess.
# This code may accelerate, but may be not useful on Windows and Mac Os
Cal.calendar(freq="1min")
get_calendar_day(freq="1min")
Cal.calendar(freq=self.freq)
get_calendar_day(freq=self.freq)
def _gen_dataframe(self, config, datasets=["train", "valid", "test"]):
try:
@@ -126,7 +128,7 @@ class HighFreqProvider:
raise ValueError("Must specify the path to save the dataset.") from e
if os.path.isfile(path):
start = time.time()
self.logger.info("Dataset exists, load from disk.", __name__)
self.logger.info(f"[{__name__}]Dataset exists, load from disk.")
# res = dataset.prepare(['train', 'valid', 'test'])
with open(path, "rb") as f:
@@ -135,11 +137,11 @@ class HighFreqProvider:
res = [data[i] for i in datasets]
else:
res = data.prepare(datasets)
self.logger.info(f"Data loaded, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Data loaded, time cost: {time.time() - start:.2f}")
else:
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
self.logger.info("Generating dataset", __name__)
self.logger.info(f"[{__name__}]Generating dataset")
start_time = time.time()
self._prepare_calender_cache()
dataset = init_instance_by_config(config)
@@ -158,7 +160,7 @@ class HighFreqProvider:
with open(path[:-4] + "test.pkl", "wb") as f:
pkl.dump(testset, f)
res = [data[i] for i in datasets]
self.logger.info(f"Data generated, time cost: {(time.time() - start_time):.2f}", __name__)
self.logger.info(f"[{__name__}]Data generated, time cost: {(time.time() - start_time):.2f}")
return res
def _gen_data(self, config, datasets=["train", "valid", "test"]):
@@ -168,7 +170,7 @@ class HighFreqProvider:
raise ValueError("Must specify the path to save the dataset.") from e
if os.path.isfile(path):
start = time.time()
self.logger.info("Dataset exists, load from disk.", __name__)
self.logger.info(f"[{__name__}]Dataset exists, load from disk.")
# res = dataset.prepare(['train', 'valid', 'test'])
with open(path, "rb") as f:
@@ -177,18 +179,18 @@ class HighFreqProvider:
res = [data[i] for i in datasets]
else:
res = data.prepare(datasets)
self.logger.info(f"Data loaded, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Data loaded, time cost: {time.time() - start:.2f}")
else:
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
self.logger.info("Generating dataset", __name__)
self.logger.info(f"[{__name__}]Generating dataset")
start_time = time.time()
self._prepare_calender_cache()
dataset = init_instance_by_config(config)
dataset.config(dump_all=True, recursive=True)
dataset.to_pickle(path)
res = dataset.prepare(datasets)
self.logger.info(f"Data generated, time cost: {(time.time() - start_time):.2f}", __name__)
self.logger.info(f"[{__name__}]Data generated, time cost: {(time.time() - start_time):.2f}")
return res
def _gen_dataset(self, config):
@@ -198,21 +200,21 @@ class HighFreqProvider:
raise ValueError("Must specify the path to save the dataset.") from e
if os.path.isfile(path):
start = time.time()
self.logger.info("Dataset exists, load from disk.", __name__)
self.logger.info(f"[{__name__}]Dataset exists, load from disk.")
with open(path, "rb") as f:
dataset = pkl.load(f)
self.logger.info(f"Data loaded, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Data loaded, time cost: {time.time() - start:.2f}")
else:
start = time.time()
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
self.logger.info("Generating dataset", __name__)
self.logger.info(f"[{__name__}]Generating dataset")
self._prepare_calender_cache()
dataset = init_instance_by_config(config)
self.logger.info(f"Dataset init, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Dataset init, time cost: {time.time() - start:.2f}")
dataset.prepare(["train", "valid", "test"])
self.logger.info(f"Dataset prepared, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Dataset prepared, time cost: {time.time() - start:.2f}")
dataset.config(dump_all=True, recursive=True)
dataset.to_pickle(path)
return dataset
@@ -225,22 +227,22 @@ class HighFreqProvider:
if os.path.isfile(path + "tmp_dataset.pkl"):
start = time.time()
self.logger.info("Dataset exists, load from disk.", __name__)
self.logger.info(f"[{__name__}]Dataset exists, load from disk.")
else:
start = time.time()
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
self.logger.info("Generating dataset", __name__)
self.logger.info(f"[{__name__}]Generating dataset")
self._prepare_calender_cache()
dataset = init_instance_by_config(config)
self.logger.info(f"Dataset init, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Dataset init, time cost: {time.time() - start:.2f}")
dataset.config(dump_all=False, recursive=True)
dataset.to_pickle(path + "tmp_dataset.pkl")
with open(path + "tmp_dataset.pkl", "rb") as f:
new_dataset = pkl.load(f)
time_list = D.calendar(start_time=self.start_time, end_time=self.end_time, freq="1min")[::240]
time_list = D.calendar(start_time=self.start_time, end_time=self.end_time, freq=self.freq)[::240]
def generate_dataset(times):
if os.path.isfile(path + times.strftime("%Y-%m-%d") + ".pkl"):
@@ -266,15 +268,15 @@ class HighFreqProvider:
if os.path.isfile(path + "tmp_dataset.pkl"):
start = time.time()
self.logger.info("Dataset exists, load from disk.", __name__)
self.logger.info(f"[{__name__}]Dataset exists, load from disk.")
else:
start = time.time()
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
self.logger.info("Generating dataset", __name__)
self.logger.info(f"[{__name__}]Generating dataset")
self._prepare_calender_cache()
dataset = init_instance_by_config(config)
self.logger.info(f"Dataset init, time cost: {time.time() - start:.2f}", __name__)
self.logger.info(f"[{__name__}]Dataset init, time cost: {time.time() - start:.2f}")
dataset.config(dump_all=False, recursive=True)
dataset.to_pickle(path + "tmp_dataset.pkl")
@@ -283,7 +285,7 @@ class HighFreqProvider:
instruments = D.instruments(market="all")
stock_list = D.list_instruments(
instruments=instruments, start_time=self.start_time, end_time=self.end_time, freq="1min", as_list=True
instruments=instruments, start_time=self.start_time, end_time=self.end_time, freq=self.freq, as_list=True
)
def generate_dataset(stock):

View File

@@ -55,8 +55,10 @@ class InternalData:
# The handler is initialized for only once.
if not trainer.has_worker():
self.dh = init_task_handler(perf_task_tpl)
self.dh.config(dump_all=False) # in some cases, the data handler are saved to disk with `dump_all=True`
else:
self.dh = init_instance_by_config(perf_task_tpl["dataset"]["kwargs"]["handler"])
assert self.dh.dump_all is False # otherwise, it will save all the detailed data
seg = perf_task_tpl["dataset"]["kwargs"]["segments"]
@@ -77,7 +79,7 @@ class InternalData:
get_module_logger("Internal Data").info("the data has been initialized")
else:
# train new models
assert 0 == len(recorders), "An empty experiment is required for setup `InternalData``"
assert 0 == len(recorders), "An empty experiment is required for setup `InternalData`"
trainer.train(gen_task)
# 2) extract the similarity matrix
@@ -119,6 +121,7 @@ class MetaTaskDS(MetaTask):
def __init__(self, task: dict, meta_info: pd.DataFrame, mode: str = MetaTask.PROC_MODE_FULL, fill_method="max"):
"""
The description of the processed data
time_perf: A array with shape <hist_step_n * step, data pieces> -> data piece performance
@@ -132,6 +135,10 @@ class MetaTaskDS(MetaTask):
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.]])
Parameters
----------
meta_info: pd.DataFrame
please refer to the docs of _prepare_meta_ipt for detailed explanation.
"""
super().__init__(task, meta_info)
self.fill_method = fill_method
@@ -180,12 +187,41 @@ class MetaTaskDS(MetaTask):
self.processed_meta_input = data_to_tensor(self.processed_meta_input)
def _get_processed_meta_info(self):
meta_info_norm = self.meta_info.sub(self.meta_info.mean(axis=1), axis=0) # .fillna(0.)
if self.fill_method == "max":
meta_info_norm = meta_info_norm.T.fillna(
meta_info_norm.max(axis=1)
).T # fill it with row max to align with previous implementation
meta_info_norm = self.meta_info.sub(self.meta_info.mean(axis=1), axis=0)
if self.fill_method.startswith("max"):
suffix = self.fill_method.lstrip("max")
if suffix == "seg":
fill_value = {}
for col in meta_info_norm.columns:
fill_value[col] = meta_info_norm.loc[meta_info_norm[col].isna(), :].dropna(axis=1).mean().max()
fill_value = pd.Series(fill_value).sort_index()
# The NaN Values are filled segment-wise. Below is an exampleof fill_value
# 2009-01-05 2009-02-06 0.145809
# 2009-02-09 2009-03-06 0.148005
# 2009-03-09 2009-04-03 0.090385
# 2009-04-07 2009-05-05 0.114318
# 2009-05-06 2009-06-04 0.119328
# ...
meta_info_norm = meta_info_norm.fillna(fill_value)
else:
if len(suffix) > 0:
get_module_logger("MetaTaskDS").warning(
f"fill_method={self.fill_method}; the info after can't be correctly parsed. Please check your parameters."
)
fill_value = meta_info_norm.max(axis=1)
# fill it with row max to align with previous implementation
# This will magnify the data similarity when data is in daily freq
# the fill value corresponds to data like this
# It get a performance value for each day.
# The performance value are get from other models on this day
# 2009-01-16 0.276320
# 2009-01-19 0.280603
# ...
# 2011-06-27 0.203773
meta_info_norm = meta_info_norm.T.fillna(fill_value).T
elif self.fill_method == "zero":
# It will fillna(0.0) at the end.
pass
else:
raise NotImplementedError(f"This type of input is not supported")
@@ -286,7 +322,33 @@ class MetaDatasetDS(MetaTaskDataset):
logger.warning(f"ValueError: {e}")
assert len(self.meta_task_l) > 0, "No meta tasks found. Please check the data and setting"
def _prepare_meta_ipt(self, task):
def _prepare_meta_ipt(self, task) -> pd.DataFrame:
"""
Please refer to `self.internal_data.setup` for detailed information about `self.internal_data.data_ic_df`
Indices with format below can be successfully sliced by `ic_df.loc[:end, pd.IndexSlice[:, :end]]`
2021-06-21 2021-06-04 .. 2021-03-22 2021-03-08
2021-07-02 2021-06-18 .. 2021-04-02 None
Returns
-------
a pd.DataFrame with similar content below.
- each column corresponds to a trained model named by the training data range
- each row corresponds to a day of data tested by the models of the columns
- The rows cells that overlaps with the data used by columns are masked
2009-01-05 2009-02-09 ... 2011-04-27 2011-05-26
2009-02-06 2009-03-06 ... 2011-05-25 2011-06-23
datetime ...
2009-01-13 NaN 0.310639 ... -0.169057 0.137792
2009-01-14 NaN 0.261086 ... -0.143567 0.082581
... ... ... ... ... ...
2011-06-30 -0.054907 -0.020219 ... -0.023226 NaN
2011-07-01 -0.075762 -0.026626 ... -0.003167 NaN
"""
ic_df = self.internal_data.data_ic_df
segs = task["dataset"]["kwargs"]["segments"]
@@ -294,15 +356,19 @@ class MetaDatasetDS(MetaTaskDataset):
ic_df_avail = ic_df.loc[:end, pd.IndexSlice[:, :end]]
# meta data set focus on the **information** instead of preprocess
# 1) filter the future info
def mask_future(s):
"""mask future information"""
# from qlib.utils import get_date_by_shift
# 1) filter the overlap info
def mask_overlap(s):
"""
mask overlap information
data after self.name[end] with self.trunc_days that contains future info are also considered as overlap info
Approximately the diagnal + horizon length of data are masked.
"""
start, end = s.name
end = get_date_by_shift(trading_date=end, shift=self.trunc_days - 1, future=True)
return s.mask((s.index >= start) & (s.index <= end))
ic_df_avail = ic_df_avail.apply(mask_future) # apply to each col
ic_df_avail = ic_df_avail.apply(mask_overlap) # apply to each col
# 2) filter the info with too long periods
total_len = self.step * self.hist_step_n

View File

@@ -52,6 +52,7 @@ class MetaModelDS(MetaTaskModel):
lr=0.0001,
max_epoch=100,
seed=43,
alpha=0.0,
):
self.step = step
self.hist_step_n = hist_step_n
@@ -61,6 +62,7 @@ class MetaModelDS(MetaTaskModel):
self.lr = lr
self.max_epoch = max_epoch
self.fitted = False
self.alpha = alpha
torch.manual_seed(seed)
def run_epoch(self, phase, task_list, epoch, opt, loss_l, ignore_weight=False):
@@ -144,7 +146,11 @@ class MetaModelDS(MetaTaskModel):
) # debug: record when the test phase starts
self.tn = PredNet(
step=self.step, hist_step_n=self.hist_step_n, clip_weight=self.clip_weight, clip_method=self.clip_method
step=self.step,
hist_step_n=self.hist_step_n,
clip_weight=self.clip_weight,
clip_method=self.clip_method,
alpha=self.alpha,
)
opt = optim.Adam(self.tn.parameters(), lr=self.lr)

View File

@@ -41,11 +41,18 @@ class TimeWeightMeta(SingleMetaBase):
class PredNet(nn.Module):
def __init__(self, step, hist_step_n, clip_weight=None, clip_method="tanh"):
def __init__(self, step, hist_step_n, clip_weight=None, clip_method="tanh", alpha: float = 0.0):
"""
Parameters
----------
alpha : float
the regularization for sub model (useful when align meta model with linear submodel)
"""
super().__init__()
self.step = step
self.twm = TimeWeightMeta(hist_step_n=hist_step_n, clip_weight=clip_weight, clip_method=clip_method)
self.init_paramters(hist_step_n)
self.alpha = alpha
def get_sample_weights(self, X, time_perf, time_belong, ignore_weight=False):
weights = torch.from_numpy(np.ones(X.shape[0])).float().to(X.device)
@@ -59,7 +66,7 @@ class PredNet(nn.Module):
"""Please refer to the docs of MetaTaskDS for the description of the variables"""
weights = self.get_sample_weights(X, time_perf, time_belong, ignore_weight=ignore_weight)
X_w = X.T * weights.view(1, -1)
theta = torch.inverse(X_w @ X) @ X_w @ y
theta = torch.inverse(X_w @ X + self.alpha * torch.eye(X_w.shape[0])) @ X_w @ y
return X_test @ theta, weights
def init_paramters(self, hist_step_n):

View File

@@ -5,6 +5,9 @@ import numpy as np
import torch
from torch import nn
from qlib.constant import EPS
from qlib.log import get_module_logger
class ICLoss(nn.Module):
def forward(self, pred, y, idx, skip_size=50):
@@ -24,6 +27,7 @@ class ICLoss(nn.Module):
diff_point.append(i)
prev = date
diff_point.append(None)
# The lengths of diff_point will be one more larger then diff_point
ic_all = 0.0
skip_n = 0
@@ -34,13 +38,23 @@ class ICLoss(nn.Module):
skip_n += 1
continue
y_focus = y[start_i:end_i]
if pred_focus.std() < EPS or y_focus.std() < EPS:
# These cases often happend at the end of test data.
# Usually caused by fillna(0.)
skip_n += 1
continue
ic_day = torch.dot(
(pred_focus - pred_focus.mean()) / np.sqrt(pred_focus.shape[0]) / pred_focus.std(),
(y_focus - y_focus.mean()) / np.sqrt(y_focus.shape[0]) / y_focus.std(),
)
ic_all += ic_day
if len(diff_point) - 1 - skip_n <= 0:
raise ValueError("No enough data for calculating iC")
raise ValueError("No enough data for calculating IC")
if skip_n > 0:
get_module_logger("ICLoss").info(
f"{skip_n} days are skipped due to zero std or small scale of valid samples."
)
ic_mean = ic_all / (len(diff_point) - 1 - skip_n)
return -ic_mean # ic loss

View File

@@ -4,6 +4,7 @@
import numpy as np
import pandas as pd
from typing import Text, Union
from qlib.log import get_module_logger
from qlib.data.dataset.weight import Reweighter
from scipy.optimize import nnls
from sklearn.linear_model import LinearRegression, Ridge, Lasso
@@ -29,7 +30,7 @@ class LinearModel(Model):
RIDGE = "ridge"
LASSO = "lasso"
def __init__(self, estimator="ols", alpha=0.0, fit_intercept=False):
def __init__(self, estimator="ols", alpha=0.0, fit_intercept=False, include_valid: bool = False):
"""
Parameters
----------
@@ -39,6 +40,9 @@ class LinearModel(Model):
l1 or l2 regularization parameter
fit_intercept : bool
whether fit intercept
include_valid: bool
Should the validation data be included for training?
The validation data should be included
"""
assert estimator in [self.OLS, self.NNLS, self.RIDGE, self.LASSO], f"unsupported estimator `{estimator}`"
self.estimator = estimator
@@ -49,9 +53,16 @@ class LinearModel(Model):
self.fit_intercept = fit_intercept
self.coef_ = None
self.include_valid = include_valid
def fit(self, dataset: DatasetH, reweighter: Reweighter = None):
df_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
if self.include_valid:
try:
df_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
df_train = pd.concat([df_train, df_valid])
except KeyError:
get_module_logger("LinearModel").info("include_valid=True, but valid does not exist")
if df_train.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
if reweighter is not None:

View File

@@ -56,7 +56,7 @@ class ADARNN(Model):
n_splits=2,
GPU=0,
seed=None,
**kwargs
**_
):
# Set logger.
self.logger = get_module_logger("ADARNN")
@@ -81,7 +81,7 @@ class ADARNN(Model):
self.optimizer = optimizer.lower()
self.loss = loss
self.n_splits = n_splits
self.device = torch.device("cuda:%d" % (GPU) if torch.cuda.is_available() and GPU >= 0 else "cpu")
self.device = torch.device("cuda:%d" % GPU if torch.cuda.is_available() and GPU >= 0 else "cpu")
self.seed = seed
self.logger.info(
@@ -213,7 +213,8 @@ class ADARNN(Model):
weight_mat = self.transform_type(out_weight_list)
return weight_mat, None
def calc_all_metrics(self, pred):
@staticmethod
def calc_all_metrics(pred):
"""pred is a pandas dataframe that has two attributes: score (pred) and label (real)"""
res = {}
ic = pred.groupby(level="datetime").apply(lambda x: x.label.corr(x.score))
@@ -259,8 +260,6 @@ class ADARNN(Model):
save_path = get_or_create_path(save_path)
stop_steps = 0
best_score = -np.inf
best_epoch = 0
evals_result["train"] = []
evals_result["valid"] = []
@@ -400,7 +399,7 @@ class AdaRNN(nn.Module):
self.model_type = model_type
self.trans_loss = trans_loss
self.len_seq = len_seq
self.device = torch.device("cuda:%d" % (GPU) if torch.cuda.is_available() and GPU >= 0 else "cpu")
self.device = torch.device("cuda:%d" % GPU if torch.cuda.is_available() and GPU >= 0 else "cpu")
in_size = self.n_input
features = nn.ModuleList()
@@ -499,7 +498,8 @@ class AdaRNN(nn.Module):
res = self.softmax(weight).squeeze()
return res
def get_features(self, output_list):
@staticmethod
def get_features(output_list):
fea_list_src, fea_list_tar = [], []
for fea in output_list:
fea_list_src.append(fea[0 : fea.size(0) // 2])
@@ -561,7 +561,7 @@ class TransferLoss:
"""
self.loss_type = loss_type
self.input_dim = input_dim
self.device = torch.device("cuda:%d" % (GPU) if torch.cuda.is_available() and GPU >= 0 else "cpu")
self.device = torch.device("cuda:%d" % GPU if torch.cuda.is_available() and GPU >= 0 else "cpu")
def compute(self, X, Y):
"""Compute adaptation loss
@@ -676,7 +676,8 @@ class MMD_loss(nn.Module):
self.fix_sigma = None
self.kernel_type = kernel_type
def guassian_kernel(self, source, target, kernel_mul=2.0, kernel_num=5, fix_sigma=None):
@staticmethod
def guassian_kernel(source, target, kernel_mul=2.0, kernel_num=5, fix_sigma=None):
n_samples = int(source.size()[0]) + int(target.size()[0])
total = torch.cat([source, target], dim=0)
total0 = total.unsqueeze(0).expand(int(total.size(0)), int(total.size(0)), int(total.size(1)))
@@ -691,7 +692,8 @@ class MMD_loss(nn.Module):
kernel_val = [torch.exp(-L2_distance / bandwidth_temp) for bandwidth_temp in bandwidth_list]
return sum(kernel_val)
def linear_mmd(self, X, Y):
@staticmethod
def linear_mmd(X, Y):
delta = X.mean(axis=0) - Y.mean(axis=0)
loss = delta.dot(delta.T)
return loss

View File

@@ -0,0 +1,511 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from __future__ import division
from __future__ import print_function
import numpy as np
import pandas as pd
from typing import Text, Union
import copy
from ...utils import get_or_create_path
from ...log import get_module_logger
import torch
import torch.nn as nn
import torch.optim as optim
from ...model.base import Model
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
########################################################################
########################################################################
########################################################################
class CNNEncoderBase(nn.Module):
def __init__(self, input_dim, output_dim, kernel_size, device):
"""Build a basic CNN encoder
Parameters
----------
input_dim : int
The input dimension
output_dim : int
The output dimension
kernel_size : int
The size of convolutional kernels
"""
super().__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.kernel_size = kernel_size
self.device = device
# set padding to ensure the same length
# it is correct only when kernel_size is odd, dilation is 1, stride is 1
self.conv = nn.Conv1d(input_dim, output_dim, kernel_size, padding=(kernel_size - 1) // 2)
def forward(self, x):
"""
Parameters
----------
x : torch.Tensor
input data
Returns
-------
torch.Tensor
Updated representations
"""
# input shape: [batch_size, seq_len*input_dim]
# output shape: [batch_size, seq_len, input_dim]
x = x.view(x.shape[0], -1, self.input_dim).permute(0, 2, 1).to(self.device)
y = self.conv(x) # [batch_size, output_dim, conved_seq_len]
y = y.permute(0, 2, 1) # [batch_size, conved_seq_len, output_dim]
return y
class KRNNEncoderBase(nn.Module):
def __init__(self, input_dim, output_dim, dup_num, rnn_layers, dropout, device):
"""Build K parallel RNNs
Parameters
----------
input_dim : int
The input dimension
output_dim : int
The output dimension
dup_num : int
The number of parallel RNNs
rnn_layers: int
The number of RNN layers
"""
super().__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.dup_num = dup_num
self.rnn_layers = rnn_layers
self.dropout = dropout
self.device = device
self.rnn_modules = nn.ModuleList()
for _ in range(dup_num):
self.rnn_modules.append(nn.GRU(input_dim, output_dim, num_layers=self.rnn_layers, dropout=dropout))
def forward(self, x):
"""
Parameters
----------
x : torch.Tensor
Input data
n_id : torch.Tensor
Node indices
Returns
-------
torch.Tensor
Updated representations
"""
# input shape: [batch_size, seq_len, input_dim]
# output shape: [batch_size, seq_len, output_dim]
# [seq_len, batch_size, input_dim]
batch_size, seq_len, input_dim = x.shape
x = x.permute(1, 0, 2).to(self.device)
hids = []
for rnn in self.rnn_modules:
h, _ = rnn(x) # [seq_len, batch_size, output_dim]
hids.append(h)
# [seq_len, batch_size, output_dim, num_dups]
hids = torch.stack(hids, dim=-1)
hids = hids.view(seq_len, batch_size, self.output_dim, self.dup_num)
hids = hids.mean(dim=3)
hids = hids.permute(1, 0, 2)
return hids
class CNNKRNNEncoder(nn.Module):
def __init__(
self, cnn_input_dim, cnn_output_dim, cnn_kernel_size, rnn_output_dim, rnn_dup_num, rnn_layers, dropout, device
):
"""Build an encoder composed of CNN and KRNN
Parameters
----------
cnn_input_dim : int
The input dimension of CNN
cnn_output_dim : int
The output dimension of CNN
cnn_kernel_size : int
The size of convolutional kernels
rnn_output_dim : int
The output dimension of KRNN
rnn_dup_num : int
The number of parallel duplicates for KRNN
rnn_layers : int
The number of RNN layers
"""
super().__init__()
self.cnn_encoder = CNNEncoderBase(cnn_input_dim, cnn_output_dim, cnn_kernel_size, device)
self.krnn_encoder = KRNNEncoderBase(cnn_output_dim, rnn_output_dim, rnn_dup_num, rnn_layers, dropout, device)
def forward(self, x):
"""
Parameters
----------
x : torch.Tensor
Input data
n_id : torch.Tensor
Node indices
Returns
-------
torch.Tensor
Updated representations
"""
cnn_out = self.cnn_encoder(x)
krnn_out = self.krnn_encoder(cnn_out)
return krnn_out
class KRNNModel(nn.Module):
def __init__(self, fea_dim, cnn_dim, cnn_kernel_size, rnn_dim, rnn_dups, rnn_layers, dropout, device, **params):
"""Build a KRNN model
Parameters
----------
fea_dim : int
The feature dimension
cnn_dim : int
The hidden dimension of CNN
cnn_kernel_size : int
The size of convolutional kernels
rnn_dim : int
The hidden dimension of KRNN
rnn_dups : int
The number of parallel duplicates
rnn_layers: int
The number of RNN layers
"""
super().__init__()
self.encoder = CNNKRNNEncoder(
cnn_input_dim=fea_dim,
cnn_output_dim=cnn_dim,
cnn_kernel_size=cnn_kernel_size,
rnn_output_dim=rnn_dim,
rnn_dup_num=rnn_dups,
rnn_layers=rnn_layers,
dropout=dropout,
device=device,
)
self.out_fc = nn.Linear(rnn_dim, 1)
self.device = device
def forward(self, x):
# x: [batch_size, node_num, seq_len, input_dim]
encode = self.encoder(x)
out = self.out_fc(encode[:, -1, :]).squeeze().to(self.device)
return out
class KRNN(Model):
"""KRNN Model
Parameters
----------
d_feat : int
input dimension for each time step
metric: str
the evaluation metric used in early stop
optimizer : str
optimizer name
GPU : str
the GPU ID(s) used for training
"""
def __init__(
self,
fea_dim=6,
cnn_dim=64,
cnn_kernel_size=3,
rnn_dim=64,
rnn_dups=3,
rnn_layers=2,
dropout=0,
n_epochs=200,
lr=0.001,
metric="",
batch_size=2000,
early_stop=20,
loss="mse",
optimizer="adam",
GPU=0,
seed=None,
**kwargs
):
# Set logger.
self.logger = get_module_logger("KRNN")
self.logger.info("KRNN pytorch version...")
# set hyper-parameters.
self.fea_dim = fea_dim
self.cnn_dim = cnn_dim
self.cnn_kernel_size = cnn_kernel_size
self.rnn_dim = rnn_dim
self.rnn_dups = rnn_dups
self.rnn_layers = rnn_layers
self.dropout = dropout
self.n_epochs = n_epochs
self.lr = lr
self.metric = metric
self.batch_size = batch_size
self.early_stop = early_stop
self.optimizer = optimizer.lower()
self.loss = loss
self.device = torch.device("cuda:%d" % (GPU) if torch.cuda.is_available() and GPU >= 0 else "cpu")
self.seed = seed
self.logger.info(
"KRNN parameters setting:"
"\nfea_dim : {}"
"\ncnn_dim : {}"
"\ncnn_kernel_size : {}"
"\nrnn_dim : {}"
"\nrnn_dups : {}"
"\nrnn_layers : {}"
"\ndropout : {}"
"\nn_epochs : {}"
"\nlr : {}"
"\nmetric : {}"
"\nbatch_size: {}"
"\nearly_stop : {}"
"\noptimizer : {}"
"\nloss_type : {}"
"\nvisible_GPU : {}"
"\nuse_GPU : {}"
"\nseed : {}".format(
fea_dim,
cnn_dim,
cnn_kernel_size,
rnn_dim,
rnn_dups,
rnn_layers,
dropout,
n_epochs,
lr,
metric,
batch_size,
early_stop,
optimizer.lower(),
loss,
GPU,
self.use_gpu,
seed,
)
)
if self.seed is not None:
np.random.seed(self.seed)
torch.manual_seed(self.seed)
self.krnn_model = KRNNModel(
fea_dim=self.fea_dim,
cnn_dim=self.cnn_dim,
cnn_kernel_size=self.cnn_kernel_size,
rnn_dim=self.rnn_dim,
rnn_dups=self.rnn_dups,
rnn_layers=self.rnn_layers,
dropout=self.dropout,
device=self.device,
)
if optimizer.lower() == "adam":
self.train_optimizer = optim.Adam(self.krnn_model.parameters(), lr=self.lr)
elif optimizer.lower() == "gd":
self.train_optimizer = optim.SGD(self.krnn_model.parameters(), lr=self.lr)
else:
raise NotImplementedError("optimizer {} is not supported!".format(optimizer))
self.fitted = False
self.krnn_model.to(self.device)
@property
def use_gpu(self):
return self.device != torch.device("cpu")
def mse(self, pred, label):
loss = (pred - label) ** 2
return torch.mean(loss)
def loss_fn(self, pred, label):
mask = ~torch.isnan(label)
if self.loss == "mse":
return self.mse(pred[mask], label[mask])
raise ValueError("unknown loss `%s`" % self.loss)
def metric_fn(self, pred, label):
mask = torch.isfinite(label)
if self.metric in ("", "loss"):
return -self.loss_fn(pred[mask], label[mask])
raise ValueError("unknown metric `%s`" % self.metric)
def get_daily_inter(self, df, shuffle=False):
# organize the train data into daily batches
daily_count = df.groupby(level=0).size().values
daily_index = np.roll(np.cumsum(daily_count), 1)
daily_index[0] = 0
if shuffle:
# shuffle data
daily_shuffle = list(zip(daily_index, daily_count))
np.random.shuffle(daily_shuffle)
daily_index, daily_count = zip(*daily_shuffle)
return daily_index, daily_count
def train_epoch(self, x_train, y_train):
x_train_values = x_train.values
y_train_values = np.squeeze(y_train.values)
self.krnn_model.train()
indices = np.arange(len(x_train_values))
np.random.shuffle(indices)
for i in range(len(indices))[:: self.batch_size]:
if len(indices) - i < self.batch_size:
break
feature = torch.from_numpy(x_train_values[indices[i : i + self.batch_size]]).float().to(self.device)
label = torch.from_numpy(y_train_values[indices[i : i + self.batch_size]]).float().to(self.device)
pred = self.krnn_model(feature)
loss = self.loss_fn(pred, label)
self.train_optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_value_(self.krnn_model.parameters(), 3.0)
self.train_optimizer.step()
def test_epoch(self, data_x, data_y):
# prepare training data
x_values = data_x.values
y_values = np.squeeze(data_y.values)
self.krnn_model.eval()
scores = []
losses = []
indices = np.arange(len(x_values))
for i in range(len(indices))[:: self.batch_size]:
if len(indices) - i < self.batch_size:
break
feature = torch.from_numpy(x_values[indices[i : i + self.batch_size]]).float().to(self.device)
label = torch.from_numpy(y_values[indices[i : i + self.batch_size]]).float().to(self.device)
pred = self.krnn_model(feature)
loss = self.loss_fn(pred, label)
losses.append(loss.item())
score = self.metric_fn(pred, label)
scores.append(score.item())
return np.mean(losses), np.mean(scores)
def fit(
self,
dataset: DatasetH,
evals_result=dict(),
save_path=None,
):
df_train, df_valid, df_test = dataset.prepare(
["train", "valid", "test"],
col_set=["feature", "label"],
data_key=DataHandlerLP.DK_L,
)
if df_train.empty or df_valid.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
save_path = get_or_create_path(save_path)
stop_steps = 0
train_loss = 0
best_score = -np.inf
best_epoch = 0
evals_result["train"] = []
evals_result["valid"] = []
# train
self.logger.info("training...")
self.fitted = True
for step in range(self.n_epochs):
self.logger.info("Epoch%d:", step)
self.logger.info("training...")
self.train_epoch(x_train, y_train)
self.logger.info("evaluating...")
train_loss, train_score = self.test_epoch(x_train, y_train)
val_loss, val_score = self.test_epoch(x_valid, y_valid)
self.logger.info("train %.6f, valid %.6f" % (train_score, val_score))
evals_result["train"].append(train_score)
evals_result["valid"].append(val_score)
if val_score > best_score:
best_score = val_score
stop_steps = 0
best_epoch = step
best_param = copy.deepcopy(self.krnn_model.state_dict())
else:
stop_steps += 1
if stop_steps >= self.early_stop:
self.logger.info("early stop")
break
self.logger.info("best score: %.6lf @ %d" % (best_score, best_epoch))
self.krnn_model.load_state_dict(best_param)
torch.save(best_param, save_path)
if self.use_gpu:
torch.cuda.empty_cache()
def predict(self, dataset: DatasetH, segment: Union[Text, slice] = "test"):
if not self.fitted:
raise ValueError("model is not fitted yet!")
x_test = dataset.prepare(segment, col_set="feature", data_key=DataHandlerLP.DK_I)
index = x_test.index
self.krnn_model.eval()
x_values = x_test.values
sample_num = x_values.shape[0]
preds = []
for begin in range(sample_num)[:: self.batch_size]:
if sample_num - begin < self.batch_size:
end = sample_num
else:
end = begin + self.batch_size
x_batch = torch.from_numpy(x_values[begin:end]).float().to(self.device)
with torch.no_grad():
pred = self.krnn_model(x_batch).detach().cpu().numpy()
preds.append(pred)
return pd.Series(np.concatenate(preds), index=index)

View File

@@ -47,10 +47,6 @@ class DNNModelPytorch(Model):
layer sizes
lr : float
learning rate
lr_decay : float
learning rate decay
lr_decay_steps : int
learning rate decay steps
optimizer : str
optimizer name
GPU : int
@@ -64,8 +60,6 @@ class DNNModelPytorch(Model):
batch_size=2000,
early_stop_rounds=50,
eval_steps=20,
lr_decay=0.96,
lr_decay_steps=100,
optimizer="gd",
loss="mse",
GPU=0,
@@ -93,8 +87,6 @@ class DNNModelPytorch(Model):
self.batch_size = batch_size
self.early_stop_rounds = early_stop_rounds
self.eval_steps = eval_steps
self.lr_decay = lr_decay
self.lr_decay_steps = lr_decay_steps
self.optimizer = optimizer.lower()
self.loss_type = loss
if isinstance(GPU, str):
@@ -116,8 +108,6 @@ class DNNModelPytorch(Model):
f"\nbatch_size : {batch_size}"
f"\nearly_stop_rounds : {early_stop_rounds}"
f"\neval_steps : {eval_steps}"
f"\nlr_decay : {lr_decay}"
f"\nlr_decay_steps : {lr_decay_steps}"
f"\noptimizer : {optimizer}"
f"\nloss_type : {loss}"
f"\nseed : {seed}"

View File

@@ -0,0 +1,381 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from __future__ import division
from __future__ import print_function
import numpy as np
import pandas as pd
from typing import Text, Union
import copy
from ...utils import get_or_create_path
from ...log import get_module_logger
import torch
import torch.nn as nn
import torch.optim as optim
from ...model.base import Model
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from .pytorch_krnn import CNNKRNNEncoder
class SandwichModel(nn.Module):
def __init__(
self,
fea_dim,
cnn_dim_1,
cnn_dim_2,
cnn_kernel_size,
rnn_dim_1,
rnn_dim_2,
rnn_dups,
rnn_layers,
dropout,
device,
**params
):
"""Build a Sandwich model
Parameters
----------
fea_dim : int
The feature dimension
cnn_dim_1 : int
The hidden dimension of the first CNN
cnn_dim_2 : int
The hidden dimension of the second CNN
cnn_kernel_size : int
The size of convolutional kernels
rnn_dim_1 : int
The hidden dimension of the first KRNN
rnn_dim_2 : int
The hidden dimension of the second KRNN
rnn_dups : int
The number of parallel duplicates
rnn_layers: int
The number of RNN layers
"""
super().__init__()
self.first_encoder = CNNKRNNEncoder(
cnn_input_dim=fea_dim,
cnn_output_dim=cnn_dim_1,
cnn_kernel_size=cnn_kernel_size,
rnn_output_dim=rnn_dim_1,
rnn_dup_num=rnn_dups,
rnn_layers=rnn_layers,
dropout=dropout,
device=device,
)
self.second_encoder = CNNKRNNEncoder(
cnn_input_dim=rnn_dim_1,
cnn_output_dim=cnn_dim_2,
cnn_kernel_size=cnn_kernel_size,
rnn_output_dim=rnn_dim_2,
rnn_dup_num=rnn_dups,
rnn_layers=rnn_layers,
dropout=dropout,
device=device,
)
self.out_fc = nn.Linear(rnn_dim_2, 1)
self.device = device
def forward(self, x):
# x: [batch_size, node_num, seq_len, input_dim]
encode = self.first_encoder(x)
encode = self.second_encoder(encode)
out = self.out_fc(encode[:, -1, :]).squeeze().to(self.device)
return out
class Sandwich(Model):
"""Sandwich Model
Parameters
----------
d_feat : int
input dimension for each time step
metric: str
the evaluation metric used in early stop
optimizer : str
optimizer name
GPU : str
the GPU ID(s) used for training
"""
def __init__(
self,
fea_dim=6,
cnn_dim_1=64,
cnn_dim_2=32,
cnn_kernel_size=3,
rnn_dim_1=16,
rnn_dim_2=8,
rnn_dups=3,
rnn_layers=2,
dropout=0,
n_epochs=200,
lr=0.001,
metric="",
batch_size=2000,
early_stop=20,
loss="mse",
optimizer="adam",
GPU=0,
seed=None,
**kwargs
):
# Set logger.
self.logger = get_module_logger("Sandwich")
self.logger.info("Sandwich pytorch version...")
# set hyper-parameters.
self.fea_dim = fea_dim
self.cnn_dim_1 = cnn_dim_1
self.cnn_dim_2 = cnn_dim_2
self.cnn_kernel_size = cnn_kernel_size
self.rnn_dim_1 = rnn_dim_1
self.rnn_dim_2 = rnn_dim_2
self.rnn_dups = rnn_dups
self.rnn_layers = rnn_layers
self.dropout = dropout
self.n_epochs = n_epochs
self.lr = lr
self.metric = metric
self.batch_size = batch_size
self.early_stop = early_stop
self.optimizer = optimizer.lower()
self.loss = loss
self.device = torch.device("cuda:%d" % (GPU) if torch.cuda.is_available() and GPU >= 0 else "cpu")
self.seed = seed
self.logger.info(
"Sandwich parameters setting:"
"\nfea_dim : {}"
"\ncnn_dim_1 : {}"
"\ncnn_dim_2 : {}"
"\ncnn_kernel_size : {}"
"\nrnn_dim_1 : {}"
"\nrnn_dim_2 : {}"
"\nrnn_dups : {}"
"\nrnn_layers : {}"
"\ndropout : {}"
"\nn_epochs : {}"
"\nlr : {}"
"\nmetric : {}"
"\nbatch_size: {}"
"\nearly_stop : {}"
"\noptimizer : {}"
"\nloss_type : {}"
"\nvisible_GPU : {}"
"\nuse_GPU : {}"
"\nseed : {}".format(
fea_dim,
cnn_dim_1,
cnn_dim_2,
cnn_kernel_size,
rnn_dim_1,
rnn_dim_2,
rnn_dups,
rnn_layers,
dropout,
n_epochs,
lr,
metric,
batch_size,
early_stop,
optimizer.lower(),
loss,
GPU,
self.use_gpu,
seed,
)
)
if self.seed is not None:
np.random.seed(self.seed)
torch.manual_seed(self.seed)
self.sandwich_model = SandwichModel(
fea_dim=self.fea_dim,
cnn_dim_1=self.cnn_dim_1,
cnn_dim_2=self.cnn_dim_2,
cnn_kernel_size=self.cnn_kernel_size,
rnn_dim_1=self.rnn_dim_1,
rnn_dim_2=self.rnn_dim_2,
rnn_dups=self.rnn_dups,
rnn_layers=self.rnn_layers,
dropout=self.dropout,
device=self.device,
)
if optimizer.lower() == "adam":
self.train_optimizer = optim.Adam(self.sandwich_model.parameters(), lr=self.lr)
elif optimizer.lower() == "gd":
self.train_optimizer = optim.SGD(self.sandwich_model.parameters(), lr=self.lr)
else:
raise NotImplementedError("optimizer {} is not supported!".format(optimizer))
self.fitted = False
self.sandwich_model.to(self.device)
@property
def use_gpu(self):
return self.device != torch.device("cpu")
def mse(self, pred, label):
loss = (pred - label) ** 2
return torch.mean(loss)
def loss_fn(self, pred, label):
mask = ~torch.isnan(label)
if self.loss == "mse":
return self.mse(pred[mask], label[mask])
raise ValueError("unknown loss `%s`" % self.loss)
def metric_fn(self, pred, label):
mask = torch.isfinite(label)
if self.metric in ("", "loss"):
return -self.loss_fn(pred[mask], label[mask])
raise ValueError("unknown metric `%s`" % self.metric)
def train_epoch(self, x_train, y_train):
x_train_values = x_train.values
y_train_values = np.squeeze(y_train.values)
self.sandwich_model.train()
indices = np.arange(len(x_train_values))
np.random.shuffle(indices)
for i in range(len(indices))[:: self.batch_size]:
if len(indices) - i < self.batch_size:
break
feature = torch.from_numpy(x_train_values[indices[i : i + self.batch_size]]).float().to(self.device)
label = torch.from_numpy(y_train_values[indices[i : i + self.batch_size]]).float().to(self.device)
pred = self.sandwich_model(feature)
loss = self.loss_fn(pred, label)
self.train_optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_value_(self.sandwich_model.parameters(), 3.0)
self.train_optimizer.step()
def test_epoch(self, data_x, data_y):
# prepare training data
x_values = data_x.values
y_values = np.squeeze(data_y.values)
self.sandwich_model.eval()
scores = []
losses = []
indices = np.arange(len(x_values))
for i in range(len(indices))[:: self.batch_size]:
if len(indices) - i < self.batch_size:
break
feature = torch.from_numpy(x_values[indices[i : i + self.batch_size]]).float().to(self.device)
label = torch.from_numpy(y_values[indices[i : i + self.batch_size]]).float().to(self.device)
pred = self.sandwich_model(feature)
loss = self.loss_fn(pred, label)
losses.append(loss.item())
score = self.metric_fn(pred, label)
scores.append(score.item())
return np.mean(losses), np.mean(scores)
def fit(
self,
dataset: DatasetH,
evals_result=dict(),
save_path=None,
):
df_train, df_valid, df_test = dataset.prepare(
["train", "valid", "test"],
col_set=["feature", "label"],
data_key=DataHandlerLP.DK_L,
)
if df_train.empty or df_valid.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
save_path = get_or_create_path(save_path)
stop_steps = 0
train_loss = 0
best_score = -np.inf
best_epoch = 0
evals_result["train"] = []
evals_result["valid"] = []
# train
self.logger.info("training...")
self.fitted = True
for step in range(self.n_epochs):
self.logger.info("Epoch%d:", step)
self.logger.info("training...")
self.train_epoch(x_train, y_train)
self.logger.info("evaluating...")
train_loss, train_score = self.test_epoch(x_train, y_train)
val_loss, val_score = self.test_epoch(x_valid, y_valid)
self.logger.info("train %.6f, valid %.6f" % (train_score, val_score))
evals_result["train"].append(train_score)
evals_result["valid"].append(val_score)
if val_score > best_score:
best_score = val_score
stop_steps = 0
best_epoch = step
best_param = copy.deepcopy(self.sandwich_model.state_dict())
else:
stop_steps += 1
if stop_steps >= self.early_stop:
self.logger.info("early stop")
break
self.logger.info("best score: %.6lf @ %d" % (best_score, best_epoch))
self.sandwich_model.load_state_dict(best_param)
torch.save(best_param, save_path)
if self.use_gpu:
torch.cuda.empty_cache()
def predict(self, dataset: DatasetH, segment: Union[Text, slice] = "test"):
if not self.fitted:
raise ValueError("model is not fitted yet!")
x_test = dataset.prepare(segment, col_set="feature", data_key=DataHandlerLP.DK_I)
index = x_test.index
self.sandwich_model.eval()
x_values = x_test.values
sample_num = x_values.shape[0]
preds = []
for begin in range(sample_num)[:: self.batch_size]:
if sample_num - begin < self.batch_size:
end = sample_num
else:
end = begin + self.batch_size
x_batch = torch.from_numpy(x_values[begin:end]).float().to(self.device)
with torch.no_grad():
pred = self.sandwich_model(x_batch).detach().cpu().numpy()
preds.append(pred)
return pd.Series(np.concatenate(preds), index=index)

View File

@@ -168,7 +168,8 @@ class TCN(Model):
self.TCN_model.train()
for data in data_loader:
feature = data[:, :, 0:-1].to(self.device)
data = torch.transpose(data, 1, 2)
feature = data[:, 0:-1, :].to(self.device)
label = data[:, -1, -1].to(self.device)
pred = self.TCN_model(feature.float())
@@ -187,8 +188,8 @@ class TCN(Model):
losses = []
for data in data_loader:
feature = data[:, :, 0:-1].to(self.device)
data = torch.transpose(data, 1, 2)
feature = data[:, 0:-1, :].to(self.device)
# feature[torch.isnan(feature)] = 0
label = data[:, -1, -1].to(self.device)

View File

@@ -70,7 +70,7 @@ class DayCumsum(ElemOperator):
Otherwise, the value is zero.
"""
def __init__(self, feature, start: str = "9:30", end: str = "14:59"):
def __init__(self, feature, start: str = "9:30", end: str = "14:59", data_granularity: int = 1):
self.feature = feature
self.start = datetime.strptime(start, "%H:%M")
self.end = datetime.strptime(end, "%H:%M")
@@ -80,15 +80,17 @@ class DayCumsum(ElemOperator):
self.noon_open = datetime.strptime("13:00", "%H:%M")
self.noon_close = datetime.strptime("15:00", "%H:%M")
self.start_id = time_to_day_index(self.start)
self.end_id = time_to_day_index(self.end)
self.data_granularity = data_granularity
self.start_id = time_to_day_index(self.start) // self.data_granularity
self.end_id = time_to_day_index(self.end) // self.data_granularity
assert 240 % self.data_granularity == 0
def period_cusum(self, df):
df = df.copy()
assert len(df) == 240
assert len(df) == 240 // self.data_granularity
df.iloc[0 : self.start_id] = 0
df = df.cumsum()
df.iloc[self.end_id + 1 : 240] = 0
df.iloc[self.end_id + 1 : 240 // self.data_granularity] = 0
return df
def _load_internal(self, instrument, start_index, end_index, freq):

View File

@@ -1,5 +1,6 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from functools import partial
import pandas as pd
@@ -10,7 +11,11 @@ import matplotlib.pyplot as plt
from scipy import stats
from typing import Sequence
from qlib.typehint import Literal
from ..graph import ScatterGraph, SubplotsGraph, BarGraph, HeatmapGraph
from ..utils import guess_plotly_rangebreaks
def _group_return(pred_label: pd.DataFrame = None, reverse: bool = False, N: int = 5, **kwargs) -> tuple:
@@ -48,12 +53,13 @@ def _group_return(pred_label: pd.DataFrame = None, reverse: bool = False, N: int
t_df["long-average"] = t_df["Group1"] - pred_label.groupby(level="datetime")["label"].mean()
t_df = t_df.dropna(how="all") # for days which does not contain label
# FIXME: support HIGH-FREQ
t_df.index = t_df.index.strftime("%Y-%m-%d")
# Cumulative Return By Group
group_scatter_figure = ScatterGraph(
t_df.cumsum(),
layout=dict(title="Cumulative Return", xaxis=dict(type="category", tickangle=45)),
layout=dict(
title="Cumulative Return",
xaxis=dict(tickangle=45, rangebreaks=kwargs.get("rangebreaks", guess_plotly_rangebreaks(t_df.index))),
),
).figure
t_df = t_df.loc[:, ["long-short", "long-average"]]
@@ -110,22 +116,36 @@ def _plot_qq(data: pd.Series = None, dist=stats.norm) -> go.Figure:
return fig
def _pred_ic(pred_label: pd.DataFrame = None, rank: bool = False, **kwargs) -> tuple:
def _pred_ic(
pred_label: pd.DataFrame = None, methods: Sequence[Literal["IC", "Rank IC"]] = ("IC", "Rank IC"), **kwargs
) -> tuple:
"""
:param pred_label:
:param rank:
:param pred_label: pd.DataFrame
must contain one column of realized return with name `label` and one column of predicted score names `score`.
:param methods: Sequence[Literal["IC", "Rank IC"]]
IC series to plot.
IC is sectional pearson correlation between label and score
Rank IC is the spearman correlation between label and score
For the Monthly IC, IC histogram, IC Q-Q plot. Only the first type of IC will be plotted.
:return:
"""
if rank:
ic = pred_label.groupby(level="datetime").apply(
lambda x: x["label"].rank(pct=True).corr(x["score"].rank(pct=True))
)
else:
ic = pred_label.groupby(level="datetime").apply(lambda x: x["label"].corr(x["score"]))
_methods_mapping = {"IC": "pearson", "Rank IC": "spearman"}
_index = ic.index.get_level_values(0).astype("str").str.replace("-", "").str.slice(0, 6)
_monthly_ic = ic.groupby(_index).mean()
def _corr_series(x, method):
return x["label"].corr(x["score"], method=method)
ic_df = pd.concat(
[
pred_label.groupby(level="datetime").apply(partial(_corr_series, method=_methods_mapping[m])).rename(m)
for m in methods
],
axis=1,
)
_ic = ic_df.iloc(axis=1)[0]
_index = _ic.index.get_level_values(0).astype("str").str.replace("-", "").str.slice(0, 6)
_monthly_ic = _ic.groupby(_index).mean()
_monthly_ic.index = pd.MultiIndex.from_arrays(
[_monthly_ic.index.str.slice(0, 4), _monthly_ic.index.str.slice(4, 6)],
names=["year", "month"],
@@ -148,27 +168,27 @@ def _pred_ic(pred_label: pd.DataFrame = None, rank: bool = False, **kwargs) -> t
_monthly_ic = _monthly_ic.reindex(fill_index)
_ic_df = ic.to_frame("ic")
ic_bar_figure = ic_figure(_ic_df, kwargs.get("show_nature_day", True))
ic_bar_figure = ic_figure(ic_df, kwargs.get("show_nature_day", False))
ic_heatmap_figure = HeatmapGraph(
_monthly_ic.unstack(),
layout=dict(title="Monthly IC", yaxis=dict(tickformat=",d")),
layout=dict(title="Monthly IC", xaxis=dict(dtick=1), yaxis=dict(tickformat="04d", dtick=1)),
graph_kwargs=dict(xtype="array", ytype="array"),
).figure
dist = stats.norm
_qqplot_fig = _plot_qq(ic, dist)
_qqplot_fig = _plot_qq(_ic, dist)
if isinstance(dist, stats.norm.__class__):
dist_name = "Normal"
else:
dist_name = "Unknown"
_ic_df = _ic.to_frame("IC")
_bin_size = ((_ic_df.max() - _ic_df.min()) / 20).min()
_sub_graph_data = [
(
"ic",
"IC",
dict(
row=1,
col=1,
@@ -202,12 +222,13 @@ def _pred_autocorr(pred_label: pd.DataFrame, lag=1, **kwargs) -> tuple:
pred = pred_label.copy()
pred["score_last"] = pred.groupby(level="instrument")["score"].shift(lag)
ac = pred.groupby(level="datetime").apply(lambda x: x["score"].rank(pct=True).corr(x["score_last"].rank(pct=True)))
# FIXME: support HIGH-FREQ
_df = ac.to_frame("value")
_df.index = _df.index.strftime("%Y-%m-%d")
ac_figure = ScatterGraph(
_df,
layout=dict(title="Auto Correlation", xaxis=dict(type="category", tickangle=45)),
layout=dict(
title="Auto Correlation",
xaxis=dict(tickangle=45, rangebreaks=kwargs.get("rangebreaks", guess_plotly_rangebreaks(_df.index))),
),
).figure
return (ac_figure,)
@@ -233,32 +254,33 @@ def _pred_turnover(pred_label: pd.DataFrame, N=5, lag=1, **kwargs) -> tuple:
"Bottom": bottom,
}
)
# FIXME: support HIGH-FREQ
r_df.index = r_df.index.strftime("%Y-%m-%d")
turnover_figure = ScatterGraph(
r_df,
layout=dict(title="Top-Bottom Turnover", xaxis=dict(type="category", tickangle=45)),
layout=dict(
title="Top-Bottom Turnover",
xaxis=dict(tickangle=45, rangebreaks=kwargs.get("rangebreaks", guess_plotly_rangebreaks(r_df.index))),
),
).figure
return (turnover_figure,)
def ic_figure(ic_df: pd.DataFrame, show_nature_day=True, **kwargs) -> go.Figure:
"""IC figure
r"""IC figure
:param ic_df: ic DataFrame
:param show_nature_day: whether to display the abscissa of non-trading day
:param \*\*kwargs: contains some parameters to control plot style in plotly. Currently, supports
- `rangebreaks`: https://plotly.com/python/time-series/#Hiding-Weekends-and-Holidays
:return: plotly.graph_objs.Figure
"""
if show_nature_day:
date_index = pd.date_range(ic_df.index.min(), ic_df.index.max())
ic_df = ic_df.reindex(date_index)
# FIXME: support HIGH-FREQ
ic_df.index = ic_df.index.strftime("%Y-%m-%d")
ic_bar_figure = BarGraph(
ic_df,
layout=dict(
title="Information Coefficient (IC)",
xaxis=dict(type="category", tickangle=45),
xaxis=dict(tickangle=45, rangebreaks=kwargs.get("rangebreaks", guess_plotly_rangebreaks(ic_df.index))),
),
).figure
return ic_bar_figure
@@ -272,9 +294,10 @@ def model_performance_graph(
rank=False,
graph_names: list = ["group_return", "pred_ic", "pred_autocorr"],
show_notebook: bool = True,
show_nature_day=True,
show_nature_day: bool = False,
**kwargs,
) -> [list, tuple]:
"""Model performance
r"""Model performance
:param pred_label: index is **pd.MultiIndex**, index name is **[instrument, datetime]**; columns names is **[score, label]**.
It is usually same as the label of model training(e.g. "Ref($close, -2)/Ref($close, -1) - 1").
@@ -297,17 +320,14 @@ def model_performance_graph(
:param graph_names: graph names; default ['cumulative_return', 'pred_ic', 'pred_autocorr', 'pred_turnover'].
:param show_notebook: whether to display graphics in notebook, the default is `True`.
:param show_nature_day: whether to display the abscissa of non-trading day.
:param \*\*kwargs: contains some parameters to control plot style in plotly. Currently, supports
- `rangebreaks`: https://plotly.com/python/time-series/#Hiding-Weekends-and-Holidays
:return: if show_notebook is True, display in notebook; else return `plotly.graph_objs.Figure` list.
"""
figure_list = []
for graph_name in graph_names:
fun_res = eval(f"_{graph_name}")(
pred_label=pred_label,
lag=lag,
N=N,
reverse=reverse,
rank=rank,
show_nature_day=show_nature_day,
pred_label=pred_label, lag=lag, N=N, reverse=reverse, rank=rank, show_nature_day=show_nature_day, **kwargs
)
figure_list += fun_res

View File

@@ -119,7 +119,7 @@ def _get_risk_analysis_figure(analysis_df: pd.DataFrame) -> Iterable[py.Figure]:
_figure = SubplotsGraph(
_get_all_risk_analysis(analysis_df),
kind_map=dict(kind="BarGraph", kwargs={}),
subplots_kwargs={"rows": 4, "cols": 1},
subplots_kwargs={"rows": 1, "cols": 4},
).figure
return (_figure,)

View File

@@ -4,6 +4,7 @@
import pandas as pd
from ..graph import ScatterGraph
from ..utils import guess_plotly_rangebreaks
def _get_score_ic(pred_label: pd.DataFrame):
@@ -19,7 +20,7 @@ def _get_score_ic(pred_label: pd.DataFrame):
return pd.DataFrame({"ic": _ic, "rank_ic": _rank_ic})
def score_ic_graph(pred_label: pd.DataFrame, show_notebook: bool = True) -> [list, tuple]:
def score_ic_graph(pred_label: pd.DataFrame, show_notebook: bool = True, **kwargs) -> [list, tuple]:
"""score IC
Example:
@@ -53,11 +54,13 @@ def score_ic_graph(pred_label: pd.DataFrame, show_notebook: bool = True) -> [lis
:return: if show_notebook is True, display in notebook; else return **plotly.graph_objs.Figure** list.
"""
_ic_df = _get_score_ic(pred_label)
# FIXME: support HIGH-FREQ
_ic_df.index = _ic_df.index.strftime("%Y-%m-%d")
_figure = ScatterGraph(
_ic_df,
layout=dict(title="Score IC", xaxis=dict(type="category", tickangle=45)),
layout=dict(
title="Score IC",
xaxis=dict(tickangle=45, rangebreaks=kwargs.get("rangebreaks", guess_plotly_rangebreaks(_ic_df.index))),
),
graph_kwargs={"mode": "lines+markers"},
).figure
if show_notebook:

View File

@@ -139,8 +139,8 @@ class FeaACAna(FeaAnalyser):
class FeaSkewTurt(NumFeaAnalyser):
def calc_stat_values(self):
self._skew = datetime_groupby_apply(self._dataset, "skew", skip_group=True)
self._kurt = datetime_groupby_apply(self._dataset, pd.DataFrame.kurt, skip_group=True)
self._skew = datetime_groupby_apply(self._dataset, "skew")
self._kurt = datetime_groupby_apply(self._dataset, pd.DataFrame.kurt)
def plot_single(self, col, ax):
self._skew[col].plot(ax=ax, label="skew")

View File

@@ -1,6 +1,7 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import matplotlib.pyplot as plt
import pandas as pd
def sub_fig_generator(sub_fs=(3, 3), col_n=10, row_n=1, wspace=None, hspace=None, sharex=False, sharey=False):
@@ -43,3 +44,31 @@ def sub_fig_generator(sub_fs=(3, 3), col_n=10, row_n=1, wspace=None, hspace=None
res = res.item()
yield res
plt.show()
def guess_plotly_rangebreaks(dt_index: pd.DatetimeIndex):
"""
This function `guesses` the rangebreaks required to remove gaps in datetime index.
It basically calculates the difference between a `continuous` datetime index and index given.
For more details on `rangebreaks` params in plotly, see
https://plotly.com/python/reference/layout/xaxis/#layout-xaxis-rangebreaks
Parameters
----------
dt_index: pd.DatetimeIndex
The datetimes of the data.
Returns
-------
the `rangebreaks` to be passed into plotly axis.
"""
dt_idx = dt_index.sort_values()
gaps = dt_idx[1:] - dt_idx[:-1]
min_gap = gaps.min()
gaps_to_break = {}
for gap, d in zip(gaps, dt_idx[:-1]):
if gap > min_gap:
gaps_to_break.setdefault(gap - min_gap, []).append(d + min_gap)
return [dict(values=v, dvalue=int(k.total_seconds() * 1000)) for k, v in gaps_to_break.items()]

View File

@@ -635,7 +635,7 @@ class FileOrderStrategy(BaseStrategy):
self.order_df = file
else:
with get_io_object(file) as f:
self.order_df = pd.read_csv(f, dtype={"datetime": np.str})
self.order_df = pd.read_csv(f, dtype={"datetime": str})
self.order_df["datetime"] = self.order_df["datetime"].apply(pd.Timestamp)
self.order_df = self.order_df.set_index(["datetime", "instrument"])

View File

@@ -7,6 +7,7 @@ import numpy as np
import pandas as pd
from typing import Dict, List, Text, Tuple, Union
from abc import ABC
from qlib.data import D
from qlib.data.dataset import Dataset
@@ -17,11 +18,11 @@ from qlib.backtest.signal import Signal, create_signal_from
from qlib.backtest.decision import Order, OrderDir, TradeDecisionWO
from qlib.log import get_module_logger
from qlib.utils import get_pre_trading_date, load_dataset
from qlib.contrib.strategy.order_generator import OrderGenWOInteract
from qlib.contrib.strategy.order_generator import OrderGenerator, OrderGenWOInteract
from qlib.contrib.strategy.optimizer import EnhancedIndexingOptimizer
class BaseSignalStrategy(BaseStrategy):
class BaseSignalStrategy(BaseStrategy, ABC):
def __init__(
self,
*,
@@ -47,7 +48,7 @@ class BaseSignalStrategy(BaseStrategy):
- If `trade_exchange` is None, self.trade_exchange will be set with common_infra
- It allowes different trade_exchanges is used in different executions.
- For example:
- In daily execution, both daily exchange and minutely are usable, but the daily exchange is recommended because it run faster.
- In daily execution, both daily exchange and minutely are usable, but the daily exchange is recommended because it runs faster.
- In minutely execution, the daily exchange is not usable, only the minutely exchange is recommended.
"""
@@ -64,7 +65,7 @@ class BaseSignalStrategy(BaseStrategy):
def get_risk_degree(self, trade_step=None):
"""get_risk_degree
Return the proportion of your total value you will used in investment.
Return the proportion of your total value you will use in investment.
Dynamically risk_degree will result in Market timing.
"""
# It will use 95% amount of your total value by default
@@ -76,6 +77,7 @@ class TopkDropoutStrategy(BaseSignalStrategy):
# 1. Supporting leverage the get_range_limit result from the decision
# 2. Supporting alter_outer_trade_decision
# 3. Supporting checking the availability of trade decision
# 4. Regenerate results with forbid_all_trade_at_limit set to false and flip the default to false, as it is consistent with reality.
def __init__(
self,
*,
@@ -85,6 +87,7 @@ class TopkDropoutStrategy(BaseSignalStrategy):
method_buy="top",
hold_thresh=1,
only_tradable=False,
forbid_all_trade_at_limit=True,
**kwargs,
):
"""
@@ -111,6 +114,17 @@ class TopkDropoutStrategy(BaseSignalStrategy):
else:
strategy will make buy sell decision without checking the tradable state of the stock.
forbid_all_trade_at_limit : bool
if forbid all trades when limit_up or limit_down reached.
if forbid_all_trade_at_limit:
strategy will not do any trade when price reaches limit up/down, even not sell at limit up nor buy at
limit down, though allowed in reality.
else:
strategy will sell at limit up and buy ad limit down.
"""
super().__init__(**kwargs)
self.topk = topk
@@ -119,6 +133,7 @@ class TopkDropoutStrategy(BaseSignalStrategy):
self.method_buy = method_buy
self.hold_thresh = hold_thresh
self.only_tradable = only_tradable
self.forbid_all_trade_at_limit = forbid_all_trade_at_limit
def generate_trade_decision(self, execute_result=None):
# get the number of trading step finished, trade_step can be [0, 1, 2, ..., trade_len - 1]
@@ -161,7 +176,7 @@ class TopkDropoutStrategy(BaseSignalStrategy):
]
else:
# Otherwise, the stock will make decision with out the stock tradable info
# Otherwise, the stock will make decision without the stock tradable info
def get_first_n(li, n):
return list(li)[:n]
@@ -171,7 +186,7 @@ class TopkDropoutStrategy(BaseSignalStrategy):
def filter_stock(li):
return li
current_temp = copy.deepcopy(self.trade_position)
current_temp: Position = copy.deepcopy(self.trade_position)
# generate order list for this adjust date
sell_order_list = []
buy_order_list = []
@@ -216,7 +231,10 @@ class TopkDropoutStrategy(BaseSignalStrategy):
buy = today[: len(sell) + self.topk - len(last)]
for code in current_stock_list:
if not self.trade_exchange.is_stock_tradable(
stock_id=code, start_time=trade_start_time, end_time=trade_end_time
stock_id=code,
start_time=trade_start_time,
end_time=trade_end_time,
direction=None if self.forbid_all_trade_at_limit else OrderDir.SELL,
):
continue
if code in sell:
@@ -244,7 +262,7 @@ class TopkDropoutStrategy(BaseSignalStrategy):
cash += trade_val - trade_cost
# buy new stock
# note the current has been changed
current_stock_list = current_temp.get_stock_list()
# current_stock_list = current_temp.get_stock_list()
value = cash * self.risk_degree / len(buy) if len(buy) > 0 else 0
# open_cost should be considered in the real trading environment, while the backtest in evaluate.py does not
@@ -253,7 +271,10 @@ class TopkDropoutStrategy(BaseSignalStrategy):
for code in buy:
# check is stock suspended
if not self.trade_exchange.is_stock_tradable(
stock_id=code, start_time=trade_start_time, end_time=trade_end_time
stock_id=code,
start_time=trade_start_time,
end_time=trade_end_time,
direction=None if self.forbid_all_trade_at_limit else OrderDir.BUY,
):
continue
# buy order
@@ -296,15 +317,15 @@ class WeightStrategyBase(BaseSignalStrategy):
- It allowes different trade_exchanges is used in different executions.
- For example:
- In daily execution, both daily exchange and minutely are usable, but the daily exchange is recommended because it run faster.
- In daily execution, both daily exchange and minutely are usable, but the daily exchange is recommended because it runs faster.
- In minutely execution, the daily exchange is not usable, only the minutely exchange is recommended.
"""
super().__init__(**kwargs)
if isinstance(order_generator_cls_or_obj, type):
self.order_generator = order_generator_cls_or_obj()
self.order_generator: OrderGenerator = order_generator_cls_or_obj()
else:
self.order_generator = order_generator_cls_or_obj
self.order_generator: OrderGenerator = order_generator_cls_or_obj
def generate_target_weight_position(self, score, current, trade_start_time, trade_end_time):
"""
@@ -316,9 +337,8 @@ class WeightStrategyBase(BaseSignalStrategy):
pred score for this trade date, index is stock_id, contain 'score' column.
current : Position()
current position.
trade_exchange : Exchange()
trade_date : pd.Timestamp
trade date.
trade_start_time: pd.Timestamp
trade_end_time: pd.Timestamp
"""
raise NotImplementedError()
@@ -428,7 +448,7 @@ class EnhancedIndexingStrategy(WeightStrategyBase):
specific_risk = load_dataset(root + "/" + self.specific_risk_path, index_col=[0])
if not factor_exp.index.equals(specific_risk.index):
# NOTE: for stocks missing specific_risk, we always assume it have the highest volatility
# NOTE: for stocks missing specific_risk, we always assume it has the highest volatility
specific_risk = specific_risk.reindex(factor_exp.index, fill_value=specific_risk.max())
universe = factor_exp.index.tolist()

View File

@@ -783,7 +783,7 @@ class LocalPITProvider(PITProvider):
index_path = C.dpm.get_data_uri() / "financial" / instrument.lower() / f"{field}.index"
data_path = C.dpm.get_data_uri() / "financial" / instrument.lower() / f"{field}.data"
if not (index_path.exists() and data_path.exists()):
raise FileNotFoundError("No file is found. Raise exception and ")
raise FileNotFoundError("No file is found.")
# NOTE: The most significant performance loss is here.
# Does the acceleration that makes the program complicated really matters?
# - It makes parameters of the interface complicate
@@ -797,14 +797,14 @@ class LocalPITProvider(PITProvider):
cur_time_int = int(cur_time.year) * 10000 + int(cur_time.month) * 100 + int(cur_time.day)
loc = np.searchsorted(data["date"], cur_time_int, side="right")
if loc <= 0:
return pd.Series()
return pd.Series(dtype=C.pit_record_type["value"])
last_period = data["period"][:loc].max() # return the latest quarter
first_period = data["period"][:loc].min()
period_list = get_period_list(first_period, last_period, quarterly)
if period is not None:
# NOTE: `period` has higher priority than `start_index` & `end_index`
if period not in period_list:
return pd.Series()
return pd.Series(dtype=C.pit_record_type["value"])
else:
period_list = [period]
else:
@@ -868,7 +868,7 @@ class LocalExpressionProvider(ExpressionProvider):
# Ensure that each column type is consistent
# FIXME:
# 1) The stock data is currently float. If there is other types of data, this part needs to be re-implemented.
# 2) The the precision should be configurable
# 2) The precision should be configurable
try:
series = series.astype(np.float32)
except ValueError:

View File

@@ -417,7 +417,7 @@ class TSDataSampler:
# NOTE: bool(np.nan) is True !!!!!!!!
# make sure reindex comes first. Otherwise extra NaN may appear.
flt_data = flt_data.swaplevel()
flt_data = flt_data.reindex(self.data_index).fillna(False).astype(np.bool)
flt_data = flt_data.reindex(self.data_index).fillna(False).astype(bool)
self.flt_data = flt_data.values
self.idx_map = self.flt_idx_map(self.flt_data, self.idx_map)
self.data_index = self.data_index[np.where(self.flt_data)[0]]

View File

@@ -7,6 +7,7 @@ from typing import Callable, Union, Tuple, List, Iterator, Optional
import pandas as pd
from qlib.typehint import Literal
from ...log import get_module_logger, TimeInspector
from ...utils import init_instance_by_config
from ...utils.serial import Serializable
@@ -49,6 +50,8 @@ class DataHandler(Serializable):
- Fetching data with `col_set=CS_RAW` will return the raw data and may avoid pandas from copying the data when calling `loc`
"""
_data: pd.DataFrame # underlying data.
def __init__(
self,
instruments=None,
@@ -155,6 +158,11 @@ class DataHandler(Serializable):
"""
fetch data from underlying data source
Design motivation:
- providing a unified interface for underlying data.
- Potential to make the interface more friendly.
- User can improve performance when fetching data in this extra layer
Parameters
----------
selector : Union[pd.Timestamp, slice, str]
@@ -328,6 +336,9 @@ class DataHandler(Serializable):
yield cur_date, self.fetch(selector, **kwargs)
DATA_KEY_TYPE = Literal["raw", "infer", "learn"]
class DataHandlerLP(DataHandler):
"""
DataHandler with **(L)earnable (P)rocessor**
@@ -346,17 +357,28 @@ class DataHandlerLP(DataHandler):
- These processors only apply to the learning phase.
Tips to improve the performance of data handler
Tips for data handler
- To reduce the memory cost
- `drop_raw=True`: this will modify the data inplace on raw data;
- Please note processed data like `self._infer` or `self._learn` are concepts different from `segments` in Qlib's `Dataset` like "train" and "test"
- Processed data like `self._infer` or `self._learn` are underlying data processed with different processors
- `segments` in Qlib's `Dataset` like "train" and "test" are simply the time segmentations when querying data("train" are often before "test" in time-series).
- For example, you can query `data._infer` processed by `infer_processors` in the "train" time segmentation.
"""
# based on `self._data`, _infer and _learn are genrated after processors
_infer: pd.DataFrame # data for inference
_learn: pd.DataFrame # data for learning models
# data key
DK_R = "raw"
DK_I = "infer"
DK_L = "learn"
DK_R: DATA_KEY_TYPE = "raw"
DK_I: DATA_KEY_TYPE = "infer"
DK_L: DATA_KEY_TYPE = "learn"
# map data_key to attribute name
ATTR_MAP = {DK_R: "_data", DK_I: "_infer", DK_L: "_learn"}
# process type
@@ -600,7 +622,7 @@ class DataHandlerLP(DataHandler):
# TODO: Be able to cache handler data. Save the memory for data processing
def _get_df_by_key(self, data_key: str = DK_I) -> pd.DataFrame:
def _get_df_by_key(self, data_key: DATA_KEY_TYPE = DK_I) -> pd.DataFrame:
if data_key == self.DK_R and self.drop_raw:
raise AttributeError(
"DataHandlerLP has not attribute _data, please set drop_raw = False if you want to use raw data"
@@ -613,7 +635,7 @@ class DataHandlerLP(DataHandler):
selector: Union[pd.Timestamp, slice, str] = slice(None, None),
level: Union[str, int] = "datetime",
col_set=DataHandler.CS_ALL,
data_key: str = DK_I,
data_key: DATA_KEY_TYPE = DK_I,
squeeze: bool = False,
proc_func: Callable = None,
) -> pd.DataFrame:
@@ -647,7 +669,7 @@ class DataHandlerLP(DataHandler):
proc_func=proc_func,
)
def get_cols(self, col_set=DataHandler.CS_ALL, data_key: str = DK_I) -> list:
def get_cols(self, col_set=DataHandler.CS_ALL, data_key: DATA_KEY_TYPE = DK_I) -> list:
"""
get the column names
@@ -655,7 +677,7 @@ class DataHandlerLP(DataHandler):
----------
col_set : str
select a set of meaningful columns.(e.g. features, columns).
data_key : str
data_key : DATA_KEY_TYPE
the data to fetch: DK_*.
Returns
@@ -698,3 +720,26 @@ class DataHandlerLP(DataHandler):
]:
setattr(new_hd, key, getattr(handler, key, None))
return new_hd
@classmethod
def from_df(cls, df: pd.DataFrame) -> "DataHandlerLP":
"""
Motivation:
- When user want to get a quick data handler.
The created data handler will have only one shared Dataframe without processors.
After creating the handler, user may often want to dump the handler for reuse
Here is a typical use case
.. code-block:: python
from qlib.data.dataset import DataHandlerLP
dh = DataHandlerLP.from_df(df)
dh.to_pickle(fname, dump_all=True)
TODO:
- The StaticDataLoader is quite slow. It don't have to copy the data again...
"""
loader = data_loader_module.StaticDataLoader(df)
return cls(data_loader=loader)

View File

@@ -153,7 +153,7 @@ class QlibDataLoader(DLWParser):
filter_pipe: List = None,
swap_level: bool = True,
freq: Union[str, dict] = "day",
inst_processor: dict = None,
inst_processors: Union[dict, list] = None,
):
"""
Parameters
@@ -167,16 +167,19 @@ class QlibDataLoader(DLWParser):
freq: dict or str
If type(config) == dict and type(freq) == str, load config data using freq.
If type(config) == dict and type(freq) == dict, load config[<group_name>] data using freq[<group_name>]
inst_processor: dict
If inst_processor is not None and type(config) == dict; load config[<group_name>] data using inst_processor[<group_name>]
inst_processors: dict | list
If inst_processors is not None and type(config) == dict; load config[<group_name>] data using inst_processors[<group_name>]
If inst_processors is a list, then it will be applied to all groups.
"""
self.filter_pipe = filter_pipe
self.swap_level = swap_level
self.freq = freq
# sample
self.inst_processor = inst_processor if inst_processor is not None else {}
assert isinstance(self.inst_processor, dict), f"inst_processor(={self.inst_processor}) must be dict"
self.inst_processors = inst_processors if inst_processors is not None else {}
assert isinstance(
self.inst_processors, (dict, list)
), f"inst_processors(={self.inst_processors}) must be dict or list"
super().__init__(config)
@@ -187,8 +190,8 @@ class QlibDataLoader(DLWParser):
if _gp not in freq:
raise ValueError(f"freq(={freq}) missing group(={_gp})")
assert (
self.inst_processor
), f"freq(={self.freq}), inst_processor(={self.inst_processor}) cannot be None/empty"
self.inst_processors
), f"freq(={self.freq}), inst_processors(={self.inst_processors}) cannot be None/empty"
def load_group_df(
self,
@@ -208,9 +211,10 @@ class QlibDataLoader(DLWParser):
warnings.warn("`filter_pipe` is not None, but it will not be used with `instruments` as list")
freq = self.freq[gp_name] if isinstance(self.freq, dict) else self.freq
df = D.features(
instruments, exprs, start_time, end_time, freq=freq, inst_processors=self.inst_processor.get(gp_name, [])
inst_processors = (
self.inst_processors if isinstance(self.inst_processors, list) else self.inst_processors.get(gp_name, [])
)
df = D.features(instruments, exprs, start_time, end_time, freq=freq, inst_processors=inst_processors)
df.columns = names
if self.swap_level:
df = df.swaplevel().sort_index() # NOTE: if swaplevel, return <datetime, instrument>

View File

@@ -2,7 +2,7 @@
# Licensed under the MIT License.
import abc
from typing import Union, Text
from typing import Union, Text, Optional
import numpy as np
import pandas as pd
@@ -11,6 +11,8 @@ from ...constant import EPS
from .utils import fetch_df_by_index
from ...utils.serial import Serializable
from ...utils.paral import datetime_groupby_apply
from qlib.data.inst_processor import InstProcessor
from qlib.data import D
def get_group_columns(df: pd.DataFrame, group: Union[Text, None]):
@@ -211,16 +213,19 @@ class MinMaxNorm(Processor):
self.min_val = np.nanmin(df[cols].values, axis=0)
self.max_val = np.nanmax(df[cols].values, axis=0)
self.ignore = self.min_val == self.max_val
# To improve the speed, we set the value of `min_val` to `0` for the columns that do not need to be processed,
# and the value of `max_val` to `1`, when using `(x - min_val) / (max_val - min_val)` for uniform calculation,
# the columns that do not need to be processed will be calculated by `(x - 0) / (1 - 0)`,
# as you can see, the columns that do not need to be processed, will not be affected.
for _i, _con in enumerate(self.ignore):
if _con:
self.min_val[_i] = 0
self.max_val[_i] = 1
self.cols = cols
def __call__(self, df):
def normalize(x, min_val=self.min_val, max_val=self.max_val, ignore=self.ignore):
if (~ignore).all():
return (x - min_val) / (max_val - min_val)
for i in range(ignore.size):
if not ignore[i]:
x[i] = (x[i] - min_val) / (max_val - min_val)
return x
def normalize(x, min_val=self.min_val, max_val=self.max_val):
return (x - min_val) / (max_val - min_val)
df.loc(axis=1)[self.cols] = normalize(df[self.cols].values)
return df
@@ -242,16 +247,19 @@ class ZScoreNorm(Processor):
self.mean_train = np.nanmean(df[cols].values, axis=0)
self.std_train = np.nanstd(df[cols].values, axis=0)
self.ignore = self.std_train == 0
# To improve the speed, we set the value of `std_train` to `1` for the columns that do not need to be processed,
# and the value of `mean_train` to `0`, when using `(x - mean_train) / std_train` for uniform calculation,
# the columns that do not need to be processed will be calculated by `(x - 0) / 1`,
# as you can see, the columns that do not need to be processed, will not be affected.
for _i, _con in enumerate(self.ignore):
if _con:
self.std_train[_i] = 1
self.mean_train[_i] = 0
self.cols = cols
def __call__(self, df):
def normalize(x, mean_train=self.mean_train, std_train=self.std_train, ignore=self.ignore):
if (~ignore).all():
return (x - mean_train) / std_train
for i in range(ignore.size):
if not ignore[i]:
x[i] = (x[i] - mean_train) / std_train
return x
def normalize(x, mean_train=self.mean_train, std_train=self.std_train):
return (x - mean_train) / std_train
df.loc(axis=1)[self.cols] = normalize(df[self.cols].values)
return df
@@ -361,7 +369,7 @@ class CSZFillna(Processor):
def __call__(self, df):
cols = get_group_columns(df, self.fields_group)
df[cols] = df[cols].groupby("datetime").apply(lambda x: x.fillna(x.mean()))
df[cols] = df[cols].groupby("datetime", group_keys=False).apply(lambda x: x.fillna(x.mean()))
return df
@@ -372,3 +380,42 @@ class HashStockFormat(Processor):
from .storage import HashingStockStorage # pylint: disable=C0415
return HashingStockStorage.from_df(df)
class TimeRangeFlt(InstProcessor):
"""
This is a filter to filter stock.
Only keep the data that exist from start_time to end_time (the existence in the middle is not checked.)
WARNING: It may induce leakage!!!
"""
def __init__(
self,
start_time: Optional[Union[pd.Timestamp, str]] = None,
end_time: Optional[Union[pd.Timestamp, str]] = None,
freq: str = "day",
):
"""
Parameters
----------
start_time : Optional[Union[pd.Timestamp, str]]
The data must start earlier (or equal) than `start_time`
None indicates data will not be filtered based on `start_time`
end_time : Optional[Union[pd.Timestamp, str]]
similar to start_time
freq : str
The frequency of the calendar
"""
# Align to calendar before filtering
cal = D.calendar(start_time=start_time, end_time=end_time, freq=freq)
self.start_time = None if start_time is None else cal[0]
self.end_time = None if end_time is None else cal[-1]
def __call__(self, df: pd.DataFrame, instrument, *args, **kwargs):
if (
df.empty
or (self.start_time is None or df.index.min() <= self.start_time)
and (self.end_time is None or df.index.max() >= self.end_time)
):
return df
return df.head(0)

View File

@@ -2,9 +2,8 @@
# Licensed under the MIT License.
from __future__ import annotations
import pandas as pd
from typing import Union, List
from typing import Union, List, TYPE_CHECKING
from qlib.utils import init_instance_by_config
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from qlib.data.dataset import DataHandler
@@ -121,7 +120,7 @@ def convert_index_format(df: Union[pd.DataFrame, pd.Series], level: str = "datet
return df
def init_task_handler(task: dict) -> Union[DataHandler, None]:
def init_task_handler(task: dict) -> DataHandler:
"""
initialize the handler part of the task **inplace**
@@ -142,5 +141,6 @@ def init_task_handler(task: dict) -> Union[DataHandler, None]:
if h_conf is not None:
handler = init_instance_by_config(h_conf, accept_types=DataHandler)
task["dataset"]["kwargs"]["handler"] = handler
return handler
else:
raise ValueError("The task does not contains a handler part.")

View File

@@ -18,7 +18,7 @@ class StructuredCovEstimator(RiskModel):
`B` is the regression coefficients matrix for all observations (row) on
all factors (columns), and `U` is the residual matrix with shape like `X`.
Therefore the structured covariance can be estimated by
Therefore, the structured covariance can be estimated by
cov(X.T) = F @ cov(B.T) @ F.T + diag(var(U))
In finance domain, there are mainly three methods to design `F` [1][2]:

View File

@@ -28,14 +28,15 @@ from qlib.typehint import Literal
def _get_multi_level_executor_config(
strategy_config: dict,
cash_limit: float = None,
cash_limit: float | None = None,
generate_report: bool = False,
data_granularity: str = "1min",
) -> dict:
executor_config = {
"class": "SimulatorExecutor",
"module_path": "qlib.backtest.executor",
"kwargs": {
"time_per_step": "1min",
"time_per_step": data_granularity,
"verbose": False,
"trade_type": SimulatorExecutor.TT_PARAL if cash_limit is not None else SimulatorExecutor.TT_SERIAL,
"generate_report": generate_report,
@@ -127,7 +128,7 @@ def single_with_simulator(
backtest_config: dict,
orders: pd.DataFrame,
split: Literal["stock", "day"] = "stock",
cash_limit: float = None,
cash_limit: float | None = None,
generate_report: bool = False,
) -> Union[Tuple[pd.DataFrame, dict], pd.DataFrame]:
"""Run backtest in a single thread with SingleAssetOrderExecution simulator. The orders will be executed day by day.
@@ -154,12 +155,7 @@ def single_with_simulator(
-------
If generate_report is True, return execution records and the generated report. Otherwise, return only records.
"""
if split == "stock":
stock_id = orders.iloc[0].instrument
init_qlib(backtest_config["qlib"], part=stock_id)
else:
day = orders.iloc[0].datetime
init_qlib(backtest_config["qlib"], part=day)
init_qlib(backtest_config["qlib"])
stocks = orders.instrument.unique().tolist()
@@ -181,13 +177,14 @@ def single_with_simulator(
strategy_config=backtest_config["strategies"],
cash_limit=cash_limit,
generate_report=generate_report,
data_granularity=backtest_config["data_granularity"],
)
exchange_config = copy.deepcopy(backtest_config["exchange"])
exchange_config.update(
{
"codes": stocks,
"freq": "1min",
"freq": backtest_config["data_granularity"],
}
)
@@ -202,7 +199,7 @@ def single_with_simulator(
reports.append(simulator.report_dict)
decisions += simulator.decisions
indicator_1day_objs = [report["indicator"]["1day"][1] for report in reports]
indicator_1day_objs = [report["indicator_dict"]["1day"][1] for report in reports]
indicator_info = {k: v for obj in indicator_1day_objs for k, v in obj.order_indicator_his.items()}
records = _convert_indicator_to_dataframe(indicator_info)
assert records is None or not np.isnan(records["ffr"]).any()
@@ -226,7 +223,7 @@ def single_with_collect_data_loop(
backtest_config: dict,
orders: pd.DataFrame,
split: Literal["stock", "day"] = "stock",
cash_limit: float = None,
cash_limit: float | None = None,
generate_report: bool = False,
) -> Union[Tuple[pd.DataFrame, dict], pd.DataFrame]:
"""Run backtest in a single thread with collect_data_loop.
@@ -253,12 +250,7 @@ def single_with_collect_data_loop(
If generate_report is True, return execution records and the generated report. Otherwise, return only records.
"""
if split == "stock":
stock_id = orders.iloc[0].instrument
init_qlib(backtest_config["qlib"], part=stock_id)
else:
day = orders.iloc[0].datetime
init_qlib(backtest_config["qlib"], part=day)
init_qlib(backtest_config["qlib"])
trade_start_time = orders["datetime"].min()
trade_end_time = orders["datetime"].max()
@@ -280,13 +272,14 @@ def single_with_collect_data_loop(
strategy_config=backtest_config["strategies"],
cash_limit=cash_limit,
generate_report=generate_report,
data_granularity=backtest_config["data_granularity"],
)
exchange_config = copy.deepcopy(backtest_config["exchange"])
exchange_config.update(
{
"codes": stocks,
"freq": "1min",
"freq": backtest_config["data_granularity"],
}
)
@@ -357,7 +350,10 @@ def backtest(backtest_config: dict, with_simulator: bool = False) -> pd.DataFram
if not output_path.exists():
os.makedirs(output_path)
res.to_csv(output_path / "summary.csv")
if "pa" in res.columns:
res["pa"] = res["pa"] * 10000.0 # align with training metrics
res.to_csv(output_path / "backtest_result.csv")
return res

View File

@@ -98,8 +98,9 @@ def get_backtest_config_fromfile(path: str) -> dict:
"debug_single_day": None,
"concurrency": -1,
"multiplier": 1.0,
"output_dir": "outputs/",
"output_dir": "outputs_backtest/",
"generate_report": False,
"data_granularity": "1min",
}
backtest_config = merge_a_into_b(a=backtest_config, b=backtest_config_default)

View File

@@ -1,8 +1,12 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from __future__ import annotations
import argparse
import os
import random
import sys
import warnings
from pathlib import Path
from typing import cast, List, Optional
@@ -13,14 +17,15 @@ import yaml
from qlib.backtest import Order
from qlib.backtest.decision import OrderDir
from qlib.constant import ONE_MIN
from qlib.rl.data.pickle_styled import load_simple_intraday_backtest_data
from qlib.rl.data.native import load_handler_intraday_processed_data
from qlib.rl.interpreter import ActionInterpreter, StateInterpreter
from qlib.rl.order_execution import SingleAssetOrderExecutionSimple
from qlib.rl.reward import Reward
from qlib.rl.trainer import Checkpoint, train
from qlib.rl.trainer import Checkpoint, backtest, train
from qlib.rl.trainer.callbacks import Callback, EarlyStopping, MetricsWriter
from qlib.rl.utils.log import CsvWriter
from qlib.utils import init_instance_by_config
from tianshou.policy import BasePolicy
from torch import nn
from torch.utils.data import Dataset
@@ -46,19 +51,17 @@ def _read_orders(order_dir: Path) -> pd.DataFrame:
class LazyLoadDataset(Dataset):
def __init__(
self,
data_dir: str,
order_file_path: Path,
data_dir: Path,
default_start_time_index: int,
default_end_time_index: int,
) -> None:
self._default_start_time_index = default_start_time_index
self._default_end_time_index = default_end_time_index
self._order_file_path = order_file_path
self._order_df = _read_orders(order_file_path).reset_index()
self._data_dir = data_dir
self._ticks_index: Optional[pd.DatetimeIndex] = None
self._data_dir = Path(data_dir)
def __len__(self) -> int:
return len(self._order_df)
@@ -71,12 +74,17 @@ class LazyLoadDataset(Dataset):
# TODO: We only load ticks index once based on the assumption that ticks index of different dates
# TODO: in one experiment are all the same. If that assumption is not hold, we need to load ticks index
# TODO: of all dates.
backtest_data = load_simple_intraday_backtest_data(
data = load_handler_intraday_processed_data(
data_dir=self._data_dir,
stock_id=row["instrument"],
date=date,
feature_columns_today=[],
feature_columns_yesterday=[],
backtest=True,
index_only=True,
)
self._ticks_index = [t - date for t in backtest_data.get_time_index()]
self._ticks_index = [t - date for t in data.today.index]
order = Order(
stock_id=row["instrument"],
@@ -98,93 +106,132 @@ def train_and_test(
action_interpreter: ActionInterpreter,
policy: BasePolicy,
reward: Reward,
run_training: bool,
run_backtest: bool,
) -> None:
order_root_path = Path(data_config["source"]["order_dir"])
data_granularity = simulator_config.get("data_granularity", 1)
def _simulator_factory_simple(order: Order) -> SingleAssetOrderExecutionSimple:
return SingleAssetOrderExecutionSimple(
order=order,
data_dir=Path(data_config["source"]["data_dir"]),
data_dir=data_config["source"]["feature_root_dir"],
feature_columns_today=data_config["source"]["feature_columns_today"],
feature_columns_yesterday=data_config["source"]["feature_columns_yesterday"],
data_granularity=data_granularity,
ticks_per_step=simulator_config["time_per_step"],
deal_price_type=data_config["source"].get("deal_price_column", "close"),
vol_threshold=simulator_config["vol_limit"],
)
train_dataset = LazyLoadDataset(
order_file_path=order_root_path / "train",
data_dir=Path(data_config["source"]["data_dir"]),
default_start_time_index=data_config["source"]["default_start_time"],
default_end_time_index=data_config["source"]["default_end_time"],
)
valid_dataset = LazyLoadDataset(
order_file_path=order_root_path / "valid",
data_dir=Path(data_config["source"]["data_dir"]),
default_start_time_index=data_config["source"]["default_start_time"],
default_end_time_index=data_config["source"]["default_end_time"],
)
assert data_config["source"]["default_start_time_index"] % data_granularity == 0
assert data_config["source"]["default_end_time_index"] % data_granularity == 0
callbacks = []
if "checkpoint_path" in trainer_config:
callbacks.append(
Checkpoint(
dirpath=Path(trainer_config["checkpoint_path"]),
every_n_iters=trainer_config["checkpoint_every_n_iters"],
save_latest="copy",
),
if run_training:
train_dataset, valid_dataset = [
LazyLoadDataset(
data_dir=data_config["source"]["feature_root_dir"],
order_file_path=order_root_path / tag,
default_start_time_index=data_config["source"]["default_start_time_index"] // data_granularity,
default_end_time_index=data_config["source"]["default_end_time_index"] // data_granularity,
)
for tag in ("train", "valid")
]
callbacks: List[Callback] = []
if "checkpoint_path" in trainer_config:
callbacks.append(MetricsWriter(dirpath=Path(trainer_config["checkpoint_path"])))
callbacks.append(
Checkpoint(
dirpath=Path(trainer_config["checkpoint_path"]) / "checkpoints",
every_n_iters=trainer_config.get("checkpoint_every_n_iters", 1),
save_latest="copy",
),
)
if "earlystop_patience" in trainer_config:
callbacks.append(
EarlyStopping(
patience=trainer_config["earlystop_patience"],
monitor="val/pa",
)
)
train(
simulator_fn=_simulator_factory_simple,
state_interpreter=state_interpreter,
action_interpreter=action_interpreter,
policy=policy,
reward=reward,
initial_states=cast(List[Order], train_dataset),
trainer_kwargs={
"max_iters": trainer_config["max_epoch"],
"finite_env_type": env_config["parallel_mode"],
"concurrency": env_config["concurrency"],
"val_every_n_iters": trainer_config.get("val_every_n_epoch", None),
"callbacks": callbacks,
},
vessel_kwargs={
"episode_per_iter": trainer_config["episode_per_collect"],
"update_kwargs": {
"batch_size": trainer_config["batch_size"],
"repeat": trainer_config["repeat_per_collect"],
},
"val_initial_states": valid_dataset,
},
)
trainer_kwargs = {
"max_iters": trainer_config["max_epoch"],
"finite_env_type": env_config["parallel_mode"],
"concurrency": env_config["concurrency"],
"val_every_n_iters": trainer_config.get("val_every_n_epoch", None),
"callbacks": callbacks,
}
vessel_kwargs = {
"episode_per_iter": trainer_config["episode_per_collect"],
"update_kwargs": {
"batch_size": trainer_config["batch_size"],
"repeat": trainer_config["repeat_per_collect"],
},
"val_initial_states": valid_dataset,
}
if run_backtest:
test_dataset = LazyLoadDataset(
data_dir=data_config["source"]["feature_root_dir"],
order_file_path=order_root_path / "test",
default_start_time_index=data_config["source"]["default_start_time_index"] // data_granularity,
default_end_time_index=data_config["source"]["default_end_time_index"] // data_granularity,
)
train(
simulator_fn=_simulator_factory_simple,
state_interpreter=state_interpreter,
action_interpreter=action_interpreter,
policy=policy,
reward=reward,
initial_states=cast(List[Order], train_dataset),
trainer_kwargs=trainer_kwargs,
vessel_kwargs=vessel_kwargs,
)
backtest(
simulator_fn=_simulator_factory_simple,
state_interpreter=state_interpreter,
action_interpreter=action_interpreter,
initial_states=test_dataset,
policy=policy,
logger=CsvWriter(Path(trainer_config["checkpoint_path"])),
reward=reward,
finite_env_type=env_config["parallel_mode"],
concurrency=env_config["concurrency"],
)
def main(config: dict) -> None:
def main(config: dict, run_training: bool, run_backtest: bool) -> None:
if not run_training and not run_backtest:
warnings.warn("Skip the entire job since training and backtest are both skipped.")
return
if "seed" in config["runtime"]:
seed_everything(config["runtime"]["seed"])
state_config = config["state_interpreter"]
state_interpreter: StateInterpreter = init_instance_by_config(state_config)
for extra_module_path in config["env"].get("extra_module_paths", []):
sys.path.append(extra_module_path)
state_interpreter: StateInterpreter = init_instance_by_config(config["state_interpreter"])
action_interpreter: ActionInterpreter = init_instance_by_config(config["action_interpreter"])
reward: Reward = init_instance_by_config(config["reward"])
additional_policy_kwargs = {
"obs_space": state_interpreter.observation_space,
"action_space": action_interpreter.action_space,
}
# Create torch network
if "kwargs" not in config["network"]:
config["network"]["kwargs"] = {}
config["network"]["kwargs"].update({"obs_space": state_interpreter.observation_space})
network: nn.Module = init_instance_by_config(config["network"])
if "network" in config:
if "kwargs" not in config["network"]:
config["network"]["kwargs"] = {}
config["network"]["kwargs"].update({"obs_space": state_interpreter.observation_space})
additional_policy_kwargs["network"] = init_instance_by_config(config["network"])
# Create policy
config["policy"]["kwargs"].update(
{
"network": network,
"obs_space": state_interpreter.observation_space,
"action_space": action_interpreter.action_space,
}
)
if "kwargs" not in config["policy"]:
config["policy"]["kwargs"] = {}
config["policy"]["kwargs"].update(additional_policy_kwargs)
policy: BasePolicy = init_instance_by_config(config["policy"])
use_cuda = config["runtime"].get("use_cuda", False)
@@ -200,20 +247,22 @@ def main(config: dict) -> None:
state_interpreter=state_interpreter,
policy=policy,
reward=reward,
run_training=run_training,
run_backtest=run_backtest,
)
if __name__ == "__main__":
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
parser = argparse.ArgumentParser()
parser.add_argument("--config_path", type=str, required=True, help="Path to the config file")
parser.add_argument("--no_training", action="store_true", help="Skip training workflow.")
parser.add_argument("--run_backtest", action="store_true", help="Run backtest workflow.")
args = parser.parse_args()
with open(args.config_path, "r") as input_stream:
config = yaml.safe_load(input_stream)
main(config)
main(config, run_training=not args.no_training, run_backtest=args.run_backtest)

View File

@@ -8,48 +8,14 @@ TODO: The implementation here is kind of adhoc. It is better to design a more un
from __future__ import annotations
import pickle
from pathlib import Path
from typing import List
import cachetools
import numpy as np
import pandas as pd
import qlib
from qlib.constant import REG_CN
from qlib.contrib.ops.high_freq import BFillNan, Cut, Date, DayCumsum, DayLast, FFillNan, IsInf, IsNull, Select
from qlib.data.dataset import DatasetH
dataset = None
class DataWrapper:
def __init__(
self,
feature_dataset: DatasetH,
backtest_dataset: DatasetH,
columns_today: List[str],
columns_yesterday: List[str],
_internal: bool = False,
):
assert _internal, "Init function of data wrapper is for internal use only."
self.feature_dataset = feature_dataset
self.backtest_dataset = backtest_dataset
self.columns_today = columns_today
self.columns_yesterday = columns_yesterday
@cachetools.cached( # type: ignore
cache=cachetools.LRUCache(100),
key=lambda _, stock_id, date, backtest: (stock_id, date.replace(hour=0, minute=0, second=0), backtest),
)
def get(self, stock_id: str, date: pd.Timestamp, backtest: bool = False) -> pd.DataFrame:
start_time, end_time = date.replace(hour=0, minute=0, second=0), date.replace(hour=23, minute=59, second=59)
dataset = self.backtest_dataset if backtest else self.feature_dataset
return dataset.handler.fetch(pd.IndexSlice[stock_id, start_time:end_time], level=None)
def init_qlib(qlib_config: dict, part: str = None) -> None:
def init_qlib(qlib_config: dict) -> None:
"""Initialize necessary resource to launch the workflow, including data direction, feature columns, etc..
Parameters
@@ -72,20 +38,15 @@ def init_qlib(qlib_config: dict, part: str = None) -> None:
"$bidV_1", "$bidV1_1", "$bidV3_1", "$bidV5_1", "$askV_1", "$askV1_1", "$askV3_1", "$askV5_1",
],
}
part
Identifying which part (stock / date) to load.
"""
global dataset # pylint: disable=W0603
def _convert_to_path(path: str | Path) -> Path:
return path if isinstance(path, Path) else Path(path)
provider_uri_map = {}
if "provider_uri_day" in qlib_config:
provider_uri_map["day"] = _convert_to_path(qlib_config["provider_uri_day"]).as_posix()
if "provider_uri_1min" in qlib_config:
provider_uri_map["1min"] = _convert_to_path(qlib_config["provider_uri_1min"]).as_posix()
for granularity in ["1min", "5min", "day"]:
if f"provider_uri_{granularity}" in qlib_config:
provider_uri_map[f"{granularity}"] = _convert_to_path(qlib_config[f"provider_uri_{granularity}"]).as_posix()
qlib.init(
region=REG_CN,
@@ -119,47 +80,3 @@ def init_qlib(qlib_config: dict, part: str = None) -> None:
redis_port=-1,
clear_mem_cache=False, # init_qlib will be called for multiple times. Keep the cache for improving performance
)
if part == "skip":
return
# this won't work if it's put outside in case of multiprocessing
from qlib.data import D # noqa pylint: disable=C0415,W0611
if part is None:
feature_path = Path(qlib_config["feature_root_dir"]) / "feature.pkl"
backtest_path = Path(qlib_config["feature_root_dir"]) / "backtest.pkl"
else:
feature_path = Path(qlib_config["feature_root_dir"]) / "feature" / (part + ".pkl")
backtest_path = Path(qlib_config["feature_root_dir"]) / "backtest" / (part + ".pkl")
with feature_path.open("rb") as f:
feature_dataset = pickle.load(f)
with backtest_path.open("rb") as f:
backtest_dataset = pickle.load(f)
dataset = DataWrapper(
feature_dataset,
backtest_dataset,
qlib_config["feature_columns_today"],
qlib_config["feature_columns_yesterday"],
_internal=True,
)
def fetch_features(stock_id: str, date: pd.Timestamp, yesterday: bool = False, backtest: bool = False) -> pd.DataFrame:
assert dataset is not None, "You must call init_qlib() before doing this."
if backtest:
fields = ["$close", "$volume"]
else:
fields = dataset.columns_yesterday if yesterday else dataset.columns_today
data = dataset.get(stock_id, date, backtest)
if data is None or len(data) == 0:
# create a fake index, but RL doesn't care about index
data = pd.DataFrame(0.0, index=np.arange(240), columns=fields, dtype=np.float32) # FIXME: hardcode here
else:
data = data.rename(columns={c: c.rstrip("0") for c in data.columns})
data = data[fields]
return data

View File

@@ -2,17 +2,29 @@
# Licensed under the MIT License.
from __future__ import annotations
from typing import cast
from pathlib import Path
from typing import cast, List
import cachetools
import pandas as pd
import pickle
import os
from qlib.backtest import Exchange, Order
from qlib.backtest.decision import TradeRange, TradeRangeByTime
from qlib.rl.order_execution.utils import get_ticks_slice
from qlib.constant import EPS_T
from .base import BaseIntradayBacktestData, BaseIntradayProcessedData, ProcessedDataProvider
from .integration import fetch_features
def get_ticks_slice(
ticks_index: pd.DatetimeIndex,
start: pd.Timestamp,
end: pd.Timestamp,
include_end: bool = False,
) -> pd.DatetimeIndex:
if not include_end:
end = end - EPS_T
return ticks_index[ticks_index.slice_indexer(start, end)]
class IntradayBacktestData(BaseIntradayBacktestData):
@@ -71,6 +83,31 @@ class IntradayBacktestData(BaseIntradayBacktestData):
return pd.DatetimeIndex([e[1] for e in list(self._exchange.quote_df.index)])
class DataframeIntradayBacktestData(BaseIntradayBacktestData):
"""Backtest data from dataframe"""
def __init__(self, df: pd.DataFrame, price_column: str = "$close0", volume_column: str = "$volume0") -> None:
self.df = df
self.price_column = price_column
self.volume_column = volume_column
def __repr__(self) -> str:
with pd.option_context("memory_usage", False, "display.max_info_columns", 1, "display.large_repr", "info"):
return f"{self.__class__.__name__}({self.df})"
def __len__(self) -> int:
return len(self.df)
def get_deal_price(self) -> pd.Series:
return self.df[self.price_column]
def get_volume(self) -> pd.Series:
return self.df[self.volume_column]
def get_time_index(self) -> pd.DatetimeIndex:
return cast(pd.DatetimeIndex, self.df.index)
@cachetools.cached( # type: ignore
cache=cachetools.LRUCache(100),
key=lambda order, _, __: order.key_by_day,
@@ -103,13 +140,18 @@ def load_backtest_data(
return backtest_data
class NTIntradayProcessedData(BaseIntradayProcessedData):
"""Subclass of IntradayProcessedData. Used to handle NT style data."""
class HandlerIntradayProcessedData(BaseIntradayProcessedData):
"""Subclass of IntradayProcessedData. Used to handle handler (bin format) style data."""
def __init__(
self,
data_dir: Path,
stock_id: str,
date: pd.Timestamp,
feature_columns_today: List[str],
feature_columns_yesterday: List[str],
backtest: bool = False,
index_only: bool = False,
) -> None:
def _drop_stock_id(df: pd.DataFrame) -> pd.DataFrame:
df = df.reset_index()
@@ -117,8 +159,18 @@ class NTIntradayProcessedData(BaseIntradayProcessedData):
df = df.drop(columns=["instrument"])
return df.set_index(["datetime"])
self.today = _drop_stock_id(fetch_features(stock_id, date))
self.yesterday = _drop_stock_id(fetch_features(stock_id, date, yesterday=True))
path = os.path.join(data_dir, "backtest" if backtest else "feature", f"{stock_id}.pkl")
start_time, end_time = date.replace(hour=0, minute=0, second=0), date.replace(hour=23, minute=59, second=59)
with open(path, "rb") as fstream:
dataset = pickle.load(fstream)
data = dataset.handler.fetch(pd.IndexSlice[stock_id, start_time:end_time], level=None)
if index_only:
self.today = _drop_stock_id(data[[]])
self.yesterday = _drop_stock_id(data[[]])
else:
self.today = _drop_stock_id(data[feature_columns_today])
self.yesterday = _drop_stock_id(data[feature_columns_yesterday])
def __repr__(self) -> str:
with pd.option_context("memory_usage", False, "display.max_info_columns", 1, "display.large_repr", "info"):
@@ -127,12 +179,42 @@ class NTIntradayProcessedData(BaseIntradayProcessedData):
@cachetools.cached( # type: ignore
cache=cachetools.LRUCache(100), # 100 * 50K = 5MB
key=lambda data_dir, stock_id, date, feature_columns_today, feature_columns_yesterday, backtest, index_only: (
stock_id,
date,
backtest,
index_only,
),
)
def load_nt_intraday_processed_data(stock_id: str, date: pd.Timestamp) -> NTIntradayProcessedData:
return NTIntradayProcessedData(stock_id, date)
def load_handler_intraday_processed_data(
data_dir: Path,
stock_id: str,
date: pd.Timestamp,
feature_columns_today: List[str],
feature_columns_yesterday: List[str],
backtest: bool = False,
index_only: bool = False,
) -> HandlerIntradayProcessedData:
return HandlerIntradayProcessedData(
data_dir, stock_id, date, feature_columns_today, feature_columns_yesterday, backtest, index_only
)
class NTProcessedDataProvider(ProcessedDataProvider):
class HandlerProcessedDataProvider(ProcessedDataProvider):
def __init__(
self,
data_dir: str,
feature_columns_today: List[str],
feature_columns_yesterday: List[str],
backtest: bool = False,
) -> None:
super().__init__()
self.data_dir = Path(data_dir)
self.feature_columns_today = feature_columns_today
self.feature_columns_yesterday = feature_columns_yesterday
self.backtest = backtest
def get_data(
self,
stock_id: str,
@@ -140,4 +222,12 @@ class NTProcessedDataProvider(ProcessedDataProvider):
feature_dim: int,
time_index: pd.Index,
) -> BaseIntradayProcessedData:
return load_nt_intraday_processed_data(stock_id, date)
return load_handler_intraday_processed_data(
self.data_dir,
stock_id,
date,
self.feature_columns_today,
self.feature_columns_yesterday,
backtest=self.backtest,
index_only=False,
)

View File

@@ -83,7 +83,16 @@ def _find_pickle(filename_without_suffix: Path) -> Path:
@lru_cache(maxsize=10) # 10 * 40M = 400MB
def _read_pickle(filename_without_suffix: Path) -> pd.DataFrame:
return pd.read_pickle(_find_pickle(filename_without_suffix))
df = pd.read_pickle(_find_pickle(filename_without_suffix))
index_cols = df.index.names
df = df.reset_index()
for date_col_name in ["date", "datetime"]:
if date_col_name in df:
df[date_col_name] = pd.to_datetime(df[date_col_name])
df = df.set_index(index_cols)
return df
class SimpleIntradayBacktestData(BaseIntradayBacktestData):
@@ -95,7 +104,7 @@ class SimpleIntradayBacktestData(BaseIntradayBacktestData):
stock_id: str,
date: pd.Timestamp,
deal_price: DealPriceType = "close",
order_dir: int = None,
order_dir: int | None = None,
) -> None:
super(SimpleIntradayBacktestData, self).__init__()
@@ -149,8 +158,8 @@ class SimpleIntradayBacktestData(BaseIntradayBacktestData):
return cast(pd.DatetimeIndex, self.data.index)
class IntradayProcessedData(BaseIntradayProcessedData):
"""Subclass of IntradayProcessedData. Used to handle Dataset Handler style data."""
class PickleIntradayProcessedData(BaseIntradayProcessedData):
"""Subclass of IntradayProcessedData. Used to handle pickle-styled data."""
def __init__(
self,
@@ -161,6 +170,7 @@ class IntradayProcessedData(BaseIntradayProcessedData):
time_index: pd.Index,
) -> None:
proc = _read_pickle((data_dir if isinstance(data_dir, Path) else Path(data_dir)) / stock_id)
# We have to infer the names here because,
# unfortunately they are not included in the original data.
cnames = _infer_processed_data_column_names(feature_dim)
@@ -198,7 +208,7 @@ def load_simple_intraday_backtest_data(
stock_id: str,
date: pd.Timestamp,
deal_price: DealPriceType = "close",
order_dir: int = None,
order_dir: int | None = None,
) -> SimpleIntradayBacktestData:
return SimpleIntradayBacktestData(data_dir, stock_id, date, deal_price, order_dir)
@@ -207,14 +217,14 @@ def load_simple_intraday_backtest_data(
cache=cachetools.LRUCache(100), # 100 * 50K = 5MB
key=lambda data_dir, stock_id, date, feature_dim, time_index: hashkey(data_dir, stock_id, date),
)
def load_pickled_intraday_processed_data(
def load_pickle_intraday_processed_data(
data_dir: Path,
stock_id: str,
date: pd.Timestamp,
feature_dim: int,
time_index: pd.Index,
) -> BaseIntradayProcessedData:
return IntradayProcessedData(data_dir, stock_id, date, feature_dim, time_index)
return PickleIntradayProcessedData(data_dir, stock_id, date, feature_dim, time_index)
class PickleProcessedDataProvider(ProcessedDataProvider):
@@ -230,7 +240,7 @@ class PickleProcessedDataProvider(ProcessedDataProvider):
feature_dim: int,
time_index: pd.Index,
) -> BaseIntradayProcessedData:
return load_pickled_intraday_processed_data(
return load_pickle_intraday_processed_data(
data_dir=self._data_dir,
stock_id=stock_id,
date=date,

View File

@@ -53,6 +53,18 @@ class FullHistoryObs(TypedDict):
position_history: Any
class DummyStateInterpreter(StateInterpreter[SAOEState, dict]):
"""Dummy interpreter for policies that do not need inputs (for example, AllOne)."""
def interpret(self, state: SAOEState) -> dict:
# TODO: A fake state, used to pass `check_nan_observation`. Find a better way in the future.
return {"DUMMY": _to_int32(1)}
@property
def observation_space(self) -> spaces.Dict:
return spaces.Dict({"DUMMY": spaces.Box(-np.inf, np.inf, shape=(), dtype=np.int32)})
class FullHistoryStateInterpreter(StateInterpreter[SAOEState, FullHistoryObs]):
"""The observation of all the history, including today (until this moment), and yesterday.

View File

@@ -12,11 +12,11 @@ import torch
import torch.nn as nn
from gym.spaces import Discrete
from tianshou.data import Batch, ReplayBuffer, to_torch
from tianshou.policy import BasePolicy, PPOPolicy
from tianshou.policy import BasePolicy, PPOPolicy, DQNPolicy
from qlib.rl.trainer.trainer import Trainer
__all__ = ["AllOne", "PPO"]
__all__ = ["AllOne", "PPO", "DQN"]
# baselines #
@@ -32,7 +32,7 @@ class NonLearnablePolicy(BasePolicy):
super().__init__()
def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, Any]:
pass
return {}
def process_fn(
self,
@@ -40,7 +40,7 @@ class NonLearnablePolicy(BasePolicy):
buffer: ReplayBuffer,
indices: np.ndarray,
) -> Batch:
pass
return Batch({})
class AllOne(NonLearnablePolicy):
@@ -49,13 +49,18 @@ class AllOne(NonLearnablePolicy):
Useful when implementing some baselines (e.g., TWAP).
"""
def __init__(self, obs_space: gym.Space, action_space: gym.Space, fill_value: float | int = 1.0) -> None:
super().__init__(obs_space, action_space)
self.fill_value = fill_value
def forward(
self,
batch: Batch,
state: dict | Batch | np.ndarray = None,
**kwargs: Any,
) -> Batch:
return Batch(act=np.full(len(batch), 1.0), state=state)
return Batch(act=np.full(len(batch), self.fill_value), state=state)
# ppo #
@@ -153,6 +158,56 @@ class PPO(PPOPolicy):
set_weight(self, Trainer.get_policy_state_dict(weight_file))
DQNModel = PPOActor # Reuse PPOActor.
class DQN(DQNPolicy):
"""A wrapper of tianshou DQNPolicy.
Differences:
- Auto-create model network. Supports discrete action space only.
- Support a ``weight_file`` that supports loading checkpoint.
"""
def __init__(
self,
network: nn.Module,
obs_space: gym.Space,
action_space: gym.Space,
lr: float,
weight_decay: float = 0.0,
discount_factor: float = 0.99,
estimation_step: int = 1,
target_update_freq: int = 0,
reward_normalization: bool = False,
is_double: bool = True,
clip_loss_grad: bool = False,
weight_file: Optional[Path] = None,
) -> None:
assert isinstance(action_space, Discrete)
model = DQNModel(network, action_space.n)
optimizer = torch.optim.Adam(
model.parameters(),
lr=lr,
weight_decay=weight_decay,
)
super().__init__(
model,
optimizer,
discount_factor=discount_factor,
estimation_step=estimation_step,
target_update_freq=target_update_freq,
reward_normalization=reward_normalization,
is_double=is_double,
clip_loss_grad=clip_loss_grad,
)
if weight_file is not None:
set_weight(self, Trainer.get_policy_state_dict(weight_file))
# utilities: these should be put in a separate (common) file. #

View File

@@ -7,6 +7,7 @@ from typing import cast
import numpy as np
from qlib.backtest.decision import OrderDir
from qlib.rl.order_execution.state import SAOEMetrics, SAOEState
from qlib.rl.reward import Reward
@@ -21,10 +22,13 @@ class PAPenaltyReward(Reward[SAOEState]):
----------
penalty
The penalty for large volume in a short time.
scale
The weight used to scale up or down the reward.
"""
def __init__(self, penalty: float = 100.0):
def __init__(self, penalty: float = 100.0, scale: float = 1.0) -> None:
self.penalty = penalty
self.scale = scale
def reward(self, simulator_state: SAOEState) -> float:
whole_order = simulator_state.order.amount
@@ -43,4 +47,53 @@ class PAPenaltyReward(Reward[SAOEState]):
self.log("reward/pa", pa)
self.log("reward/penalty", penalty)
return reward
return reward * self.scale
class PPOReward(Reward[SAOEState]):
"""Reward proposed by paper "An End-to-End Optimal Trade Execution Framework based on Proximal Policy Optimization".
Parameters
----------
max_step
Maximum number of steps.
start_time_index
First time index that allowed to trade.
end_time_index
Last time index that allowed to trade.
"""
def __init__(self, max_step: int, start_time_index: int = 0, end_time_index: int = 239) -> None:
self.max_step = max_step
self.start_time_index = start_time_index
self.end_time_index = end_time_index
def reward(self, simulator_state: SAOEState) -> float:
if simulator_state.cur_step == self.max_step - 1 or simulator_state.position < 1e-6:
if simulator_state.history_exec["deal_amount"].sum() == 0.0:
vwap_price = cast(
float,
np.average(simulator_state.history_exec["market_price"]),
)
else:
vwap_price = cast(
float,
np.average(
simulator_state.history_exec["market_price"],
weights=simulator_state.history_exec["deal_amount"],
),
)
twap_price = simulator_state.backtest_data.get_deal_price().mean()
if simulator_state.order.direction == OrderDir.SELL:
ratio = vwap_price / twap_price if twap_price != 0 else 1.0
else:
ratio = twap_price / vwap_price if vwap_price != 0 else 1.0
if ratio < 1.0:
return -1.0
elif ratio < 1.1:
return 0.0
else:
return 1.0
else:
return 0.0

Some files were not shown because too many files have changed in this diff Show More