1
0
mirror of https://github.com/microsoft/qlib.git synced 2026-06-29 09:01:18 +08:00

Compare commits

...

62 Commits

Author SHA1 Message Date
you-n-g
97aa16a078 Update __init__.py 2022-01-20 02:02:56 +08:00
you-n-g
094be9be86 Update python-publish.yml 2022-01-20 01:56:35 +08:00
you-n-g
d9b9386032 Update __init__.py 2022-01-20 01:49:53 +08:00
Young
b86a30aae7 Bump to 0.8.2 2022-01-20 01:43:26 +08:00
you-n-g
2c5a4691f3 fall back error (#875) 2022-01-20 01:39:24 +08:00
you-n-g
54344c4426 Update config.py (#871) 2022-01-19 19:51:36 +08:00
you-n-g
303cdb8ce3 update required package for test 2022-01-19 13:10:46 +08:00
you-n-g
1a0ac1ab6d Remove arctic from Qlib core to Contrib (#865)
* Remove arctic from Qlib core to Contrib

* fix empty df bug
2022-01-19 10:39:37 +08:00
Wangwuyi123
a79e446724 Update README.md (#863) 2022-01-19 09:57:11 +08:00
you-n-g
bdf1fb29a6 Fix pytorch_nn.py step bug (#864)
* Update pytorch_nn.py

* Update pytorch_nn.py
2022-01-18 22:39:19 +08:00
dependabot[bot]
86e1265f69 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/ADARNN (#870)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:17:25 +08:00
dependabot[bot]
628eb7fa73 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/ADD (#869)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:17:15 +08:00
dependabot[bot]
2a1b512cd2 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/ALSTM (#868)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:17:04 +08:00
dependabot[bot]
50e7901e87 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/CatBoost (#867)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:16:47 +08:00
dependabot[bot]
3ba54cd1ab Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/DoubleEnsemble (#866)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:16:23 +08:00
dependabot[bot]
483d01f0c1 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/GRU (#833)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:16:13 +08:00
dependabot[bot]
61836cba3d Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/LightGBM (#830)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:16:03 +08:00
dependabot[bot]
aeb5e40c77 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/SFM (#829)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:14:50 +08:00
dependabot[bot]
116f0fa7a7 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/TCTS (#834)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:13:57 +08:00
dependabot[bot]
5296cce725 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/GATs (#831)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:13:29 +08:00
dependabot[bot]
292fcc9e98 Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/TRA (#832)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:13:23 +08:00
dependabot[bot]
d3fbf066cf Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/Localformer (#835)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:13:06 +08:00
dependabot[bot]
52ecb79e0b Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/MLP (#836)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:12:57 +08:00
dependabot[bot]
59c52eac0a Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/TCN (#837)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:12:42 +08:00
dependabot[bot]
f455305a2a Bump numpy from 1.17.4 to 1.21.0 in /examples/benchmarks/LSTM (#838)
Bumps [numpy](https://github.com/numpy/numpy) from 1.17.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.17.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-18 22:12:34 +08:00
you-n-g
a67f67db6e Update README.md 2022-01-18 10:20:07 +08:00
you-n-g
5c2e99aee3 Update .readthedocs.yml 2022-01-18 09:25:30 +08:00
luocy16
2bb8a4ce0e Supporting Arctic Backend Provider & Orderbook, Tick Data Example (#744)
* change weight_decay & batchsize

* del weight_decay

* big weight_decay

* mid weight_decay

* small layer

* 2 layer

* full layer

* no weight decay

* divide into two data source

* change parse field

* delete some debug

* add Toperator

* new format of arctic

* fix cache bug to arctic read

* fix connection problem

* add some operator

* final version for arcitc

* clear HZ cache

* remove not used function

* add topswrappers

* successfully import data and run first test

* A simpler version to support arctic

* Successfully run all high-freq expressions

* Black format and fix add docs

* Add docs for download and test data

* update scripts and docs

* Add docs

* fix bug

* Refine docs

* fix test bug

* fix CI error

* clean code

Co-authored-by: bxdd <bxddream@gmail.com>
Co-authored-by: wangwenxi.handsome <wangwenxi.handsome@gmail.com>
Co-authored-by: Young <afe.young@gmail.com>
2022-01-18 09:13:11 +08:00
you-n-g
7f274b1e4e Fix code and docs for issues (#853)
* Docs for model and strategy

* add some docs about workflow and online

* safe_load yaml

* DDG-DA paper link and comments for code
2022-01-17 13:57:44 +08:00
Pengrong Zhu
2aee9e0145 Add future calendar collector (#795)
* fix Windows mount

* add future_calendar_collector

* update docs

Co-authored-by: Young <afe.young@gmail.com>
Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
2022-01-16 10:14:27 +08:00
you-n-g
a62e2ec4de Update __init__.py 2022-01-15 23:07:31 +08:00
Young
e7954bdb32 update version 2022-01-15 22:49:14 +08:00
you-n-g
d6f69aefea Update data.rst 2022-01-15 19:22:31 +08:00
you-n-g
1bebe9780e Fix the read the docs error (#852) 2022-01-15 19:15:06 +08:00
you-n-g
7a4a92bc69 Update data.rst 2022-01-14 13:17:52 +08:00
you-n-g
271782c9dd Update data.rst 2022-01-14 09:19:12 +08:00
you-n-g
d0113ea7df pylint code refine & Fix nested example (#848)
* refine code by CI

* fix argument error

* fix nested eample
2022-01-14 09:09:21 +08:00
you-n-g
c3996955ef Update README.md 2022-01-13 15:29:43 +08:00
Jiabao Qu
8261965015 fix: highfreq_gdbt_model of prepare data (#846)
Co-authored-by: Jiabao Qu <qujiabao@logiocean.com>
2022-01-12 21:36:23 +08:00
Jiabao Qu
6f71f8a46b chore: remove hard code input dimension of model pytorch_tcts (#843)
Co-authored-by: Jiabao Qu <qujiabao@logiocean.com>
2022-01-12 19:12:20 +08:00
Chia-hung Tai
edd8badeaf [840] - Test case for operators. (#841)
* [840] - Test case for operators.

* Move import to the head of file and add test_setting.
2022-01-11 18:44:15 +08:00
Young
19689024d4 Fix exp uri CI bug 2022-01-10 17:29:27 +08:00
you-n-g
0304df0d5b Update README.md 2022-01-10 16:56:18 +08:00
Young
181ee3c070 FIX File Name 2022-01-10 16:55:20 +08:00
you-n-g
cf35562e84 DDG-DA paper code (#743)
* Merge data selection to main

* Update trainer for reweighter

* Typos fixed.

* update data selection interface

* successfully run exp after refactor some interface

* data selection share handler &  trainer

* fix meta model time series bug

* fix online workflow set_uri bug

* fix set_uri bug

* updawte ds docs and delay trainer bug

* docs

* resume reweighter

* add reweighting result

* fix qlib model import

* make recorder more friendly

* fix experiment workflow bug

* commit for merging master incase of conflictions

* Successful run DDG-DA with a single command

* remove unused code

* asdd more docs

* Update README.md

* Update & fix some bugs.

* Update configuration & remove debug functions

* Update README.md

* Modfify horizon from code rather than yaml

* Update performance in README.md

* fix part comments

* Remove unfinished TCTS.

* Fix some details.

* Update meta docs

* Update README.md of the benchmarks_dynamic

* Update README.md files

* Add README.md to the rolling_benchmark baseline.

* Refine the docs and link

* Rename README.md in benchmarks_dynamic.

* Remove comments.

* auto download data

Co-authored-by: wendili-cs <wendili.academic@qq.com>
Co-authored-by: demon143 <785696300@qq.com>
2022-01-10 16:52:37 +08:00
Chia-hung Tai
184ce34a34 [807] Move the REG_CONSTANT/EPS to constant.py. (#811)
* [807] Move the REG_CONSTANT to constant.py.

* import REG_US.

* Move EPS to constant.py.
2022-01-09 21:39:46 +08:00
Chia-hung Tai
382ababc01 Add description of the pu template. (#812) 2022-01-09 21:14:11 +08:00
Chia-hung Tai
bcf18c14de Fix typos and comments. (#815)
* Fix typos and comments.

* Add comma before and.
2022-01-09 21:13:25 +08:00
Chia-hung Tai
6c1332f604 Fix some warnings in log.py. (#805)
* Fix some warnings in log.py.

* Fix typo and using black format.

* Fix black.

* Rename dict_ to attrs
2022-01-06 15:36:00 +08:00
you-n-g
93088485c3 Update README.md (#802)
* Update README.md

* Update README.md

* Update README.md

* Update README.md
2022-01-04 19:16:04 +08:00
Chia-hung Tai
c633d3fec0 Fix BaseStrategy path. (#801)
qlib.strategy.base.BaseStrategy is the current path.
2022-01-04 18:55:40 +08:00
you-n-g
0b6d99bd38 Add a more understandable example of data workflow (#797)
* Update data.rst

* Update data.rst
2022-01-04 09:07:44 +08:00
you-n-g
03cce8c908 Some Optimization of online code (#784)
* Some Optimization of online code

* more flexible updater and load_object & fix p*_uri

* make recorder more friendly

* remove unused import
2022-01-03 15:52:03 +08:00
安阁锐
e76b409d9a Fix $volume normalization issue (#792)
* Fix $volume normalization issue

Fix: https://github.com/microsoft/qlib/issues/765

* black formatting

black formatting

* black formatting

black formatting

* black formatting

black formatting
2022-01-01 23:44:17 +08:00
Arthur Cui
3e79a088ef Add Crypto dataset from coingecko (#733)
* add crypto symbols collectors

* add crypto data collector

* add crypto symbols collectors

* add crypto data collector

* solver region and source problem

* fix merge

* fix merge

* clean all cn information

Co-authored-by: DefangCui <170007807@pku.edu.cn>
2021-12-31 22:24:26 +08:00
SunsetWolf
dfc0ed3c01 fix_typo (#790)
Signed-off-by: unknown <lv.linlang@qq.com>
2021-12-31 22:14:47 +08:00
you-n-g
f59cfe51e0 Fix account shared bug (#791)
* Fix account shared bug

* fix bug in nested executor
2021-12-31 15:56:21 +08:00
Pengrong Zhu
1ecdfd45fe fix dump_bin:DumpDataUpdate (#783) 2021-12-29 09:29:08 +08:00
Chao Ning
622303b83a add map_location to torch.load to make it work when cuda is unavailable (#782) 2021-12-29 00:02:04 +08:00
Chao Ning
6bafd0a09b Reformat example data names: use {region}_data for 1-day data, and {region}_data_1min for 1-min data (#781)
* Fix high-freq data name from `yahoo_cn_1min` to `cn_data_1min`

* re-format example data names using `qlib_{region}_{feq}`, e.g. qlib_cn_1d

* re-format example data names using `{region}_{feq}`, e.g. us_1d and cn_1min

* keep using  for 1day data, and change 1min data to
2021-12-28 23:58:49 +08:00
you-n-g
aed9c09091 Update news 2021-12-28 19:54:30 +08:00
Dong Zhou
1b8f0b4575 support optimization based strategy (#754)
* support optimization based strategy

* fix riskdata not found & update doc

* refactor signal_strategy

* add portfolio example

* Update examples/portfolio/prepare_riskdata.py

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* fix typo

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* fix typo

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* update doc

* fix riskmodel doc

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
2021-12-28 18:44:20 +08:00
185 changed files with 5152 additions and 1155 deletions

View File

@@ -8,6 +8,7 @@
<!--- Why is this change required? What problem does it solve? -->
## How Has This Been Tested?
<! --- Put an `x` in all the boxes that apply: --->
- [ ] Pass the test by running: `pytest qlib/tests/test_all_pipeline.py` under upper directory of `qlib`.
- [ ] If you are adding a new feature, test on your own test scripts.

View File

@@ -12,7 +12,8 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [windows-latest, macos-latest, macos-11]
os: [windows-latest, macos-11]
# FIXME: macos-latest will raise error now.
# not supporting 3.6 due to annotations is not supported https://stackoverflow.com/a/52890129
python-version: [3.7, 3.8]

View File

@@ -60,7 +60,7 @@ jobs:
python -m pip install --upgrade cython
python -m pip install numpy jupyter jupyter_contrib_nbextensions
python -m pip install -U scipy scikit-learn # installing without this line will cause errors on GitHub Actions, while instsalling locally won't
python setup.py install
pip install -e .
- name: Install test dependencies
run: |
python -m pip install --upgrade pip

View File

@@ -17,5 +17,5 @@ python:
version: 3.7
install:
- requirements: docs/requirements.txt
- method: setuptools
path: .
- method: pip
path: .

View File

@@ -30,7 +30,7 @@ Version 0.2.1
--------------------
- Support registering user-defined ``Provider``.
- Support use operators in string format, e.g. ``['Ref($close, 1)']`` is valid field format.
- Support dynamic fields in ``$some_field`` format. And exising fields like ``Close()`` may be deprecated in the future.
- Support dynamic fields in ``$some_field`` format. And existing fields like ``Close()`` may be deprecated in the future.
Version 0.2.2
--------------------
@@ -78,7 +78,7 @@ Version 0.3.5
- Support multi-label training, you can provide multiple label in ``handler``. (But LightGBM doesn't support due to the algorithm itself)
- Refactor ``handler`` code, dataset.py is no longer used, and you can deploy your own labels and features in ``feature_label_config``
- Handler only offer DataFrame. Also, ``trainer`` and model.py only receive DataFrame
- Change ``split_rolling_data``, we roll the data on market calender now, not on normal date
- Change ``split_rolling_data``, we roll the data on market calendar now, not on normal date
- Move some date config from ``handler`` to ``trainer``
Version 0.4.0
@@ -167,11 +167,11 @@ Version 0.8.0
- There are lots of changes for daily trading, it is hard to list all of them. But a few important changes could be noticed
- The trading limitation is more accurate;
- In `previous version <https://github.com/microsoft/qlib/blob/v0.7.2/qlib/contrib/backtest/exchange.py#L160>`_, longing and shorting actions share the same action.
- In `current verison <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/backtest/exchange.py#L304>`_, the trading limitation is different between loging and shorting action.
- In `current version <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/backtest/exchange.py#L304>`_, the trading limitation is different between logging and shorting action.
- The constant is different when calculating annualized metrics.
- `Current version <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/contrib/evaluate.py#L42>`_ uses more accurate constant than `previous version <https://github.com/microsoft/qlib/blob/v0.7.2/qlib/contrib/evaluate.py#L22>`_
- `A new version <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/tests/data.py#L17>`_ of data is released. Due to the unstability of Yahoo data source, the data may be different after downloading data again.
- Users could chec kout the backtesting results between `Current version <https://github.com/microsoft/qlib/tree/7c31012b507a3823117bddcc693fc64899460b2a/examples/benchmarks>`_ and `previous version <https://github.com/microsoft/qlib/tree/v0.7.2/examples/benchmarks>`_
- Users could check out the backtesting results between `Current version <https://github.com/microsoft/qlib/tree/7c31012b507a3823117bddcc693fc64899460b2a/examples/benchmarks>`_ and `previous version <https://github.com/microsoft/qlib/tree/v0.7.2/examples/benchmarks>`_
Other Versions

View File

@@ -11,21 +11,24 @@
Recent released features
| Feature | Status |
| -- | ------ |
| Release Qlib v0.8.0 | [Released](https://github.com/microsoft/qlib/releases/tag/v0.8.0) on Dec 8, 2021 |
| ADD model | [Released](https://github.com/microsoft/qlib/pull/704) on Nov 22, 2021 |
| ADARNN model | [Released](https://github.com/microsoft/qlib/pull/689) on Nov 14, 2021 |
| TCN model | [Released](https://github.com/microsoft/qlib/pull/668) on Nov 4, 2021 |
| Nested Decision Framework | [Released](https://github.com/microsoft/qlib/pull/438) on Oct 1, 2021. [Example](https://github.com/microsoft/qlib/blob/main/examples/nested_decision_execution/workflow.py) and [Doc](https://qlib.readthedocs.io/en/latest/component/highfreq.html) |
|Temporal Routing Adaptor (TRA) | [Released](https://github.com/microsoft/qlib/pull/531) on July 30, 2021 |
| Transformer & Localformer | [Released](https://github.com/microsoft/qlib/pull/508) on July 22, 2021 |
| Release Qlib v0.7.0 | [Released](https://github.com/microsoft/qlib/releases/tag/v0.7.0) on July 12, 2021 |
| TCTS Model | [Released](https://github.com/microsoft/qlib/pull/491) on July 1, 2021 |
| Online serving and automatic model rolling | :star: [Released](https://github.com/microsoft/qlib/pull/290) on May 17, 2021 |
| DoubleEnsemble Model | [Released](https://github.com/microsoft/qlib/pull/286) on Mar 2, 2021 |
| High-frequency data processing example | [Released](https://github.com/microsoft/qlib/pull/257) on Feb 5, 2021 |
| High-frequency trading example | [Part of code released](https://github.com/microsoft/qlib/pull/227) on Jan 28, 2021 |
| High-frequency data(1min) | [Released](https://github.com/microsoft/qlib/pull/221) on Jan 27, 2021 |
| Tabnet Model | [Released](https://github.com/microsoft/qlib/pull/205) on Jan 22, 2021 |
| Arctic Provider Backend & Orderbook data example | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/744) on Jan 17, 2022 |
| Meta-Learning-based framework & DDG-DA | :chart_with_upwards_trend: :hammer: [Released](https://github.com/microsoft/qlib/pull/743) on Jan 10, 2022 |
| Planning-based portfolio optimization | :hammer: [Released](https://github.com/microsoft/qlib/pull/754) on Dec 28, 2021 |
| Release Qlib v0.8.0 | :octocat: [Released](https://github.com/microsoft/qlib/releases/tag/v0.8.0) on Dec 8, 2021 |
| ADD model | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/704) on Nov 22, 2021 |
| ADARNN model | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/689) on Nov 14, 2021 |
| TCN model | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/668) on Nov 4, 2021 |
| Nested Decision Framework | :hammer: [Released](https://github.com/microsoft/qlib/pull/438) on Oct 1, 2021. [Example](https://github.com/microsoft/qlib/blob/main/examples/nested_decision_execution/workflow.py) and [Doc](https://qlib.readthedocs.io/en/latest/component/highfreq.html) |
| Temporal Routing Adaptor (TRA) | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/531) on July 30, 2021 |
| Transformer & Localformer | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/508) on July 22, 2021 |
| Release Qlib v0.7.0 | :octocat: [Released](https://github.com/microsoft/qlib/releases/tag/v0.7.0) on July 12, 2021 |
| TCTS Model | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/491) on July 1, 2021 |
| Online serving and automatic model rolling | :hammer: [Released](https://github.com/microsoft/qlib/pull/290) on May 17, 2021 |
| DoubleEnsemble Model | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/286) on Mar 2, 2021 |
| High-frequency data processing example | :hammer: [Released](https://github.com/microsoft/qlib/pull/257) on Feb 5, 2021 |
| High-frequency trading example | :chart_with_upwards_trend: [Part of code released](https://github.com/microsoft/qlib/pull/227) on Jan 28, 2021 |
| High-frequency data(1min) | :rice: [Released](https://github.com/microsoft/qlib/pull/221) on Jan 27, 2021 |
| Tabnet Model | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/205) on Jan 22, 2021 |
Features released before 2021 are not listed here.
@@ -49,9 +52,12 @@ For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative
- [Data Preparation](#data-preparation)
- [Auto Quant Research Workflow](#auto-quant-research-workflow)
- [Building Customized Quant Research Workflow by Code](#building-customized-quant-research-workflow-by-code)
- [**Quant Model(Paper) Zoo**](#quant-model-paper-zoo)
- [Run a single model](#run-a-single-model)
- [Run multiple models](#run-multiple-models)
- [Main Challenges & Solutions in Quant Research](#main-challenges--solutions-in-quant-research)
- [Forecasting: Finding Valuable Signals/Patterns](#forecasting-finding-valuable-signalspatterns)
- [**Quant Model (Paper) Zoo**](#quant-model-paper-zoo)
- [Run a Single Model](#run-a-single-model)
- [Run Multiple Models](#run-multiple-models)
- [Adapting to Market Dynamics](#adapting-to-market-dynamics)
- [**Quant Dataset Zoo**](#quant-dataset-zoo)
- [More About Qlib](#more-about-qlib)
- [Offline Mode and Online Mode](#offline-mode-and-online-mode)
@@ -66,10 +72,7 @@ New features under development(order by estimated release time).
Your feedbacks about the features are very important.
| Feature | Status |
| -- | ------ |
| Planning-based portfolio optimization | Under review: https://github.com/microsoft/qlib/pull/280 |
| Fund data supporting and analysis | Under review: https://github.com/microsoft/qlib/pull/292 |
| Point-in-Time database | Under review: https://github.com/microsoft/qlib/pull/343 |
| Meta-Learning-based data selection | Initial opensource version under development |
# Framework of Qlib
@@ -112,6 +115,7 @@ This table demonstrates the supported Python version of `Qlib`:
1. **Conda** is suggested for managing your Python environment.
1. Please pay attention that installing cython in Python 3.6 will raise some error when installing ``Qlib`` from source. If users use Python 3.6 on their machines, it is recommended to *upgrade* Python to version 3.7 or use `conda`'s Python to install ``Qlib`` from source.
1. For Python 3.9, `Qlib` supports running workflows such as training models, doing backtest and plot most of the related figures (those included in [notebook](examples/workflow_by_code.ipynb)). However, plotting for the *model performance* is not supported for now and we will fix this when the dependent packages are upgraded in the future.
1. `Qlib`Requires `tables` package, `hdf5` in tables does not support python3.9.
### Install with pip
Users can easily install ``Qlib`` by pip according to the following command.
@@ -133,17 +137,11 @@ Also, users can install the latest dev version ``Qlib`` by the source code accor
```
* Clone the repository and install ``Qlib`` as follows.
* If you haven't installed qlib by the command ``pip install pyqlib`` before:
```bash
git clone https://github.com/microsoft/qlib.git && cd qlib
python setup.py install
```
* If you have already installed the stable version by the command ``pip install pyqlib``:
```bash
git clone https://github.com/microsoft/qlib.git && cd qlib
pip install .
```
**Note**: **Only** the command ``pip install .`` **can** overwrite the stable version installed by ``pip install pyqlib``, while the command ``python setup.py install`` **can't**.
**Note**: You can install Qlib with `python setup.py install` as well. But it is not the recommanded approach. It will skip `pip` and cause obscure problems. For example, **only** the command ``pip install .`` **can** overwrite the stable version installed by ``pip install pyqlib``, while the command ``python setup.py install`` **can't**.
**Tips**: If you fail to install `Qlib` or run the examples in your environment, comparing your steps and the [CI workflow](.github/workflows/test.yml) may help you find the problem.
@@ -195,7 +193,7 @@ We recommend users to prepare their own data if they have a high-quality dataset
```python
import qlib
from qlib.data import D
from qlib.config import REG_CN
from qlib.constant import REG_CN
# Initialization
mount_path = "~/.qlib/qlib_data/cn_data" # target_dir
@@ -280,8 +278,18 @@ Qlib provides a tool named `qrun` to run the whole workflow automatically (inclu
## Building Customized Quant Research Workflow by Code
The automatic workflow may not suit the research workflow of all Quant researchers. To support a flexible Quant research workflow, Qlib also provides a modularized interface to allow researchers to build their own workflow by code. [Here](examples/workflow_by_code.ipynb) is a demo for customized Quant research workflow by code.
# Main Challenges & Solutions in Quant Research
Quant investment is an very unique scenario with lots of key challenges to be solved.
Currently, Qlib provides some solutions for several of them.
# [Quant Model (Paper) Zoo](examples/benchmarks)
## Forecasting: Finding Valuable Signals/Patterns
Accurate forecasting of the stock price trend is a very important part to construct profitable portfolios.
However, huge amount of data with various formats in the financial market which make it challenging to build forecasting models.
An increasing number of SOTA Quant research works/papers, which focus on building forecasting models to mine valuable signals/patterns in complex financial data, are released in `Qlib`
### [Quant Model (Paper) Zoo](examples/benchmarks)
Here is a list of models built on `Qlib`.
- [GBDT based on XGBoost (Tianqi Chen, et al. KDD 2016)](examples/benchmarks/XGBoost/)
@@ -308,7 +316,7 @@ Your PR of new Quant models is highly welcomed.
The performance of each model on the `Alpha158` and `Alpha360` dataset can be found [here](examples/benchmarks/README.md).
## Run a single model
### Run a single model
All the models listed above are runnable with ``Qlib``. Users can find the config files we provide and some details about the model through the [benchmarks](examples/benchmarks) folder. More information can be retrieved at the model files listed above.
`Qlib` provides three different ways to run a single model, users can pick the one that fits their cases best:
@@ -318,7 +326,7 @@ All the models listed above are runnable with ``Qlib``. Users can find the confi
- Users can use the script [`run_all_model.py`](examples/run_all_model.py) listed in the `examples` folder to run a model. Here is an example of the specific shell command to be used: `python run_all_model.py run --models=lightgbm`, where the `--models` arguments can take any number of models listed above(the available models can be found in [benchmarks](examples/benchmarks/)). For more use cases, please refer to the file's [docstrings](examples/run_all_model.py).
- **NOTE**: Each baseline has different environment dependencies, please make sure that your python version aligns with the requirements(e.g. TFT only supports Python 3.6~3.7 due to the limitation of `tensorflow==1.15.0`)
## Run multiple models
### Run multiple models
`Qlib` also provides a script [`run_all_model.py`](examples/run_all_model.py) which can run multiple models for several iterations. (**Note**: the script only support *Linux* for now. Other OS will be supported in the future. Besides, it doesn't support parallel running the same model for multiple times as well, and this will be fixed in the future development too.)
The script will create a unique virtual environment for each model, and delete the environments after training. Thus, only experiment results such as `IC` and `backtest` results will be generated and stored.
@@ -330,6 +338,14 @@ python run_all_model.py run 10
It also provides the API to run specific models at once. For more use cases, please refer to the file's [docstrings](examples/run_all_model.py).
## [Adapting to Market Dynamics](examples/benchmarks_dynamic)
Due to the non-stationary nature of the environment of the financial market, the data distribution may change in different periods, which makes the performance of models build on training data decays in the future test data.
So adapting the forecasting models/strategies to market dynamics is very important to the model/strategies' performance.
Here is a list of solutions built on `Qlib`.
- [Rolling Retraining](examples/benchmarks_dynamic/baseline/)
- [DDG-DA on pytorch (Wendi, et al. AAAI 2022)](examples/benchmarks_dynamic/DDG-DA/)
# Quant Dataset Zoo
Dataset plays a very important role in Quant. Here is a list of the datasets built on `Qlib`:
@@ -418,6 +434,16 @@ For example, if you want to contribute to Qlib's document/code, you can follow t
<img src="https://github.com/demon143/qlib/blob/main/docs/_static/img/change%20doc.gif" />
</p>
If you don't know how to start to contribute, you can refer to the following examples.
| Type | Examples |
| -- | -- |
| Solving issues | [Answer a question](https://github.com/microsoft/qlib/issues/749); [issuing](https://github.com/microsoft/qlib/issues/765) or [fixing](https://github.com/microsoft/qlib/pull/792) a bug |
| Docs | [Improve docs quality](https://github.com/microsoft/qlib/pull/797/files) ; [Fix a typo](https://github.com/microsoft/qlib/pull/774) |
| Feature | Implement a [requested feature](https://github.com/microsoft/qlib/projects) like [this](https://github.com/microsoft/qlib/pull/754); [Refactor interfaces](https://github.com/microsoft/qlib/pull/539/files) |
| Dataset | [Add a dataset](https://github.com/microsoft/qlib/pull/733) |
| Models | [Implement a new model](https://github.com/microsoft/qlib/pull/689) |
If you would like to become one of Qlib's maintainers to contribute more (e.g. help merge PR, triage issues), please contact us by email([qlib@microsoft.com](mailto:qlib@microsoft.com)). We are glad to help you to set the right permission.
## Licence
Most contributions require you to agree to a

View File

@@ -1 +0,0 @@
0.8.0.99

View File

@@ -21,6 +21,12 @@ The introduction of ``Data Layer`` includes the following parts.
- Cache
- Data and Cache File Structure
Here is a typical example of Qlib data workflow
- Users download data and converting data into Qlib format(with filename suffix `.bin`). In this step, typically only some basic data are stored on disk(such as OHLCV).
- Creating some basic features based on Qlib's expression Engine(e.g. "Ref($close, 60) / $close", the return of last 60 trading days). Supported operators in the expression engine can be found `here <https://github.com/microsoft/qlib/blob/main/qlib/data/ops.py>`_. This step is typically implemented in Qlib's `Data Loader <https://qlib.readthedocs.io/en/latest/component/data.html#data-loader>`_ which is a component of `Data Handler <https://qlib.readthedocs.io/en/latest/component/data.html#data-handler>`_ .
- If users require more complicated data processing (e.g. data normalization), `Data Handler <https://qlib.readthedocs.io/en/latest/component/data.html#data-handler>`_ support user-customized processors to process data(some predefined processors can be found `here <https://github.com/microsoft/qlib/blob/main/qlib/data/dataset/processor.py>`_). The processors are different from operators in expression engine. It is designed for some complicated data processing methods which is hard to supported in operators in expression engine.
- At last, `Dataset <https://qlib.readthedocs.io/en/latest/component/data.html#dataset>`_ is responsible to prepare model-specific dataset from the processed data of Data Handler
Data Preparation
============================
@@ -46,6 +52,7 @@ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency
Qlib Format Dataset
--------------------
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows.
The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
.. code-block:: bash
@@ -213,7 +220,7 @@ The `trade unit` defines the unit number of stocks can be used in a trade, and t
.. code-block:: python
from qlib.config import REG_CN
from qlib.constant import REG_CN
qlib.init(provider_uri='~/.qlib/qlib_data/cn_data', region=REG_CN)

View File

@@ -14,7 +14,7 @@ To get the join trading performance of daily and intraday trading, they must int
In order to support the joint backtest strategies in multiple levels, a corresponding framework is required. None of the publicly available high-frequency trading frameworks considers multi-level joint trading, which make the backtesting aforementioned inaccurate.
Besides backtesting, the optimization of strategies from different levels is not standalone and can be affected by each other.
For example, the best portfolio management strategy may change with the performance of order executions(e.g. a portfolio with higher turnover may becomes a better choice when we imporve the order execution strategies).
For example, the best portfolio management strategy may change with the performance of order executions(e.g. a portfolio with higher turnover may becomes a better choice when we improve the order execution strategies).
To achieve the overall good performance , it is necessary to consider the interaction of strategies in different level.
Therefore, building a new framework for trading in multiple levels becomes necessary to solve the various problems mentioned above, for which we designed a nested decision execution framework that consider the interaction of strategies.

68
docs/component/meta.rst Normal file
View File

@@ -0,0 +1,68 @@
.. _meta:
=================================
Meta Controller: Meta-Task & Meta-Dataset & Meta-Model
=================================
.. currentmodule:: qlib
Introduction
=============
``Meta Controller`` provides guidance to ``Forecast Model``, which aims to learn regular patterns among a series of forecasting tasks and use learned patterns to guide forthcoming forecasting tasks. Users can implement their own meta-model instance based on ``Meta Controller`` module.
Meta Task
=============
A `Meta Task` instance is the basic element in the meta-learning framework. It saves the data that can be used for the `Meta Model`. Multiple `Meta Task` instances may share the same `Data Handler`, controlled by `Meta Dataset`. Users should use `prepare_task_data()` to obtain the data that can be directly fed into the `Meta Model`.
.. autoclass:: qlib.model.meta.task.MetaTask
:members:
Meta Dataset
=============
`Meta Dataset` controls the meta-information generating process. It is on the duty of providing data for training the `Meta Model`. Users should use `prepare_tasks` to retrieve a list of `Meta Task` instances.
.. autoclass:: qlib.model.meta.dataset.MetaTaskDataset
:members:
Meta Model
=============
General Meta Model
------------------
`Meta Model` instance is the part that controls the workflow. The usage of the `Meta Model` includes:
1. Users train their `Meta Model` with the `fit` function.
2. The `Meta Model` instance guides the workflow by giving useful information via the `inference` function.
.. autoclass:: qlib.model.meta.model.MetaModel
:members:
Meta Task Model
------------------
This type of meta-model may interact with task definitions directly. Then, the `Meta Task Model` is the class for them to inherit from. They guide the base tasks by modifying the base task definitions. The function `prepare_tasks` can be used to obtain the modified base task definitions.
.. autoclass:: qlib.model.meta.model.MetaTaskModel
:members:
Meta Guide Model
------------------
This type of meta-model participates in the training process of the base forecasting model. The meta-model may guide the base forecasting models during their training to improve their performances.
.. autoclass:: qlib.model.meta.model.MetaGuideModel
:members:
Example
=============
``Qlib`` provides an implementation of ``Meta Model`` module, ``DDG-DA``,
which adapts to the market dynamics.
``DDG-DA`` includes four steps:
1. Calculate meta-information and encapsulate it into ``Meta Task`` instances. All the meta-tasks form a ``Meta Dataset`` instance.
2. Train ``DDG-DA`` based on the training data of the meta-dataset.
3. Do the inference of the ``DDG-DA`` to get guide information.
4. Apply guide information to the forecasting models to improve their performances.
The `above example <https://github.com/microsoft/qlib/tree/main/examples/benchmarks_dynamic/DDG-DA>`_ can be found in ``examples/benchmarks_dynamic/DDG-DA/workflow.py``.

View File

@@ -106,6 +106,9 @@ Example
`SignalRecord` is the `Record Template` in ``Qlib``, please refer to `Workflow <recorder.html#record-template>`_.
Also, the above example has been given in ``examples/train_backtest_analyze.ipynb``.
Technically, the meaning of the model prediction depends on the label setting designed by user.
By default, the meaning of the score is normally the rating of the instruments by the forecasting model. The higher the score, the more profit the instruments.
Custom Model
===================

View File

@@ -23,6 +23,10 @@ The `examples <https://github.com/microsoft/qlib/tree/main/examples/online_srv>`
**NOTE**: User should keep his data source updated to support online serving. For example, Qlib provides `a batch of scripts <https://github.com/microsoft/qlib/blob/main/scripts/data_collector/yahoo/README.md#automatic-update-of-daily-frequency-datafrom-yahoo-finance>`_ to help users update Yahoo daily data.
Known limitations currently
- Currently, the daily updating prediction for the next trading day is supported. But generating orders for the next trading day is not supported due to the `limitations of public data <https://github.com/microsoft/qlib/issues/215#issuecomment-766293563>_`
Online Manager
=============

View File

@@ -37,7 +37,7 @@ Here is a general view of the structure of the system:
This experiment management system defines a set of interface and provided a concrete implementation ``MLflowExpManager``, which is based on the machine learning platform: ``MLFlow`` (`link <https://mlflow.org/>`_).
If users set the implementation of ``ExpManager`` to be ``MLflowExpManager``, they can use the command `mlflow ui` to visualize and check the experiment results. For more information, pleaes refer to the related documents `here <https://www.mlflow.org/docs/latest/cli.html#mlflow-ui>`_.
If users set the implementation of ``ExpManager`` to be ``MLflowExpManager``, they can use the command `mlflow ui` to visualize and check the experiment results. For more information, please refer to the related documents `here <https://www.mlflow.org/docs/latest/cli.html#mlflow-ui>`_.
Qlib Recorder
===================

View File

@@ -8,7 +8,7 @@ Portfolio Strategy: Portfolio Management
Introduction
===================
``Portfolio Strategy`` is designed to adopt different portfolio strategies, which means that users can adopt different algorithms to generate investment portfolios based on the prediction scores of the ``Forecast Model``. Users can use the ``Portfolio Strategy`` in an automatic workflow by ``Workflow`` module, please refer to `Workflow: Workflow Management <workflow.html>`_.
``Portfolio Strategy`` is designed to adopt different portfolio strategies, which means that users can adopt different algorithms to generate investment portfolios based on the prediction scores of the ``Forecast Model``. Users can use the ``Portfolio Strategy`` in an automatic workflow by ``Workflow`` module, please refer to `Workflow: Workflow Management <workflow.html>`_.
Because the components in ``Qlib`` are designed in a loosely-coupled way, ``Portfolio Strategy`` can be used as an independent module also.
@@ -22,20 +22,22 @@ Base Class & Interface
BaseStrategy
------------------
Qlib provides a base class ``qlib.contrib.strategy.BaseStrategy``. All strategy classes need to inherit the base class and implement its interface.
Qlib provides a base class ``qlib.strategy.base.BaseStrategy``. All strategy classes need to inherit the base class and implement its interface.
- `get_risk_degree`
Return the proportion of your total value you will use in investment. Dynamically risk_degree will result in Market timing.
- `generate_order_list`
Return the order list.
Return the order list.
The frequency to call this method depends on the executor frequency("time_per_step"="day" by default). But the trading frequency can be decided by users' implementation.
For example, if the user wants to trading in weekly while the `time_per_step` is "day" in executor, user can return non-empty TradeDecision weekly(otherwise return empty like `this <https://github.com/microsoft/qlib/blob/main/qlib/contrib/strategy/signal_strategy.py#L132>`_ ).
Users can inherit `BaseStrategy` to customize their strategy class.
WeightStrategyBase
--------------------
Qlib also provides a class ``qlib.contrib.strategy.WeightStrategyBase`` that is a subclass of `BaseStrategy`.
Qlib also provides a class ``qlib.contrib.strategy.WeightStrategyBase`` that is a subclass of `BaseStrategy`.
`WeightStrategyBase` only focuses on the target positions, and automatically generates an order list based on positions. It provides the `generate_target_weight_position` interface.
@@ -71,17 +73,27 @@ TopkDropoutStrategy
- `Topk`: The number of stocks held
- `Drop`: The number of stocks sold on each trading day
Currently, the number of held stocks is `Topk`.
On each trading day, the `Drop` number of held stocks with the worst `prediction score` will be sold, and the same number of unheld stocks with the best `prediction score` will be bought.
.. image:: ../_static/img/topk_drop.png
:alt: Topk-Drop
``TopkDrop`` algorithm sells `Drop` stocks every trading day, which guarantees a fixed turnover rate.
- Generate the order list from the target amount
EnhancedIndexingStrategy
------------------------
`EnhancedIndexingStrategy` Enhanced indexing combines the arts of active management and passive management,
with the aim of outperforming a benchmark index (e.g., S&P 500) in terms of portfolio return while controlling
the risk exposure (a.k.a. tracking error).
For more information, please refer to `qlib.contrib.strategy.signal_strategy.EnhancedIndexingStrategy`
and `qlib.contrib.strategy.optimizer.enhanced_indexing.EnhancedIndexingOptimizer`.
Usage & Example
====================

View File

@@ -124,9 +124,47 @@ Configuration File
===================
Let's get into details of ``qrun`` in this section.
Before using ``qrun``, users need to prepare a configuration file. The following content shows how to prepare each part of the configuration file.
The design logic of the configuration file is very simple. It predefines fixed workflows and provide this yaml interface to users to define how to initialize each component.
It follow the design of `init_instance_by_config <https://github.com/microsoft/qlib/blob/2aee9e0145decc3e71def70909639b5e5a6f4b58/qlib/utils/__init__.py#L264>`_ . It defines the initialization of each component of Qlib, which typically include the class and the initialization arguments.
For example, the following yaml and code are equivalent.
.. code-block:: YAML
model:
class: LGBModel
module_path: qlib.contrib.model.gbdt
kwargs:
loss: mse
colsample_bytree: 0.8879
learning_rate: 0.0421
subsample: 0.8789
lambda_l1: 205.6999
lambda_l2: 580.9768
max_depth: 8
num_leaves: 210
num_threads: 20
.. code-block:: python
from qlib.contrib.model.gbdt import LGBModel
kwargs = {
"loss": "mse" ,
"colsample_bytree": 0.8879,
"learning_rate": 0.0421,
"subsample": 0.8789,
"lambda_l1": 205.6999,
"lambda_l2": 580.9768,
"max_depth": 8,
"num_leaves": 210,
"num_threads": 20,
}
LGBModel(kwargs)
Qlib Init Section
--------------------

View File

@@ -31,7 +31,7 @@ Let's see an example,
First make sure you have the latest version of `qlib` installed.
Then, you need to privide a configuration to setup the experiment.
Then, you need to provide a configuration to setup the experiment.
We write a simple configuration example as following,
.. code-block:: YAML
@@ -217,13 +217,13 @@ The tuner pipeline contains different tuners, and the `tuner` program will proce
Each part represents a tuner, and its modules which are to be tuned. Space in each part is the hyper-parameters' space of a certain module, you need to create your searching space and modify it in `/qlib/contrib/tuner/space.py`. We use `hyperopt` package to help us to construct the space, you can see the detail of how to use it in https://github.com/hyperopt/hyperopt/wiki/FMin .
- model
You need to provide the `class` and the `space` of the model. If the model is user's own implementation, you need to privide the `module_path`.
You need to provide the `class` and the `space` of the model. If the model is user's own implementation, you need to provide the `module_path`.
- trainer
You need to proveide the `class` of the trainer. If the trainer is user's own implementation, you need to privide the `module_path`.
You need to provide the `class` of the trainer. If the trainer is user's own implementation, you need to provide the `module_path`.
- strategy
You need to provide the `class` and the `space` of the strategy. If the strategy is user's own implementation, you need to privide the `module_path`.
You need to provide the `class` and the `space` of the strategy. If the strategy is user's own implementation, you need to provide the `module_path`.
- data_label
The label of the data, you can search which kinds of labels will lead to a better result. This part is optional, and you only need to provide `space`.
@@ -273,7 +273,7 @@ You need to use the same dataset to evaluate your different `estimator` experime
About the data and backtest
~~~~~~~~~~~~~~~~~~~~~~~~~~~
`data` and `backtest` are all same in the whole `tuner` experiment. Different `estimator` experiments must use the same data and backtest method. So, these two parts of config are same with that in `estimator` configuration. You can see the precise defination of these parts in `estimator` introduction. We only provide an example here.
`data` and `backtest` are all same in the whole `tuner` experiment. Different `estimator` experiments must use the same data and backtest method. So, these two parts of config are same with that in `estimator` configuration. You can see the precise definition of these parts in `estimator` introduction. We only provide an example here.
.. code-block:: YAML

View File

@@ -36,10 +36,11 @@ Document Structure
:caption: COMPONENTS:
Workflow: Workflow Management <component/workflow.rst>
Data Layer: Data Framework&Usage <component/data.rst>
Data Layer: Data Framework & Usage <component/data.rst>
Forecast Model: Model Training & Prediction <component/model.rst>
Portfolio Management and Backtest <component/strategy.rst>
Nested Decision Execution: High-Frequency Trading <component/highfreq.rst>
Meta Controller: Meta-Task & Meta-Dataset & Meta-Model <component/meta.rst>
Qlib Recorder: Experiment Management <component/recorder.rst>
Analysis: Evaluation & Results Analysis <component/report.rst>
Online Serving: Online Management & Strategy & Tool <component/online.rst>

View File

@@ -31,7 +31,7 @@ Users can easily intsall ``Qlib`` according to the following steps:
git clone https://github.com/microsoft/qlib.git && cd qlib
python setup.py install
To kown more about `installation`, please refer to `Qlib Installation <../start/installation.html>`_.
To known more about `installation`, please refer to `Qlib Installation <../start/installation.html>`_.
Prepare Data
==============
@@ -44,7 +44,7 @@ Load and prepare data by running the following code:
This dataset is created by public data collected by crawler scripts in ``scripts/data_collector/``, which have been released in the same repository. Users could create the same dataset with it.
To kown more about `prepare data`, please refer to `Data Preparation <../component/data.html#data-preparation>`_.
To known more about `prepare data`, please refer to `Data Preparation <../component/data.html#data-preparation>`_.
Auto Quant Research Workflow
====================================

View File

@@ -3,3 +3,4 @@ cmake
numpy
scipy
scikit-learn
pandas

View File

@@ -27,7 +27,7 @@ Initialize Qlib before calling other APIs: run following code in python.
import qlib
# region in [REG_CN, REG_US]
from qlib.config import REG_CN
from qlib.constant import REG_CN
provider_uri = "~/.qlib/qlib_data/cn_data" # target_dir
qlib.init(provider_uri=provider_uri, region=REG_CN)
@@ -42,10 +42,10 @@ Besides `provider_uri` and `region`, `qlib.init` has other parameters. The follo
- `provider_uri`
Type: str. The URI of the Qlib data. For example, it could be the location where the data loaded by ``get_data.py`` are stored.
- `region`
Type: str, optional parameter(default: `qlib.config.REG_CN`).
Currently: ``qlib.config.REG_US`` ('us') and ``qlib.config.REG_CN`` ('cn') is supported. Different value of `region` will result in different stock market mode.
- ``qlib.config.REG_US``: US stock market.
- ``qlib.config.REG_CN``: China stock market.
Type: str, optional parameter(default: `qlib.constant.REG_CN`).
Currently: ``qlib.constant.REG_US`` ('us') and ``qlib.constant.REG_CN`` ('cn') is supported. Different value of `region` will result in different stock market mode.
- ``qlib.constant.REG_US``: US stock market.
- ``qlib.constant.REG_CN``: China stock market.
Different modes will result in different trading limitations and costs.
The region is just `shortcuts for defining a batch of configurations <https://github.com/microsoft/qlib/blob/main/qlib/config.py#L239>`_. Users can set the key configurations manually if the existing region setting can't meet their requirements.

View File

@@ -1,4 +1,4 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,4 +1,4 @@
numpy==1.17.4
numpy==1.21.0
pandas==1.1.2
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,4 +1,4 @@
numpy==1.17.4
numpy==1.21.0
pandas==1.1.2
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,3 +1,3 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
catboost==0.24.3

View File

@@ -1,3 +1,3 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
lightgbm==3.1.0

View File

@@ -1,4 +1,4 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,4 +1,4 @@
numpy==1.17.4
numpy==1.21.0
pandas==1.1.2
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,4 +1,4 @@
numpy==1.17.4
numpy==1.21.0
pandas==1.1.2
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,3 +1,3 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
lightgbm==3.1.0

View File

@@ -22,7 +22,6 @@ data_handler_config: &data_handler_config
- class: CSRankNorm
kwargs:
fields_group: label
label: ["Ref($close, -2) / Ref($close, -1) - 1"]
port_analysis_config: &port_analysis_config
strategy:
class: TopkDropoutStrategy

View File

@@ -1,3 +1,3 @@
numpy==1.17.4
numpy==1.21.0
pandas==1.1.2
torch==1.2.0

View File

@@ -1,4 +1,4 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -9,7 +9,7 @@ Here are the results of each benchmark model running on Qlib's `Alpha360` and `A
The numbers shown below demonstrate the performance of the entire `workflow` of each model. We will update the `workflow` as well as models in the near future for better results.
<!--
> If you need to reproduce the results below, please use the **v1** dataset: `python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn --version v1`
> If you need to reproduce the results below, please use the **v1** dataset: `python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn --version v1`
>
> In the new version of qlib, the default dataset is **v2**. Since the data is collected from the YahooFinance API (which is not very stable), the results of *v2* and *v1* may differ -->

View File

@@ -1,4 +1,4 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,4 +1,4 @@
numpy==1.17.4
numpy==1.21.0
pandas==1.1.2
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -1,4 +1,4 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
scikit_learn==0.23.2
torch==1.7.0

View File

@@ -32,7 +32,7 @@ import abc
import enum
# Type defintions
# Type definitions
class DataTypes(enum.IntEnum):
"""Defines numerical types of each column."""

View File

@@ -254,9 +254,9 @@ class DistributedHyperparamOptManager(HyperparamOptManager):
param_ranges: Discrete hyperparameter range for random search.
fixed_params: Fixed model parameters per experiment.
root_model_folder: Folder to store optimisation artifacts.
worker_number: Worker index definining which set of hyperparameters to
worker_number: Worker index defining which set of hyperparameters to
test.
search_iterations: Maximum numer of random search iterations.
search_iterations: Maximum number of random search iterations.
num_iterations_per_worker: How many iterations are handled per worker.
clear_serialised_params: Whether to regenerate hyperparameter
combinations.
@@ -330,7 +330,7 @@ class DistributedHyperparamOptManager(HyperparamOptManager):
if os.path.exists(self.serialised_ranges_folder):
df = pd.read_csv(self.serialised_ranges_path, index_col=0)
else:
print("Unable to load - regenerating serach ranges instead")
print("Unable to load - regenerating search ranges instead")
df = self.update_serialised_hyperparam_df()
return df

View File

@@ -342,7 +342,7 @@ class TFTDataCache:
@classmethod
def contains(cls, key):
"""Retuns boolean indicating whether key is present in cache."""
"""Returns boolean indicating whether key is present in cache."""
return key in cls._data_cache
@@ -1120,10 +1120,10 @@ class TemporalFusionTransformer:
Args:
df: Input dataframe
return_targets: Whether to also return outputs aligned with predictions to
faciliate evaluation
facilitate evaluation
Returns:
Input dataframe or tuple of (input dataframe, algined output dataframe).
Input dataframe or tuple of (input dataframe, aligned output dataframe).
"""
data = self._batch_data(df)

View File

@@ -209,7 +209,6 @@ class TFTModel(ModelFT):
fixed_params = self.data_formatter.get_experiment_params()
params = self.data_formatter.get_default_model_params()
# Wendi: 合并调优的参数和非调优的参数
params = {**params, **fixed_params}
if not os.path.exists(self.model_folder):
@@ -295,7 +294,7 @@ class TFTModel(ModelFT):
def to_pickle(self, path: Union[Path, str]):
"""
Tensorflow model can't be dumped directly.
So the data should be save seperatedly
So the data should be save separately
**TODO**: Please implement the function to load the files

View File

@@ -57,7 +57,7 @@ And here are two ways to run the model:
python example.py --config_file configs/config_alstm.yaml
```
Here we trained TRA on a pretrained backbone model. Therefore we run `*_init.yaml` before TRA's scipts.
Here we trained TRA on a pretrained backbone model. Therefore we run `*_init.yaml` before TRA's scripts.
### Results

View File

@@ -1,5 +1,5 @@
pandas==1.1.2
numpy==1.17.4
numpy==1.21.0
scikit_learn==0.23.2
torch==1.7.0
seaborn

View File

@@ -124,7 +124,7 @@ class TRAModel(Model):
loss = (pred - label).pow(2).mean()
L = (all_preds.detach() - label[:, None]).pow(2)
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure postive input
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure positive input
data_set.assign_data(index, L) # save loss to memory
@@ -165,7 +165,7 @@ class TRAModel(Model):
L = (all_preds - label[:, None]).pow(2)
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure postive input
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure positive input
data_set.assign_data(index, L) # save loss to memory
@@ -484,7 +484,7 @@ class TRA(nn.Module):
"""Temporal Routing Adaptor (TRA)
TRA takes historical prediction erros & latent representation as inputs,
TRA takes historical prediction errors & latent representation as inputs,
then routes the input sample to a specific predictor for training & inference.
Args:

View File

@@ -0,0 +1,30 @@
# Introduction
This is the implementation of `DDG-DA` based on `Meta Controller` component provided by `Qlib`.
Please refer to the paper for more details: *DDG-DA: Data Distribution Generation for Predictable Concept Drift Adaptation* [[arXiv](https://arxiv.org/abs/2201.04038)]
## Background
In many real-world scenarios, we often deal with streaming data that is sequentially collected over time. Due to the non-stationary nature of the environment, the streaming data distribution may change in unpredictable ways, which is known as concept drift. To handle concept drift, previous methods first detect when/where the concept drift happens and then adapt models to fit the distribution of the latest data. However, there are still many cases that some underlying factors of environment evolution are predictable, making it possible to model the future concept drift trend of the streaming data, while such cases are not fully explored in previous work.
Therefore, we propose a novel method `DDG-DA`, that can effectively forecast the evolution of data distribution and improve the performance of models. Specifically, we first train a predictor to estimate the future data distribution, then leverage it to generate training samples, and finally train models on the generated data.
## Dataset
The data in the paper are private. So we conduct experiments on Qlib's public dataset.
Though the dataset is different, the conclusion remains the same. By applying `DDG-DA`, users can see rising trends at the test phase both in the proxy models' ICs and the performances of the forecasting models.
## Run the Code
Users can try `DDG-DA` by running the following command:
```bash
python workflow.py run_all
```
The default forecasting models are `Linear`. Users can choose other forecasting models by changing the `forecast_model` parameter when `DDG-DA` initializes. For example, users can try `LightGBM` forecasting models by running the following command:
```bash
python workflow.py --forecast_model="gbdt" run_all
```
## Results
The results of related methods in Qlib's public dataset can be found [here](../)

View File

@@ -0,0 +1 @@
torch==1.10.0

View File

@@ -0,0 +1,261 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from pathlib import Path
from qlib.model.meta.task import MetaTask
from qlib.contrib.meta.data_selection.model import MetaModelDS
from qlib.contrib.meta.data_selection.dataset import InternalData, MetaDatasetDS
from qlib.data.dataset.handler import DataHandlerLP
import pandas as pd
import fire
import sys
from tqdm.auto import tqdm
import yaml
import pickle
from qlib import auto_init
from qlib.model.trainer import TrainerR, task_train
from qlib.utils import init_instance_by_config
from qlib.workflow.task.gen import RollingGen, task_generator
from qlib.workflow import R
from qlib.tests.data import GetData
DIRNAME = Path(__file__).absolute().resolve().parent
sys.path.append(str(DIRNAME.parent / "baseline"))
from rolling_benchmark import RollingBenchmark # NOTE: sys.path is changed for import RollingBenchmark
class DDGDA:
"""
please run `python workflow.py run_all` to run the full workflow of the experiment
**NOTE**
before running the example, please clean your previous results with following command
- `rm -r mlruns`
"""
def __init__(self, sim_task_model="linear", forecast_model="linear"):
self.step = 20
# NOTE:
# the horizon must match the meaning in the base task template
self.horizon = 20
self.meta_exp_name = "DDG-DA"
self.sim_task_model = sim_task_model # The model to capture the distribution of data.
self.forecast_model = forecast_model # downstream forecasting models' type
def get_feature_importance(self):
# this must be lightGBM, because it needs to get the feature importance
rb = RollingBenchmark(model_type="gbdt")
task = rb.basic_task()
model = init_instance_by_config(task["model"])
dataset = init_instance_by_config(task["dataset"])
model.fit(dataset)
fi = model.get_feature_importance()
# Because the model use numpy instead of dataframe for training lightgbm
# So the we must use following extra steps to get the right feature importance
df = dataset.prepare(segments=slice(None), col_set="feature", data_key=DataHandlerLP.DK_R)
cols = df.columns
fi_named = {cols[int(k.split("_")[1])]: imp for k, imp in fi.to_dict().items()}
return pd.Series(fi_named)
def dump_data_for_proxy_model(self):
"""
Dump data for training meta model.
The meta model will be trained upon the proxy forecasting model.
This dataset is for the proxy forecasting model.
"""
topk = 30
fi = self.get_feature_importance()
col_selected = fi.nlargest(topk)
rb = RollingBenchmark(model_type=self.sim_task_model)
task = rb.basic_task()
dataset = init_instance_by_config(task["dataset"])
prep_ds = dataset.prepare(slice(None), col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
feature_df = prep_ds["feature"]
label_df = prep_ds["label"]
feature_selected = feature_df.loc[:, col_selected.index]
feature_selected = feature_selected.groupby("datetime").apply(lambda df: (df - df.mean()).div(df.std()))
feature_selected = feature_selected.fillna(0.0)
df_all = {
"label": label_df.reindex(feature_selected.index),
"feature": feature_selected,
}
df_all = pd.concat(df_all, axis=1)
df_all.to_pickle(DIRNAME / "fea_label_df.pkl")
# dump data in handler format for aligning the interface
handler = DataHandlerLP(
data_loader={
"class": "qlib.data.dataset.loader.StaticDataLoader",
"kwargs": {"config": DIRNAME / "fea_label_df.pkl"},
}
)
handler.to_pickle(DIRNAME / "handler_proxy.pkl", dump_all=True)
@property
def _internal_data_path(self):
return DIRNAME / f"internal_data_s{self.step}.pkl"
def dump_meta_ipt(self):
"""
Dump data for training meta model.
This function will dump the input data for meta model
"""
# According to the experiments, the choice of the model type is very important for achieving good results
rb = RollingBenchmark(model_type=self.sim_task_model)
sim_task = rb.basic_task()
if self.sim_task_model == "gbdt":
sim_task["model"].setdefault("kwargs", {}).update({"early_stopping_rounds": None, "num_boost_round": 150})
exp_name_sim = f"data_sim_s{self.step}"
internal_data = InternalData(sim_task, self.step, exp_name=exp_name_sim)
internal_data.setup(trainer=TrainerR)
with self._internal_data_path.open("wb") as f:
pickle.dump(internal_data, f)
def train_meta_model(self):
"""
training a meta model based on a simplified linear proxy model;
"""
# 1) leverage the simplified proxy forecasting model to train meta model.
# - Only the dataset part is important, in current version of meta model will integrate the
rb = RollingBenchmark(model_type=self.sim_task_model)
sim_task = rb.basic_task()
proxy_forecast_model_task = {
# "model": "qlib.contrib.model.linear.LinearModel",
"dataset": {
"class": "qlib.data.dataset.DatasetH",
"kwargs": {
"handler": f"file://{(DIRNAME / 'handler_proxy.pkl').absolute()}",
"segments": {
"train": ("2008-01-01", "2010-12-31"),
"test": ("2011-01-01", sim_task["dataset"]["kwargs"]["segments"]["test"][1]),
},
},
},
# "record": ["qlib.workflow.record_temp.SignalRecord"]
}
# the proxy_forecast_model_task will be used to create meta tasks.
# The test date of first task will be 2011-01-01. Each test segment will be about 20days
# The tasks include all training tasks and test tasks.
# 2) preparing meta dataset
kwargs = dict(
task_tpl=proxy_forecast_model_task,
step=self.step,
segments=0.62, # keep test period consistent with the dataset yaml
trunc_days=1 + self.horizon,
hist_step_n=30,
fill_method="max",
rolling_ext_days=0,
)
# NOTE:
# the input of meta model (internal data) are shared between proxy model and final forecasting model
# but their task test segment are not aligned! It worked in my previous experiment.
# So the misalignment will not affect the effectiveness of the method.
with self._internal_data_path.open("rb") as f:
internal_data = pickle.load(f)
md = MetaDatasetDS(exp_name=internal_data, **kwargs)
# 3) train and logging meta model
with R.start(experiment_name=self.meta_exp_name):
R.log_params(**kwargs)
mm = MetaModelDS(step=self.step, hist_step_n=kwargs["hist_step_n"], lr=0.001, max_epoch=200, seed=43)
mm.fit(md)
R.save_objects(model=mm)
@property
def _task_path(self):
return DIRNAME / f"tasks_s{self.step}.pkl"
def meta_inference(self):
"""
Leverage meta-model for inference:
- Given
- baseline tasks
- input for meta model(internal data)
- meta model (its learnt knowledge on proxy forecasting model is expected to transfer to normal forecasting model)
"""
# 1) get meta model
exp = R.get_exp(experiment_name=self.meta_exp_name)
rec = exp.list_recorders(rtype=exp.RT_L)[0]
meta_model: MetaModelDS = rec.load_object("model")
# 2)
# we are transfer to knowledge of meta model to final forecasting tasks.
# Create MetaTaskDataset for the final forecasting tasks
# Aligning the setting of it to the MetaTaskDataset when training Meta model is necessary
# 2.1) get previous config
param = rec.list_params()
trunc_days = int(param["trunc_days"])
step = int(param["step"])
hist_step_n = int(param["hist_step_n"])
fill_method = param.get("fill_method", "max")
rb = RollingBenchmark(model_type=self.forecast_model)
task_l = rb.create_rolling_tasks()
# 2.2) create meta dataset for final dataset
kwargs = dict(
task_tpl=task_l,
step=step,
segments=0.0, # all the tasks are for testing
trunc_days=trunc_days,
hist_step_n=hist_step_n,
fill_method=fill_method,
task_mode=MetaTask.PROC_MODE_TRANSFER,
)
with self._internal_data_path.open("rb") as f:
internal_data = pickle.load(f)
mds = MetaDatasetDS(exp_name=internal_data, **kwargs)
# 3) meta model make inference and get new qlib task
new_tasks = meta_model.inference(mds)
with self._task_path.open("wb") as f:
pickle.dump(new_tasks, f)
def train_and_eval_tasks(self):
"""
Training the tasks generated by meta model
Then evaluate it
"""
with self._task_path.open("rb") as f:
tasks = pickle.load(f)
rb = RollingBenchmark(rolling_exp="rolling_ds", model_type=self.forecast_model)
rb.train_rolling_tasks(tasks)
rb.ens_rolling()
rb.update_rolling_rec()
def run_all(self):
# 1) file: handler_proxy.pkl
self.dump_data_for_proxy_model()
# 2)
# file: internal_data_s20.pkl
# mlflow: data_sim_s20, models for calculating meta_ipt
self.dump_meta_ipt()
# 3) meta model will be stored in `DDG-DA`
self.train_meta_model()
# 4) new_tasks are saved in "tasks_s20.pkl" (reweighter is added)
self.meta_inference()
# 5) load the saved tasks and train model
self.train_and_eval_tasks()
if __name__ == "__main__":
GetData().qlib_data(exists_skip=True)
auto_init()
fire.Fire(DDGDA)

View File

@@ -0,0 +1,18 @@
# Introduction
Due to the non-stationary nature of the environment of the financial market, the data distribution may change in different periods, which makes the performance of models build on training data decays in the future test data.
So adapting the forecasting models/strategies to market dynamics is very important to the model/strategies' performance.
The table below shows the performances of different solutions on different forecasting models.
## Alpha158 dataset
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Annualized Return | Information Ratio | Max Drawdown |
|------------------|---------|----|------|---------|-----------|-------------------|-------------------|--------------|
| RR[Linear] |Alpha158 |0.088|0.570|0.102 |0.622 |0.077 |1.175 |-0.086 |
| DDG-DA[Linear] |Alpha158 |0.093|0.622|0.106 |0.670 |0.085 |1.213 |-0.093 |
| RR[LightGBM] |Alpha158 |0.079|0.566|0.088 |0.592 |0.075 |1.226 |-0.096 |
| DDG-DA[LightGBM] |Alpha158 |0.084|0.639|0.093 |0.664 |0.099 |1.442 |-0.071 |
- The label horizon of the `Alpha158` dataset is set to 20.
- The rolling time intervals are set to 20 trading days.
- The test rolling periods are from January 2017 to August 2020.

View File

@@ -0,0 +1,15 @@
# Introduction
This is the framework of periodically Rolling Retrain (RR) forecasting models. RR adapts to market dynamics by utilizing the up-to-date data periodically.
## Run the Code
Users can try RR by running the following command:
```bash
python rolling_benchmark.py run_all
```
The default forecasting models are `Linear`. Users can choose other forecasting models by changing the `model_type` parameter.
For example, users can try `LightGBM` forecasting models by running the following command:
```bash
python rolling_benchmark.py --model_type="gbdt" run_all
```

View File

@@ -0,0 +1,114 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from qlib.model.ens.ensemble import RollingEnsemble
from qlib.utils import init_instance_by_config
import fire
import yaml
from qlib import auto_init
from pathlib import Path
from tqdm.auto import tqdm
from qlib.model.trainer import TrainerR
from qlib.workflow import R
from qlib.tests.data import GetData
DIRNAME = Path(__file__).absolute().resolve().parent
from qlib.workflow.task.gen import task_generator, RollingGen
from qlib.workflow.task.collect import RecorderCollector
from qlib.workflow.record_temp import PortAnaRecord, SigAnaRecord
class RollingBenchmark:
"""
**NOTE**
before running the example, please clean your previous results with following command
- `rm -r mlruns`
"""
def __init__(self, rolling_exp="rolling_models", model_type="linear") -> None:
self.step = 20
self.horizon = 20
self.rolling_exp = rolling_exp
self.model_type = model_type
def basic_task(self):
"""For fast training rolling"""
if self.model_type == "gbdt":
conf_path = DIRNAME.parent.parent / "benchmarks" / "LightGBM" / "workflow_config_lightgbm_Alpha158.yaml"
# dump the processed data on to disk for later loading to speed up the processing
h_path = DIRNAME / "lightgbm_alpha158_handler_horizon{}.pkl".format(self.horizon)
elif self.model_type == "linear":
conf_path = DIRNAME.parent.parent / "benchmarks" / "Linear" / "workflow_config_linear_Alpha158.yaml"
h_path = DIRNAME / "linear_alpha158_handler_horizon{}.pkl".format(self.horizon)
else:
raise AssertionError("Model type is not supported!")
with conf_path.open("r") as f:
conf = yaml.safe_load(f)
# modify dataset horizon
conf["task"]["dataset"]["kwargs"]["handler"]["kwargs"]["label"] = [
"Ref($close, -{}) / Ref($close, -1) - 1".format(self.horizon + 1)
]
task = conf["task"]
if not h_path.exists():
h_conf = task["dataset"]["kwargs"]["handler"]
h = init_instance_by_config(h_conf)
h.to_pickle(h_path, dump_all=True)
task["dataset"]["kwargs"]["handler"] = f"file://{h_path}"
task["record"] = ["qlib.workflow.record_temp.SignalRecord"]
return task
def create_rolling_tasks(self):
task = self.basic_task()
task_l = task_generator(
task, RollingGen(step=self.step, trunc_days=self.horizon + 1)
) # the last two days should be truncated to avoid information leakage
return task_l
def train_rolling_tasks(self, task_l=None):
if task_l is None:
task_l = self.create_rolling_tasks()
trainer = TrainerR(experiment_name=self.rolling_exp)
trainer(task_l)
COMB_EXP = "rolling"
def ens_rolling(self):
rc = RecorderCollector(
experiment=self.rolling_exp,
artifacts_key=["pred", "label"],
process_list=[RollingEnsemble()],
# rec_key_func=lambda rec: (self.COMB_EXP, rec.info["id"]),
artifacts_path={"pred": "pred.pkl", "label": "label.pkl"},
)
res = rc()
with R.start(experiment_name=self.COMB_EXP):
R.log_params(exp_name=self.rolling_exp)
R.save_objects(**{"pred.pkl": res["pred"], "label.pkl": res["label"]})
def update_rolling_rec(self):
"""
Evaluate the combined rolling results
"""
for rid, rec in R.list_recorders(experiment_name=self.COMB_EXP).items():
for rt_cls in SigAnaRecord, PortAnaRecord:
rt = rt_cls(recorder=rec, skip_existing=True)
rt.generate()
print(f"Your evaluation results can be found in the experiment named `{self.COMB_EXP}`.")
def run_all(self):
# the results will be save in mlruns.
# 1) each rolling task is saved in rolling_models
self.train_rolling_tasks()
# 2) combined rolling tasks and evaluation results are saved in rolling
self.ens_rolling()
self.update_rolling_rec()
if __name__ == "__main__":
GetData().qlib_data(exists_skip=True)
auto_init()
fire.Fire(RollingBenchmark)

View File

@@ -150,7 +150,7 @@ class Cut(ElemOperator):
self.l = l
self.r = r
if (self.l is not None and self.l <= 0) or (self.r is not None and self.r >= 0):
raise ValueError("Cut operator l shoud > 0 and r should < 0")
raise ValueError("Cut operator l should > 0 and r should < 0")
super(Cut, self).__init__(feature)

View File

@@ -1,5 +1,6 @@
import numpy as np
import pandas as pd
from qlib.constant import EPS
from qlib.data.dataset.processor import Processor
from qlib.data.dataset.utils import fetch_df_by_index
@@ -27,7 +28,7 @@ class HighFreqNorm(Processor):
part_values = np.log1p(part_values)
self.feature_med[name] = np.nanmedian(part_values)
part_values = part_values - self.feature_med[name]
self.feature_std[name] = np.nanmedian(np.absolute(part_values)) * 1.4826 + 1e-12
self.feature_std[name] = np.nanmedian(np.absolute(part_values)) * 1.4826 + EPS
part_values = part_values / self.feature_std[name]
self.feature_vmax[name] = np.nanmax(part_values)
self.feature_vmin[name] = np.nanmin(part_values)

View File

@@ -5,7 +5,8 @@ import fire
import qlib
import pickle
from qlib.config import REG_CN, HIGH_FREQ_CONFIG
from qlib.constant import REG_CN
from qlib.config import HIGH_FREQ_CONFIG
from qlib.utils import init_instance_by_config
from qlib.data.dataset.handler import DataHandlerLP
@@ -82,7 +83,7 @@ class HighfreqWorkflow:
def _init_qlib(self):
"""initialize qlib"""
# use yahoo_cn_1min data
# use cn_data_1min data
QLIB_INIT_CONFIG = {**HIGH_FREQ_CONFIG, **self.SPEC_CONF}
provider_uri = QLIB_INIT_CONFIG.get("provider_uri")
GetData().qlib_data(target_dir=provider_uri, interval="1min", region=REG_CN, exists_skip=True)

View File

@@ -1,6 +1,6 @@
import qlib
import optuna
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config
from qlib.tests.config import CSI300_DATASET_CONFIG
from qlib.tests.data import GetData

View File

@@ -1,6 +1,6 @@
import qlib
import optuna
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config
from qlib.tests.data import GetData
from qlib.tests.config import get_dataset_config, CSI300_MARKET, DATASET_ALPHA360_CLASS

View File

@@ -3,7 +3,7 @@
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config
from qlib.tests.data import GetData

View File

@@ -11,7 +11,7 @@ from pprint import pprint
import fire
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.workflow import R
from qlib.workflow.task.gen import RollingGen, task_generator
from qlib.workflow.task.manage import TaskManager, run_task

View File

@@ -100,7 +100,8 @@ from copy import deepcopy
import qlib
import fire
import pandas as pd
from qlib.config import REG_CN, HIGH_FREQ_CONFIG
from qlib.constant import REG_CN
from qlib.config import HIGH_FREQ_CONFIG
from qlib.data import D
from qlib.utils import exists_qlib_data, init_instance_by_config, flatten_dict
from qlib.workflow import R
@@ -154,6 +155,8 @@ class NestedDecisionExecutionWorkflow:
},
}
exp_name = "nested"
port_analysis_config = {
"executor": {
"class": "NestedExecutor",
@@ -229,7 +232,7 @@ class NestedDecisionExecutionWorkflow:
qlib.init(provider_uri=provider_uri_map, dataset_cache=None, expression_cache=None)
def _train_model(self, model, dataset):
with R.start(experiment_name="train"):
with R.start(experiment_name=self.exp_name):
R.log_params(**flatten_dict(self.task))
model.fit(dataset)
R.save_objects(**{"params.pkl": model})
@@ -256,7 +259,7 @@ class NestedDecisionExecutionWorkflow:
self.port_analysis_config["strategy"] = strategy_config
self.port_analysis_config["backtest"]["benchmark"] = self.benchmark
with R.start(experiment_name="backtest"):
with R.start(experiment_name=self.exp_name, resume=True):
recorder = R.get_recorder()
par = PortAnaRecord(
recorder,
@@ -298,7 +301,7 @@ class NestedDecisionExecutionWorkflow:
# - Aligning the profit calculation between multiple levels and single levels.
# 2) comparing different backtest
# - Basic test idea:
# - the daily backtest will be similar as multi-level(the data quality makes this gap samller)
# - the daily backtest will be similar as multi-level(the data quality makes this gap smaller)
def check_diff_freq(self):
self._init_qlib()
@@ -381,7 +384,7 @@ class NestedDecisionExecutionWorkflow:
}
pa_conf["backtest"]["benchmark"] = self.benchmark
with R.start(experiment_name="backtest"):
with R.start(experiment_name=self.exp_name, resume=True):
recorder = R.get_recorder()
par = PortAnaRecord(recorder, pa_conf)
par.generate()

View File

@@ -10,7 +10,7 @@ Next, we will finish updating online predictions.
import copy
import fire
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.model.trainer import task_train
from qlib.workflow.online.utils import OnlineToolR
from qlib.tests.config import CSI300_GBDT_TASK

View File

@@ -0,0 +1,52 @@
# Introduction
This example tries to demonstrate how Qlib supports data without fixed shared frequency.
For example,
- Daily prices volume data are fixed-frequency data. The data comes in a fixed frequency (i.e. daily)
- Orders are not fixed data and they may come at any time point
To support such non-fixed-frequency, Qlib implements an Arctic-based backend.
Here is an example to import and query data based on this backend.
# Installation
Please refer to [the installation docs](https://docs.mongodb.com/manual/installation/) of mongodb.
Current version of script with default value tries to connect localhost **via default port without authentication**.
Run following command to install necessary libraries
```
pip install pytest coverage
pip install arctic # NOTE: pip may fail to resolve the right package dependency !!! Please make sure the dependency are satisfied.
```
# Importing example data
1. (Optional) Please follow the first part of [this section](https://github.com/microsoft/qlib#data-preparation) to **get 1min data** of Qlib.
2. Please follow following steps to download example data
```bash
cd examples/orderbook_data/
wget http://fintech.msra.cn/stock_data/downloads/highfreq_orderboook_example_data.tar.bz2
tar xf highfreq_orderboook_example_data.tar.bz2
```
3. Please import the example data to your mongo db
```bash
cd examples/orderbook_data/
python create_dataset.py initialize_library # Initialization Libraries
python create_dataset.py import_data # Initialization Libraries
```
# Query Examples
After importing these data, you run `example.py` to create some high-frequency features.
```bash
cd examples/orderbook_data/
pytest -s --disable-warnings example.py # If you want run all examples
pytest -s --disable-warnings example.py::TestClass::test_exp_10 # If you want to run specific example
```
# Known limitations
Expression computing between different frequencies are not supported yet

View File

@@ -0,0 +1,315 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""
NOTE:
- This scripts is a demo to import example data import Qlib
- !!!!!!!!!!!!!!!TODO!!!!!!!!!!!!!!!!!!!:
- Its structure is not well designed and very ugly, your contribution is welcome to make importing dataset easier
"""
from datetime import date, datetime as dt
import os
from pathlib import Path
import random
import shutil
import time
import traceback
from arctic import Arctic, chunkstore
import arctic
from arctic import Arctic, CHUNK_STORE
from arctic.chunkstore.chunkstore import CHUNK_SIZE
import fire
from joblib import Parallel, delayed, parallel
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas.core.indexes.datetimes import date_range
from pymongo.mongo_client import MongoClient
DIRNAME = Path(__file__).absolute().resolve().parent
# CONFIG
N_JOBS = -1 # leaving one kernel free
LOG_FILE_PATH = DIRNAME / "log_file"
DATA_PATH = DIRNAME / "raw_data"
DATABASE_PATH = DIRNAME / "orig_data"
DATA_INFO_PATH = DIRNAME / "data_info"
DATA_FINISH_INFO_PATH = DIRNAME / "./data_finish_info"
DOC_TYPE = ["Tick", "Order", "OrderQueue", "Transaction", "Day", "Minute"]
MAX_SIZE = 3000 * 1024 * 1024 * 1024
ALL_STOCK_PATH = DATABASE_PATH / "all.txt"
ARCTIC_SRV = "127.0.0.1"
def get_library_name(doc_type):
if str.lower(doc_type) == str.lower("Tick"):
return "ticks"
else:
return str.lower(doc_type)
def is_stock(exchange_place, code):
if exchange_place == "SH" and code[0] != "6":
return False
if exchange_place == "SZ" and code[0] != "0" and code[:2] != "30":
return False
return True
def add_one_stock_daily_data(filepath, type, exchange_place, arc, date):
"""
exchange_place: "SZ" OR "SH"
type: "tick", "orderbook", ...
filepath: the path of csv
arc: arclink created by a process
"""
code = os.path.split(filepath)[-1].split(".csv")[0]
if exchange_place == "SH" and code[0] != "6":
return
if exchange_place == "SZ" and code[0] != "0" and code[:2] != "30":
return
df = pd.read_csv(filepath, encoding="gbk", dtype={"code": str})
code = os.path.split(filepath)[-1].split(".csv")[0]
def format_time(day, hms):
day = str(day)
hms = str(hms)
if hms[0] == "1": # >=10,
return (
"-".join([day[0:4], day[4:6], day[6:8]]) + " " + ":".join([hms[:2], hms[2:4], hms[4:6] + "." + hms[6:]])
)
else:
return (
"-".join([day[0:4], day[4:6], day[6:8]]) + " " + ":".join([hms[:1], hms[1:3], hms[3:5] + "." + hms[5:]])
)
## Discard the entire row if wrong data timestamp encoutered.
timestamp = list(zip(list(df["date"]), list(df["time"])))
error_index_list = []
for index, t in enumerate(timestamp):
try:
pd.Timestamp(format_time(t[0], t[1]))
except Exception:
error_index_list.append(index) ## The row number of the error line
# to-do: writting to logs
if len(error_index_list) > 0:
print("error: {}, {}".format(filepath, len(error_index_list)))
df = df.drop(error_index_list)
timestamp = list(zip(list(df["date"]), list(df["time"]))) ## The cleaned timestamp
# generate timestamp
pd_timestamp = pd.DatetimeIndex(
[pd.Timestamp(format_time(timestamp[i][0], timestamp[i][1])) for i in range(len(df["date"]))]
)
df = df.drop(columns=["date", "time", "name", "code", "wind_code"])
# df = pd.DataFrame(data=df.to_dict("list"), index=pd_timestamp)
df["date"] = pd.to_datetime(pd_timestamp)
df.set_index("date", inplace=True)
if str.lower(type) == "orderqueue":
## extract ab1~ab50
df["ab"] = [
",".join([str(int(row["ab" + str(i + 1)])) for i in range(0, row["ab_items"])])
for timestamp, row in df.iterrows()
]
df = df.drop(columns=["ab" + str(i) for i in range(1, 51)])
type = get_library_name(type)
# arc.initialize_library(type, lib_type=CHUNK_STORE)
lib = arc[type]
symbol = "".join([exchange_place, code])
if symbol in lib.list_symbols():
print("update {0}, date={1}".format(symbol, date))
if df.empty == True:
return error_index_list
lib.update(symbol, df, chunk_size="D")
else:
print("write {0}, date={1}".format(symbol, date))
lib.write(symbol, df, chunk_size="D")
return error_index_list
def add_one_stock_daily_data_wrapper(filepath, type, exchange_place, index, date):
pid = os.getpid()
code = os.path.split(filepath)[-1].split(".csv")[0]
arc = Arctic(ARCTIC_SRV)
try:
if index % 100 == 0:
print("index = {}, filepath = {}".format(index, filepath))
error_index_list = add_one_stock_daily_data(filepath, type, exchange_place, arc, date)
if error_index_list is not None and len(error_index_list) > 0:
f = open(os.path.join(LOG_FILE_PATH, "temp_timestamp_error_{0}_{1}_{2}.txt".format(pid, date, type)), "a+")
f.write("{}, {}, {}\n".format(filepath, error_index_list, exchange_place + "_" + code))
f.close()
except Exception as e:
info = traceback.format_exc()
print("error:" + str(e))
f = open(os.path.join(LOG_FILE_PATH, "temp_fail_{0}_{1}_{2}.txt".format(pid, date, type)), "a+")
f.write("fail:" + str(filepath) + "\n" + str(e) + "\n" + str(info) + "\n")
f.close()
finally:
arc.reset()
def add_data(tick_date, doc_type, stock_name_dict):
pid = os.getpid()
if doc_type not in DOC_TYPE:
print("doc_type not in {}".format(DOC_TYPE))
return
try:
begin_time = time.time()
os.system(f"cp {DATABASE_PATH}/{tick_date + '_{}.tar.gz'.format(doc_type)} {DATA_PATH}/")
os.system(
f"tar -xvzf {DATA_PATH}/{tick_date + '_{}.tar.gz'.format(doc_type)} -C {DATA_PATH}/ {tick_date + '_' + doc_type}/SH"
)
os.system(
f"tar -xvzf {DATA_PATH}/{tick_date + '_{}.tar.gz'.format(doc_type)} -C {DATA_PATH}/ {tick_date + '_' + doc_type}/SZ"
)
os.system(f"chmod 777 {DATA_PATH}")
os.system(f"chmod 777 {DATA_PATH}/{tick_date + '_' + doc_type}")
os.system(f"chmod 777 {DATA_PATH}/{tick_date + '_' + doc_type}/SH")
os.system(f"chmod 777 {DATA_PATH}/{tick_date + '_' + doc_type}/SZ")
os.system(f"chmod 777 {DATA_PATH}/{tick_date + '_' + doc_type}/SH/{tick_date}")
os.system(f"chmod 777 {DATA_PATH}/{tick_date + '_' + doc_type}/SZ/{tick_date}")
print("tick_date={}".format(tick_date))
temp_data_path_sh = os.path.join(DATA_PATH, tick_date + "_" + doc_type, "SH", tick_date)
temp_data_path_sz = os.path.join(DATA_PATH, tick_date + "_" + doc_type, "SZ", tick_date)
is_files_exist = {"sh": os.path.exists(temp_data_path_sh), "sz": os.path.exists(temp_data_path_sz)}
sz_files = (
(
set([i.split(".csv")[0] for i in os.listdir(temp_data_path_sz) if i[:2] == "30" or i[0] == "0"])
& set(stock_name_dict["SZ"])
)
if is_files_exist["sz"]
else set()
)
sz_file_nums = len(sz_files) if is_files_exist["sz"] else 0
sh_files = (
(
set([i.split(".csv")[0] for i in os.listdir(temp_data_path_sh) if i[0] == "6"])
& set(stock_name_dict["SH"])
)
if is_files_exist["sh"]
else set()
)
sh_file_nums = len(sh_files) if is_files_exist["sh"] else 0
print("sz_file_nums:{}, sh_file_nums:{}".format(sz_file_nums, sh_file_nums))
f = (DATA_INFO_PATH / "data_info_log_{}_{}".format(doc_type, tick_date)).open("w+")
f.write("sz:{}, sh:{}, date:{}:".format(sz_file_nums, sh_file_nums, tick_date) + "\n")
f.close()
if sh_file_nums > 0:
# write is not thread-safe, update may be thread-safe
Parallel(n_jobs=N_JOBS)(
delayed(add_one_stock_daily_data_wrapper)(
os.path.join(temp_data_path_sh, name + ".csv"), doc_type, "SH", index, tick_date
)
for index, name in enumerate(list(sh_files))
)
if sz_file_nums > 0:
# write is not thread-safe, update may be thread-safe
Parallel(n_jobs=N_JOBS)(
delayed(add_one_stock_daily_data_wrapper)(
os.path.join(temp_data_path_sz, name + ".csv"), doc_type, "SZ", index, tick_date
)
for index, name in enumerate(list(sz_files))
)
os.system(f"rm -f {DATA_PATH}/{tick_date + '_{}.tar.gz'.format(doc_type)}")
os.system(f"rm -rf {DATA_PATH}/{tick_date + '_' + doc_type}")
total_time = time.time() - begin_time
f = (DATA_FINISH_INFO_PATH / "data_info_finish_log_{}_{}".format(doc_type, tick_date)).open("w+")
f.write("finish: date:{}, consume_time:{}, end_time: {}".format(tick_date, total_time, time.time()) + "\n")
f.close()
except Exception as e:
info = traceback.format_exc()
print("date error:" + str(e))
f = open(os.path.join(LOG_FILE_PATH, "temp_fail_{0}_{1}_{2}.txt".format(pid, tick_date, doc_type)), "a+")
f.write("fail:" + str(tick_date) + "\n" + str(e) + "\n" + str(info) + "\n")
f.close()
class DSCreator:
"""Dataset creator"""
def clear(self):
client = MongoClient(ARCTIC_SRV)
client.drop_database("arctic")
def initialize_library(self):
arc = Arctic(ARCTIC_SRV)
for doc_type in DOC_TYPE:
arc.initialize_library(get_library_name(doc_type), lib_type=CHUNK_STORE)
def _get_empty_folder(self, fp: Path):
fp = Path(fp)
if fp.exists():
shutil.rmtree(fp)
fp.mkdir(parents=True, exist_ok=True)
def import_data(self, doc_type_l=["Tick", "Transaction", "Order"]):
# clear all the old files
for fp in LOG_FILE_PATH, DATA_INFO_PATH, DATA_FINISH_INFO_PATH, DATA_PATH:
self._get_empty_folder(fp)
arc = Arctic(ARCTIC_SRV)
for doc_type in DOC_TYPE:
# arc.initialize_library(get_library_name(doc_type), lib_type=CHUNK_STORE)
arc.set_quota(get_library_name(doc_type), MAX_SIZE)
arc.reset()
# doc_type = 'Day'
for doc_type in doc_type_l:
date_list = list(set([int(path.split("_")[0]) for path in os.listdir(DATABASE_PATH) if doc_type in path]))
date_list.sort()
date_list = [str(date) for date in date_list]
f = open(ALL_STOCK_PATH, "r")
stock_name_list = [lines.split("\t")[0] for lines in f.readlines()]
f.close()
stock_name_dict = {
"SH": [stock_name[2:] for stock_name in stock_name_list if "SH" in stock_name],
"SZ": [stock_name[2:] for stock_name in stock_name_list if "SZ" in stock_name],
}
lib_name = get_library_name(doc_type)
a = Arctic(ARCTIC_SRV)
# a.initialize_library(lib_name, lib_type=CHUNK_STORE)
stock_name_exist = a[lib_name].list_symbols()
lib = a[lib_name]
initialize_count = 0
for stock_name in stock_name_list:
if stock_name not in stock_name_exist:
initialize_count += 1
# A placeholder for stocks
pdf = pd.DataFrame(index=[pd.Timestamp("1900-01-01")])
pdf.index.name = "date" # an col named date is necessary
lib.write(stock_name, pdf)
print("initialize count: {}".format(initialize_count))
print("tasks: {}".format(date_list))
a.reset()
# date_list = [files.split("_")[0] for files in os.listdir("./raw_data_price") if "tar" in files]
# print(len(date_list))
date_list = ["20201231"] # for test
Parallel(n_jobs=min(2, len(date_list)))(
delayed(add_data)(date, doc_type, stock_name_dict) for date in date_list
)
if __name__ == "__main__":
fire.Fire(DSCreator)

View File

@@ -0,0 +1,312 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from arctic.arctic import Arctic
import qlib
from qlib.data import D
import unittest
class TestClass(unittest.TestCase):
"""
Useful commands
- run all tests: pytest examples/orderbook_data/example.py
- run a single test: pytest -s --pdb --disable-warnings examples/orderbook_data/example.py::TestClass::test_basic01
"""
def setUp(self):
"""
Configure for arctic
"""
provider_uri = "~/.qlib/qlib_data/yahoo_cn_1min"
qlib.init(
provider_uri=provider_uri,
mem_cache_size_limit=1024 ** 3 * 2,
mem_cache_type="sizeof",
kernels=1,
expression_provider={"class": "LocalExpressionProvider", "kwargs": {"time2idx": False}},
feature_provider={
"class": "ArcticFeatureProvider",
"module_path": "qlib.contrib.data.data",
"kwargs": {"uri": "127.0.0.1"},
},
dataset_provider={
"class": "LocalDatasetProvider",
"kwargs": {
"align_time": False, # Order book is not fixed, so it can't be align to a shared fixed frequency calendar
},
},
)
# self.stocks_list = ["SH600519"]
self.stocks_list = ["SZ000725"]
def test_basic(self):
# NOTE: this data contains a lot of zeros in $askX and $bidX
df = D.features(
self.stocks_list,
fields=["$ask1", "$ask2", "$bid1", "$bid2"],
freq="ticks",
start_time="20201230",
end_time="20210101",
)
print(df)
def test_basic_without_time(self):
df = D.features(self.stocks_list, fields=["$ask1"], freq="ticks")
print(df)
def test_basic01(self):
df = D.features(
self.stocks_list,
fields=["TResample($ask1, '1min', 'last')"],
freq="ticks",
start_time="20201230",
end_time="20210101",
)
print(df)
def test_basic02(self):
df = D.features(
self.stocks_list,
fields=["$function_code"],
freq="transaction",
start_time="20201230",
end_time="20210101",
)
print(df)
def test_basic03(self):
df = D.features(
self.stocks_list,
fields=["$function_code"],
freq="order",
start_time="20201230",
end_time="20210101",
)
print(df)
# Here are some popular expressions for high-frequency
# 1) some shared expression
expr_sum_buy_ask_1 = "(TResample($ask1, '1min', 'last') + TResample($bid1, '1min', 'last'))"
total_volume = (
"TResample("
+ "+".join([f"${name}{i}" for i in range(1, 11) for name in ["asize", "bsize"]])
+ ", '1min', 'sum')"
)
@staticmethod
def total_func(name, method):
return "TResample(" + "+".join([f"${name}{i}" for i in range(1, 11)]) + ",'1min', '{}')".format(method)
def test_exp_01(self):
exprs = []
names = []
for name in ["asize", "bsize"]:
for i in range(1, 11):
exprs.append(f"TResample(${name}{i}, '1min', 'mean') / ({self.total_volume})")
names.append(f"v_{name}_{i}")
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
# 2) some often used papers;
def test_exp_02(self):
spread_func = (
lambda index: f"2 * TResample($ask{index} - $bid{index}, '1min', 'last') / {self.expr_sum_buy_ask_1}"
)
mid_func = (
lambda index: f"2 * TResample(($ask{index} + $bid{index})/2, '1min', 'last') / {self.expr_sum_buy_ask_1}"
)
exprs = []
names = []
for i in range(1, 11):
exprs.extend([spread_func(i), mid_func(i)])
names.extend([f"p_spread_{i}", f"p_mid_{i}"])
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
def test_exp_03(self):
expr3_func1 = (
lambda name, index_left, index_right: f"2 * TResample(Abs(${name}{index_left} - ${name}{index_right}), '1min', 'last') / {self.expr_sum_buy_ask_1}"
)
for name in ["ask", "bid"]:
for i in range(1, 10):
exprs = [expr3_func1(name, i + 1, i)]
names = [f"p_diff_{name}_{i}_{i+1}"]
exprs.extend([expr3_func1("ask", 10, 1), expr3_func1("bid", 1, 10)])
names.extend(["p_diff_ask_10_1", "p_diff_bid_1_10"])
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
def test_exp_04(self):
exprs = []
names = []
for name in ["asize", "bsize"]:
exprs.append(f"(({ self.total_func(name, 'mean')}) / 10) / {self.total_volume}")
names.append(f"v_avg_{name}")
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
def test_exp_05(self):
exprs = [
f"2 * Sub({ self.total_func('ask', 'last')}, {self.total_func('bid', 'last')})/{self.expr_sum_buy_ask_1}",
f"Sub({ self.total_func('asize', 'mean')}, {self.total_func('bsize', 'mean')})/{self.total_volume}",
]
names = ["p_accspread", "v_accspread"]
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
# (p|v)_diff_(ask|bid|asize|bsize)_(time_interval)
def test_exp_06(self):
t = 3
expr6_price_func = (
lambda name, index, method: f'2 * (TResample(${name}{index}, "{t}s", "{method}") - Ref(TResample(${name}{index}, "{t}s", "{method}"), 1)) / {t}'
)
exprs = []
names = []
for i in range(1, 11):
for name in ["bid", "ask"]:
exprs.append(
f"TResample({expr6_price_func(name, i, 'last')}, '1min', 'mean') / {self.expr_sum_buy_ask_1}"
)
names.append(f"p_diff_{name}{i}_{t}s")
for i in range(1, 11):
for name in ["asize", "bsize"]:
exprs.append(f"TResample({expr6_price_func(name, i, 'mean')}, '1min', 'mean') / {self.total_volume}")
names.append(f"v_diff_{name}{i}_{t}s")
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
# TODOs:
# Following expressions may be implemented in the future
# expr7_2 = lambda funccode, bsflag, time_interval: \
# "TResample(TRolling(TEq(@transaction.function_code, {}) & TEq(@transaction.bs_flag ,{}), '{}s', 'sum') / \
# TRolling(@transaction.function_code, '{}s', 'count') , '1min', 'mean')".format(ord(funccode), bsflag,time_interval,time_interval)
# create_dataset(7, "SH600000", [expr7_2("C")] + [expr7(funccode, ordercode) for funccode in ['B','S'] for ordercode in ['0','1']])
# create_dataset(7, ["SH600000"], [expr7_2("C", 48)] )
@staticmethod
def expr7_init(funccode, ordercode, time_interval):
# NOTE: based on on order frequency (i.e. freq="order")
return f"Rolling(Eq($function_code, {ord(funccode)}) & Eq($order_kind ,{ord(ordercode)}), '{time_interval}s', 'sum') / Rolling($function_code, '{time_interval}s', 'count')"
# (la|lb|ma|mb|ca|cb)_intensity_(time_interval)
def test_exp_07_1(self):
# NOTE: based on transaction frequency (i.e. freq="transaction")
expr7_3 = (
lambda funccode, code, time_interval: f"TResample(Rolling(Eq($function_code, {ord(funccode)}) & {code}($ask_order, $bid_order) , '{time_interval}s', 'sum') / Rolling($function_code, '{time_interval}s', 'count') , '1min', 'mean')"
)
exprs = [expr7_3("C", "Gt", "3"), expr7_3("C", "Lt", "3")]
names = ["ca_intensity_3s", "cb_intensity_3s"]
df = D.features(self.stocks_list, fields=exprs, freq="transaction")
df.columns = names
print(df)
trans_dict = {"B": "a", "S": "b", "0": "l", "1": "m"}
def test_exp_07_2(self):
# NOTE: based on on order frequency
expr7 = (
lambda funccode, ordercode, time_interval: f"TResample({self.expr7_init(funccode, ordercode, time_interval)}, '1min', 'mean')"
)
exprs = []
names = []
for funccode in ["B", "S"]:
for ordercode in ["0", "1"]:
exprs.append(expr7(funccode, ordercode, "3"))
names.append(self.trans_dict[ordercode] + self.trans_dict[funccode] + "_intensity_3s")
df = D.features(self.stocks_list, fields=exprs, freq="transaction")
df.columns = names
print(df)
@staticmethod
def expr7_3_init(funccode, code, time_interval):
# NOTE: It depends on transaction frequency
return f"Rolling(Eq($function_code, {ord(funccode)}) & {code}($ask_order, $bid_order) , '{time_interval}s', 'sum') / Rolling($function_code, '{time_interval}s', 'count')"
# (la|lb|ma|mb|ca|cb)_relative_intensity_(time_interval_small)_(time_interval_big)
def test_exp_08_1(self):
expr8_1 = (
lambda funccode, ordercode, time_interval_short, time_interval_long: f"TResample(Gt({self.expr7_init(funccode, ordercode, time_interval_short)},{self.expr7_init(funccode, ordercode, time_interval_long)}), '1min', 'mean')"
)
exprs = []
names = []
for funccode in ["B", "S"]:
for ordercode in ["0", "1"]:
exprs.append(expr8_1(funccode, ordercode, "10", "900"))
names.append(self.trans_dict[ordercode] + self.trans_dict[funccode] + "_relative_intensity_10s_900s")
df = D.features(self.stocks_list, fields=exprs, freq="order")
df.columns = names
print(df)
def test_exp_08_2(self):
# NOTE: It depends on transaction frequency
expr8_2 = (
lambda funccode, ordercode, time_interval_short, time_interval_long: f"TResample(Gt({self.expr7_3_init(funccode, ordercode, time_interval_short)},{self.expr7_3_init(funccode, ordercode, time_interval_long)}), '1min', 'mean')"
)
exprs = [expr8_2("C", "Gt", "10", "900"), expr8_2("C", "Lt", "10", "900")]
names = ["ca_relative_intensity_10s_900s", "cb_relative_intensity_10s_900s"]
df = D.features(self.stocks_list, fields=exprs, freq="transaction")
df.columns = names
print(df)
## v9(la|lb|ma|mb|ca|cb)_diff_intensity_(time_interval1)_(time_interval2)
# 1) calculating the original data
# 2) Resample data to 3s and calculate the changing rate
# 3) Resample data to 1min
def test_exp_09_trans(self):
exprs = [
f'TResample(Div(Sub(TResample({self.expr7_3_init("C", "Gt", "3")}, "3s", "last"), Ref(TResample({self.expr7_3_init("C", "Gt", "3")}, "3s","last"), 1)), 3), "1min", "mean")',
f'TResample(Div(Sub(TResample({self.expr7_3_init("C", "Lt", "3")}, "3s", "last"), Ref(TResample({self.expr7_3_init("C", "Lt", "3")}, "3s","last"), 1)), 3), "1min", "mean")',
]
names = ["ca_diff_intensity_3s_3s", "cb_diff_intensity_3s_3s"]
df = D.features(self.stocks_list, fields=exprs, freq="transaction")
df.columns = names
print(df)
def test_exp_09_order(self):
exprs = []
names = []
for funccode in ["B", "S"]:
for ordercode in ["0", "1"]:
exprs.append(
f'TResample(Div(Sub(TResample({self.expr7_init(funccode, ordercode, "3")}, "3s", "last"), Ref(TResample({self.expr7_init(funccode, ordercode, "3")},"3s", "last"), 1)), 3) ,"1min", "mean")'
)
names.append(self.trans_dict[ordercode] + self.trans_dict[funccode] + "_diff_intensity_3s_3s")
df = D.features(self.stocks_list, fields=exprs, freq="order")
df.columns = names
print(df)
def test_exp_10(self):
exprs = []
names = []
for i in [5, 10, 30, 60]:
exprs.append(
f'TResample(Ref(TResample($ask1 + $bid1, "1s", "ffill"), {-i}) / TResample($ask1 + $bid1, "1s", "ffill") - 1, "1min", "mean" )'
)
names.append(f"lag_{i}_change_rate" for i in [5, 10, 30, 60])
df = D.features(self.stocks_list, fields=exprs, freq="ticks")
df.columns = names
print(df)
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,46 @@
# Portfolio Optimization Strategy
## Introduction
In `qlib/examples/benchmarks` we have various **alpha** models that predict
the stock returns. We also use a simple rule based `TopkDropoutStrategy` to
evaluate the investing performance of these models. However, such a strategy
is too simple to control the portfolio risk like correlation and volatility.
To this end, an optimization based strategy should be used to for the
trade-off between return and risk. In this doc, we will show how to use
`EnhancedIndexingStrategy` to maximize portfolio return while minimizing
tracking error relative to a benchmark.
## Preparation
We use China stock market data for our example.
1. Prepare CSI300 weight:
```bash
wget http://fintech.msra.cn/stock_data/downloads/csi300_weight.zip
unzip -d ~/.qlib/qlib_data/cn_data csi300_weight.zip
rm -f csi300_weight.zip
```
2. Prepare risk model data:
```bash
python prepare_riskdata.py
```
Here we use a **Statistical Risk Model** implemented in `qlib.model.riskmodel`.
However users are strongly recommended to use other risk models for better quality:
* **Fundamental Risk Model** like MSCI BARRA
* [Deep Risk Model](https://arxiv.org/abs/2107.05201)
## End-to-End Workflow
You can finish workflow with `EnhancedIndexingStrategy` by running
`qrun config_enhanced_indexing.yaml`.
In this config, we mainly changed the strategy section compared to
`qlib/examples/benchmarks/workflow_config_lightgbm_Alpha158.yaml`.

View File

@@ -0,0 +1,71 @@
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
region: cn
market: &market csi300
benchmark: &benchmark SH000300
data_handler_config: &data_handler_config
start_time: 2008-01-01
end_time: 2020-08-01
fit_start_time: 2008-01-01
fit_end_time: 2014-12-31
instruments: *market
port_analysis_config: &port_analysis_config
strategy:
class: EnhancedIndexingStrategy
module_path: qlib.contrib.strategy
kwargs:
model: <MODEL>
dataset: <DATASET>
riskmodel_root: ./riskdata
backtest:
start_time: 2017-01-01
end_time: 2020-08-01
account: 100000000
benchmark: *benchmark
exchange_kwargs:
limit_threshold: 0.095
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
task:
model:
class: LGBModel
module_path: qlib.contrib.model.gbdt
kwargs:
loss: mse
colsample_bytree: 0.8879
learning_rate: 0.2
subsample: 0.8789
lambda_l1: 205.6999
lambda_l2: 580.9768
max_depth: 8
num_leaves: 210
num_threads: 20
dataset:
class: DatasetH
module_path: qlib.data.dataset
kwargs:
handler:
class: Alpha158
module_path: qlib.contrib.data.handler
kwargs: *data_handler_config
segments:
train: [2008-01-01, 2014-12-31]
valid: [2015-01-01, 2016-12-31]
test: [2017-01-01, 2020-08-01]
record:
- class: SignalRecord
module_path: qlib.workflow.record_temp
kwargs:
model: <MODEL>
dataset: <DATASET>
- class: SigAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
ana_long_short: False
ann_scaler: 252
- class: PortAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
config: *port_analysis_config

View File

@@ -0,0 +1,55 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import os
import numpy as np
import pandas as pd
from qlib.data import D
from qlib.model.riskmodel import StructuredCovEstimator
def prepare_data(riskdata_root="./riskdata", T=240, start_time="2016-01-01"):
universe = D.features(D.instruments("csi300"), ["$close"], start_time=start_time).swaplevel().sort_index()
price_all = (
D.features(D.instruments("all"), ["$close"], start_time=start_time).squeeze().unstack(level="instrument")
)
# StructuredCovEstimator is a statistical risk model
riskmodel = StructuredCovEstimator()
for i in range(T - 1, len(price_all)):
date = price_all.index[i]
ref_date = price_all.index[i - T + 1]
print(date)
codes = universe.loc[date].index
price = price_all.loc[ref_date:date, codes]
# calculate return and remove extreme return
ret = price.pct_change()
ret.clip(ret.quantile(0.025), ret.quantile(0.975), axis=1, inplace=True)
# run risk model
F, cov_b, var_u = riskmodel.predict(ret, is_price=False, return_decomposed_components=True)
# save risk data
root = riskdata_root + "/" + date.strftime("%Y%m%d")
os.makedirs(root, exist_ok=True)
pd.DataFrame(F, index=codes).to_pickle(root + "/factor_exp.pkl")
pd.DataFrame(cov_b).to_pickle(root + "/factor_cov.pkl")
# for specific_risk we follow the convention to save volatility
pd.Series(np.sqrt(var_u), index=codes).to_pickle(root + "/specific_risk.pkl")
if __name__ == "__main__":
import qlib
qlib.init(provider_uri="~/.qlib/qlib_data/cn_data")
prepare_data()

View File

@@ -6,7 +6,7 @@ import fire
import pickle
from datetime import datetime
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.data.dataset.handler import DataHandlerLP
from qlib.utils import init_instance_by_config
from qlib.tests.data import GetData

View File

@@ -20,7 +20,6 @@ from operator import xor
from pprint import pprint
import qlib
from qlib.config import REG_CN
from qlib.workflow import R
from qlib.tests.data import GetData
@@ -187,7 +186,7 @@ def gen_and_save_md_table(metrics, dataset):
# read yaml, remove seed kwargs of model, and then save file in the temp_dir
def gen_yaml_file_without_seed_kwargs(yaml_path, temp_dir):
with open(yaml_path, "r") as fp:
config = yaml.load(fp)
config = yaml.safe_load(fp)
try:
del config["task"]["model"]["kwargs"]["seed"]
except KeyError:

View File

@@ -61,7 +61,7 @@
"\n",
"import qlib\n",
"import pandas as pd\n",
"from qlib.config import REG_CN\n",
"from qlib.constant import REG_CN\n",
"from qlib.utils import exists_qlib_data, init_instance_by_config\n",
"from qlib.workflow import R\n",
"from qlib.workflow.record_temp import SignalRecord, PortAnaRecord\n",

View File

@@ -2,7 +2,7 @@
# Licensed under the MIT License.
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config, flatten_dict
from qlib.workflow import R
from qlib.workflow.record_temp import SignalRecord, PortAnaRecord, SigAnaRecord

View File

@@ -2,7 +2,7 @@
# Licensed under the MIT License.
from pathlib import Path
__version__ = "0.8.0.99"
__version__ = "0.8.3"
__version__bak = __version__ # This version is backup for QlibConfig.reset_qlib_version
import os
from typing import Union
@@ -12,18 +12,23 @@ import platform
import subprocess
from .log import get_module_logger
# init qlib
def init(default_conf="client", **kwargs):
"""
Parameters
----------
default_conf: str
the default value is client. Accepted values: client/server.
**kwargs :
clear_mem_cache: str
the default value is True;
Will the memory cache be clear.
It is often used to improve performance when init will be called for multiple times
skip_if_reg: bool: str
the default value is True;
When using the recorder, skip_if_reg can set to True to avoid loss of recorder.
"""
from .config import C
from .data.cache import H
@@ -57,7 +62,7 @@ def init(default_conf="client", **kwargs):
else:
logger.warning(f"auto_path is False, please make sure {mount_path} is mounted")
elif uri_type == C.NFS_URI:
_mount_nfs_uri(provider_uri, mount_path, C["auto_mount"])
_mount_nfs_uri(provider_uri, C.dpm.get_data_uri(_freq), C["auto_mount"])
else:
raise NotImplementedError(f"This type of URI is not supported")
@@ -90,7 +95,7 @@ def _mount_nfs_uri(provider_uri, mount_path, auto_mount: bool = False):
sys_type = platform.system()
if "win" in sys_type.lower():
# system: window
exec_result = os.popen("mount -o anon %s %s" % (provider_uri, mount_path + ":"))
exec_result = os.popen(f"mount -o anon {provider_uri} {mount_path}")
result = exec_result.read()
if "85" in result:
LOG.warning(f"{provider_uri} on Windows:{mount_path} is already mounted")
@@ -180,7 +185,7 @@ def get_project_path(config_name="config.yaml", cur_path: Union[Path, str, None]
- There is a file named `config.yaml` in qlib.
For example:
If your project file system stucuture follows such a pattern
If your project file system structure follows such a pattern
<project_path>/
- config.yaml
@@ -225,7 +230,7 @@ def auto_init(**kwargs):
Here are two examples of the configuration
Example 1)
If you want create a new project-specific config based on a shared configure, you can use `conf_type: ref`
If you want to create a new project-specific config based on a shared configure, you can use `conf_type: ref`
.. code-block:: yaml
@@ -241,7 +246,7 @@ def auto_init(**kwargs):
default_exp_name: "Experiment"
Example 2)
If you wan to create simple a stand alone config, you can use following config(a.k.a `conf_type: origin`)
If you want to create simple a standalone config, you can use following config(a.k.a. `conf_type: origin`)
.. code-block:: python
@@ -271,8 +276,8 @@ def auto_init(**kwargs):
init_from_yaml_conf(conf_pp, **kwargs)
elif conf_type == "ref":
# This config type will be more convenient in following scenario
# - There is a shared configure file and you don't want to edit it inplace.
# - The shared configure may be updated later and you don't want to copy it.
# - There is a shared configure file, and you don't want to edit it inplace.
# - The shared configure may be updated later, and you don't want to copy it.
# - You have some customized config.
qlib_conf_path = conf.get("qlib_cfg", None)

View File

@@ -31,7 +31,7 @@ rtn & earning in the Account
class AccumulatedInfo:
"""
accumulated trading info, including accumulated return/cost/turnover
AccumulatedInfo should be shared accross different levels
AccumulatedInfo should be shared across different levels
"""
def __init__(self):
@@ -199,7 +199,7 @@ class Account:
# if stock is sold out, no stock price information in Position, then we should update account first, then update current position
# if stock is bought, there is no stock in current position, update current, then update account
# The cost will be substracted from the cash at last. So the trading logic can ignore the cost calculation
# The cost will be subtracted from the cash at last. So the trading logic can ignore the cost calculation
if order.direction == Order.SELL:
# sell stock
self._update_state_from_order(order, trade_val, cost, trade_price)
@@ -378,7 +378,7 @@ class Account:
)
def get_portfolio_metrics(self):
"""get the history portfolio_metrics and postions instance"""
"""get the history portfolio_metrics and positions instance"""
if self.is_port_metr_enabled():
_portfolio_metrics = self.portfolio_metrics.generate_portfolio_metrics_dataframe()
_positions = self.get_hist_positions()

View File

@@ -13,7 +13,7 @@ from tqdm.auto import tqdm
def backtest_loop(start_time, end_time, trade_strategy: BaseStrategy, trade_executor: BaseExecutor):
"""backtest funciton for the interaction of the outermost strategy and executor in the nested decision execution
"""backtest function for the interaction of the outermost strategy and executor in the nested decision execution
please refer to the docs of `collect_data_loop`

View File

@@ -505,8 +505,8 @@ class BaseTradeDecision:
`inner_trade_decision` will be changed **inplaced**.
Motivation of the `mod_inner_decision`
- Leave a hook for outer decision to affact the decision generated by the inner strategy
- e.g. the outmost strategy generate a time range for trading. But the upper layer can only affact the
- Leave a hook for outer decision to affect the decision generated by the inner strategy
- e.g. the outmost strategy generate a time range for trading. But the upper layer can only affect the
nearest layer in the original design. With `mod_inner_decision`, the decision can passed through multiple
layers

View File

@@ -14,7 +14,8 @@ import numpy as np
import pandas as pd
from ..data.data import D
from ..config import C, REG_CN
from ..config import C
from ..constant import REG_CN
from ..log import get_module_logger
from .decision import Order, OrderDir, OrderHelper
from .high_performance_ds import BaseQuote, PandasQuote, NumpyQuote
@@ -103,7 +104,7 @@ class Exchange:
Necessary fields:
$close is for calculating the total value at end of each day.
Optional fields:
$volume is only necessary when we limit the trade amount or caculate PA(vwap) indicator
$volume is only necessary when we limit the trade amount or calculate PA(vwap) indicator
$vwap is only necessary when we use the $vwap price as the deal price
$factor is for rounding to the trading unit
limit_sell will be set to False by default(False indicates we can sell this
@@ -505,7 +506,7 @@ class Exchange:
Note: some future information is used in this function
Parameter:
target_position : dict { stock_id : amount }
current_postion : dict { stock_id : amount}
current_position : dict { stock_id : amount}
trade_unit : trade_unit
down sample : for amount 321 and trade_unit 100, deal_amount is 300
deal order on trade_date
@@ -535,7 +536,7 @@ class Exchange:
deal_amount = self.get_real_deal_amount(current_amount, target_amount, factor)
if deal_amount == 0:
continue
elif deal_amount > 0:
if deal_amount > 0:
# buy stock
buy_order_list.append(
Order(
@@ -686,9 +687,7 @@ class Exchange:
orig_deal_amount = order.deal_amount
order.deal_amount = max(min(vol_limit_min, orig_deal_amount), 0)
if vol_limit_min < orig_deal_amount:
self.logger.debug(
f"Order clipped due to volume limitation: {order}, {[(vol, rule) for vol, rule in zip(vol_limit_num, vol_limit)]}"
)
self.logger.debug(f"Order clipped due to volume limitation: {order}, {list(zip(vol_limit_num, vol_limit))}")
def _get_buy_amount_by_cash_limit(self, trade_price, cash, cost_ratio):
"""return the real order amount after cash limit for buying.

View File

@@ -41,7 +41,7 @@ class BaseExecutor:
Parameters
----------
time_per_step : str
trade time per trading step, used for genreate the trade calendar
trade time per trading step, used for generate the trade calendar
show_indicator: bool, optional
whether to show indicators, :
- 'pa', the price advantage
@@ -118,7 +118,7 @@ class BaseExecutor:
self.dealt_order_amount = defaultdict(float)
self.deal_day = None
def reset_common_infra(self, common_infra):
def reset_common_infra(self, common_infra, copy_trade_account=False):
"""
reset infrastructure for trading
- reset trade_account
@@ -129,9 +129,14 @@ class BaseExecutor:
self.common_infra.update(common_infra)
if common_infra.has("trade_account"):
# NOTE: there is a trick in the code.
# shallow copy is used instead of deepcopy. So positions are shared
self.trade_account: Account = copy.copy(common_infra.get("trade_account"))
if copy_trade_account:
# NOTE: there is a trick in the code.
# shallow copy is used instead of deepcopy.
# 1. So positions are shared
# 2. Others are not shared, so each level has it own metrics (portfolio and trading metrics)
self.trade_account: Account = copy.copy(common_infra.get("trade_account"))
else:
self.trade_account = common_infra.get("trade_account")
self.trade_account.reset(freq=self.time_per_step, port_metr_enabled=self.generate_portfolio_metrics)
@property
@@ -189,7 +194,7 @@ class BaseExecutor:
return return_value.get("execute_result")
@abstractclassmethod
def _collect_data(self, trade_decision: BaseTradeDecision, level: int = 0) -> Tuple[List[object], dict]:
def _collect_data(cls, trade_decision: BaseTradeDecision, level: int = 0) -> Tuple[List[object], dict]:
"""
Please refer to the doc of collect_data
The only difference between `_collect_data` and `collect_data` is that some common steps are moved into
@@ -342,14 +347,18 @@ class NestedExecutor(BaseExecutor):
**kwargs,
)
def reset_common_infra(self, common_infra):
def reset_common_infra(self, common_infra, copy_trade_account=False):
"""
reset infrastructure for trading
- reset inner_strategyand inner_executor common infra
"""
super(NestedExecutor, self).reset_common_infra(common_infra)
# NOTE: please refer to the docs of BaseExecutor.reset_common_infra for the meaning of `copy_trade_account`
self.inner_executor.reset_common_infra(common_infra)
# The first level follow the `copy_trade_account` from the upper level
super(NestedExecutor, self).reset_common_infra(common_infra, copy_trade_account=copy_trade_account)
# The lower level have to copy the trade_account
self.inner_executor.reset_common_infra(common_infra, copy_trade_account=True)
self.inner_strategy.reset_common_infra(common_infra)
def _init_sub_trading(self, trade_decision):
@@ -360,12 +369,12 @@ class NestedExecutor(BaseExecutor):
self.inner_strategy.reset(level_infra=sub_level_infra, outer_trade_decision=trade_decision)
def _update_trade_decision(self, trade_decision: BaseTradeDecision) -> BaseTradeDecision:
# outter strategy have chance to update decision each iterator
# outer strategy have chance to update decision each iterator
updated_trade_decision = trade_decision.update(self.inner_executor.trade_calendar)
if updated_trade_decision is not None:
trade_decision = updated_trade_decision
# NEW UPDATE
# create a hook for inner strategy to update outter decision
# create a hook for inner strategy to update outer decision
self.inner_strategy.alter_outer_trade_decision(trade_decision)
return trade_decision

View File

@@ -400,7 +400,7 @@ class BaseOrderIndicator:
indicators : List[BaseOrderIndicator]
the list of all inner indicators.
metrics : Union[str, List[str]]
all metrics needs ot be sumed.
all metrics needs to be sumed.
fill_value : float, optional
fill np.NaN with value. By default None.
"""

View File

@@ -20,7 +20,7 @@ class BasePosition:
Please refer to the `Position` class for the position
"""
def __init__(self, cash=0.0, *args, **kwargs):
def __init__(self, *args, cash=0.0, **kwargs):
self._settle_type = self.ST_NO
def skip_update(self) -> bool:
@@ -152,7 +152,7 @@ class BasePosition:
"""
generate stock weight dict {stock_id : value weight of stock in the position}
it is meaningful in the beginning or the end of each trade step
- During execution of each trading step, the weight may be not consistant with the portfolio value
- During execution of each trading step, the weight may be not consistent with the portfolio value
Parameters
----------

View File

@@ -39,7 +39,7 @@ def get_benchmark_weight(
if not path:
path = Path(C.dpm.get_data_uri(freq)).expanduser() / "raw" / "AIndexMembers" / "weights.csv"
# TODO: the storage of weights should be implemented in a more elegent way
# TODO: The benchmark is not consistant with the filename in instruments.
# TODO: The benchmark is not consistent with the filename in instruments.
bench_weight_df = pd.read_csv(path, usecols=["code", "date", "index", "weight"])
bench_weight_df = bench_weight_df[bench_weight_df["index"] == bench]
bench_weight_df["date"] = pd.to_datetime(bench_weight_df["date"])
@@ -156,16 +156,16 @@ def decompose_portofolio(stock_weight_df, stock_group_df, stock_ret_df):
group_weight, stock_weight_in_group = decompose_portofolio_weight(stock_weight_df, stock_group_df)
group_ret = {}
for group_key in stock_weight_in_group:
stock_weight_in_group_start_date = min(stock_weight_in_group[group_key].index)
stock_weight_in_group_end_date = max(stock_weight_in_group[group_key].index)
for group_key, val in stock_weight_in_group.items():
stock_weight_in_group_start_date = min(val.index)
stock_weight_in_group_end_date = max(val.index)
temp_stock_ret_df = stock_ret_df[
(stock_ret_df.index >= stock_weight_in_group_start_date)
& (stock_ret_df.index <= stock_weight_in_group_end_date)
]
group_ret[group_key] = (temp_stock_ret_df * stock_weight_in_group[group_key]).sum(axis=1)
group_ret[group_key] = (temp_stock_ret_df * val).sum(axis=1)
# If no weight is assigned, then the return of group will be np.nan
group_ret[group_key][group_weight[group_key] == 0.0] = np.nan

View File

@@ -73,7 +73,7 @@ class PortfolioMetrics:
self.init_bench(freq=freq, benchmark_config=benchmark_config)
def init_vars(self):
self.accounts = OrderedDict() # account postion value for each trade time
self.accounts = OrderedDict() # account position value for each trade time
self.returns = OrderedDict() # daily return rate for each trade time
self.total_turnovers = OrderedDict() # total turnover for each trade time
self.turnovers = OrderedDict() # turnover for each trade time
@@ -212,7 +212,8 @@ class PortfolioMetrics:
path: str/ pathlib.Path()
"""
path = pathlib.Path(path)
r = pd.read_csv(open(path, "rb"), index_col=0)
with path.open("rb") as f:
r = pd.read_csv(f, index_col=0)
r.index = pd.DatetimeIndex(r.index)
index = r.index
@@ -236,7 +237,7 @@ class Indicator:
"""
`Indicator` is implemented in a aggregate way.
All the metrics are calculated aggregately.
All the metrics are calculated for a seperated stock and in a specific step on a specific level.
All the metrics are calculated for a separated stock and in a specific step on a specific level.
| indicator | desc. |
|--------------+--------------------------------------------------------------|

View File

@@ -93,7 +93,7 @@ class TradeCalendarManager:
About the endpoints:
- Qlib uses the closed interval in time-series data selection, which has the same performance as pandas.Series.loc
# - The returned right endpoints should minus 1 seconds becasue of the closed interval representation in Qlib.
# - The returned right endpoints should minus 1 seconds because of the closed interval representation in Qlib.
# Note: Qlib supports up to minutely decision execution, so 1 seconds is less than any trading time interval.
Parameters
@@ -205,10 +205,7 @@ class BaseInfrastructure:
warnings.warn(f"infra {infra_name} is not found!")
def has(self, infra_name):
if infra_name in self.get_support_infra() and hasattr(self, infra_name):
return True
else:
return False
return infra_name in self.get_support_infra() and hasattr(self, infra_name)
def update(self, other):
support_infra = other.get_support_infra()

View File

@@ -4,7 +4,7 @@
About the configs
=================
The config will based on _default_config.
The config will be based on _default_config.
Two modes are supported
- client
- server
@@ -19,16 +19,18 @@ import logging
import platform
import multiprocessing
from pathlib import Path
from typing import Optional, Union
from typing import Callable, Optional, Union
from typing import TYPE_CHECKING
from qlib.constant import REG_CN, REG_US
if TYPE_CHECKING:
from qlib.utils.time import Freq
class Config:
def __init__(self, default_conf):
self.__dict__["_default_config"] = copy.deepcopy(default_conf) # avoiding conflictions with __getattr__
self.__dict__["_default_config"] = copy.deepcopy(default_conf) # avoiding conflicts with __getattr__
self.reset()
def __getitem__(self, key):
@@ -38,7 +40,7 @@ class Config:
if attr in self.__dict__["_config"]:
return self.__dict__["_config"][attr]
raise AttributeError(f"No such {attr} in self._config")
raise AttributeError(f"No such `{attr}` in self._config")
def get(self, key, default=None):
return self.__dict__["_config"].get(key, default)
@@ -74,10 +76,6 @@ class Config:
self.update(**config_c.__dict__["_config"])
# REGION CONST
REG_CN = "cn"
REG_US = "us"
# pickle.dump protocol version: https://docs.python.org/3/library/pickle.html#data-stream-format
PROTOCOL_VERSION = 4
@@ -114,6 +112,8 @@ _default_config = {
"calendar_cache": None,
# for simple dataset cache
"local_cache_path": None,
# kernels can be a fixed value or a callable function lie `def (freq: str) -> int`
# If the kernels are arctic_kernels, `min(NUM_USABLE_CPU, 30)` may be a good value
"kernels": NUM_USABLE_CPU,
# pickle.dump protocol version
"dump_protocol_version": PROTOCOL_VERSION,
@@ -123,11 +123,10 @@ _default_config = {
"joblib_backend": "multiprocessing",
"default_disk_cache": 1, # 0:skip/1:use
"mem_cache_size_limit": 500,
"mem_cache_limit_type": "length",
# memory cache expire second, only in used 'DatasetURICache' and 'client D.calendar'
# default 1 hour
"mem_cache_expire": 60 * 60,
# memory cache space limit, default 5GB, only in used client
"mem_cache_space_limit": 1024 * 1024 * 1024 * 5,
# cache dir name
"dataset_cache_dir_name": "dataset_cache",
"features_cache_dir_name": "features_cache",
@@ -217,8 +216,9 @@ MODE_CONF = {
"provider_uri": "~/.qlib/qlib_data/cn_data",
# cache
# Using parameter 'remote' to announce the client is using server_cache, and the writing access will be disabled.
"expression_cache": DISK_EXPRESSION_CACHE,
"dataset_cache": DISK_DATASET_CACHE,
# Disable cache by default. Avoid introduce advanced features for beginners
"expression_cache": None,
"dataset_cache": None,
# SimpleDatasetCache directory
"local_cache_path": Path("~/.cache/qlib_simple_cache").expanduser().resolve(),
"calendar_cache": None,
@@ -240,7 +240,7 @@ MODE_CONF = {
}
HIGH_FREQ_CONFIG = {
"provider_uri": "~/.qlib/qlib_data/yahoo_cn_1min",
"provider_uri": "~/.qlib/qlib_data/cn_data_1min",
"dataset_cache": None,
"expression_cache": "DiskExpressionCache",
"region": REG_CN,
@@ -271,7 +271,19 @@ class QlibConfig(Config):
self._registered = False
class DataPathManager:
"""
Motivation:
- get the right path (e.g. data uri) for accessing data based on given information(e.g. provider_uri, mount_path and frequency)
- some helper functions to process uri.
"""
def __init__(self, provider_uri: Union[str, Path, dict], mount_path: Union[str, Path, dict]):
"""
The relation of `provider_uri` and `mount_path`
- `mount_path` is used only if provider_uri is an NFS path
- otherwise, provider_uri will be used for accessing data
"""
self.provider_uri = provider_uri
self.mount_path = mount_path
@@ -302,6 +314,9 @@ class QlibConfig(Config):
return QlibConfig.LOCAL_URI
def get_data_uri(self, freq: Optional[Union[str, Freq]] = None) -> Path:
"""
please refer DataPathManager's __init__ and class doc
"""
if freq is not None:
freq = str(freq) # converting Freq to string
if freq is None or freq not in self.provider_uri:
@@ -312,7 +327,8 @@ class QlibConfig(Config):
elif self.get_uri_type(_provider_uri) == QlibConfig.NFS_URI:
if "win" in platform.system().lower():
# windows, mount_path is the drive
return Path(f"{self.mount_path[freq]}:\\")
_path = str(self.mount_path[freq])
return Path(f"{_path}:\\") if ":" not in _path else Path(_path)
return Path(self.mount_path[freq])
else:
raise NotImplementedError(f"This type of uri is not supported")
@@ -349,9 +365,7 @@ class QlibConfig(Config):
for _freq in _provider_uri.keys():
# mount_path
_mount_path[_freq] = (
_mount_path[_freq]
if _mount_path[_freq] is None
else str(Path(_mount_path[_freq]).expanduser().resolve())
_mount_path[_freq] if _mount_path[_freq] is None else str(Path(_mount_path[_freq]).expanduser())
)
self["provider_uri"] = _provider_uri
self["mount_path"] = _mount_path
@@ -360,10 +374,10 @@ class QlibConfig(Config):
"""
configure qlib based on the input parameters
The configure will act like a dictionary.
The configuration will act like a dictionary.
Normally, it literally replace the value according to the keys.
However, sometimes it is hard for users to set the config when the configure is nested and complicated
Normally, it literally is replaced the value according to the keys.
However, sometimes it is hard for users to set the config when the configuration is nested and complicated
So this API provides some special parameters for users to set the keys in a more convenient way.
- region: REG_CN, REG_US
@@ -450,6 +464,12 @@ class QlibConfig(Config):
# Due to a bug? that converting __version__ to _QlibConfig__version__bak
# Using __version__bak instead of __version__
def get_kernels(self, freq: str):
"""get number of processors given frequency"""
if isinstance(self["kernels"], Callable):
return self["kernels"](freq)
return self["kernels"]
@property
def registered(self):
return self._registered

9
qlib/constant.py Normal file
View File

@@ -0,0 +1,9 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# REGION CONST
REG_CN = "cn"
REG_US = "us"
# Epsilon for avoiding division by zero.
EPS = 1e-12

55
qlib/contrib/data/data.py Normal file
View File

@@ -0,0 +1,55 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# We remove arctic from core framework of Qlib to contrib due to
# - Arctic has very strict limitation on pandas and numpy version
# - https://github.com/man-group/arctic/pull/908
# - pip fail to computing the right version number!!!!
# - Maybe we can solve this problem by poetry
# FIXME: So if you want to use arctic-based provider, please install arctic manually
# `pip install arctic` may not be enough.
from arctic import Arctic
import pandas as pd
import pymongo
from qlib.data.data import FeatureProvider
class ArcticFeatureProvider(FeatureProvider):
def __init__(
self, uri="127.0.0.1", retry_time=0, market_transaction_time_list=[("09:15", "11:30"), ("13:00", "15:00")]
):
super().__init__()
self.uri = uri
# TODO:
# retry connecting if error occurs
# does it real matters?
self.retry_time = retry_time
# NOTE: this is especially important for TResample operator
self.market_transaction_time_list = market_transaction_time_list
def feature(self, instrument, field, start_index, end_index, freq):
field = str(field)[1:]
with pymongo.MongoClient(self.uri) as client:
# TODO: this will result in frequently connecting the server and performance issue
arctic = Arctic(client)
if freq not in arctic.list_libraries():
raise ValueError("lib {} not in arctic".format(freq))
if instrument not in arctic[freq].list_symbols():
# instruments does not exist
return pd.Series()
else:
df = arctic[freq].read(instrument, columns=[field], chunk_range=(start_index, end_index))
s = df[field]
if not s.empty:
s = pd.concat(
[
s.between_time(time_tuple[0], time_tuple[1])
for time_tuple in self.market_transaction_time_list
]
)
return s

View File

@@ -63,9 +63,7 @@ def _get_date_parse_fn(target):
get_date_parse_fn('20120101')('2017-01-01') => '20170101'
get_date_parse_fn(20120101)('2017-01-01') => 20170101
"""
if isinstance(target, pd.Timestamp):
_fn = lambda x: pd.Timestamp(x) # Timestamp('2020-01-01')
elif isinstance(target, int):
if isinstance(target, int):
_fn = lambda x: int(str(x).replace("-", "")[:8]) # 20200201
elif isinstance(target, str) and len(target) == 8:
_fn = lambda x: str(x).replace("-", "")[:8] # '20200201'
@@ -158,7 +156,7 @@ class MTSDatasetH(DatasetH):
try:
df = self.handler._learn.copy() # use copy otherwise recorder will fail
# FIXME: currently we cannot support switching from `_learn` to `_infer` for inference
except:
except Exception:
warnings.warn("cannot access `_learn`, will load raw data")
df = self.handler._data.copy()
df.index = df.index.swaplevel()

View File

@@ -126,9 +126,9 @@ class Alpha360(DataHandlerLP):
fields += ["$vwap/$close"]
names += ["VWAP0"]
for i in range(59, 0, -1):
fields += ["Ref($volume, %d)/$volume" % (i)]
fields += ["Ref($volume, %d)/($volume+1e-12)" % (i)]
names += ["VOLUME%d" % (i)]
fields += ["$volume/$volume"]
fields += ["$volume/($volume+1e-12)"]
names += ["VOLUME0"]
return fields, names
@@ -249,7 +249,7 @@ class Alpha158(DataHandlerLP):
names += [field.upper() + str(d) for d in windows]
if "volume" in config:
windows = config["volume"].get("windows", range(5))
fields += ["Ref($volume, %d)/$volume" % d if d != 0 else "$volume/$volume" for d in windows]
fields += ["Ref($volume, %d)/($volume+1e-12)" % d if d != 0 else "$volume/($volume+1e-12)" for d in windows]
names += ["VOLUME" + str(d) for d in windows]
if "rolling" in config:
windows = config["rolling"].get("windows", [5, 10, 20, 30, 60])

View File

@@ -18,8 +18,8 @@ class SepDataFrame:
"""
(Sep)erate DataFrame
We usually concat multiple dataframe to be processed together(Such as feature, label, weight, filter).
However, they are usally be used seperately at last.
This will result in extra cost for concating and spliting data(reshaping and copying data in the memory is very expensive)
However, they are usually be used separately at last.
This will result in extra cost for concatenating and splitting data(reshaping and copying data in the memory is very expensive)
SepDataFrame tries to act like a DataFrame whose column with multiindex
"""

View File

@@ -371,7 +371,7 @@ def long_short_backtest(
def t_run():
pred_FN = "./check_pred.csv"
pred = pd.read_csv(pred_FN)
pred: pd.DataFrame = pd.read_csv(pred_FN)
pred["datetime"] = pd.to_datetime(pred["datetime"])
pred = pred.set_index([pred.columns[0], pred.columns[1]])
pred = pred.iloc[:9000]

View File

@@ -38,11 +38,11 @@ def _get_position_value_from_df(evaluate_date, position, close_data_df):
def get_position_value(evaluate_date, position):
"""sum of close*amount
get value of postion
get value of position
use close price
postions:
positions:
{
Timestamp('2016-01-05 00:00:00'):
{

View File

@@ -0,0 +1,4 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from .data_selection import MetaTaskDS, MetaDatasetDS, MetaModelDS

View File

@@ -0,0 +1,5 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from .dataset import MetaDatasetDS, MetaTaskDS
from .model import MetaModelDS

View File

@@ -0,0 +1,325 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from copy import deepcopy
from qlib.data.dataset.utils import init_task_handler
from qlib.utils.data import deepcopy_basic_type
from qlib.contrib.torch import data_to_tensor
from qlib.workflow.task.utils import TimeAdjuster
from qlib.model.meta.task import MetaTask
from typing import Dict, List, Union, Text, Tuple
from qlib.data.dataset.handler import DataHandler
from qlib.log import get_module_logger
from qlib.utils import auto_filter_kwargs, get_date_by_shift, init_instance_by_config
from qlib.workflow import R
from qlib.workflow.task.gen import RollingGen, task_generator
from joblib import Parallel, delayed
from qlib.model.meta.dataset import MetaTaskDataset
from qlib.model.trainer import task_train, TrainerR
from qlib.data.dataset import DatasetH
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
class InternalData:
def __init__(self, task_tpl: dict, step: int, exp_name: str):
self.task_tpl = task_tpl
self.step = step
self.exp_name = exp_name
def setup(self, trainer=TrainerR, trainer_kwargs={}):
"""
after running this function `self.data_ic_df` will become set.
Each col represents a data.
Each row represents the Timestamp of performance of that data.
For example,
.. code-block:: python
2021-06-21 2021-06-04 2021-05-21 2021-05-07 2021-04-20 2021-04-06 2021-03-22 2021-03-08 ...
2021-07-02 2021-06-18 2021-06-03 2021-05-20 2021-05-06 2021-04-19 2021-04-02 2021-03-19 ...
datetime ...
2018-01-02 0.079782 0.115975 0.070866 0.028849 -0.081170 0.140380 0.063864 0.110987 ...
2018-01-03 0.123386 0.107789 0.071037 0.045278 -0.060782 0.167446 0.089779 0.124476 ...
2018-01-04 0.140775 0.097206 0.063702 0.042415 -0.078164 0.173218 0.098914 0.114389 ...
2018-01-05 0.030320 -0.037209 -0.044536 -0.047267 -0.081888 0.045648 0.059947 0.047652 ...
2018-01-08 0.107201 0.009219 -0.015995 -0.036594 -0.086633 0.108965 0.122164 0.108508 ...
... ... ... ... ... ... ... ... ... ...
"""
# 1) prepare the prediction of proxy models
perf_task_tpl = deepcopy(self.task_tpl) # this task is supposed to contains no complicated objects
trainer = auto_filter_kwargs(trainer)(experiment_name=self.exp_name, **trainer_kwargs)
# NOTE:
# The handler is initialized for only once.
if not trainer.has_worker():
self.dh = init_task_handler(perf_task_tpl)
else:
self.dh = init_instance_by_config(perf_task_tpl["dataset"]["kwargs"]["handler"])
seg = perf_task_tpl["dataset"]["kwargs"]["segments"]
# We want to split the training time period into small segments.
perf_task_tpl["dataset"]["kwargs"]["segments"] = {
"train": (DatasetH.get_min_time(seg), DatasetH.get_max_time(seg)),
"test": (None, None),
}
# NOTE:
# we play a trick here
# treat the training segments as test to create the rolling tasks
rg = RollingGen(step=self.step, test_key="train", train_key=None, task_copy_func=deepcopy_basic_type)
gen_task = task_generator(perf_task_tpl, [rg])
recorders = R.list_recorders(experiment_name=self.exp_name)
if len(gen_task) == len(recorders):
get_module_logger("Internal Data").info("the data has been initialized")
else:
# train new models
assert 0 == len(recorders), "An empty experiment is required for setup `InternalData``"
trainer.train(gen_task)
# 2) extract the similarity matrix
label_df = self.dh.fetch(col_set="label")
# for
recorders = R.list_recorders(experiment_name=self.exp_name)
key_l = []
ic_l = []
for _, rec in tqdm(recorders.items(), desc="calc"):
pred = rec.load_object("pred.pkl")
task = rec.load_object("task")
data_key = task["dataset"]["kwargs"]["segments"]["train"]
key_l.append(data_key)
ic_l.append(delayed(self._calc_perf)(pred.iloc[:, 0], label_df.iloc[:, 0]))
ic_l = Parallel(n_jobs=-1)(ic_l)
self.data_ic_df = pd.DataFrame(dict(zip(key_l, ic_l)))
self.data_ic_df = self.data_ic_df.sort_index().sort_index(axis=1)
del self.dh # handler is not useful now
def _calc_perf(self, pred, label):
df = pd.DataFrame({"pred": pred, "label": label})
df = df.groupby("datetime").corr(method="spearman")
corr = df.loc(axis=0)[:, "pred"]["label"].droplevel(axis=0, level=-1)
return corr
def update(self):
"""update the data for online trading"""
# TODO:
# when new data are totally(including label) available
# - update the prediction
# - update the data similarity map(if applied)
class MetaTaskDS(MetaTask):
"""Meta Task for Data Selection"""
def __init__(self, task: dict, meta_info: pd.DataFrame, mode: str = MetaTask.PROC_MODE_FULL, fill_method="max"):
"""
The description of the processed data
time_perf: A array with shape <hist_step_n * step, data pieces> -> data piece performance
time_belong: A array with shape <sample, data pieces> -> belong or not (1. or 0.)
array([[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.]])
"""
super().__init__(task, meta_info)
self.fill_method = fill_method
time_perf = self._get_processed_meta_info()
self.processed_meta_input = {"time_perf": time_perf}
# FIXME: memory issue in this step
if mode == MetaTask.PROC_MODE_FULL:
# process metainfo_
ds = self.get_dataset()
# these three lines occupied 70% of the time of initializing MetaTaskDS
d_train, d_test = ds.prepare(["train", "test"], col_set=["feature", "label"])
prev_size = d_test.shape[0]
d_train = d_train.dropna(axis=0)
d_test = d_test.dropna(axis=0)
if prev_size == 0 or d_test.shape[0] / prev_size <= 0.1:
raise ValueError(f"Most of samples are dropped. Please check this task: {task}")
assert (
d_test.groupby("datetime").size().shape[0] >= 5
), "In this segment, this trading dates is less than 5, you'd better check the data."
sample_time_belong = np.zeros((d_train.shape[0], time_perf.shape[1]))
for i, col in enumerate(time_perf.columns):
# these two lines of code occupied 20% of the time of initializing MetaTaskDS
slc = slice(*d_train.index.slice_locs(start=col[0], end=col[1]))
sample_time_belong[slc, i] = 1.0
# If you want that last month also belongs to the last time_perf
# Assumptions: the latest data has similar performance like the last month
sample_time_belong[sample_time_belong.sum(axis=1) != 1, -1] = 1.0
self.processed_meta_input.update(
dict(
X=d_train["feature"],
y=d_train["label"].iloc[:, 0],
X_test=d_test["feature"],
y_test=d_test["label"].iloc[:, 0],
time_belong=sample_time_belong,
test_idx=d_test["label"].index,
)
)
# TODO: set device: I think this is not necessary to converting data format.
self.processed_meta_input = data_to_tensor(self.processed_meta_input)
def _get_processed_meta_info(self):
meta_info_norm = self.meta_info.sub(self.meta_info.mean(axis=1), axis=0) # .fillna(0.)
if self.fill_method == "max":
meta_info_norm = meta_info_norm.T.fillna(
meta_info_norm.max(axis=1)
).T # fill it with row max to align with previous implementation
elif self.fill_method == "zero":
pass
else:
raise NotImplementedError(f"This type of input is not supported")
meta_info_norm = meta_info_norm.fillna(0.0) # always fill zero in case of NaN
return meta_info_norm
def get_meta_input(self):
return self.processed_meta_input
class MetaDatasetDS(MetaTaskDataset):
def __init__(
self,
*,
task_tpl: Union[dict, list],
step: int,
trunc_days: int = None,
rolling_ext_days: int = 0,
exp_name: Union[str, InternalData],
segments: Union[Dict[Text, Tuple], float],
hist_step_n: int = 10,
task_mode: str = MetaTask.PROC_MODE_FULL,
fill_method: str = "max",
):
"""
A dataset for meta model.
Parameters
----------
task_tpl : Union[dict, list]
Decide what tasks are used.
- dict : the task template the prepared task is generated with `step`, `trunc_days` and `RollingGen`
- list : when list, use the list of tasks directly
the list is supposed to be sorted according timeline
step : int
the rolling step
trunc_days: int
days to be truncated based on the test start
rolling_ext_days: int
sometimes users want to train meta models for a longer test period but with smaller rolling steps for more task samples.
the total length of test periods will be `step + rolling_ext_days`
exp_name : Union[str, InternalData]
Decide what meta_info are used for prediction.
- str: the name of the experiment to store the performance of data
- InternalData: a prepared internal data
segments: Union[Dict[Text, Tuple], float]
the segments to divide data
both left and right
if segments is a float:
the float represents the percentage of data for training
hist_step_n: int
length of historical steps for the meta infomation
task_mode : str
Please refer to the docs of MetaTask
"""
super().__init__(segments=segments)
if isinstance(exp_name, InternalData):
self.internal_data = exp_name
else:
self.internal_data = InternalData(task_tpl, step=step, exp_name=exp_name)
self.internal_data.setup()
self.task_tpl = deepcopy(task_tpl) # FIXME: if the handler is shared, how to avoid the explosion of the memroy.
self.trunc_days = trunc_days
self.hist_step_n = hist_step_n
self.step = step
if isinstance(task_tpl, dict):
rg = RollingGen(
step=step, trunc_days=trunc_days, task_copy_func=deepcopy_basic_type
) # NOTE: trunc_days is very important !!!!
task_iter = rg(task_tpl)
if rolling_ext_days > 0:
self.ta = TimeAdjuster(future=True)
for t in task_iter:
t["dataset"]["kwargs"]["segments"]["test"] = self.ta.shift(
t["dataset"]["kwargs"]["segments"]["test"], step=rolling_ext_days, rtype=RollingGen.ROLL_EX
)
if task_mode == MetaTask.PROC_MODE_FULL:
# Only pre initializing the task when full task is req
# initializing handler and share it.
init_task_handler(task_tpl)
else:
assert isinstance(task_tpl, list)
task_iter = task_tpl
self.task_list = []
self.meta_task_l = []
logger = get_module_logger("MetaDatasetDS")
logger.info(f"Example task for training meta model: {task_iter[0]}")
for t in tqdm(task_iter, desc="creating meta tasks"):
try:
self.meta_task_l.append(
MetaTaskDS(t, meta_info=self._prepare_meta_ipt(t), mode=task_mode, fill_method=fill_method)
)
self.task_list.append(t)
except ValueError as e:
logger.warning(f"ValueError: {e}")
assert len(self.meta_task_l) > 0, "No meta tasks found. Please check the data and setting"
def _prepare_meta_ipt(self, task):
ic_df = self.internal_data.data_ic_df
segs = task["dataset"]["kwargs"]["segments"]
end = max([segs[k][1] for k in ("train", "valid") if k in segs])
ic_df_avail = ic_df.loc[:end, pd.IndexSlice[:, :end]]
# meta data set focus on the **information** instead of preprocess
# 1) filter the future info
def mask_future(s):
"""mask future information"""
# from qlib.utils import get_date_by_shift
start, end = s.name
end = get_date_by_shift(trading_date=end, shift=self.trunc_days - 1, future=True)
return s.mask((s.index >= start) & (s.index <= end))
ic_df_avail = ic_df_avail.apply(mask_future) # apply to each col
# 2) filter the info with too long periods
total_len = self.step * self.hist_step_n
if ic_df_avail.shape[0] >= total_len:
return ic_df_avail.iloc[-total_len:]
else:
raise ValueError("the history of distribution data is not long enough.")
def _prepare_seg(self, segment: Text) -> List[MetaTask]:
if isinstance(self.segments, float):
train_task_n = int(len(self.meta_task_l) * self.segments)
if segment == "train":
return self.meta_task_l[:train_task_n]
elif segment == "test":
return self.meta_task_l[train_task_n:]
else:
raise NotImplementedError(f"This type of input is not supported")
else:
raise NotImplementedError(f"This type of input is not supported")

View File

@@ -0,0 +1,182 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from qlib.log import get_module_logger
import pandas as pd
import numpy as np
from qlib.model.meta.task import MetaTask
import torch
from torch import nn
from torch import optim
from tqdm.auto import tqdm
import collections
import copy
from typing import Union, List, Tuple, Dict
from ....data.dataset.weight import Reweighter
from ....model.meta.dataset import MetaTaskDataset
from ....model.meta.model import MetaModel, MetaTaskModel
from ....workflow import R
from .utils import ICLoss
from .dataset import MetaDatasetDS
from qlib.contrib.meta.data_selection.net import PredNet
from qlib.data.dataset.weight import Reweighter
from qlib.log import get_module_logger
logger = get_module_logger("data selection")
class TimeReweighter(Reweighter):
def __init__(self, time_weight: pd.Series):
self.time_weight = time_weight
def reweight(self, data: Union[pd.DataFrame, pd.Series]):
# TODO: handling TSDataSampler
w_s = pd.Series(1.0, index=data.index)
for k, w in self.time_weight.items():
w_s.loc[slice(*k)] = w
logger.info(f"Reweighting result: {w_s}")
return w_s
class MetaModelDS(MetaTaskModel):
"""
The meta-model for meta-learning-based data selection.
"""
def __init__(
self,
step,
hist_step_n,
clip_method="tanh",
clip_weight=2.0,
criterion="ic_loss",
lr=0.0001,
max_epoch=100,
seed=43,
):
self.step = step
self.hist_step_n = hist_step_n
self.clip_method = clip_method
self.clip_weight = clip_weight
self.criterion = criterion
self.lr = lr
self.max_epoch = max_epoch
self.fitted = False
torch.manual_seed(seed)
def run_epoch(self, phase, task_list, epoch, opt, loss_l, ignore_weight=False):
if phase == "train":
self.tn.train()
torch.set_grad_enabled(True)
else:
self.tn.eval()
torch.set_grad_enabled(False)
running_loss = 0.0
pred_y_all = []
for task in tqdm(task_list, desc=f"{phase} Task", leave=False):
meta_input = task.get_meta_input()
pred, weights = self.tn(
meta_input["X"],
meta_input["y"],
meta_input["time_perf"],
meta_input["time_belong"],
meta_input["X_test"],
ignore_weight=ignore_weight,
)
if self.criterion == "mse":
criterion = nn.MSELoss()
loss = criterion(pred, meta_input["y_test"])
elif self.criterion == "ic_loss":
criterion = ICLoss()
try:
loss = criterion(pred, meta_input["y_test"], meta_input["test_idx"], skip_size=50)
except ValueError as e:
get_module_logger("MetaModelDS").warning(f"Exception `{e}` when calculating IC loss")
continue
assert not np.isnan(loss.detach().item()), "NaN loss!"
if phase == "train":
opt.zero_grad()
norm_loss = nn.MSELoss()
loss.backward()
opt.step()
elif phase == "test":
pass
pred_y_all.append(
pd.DataFrame(
{
"pred": pd.Series(pred.detach().cpu().numpy(), index=meta_input["test_idx"]),
"label": pd.Series(meta_input["y_test"].detach().cpu().numpy(), index=meta_input["test_idx"]),
}
)
)
running_loss += loss.detach().item()
running_loss = running_loss / len(task_list)
loss_l.setdefault(phase, []).append(running_loss)
pred_y_all = pd.concat(pred_y_all)
ic = pred_y_all.groupby("datetime").apply(lambda df: df["pred"].corr(df["label"], method="spearman")).mean()
R.log_metrics(**{f"loss/{phase}": running_loss, "step": epoch})
R.log_metrics(**{f"ic/{phase}": ic, "step": epoch})
def fit(self, meta_dataset: MetaDatasetDS):
"""
The meta-learning-based data selection interacts directly with meta-dataset due to the close-form proxy measurement.
Parameters
----------
meta_dataset : MetaDatasetDS
The meta-model takes the meta-dataset for its training process.
"""
if not self.fitted:
for k in set(["lr", "step", "hist_step_n", "clip_method", "clip_weight", "criterion", "max_epoch"]):
R.log_params(**{k: getattr(self, k)})
# FIXME: get test tasks for just checking the performance
phases = ["train", "test"]
meta_tasks_l = meta_dataset.prepare_tasks(phases)
if len(meta_tasks_l[1]):
R.log_params(
**dict(proxy_test_begin=meta_tasks_l[1][0].task["dataset"]["kwargs"]["segments"]["test"])
) # debug: record when the test phase starts
self.tn = PredNet(
step=self.step, hist_step_n=self.hist_step_n, clip_weight=self.clip_weight, clip_method=self.clip_method
)
opt = optim.Adam(self.tn.parameters(), lr=self.lr)
# run weight with no weight
for phase, task_list in zip(phases, meta_tasks_l):
self.run_epoch(f"{phase}_noweight", task_list, 0, opt, {}, ignore_weight=True)
self.run_epoch(f"{phase}_init", task_list, 0, opt, {})
# run training
loss_l = {}
for epoch in tqdm(range(self.max_epoch), desc="epoch"):
for phase, task_list in zip(phases, meta_tasks_l):
self.run_epoch(phase, task_list, epoch, opt, loss_l)
R.save_objects(**{"model.pkl": self.tn})
self.fitted = True
def _prepare_task(self, task: MetaTask) -> dict:
meta_ipt = task.get_meta_input()
weights = self.tn.twm(meta_ipt["time_perf"])
weight_s = pd.Series(weights.detach().cpu().numpy(), index=task.meta_info.columns)
task = copy.copy(task.task) # NOTE: this is a shallow copy.
task["reweighter"] = TimeReweighter(weight_s)
return task
def inference(self, meta_dataset: MetaTaskDataset) -> List[dict]:
res = []
for mt in meta_dataset.prepare_tasks("test"):
res.append(self._prepare_task(mt))
return res

View File

@@ -0,0 +1,68 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import pandas as pd
import numpy as np
import torch
from torch import nn
from .utils import preds_to_weight_with_clamp, SingleMetaBase
class TimeWeightMeta(SingleMetaBase):
def __init__(self, hist_step_n, clip_weight=None, clip_method="clamp"):
# clip_method includes "tanh" or "clamp"
super().__init__(hist_step_n, clip_weight, clip_method)
self.linear = nn.Linear(hist_step_n, 1)
self.k = nn.Parameter(torch.Tensor([8.0]))
def forward(self, time_perf, time_belong=None, return_preds=False):
hist_step_n = self.linear.in_features
# NOTE: the reshape order is very important
time_perf = time_perf.reshape(hist_step_n, time_perf.shape[0] // hist_step_n, *time_perf.shape[1:])
time_perf = torch.mean(time_perf, dim=1, keepdim=False)
preds = []
for i in range(time_perf.shape[1]):
preds.append(self.linear(time_perf[:, i]))
preds = torch.cat(preds)
preds = preds - torch.mean(preds) # avoid using future information
preds = preds * self.k
if return_preds:
if time_belong is None:
return preds
else:
return time_belong @ preds
else:
weights = preds_to_weight_with_clamp(preds, self.clip_weight, self.clip_method)
if time_belong is None:
return weights
else:
return time_belong @ weights
class PredNet(nn.Module):
def __init__(self, step, hist_step_n, clip_weight=None, clip_method="tanh"):
super().__init__()
self.step = step
self.twm = TimeWeightMeta(hist_step_n=hist_step_n, clip_weight=clip_weight, clip_method=clip_method)
self.init_paramters(hist_step_n)
def get_sample_weights(self, X, time_perf, time_belong, ignore_weight=False):
weights = torch.from_numpy(np.ones(X.shape[0])).float().to(X.device)
if not ignore_weight:
if time_perf is not None:
weights_t = self.twm(time_perf, time_belong)
weights = weights * weights_t
return weights
def forward(self, X, y, time_perf, time_belong, X_test, ignore_weight=False):
"""Please refer to the docs of MetaTaskDS for the description of the variables"""
weights = self.get_sample_weights(X, time_perf, time_belong, ignore_weight=ignore_weight)
X_w = X.T * weights.view(1, -1)
theta = torch.inverse(X_w @ X) @ X_w @ y
return X_test @ theta, weights
def init_paramters(self, hist_step_n):
self.twm.linear.weight.data = 1.0 / hist_step_n + self.twm.linear.weight.data * 0.01
self.twm.linear.bias.data.fill_(0.0)

View File

@@ -0,0 +1,98 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import pandas as pd
import numpy as np
import torch
from torch import nn
from qlib.contrib.torch import data_to_tensor
class ICLoss(nn.Module):
def forward(self, pred, y, idx, skip_size=50):
"""forward.
:param pred:
:param y:
:param idx: Assume the level of the idx is (date, inst), and it is sorted
"""
prev = None
diff_point = []
for i, (date, inst) in enumerate(idx):
if date != prev:
diff_point.append(i)
prev = date
diff_point.append(None)
ic_all = 0.0
skip_n = 0
for start_i, end_i in zip(diff_point, diff_point[1:]):
pred_focus = pred[start_i:end_i] # TODO: just for fake
if pred_focus.shape[0] < skip_size:
# skip some days which have very small amount of stock.
skip_n += 1
continue
y_focus = y[start_i:end_i]
ic_day = torch.dot(
(pred_focus - pred_focus.mean()) / np.sqrt(pred_focus.shape[0]) / pred_focus.std(),
(y_focus - y_focus.mean()) / np.sqrt(y_focus.shape[0]) / y_focus.std(),
)
ic_all += ic_day
if len(diff_point) - 1 - skip_n <= 0:
raise ValueError("No enough data for calculating iC")
ic_mean = ic_all / (len(diff_point) - 1 - skip_n)
return -ic_mean # ic loss
def preds_to_weight_with_clamp(preds, clip_weight=None, clip_method="tanh"):
"""
Clip the weights.
Parameters
----------
clip_weight: float
The clip threshold.
clip_method: str
The clip method. Current available: "clamp", "tanh", and "sigmoid".
"""
if clip_weight is not None:
if clip_method == "clamp":
weights = torch.exp(preds)
weights = weights.clamp(1.0 / clip_weight, clip_weight)
elif clip_method == "tanh":
weights = torch.exp(torch.tanh(preds) * np.log(clip_weight))
elif clip_method == "sigmoid":
# intuitively assume its sum is 1
if clip_weight == 0.0:
weights = torch.ones_like(preds)
else:
sm = nn.Sigmoid()
weights = sm(preds) * clip_weight # TODO: The clip_weight is useless here.
weights = weights / torch.sum(weights) * weights.numel()
else:
raise ValueError("Unknown clip_method")
else:
weights = torch.exp(preds)
return weights
class SingleMetaBase(nn.Module):
def __init__(self, hist_n, clip_weight=None, clip_method="clamp"):
# method can be tanh or clamp
super().__init__()
self.clip_weight = clip_weight
if clip_method in ["tanh", "clamp"]:
if self.clip_weight is not None and self.clip_weight < 1.0:
self.clip_weight = 1 / self.clip_weight
self.clip_method = clip_method
def is_enabled(self):
if self.clip_weight is None:
return True
if self.clip_method == "sigmoid":
if self.clip_weight > 0.0:
return True
else:
if self.clip_weight > 1.0:
return True
return False

View File

@@ -11,6 +11,7 @@ from ...model.base import Model
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.interpret.base import FeatureInt
from ...data.dataset.weight import Reweighter
class CatBoostModel(Model, FeatureInt):
@@ -31,6 +32,7 @@ class CatBoostModel(Model, FeatureInt):
early_stopping_rounds=50,
verbose_eval=20,
evals_result=dict(),
reweighter=None,
**kwargs
):
df_train, df_valid = dataset.prepare(
@@ -49,8 +51,17 @@ class CatBoostModel(Model, FeatureInt):
else:
raise ValueError("CatBoost doesn't support multi-label training")
train_pool = Pool(data=x_train, label=y_train_1d)
valid_pool = Pool(data=x_valid, label=y_valid_1d)
if reweighter is None:
w_train = None
w_valid = None
elif isinstance(reweighter, Reweighter):
w_train = reweighter.reweight(df_train).values
w_valid = reweighter.reweight(df_valid).values
else:
raise ValueError("Unsupported reweighter type.")
train_pool = Pool(data=x_train, label=y_train_1d, weight=w_train)
valid_pool = Pool(data=x_valid, label=y_valid_1d, weight=w_valid)
# Initialize the catboost model
self._params["iterations"] = num_boost_round

View File

@@ -4,59 +4,73 @@
import numpy as np
import pandas as pd
import lightgbm as lgb
from typing import Text, Union
from typing import List, Text, Tuple, Union
from ...model.base import ModelFT
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.interpret.base import LightGBMFInt
from ...data.dataset.weight import Reweighter
class LGBModel(ModelFT, LightGBMFInt):
"""LightGBM Model"""
def __init__(self, loss="mse", early_stopping_rounds=50, **kwargs):
def __init__(self, loss="mse", early_stopping_rounds=50, num_boost_round=1000, **kwargs):
if loss not in {"mse", "binary"}:
raise NotImplementedError
self.params = {"objective": loss, "verbosity": -1}
self.params.update(kwargs)
self.early_stopping_rounds = early_stopping_rounds
self.num_boost_round = num_boost_round
self.model = None
def _prepare_data(self, dataset: DatasetH):
df_train, df_valid = dataset.prepare(
["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
)
if df_train.empty or df_valid.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
def _prepare_data(self, dataset: DatasetH, reweighter=None) -> List[Tuple[lgb.Dataset, str]]:
"""
The motivation of current version is to make validation optional
- train segment is necessary;
"""
ds_l = []
assert "train" in dataset.segments
for key in ["train", "valid"]:
if key in dataset.segments:
df = dataset.prepare(key, col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
if df.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
x, y = df["feature"], df["label"]
# Lightgbm need 1D array as its label
if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
else:
raise ValueError("LightGBM doesn't support multi-label training")
# Lightgbm need 1D array as its label
if y.values.ndim == 2 and y.values.shape[1] == 1:
y = np.squeeze(y.values)
else:
raise ValueError("LightGBM doesn't support multi-label training")
dtrain = lgb.Dataset(x_train, label=y_train)
dvalid = lgb.Dataset(x_valid, label=y_valid)
return dtrain, dvalid
if reweighter is None:
w = None
elif isinstance(reweighter, Reweighter):
w = reweighter.reweight(df)
else:
raise ValueError("Unsupported reweighter type.")
ds_l.append((lgb.Dataset(x.values, label=y, weight=w), key))
return ds_l
def fit(
self,
dataset: DatasetH,
num_boost_round=1000,
num_boost_round=None,
early_stopping_rounds=None,
verbose_eval=20,
evals_result=dict(),
reweighter=None,
**kwargs
):
dtrain, dvalid = self._prepare_data(dataset)
ds_l = self._prepare_data(dataset, reweighter)
ds, names = list(zip(*ds_l))
self.model = lgb.train(
self.params,
dtrain,
num_boost_round=num_boost_round,
valid_sets=[dtrain, dvalid],
valid_names=["train", "valid"],
ds[0], # training dataset
num_boost_round=self.num_boost_round if num_boost_round is None else num_boost_round,
valid_sets=ds,
valid_names=names,
early_stopping_rounds=(
self.early_stopping_rounds if early_stopping_rounds is None else early_stopping_rounds
),
@@ -64,8 +78,8 @@ class LGBModel(ModelFT, LightGBMFInt):
evals_result=evals_result,
**kwargs
)
evals_result["train"] = list(evals_result["train"].values())[0]
evals_result["valid"] = list(evals_result["valid"].values())[0]
for k in names:
evals_result[k] = list(evals_result[k].values())[0]
def predict(self, dataset: DatasetH, segment: Union[Text, slice] = "test"):
if self.model is None:
@@ -73,7 +87,7 @@ class LGBModel(ModelFT, LightGBMFInt):
x_test = dataset.prepare(segment, col_set="feature", data_key=DataHandlerLP.DK_I)
return pd.Series(self.model.predict(x_test.values), index=x_test.index)
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20, reweighter=None):
"""
finetune model
@@ -87,7 +101,7 @@ class LGBModel(ModelFT, LightGBMFInt):
verbose level
"""
# Based on existing model and finetune by train more rounds
dtrain, _ = self._prepare_data(dataset)
dtrain, _ = self._prepare_data(dataset, reweighter)
if dtrain.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
self.model = lgb.train(

View File

@@ -56,7 +56,7 @@ class HFLGBModel(ModelFT, LightGBMFInt):
def hf_signal_test(self, dataset: DatasetH, threhold=0.2):
"""
Test the sigal in high frequency test set
Test the signal in high frequency test set
"""
if self.model == None:
raise ValueError("Model hasn't been trained yet")
@@ -86,7 +86,7 @@ class HFLGBModel(ModelFT, LightGBMFInt):
raise ValueError("Empty data from dataset, please check your dataset config.")
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_train["feature"], df_valid["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
l_name = df_train["label"].columns[0]
# Convert label into alpha

View File

@@ -4,6 +4,7 @@
import numpy as np
import pandas as pd
from typing import Text, Union
from qlib.data.dataset.weight import Reweighter
from scipy.optimize import nnls
from sklearn.linear_model import LinearRegression, Ridge, Lasso
@@ -49,33 +50,40 @@ class LinearModel(Model):
self.coef_ = None
def fit(self, dataset: DatasetH):
def fit(self, dataset: DatasetH, reweighter: Reweighter = None):
df_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
if df_train.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
if reweighter is not None:
w: pd.Series = reweighter.reweight(df_train)
w = w.values
else:
w = None
X, y = df_train["feature"].values, np.squeeze(df_train["label"].values)
if self.estimator in [self.OLS, self.RIDGE, self.LASSO]:
self._fit(X, y)
self._fit(X, y, w)
elif self.estimator == self.NNLS:
self._fit_nnls(X, y)
self._fit_nnls(X, y, w)
else:
raise ValueError(f"unknown estimator `{self.estimator}`")
return self
def _fit(self, X, y):
def _fit(self, X, y, w):
if self.estimator == self.OLS:
model = LinearRegression(fit_intercept=self.fit_intercept, copy_X=False)
else:
model = {self.RIDGE: Ridge, self.LASSO: Lasso}[self.estimator](
alpha=self.alpha, fit_intercept=self.fit_intercept, copy_X=False
)
model.fit(X, y)
model.fit(X, y, sample_weight=w)
self.coef_ = model.coef_
self.intercept_ = model.intercept_
def _fit_nnls(self, X, y):
def _fit_nnls(self, X, y, w=None):
if w is not None:
raise NotImplementedError("TODO: support nnls with weight") # TODO
if self.fit_intercept:
X = np.c_[X, np.ones(len(X))] # NOTE: mem copy
coef = nnls(X, y)[0]

View File

@@ -554,7 +554,7 @@ class AdaRNN(nn.Module):
return fc_out
class TransferLoss(object):
class TransferLoss:
def __init__(self, loss_type="cosine", input_dim=512):
"""
Supported loss_type: mmd(mmd_lin), mmd_rbf, coral, cosine, kl, js, mine, adv

View File

@@ -22,6 +22,8 @@ from .pytorch_utils import count_parameters
from ...model.base import Model
from ...data.dataset import DatasetH, TSDatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.utils import ConcatDataset
from ...data.dataset.weight import Reweighter
class ALSTM(Model):
@@ -139,15 +141,18 @@ class ALSTM(Model):
def use_gpu(self):
return self.device != torch.device("cpu")
def mse(self, pred, label):
loss = (pred - label) ** 2
def mse(self, pred, label, weight):
loss = weight * (pred - label) ** 2
return torch.mean(loss)
def loss_fn(self, pred, label):
def loss_fn(self, pred, label, weight=None):
mask = ~torch.isnan(label)
if weight is None:
weight = torch.ones_like(label)
if self.loss == "mse":
return self.mse(pred[mask], label[mask])
return self.mse(pred[mask], label[mask], weight[mask])
raise ValueError("unknown loss `%s`" % self.loss)
@@ -164,12 +169,12 @@ class ALSTM(Model):
self.ALSTM_model.train()
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
label = data[:, -1, -1].to(self.device)
pred = self.ALSTM_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
self.train_optimizer.zero_grad()
loss.backward()
@@ -183,7 +188,7 @@ class ALSTM(Model):
scores = []
losses = []
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
# feature[torch.isnan(feature)] = 0
@@ -191,7 +196,7 @@ class ALSTM(Model):
with torch.no_grad():
pred = self.ALSTM_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
losses.append(loss.item())
score = self.metric_fn(pred, label)
@@ -204,6 +209,7 @@ class ALSTM(Model):
dataset,
evals_result=dict(),
save_path=None,
reweighter=None,
):
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
@@ -213,11 +219,28 @@ class ALSTM(Model):
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
if reweighter is None:
wl_train = np.ones(len(dl_train))
wl_valid = np.ones(len(dl_valid))
elif isinstance(reweighter, Reweighter):
wl_train = reweighter.reweight(dl_train)
wl_valid = reweighter.reweight(dl_valid)
else:
raise ValueError("Unsupported reweighter type.")
train_loader = DataLoader(
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_train, wl_train),
batch_size=self.batch_size,
shuffle=True,
num_workers=self.n_jobs,
drop_last=True,
)
valid_loader = DataLoader(
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_valid, wl_valid),
batch_size=self.batch_size,
shuffle=False,
num_workers=self.n_jobs,
drop_last=True,
)
save_path = get_or_create_path(save_path)

View File

@@ -260,7 +260,7 @@ class GATs(Model):
if self.model_path is not None:
self.logger.info("Loading pretrained model...")
pretrained_model.load_state_dict(torch.load(self.model_path))
pretrained_model.load_state_dict(torch.load(self.model_path, map_location=self.device))
model_dict = self.GAT_model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_model.state_dict().items() if k in model_dict}

Some files were not shown because too many files have changed in this diff Show More