1
0
mirror of https://github.com/microsoft/qlib.git synced 2026-06-29 09:01:18 +08:00

Compare commits

...

58 Commits

Author SHA1 Message Date
Young
e7954bdb32 update version 2022-01-15 22:49:14 +08:00
you-n-g
d6f69aefea Update data.rst 2022-01-15 19:22:31 +08:00
you-n-g
1bebe9780e Fix the read the docs error (#852) 2022-01-15 19:15:06 +08:00
you-n-g
7a4a92bc69 Update data.rst 2022-01-14 13:17:52 +08:00
you-n-g
271782c9dd Update data.rst 2022-01-14 09:19:12 +08:00
you-n-g
d0113ea7df pylint code refine & Fix nested example (#848)
* refine code by CI

* fix argument error

* fix nested eample
2022-01-14 09:09:21 +08:00
you-n-g
c3996955ef Update README.md 2022-01-13 15:29:43 +08:00
Jiabao Qu
8261965015 fix: highfreq_gdbt_model of prepare data (#846)
Co-authored-by: Jiabao Qu <qujiabao@logiocean.com>
2022-01-12 21:36:23 +08:00
Jiabao Qu
6f71f8a46b chore: remove hard code input dimension of model pytorch_tcts (#843)
Co-authored-by: Jiabao Qu <qujiabao@logiocean.com>
2022-01-12 19:12:20 +08:00
Chia-hung Tai
edd8badeaf [840] - Test case for operators. (#841)
* [840] - Test case for operators.

* Move import to the head of file and add test_setting.
2022-01-11 18:44:15 +08:00
Young
19689024d4 Fix exp uri CI bug 2022-01-10 17:29:27 +08:00
you-n-g
0304df0d5b Update README.md 2022-01-10 16:56:18 +08:00
Young
181ee3c070 FIX File Name 2022-01-10 16:55:20 +08:00
you-n-g
cf35562e84 DDG-DA paper code (#743)
* Merge data selection to main

* Update trainer for reweighter

* Typos fixed.

* update data selection interface

* successfully run exp after refactor some interface

* data selection share handler &  trainer

* fix meta model time series bug

* fix online workflow set_uri bug

* fix set_uri bug

* updawte ds docs and delay trainer bug

* docs

* resume reweighter

* add reweighting result

* fix qlib model import

* make recorder more friendly

* fix experiment workflow bug

* commit for merging master incase of conflictions

* Successful run DDG-DA with a single command

* remove unused code

* asdd more docs

* Update README.md

* Update & fix some bugs.

* Update configuration & remove debug functions

* Update README.md

* Modfify horizon from code rather than yaml

* Update performance in README.md

* fix part comments

* Remove unfinished TCTS.

* Fix some details.

* Update meta docs

* Update README.md of the benchmarks_dynamic

* Update README.md files

* Add README.md to the rolling_benchmark baseline.

* Refine the docs and link

* Rename README.md in benchmarks_dynamic.

* Remove comments.

* auto download data

Co-authored-by: wendili-cs <wendili.academic@qq.com>
Co-authored-by: demon143 <785696300@qq.com>
2022-01-10 16:52:37 +08:00
Chia-hung Tai
184ce34a34 [807] Move the REG_CONSTANT/EPS to constant.py. (#811)
* [807] Move the REG_CONSTANT to constant.py.

* import REG_US.

* Move EPS to constant.py.
2022-01-09 21:39:46 +08:00
Chia-hung Tai
382ababc01 Add description of the pu template. (#812) 2022-01-09 21:14:11 +08:00
Chia-hung Tai
bcf18c14de Fix typos and comments. (#815)
* Fix typos and comments.

* Add comma before and.
2022-01-09 21:13:25 +08:00
Chia-hung Tai
6c1332f604 Fix some warnings in log.py. (#805)
* Fix some warnings in log.py.

* Fix typo and using black format.

* Fix black.

* Rename dict_ to attrs
2022-01-06 15:36:00 +08:00
you-n-g
93088485c3 Update README.md (#802)
* Update README.md

* Update README.md

* Update README.md

* Update README.md
2022-01-04 19:16:04 +08:00
Chia-hung Tai
c633d3fec0 Fix BaseStrategy path. (#801)
qlib.strategy.base.BaseStrategy is the current path.
2022-01-04 18:55:40 +08:00
you-n-g
0b6d99bd38 Add a more understandable example of data workflow (#797)
* Update data.rst

* Update data.rst
2022-01-04 09:07:44 +08:00
you-n-g
03cce8c908 Some Optimization of online code (#784)
* Some Optimization of online code

* more flexible updater and load_object & fix p*_uri

* make recorder more friendly

* remove unused import
2022-01-03 15:52:03 +08:00
安阁锐
e76b409d9a Fix $volume normalization issue (#792)
* Fix $volume normalization issue

Fix: https://github.com/microsoft/qlib/issues/765

* black formatting

black formatting

* black formatting

black formatting

* black formatting

black formatting
2022-01-01 23:44:17 +08:00
Arthur Cui
3e79a088ef Add Crypto dataset from coingecko (#733)
* add crypto symbols collectors

* add crypto data collector

* add crypto symbols collectors

* add crypto data collector

* solver region and source problem

* fix merge

* fix merge

* clean all cn information

Co-authored-by: DefangCui <170007807@pku.edu.cn>
2021-12-31 22:24:26 +08:00
SunsetWolf
dfc0ed3c01 fix_typo (#790)
Signed-off-by: unknown <lv.linlang@qq.com>
2021-12-31 22:14:47 +08:00
you-n-g
f59cfe51e0 Fix account shared bug (#791)
* Fix account shared bug

* fix bug in nested executor
2021-12-31 15:56:21 +08:00
Pengrong Zhu
1ecdfd45fe fix dump_bin:DumpDataUpdate (#783) 2021-12-29 09:29:08 +08:00
Chao Ning
622303b83a add map_location to torch.load to make it work when cuda is unavailable (#782) 2021-12-29 00:02:04 +08:00
Chao Ning
6bafd0a09b Reformat example data names: use {region}_data for 1-day data, and {region}_data_1min for 1-min data (#781)
* Fix high-freq data name from `yahoo_cn_1min` to `cn_data_1min`

* re-format example data names using `qlib_{region}_{feq}`, e.g. qlib_cn_1d

* re-format example data names using `{region}_{feq}`, e.g. us_1d and cn_1min

* keep using  for 1day data, and change 1min data to
2021-12-28 23:58:49 +08:00
you-n-g
aed9c09091 Update news 2021-12-28 19:54:30 +08:00
Dong Zhou
1b8f0b4575 support optimization based strategy (#754)
* support optimization based strategy

* fix riskdata not found & update doc

* refactor signal_strategy

* add portfolio example

* Update examples/portfolio/prepare_riskdata.py

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* fix typo

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* fix typo

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* update doc

* fix riskmodel doc

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
2021-12-28 18:44:20 +08:00
you-n-g
4709909782 Add hook for supporting RL strategy (#768) 2021-12-27 12:16:36 +08:00
Pengrong Zhu
a0f49fe2e7 fix cn_index collector (#780) 2021-12-26 14:12:48 +08:00
you-n-g
2840570dd3 Fix Typo in README.md 2021-12-26 00:42:16 +08:00
you-n-g
00ad122175 Update Contributor list (#779) 2021-12-26 00:25:03 +08:00
you-n-g
3493f29e16 Enhance Task Dict Var (#778) 2021-12-26 00:18:44 +08:00
you-n-g
e33de44cb9 Update Docs of Alpha360 (#777) 2021-12-25 18:07:44 +08:00
Chia-hung Tai
e843e021a2 Use encoding="utf-8" in open. (#773) 2021-12-25 18:00:56 +08:00
Chia-hung Tai
5aa5a6f356 Replace scripts/get_data.py to get_data.py. (#775)
For the consitency in this page, replace scripts/get_data.py to get_data.py.
2021-12-25 16:12:04 +08:00
Chia-hung Tai
f490708025 Fix typo leanable to learnable. (#774) 2021-12-25 16:07:40 +08:00
you-n-g
41a5778684 Update strategy.rst
Add docs for the prediction score
2021-12-25 15:24:58 +08:00
you-n-g
ef161715f7 Add docs about the patameters (#771) 2021-12-24 15:26:27 +08:00
you-n-g
d087054a59 Add Cache to avoid frequently loading calendar (#766) 2021-12-23 09:08:52 +08:00
cuicorey
350fbe91c9 Change BCELoss in MLP model (#756) 2021-12-20 19:03:33 +08:00
you-n-g
2aca74cd21 Black Format 2021-12-20 18:21:31 +08:00
you-n-g
92ff3d20b9 Update processor.py 2021-12-20 18:18:59 +08:00
you-n-g
0552120a2e Update documents for qlib_uri 2021-12-20 14:18:53 +08:00
you-n-g
3480fd932f Update README.md 2021-12-18 12:29:36 +08:00
Pengrong Zhu
957f9a18e9 fix IndexError of the last trading day in backtest calendar (#751) 2021-12-17 11:11:56 +08:00
you-n-g
6c83632fc4 Update README.md 2021-12-14 18:13:04 +08:00
Arthur Cui
125922b77a solve VERSION.txt bug (#732)
* solve VERSION.txt bug

* back to main version

* change setup and init to follow pypi type

* add read function

* solve black format

Co-authored-by: DefangCui <170007807@pku.edu.cn>
2021-12-12 12:02:20 +08:00
Pengrong Zhu
5e69d089c0 add description of dataset document (#742) 2021-12-12 09:49:10 +08:00
Pengrong Zhu
c10c349b20 remove unneeded code from workflow_by_code.ipynb && fix analysis_model_performance (#740) 2021-12-11 13:23:00 +08:00
upgradvisor-bot
7cb1f7cee0 Hyperopt upgrade (#741)
* Upgrade hyperopt

* Do not use newly added progress bar

Co-authored-by: Raphael Sofaer <rsofaer@gmail.com>
2021-12-11 12:37:08 +08:00
you-n-g
d0ff5eea9d Update README.md 2021-12-10 17:39:15 +08:00
you-n-g
e99f00b445 Add method parameter for volume (#734) 2021-12-09 10:45:25 +08:00
you-n-g
e50ad4309e Update news 2021-12-08 10:24:58 +08:00
Young
d89ae2370f update version to dev 2021-12-08 08:25:28 +08:00
165 changed files with 4375 additions and 1152 deletions

View File

@@ -8,6 +8,7 @@
<!--- Why is this change required? What problem does it solve? -->
## How Has This Been Tested?
<! --- Put an `x` in all the boxes that apply: --->
- [ ] Pass the test by running: `pytest qlib/tests/test_all_pipeline.py` under upper directory of `qlib`.
- [ ] If you are adding a new feature, test on your own test scripts.

View File

@@ -30,7 +30,7 @@ Version 0.2.1
--------------------
- Support registering user-defined ``Provider``.
- Support use operators in string format, e.g. ``['Ref($close, 1)']`` is valid field format.
- Support dynamic fields in ``$some_field`` format. And exising fields like ``Close()`` may be deprecated in the future.
- Support dynamic fields in ``$some_field`` format. And existing fields like ``Close()`` may be deprecated in the future.
Version 0.2.2
--------------------
@@ -78,7 +78,7 @@ Version 0.3.5
- Support multi-label training, you can provide multiple label in ``handler``. (But LightGBM doesn't support due to the algorithm itself)
- Refactor ``handler`` code, dataset.py is no longer used, and you can deploy your own labels and features in ``feature_label_config``
- Handler only offer DataFrame. Also, ``trainer`` and model.py only receive DataFrame
- Change ``split_rolling_data``, we roll the data on market calender now, not on normal date
- Change ``split_rolling_data``, we roll the data on market calendar now, not on normal date
- Move some date config from ``handler`` to ``trainer``
Version 0.4.0
@@ -167,11 +167,11 @@ Version 0.8.0
- There are lots of changes for daily trading, it is hard to list all of them. But a few important changes could be noticed
- The trading limitation is more accurate;
- In `previous version <https://github.com/microsoft/qlib/blob/v0.7.2/qlib/contrib/backtest/exchange.py#L160>`_, longing and shorting actions share the same action.
- In `current verison <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/backtest/exchange.py#L304>`_, the trading limitation is different between loging and shorting action.
- In `current version <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/backtest/exchange.py#L304>`_, the trading limitation is different between logging and shorting action.
- The constant is different when calculating annualized metrics.
- `Current version <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/contrib/evaluate.py#L42>`_ uses more accurate constant than `previous version <https://github.com/microsoft/qlib/blob/v0.7.2/qlib/contrib/evaluate.py#L22>`_
- `A new version <https://github.com/microsoft/qlib/blob/7c31012b507a3823117bddcc693fc64899460b2a/qlib/tests/data.py#L17>`_ of data is released. Due to the unstability of Yahoo data source, the data may be different after downloading data again.
- Users could chec kout the backtesting results between `Current version <https://github.com/microsoft/qlib/tree/7c31012b507a3823117bddcc693fc64899460b2a/examples/benchmarks>`_ and `previous version <https://github.com/microsoft/qlib/tree/v0.7.2/examples/benchmarks>`_
- Users could check out the backtesting results between `Current version <https://github.com/microsoft/qlib/tree/7c31012b507a3823117bddcc693fc64899460b2a/examples/benchmarks>`_ and `previous version <https://github.com/microsoft/qlib/tree/v0.7.2/examples/benchmarks>`_
Other Versions

View File

@@ -11,9 +11,13 @@
Recent released features
| Feature | Status |
| -- | ------ |
| Meta-Learning-based framework & DDG-DA | [Released](https://github.com/microsoft/qlib/pull/743) on Jan 10, 2022 |
| Planning-based portfolio optimization | [Released](https://github.com/microsoft/qlib/pull/754) on Dec 28, 2021 |
| Release Qlib v0.8.0 | [Released](https://github.com/microsoft/qlib/releases/tag/v0.8.0) on Dec 8, 2021 |
| ADD model | [Released](https://github.com/microsoft/qlib/pull/704) on Nov 22, 2021 |
| ADARNN model | [Released](https://github.com/microsoft/qlib/pull/689) on Nov 14, 2021 |
| TCN model | [Released](https://github.com/microsoft/qlib/pull/668) on Nov 4, 2021 |
| Nested Decision Framework | [Released](https://github.com/microsoft/qlib/pull/438) on Oct 1, 2021. [Example](https://github.com/microsoft/qlib/blob/main/examples/nested_decision_execution/workflow.py) and [Doc](https://qlib.readthedocs.io/en/latest/component/highfreq.html) |
|Temporal Routing Adaptor (TRA) | [Released](https://github.com/microsoft/qlib/pull/531) on July 30, 2021 |
| Transformer & Localformer | [Released](https://github.com/microsoft/qlib/pull/508) on July 22, 2021 |
| Release Qlib v0.7.0 | [Released](https://github.com/microsoft/qlib/releases/tag/v0.7.0) on July 12, 2021 |
@@ -47,9 +51,12 @@ For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative
- [Data Preparation](#data-preparation)
- [Auto Quant Research Workflow](#auto-quant-research-workflow)
- [Building Customized Quant Research Workflow by Code](#building-customized-quant-research-workflow-by-code)
- [**Quant Model(Paper) Zoo**](#quant-model-paper-zoo)
- [Run a single model](#run-a-single-model)
- [Run multiple models](#run-multiple-models)
- [Main Challenges & Solutions in Quant Research](#main-challenges--solutions-in-quant-research)
- [Forecasting: Finding Valuable Signals/Patterns](#forecasting-finding-valuable-signalspatterns)
- [**Quant Model (Paper) Zoo**](#quant-model-paper-zoo)
- [Run a Single Model](#run-a-single-model)
- [Run Multiple Models](#run-multiple-models)
- [Adapting to Market Dynamics](#adapting-to-market-dynamics)
- [**Quant Dataset Zoo**](#quant-dataset-zoo)
- [More About Qlib](#more-about-qlib)
- [Offline Mode and Online Mode](#offline-mode-and-online-mode)
@@ -64,11 +71,8 @@ New features under development(order by estimated release time).
Your feedbacks about the features are very important.
| Feature | Status |
| -- | ------ |
| Planning-based portfolio optimization | Under review: https://github.com/microsoft/qlib/pull/280 |
| Fund data supporting and analysis | Under review: https://github.com/microsoft/qlib/pull/292 |
| Point-in-Time database | Under review: https://github.com/microsoft/qlib/pull/343 |
| High-frequency trading | Under review: https://github.com/microsoft/qlib/pull/408 |
| Meta-Learning-based data selection | Initial opensource version under development |
| Orderbook database | Under review: https://github.com/microsoft/qlib/pull/744 |
# Framework of Qlib
@@ -159,15 +163,17 @@ Load and prepare data by running the following code:
This dataset is created by public data collected by [crawler scripts](scripts/data_collector/), which have been released in
the same repository.
Users could create the same dataset with it.
Users could create the same dataset with it. [Description of dataset](https://github.com/microsoft/qlib/tree/main/scripts/data_collector#description-of-dataset)
*Please pay **ATTENTION** that the data is collected from [Yahoo Finance](https://finance.yahoo.com/lookup), and the data might not be perfect.
We recommend users to prepare their own data if they have a high-quality dataset. For more information, users can refer to the [related document](https://qlib.readthedocs.io/en/latest/component/data.html#converting-csv-format-into-qlib-format)*.
### Automatic update of daily frequency data (from yahoo finance)
> This step is *Optional* if users only want to try their models and strategies on history data.
>
> It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.
> For more information refer to: [yahoo collector](https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
>
> For more information, please refer to: [yahoo collector](https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
* Automatic update of data to the "qlib" directory each trading day(Linux)
* use *crontab*: `crontab -e`
@@ -192,7 +198,7 @@ We recommend users to prepare their own data if they have a high-quality dataset
```python
import qlib
from qlib.data import D
from qlib.config import REG_CN
from qlib.constant import REG_CN
# Initialization
mount_path = "~/.qlib/qlib_data/cn_data" # target_dir
@@ -277,8 +283,18 @@ Qlib provides a tool named `qrun` to run the whole workflow automatically (inclu
## Building Customized Quant Research Workflow by Code
The automatic workflow may not suit the research workflow of all Quant researchers. To support a flexible Quant research workflow, Qlib also provides a modularized interface to allow researchers to build their own workflow by code. [Here](examples/workflow_by_code.ipynb) is a demo for customized Quant research workflow by code.
# Main Challenges & Solutions in Quant Research
Quant investment is an very unique scenario with lots of key challenges to be solved.
Currently, Qlib provides some solutions for several of them.
# [Quant Model (Paper) Zoo](examples/benchmarks)
## Forecasting: Finding Valuable Signals/Patterns
Accurate forecasting of the stock price trend is a very important part to construct profitable portfolios.
However, huge amount of data with various formats in the financial market which make it challenging to build forecasting models.
An increasing number of SOTA Quant research works/papers, which focus on building forecasting models to mine valuable signals/patterns in complex financial data, are released in `Qlib`
### [Quant Model (Paper) Zoo](examples/benchmarks)
Here is a list of models built on `Qlib`.
- [GBDT based on XGBoost (Tianqi Chen, et al. KDD 2016)](examples/benchmarks/XGBoost/)
@@ -305,7 +321,7 @@ Your PR of new Quant models is highly welcomed.
The performance of each model on the `Alpha158` and `Alpha360` dataset can be found [here](examples/benchmarks/README.md).
## Run a single model
### Run a single model
All the models listed above are runnable with ``Qlib``. Users can find the config files we provide and some details about the model through the [benchmarks](examples/benchmarks) folder. More information can be retrieved at the model files listed above.
`Qlib` provides three different ways to run a single model, users can pick the one that fits their cases best:
@@ -315,7 +331,7 @@ All the models listed above are runnable with ``Qlib``. Users can find the confi
- Users can use the script [`run_all_model.py`](examples/run_all_model.py) listed in the `examples` folder to run a model. Here is an example of the specific shell command to be used: `python run_all_model.py run --models=lightgbm`, where the `--models` arguments can take any number of models listed above(the available models can be found in [benchmarks](examples/benchmarks/)). For more use cases, please refer to the file's [docstrings](examples/run_all_model.py).
- **NOTE**: Each baseline has different environment dependencies, please make sure that your python version aligns with the requirements(e.g. TFT only supports Python 3.6~3.7 due to the limitation of `tensorflow==1.15.0`)
## Run multiple models
### Run multiple models
`Qlib` also provides a script [`run_all_model.py`](examples/run_all_model.py) which can run multiple models for several iterations. (**Note**: the script only support *Linux* for now. Other OS will be supported in the future. Besides, it doesn't support parallel running the same model for multiple times as well, and this will be fixed in the future development too.)
The script will create a unique virtual environment for each model, and delete the environments after training. Thus, only experiment results such as `IC` and `backtest` results will be generated and stored.
@@ -327,6 +343,14 @@ python run_all_model.py run 10
It also provides the API to run specific models at once. For more use cases, please refer to the file's [docstrings](examples/run_all_model.py).
## [Adapting to Market Dynamics](examples/benchmarks_dynamic)
Due to the non-stationary nature of the environment of the financial market, the data distribution may change in different periods, which makes the performance of models build on training data decays in the future test data.
So adapting the forecasting models/strategies to market dynamics is very important to the model/strategies' performance.
Here is a list of solutions built on `Qlib`.
- [Rolling Retraining](examples/benchmarks_dynamic/baseline/)
- [DDG-DA on pytorch (Wendi, et al. AAAI 2022)](examples/benchmarks_dynamic/DDG-DA/)
# Quant Dataset Zoo
Dataset plays a very important role in Quant. Here is a list of the datasets built on `Qlib`:
@@ -397,17 +421,36 @@ Join IM discussion groups:
|![image](http://fintech.msra.cn/images_v070/qrcode/gitter_qr.png)|
# Contributing
We appreciate all contributions and thank all the contributors!
<a href="https://github.com/microsoft/qlib/graphs/contributors"><img src="https://contrib.rocks/image?repo=microsoft/qlib" /></a>
Before we released Qlib as an open-source project on Github in Sep 2020, Qlib is an internal project in our group. Unfortunately, the internal commit history is not kept. A lot of members in our group have also contributed a lot to Qlib, which includes Ruihua Wang, Yinda Zhang, Haisu Yu, Shuyu Wang, Bochen Pang, and [Dong Zhou](https://github.com/evanzd/evanzd). Especially thanks to [Dong Zhou](https://github.com/evanzd/evanzd) due to his initial version of Qlib.
## Guidance
This project welcomes contributions and suggestions.
**Here are some
[code standards](docs/developer/code_standard.rst) when you submit a pull request.**
[code standards](docs/developer/code_standard.rst) for submiting a pull request.**
If you want to contribute to Qlib's document, you can follow the steps in the figure below.
Making contributions is not a hard thing. Solving an issue(maybe just answering a question raised in [issues list](https://github.com/microsoft/qlib/issues) or [gitter](https://gitter.im/Microsoft/qlib)), fixing/issuing a bug, improving the documents and even fixing a typo are important contributions to Qlib.
For example, if you want to contribute to Qlib's document/code, you can follow the steps in the figure below.
<p align="center">
<img src="https://github.com/demon143/qlib/blob/main/docs/_static/img/change%20doc.gif" />
</p>
If you don't know how to start to contribute, you can refer to the following examples.
| Type | Examples |
| -- | -- |
| Solving issues | [Answer a question](https://github.com/microsoft/qlib/issues/749); [issuing](https://github.com/microsoft/qlib/issues/765) or [fixing](https://github.com/microsoft/qlib/pull/792) a bug |
| Docs | [Improve docs quality](https://github.com/microsoft/qlib/pull/797/files) ; [Fix a typo](https://github.com/microsoft/qlib/pull/774) |
| Feature | Implement a [requested feature](https://github.com/microsoft/qlib/projects) like [this](https://github.com/microsoft/qlib/pull/754); [Refactor interfaces](https://github.com/microsoft/qlib/pull/539/files) |
| Dataset | [Add a dataset](https://github.com/microsoft/qlib/pull/733) |
| Models | [Implement a new model](https://github.com/microsoft/qlib/pull/689) |
If you would like to become one of Qlib's maintainers to contribute more (e.g. help merge PR, triage issues), please contact us by email([qlib@microsoft.com](mailto:qlib@microsoft.com)). We are glad to help you to set the right permission.
## Licence
Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the right to use your contribution. For details, visit https://cla.opensource.microsoft.com.

View File

@@ -1 +0,0 @@
0.8.0

View File

@@ -21,6 +21,12 @@ The introduction of ``Data Layer`` includes the following parts.
- Cache
- Data and Cache File Structure
Here is a typical example of Qlib data workflow
- Users download data and converting data into Qlib format(with filename suffix `.bin`). In this step, typically only some basic data are stored on disk(such as OHLCV).
- Creating some basic features based on Qlib's expression Engine(e.g. "Ref($close, 60) / $close", the return of last 60 trading days). Supported operators in the expression engine can be found `here <https://github.com/microsoft/qlib/blob/main/qlib/data/ops.py>`_. This step is typically implemented in Qlib's `Data Loader <https://qlib.readthedocs.io/en/latest/component/data.html#data-loader>`_ which is a component of `Data Handler <https://qlib.readthedocs.io/en/latest/component/data.html#data-handler>`_ .
- If users require more complicated data processing (e.g. data normalization), `Data Handler <https://qlib.readthedocs.io/en/latest/component/data.html#data-handler>`_ support user-customized processors to process data(some predefined processors can be found `here <https://github.com/microsoft/qlib/blob/main/qlib/data/dataset/processor.py>`_). The processors are different from operators in expression engine. It is designed for some complicated data processing methods which is hard to supported in operators in expression engine.
- At last, `Dataset <https://qlib.readthedocs.io/en/latest/component/data.html#dataset>`_ is responsible to prepare model-specific dataset from the processed data of Data Handler
Data Preparation
============================
@@ -46,6 +52,7 @@ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency
Qlib Format Dataset
--------------------
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows.
The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
.. code-block:: bash
@@ -213,7 +220,7 @@ The `trade unit` defines the unit number of stocks can be used in a trade, and t
.. code-block:: python
from qlib.config import REG_CN
from qlib.constant import REG_CN
qlib.init(provider_uri='~/.qlib/qlib_data/cn_data', region=REG_CN)
@@ -338,7 +345,7 @@ DataHandlerLP
In addition to use ``Data Handler`` in an automatic workflow with ``qrun``, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data (standardization, remove NaN, etc.) and build datasets.
In order to achieve so, ``Qlib`` provides a base class `qlib.data.dataset.DataHandlerLP <../reference/api.html#qlib.data.dataset.handler.DataHandlerLP>`_. The core idea of this class is that: we will have some leanable ``Processors`` which can learn the parameters of data processing(e.g., parameters for zscore normalization). When new data comes in, these `trained` ``Processors`` can then process the new data and thus processing real-time data in an efficient way becomes possible. More information about ``Processors`` will be listed in the next subsection.
In order to achieve so, ``Qlib`` provides a base class `qlib.data.dataset.DataHandlerLP <../reference/api.html#qlib.data.dataset.handler.DataHandlerLP>`_. The core idea of this class is that: we will have some learnable ``Processors`` which can learn the parameters of data processing(e.g., parameters for zscore normalization). When new data comes in, these `trained` ``Processors`` can then process the new data and thus processing real-time data in an efficient way becomes possible. More information about ``Processors`` will be listed in the next subsection.
Interface

View File

@@ -14,7 +14,7 @@ To get the join trading performance of daily and intraday trading, they must int
In order to support the joint backtest strategies in multiple levels, a corresponding framework is required. None of the publicly available high-frequency trading frameworks considers multi-level joint trading, which make the backtesting aforementioned inaccurate.
Besides backtesting, the optimization of strategies from different levels is not standalone and can be affected by each other.
For example, the best portfolio management strategy may change with the performance of order executions(e.g. a portfolio with higher turnover may becomes a better choice when we imporve the order execution strategies).
For example, the best portfolio management strategy may change with the performance of order executions(e.g. a portfolio with higher turnover may becomes a better choice when we improve the order execution strategies).
To achieve the overall good performance , it is necessary to consider the interaction of strategies in different level.
Therefore, building a new framework for trading in multiple levels becomes necessary to solve the various problems mentioned above, for which we designed a nested decision execution framework that consider the interaction of strategies.

68
docs/component/meta.rst Normal file
View File

@@ -0,0 +1,68 @@
.. _meta:
=================================
Meta Controller: Meta-Task & Meta-Dataset & Meta-Model
=================================
.. currentmodule:: qlib
Introduction
=============
``Meta Controller`` provides guidance to ``Forecast Model``, which aims to learn regular patterns among a series of forecasting tasks and use learned patterns to guide forthcoming forecasting tasks. Users can implement their own meta-model instance based on ``Meta Controller`` module.
Meta Task
=============
A `Meta Task` instance is the basic element in the meta-learning framework. It saves the data that can be used for the `Meta Model`. Multiple `Meta Task` instances may share the same `Data Handler`, controlled by `Meta Dataset`. Users should use `prepare_task_data()` to obtain the data that can be directly fed into the `Meta Model`.
.. autoclass:: qlib.model.meta.task.MetaTask
:members:
Meta Dataset
=============
`Meta Dataset` controls the meta-information generating process. It is on the duty of providing data for training the `Meta Model`. Users should use `prepare_tasks` to retrieve a list of `Meta Task` instances.
.. autoclass:: qlib.model.meta.dataset.MetaTaskDataset
:members:
Meta Model
=============
General Meta Model
------------------
`Meta Model` instance is the part that controls the workflow. The usage of the `Meta Model` includes:
1. Users train their `Meta Model` with the `fit` function.
2. The `Meta Model` instance guides the workflow by giving useful information via the `inference` function.
.. autoclass:: qlib.model.meta.model.MetaModel
:members:
Meta Task Model
------------------
This type of meta-model may interact with task definitions directly. Then, the `Meta Task Model` is the class for them to inherit from. They guide the base tasks by modifying the base task definitions. The function `prepare_tasks` can be used to obtain the modified base task definitions.
.. autoclass:: qlib.model.meta.model.MetaTaskModel
:members:
Meta Guide Model
------------------
This type of meta-model participates in the training process of the base forecasting model. The meta-model may guide the base forecasting models during their training to improve their performances.
.. autoclass:: qlib.model.meta.model.MetaGuideModel
:members:
Example
=============
``Qlib`` provides an implementation of ``Meta Model`` module, ``DDG-DA``,
which adapts to the market dynamics.
``DDG-DA`` includes four steps:
1. Calculate meta-information and encapsulate it into ``Meta Task`` instances. All the meta-tasks form a ``Meta Dataset`` instance.
2. Train ``DDG-DA`` based on the training data of the meta-dataset.
3. Do the inference of the ``DDG-DA`` to get guide information.
4. Apply guide information to the forecasting models to improve their performances.
The `above example <https://github.com/microsoft/qlib/tree/main/examples/benchmarks_dynamic/DDG-DA>`_ can be found in ``examples/benchmarks_dynamic/DDG-DA/workflow.py``.

View File

@@ -37,7 +37,7 @@ Here is a general view of the structure of the system:
This experiment management system defines a set of interface and provided a concrete implementation ``MLflowExpManager``, which is based on the machine learning platform: ``MLFlow`` (`link <https://mlflow.org/>`_).
If users set the implementation of ``ExpManager`` to be ``MLflowExpManager``, they can use the command `mlflow ui` to visualize and check the experiment results. For more information, pleaes refer to the related documents `here <https://www.mlflow.org/docs/latest/cli.html#mlflow-ui>`_.
If users set the implementation of ``ExpManager`` to be ``MLflowExpManager``, they can use the command `mlflow ui` to visualize and check the experiment results. For more information, please refer to the related documents `here <https://www.mlflow.org/docs/latest/cli.html#mlflow-ui>`_.
Qlib Recorder
===================

View File

@@ -8,7 +8,7 @@ Portfolio Strategy: Portfolio Management
Introduction
===================
``Portfolio Strategy`` is designed to adopt different portfolio strategies, which means that users can adopt different algorithms to generate investment portfolios based on the prediction scores of the ``Forecast Model``. Users can use the ``Portfolio Strategy`` in an automatic workflow by ``Workflow`` module, please refer to `Workflow: Workflow Management <workflow.html>`_.
``Portfolio Strategy`` is designed to adopt different portfolio strategies, which means that users can adopt different algorithms to generate investment portfolios based on the prediction scores of the ``Forecast Model``. Users can use the ``Portfolio Strategy`` in an automatic workflow by ``Workflow`` module, please refer to `Workflow: Workflow Management <workflow.html>`_.
Because the components in ``Qlib`` are designed in a loosely-coupled way, ``Portfolio Strategy`` can be used as an independent module also.
@@ -22,20 +22,20 @@ Base Class & Interface
BaseStrategy
------------------
Qlib provides a base class ``qlib.contrib.strategy.BaseStrategy``. All strategy classes need to inherit the base class and implement its interface.
Qlib provides a base class ``qlib.strategy.base.BaseStrategy``. All strategy classes need to inherit the base class and implement its interface.
- `get_risk_degree`
Return the proportion of your total value you will use in investment. Dynamically risk_degree will result in Market timing.
- `generate_order_list`
Return the order list.
Return the order list.
Users can inherit `BaseStrategy` to customize their strategy class.
WeightStrategyBase
--------------------
Qlib also provides a class ``qlib.contrib.strategy.WeightStrategyBase`` that is a subclass of `BaseStrategy`.
Qlib also provides a class ``qlib.contrib.strategy.WeightStrategyBase`` that is a subclass of `BaseStrategy`.
`WeightStrategyBase` only focuses on the target positions, and automatically generates an order list based on positions. It provides the `generate_target_weight_position` interface.
@@ -71,17 +71,27 @@ TopkDropoutStrategy
- `Topk`: The number of stocks held
- `Drop`: The number of stocks sold on each trading day
Currently, the number of held stocks is `Topk`.
On each trading day, the `Drop` number of held stocks with the worst `prediction score` will be sold, and the same number of unheld stocks with the best `prediction score` will be bought.
.. image:: ../_static/img/topk_drop.png
:alt: Topk-Drop
``TopkDrop`` algorithm sells `Drop` stocks every trading day, which guarantees a fixed turnover rate.
- Generate the order list from the target amount
EnhancedIndexingStrategy
------------------------
`EnhancedIndexingStrategy` Enhanced indexing combines the arts of active management and passive management,
with the aim of outperforming a benchmark index (e.g., S&P 500) in terms of portfolio return while controlling
the risk exposure (a.k.a. tracking error).
For more information, please refer to `qlib.contrib.strategy.signal_strategy.EnhancedIndexingStrategy`
and `qlib.contrib.strategy.optimizer.enhanced_indexing.EnhancedIndexingOptimizer`.
Usage & Example
====================
@@ -112,6 +122,9 @@ A prediction sample is shown as follows.
``Forecast Model`` module can make predictions, please refer to `Forecast Model: Model Training & Prediction <model.html>`_.
Normally, the prediction score is the output of the models. But some models are learned from a label with a different scale. So the scale of the prediction score may be different from your expectation(e.g. the return of instruments).
Qlib didn't add a step to scale the prediction score to a unified scale. Because not every trading strategy cares about the scale(e.g. TopkDropoutStrategy only cares about the order). So the strategy is responsible for rescaling the prediction score(e.g. some portfolio-optimization-based strategies may require a meaningful scale).
Running backtest
-----------------
@@ -283,4 +296,4 @@ The backtest results are in the following form:
Reference
===================
To know more about the `prediction score` `pred_score` output by ``Forecast Model``, please refer to `Forecast Model: Model Training & Prediction <model.html>`_.
To know more about the `prediction score` `pred_score` output by ``Forecast Model``, please refer to `Forecast Model: Model Training & Prediction <model.html>`_.

View File

@@ -31,7 +31,7 @@ Let's see an example,
First make sure you have the latest version of `qlib` installed.
Then, you need to privide a configuration to setup the experiment.
Then, you need to provide a configuration to setup the experiment.
We write a simple configuration example as following,
.. code-block:: YAML
@@ -217,13 +217,13 @@ The tuner pipeline contains different tuners, and the `tuner` program will proce
Each part represents a tuner, and its modules which are to be tuned. Space in each part is the hyper-parameters' space of a certain module, you need to create your searching space and modify it in `/qlib/contrib/tuner/space.py`. We use `hyperopt` package to help us to construct the space, you can see the detail of how to use it in https://github.com/hyperopt/hyperopt/wiki/FMin .
- model
You need to provide the `class` and the `space` of the model. If the model is user's own implementation, you need to privide the `module_path`.
You need to provide the `class` and the `space` of the model. If the model is user's own implementation, you need to provide the `module_path`.
- trainer
You need to proveide the `class` of the trainer. If the trainer is user's own implementation, you need to privide the `module_path`.
You need to provide the `class` of the trainer. If the trainer is user's own implementation, you need to provide the `module_path`.
- strategy
You need to provide the `class` and the `space` of the strategy. If the strategy is user's own implementation, you need to privide the `module_path`.
You need to provide the `class` and the `space` of the strategy. If the strategy is user's own implementation, you need to provide the `module_path`.
- data_label
The label of the data, you can search which kinds of labels will lead to a better result. This part is optional, and you only need to provide `space`.
@@ -273,7 +273,7 @@ You need to use the same dataset to evaluate your different `estimator` experime
About the data and backtest
~~~~~~~~~~~~~~~~~~~~~~~~~~~
`data` and `backtest` are all same in the whole `tuner` experiment. Different `estimator` experiments must use the same data and backtest method. So, these two parts of config are same with that in `estimator` configuration. You can see the precise defination of these parts in `estimator` introduction. We only provide an example here.
`data` and `backtest` are all same in the whole `tuner` experiment. Different `estimator` experiments must use the same data and backtest method. So, these two parts of config are same with that in `estimator` configuration. You can see the precise definition of these parts in `estimator` introduction. We only provide an example here.
.. code-block:: YAML

View File

@@ -36,10 +36,11 @@ Document Structure
:caption: COMPONENTS:
Workflow: Workflow Management <component/workflow.rst>
Data Layer: Data Framework&Usage <component/data.rst>
Data Layer: Data Framework & Usage <component/data.rst>
Forecast Model: Model Training & Prediction <component/model.rst>
Portfolio Management and Backtest <component/strategy.rst>
Nested Decision Execution: High-Frequency Trading <component/highfreq.rst>
Meta Controller: Meta-Task & Meta-Dataset & Meta-Model <component/meta.rst>
Qlib Recorder: Experiment Management <component/recorder.rst>
Analysis: Evaluation & Results Analysis <component/report.rst>
Online Serving: Online Management & Strategy & Tool <component/online.rst>

View File

@@ -31,7 +31,7 @@ Users can easily intsall ``Qlib`` according to the following steps:
git clone https://github.com/microsoft/qlib.git && cd qlib
python setup.py install
To kown more about `installation`, please refer to `Qlib Installation <../start/installation.html>`_.
To known more about `installation`, please refer to `Qlib Installation <../start/installation.html>`_.
Prepare Data
==============
@@ -44,7 +44,7 @@ Load and prepare data by running the following code:
This dataset is created by public data collected by crawler scripts in ``scripts/data_collector/``, which have been released in the same repository. Users could create the same dataset with it.
To kown more about `prepare data`, please refer to `Data Preparation <../component/data.html#data-preparation>`_.
To known more about `prepare data`, please refer to `Data Preparation <../component/data.html#data-preparation>`_.
Auto Quant Research Workflow
====================================

View File

@@ -3,3 +3,4 @@ cmake
numpy
scipy
scikit-learn
pandas

View File

@@ -27,7 +27,7 @@ Initialize Qlib before calling other APIs: run following code in python.
import qlib
# region in [REG_CN, REG_US]
from qlib.config import REG_CN
from qlib.constant import REG_CN
provider_uri = "~/.qlib/qlib_data/cn_data" # target_dir
qlib.init(provider_uri=provider_uri, region=REG_CN)
@@ -42,10 +42,10 @@ Besides `provider_uri` and `region`, `qlib.init` has other parameters. The follo
- `provider_uri`
Type: str. The URI of the Qlib data. For example, it could be the location where the data loaded by ``get_data.py`` are stored.
- `region`
Type: str, optional parameter(default: `qlib.config.REG_CN`).
Currently: ``qlib.config.REG_US`` ('us') and ``qlib.config.REG_CN`` ('cn') is supported. Different value of `region` will result in different stock market mode.
- ``qlib.config.REG_US``: US stock market.
- ``qlib.config.REG_CN``: China stock market.
Type: str, optional parameter(default: `qlib.constant.REG_CN`).
Currently: ``qlib.constant.REG_US`` ('us') and ``qlib.constant.REG_CN`` ('cn') is supported. Different value of `region` will result in different stock market mode.
- ``qlib.constant.REG_US``: US stock market.
- ``qlib.constant.REG_CN``: China stock market.
Different modes will result in different trading limitations and costs.
The region is just `shortcuts for defining a batch of configurations <https://github.com/microsoft/qlib/blob/main/qlib/config.py#L239>`_. Users can set the key configurations manually if the existing region setting can't meet their requirements.

View File

@@ -22,7 +22,6 @@ data_handler_config: &data_handler_config
- class: CSRankNorm
kwargs:
fields_group: label
label: ["Ref($close, -2) / Ref($close, -1) - 1"]
port_analysis_config: &port_analysis_config
strategy:
class: TopkDropoutStrategy

View File

@@ -9,7 +9,7 @@ Here are the results of each benchmark model running on Qlib's `Alpha360` and `A
The numbers shown below demonstrate the performance of the entire `workflow` of each model. We will update the `workflow` as well as models in the near future for better results.
<!--
> If you need to reproduce the results below, please use the **v1** dataset: `python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn --version v1`
> If you need to reproduce the results below, please use the **v1** dataset: `python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn --version v1`
>
> In the new version of qlib, the default dataset is **v2**. Since the data is collected from the YahooFinance API (which is not very stable), the results of *v2* and *v1* may differ -->

View File

@@ -32,7 +32,7 @@ import abc
import enum
# Type defintions
# Type definitions
class DataTypes(enum.IntEnum):
"""Defines numerical types of each column."""

View File

@@ -254,9 +254,9 @@ class DistributedHyperparamOptManager(HyperparamOptManager):
param_ranges: Discrete hyperparameter range for random search.
fixed_params: Fixed model parameters per experiment.
root_model_folder: Folder to store optimisation artifacts.
worker_number: Worker index definining which set of hyperparameters to
worker_number: Worker index defining which set of hyperparameters to
test.
search_iterations: Maximum numer of random search iterations.
search_iterations: Maximum number of random search iterations.
num_iterations_per_worker: How many iterations are handled per worker.
clear_serialised_params: Whether to regenerate hyperparameter
combinations.
@@ -330,7 +330,7 @@ class DistributedHyperparamOptManager(HyperparamOptManager):
if os.path.exists(self.serialised_ranges_folder):
df = pd.read_csv(self.serialised_ranges_path, index_col=0)
else:
print("Unable to load - regenerating serach ranges instead")
print("Unable to load - regenerating search ranges instead")
df = self.update_serialised_hyperparam_df()
return df

View File

@@ -342,7 +342,7 @@ class TFTDataCache:
@classmethod
def contains(cls, key):
"""Retuns boolean indicating whether key is present in cache."""
"""Returns boolean indicating whether key is present in cache."""
return key in cls._data_cache
@@ -1120,10 +1120,10 @@ class TemporalFusionTransformer:
Args:
df: Input dataframe
return_targets: Whether to also return outputs aligned with predictions to
faciliate evaluation
facilitate evaluation
Returns:
Input dataframe or tuple of (input dataframe, algined output dataframe).
Input dataframe or tuple of (input dataframe, aligned output dataframe).
"""
data = self._batch_data(df)

View File

@@ -209,7 +209,6 @@ class TFTModel(ModelFT):
fixed_params = self.data_formatter.get_experiment_params()
params = self.data_formatter.get_default_model_params()
# Wendi: 合并调优的参数和非调优的参数
params = {**params, **fixed_params}
if not os.path.exists(self.model_folder):
@@ -295,7 +294,7 @@ class TFTModel(ModelFT):
def to_pickle(self, path: Union[Path, str]):
"""
Tensorflow model can't be dumped directly.
So the data should be save seperatedly
So the data should be save separately
**TODO**: Please implement the function to load the files

View File

@@ -57,7 +57,7 @@ And here are two ways to run the model:
python example.py --config_file configs/config_alstm.yaml
```
Here we trained TRA on a pretrained backbone model. Therefore we run `*_init.yaml` before TRA's scipts.
Here we trained TRA on a pretrained backbone model. Therefore we run `*_init.yaml` before TRA's scripts.
### Results

View File

@@ -124,7 +124,7 @@ class TRAModel(Model):
loss = (pred - label).pow(2).mean()
L = (all_preds.detach() - label[:, None]).pow(2)
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure postive input
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure positive input
data_set.assign_data(index, L) # save loss to memory
@@ -165,7 +165,7 @@ class TRAModel(Model):
L = (all_preds - label[:, None]).pow(2)
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure postive input
L -= L.min(dim=-1, keepdim=True).values # normalize & ensure positive input
data_set.assign_data(index, L) # save loss to memory
@@ -484,7 +484,7 @@ class TRA(nn.Module):
"""Temporal Routing Adaptor (TRA)
TRA takes historical prediction erros & latent representation as inputs,
TRA takes historical prediction errors & latent representation as inputs,
then routes the input sample to a specific predictor for training & inference.
Args:

View File

@@ -0,0 +1,27 @@
# Introduction
This is the implementation of `DDG-DA` based on `Meta Controller` component provided by `Qlib`.
## Background
In many real-world scenarios, we often deal with streaming data that is sequentially collected over time. Due to the non-stationary nature of the environment, the streaming data distribution may change in unpredictable ways, which is known as concept drift. To handle concept drift, previous methods first detect when/where the concept drift happens and then adapt models to fit the distribution of the latest data. However, there are still many cases that some underlying factors of environment evolution are predictable, making it possible to model the future concept drift trend of the streaming data, while such cases are not fully explored in previous work.
Therefore, we propose a novel method `DDG-DA`, that can effectively forecast the evolution of data distribution and improve the performance of models. Specifically, we first train a predictor to estimate the future data distribution, then leverage it to generate training samples, and finally train models on the generated data.
## Dataset
The data in the paper are private. So we conduct experiments on Qlib's public dataset.
Though the dataset is different, the conclusion remains the same. By applying `DDG-DA`, users can see rising trends at the test phase both in the proxy models' ICs and the performances of the forecasting models.
## Run the Code
Users can try `DDG-DA` by running the following command:
```bash
python workflow.py run_all
```
The default forecasting models are `Linear`. Users can choose other forecasting models by changing the `forecast_model` parameter when `DDG-DA` initializes. For example, users can try `LightGBM` forecasting models by running the following command:
```bash
python workflow.py --forecast_model="gbdt" run_all
```
## Results
The results of related methods in Qlib's public dataset can be found [here](../)

View File

@@ -0,0 +1 @@
torch==1.10.0

View File

@@ -0,0 +1,258 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from pathlib import Path
from qlib.model.meta.task import MetaTask
from qlib.contrib.meta.data_selection.model import MetaModelDS
from qlib.contrib.meta.data_selection.dataset import InternalData, MetaDatasetDS
from qlib.data.dataset.handler import DataHandlerLP
import pandas as pd
import fire
import sys
from tqdm.auto import tqdm
import yaml
import pickle
from qlib import auto_init
from qlib.model.trainer import TrainerR, task_train
from qlib.utils import init_instance_by_config
from qlib.workflow.task.gen import RollingGen, task_generator
from qlib.workflow import R
from qlib.tests.data import GetData
DIRNAME = Path(__file__).absolute().resolve().parent
sys.path.append(str(DIRNAME.parent / "baseline"))
from rolling_benchmark import RollingBenchmark # NOTE: sys.path is changed for import RollingBenchmark
class DDGDA:
"""
please run `python workflow.py run_all` to run the full workflow of the experiment
**NOTE**
before running the example, please clean your previous results with following command
- `rm -r mlruns`
"""
def __init__(self, sim_task_model="linear", forecast_model="linear"):
self.step = 20
# NOTE:
# the horizon must match the meaning in the base task template
self.horizon = 20
self.meta_exp_name = "DDG-DA"
self.sim_task_model = sim_task_model # The model to capture the distribution of data.
self.forecast_model = forecast_model # downstream forecasting models' type
def get_feature_importance(self):
# this must be lightGBM, because it needs to get the feature importance
rb = RollingBenchmark(model_type="gbdt")
task = rb.basic_task()
model = init_instance_by_config(task["model"])
dataset = init_instance_by_config(task["dataset"])
model.fit(dataset)
fi = model.get_feature_importance()
# Because the model use numpy instead of dataframe for training lightgbm
# So the we must use following extra steps to get the right feature importance
df = dataset.prepare(segments=slice(None), col_set="feature", data_key=DataHandlerLP.DK_R)
cols = df.columns
fi_named = {cols[int(k.split("_")[1])]: imp for k, imp in fi.to_dict().items()}
return pd.Series(fi_named)
def dump_data_for_proxy_model(self):
"""
Dump data for training meta model.
The meta model will be trained upon the proxy forecasting model.
This dataset is for the proxy forecasting model.
"""
topk = 30
fi = self.get_feature_importance()
col_selected = fi.nlargest(topk)
rb = RollingBenchmark(model_type=self.sim_task_model)
task = rb.basic_task()
dataset = init_instance_by_config(task["dataset"])
prep_ds = dataset.prepare(slice(None), col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
feature_df = prep_ds["feature"]
label_df = prep_ds["label"]
feature_selected = feature_df.loc[:, col_selected.index]
feature_selected = feature_selected.groupby("datetime").apply(lambda df: (df - df.mean()).div(df.std()))
feature_selected = feature_selected.fillna(0.0)
df_all = {
"label": label_df.reindex(feature_selected.index),
"feature": feature_selected,
}
df_all = pd.concat(df_all, axis=1)
df_all.to_pickle(DIRNAME / "fea_label_df.pkl")
# dump data in handler format for aligning the interface
handler = DataHandlerLP(
data_loader={
"class": "qlib.data.dataset.loader.StaticDataLoader",
"kwargs": {"config": DIRNAME / "fea_label_df.pkl"},
}
)
handler.to_pickle(DIRNAME / "handler_proxy.pkl", dump_all=True)
@property
def _internal_data_path(self):
return DIRNAME / f"internal_data_s{self.step}.pkl"
def dump_meta_ipt(self):
"""
Dump data for training meta model.
This function will dump the input data for meta model
"""
# According to the experiments, the choice of the model type is very important for achieving good results
rb = RollingBenchmark(model_type=self.sim_task_model)
sim_task = rb.basic_task()
if self.sim_task_model == "gbdt":
sim_task["model"].setdefault("kwargs", {}).update({"early_stopping_rounds": None, "num_boost_round": 150})
exp_name_sim = f"data_sim_s{self.step}"
internal_data = InternalData(sim_task, self.step, exp_name=exp_name_sim)
internal_data.setup(trainer=TrainerR)
with self._internal_data_path.open("wb") as f:
pickle.dump(internal_data, f)
def train_meta_model(self):
"""
training a meta model based on a simplified linear proxy model;
"""
# 1) leverage the simplified proxy forecasting model to train meta model.
# - Only the dataset part is important, in current version of meta model will integrate the
rb = RollingBenchmark(model_type=self.sim_task_model)
sim_task = rb.basic_task()
proxy_forecast_model_task = {
# "model": "qlib.contrib.model.linear.LinearModel",
"dataset": {
"class": "qlib.data.dataset.DatasetH",
"kwargs": {
"handler": f"file://{(DIRNAME / 'handler_proxy.pkl').absolute()}",
"segments": {
"train": ("2008-01-01", "2010-12-31"),
"test": ("2011-01-01", sim_task["dataset"]["kwargs"]["segments"]["test"][1]),
},
},
},
# "record": ["qlib.workflow.record_temp.SignalRecord"]
}
# 2) preparing meta dataset
kwargs = dict(
task_tpl=proxy_forecast_model_task,
step=self.step,
segments=0.62, # keep test period consistent with the dataset yaml
trunc_days=1 + self.horizon,
hist_step_n=30,
fill_method="max",
rolling_ext_days=0,
)
# NOTE:
# the input of meta model (internal data) are shared between proxy model and final forecasting model
# but their task test segment are not aligned! It worked in my previous experiment.
# So the misalignment will not affect the effectiveness of the method.
with self._internal_data_path.open("rb") as f:
internal_data = pickle.load(f)
md = MetaDatasetDS(exp_name=internal_data, **kwargs)
# 3) train and logging meta model
with R.start(experiment_name=self.meta_exp_name):
R.log_params(**kwargs)
mm = MetaModelDS(step=self.step, hist_step_n=kwargs["hist_step_n"], lr=0.001, max_epoch=200, seed=43)
mm.fit(md)
R.save_objects(model=mm)
@property
def _task_path(self):
return DIRNAME / f"tasks_s{self.step}.pkl"
def meta_inference(self):
"""
Leverage meta-model for inference:
- Given
- baseline tasks
- input for meta model(internal data)
- meta model (its learnt knowledge on proxy forecasting model is expected to transfer to normal forecasting model)
"""
# 1) get meta model
exp = R.get_exp(experiment_name=self.meta_exp_name)
rec = exp.list_recorders(rtype=exp.RT_L)[0]
meta_model: MetaModelDS = rec.load_object("model")
# 2)
# we are transfer to knowledge of meta model to final forecasting tasks.
# Create MetaTaskDataset for the final forecasting tasks
# Aligning the setting of it to the MetaTaskDataset when training Meta model is necessary
# 2.1) get previous config
param = rec.list_params()
trunc_days = int(param["trunc_days"])
step = int(param["step"])
hist_step_n = int(param["hist_step_n"])
fill_method = param.get("fill_method", "max")
rb = RollingBenchmark(model_type=self.forecast_model)
task_l = rb.create_rolling_tasks()
# 2.2) create meta dataset for final dataset
kwargs = dict(
task_tpl=task_l,
step=step,
segments=0.0, # all the tasks are for testing
trunc_days=trunc_days,
hist_step_n=hist_step_n,
fill_method=fill_method,
task_mode=MetaTask.PROC_MODE_TRANSFER,
)
with self._internal_data_path.open("rb") as f:
internal_data = pickle.load(f)
mds = MetaDatasetDS(exp_name=internal_data, **kwargs)
# 3) meta model make inference and get new qlib task
new_tasks = meta_model.inference(mds)
with self._task_path.open("wb") as f:
pickle.dump(new_tasks, f)
def train_and_eval_tasks(self):
"""
Training the tasks generated by meta model
Then evaluate it
"""
with self._task_path.open("rb") as f:
tasks = pickle.load(f)
rb = RollingBenchmark(rolling_exp="rolling_ds", model_type=self.forecast_model)
rb.train_rolling_tasks(tasks)
rb.ens_rolling()
rb.update_rolling_rec()
def run_all(self):
# 1) file: handler_proxy.pkl
self.dump_data_for_proxy_model()
# 2)
# file: internal_data_s20.pkl
# mlflow: data_sim_s20, models for calculating meta_ipt
self.dump_meta_ipt()
# 3) meta model will be stored in `DDG-DA`
self.train_meta_model()
# 4) new_tasks are saved in "tasks_s20.pkl" (reweighter is added)
self.meta_inference()
# 5) load the saved tasks and train model
self.train_and_eval_tasks()
if __name__ == "__main__":
GetData().qlib_data(exists_skip=True)
auto_init()
fire.Fire(DDGDA)

View File

@@ -0,0 +1,18 @@
# Introduction
Due to the non-stationary nature of the environment of the financial market, the data distribution may change in different periods, which makes the performance of models build on training data decays in the future test data.
So adapting the forecasting models/strategies to market dynamics is very important to the model/strategies' performance.
The table below shows the performances of different solutions on different forecasting models.
## Alpha158 dataset
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Annualized Return | Information Ratio | Max Drawdown |
|------------------|---------|----|------|---------|-----------|-------------------|-------------------|--------------|
| RR[Linear] |Alpha158 |0.088|0.570|0.102 |0.622 |0.077 |1.175 |-0.086 |
| DDG-DA[Linear] |Alpha158 |0.093|0.622|0.106 |0.670 |0.085 |1.213 |-0.093 |
| RR[LightGBM] |Alpha158 |0.079|0.566|0.088 |0.592 |0.075 |1.226 |-0.096 |
| DDG-DA[LightGBM] |Alpha158 |0.084|0.639|0.093 |0.664 |0.099 |1.442 |-0.071 |
- The label horizon of the `Alpha158` dataset is set to 20.
- The rolling time intervals are set to 20 trading days.
- The test rolling periods are from January 2017 to August 2020.

View File

@@ -0,0 +1,15 @@
# Introduction
This is the framework of periodically Rolling Retrain (RR) forecasting models. RR adapts to market dynamics by utilizing the up-to-date data periodically.
## Run the Code
Users can try RR by running the following command:
```bash
python rolling_benchmark.py run_all
```
The default forecasting models are `Linear`. Users can choose other forecasting models by changing the `model_type` parameter.
For example, users can try `LightGBM` forecasting models by running the following command:
```bash
python rolling_benchmark.py --model_type="gbdt" run_all
```

View File

@@ -0,0 +1,114 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from qlib.model.ens.ensemble import RollingEnsemble
from qlib.utils import init_instance_by_config
import fire
import yaml
from qlib import auto_init
from pathlib import Path
from tqdm.auto import tqdm
from qlib.model.trainer import TrainerR
from qlib.workflow import R
from qlib.tests.data import GetData
DIRNAME = Path(__file__).absolute().resolve().parent
from qlib.workflow.task.gen import task_generator, RollingGen
from qlib.workflow.task.collect import RecorderCollector
from qlib.workflow.record_temp import PortAnaRecord, SigAnaRecord
class RollingBenchmark:
"""
**NOTE**
before running the example, please clean your previous results with following command
- `rm -r mlruns`
"""
def __init__(self, rolling_exp="rolling_models", model_type="linear") -> None:
self.step = 20
self.horizon = 20
self.rolling_exp = rolling_exp
self.model_type = model_type
def basic_task(self):
"""For fast training rolling"""
if self.model_type == "gbdt":
conf_path = DIRNAME.parent.parent / "benchmarks" / "LightGBM" / "workflow_config_lightgbm_Alpha158.yaml"
# dump the processed data on to disk for later loading to speed up the processing
h_path = DIRNAME / "lightgbm_alpha158_handler_horizon{}.pkl".format(self.horizon)
elif self.model_type == "linear":
conf_path = DIRNAME.parent.parent / "benchmarks" / "Linear" / "workflow_config_linear_Alpha158.yaml"
h_path = DIRNAME / "linear_alpha158_handler_horizon{}.pkl".format(self.horizon)
else:
raise AssertionError("Model type is not supported!")
with conf_path.open("r") as f:
conf = yaml.safe_load(f)
# modify dataset horizon
conf["task"]["dataset"]["kwargs"]["handler"]["kwargs"]["label"] = [
"Ref($close, -{}) / Ref($close, -1) - 1".format(self.horizon + 1)
]
task = conf["task"]
if not h_path.exists():
h_conf = task["dataset"]["kwargs"]["handler"]
h = init_instance_by_config(h_conf)
h.to_pickle(h_path, dump_all=True)
task["dataset"]["kwargs"]["handler"] = f"file://{h_path}"
task["record"] = ["qlib.workflow.record_temp.SignalRecord"]
return task
def create_rolling_tasks(self):
task = self.basic_task()
task_l = task_generator(
task, RollingGen(step=self.step, trunc_days=self.horizon + 1)
) # the last two days should be truncated to avoid information leakage
return task_l
def train_rolling_tasks(self, task_l=None):
if task_l is None:
task_l = self.create_rolling_tasks()
trainer = TrainerR(experiment_name=self.rolling_exp)
trainer(task_l)
COMB_EXP = "rolling"
def ens_rolling(self):
rc = RecorderCollector(
experiment=self.rolling_exp,
artifacts_key=["pred", "label"],
process_list=[RollingEnsemble()],
# rec_key_func=lambda rec: (self.COMB_EXP, rec.info["id"]),
artifacts_path={"pred": "pred.pkl", "label": "label.pkl"},
)
res = rc()
with R.start(experiment_name=self.COMB_EXP):
R.log_params(exp_name=self.rolling_exp)
R.save_objects(**{"pred.pkl": res["pred"], "label.pkl": res["label"]})
def update_rolling_rec(self):
"""
Evaluate the combined rolling results
"""
for rid, rec in R.list_recorders(experiment_name=self.COMB_EXP).items():
for rt_cls in SigAnaRecord, PortAnaRecord:
rt = rt_cls(recorder=rec, skip_existing=True)
rt.generate()
print(f"Your evaluation results can be found in the experiment named `{self.COMB_EXP}`.")
def run_all(self):
# the results will be save in mlruns.
# 1) each rolling task is saved in rolling_models
self.train_rolling_tasks()
# 2) combined rolling tasks and evaluation results are saved in rolling
self.ens_rolling()
self.update_rolling_rec()
if __name__ == "__main__":
GetData().qlib_data(exists_skip=True)
auto_init()
fire.Fire(RollingBenchmark)

View File

@@ -1,15 +1,20 @@
# High-Frequency Dataset
# Introduction
This folder contains 2 examples
- A high-frequency dataset example
- An example of predicting the price trend in high-frequency data
## High-Frequency Dataset
This dataset is an example for RL high frequency trading.
## Get High-Frequency Data
### Get High-Frequency Data
Get high-frequency data by running the following command:
```bash
python workflow.py get_data
```
## Dump & Reload & Reinitialize the Dataset
### Dump & Reload & Reinitialize the Dataset
The High-Frequency Dataset is implemented as `qlib.data.dataset.DatasetH` in the `workflow.py`. `DatatsetH` is the subclass of [`qlib.utils.serial.Serializable`](https://qlib.readthedocs.io/en/latest/advanced/serial.html), whose state can be dumped in or loaded from disk in `pickle` format.
@@ -27,9 +32,9 @@ Run the example by running the following command:
python workflow.py dump_and_load_dataset
```
## Benchmarks Performance
### Signal Test
Here are the results of signal test for benchmark models. We will keep updating benchmark models in future.
## Benchmarks Performance (predicting the price trend in high-frequency data)
Here are the results of models for predicting the price trend in high-frequency data. We will keep updating benchmark models in future.
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Long precision| Short Precision | Long-Short Average Return | Long-Short Average Sharpe |
|---|---|---|---|---|---|---|---|---|---|

View File

@@ -150,7 +150,7 @@ class Cut(ElemOperator):
self.l = l
self.r = r
if (self.l is not None and self.l <= 0) or (self.r is not None and self.r >= 0):
raise ValueError("Cut operator l shoud > 0 and r should < 0")
raise ValueError("Cut operator l should > 0 and r should < 0")
super(Cut, self).__init__(feature)

View File

@@ -1,5 +1,6 @@
import numpy as np
import pandas as pd
from qlib.constant import EPS
from qlib.data.dataset.processor import Processor
from qlib.data.dataset.utils import fetch_df_by_index
@@ -27,7 +28,7 @@ class HighFreqNorm(Processor):
part_values = np.log1p(part_values)
self.feature_med[name] = np.nanmedian(part_values)
part_values = part_values - self.feature_med[name]
self.feature_std[name] = np.nanmedian(np.absolute(part_values)) * 1.4826 + 1e-12
self.feature_std[name] = np.nanmedian(np.absolute(part_values)) * 1.4826 + EPS
part_values = part_values / self.feature_std[name]
self.feature_vmax[name] = np.nanmax(part_values)
self.feature_vmin[name] = np.nanmin(part_values)

View File

@@ -5,7 +5,8 @@ import fire
import qlib
import pickle
from qlib.config import REG_CN, HIGH_FREQ_CONFIG
from qlib.constant import REG_CN
from qlib.config import HIGH_FREQ_CONFIG
from qlib.utils import init_instance_by_config
from qlib.data.dataset.handler import DataHandlerLP
@@ -82,7 +83,7 @@ class HighfreqWorkflow:
def _init_qlib(self):
"""initialize qlib"""
# use yahoo_cn_1min data
# use cn_data_1min data
QLIB_INIT_CONFIG = {**HIGH_FREQ_CONFIG, **self.SPEC_CONF}
provider_uri = QLIB_INIT_CONFIG.get("provider_uri")
GetData().qlib_data(target_dir=provider_uri, interval="1min", region=REG_CN, exists_skip=True)

View File

@@ -1,6 +1,6 @@
import qlib
import optuna
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config
from qlib.tests.config import CSI300_DATASET_CONFIG
from qlib.tests.data import GetData

View File

@@ -1,6 +1,6 @@
import qlib
import optuna
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config
from qlib.tests.data import GetData
from qlib.tests.config import get_dataset_config, CSI300_MARKET, DATASET_ALPHA360_CLASS

View File

@@ -3,7 +3,7 @@
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config
from qlib.tests.data import GetData

View File

@@ -11,13 +11,13 @@ from pprint import pprint
import fire
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.workflow import R
from qlib.workflow.task.gen import RollingGen, task_generator
from qlib.workflow.task.manage import TaskManager, run_task
from qlib.workflow.task.collect import RecorderCollector
from qlib.model.ens.group import RollingGroup
from qlib.model.trainer import TrainerRM, task_train
from qlib.model.trainer import TrainerR, TrainerRM, task_train
from qlib.tests.config import CSI100_RECORD_LGB_TASK_CONFIG, CSI100_RECORD_XGBOOST_TASK_CONFIG
@@ -29,7 +29,7 @@ class RollingTaskExample:
task_url="mongodb://10.0.0.4:27017/",
task_db_name="rolling_db",
experiment_name="rolling_exp",
task_pool="rolling_task",
task_pool=None, # if user want to "rolling_task"
task_config=None,
rolling_step=550,
rolling_type=RollingGen.ROLL_SD,
@@ -43,14 +43,19 @@ class RollingTaskExample:
}
qlib.init(provider_uri=provider_uri, region=region, mongo=mongo_conf)
self.experiment_name = experiment_name
self.task_pool = task_pool
if task_pool is None:
self.trainer = TrainerR(experiment_name=self.experiment_name)
else:
self.task_pool = task_pool
self.trainer = TrainerRM(self.experiment_name, self.task_pool)
self.task_config = task_config
self.rolling_gen = RollingGen(step=rolling_step, rtype=rolling_type)
# Reset all things to the first status, be careful to save important data
def reset(self):
print("========== reset ==========")
TaskManager(task_pool=self.task_pool).remove()
if isinstance(self.trainer, TrainerRM):
TaskManager(task_pool=self.task_pool).remove()
exp = R.get_exp(experiment_name=self.experiment_name)
for rid in exp.list_recorders():
exp.delete_recorder(rid)
@@ -66,10 +71,10 @@ class RollingTaskExample:
def task_training(self, tasks):
print("========== task_training ==========")
trainer = TrainerRM(self.experiment_name, self.task_pool)
trainer.train(tasks)
self.trainer.train(tasks)
def worker(self):
# NOTE: this is only used for TrainerRM
# train tasks by other progress or machines for multiprocessing. It is same as TrainerRM.worker.
print("========== worker ==========")
run_task(task_train, self.task_pool, experiment_name=self.experiment_name)

View File

@@ -100,7 +100,8 @@ from copy import deepcopy
import qlib
import fire
import pandas as pd
from qlib.config import REG_CN, HIGH_FREQ_CONFIG
from qlib.constant import REG_CN
from qlib.config import HIGH_FREQ_CONFIG
from qlib.data import D
from qlib.utils import exists_qlib_data, init_instance_by_config, flatten_dict
from qlib.workflow import R
@@ -154,6 +155,8 @@ class NestedDecisionExecutionWorkflow:
},
}
exp_name = "nested"
port_analysis_config = {
"executor": {
"class": "NestedExecutor",
@@ -229,7 +232,7 @@ class NestedDecisionExecutionWorkflow:
qlib.init(provider_uri=provider_uri_map, dataset_cache=None, expression_cache=None)
def _train_model(self, model, dataset):
with R.start(experiment_name="train"):
with R.start(experiment_name=self.exp_name):
R.log_params(**flatten_dict(self.task))
model.fit(dataset)
R.save_objects(**{"params.pkl": model})
@@ -256,7 +259,7 @@ class NestedDecisionExecutionWorkflow:
self.port_analysis_config["strategy"] = strategy_config
self.port_analysis_config["backtest"]["benchmark"] = self.benchmark
with R.start(experiment_name="backtest"):
with R.start(experiment_name=self.exp_name, resume=True):
recorder = R.get_recorder()
par = PortAnaRecord(
recorder,
@@ -298,7 +301,7 @@ class NestedDecisionExecutionWorkflow:
# - Aligning the profit calculation between multiple levels and single levels.
# 2) comparing different backtest
# - Basic test idea:
# - the daily backtest will be similar as multi-level(the data quality makes this gap samller)
# - the daily backtest will be similar as multi-level(the data quality makes this gap smaller)
def check_diff_freq(self):
self._init_qlib()
@@ -381,7 +384,7 @@ class NestedDecisionExecutionWorkflow:
}
pa_conf["backtest"]["benchmark"] = self.benchmark
with R.start(experiment_name="backtest"):
with R.start(experiment_name=self.exp_name, resume=True):
recorder = R.get_recorder()
par = PortAnaRecord(recorder, pa_conf)
par.generate()

View File

@@ -10,7 +10,7 @@ Next, we will finish updating online predictions.
import copy
import fire
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.model.trainer import task_train
from qlib.workflow.online.utils import OnlineToolR
from qlib.tests.config import CSI300_GBDT_TASK

View File

@@ -0,0 +1,46 @@
# Portfolio Optimization Strategy
## Introduction
In `qlib/examples/benchmarks` we have various **alpha** models that predict
the stock returns. We also use a simple rule based `TopkDropoutStrategy` to
evaluate the investing performance of these models. However, such a strategy
is too simple to control the portfolio risk like correlation and volatility.
To this end, an optimization based strategy should be used to for the
trade-off between return and risk. In this doc, we will show how to use
`EnhancedIndexingStrategy` to maximize portfolio return while minimizing
tracking error relative to a benchmark.
## Preparation
We use China stock market data for our example.
1. Prepare CSI300 weight:
```bash
wget http://fintech.msra.cn/stock_data/downloads/csi300_weight.zip
unzip -d ~/.qlib/qlib_data/cn_data csi300_weight.zip
rm -f csi300_weight.zip
```
2. Prepare risk model data:
```bash
python prepare_riskdata.py
```
Here we use a **Statistical Risk Model** implemented in `qlib.model.riskmodel`.
However users are strongly recommended to use other risk models for better quality:
* **Fundamental Risk Model** like MSCI BARRA
* [Deep Risk Model](https://arxiv.org/abs/2107.05201)
## End-to-End Workflow
You can finish workflow with `EnhancedIndexingStrategy` by running
`qrun config_enhanced_indexing.yaml`.
In this config, we mainly changed the strategy section compared to
`qlib/examples/benchmarks/workflow_config_lightgbm_Alpha158.yaml`.

View File

@@ -0,0 +1,71 @@
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
region: cn
market: &market csi300
benchmark: &benchmark SH000300
data_handler_config: &data_handler_config
start_time: 2008-01-01
end_time: 2020-08-01
fit_start_time: 2008-01-01
fit_end_time: 2014-12-31
instruments: *market
port_analysis_config: &port_analysis_config
strategy:
class: EnhancedIndexingStrategy
module_path: qlib.contrib.strategy
kwargs:
model: <MODEL>
dataset: <DATASET>
riskmodel_root: ./riskdata
backtest:
start_time: 2017-01-01
end_time: 2020-08-01
account: 100000000
benchmark: *benchmark
exchange_kwargs:
limit_threshold: 0.095
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
task:
model:
class: LGBModel
module_path: qlib.contrib.model.gbdt
kwargs:
loss: mse
colsample_bytree: 0.8879
learning_rate: 0.2
subsample: 0.8789
lambda_l1: 205.6999
lambda_l2: 580.9768
max_depth: 8
num_leaves: 210
num_threads: 20
dataset:
class: DatasetH
module_path: qlib.data.dataset
kwargs:
handler:
class: Alpha158
module_path: qlib.contrib.data.handler
kwargs: *data_handler_config
segments:
train: [2008-01-01, 2014-12-31]
valid: [2015-01-01, 2016-12-31]
test: [2017-01-01, 2020-08-01]
record:
- class: SignalRecord
module_path: qlib.workflow.record_temp
kwargs:
model: <MODEL>
dataset: <DATASET>
- class: SigAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
ana_long_short: False
ann_scaler: 252
- class: PortAnaRecord
module_path: qlib.workflow.record_temp
kwargs:
config: *port_analysis_config

View File

@@ -0,0 +1,55 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import os
import numpy as np
import pandas as pd
from qlib.data import D
from qlib.model.riskmodel import StructuredCovEstimator
def prepare_data(riskdata_root="./riskdata", T=240, start_time="2016-01-01"):
universe = D.features(D.instruments("csi300"), ["$close"], start_time=start_time).swaplevel().sort_index()
price_all = (
D.features(D.instruments("all"), ["$close"], start_time=start_time).squeeze().unstack(level="instrument")
)
# StructuredCovEstimator is a statistical risk model
riskmodel = StructuredCovEstimator()
for i in range(T - 1, len(price_all)):
date = price_all.index[i]
ref_date = price_all.index[i - T + 1]
print(date)
codes = universe.loc[date].index
price = price_all.loc[ref_date:date, codes]
# calculate return and remove extreme return
ret = price.pct_change()
ret.clip(ret.quantile(0.025), ret.quantile(0.975), axis=1, inplace=True)
# run risk model
F, cov_b, var_u = riskmodel.predict(ret, is_price=False, return_decomposed_components=True)
# save risk data
root = riskdata_root + "/" + date.strftime("%Y%m%d")
os.makedirs(root, exist_ok=True)
pd.DataFrame(F, index=codes).to_pickle(root + "/factor_exp.pkl")
pd.DataFrame(cov_b).to_pickle(root + "/factor_cov.pkl")
# for specific_risk we follow the convention to save volatility
pd.Series(np.sqrt(var_u), index=codes).to_pickle(root + "/specific_risk.pkl")
if __name__ == "__main__":
import qlib
qlib.init(provider_uri="~/.qlib/qlib_data/cn_data")
prepare_data()

View File

@@ -6,7 +6,7 @@ import fire
import pickle
from datetime import datetime
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.data.dataset.handler import DataHandlerLP
from qlib.utils import init_instance_by_config
from qlib.tests.data import GetData

View File

@@ -20,7 +20,6 @@ from operator import xor
from pprint import pprint
import qlib
from qlib.config import REG_CN
from qlib.workflow import R
from qlib.tests.data import GetData
@@ -248,7 +247,7 @@ class ModelRunner:
determines the dataset to be used for each model.
qlib_uri : str
the uri to install qlib with pip
it could be url on the we or local path
it could be url on the we or local path (NOTE: the local path must be a absolute path)
exp_folder_name: str
the name of the experiment folder
wait_before_rm_env : bool

View File

@@ -61,13 +61,7 @@
"\n",
"import qlib\n",
"import pandas as pd\n",
"from qlib.config import REG_CN\n",
"from qlib.contrib.model.gbdt import LGBModel\n",
"from qlib.contrib.data.handler import Alpha158\n",
"from qlib.contrib.evaluate import (\n",
" backtest as normal_backtest,\n",
" risk_analysis,\n",
")\n",
"from qlib.constant import REG_CN\n",
"from qlib.utils import exists_qlib_data, init_instance_by_config\n",
"from qlib.workflow import R\n",
"from qlib.workflow.record_temp import SignalRecord, PortAnaRecord\n",

View File

@@ -2,7 +2,7 @@
# Licensed under the MIT License.
import qlib
from qlib.config import REG_CN
from qlib.constant import REG_CN
from qlib.utils import init_instance_by_config, flatten_dict
from qlib.workflow import R
from qlib.workflow.record_temp import SignalRecord, PortAnaRecord, SigAnaRecord

View File

@@ -2,8 +2,7 @@
# Licensed under the MIT License.
from pathlib import Path
_version_path = Path(__file__).absolute().parent / "VERSION.txt" # This file is copyed from setup.py
__version__ = _version_path.read_text(encoding="utf-8").strip()
__version__ = "0.8.1"
__version__bak = __version__ # This version is backup for QlibConfig.reset_qlib_version
import os
from typing import Union
@@ -16,6 +15,22 @@ from .log import get_module_logger
# init qlib
def init(default_conf="client", **kwargs):
"""
Parameters
----------
default_conf: str
the default value is client. Accepted values: client/server.
**kwargs :
clear_mem_cache: str
the default value is True;
Will the memory cache be clear.
It is often used to improve performance when init will be called for multiple times
skip_if_reg: bool: str
the default value is True;
When using the recorder, skip_if_reg can set to True to avoid loss of recorder.
"""
from .config import C
from .data.cache import H
@@ -29,7 +44,9 @@ def init(default_conf="client", **kwargs):
logger.warning("Skip initialization because `skip_if_reg is True`")
return
H.clear()
clear_mem_cache = kwargs.pop("clear_mem_cache", True)
if clear_mem_cache:
H.clear()
C.set(default_conf, **kwargs)
# mount nfs
@@ -169,7 +186,7 @@ def get_project_path(config_name="config.yaml", cur_path: Union[Path, str, None]
- There is a file named `config.yaml` in qlib.
For example:
If your project file system stucuture follows such a pattern
If your project file system structure follows such a pattern
<project_path>/
- config.yaml
@@ -214,7 +231,7 @@ def auto_init(**kwargs):
Here are two examples of the configuration
Example 1)
If you want create a new project-specific config based on a shared configure, you can use `conf_type: ref`
If you want to create a new project-specific config based on a shared configure, you can use `conf_type: ref`
.. code-block:: yaml
@@ -230,7 +247,7 @@ def auto_init(**kwargs):
default_exp_name: "Experiment"
Example 2)
If you wan to create simple a stand alone config, you can use following config(a.k.a `conf_type: origin`)
If you want to create simple a standalone config, you can use following config(a.k.a. `conf_type: origin`)
.. code-block:: python
@@ -260,8 +277,8 @@ def auto_init(**kwargs):
init_from_yaml_conf(conf_pp, **kwargs)
elif conf_type == "ref":
# This config type will be more convenient in following scenario
# - There is a shared configure file and you don't want to edit it inplace.
# - The shared configure may be updated later and you don't want to copy it.
# - There is a shared configure file, and you don't want to edit it inplace.
# - The shared configure may be updated later, and you don't want to copy it.
# - You have some customized config.
qlib_conf_path = conf.get("qlib_cfg", None)

View File

@@ -31,7 +31,7 @@ rtn & earning in the Account
class AccumulatedInfo:
"""
accumulated trading info, including accumulated return/cost/turnover
AccumulatedInfo should be shared accross different levels
AccumulatedInfo should be shared across different levels
"""
def __init__(self):
@@ -199,7 +199,7 @@ class Account:
# if stock is sold out, no stock price information in Position, then we should update account first, then update current position
# if stock is bought, there is no stock in current position, update current, then update account
# The cost will be substracted from the cash at last. So the trading logic can ignore the cost calculation
# The cost will be subtracted from the cash at last. So the trading logic can ignore the cost calculation
if order.direction == Order.SELL:
# sell stock
self._update_state_from_order(order, trade_val, cost, trade_price)
@@ -378,7 +378,7 @@ class Account:
)
def get_portfolio_metrics(self):
"""get the history portfolio_metrics and postions instance"""
"""get the history portfolio_metrics and positions instance"""
if self.is_port_metr_enabled():
_portfolio_metrics = self.portfolio_metrics.generate_portfolio_metrics_dataframe()
_positions = self.get_hist_positions()

View File

@@ -13,7 +13,7 @@ from tqdm.auto import tqdm
def backtest_loop(start_time, end_time, trade_strategy: BaseStrategy, trade_executor: BaseExecutor):
"""backtest funciton for the interaction of the outermost strategy and executor in the nested decision execution
"""backtest function for the interaction of the outermost strategy and executor in the nested decision execution
please refer to the docs of `collect_data_loop`

View File

@@ -505,8 +505,8 @@ class BaseTradeDecision:
`inner_trade_decision` will be changed **inplaced**.
Motivation of the `mod_inner_decision`
- Leave a hook for outer decision to affact the decision generated by the inner strategy
- e.g. the outmost strategy generate a time range for trading. But the upper layer can only affact the
- Leave a hook for outer decision to affect the decision generated by the inner strategy
- e.g. the outmost strategy generate a time range for trading. But the upper layer can only affect the
nearest layer in the original design. With `mod_inner_decision`, the decision can passed through multiple
layers

View File

@@ -14,7 +14,8 @@ import numpy as np
import pandas as pd
from ..data.data import D
from ..config import C, REG_CN
from ..config import C
from ..constant import REG_CN
from ..log import get_module_logger
from .decision import Order, OrderDir, OrderHelper
from .high_performance_ds import BaseQuote, PandasQuote, NumpyQuote
@@ -103,7 +104,7 @@ class Exchange:
Necessary fields:
$close is for calculating the total value at end of each day.
Optional fields:
$volume is only necessary when we limit the trade amount or caculate PA(vwap) indicator
$volume is only necessary when we limit the trade amount or calculate PA(vwap) indicator
$vwap is only necessary when we use the $vwap price as the deal price
$factor is for rounding to the trading unit
limit_sell will be set to False by default(False indicates we can sell this
@@ -401,9 +402,9 @@ class Exchange:
def get_close(self, stock_id, start_time, end_time, method="ts_data_last"):
return self.quote.get_data(stock_id, start_time, end_time, field="$close", method=method)
def get_volume(self, stock_id, start_time, end_time):
def get_volume(self, stock_id, start_time, end_time, method="sum"):
"""get the total deal volume of stock with `stock_id` between the time interval [start_time, end_time)"""
return self.quote.get_data(stock_id, start_time, end_time, field="$volume", method="sum")
return self.quote.get_data(stock_id, start_time, end_time, field="$volume", method=method)
def get_deal_price(self, stock_id, start_time, end_time, direction: OrderDir, method="ts_data_last"):
if direction == OrderDir.SELL:
@@ -505,7 +506,7 @@ class Exchange:
Note: some future information is used in this function
Parameter:
target_position : dict { stock_id : amount }
current_postion : dict { stock_id : amount}
current_position : dict { stock_id : amount}
trade_unit : trade_unit
down sample : for amount 321 and trade_unit 100, deal_amount is 300
deal order on trade_date
@@ -535,7 +536,7 @@ class Exchange:
deal_amount = self.get_real_deal_amount(current_amount, target_amount, factor)
if deal_amount == 0:
continue
elif deal_amount > 0:
if deal_amount > 0:
# buy stock
buy_order_list.append(
Order(
@@ -686,9 +687,7 @@ class Exchange:
orig_deal_amount = order.deal_amount
order.deal_amount = max(min(vol_limit_min, orig_deal_amount), 0)
if vol_limit_min < orig_deal_amount:
self.logger.debug(
f"Order clipped due to volume limitation: {order}, {[(vol, rule) for vol, rule in zip(vol_limit_num, vol_limit)]}"
)
self.logger.debug(f"Order clipped due to volume limitation: {order}, {list(zip(vol_limit_num, vol_limit))}")
def _get_buy_amount_by_cash_limit(self, trade_price, cash, cost_ratio):
"""return the real order amount after cash limit for buying.

View File

@@ -41,7 +41,7 @@ class BaseExecutor:
Parameters
----------
time_per_step : str
trade time per trading step, used for genreate the trade calendar
trade time per trading step, used for generate the trade calendar
show_indicator: bool, optional
whether to show indicators, :
- 'pa', the price advantage
@@ -118,7 +118,7 @@ class BaseExecutor:
self.dealt_order_amount = defaultdict(float)
self.deal_day = None
def reset_common_infra(self, common_infra):
def reset_common_infra(self, common_infra, copy_trade_account=False):
"""
reset infrastructure for trading
- reset trade_account
@@ -129,9 +129,14 @@ class BaseExecutor:
self.common_infra.update(common_infra)
if common_infra.has("trade_account"):
# NOTE: there is a trick in the code.
# shallow copy is used instead of deepcopy. So positions are shared
self.trade_account: Account = copy.copy(common_infra.get("trade_account"))
if copy_trade_account:
# NOTE: there is a trick in the code.
# shallow copy is used instead of deepcopy.
# 1. So positions are shared
# 2. Others are not shared, so each level has it own metrics (portfolio and trading metrics)
self.trade_account: Account = copy.copy(common_infra.get("trade_account"))
else:
self.trade_account = common_infra.get("trade_account")
self.trade_account.reset(freq=self.time_per_step, port_metr_enabled=self.generate_portfolio_metrics)
@property
@@ -189,7 +194,7 @@ class BaseExecutor:
return return_value.get("execute_result")
@abstractclassmethod
def _collect_data(self, trade_decision: BaseTradeDecision, level: int = 0) -> Tuple[List[object], dict]:
def _collect_data(cls, trade_decision: BaseTradeDecision, level: int = 0) -> Tuple[List[object], dict]:
"""
Please refer to the doc of collect_data
The only difference between `_collect_data` and `collect_data` is that some common steps are moved into
@@ -342,14 +347,18 @@ class NestedExecutor(BaseExecutor):
**kwargs,
)
def reset_common_infra(self, common_infra):
def reset_common_infra(self, common_infra, copy_trade_account=False):
"""
reset infrastructure for trading
- reset inner_strategyand inner_executor common infra
"""
super(NestedExecutor, self).reset_common_infra(common_infra)
# NOTE: please refer to the docs of BaseExecutor.reset_common_infra for the meaning of `copy_trade_account`
self.inner_executor.reset_common_infra(common_infra)
# The first level follow the `copy_trade_account` from the upper level
super(NestedExecutor, self).reset_common_infra(common_infra, copy_trade_account=copy_trade_account)
# The lower level have to copy the trade_account
self.inner_executor.reset_common_infra(common_infra, copy_trade_account=True)
self.inner_strategy.reset_common_infra(common_infra)
def _init_sub_trading(self, trade_decision):
@@ -360,12 +369,12 @@ class NestedExecutor(BaseExecutor):
self.inner_strategy.reset(level_infra=sub_level_infra, outer_trade_decision=trade_decision)
def _update_trade_decision(self, trade_decision: BaseTradeDecision) -> BaseTradeDecision:
# outter strategy have chance to update decision each iterator
# outer strategy have chance to update decision each iterator
updated_trade_decision = trade_decision.update(self.inner_executor.trade_calendar)
if updated_trade_decision is not None:
trade_decision = updated_trade_decision
# NEW UPDATE
# create a hook for inner strategy to update outter decision
# create a hook for inner strategy to update outer decision
self.inner_strategy.alter_outer_trade_decision(trade_decision)
return trade_decision
@@ -395,9 +404,25 @@ class NestedExecutor(BaseExecutor):
if not self._align_range_limit or start_idx <= sub_cal.get_trade_step() <= end_idx:
# if force align the range limit, skip the steps outside the decision range limit
_inner_trade_decision: BaseTradeDecision = self.inner_strategy.generate_trade_decision(
_inner_execute_result
)
res = self.inner_strategy.generate_trade_decision(_inner_execute_result)
# NOTE: !!!!!
# the two lines below is for a special case in RL
# To solve the confliction below
# - Normally, user will create a strategy and embed it into Qlib's executor and simulator interaction loop
# For a _nested qlib example_, (Qlib Strategy) <=> (Qlib Executor[(inner Qlib Strategy) <=> (inner Qlib Executor)])
# - However, RL-based framework has it's own script to run the loop
# For an _RL learning example_, (RL Policy) <=> (RL Env[(inner Qlib Executor)])
# To make it possible to run _nested qlib example_ and _RL learning example_ together, the solution below is proposed
# - The entry script follow the example of _RL learning example_ to be compatible with all kinds of RL Framework
# - Each step of (RL Env) will make (inner Qlib Executor) one step forward
# - (inner Qlib Strategy) is a proxy strategy, it will give the program control right to (RL Env) by `yield from` and wait for the action from the policy
# So the two lines below is the implementation of yielding control rights
if isinstance(res, GeneratorType):
res = yield from res
_inner_trade_decision: BaseTradeDecision = res
trade_decision.mod_inner_decision(_inner_trade_decision) # propagate part of decision information
# NOTE sub_cal.get_step_time() must be called before collect_data in case of step shifting
@@ -407,6 +432,7 @@ class NestedExecutor(BaseExecutor):
_inner_execute_result = yield from self.inner_executor.collect_data(
trade_decision=_inner_trade_decision, level=level + 1
)
self.post_inner_exe_step(_inner_execute_result)
execute_result.extend(_inner_execute_result)
inner_order_indicators.append(
@@ -418,6 +444,17 @@ class NestedExecutor(BaseExecutor):
return execute_result, {"inner_order_indicators": inner_order_indicators, "decision_list": decision_list}
def post_inner_exe_step(self, inner_exe_res):
"""
A hook for doing sth after each step of inner strategy
Parameters
----------
inner_exe_res :
the execution result of inner task
"""
pass
def get_all_executors(self):
"""get all executors, including self and inner_executor.get_all_executors()"""
return [self, *self.inner_executor.get_all_executors()]

View File

@@ -400,7 +400,7 @@ class BaseOrderIndicator:
indicators : List[BaseOrderIndicator]
the list of all inner indicators.
metrics : Union[str, List[str]]
all metrics needs ot be sumed.
all metrics needs to be sumed.
fill_value : float, optional
fill np.NaN with value. By default None.
"""

View File

@@ -20,7 +20,7 @@ class BasePosition:
Please refer to the `Position` class for the position
"""
def __init__(self, cash=0.0, *args, **kwargs):
def __init__(self, *args, cash=0.0, **kwargs):
self._settle_type = self.ST_NO
def skip_update(self) -> bool:
@@ -152,7 +152,7 @@ class BasePosition:
"""
generate stock weight dict {stock_id : value weight of stock in the position}
it is meaningful in the beginning or the end of each trade step
- During execution of each trading step, the weight may be not consistant with the portfolio value
- During execution of each trading step, the weight may be not consistent with the portfolio value
Parameters
----------

View File

@@ -39,7 +39,7 @@ def get_benchmark_weight(
if not path:
path = Path(C.dpm.get_data_uri(freq)).expanduser() / "raw" / "AIndexMembers" / "weights.csv"
# TODO: the storage of weights should be implemented in a more elegent way
# TODO: The benchmark is not consistant with the filename in instruments.
# TODO: The benchmark is not consistent with the filename in instruments.
bench_weight_df = pd.read_csv(path, usecols=["code", "date", "index", "weight"])
bench_weight_df = bench_weight_df[bench_weight_df["index"] == bench]
bench_weight_df["date"] = pd.to_datetime(bench_weight_df["date"])
@@ -156,16 +156,16 @@ def decompose_portofolio(stock_weight_df, stock_group_df, stock_ret_df):
group_weight, stock_weight_in_group = decompose_portofolio_weight(stock_weight_df, stock_group_df)
group_ret = {}
for group_key in stock_weight_in_group:
stock_weight_in_group_start_date = min(stock_weight_in_group[group_key].index)
stock_weight_in_group_end_date = max(stock_weight_in_group[group_key].index)
for group_key, val in stock_weight_in_group.items():
stock_weight_in_group_start_date = min(val.index)
stock_weight_in_group_end_date = max(val.index)
temp_stock_ret_df = stock_ret_df[
(stock_ret_df.index >= stock_weight_in_group_start_date)
& (stock_ret_df.index <= stock_weight_in_group_end_date)
]
group_ret[group_key] = (temp_stock_ret_df * stock_weight_in_group[group_key]).sum(axis=1)
group_ret[group_key] = (temp_stock_ret_df * val).sum(axis=1)
# If no weight is assigned, then the return of group will be np.nan
group_ret[group_key][group_weight[group_key] == 0.0] = np.nan

View File

@@ -73,7 +73,7 @@ class PortfolioMetrics:
self.init_bench(freq=freq, benchmark_config=benchmark_config)
def init_vars(self):
self.accounts = OrderedDict() # account postion value for each trade time
self.accounts = OrderedDict() # account position value for each trade time
self.returns = OrderedDict() # daily return rate for each trade time
self.total_turnovers = OrderedDict() # total turnover for each trade time
self.turnovers = OrderedDict() # turnover for each trade time
@@ -212,7 +212,8 @@ class PortfolioMetrics:
path: str/ pathlib.Path()
"""
path = pathlib.Path(path)
r = pd.read_csv(open(path, "rb"), index_col=0)
with path.open("rb") as f:
r = pd.read_csv(f, index_col=0)
r.index = pd.DatetimeIndex(r.index)
index = r.index
@@ -236,7 +237,7 @@ class Indicator:
"""
`Indicator` is implemented in a aggregate way.
All the metrics are calculated aggregately.
All the metrics are calculated for a seperated stock and in a specific step on a specific level.
All the metrics are calculated for a separated stock and in a specific step on a specific level.
| indicator | desc. |
|--------------+--------------------------------------------------------------|

View File

@@ -55,9 +55,9 @@ class TradeCalendarManager:
self.start_time = pd.Timestamp(start_time) if start_time else None
self.end_time = pd.Timestamp(end_time) if end_time else None
_calendar = Cal.calendar(freq=freq)
_calendar = Cal.calendar(freq=freq, future=True)
self._calendar = _calendar
_, _, _start_index, _end_index = Cal.locate_index(start_time, end_time, freq=freq)
_, _, _start_index, _end_index = Cal.locate_index(start_time, end_time, freq=freq, future=True)
self.start_index = _start_index
self.end_index = _end_index
self.trade_len = _end_index - _start_index + 1
@@ -93,7 +93,7 @@ class TradeCalendarManager:
About the endpoints:
- Qlib uses the closed interval in time-series data selection, which has the same performance as pandas.Series.loc
# - The returned right endpoints should minus 1 seconds becasue of the closed interval representation in Qlib.
# - The returned right endpoints should minus 1 seconds because of the closed interval representation in Qlib.
# Note: Qlib supports up to minutely decision execution, so 1 seconds is less than any trading time interval.
Parameters
@@ -205,10 +205,7 @@ class BaseInfrastructure:
warnings.warn(f"infra {infra_name} is not found!")
def has(self, infra_name):
if infra_name in self.get_support_infra() and hasattr(self, infra_name):
return True
else:
return False
return infra_name in self.get_support_infra() and hasattr(self, infra_name)
def update(self, other):
support_infra = other.get_support_infra()

View File

@@ -4,12 +4,13 @@
About the configs
=================
The config will based on _default_config.
The config will be based on _default_config.
Two modes are supported
- client
- server
"""
from __future__ import annotations
import os
import re
@@ -18,12 +19,18 @@ import logging
import platform
import multiprocessing
from pathlib import Path
from typing import Union
from typing import Optional, Union
from typing import TYPE_CHECKING
from qlib.constant import REG_CN, REG_US
if TYPE_CHECKING:
from qlib.utils.time import Freq
class Config:
def __init__(self, default_conf):
self.__dict__["_default_config"] = copy.deepcopy(default_conf) # avoiding conflictions with __getattr__
self.__dict__["_default_config"] = copy.deepcopy(default_conf) # avoiding conflicts with __getattr__
self.reset()
def __getitem__(self, key):
@@ -69,10 +76,6 @@ class Config:
self.update(**config_c.__dict__["_config"])
# REGION CONST
REG_CN = "cn"
REG_US = "us"
# pickle.dump protocol version: https://docs.python.org/3/library/pickle.html#data-stream-format
PROTOCOL_VERSION = 4
@@ -235,7 +238,7 @@ MODE_CONF = {
}
HIGH_FREQ_CONFIG = {
"provider_uri": "~/.qlib/qlib_data/yahoo_cn_1min",
"provider_uri": "~/.qlib/qlib_data/cn_data_1min",
"dataset_cache": None,
"expression_cache": "DiskExpressionCache",
"region": REG_CN,
@@ -266,7 +269,11 @@ class QlibConfig(Config):
self._registered = False
class DataPathManager:
def __init__(self, provider_uri: Union[str, Path, dict], mount_path: Union[str, Path, dict]):
def __init__(
self,
provider_uri: Union[str, Path, dict],
mount_path: Union[str, Path, dict],
):
self.provider_uri = provider_uri
self.mount_path = mount_path
@@ -296,7 +303,9 @@ class QlibConfig(Config):
else:
return QlibConfig.LOCAL_URI
def get_data_uri(self, freq: str = None) -> Path:
def get_data_uri(self, freq: Optional[Union[str, Freq]] = None) -> Path:
if freq is not None:
freq = str(freq) # converting Freq to string
if freq is None or freq not in self.provider_uri:
freq = QlibConfig.DEFAULT_FREQ
_provider_uri = self.provider_uri[freq]
@@ -353,10 +362,10 @@ class QlibConfig(Config):
"""
configure qlib based on the input parameters
The configure will act like a dictionary.
The configuration will act like a dictionary.
Normally, it literally replace the value according to the keys.
However, sometimes it is hard for users to set the config when the configure is nested and complicated
Normally, it literally is replaced the value according to the keys.
However, sometimes it is hard for users to set the config when the configuration is nested and complicated
So this API provides some special parameters for users to set the keys in a more convenient way.
- region: REG_CN, REG_US

9
qlib/constant.py Normal file
View File

@@ -0,0 +1,9 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# REGION CONST
REG_CN = "cn"
REG_US = "us"
# Epsilon for avoiding division by zero.
EPS = 1e-12

View File

@@ -63,9 +63,7 @@ def _get_date_parse_fn(target):
get_date_parse_fn('20120101')('2017-01-01') => '20170101'
get_date_parse_fn(20120101)('2017-01-01') => 20170101
"""
if isinstance(target, pd.Timestamp):
_fn = lambda x: pd.Timestamp(x) # Timestamp('2020-01-01')
elif isinstance(target, int):
if isinstance(target, int):
_fn = lambda x: int(str(x).replace("-", "")[:8]) # 20200201
elif isinstance(target, str) and len(target) == 8:
_fn = lambda x: str(x).replace("-", "")[:8] # '20200201'
@@ -158,7 +156,7 @@ class MTSDatasetH(DatasetH):
try:
df = self.handler._learn.copy() # use copy otherwise recorder will fail
# FIXME: currently we cannot support switching from `_learn` to `_infer` for inference
except:
except Exception:
warnings.warn("cannot access `_learn`, will load raw data")
df = self.handler._data.copy()
df.index = df.index.swaplevel()

View File

@@ -90,7 +90,13 @@ class Alpha360(DataHandlerLP):
return (["Ref($close, -2)/Ref($close, -1) - 1"], ["LABEL0"])
def get_feature_config(self):
# NOTE:
# Alpha360 tries to provide a dataset with original price data
# the original price data includes the prices and volume in the last 60 days.
# To make it easier to learn models from this dataset, all the prices and volume
# are normalized by the latest price and volume data ( dividing by $close, $volume)
# So the latest normalized $close will be 1 (with name CLOSE0), the latest normalized $volume will be 1 (with name VOLUME0)
# If further normalization are executed (e.g. centralization), CLOSE0 and VOLUME0 will be 0.
fields = []
names = []
@@ -120,9 +126,9 @@ class Alpha360(DataHandlerLP):
fields += ["$vwap/$close"]
names += ["VWAP0"]
for i in range(59, 0, -1):
fields += ["Ref($volume, %d)/$volume" % (i)]
fields += ["Ref($volume, %d)/($volume+1e-12)" % (i)]
names += ["VOLUME%d" % (i)]
fields += ["$volume/$volume"]
fields += ["$volume/($volume+1e-12)"]
names += ["VOLUME0"]
return fields, names
@@ -243,7 +249,7 @@ class Alpha158(DataHandlerLP):
names += [field.upper() + str(d) for d in windows]
if "volume" in config:
windows = config["volume"].get("windows", range(5))
fields += ["Ref($volume, %d)/$volume" % d if d != 0 else "$volume/$volume" for d in windows]
fields += ["Ref($volume, %d)/($volume+1e-12)" % d if d != 0 else "$volume/($volume+1e-12)" for d in windows]
names += ["VOLUME" + str(d) for d in windows]
if "rolling" in config:
windows = config["rolling"].get("windows", [5, 10, 20, 30, 60])

View File

@@ -18,8 +18,8 @@ class SepDataFrame:
"""
(Sep)erate DataFrame
We usually concat multiple dataframe to be processed together(Such as feature, label, weight, filter).
However, they are usally be used seperately at last.
This will result in extra cost for concating and spliting data(reshaping and copying data in the memory is very expensive)
However, they are usually be used separately at last.
This will result in extra cost for concatenating and splitting data(reshaping and copying data in the memory is very expensive)
SepDataFrame tries to act like a DataFrame whose column with multiindex
"""

View File

@@ -371,7 +371,7 @@ def long_short_backtest(
def t_run():
pred_FN = "./check_pred.csv"
pred = pd.read_csv(pred_FN)
pred: pd.DataFrame = pd.read_csv(pred_FN)
pred["datetime"] = pd.to_datetime(pred["datetime"])
pred = pred.set_index([pred.columns[0], pred.columns[1]])
pred = pred.iloc[:9000]

View File

@@ -38,11 +38,11 @@ def _get_position_value_from_df(evaluate_date, position, close_data_df):
def get_position_value(evaluate_date, position):
"""sum of close*amount
get value of postion
get value of position
use close price
postions:
positions:
{
Timestamp('2016-01-05 00:00:00'):
{

View File

@@ -0,0 +1,4 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from .data_selection import MetaTaskDS, MetaDatasetDS, MetaModelDS

View File

@@ -0,0 +1,5 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from .dataset import MetaDatasetDS, MetaTaskDS
from .model import MetaModelDS

View File

@@ -0,0 +1,325 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from copy import deepcopy
from qlib.data.dataset.utils import init_task_handler
from qlib.utils.data import deepcopy_basic_type
from qlib.contrib.torch import data_to_tensor
from qlib.workflow.task.utils import TimeAdjuster
from qlib.model.meta.task import MetaTask
from typing import Dict, List, Union, Text, Tuple
from qlib.data.dataset.handler import DataHandler
from qlib.log import get_module_logger
from qlib.utils import auto_filter_kwargs, get_date_by_shift, init_instance_by_config
from qlib.workflow import R
from qlib.workflow.task.gen import RollingGen, task_generator
from joblib import Parallel, delayed
from qlib.model.meta.dataset import MetaTaskDataset
from qlib.model.trainer import task_train, TrainerR
from qlib.data.dataset import DatasetH
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
class InternalData:
def __init__(self, task_tpl: dict, step: int, exp_name: str):
self.task_tpl = task_tpl
self.step = step
self.exp_name = exp_name
def setup(self, trainer=TrainerR, trainer_kwargs={}):
"""
after running this function `self.data_ic_df` will become set.
Each col represents a data.
Each row represents the Timestamp of performance of that data.
For example,
.. code-block:: python
2021-06-21 2021-06-04 2021-05-21 2021-05-07 2021-04-20 2021-04-06 2021-03-22 2021-03-08 ...
2021-07-02 2021-06-18 2021-06-03 2021-05-20 2021-05-06 2021-04-19 2021-04-02 2021-03-19 ...
datetime ...
2018-01-02 0.079782 0.115975 0.070866 0.028849 -0.081170 0.140380 0.063864 0.110987 ...
2018-01-03 0.123386 0.107789 0.071037 0.045278 -0.060782 0.167446 0.089779 0.124476 ...
2018-01-04 0.140775 0.097206 0.063702 0.042415 -0.078164 0.173218 0.098914 0.114389 ...
2018-01-05 0.030320 -0.037209 -0.044536 -0.047267 -0.081888 0.045648 0.059947 0.047652 ...
2018-01-08 0.107201 0.009219 -0.015995 -0.036594 -0.086633 0.108965 0.122164 0.108508 ...
... ... ... ... ... ... ... ... ... ...
"""
# 1) prepare the prediction of proxy models
perf_task_tpl = deepcopy(self.task_tpl) # this task is supposed to contains no complicated objects
trainer = auto_filter_kwargs(trainer)(experiment_name=self.exp_name, **trainer_kwargs)
# NOTE:
# The handler is initialized for only once.
if not trainer.has_worker():
self.dh = init_task_handler(perf_task_tpl)
else:
self.dh = init_instance_by_config(perf_task_tpl["dataset"]["kwargs"]["handler"])
seg = perf_task_tpl["dataset"]["kwargs"]["segments"]
# We want to split the training time period into small segments.
perf_task_tpl["dataset"]["kwargs"]["segments"] = {
"train": (DatasetH.get_min_time(seg), DatasetH.get_max_time(seg)),
"test": (None, None),
}
# NOTE:
# we play a trick here
# treat the training segments as test to create the rolling tasks
rg = RollingGen(step=self.step, test_key="train", train_key=None, task_copy_func=deepcopy_basic_type)
gen_task = task_generator(perf_task_tpl, [rg])
recorders = R.list_recorders(experiment_name=self.exp_name)
if len(gen_task) == len(recorders):
get_module_logger("Internal Data").info("the data has been initialized")
else:
# train new models
assert 0 == len(recorders), "An empty experiment is required for setup `InternalData``"
trainer.train(gen_task)
# 2) extract the similarity matrix
label_df = self.dh.fetch(col_set="label")
# for
recorders = R.list_recorders(experiment_name=self.exp_name)
key_l = []
ic_l = []
for _, rec in tqdm(recorders.items(), desc="calc"):
pred = rec.load_object("pred.pkl")
task = rec.load_object("task")
data_key = task["dataset"]["kwargs"]["segments"]["train"]
key_l.append(data_key)
ic_l.append(delayed(self._calc_perf)(pred.iloc[:, 0], label_df.iloc[:, 0]))
ic_l = Parallel(n_jobs=-1)(ic_l)
self.data_ic_df = pd.DataFrame(dict(zip(key_l, ic_l)))
self.data_ic_df = self.data_ic_df.sort_index().sort_index(axis=1)
del self.dh # handler is not useful now
def _calc_perf(self, pred, label):
df = pd.DataFrame({"pred": pred, "label": label})
df = df.groupby("datetime").corr(method="spearman")
corr = df.loc(axis=0)[:, "pred"]["label"].droplevel(axis=0, level=-1)
return corr
def update(self):
"""update the data for online trading"""
# TODO:
# when new data are totally(including label) available
# - update the prediction
# - update the data similarity map(if applied)
class MetaTaskDS(MetaTask):
"""Meta Task for Data Selection"""
def __init__(self, task: dict, meta_info: pd.DataFrame, mode: str = MetaTask.PROC_MODE_FULL, fill_method="max"):
"""
The description of the processed data
time_perf: A array with shape <hist_step_n * step, data pieces> -> data piece performance
time_belong: A array with shape <sample, data pieces> -> belong or not (1. or 0.)
array([[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 1.]])
"""
super().__init__(task, meta_info)
self.fill_method = fill_method
time_perf = self._get_processed_meta_info()
self.processed_meta_input = {"time_perf": time_perf}
# FIXME: memory issue in this step
if mode == MetaTask.PROC_MODE_FULL:
# process metainfo_
ds = self.get_dataset()
# these three lines occupied 70% of the time of initializing MetaTaskDS
d_train, d_test = ds.prepare(["train", "test"], col_set=["feature", "label"])
prev_size = d_test.shape[0]
d_train = d_train.dropna(axis=0)
d_test = d_test.dropna(axis=0)
if prev_size == 0 or d_test.shape[0] / prev_size <= 0.1:
raise ValueError(f"Most of samples are dropped. Please check this task: {task}")
assert (
d_test.groupby("datetime").size().shape[0] >= 5
), "In this segment, this trading dates is less than 5, you'd better check the data."
sample_time_belong = np.zeros((d_train.shape[0], time_perf.shape[1]))
for i, col in enumerate(time_perf.columns):
# these two lines of code occupied 20% of the time of initializing MetaTaskDS
slc = slice(*d_train.index.slice_locs(start=col[0], end=col[1]))
sample_time_belong[slc, i] = 1.0
# If you want that last month also belongs to the last time_perf
# Assumptions: the latest data has similar performance like the last month
sample_time_belong[sample_time_belong.sum(axis=1) != 1, -1] = 1.0
self.processed_meta_input.update(
dict(
X=d_train["feature"],
y=d_train["label"].iloc[:, 0],
X_test=d_test["feature"],
y_test=d_test["label"].iloc[:, 0],
time_belong=sample_time_belong,
test_idx=d_test["label"].index,
)
)
# TODO: set device: I think this is not necessary to converting data format.
self.processed_meta_input = data_to_tensor(self.processed_meta_input)
def _get_processed_meta_info(self):
meta_info_norm = self.meta_info.sub(self.meta_info.mean(axis=1), axis=0) # .fillna(0.)
if self.fill_method == "max":
meta_info_norm = meta_info_norm.T.fillna(
meta_info_norm.max(axis=1)
).T # fill it with row max to align with previous implementation
elif self.fill_method == "zero":
pass
else:
raise NotImplementedError(f"This type of input is not supported")
meta_info_norm = meta_info_norm.fillna(0.0) # always fill zero in case of NaN
return meta_info_norm
def get_meta_input(self):
return self.processed_meta_input
class MetaDatasetDS(MetaTaskDataset):
def __init__(
self,
*,
task_tpl: Union[dict, list],
step: int,
trunc_days: int = None,
rolling_ext_days: int = 0,
exp_name: Union[str, InternalData],
segments: Union[Dict[Text, Tuple], float],
hist_step_n: int = 10,
task_mode: str = MetaTask.PROC_MODE_FULL,
fill_method: str = "max",
):
"""
A dataset for meta model.
Parameters
----------
task_tpl : Union[dict, list]
Decide what tasks are used.
- dict : the task template the prepared task is generated with `step`, `trunc_days` and `RollingGen`
- list : when list, use the list of tasks directly
the list is supposed to be sorted according timeline
step : int
the rolling step
trunc_days: int
days to be truncated based on the test start
rolling_ext_days: int
sometimes users want to train meta models for a longer test period but with smaller rolling steps for more task samples.
the total length of test periods will be `step + rolling_ext_days`
exp_name : Union[str, InternalData]
Decide what meta_info are used for prediction.
- str: the name of the experiment to store the performance of data
- InternalData: a prepared internal data
segments: Union[Dict[Text, Tuple], float]
the segments to divide data
both left and right
if segments is a float:
the float represents the percentage of data for training
hist_step_n: int
length of historical steps for the meta infomation
task_mode : str
Please refer to the docs of MetaTask
"""
super().__init__(segments=segments)
if isinstance(exp_name, InternalData):
self.internal_data = exp_name
else:
self.internal_data = InternalData(task_tpl, step=step, exp_name=exp_name)
self.internal_data.setup()
self.task_tpl = deepcopy(task_tpl) # FIXME: if the handler is shared, how to avoid the explosion of the memroy.
self.trunc_days = trunc_days
self.hist_step_n = hist_step_n
self.step = step
if isinstance(task_tpl, dict):
rg = RollingGen(
step=step, trunc_days=trunc_days, task_copy_func=deepcopy_basic_type
) # NOTE: trunc_days is very important !!!!
task_iter = rg(task_tpl)
if rolling_ext_days > 0:
self.ta = TimeAdjuster(future=True)
for t in task_iter:
t["dataset"]["kwargs"]["segments"]["test"] = self.ta.shift(
t["dataset"]["kwargs"]["segments"]["test"], step=rolling_ext_days, rtype=RollingGen.ROLL_EX
)
if task_mode == MetaTask.PROC_MODE_FULL:
# Only pre initializing the task when full task is req
# initializing handler and share it.
init_task_handler(task_tpl)
else:
assert isinstance(task_tpl, list)
task_iter = task_tpl
self.task_list = []
self.meta_task_l = []
logger = get_module_logger("MetaDatasetDS")
logger.info(f"Example task for training meta model: {task_iter[0]}")
for t in tqdm(task_iter, desc="creating meta tasks"):
try:
self.meta_task_l.append(
MetaTaskDS(t, meta_info=self._prepare_meta_ipt(t), mode=task_mode, fill_method=fill_method)
)
self.task_list.append(t)
except ValueError as e:
logger.warning(f"ValueError: {e}")
assert len(self.meta_task_l) > 0, "No meta tasks found. Please check the data and setting"
def _prepare_meta_ipt(self, task):
ic_df = self.internal_data.data_ic_df
segs = task["dataset"]["kwargs"]["segments"]
end = max([segs[k][1] for k in ("train", "valid") if k in segs])
ic_df_avail = ic_df.loc[:end, pd.IndexSlice[:, :end]]
# meta data set focus on the **information** instead of preprocess
# 1) filter the future info
def mask_future(s):
"""mask future information"""
# from qlib.utils import get_date_by_shift
start, end = s.name
end = get_date_by_shift(trading_date=end, shift=self.trunc_days - 1, future=True)
return s.mask((s.index >= start) & (s.index <= end))
ic_df_avail = ic_df_avail.apply(mask_future) # apply to each col
# 2) filter the info with too long periods
total_len = self.step * self.hist_step_n
if ic_df_avail.shape[0] >= total_len:
return ic_df_avail.iloc[-total_len:]
else:
raise ValueError("the history of distribution data is not long enough.")
def _prepare_seg(self, segment: Text) -> List[MetaTask]:
if isinstance(self.segments, float):
train_task_n = int(len(self.meta_task_l) * self.segments)
if segment == "train":
return self.meta_task_l[:train_task_n]
elif segment == "test":
return self.meta_task_l[train_task_n:]
else:
raise NotImplementedError(f"This type of input is not supported")
else:
raise NotImplementedError(f"This type of input is not supported")

View File

@@ -0,0 +1,182 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
from qlib.log import get_module_logger
import pandas as pd
import numpy as np
from qlib.model.meta.task import MetaTask
import torch
from torch import nn
from torch import optim
from tqdm.auto import tqdm
import collections
import copy
from typing import Union, List, Tuple, Dict
from ....data.dataset.weight import Reweighter
from ....model.meta.dataset import MetaTaskDataset
from ....model.meta.model import MetaModel, MetaTaskModel
from ....workflow import R
from .utils import ICLoss
from .dataset import MetaDatasetDS
from qlib.contrib.meta.data_selection.net import PredNet
from qlib.data.dataset.weight import Reweighter
from qlib.log import get_module_logger
logger = get_module_logger("data selection")
class TimeReweighter(Reweighter):
def __init__(self, time_weight: pd.Series):
self.time_weight = time_weight
def reweight(self, data: Union[pd.DataFrame, pd.Series]):
# TODO: handling TSDataSampler
w_s = pd.Series(1.0, index=data.index)
for k, w in self.time_weight.items():
w_s.loc[slice(*k)] = w
logger.info(f"Reweighting result: {w_s}")
return w_s
class MetaModelDS(MetaTaskModel):
"""
The meta-model for meta-learning-based data selection.
"""
def __init__(
self,
step,
hist_step_n,
clip_method="tanh",
clip_weight=2.0,
criterion="ic_loss",
lr=0.0001,
max_epoch=100,
seed=43,
):
self.step = step
self.hist_step_n = hist_step_n
self.clip_method = clip_method
self.clip_weight = clip_weight
self.criterion = criterion
self.lr = lr
self.max_epoch = max_epoch
self.fitted = False
torch.manual_seed(seed)
def run_epoch(self, phase, task_list, epoch, opt, loss_l, ignore_weight=False):
if phase == "train":
self.tn.train()
torch.set_grad_enabled(True)
else:
self.tn.eval()
torch.set_grad_enabled(False)
running_loss = 0.0
pred_y_all = []
for task in tqdm(task_list, desc=f"{phase} Task", leave=False):
meta_input = task.get_meta_input()
pred, weights = self.tn(
meta_input["X"],
meta_input["y"],
meta_input["time_perf"],
meta_input["time_belong"],
meta_input["X_test"],
ignore_weight=ignore_weight,
)
if self.criterion == "mse":
criterion = nn.MSELoss()
loss = criterion(pred, meta_input["y_test"])
elif self.criterion == "ic_loss":
criterion = ICLoss()
try:
loss = criterion(pred, meta_input["y_test"], meta_input["test_idx"], skip_size=50)
except ValueError as e:
get_module_logger("MetaModelDS").warning(f"Exception `{e}` when calculating IC loss")
continue
assert not np.isnan(loss.detach().item()), "NaN loss!"
if phase == "train":
opt.zero_grad()
norm_loss = nn.MSELoss()
loss.backward()
opt.step()
elif phase == "test":
pass
pred_y_all.append(
pd.DataFrame(
{
"pred": pd.Series(pred.detach().cpu().numpy(), index=meta_input["test_idx"]),
"label": pd.Series(meta_input["y_test"].detach().cpu().numpy(), index=meta_input["test_idx"]),
}
)
)
running_loss += loss.detach().item()
running_loss = running_loss / len(task_list)
loss_l.setdefault(phase, []).append(running_loss)
pred_y_all = pd.concat(pred_y_all)
ic = pred_y_all.groupby("datetime").apply(lambda df: df["pred"].corr(df["label"], method="spearman")).mean()
R.log_metrics(**{f"loss/{phase}": running_loss, "step": epoch})
R.log_metrics(**{f"ic/{phase}": ic, "step": epoch})
def fit(self, meta_dataset: MetaDatasetDS):
"""
The meta-learning-based data selection interacts directly with meta-dataset due to the close-form proxy measurement.
Parameters
----------
meta_dataset : MetaDatasetDS
The meta-model takes the meta-dataset for its training process.
"""
if not self.fitted:
for k in set(["lr", "step", "hist_step_n", "clip_method", "clip_weight", "criterion", "max_epoch"]):
R.log_params(**{k: getattr(self, k)})
# FIXME: get test tasks for just checking the performance
phases = ["train", "test"]
meta_tasks_l = meta_dataset.prepare_tasks(phases)
if len(meta_tasks_l[1]):
R.log_params(
**dict(proxy_test_begin=meta_tasks_l[1][0].task["dataset"]["kwargs"]["segments"]["test"])
) # debug: record when the test phase starts
self.tn = PredNet(
step=self.step, hist_step_n=self.hist_step_n, clip_weight=self.clip_weight, clip_method=self.clip_method
)
opt = optim.Adam(self.tn.parameters(), lr=self.lr)
# run weight with no weight
for phase, task_list in zip(phases, meta_tasks_l):
self.run_epoch(f"{phase}_noweight", task_list, 0, opt, {}, ignore_weight=True)
self.run_epoch(f"{phase}_init", task_list, 0, opt, {})
# run training
loss_l = {}
for epoch in tqdm(range(self.max_epoch), desc="epoch"):
for phase, task_list in zip(phases, meta_tasks_l):
self.run_epoch(phase, task_list, epoch, opt, loss_l)
R.save_objects(**{"model.pkl": self.tn})
self.fitted = True
def _prepare_task(self, task: MetaTask) -> dict:
meta_ipt = task.get_meta_input()
weights = self.tn.twm(meta_ipt["time_perf"])
weight_s = pd.Series(weights.detach().cpu().numpy(), index=task.meta_info.columns)
task = copy.copy(task.task) # NOTE: this is a shallow copy.
task["reweighter"] = TimeReweighter(weight_s)
return task
def inference(self, meta_dataset: MetaTaskDataset) -> List[dict]:
res = []
for mt in meta_dataset.prepare_tasks("test"):
res.append(self._prepare_task(mt))
return res

View File

@@ -0,0 +1,68 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import pandas as pd
import numpy as np
import torch
from torch import nn
from .utils import preds_to_weight_with_clamp, SingleMetaBase
class TimeWeightMeta(SingleMetaBase):
def __init__(self, hist_step_n, clip_weight=None, clip_method="clamp"):
# clip_method includes "tanh" or "clamp"
super().__init__(hist_step_n, clip_weight, clip_method)
self.linear = nn.Linear(hist_step_n, 1)
self.k = nn.Parameter(torch.Tensor([8.0]))
def forward(self, time_perf, time_belong=None, return_preds=False):
hist_step_n = self.linear.in_features
# NOTE: the reshape order is very important
time_perf = time_perf.reshape(hist_step_n, time_perf.shape[0] // hist_step_n, *time_perf.shape[1:])
time_perf = torch.mean(time_perf, dim=1, keepdim=False)
preds = []
for i in range(time_perf.shape[1]):
preds.append(self.linear(time_perf[:, i]))
preds = torch.cat(preds)
preds = preds - torch.mean(preds) # avoid using future information
preds = preds * self.k
if return_preds:
if time_belong is None:
return preds
else:
return time_belong @ preds
else:
weights = preds_to_weight_with_clamp(preds, self.clip_weight, self.clip_method)
if time_belong is None:
return weights
else:
return time_belong @ weights
class PredNet(nn.Module):
def __init__(self, step, hist_step_n, clip_weight=None, clip_method="tanh"):
super().__init__()
self.step = step
self.twm = TimeWeightMeta(hist_step_n=hist_step_n, clip_weight=clip_weight, clip_method=clip_method)
self.init_paramters(hist_step_n)
def get_sample_weights(self, X, time_perf, time_belong, ignore_weight=False):
weights = torch.from_numpy(np.ones(X.shape[0])).float().to(X.device)
if not ignore_weight:
if time_perf is not None:
weights_t = self.twm(time_perf, time_belong)
weights = weights * weights_t
return weights
def forward(self, X, y, time_perf, time_belong, X_test, ignore_weight=False):
"""Please refer to the docs of MetaTaskDS for the description of the variables"""
weights = self.get_sample_weights(X, time_perf, time_belong, ignore_weight=ignore_weight)
X_w = X.T * weights.view(1, -1)
theta = torch.inverse(X_w @ X) @ X_w @ y
return X_test @ theta, weights
def init_paramters(self, hist_step_n):
self.twm.linear.weight.data = 1.0 / hist_step_n + self.twm.linear.weight.data * 0.01
self.twm.linear.bias.data.fill_(0.0)

View File

@@ -0,0 +1,98 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import pandas as pd
import numpy as np
import torch
from torch import nn
from qlib.contrib.torch import data_to_tensor
class ICLoss(nn.Module):
def forward(self, pred, y, idx, skip_size=50):
"""forward.
:param pred:
:param y:
:param idx: Assume the level of the idx is (date, inst), and it is sorted
"""
prev = None
diff_point = []
for i, (date, inst) in enumerate(idx):
if date != prev:
diff_point.append(i)
prev = date
diff_point.append(None)
ic_all = 0.0
skip_n = 0
for start_i, end_i in zip(diff_point, diff_point[1:]):
pred_focus = pred[start_i:end_i] # TODO: just for fake
if pred_focus.shape[0] < skip_size:
# skip some days which have very small amount of stock.
skip_n += 1
continue
y_focus = y[start_i:end_i]
ic_day = torch.dot(
(pred_focus - pred_focus.mean()) / np.sqrt(pred_focus.shape[0]) / pred_focus.std(),
(y_focus - y_focus.mean()) / np.sqrt(y_focus.shape[0]) / y_focus.std(),
)
ic_all += ic_day
if len(diff_point) - 1 - skip_n <= 0:
raise ValueError("No enough data for calculating iC")
ic_mean = ic_all / (len(diff_point) - 1 - skip_n)
return -ic_mean # ic loss
def preds_to_weight_with_clamp(preds, clip_weight=None, clip_method="tanh"):
"""
Clip the weights.
Parameters
----------
clip_weight: float
The clip threshold.
clip_method: str
The clip method. Current available: "clamp", "tanh", and "sigmoid".
"""
if clip_weight is not None:
if clip_method == "clamp":
weights = torch.exp(preds)
weights = weights.clamp(1.0 / clip_weight, clip_weight)
elif clip_method == "tanh":
weights = torch.exp(torch.tanh(preds) * np.log(clip_weight))
elif clip_method == "sigmoid":
# intuitively assume its sum is 1
if clip_weight == 0.0:
weights = torch.ones_like(preds)
else:
sm = nn.Sigmoid()
weights = sm(preds) * clip_weight # TODO: The clip_weight is useless here.
weights = weights / torch.sum(weights) * weights.numel()
else:
raise ValueError("Unknown clip_method")
else:
weights = torch.exp(preds)
return weights
class SingleMetaBase(nn.Module):
def __init__(self, hist_n, clip_weight=None, clip_method="clamp"):
# method can be tanh or clamp
super().__init__()
self.clip_weight = clip_weight
if clip_method in ["tanh", "clamp"]:
if self.clip_weight is not None and self.clip_weight < 1.0:
self.clip_weight = 1 / self.clip_weight
self.clip_method = clip_method
def is_enabled(self):
if self.clip_weight is None:
return True
if self.clip_method == "sigmoid":
if self.clip_weight > 0.0:
return True
else:
if self.clip_weight > 1.0:
return True
return False

View File

@@ -11,6 +11,7 @@ from ...model.base import Model
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.interpret.base import FeatureInt
from ...data.dataset.weight import Reweighter
class CatBoostModel(Model, FeatureInt):
@@ -31,6 +32,7 @@ class CatBoostModel(Model, FeatureInt):
early_stopping_rounds=50,
verbose_eval=20,
evals_result=dict(),
reweighter=None,
**kwargs
):
df_train, df_valid = dataset.prepare(
@@ -49,8 +51,17 @@ class CatBoostModel(Model, FeatureInt):
else:
raise ValueError("CatBoost doesn't support multi-label training")
train_pool = Pool(data=x_train, label=y_train_1d)
valid_pool = Pool(data=x_valid, label=y_valid_1d)
if reweighter is None:
w_train = None
w_valid = None
elif isinstance(reweighter, Reweighter):
w_train = reweighter.reweight(df_train).values
w_valid = reweighter.reweight(df_valid).values
else:
raise ValueError("Unsupported reweighter type.")
train_pool = Pool(data=x_train, label=y_train_1d, weight=w_train)
valid_pool = Pool(data=x_valid, label=y_valid_1d, weight=w_valid)
# Initialize the catboost model
self._params["iterations"] = num_boost_round

View File

@@ -4,59 +4,73 @@
import numpy as np
import pandas as pd
import lightgbm as lgb
from typing import Text, Union
from typing import List, Text, Tuple, Union
from ...model.base import ModelFT
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.interpret.base import LightGBMFInt
from ...data.dataset.weight import Reweighter
class LGBModel(ModelFT, LightGBMFInt):
"""LightGBM Model"""
def __init__(self, loss="mse", early_stopping_rounds=50, **kwargs):
def __init__(self, loss="mse", early_stopping_rounds=50, num_boost_round=1000, **kwargs):
if loss not in {"mse", "binary"}:
raise NotImplementedError
self.params = {"objective": loss, "verbosity": -1}
self.params.update(kwargs)
self.early_stopping_rounds = early_stopping_rounds
self.num_boost_round = num_boost_round
self.model = None
def _prepare_data(self, dataset: DatasetH):
df_train, df_valid = dataset.prepare(
["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
)
if df_train.empty or df_valid.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
def _prepare_data(self, dataset: DatasetH, reweighter=None) -> List[Tuple[lgb.Dataset, str]]:
"""
The motivation of current version is to make validation optional
- train segment is necessary;
"""
ds_l = []
assert "train" in dataset.segments
for key in ["train", "valid"]:
if key in dataset.segments:
df = dataset.prepare(key, col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
if df.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
x, y = df["feature"], df["label"]
# Lightgbm need 1D array as its label
if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
else:
raise ValueError("LightGBM doesn't support multi-label training")
# Lightgbm need 1D array as its label
if y.values.ndim == 2 and y.values.shape[1] == 1:
y = np.squeeze(y.values)
else:
raise ValueError("LightGBM doesn't support multi-label training")
dtrain = lgb.Dataset(x_train, label=y_train)
dvalid = lgb.Dataset(x_valid, label=y_valid)
return dtrain, dvalid
if reweighter is None:
w = None
elif isinstance(reweighter, Reweighter):
w = reweighter.reweight(df)
else:
raise ValueError("Unsupported reweighter type.")
ds_l.append((lgb.Dataset(x.values, label=y, weight=w), key))
return ds_l
def fit(
self,
dataset: DatasetH,
num_boost_round=1000,
num_boost_round=None,
early_stopping_rounds=None,
verbose_eval=20,
evals_result=dict(),
reweighter=None,
**kwargs
):
dtrain, dvalid = self._prepare_data(dataset)
ds_l = self._prepare_data(dataset, reweighter)
ds, names = list(zip(*ds_l))
self.model = lgb.train(
self.params,
dtrain,
num_boost_round=num_boost_round,
valid_sets=[dtrain, dvalid],
valid_names=["train", "valid"],
ds[0], # training dataset
num_boost_round=self.num_boost_round if num_boost_round is None else num_boost_round,
valid_sets=ds,
valid_names=names,
early_stopping_rounds=(
self.early_stopping_rounds if early_stopping_rounds is None else early_stopping_rounds
),
@@ -64,8 +78,8 @@ class LGBModel(ModelFT, LightGBMFInt):
evals_result=evals_result,
**kwargs
)
evals_result["train"] = list(evals_result["train"].values())[0]
evals_result["valid"] = list(evals_result["valid"].values())[0]
for k in names:
evals_result[k] = list(evals_result[k].values())[0]
def predict(self, dataset: DatasetH, segment: Union[Text, slice] = "test"):
if self.model is None:
@@ -73,7 +87,7 @@ class LGBModel(ModelFT, LightGBMFInt):
x_test = dataset.prepare(segment, col_set="feature", data_key=DataHandlerLP.DK_I)
return pd.Series(self.model.predict(x_test.values), index=x_test.index)
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20, reweighter=None):
"""
finetune model
@@ -87,7 +101,7 @@ class LGBModel(ModelFT, LightGBMFInt):
verbose level
"""
# Based on existing model and finetune by train more rounds
dtrain, _ = self._prepare_data(dataset)
dtrain, _ = self._prepare_data(dataset, reweighter)
if dtrain.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
self.model = lgb.train(

View File

@@ -56,7 +56,7 @@ class HFLGBModel(ModelFT, LightGBMFInt):
def hf_signal_test(self, dataset: DatasetH, threhold=0.2):
"""
Test the sigal in high frequency test set
Test the signal in high frequency test set
"""
if self.model == None:
raise ValueError("Model hasn't been trained yet")
@@ -86,7 +86,7 @@ class HFLGBModel(ModelFT, LightGBMFInt):
raise ValueError("Empty data from dataset, please check your dataset config.")
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_train["feature"], df_valid["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
l_name = df_train["label"].columns[0]
# Convert label into alpha

View File

@@ -4,6 +4,7 @@
import numpy as np
import pandas as pd
from typing import Text, Union
from qlib.data.dataset.weight import Reweighter
from scipy.optimize import nnls
from sklearn.linear_model import LinearRegression, Ridge, Lasso
@@ -49,33 +50,40 @@ class LinearModel(Model):
self.coef_ = None
def fit(self, dataset: DatasetH):
def fit(self, dataset: DatasetH, reweighter: Reweighter = None):
df_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
if df_train.empty:
raise ValueError("Empty data from dataset, please check your dataset config.")
if reweighter is not None:
w: pd.Series = reweighter.reweight(df_train)
w = w.values
else:
w = None
X, y = df_train["feature"].values, np.squeeze(df_train["label"].values)
if self.estimator in [self.OLS, self.RIDGE, self.LASSO]:
self._fit(X, y)
self._fit(X, y, w)
elif self.estimator == self.NNLS:
self._fit_nnls(X, y)
self._fit_nnls(X, y, w)
else:
raise ValueError(f"unknown estimator `{self.estimator}`")
return self
def _fit(self, X, y):
def _fit(self, X, y, w):
if self.estimator == self.OLS:
model = LinearRegression(fit_intercept=self.fit_intercept, copy_X=False)
else:
model = {self.RIDGE: Ridge, self.LASSO: Lasso}[self.estimator](
alpha=self.alpha, fit_intercept=self.fit_intercept, copy_X=False
)
model.fit(X, y)
model.fit(X, y, sample_weight=w)
self.coef_ = model.coef_
self.intercept_ = model.intercept_
def _fit_nnls(self, X, y):
def _fit_nnls(self, X, y, w=None):
if w is not None:
raise NotImplementedError("TODO: support nnls with weight") # TODO
if self.fit_intercept:
X = np.c_[X, np.ones(len(X))] # NOTE: mem copy
coef = nnls(X, y)[0]

View File

@@ -554,7 +554,7 @@ class AdaRNN(nn.Module):
return fc_out
class TransferLoss(object):
class TransferLoss:
def __init__(self, loss_type="cosine", input_dim=512):
"""
Supported loss_type: mmd(mmd_lin), mmd_rbf, coral, cosine, kl, js, mine, adv

View File

@@ -22,6 +22,8 @@ from .pytorch_utils import count_parameters
from ...model.base import Model
from ...data.dataset import DatasetH, TSDatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.utils import ConcatDataset
from ...data.dataset.weight import Reweighter
class ALSTM(Model):
@@ -139,15 +141,18 @@ class ALSTM(Model):
def use_gpu(self):
return self.device != torch.device("cpu")
def mse(self, pred, label):
loss = (pred - label) ** 2
def mse(self, pred, label, weight):
loss = weight * (pred - label) ** 2
return torch.mean(loss)
def loss_fn(self, pred, label):
def loss_fn(self, pred, label, weight=None):
mask = ~torch.isnan(label)
if weight is None:
weight = torch.ones_like(label)
if self.loss == "mse":
return self.mse(pred[mask], label[mask])
return self.mse(pred[mask], label[mask], weight[mask])
raise ValueError("unknown loss `%s`" % self.loss)
@@ -164,12 +169,12 @@ class ALSTM(Model):
self.ALSTM_model.train()
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
label = data[:, -1, -1].to(self.device)
pred = self.ALSTM_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
self.train_optimizer.zero_grad()
loss.backward()
@@ -183,7 +188,7 @@ class ALSTM(Model):
scores = []
losses = []
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
# feature[torch.isnan(feature)] = 0
@@ -191,7 +196,7 @@ class ALSTM(Model):
with torch.no_grad():
pred = self.ALSTM_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
losses.append(loss.item())
score = self.metric_fn(pred, label)
@@ -204,6 +209,7 @@ class ALSTM(Model):
dataset,
evals_result=dict(),
save_path=None,
reweighter=None,
):
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
@@ -213,11 +219,28 @@ class ALSTM(Model):
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
if reweighter is None:
wl_train = np.ones(len(dl_train))
wl_valid = np.ones(len(dl_valid))
elif isinstance(reweighter, Reweighter):
wl_train = reweighter.reweight(dl_train)
wl_valid = reweighter.reweight(dl_valid)
else:
raise ValueError("Unsupported reweighter type.")
train_loader = DataLoader(
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_train, wl_train),
batch_size=self.batch_size,
shuffle=True,
num_workers=self.n_jobs,
drop_last=True,
)
valid_loader = DataLoader(
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_valid, wl_valid),
batch_size=self.batch_size,
shuffle=False,
num_workers=self.n_jobs,
drop_last=True,
)
save_path = get_or_create_path(save_path)

View File

@@ -260,7 +260,7 @@ class GATs(Model):
if self.model_path is not None:
self.logger.info("Loading pretrained model...")
pretrained_model.load_state_dict(torch.load(self.model_path))
pretrained_model.load_state_dict(torch.load(self.model_path, map_location=self.device))
model_dict = self.GAT_model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_model.state_dict().items() if k in model_dict}

View File

@@ -276,7 +276,7 @@ class GATs(Model):
if self.model_path is not None:
self.logger.info("Loading pretrained model...")
pretrained_model.load_state_dict(torch.load(self.model_path))
pretrained_model.load_state_dict(torch.load(self.model_path, map_location=self.device))
model_dict = self.GAT_model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_model.state_dict().items() if k in model_dict}

View File

@@ -21,6 +21,8 @@ from .pytorch_utils import count_parameters
from ...model.base import Model
from ...data.dataset import DatasetH, TSDatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.utils import ConcatDataset
from ...data.dataset.weight import Reweighter
class GRU(Model):
@@ -138,15 +140,18 @@ class GRU(Model):
def use_gpu(self):
return self.device != torch.device("cpu")
def mse(self, pred, label):
loss = (pred - label) ** 2
def mse(self, pred, label, weight):
loss = weight * (pred - label) ** 2
return torch.mean(loss)
def loss_fn(self, pred, label):
def loss_fn(self, pred, label, weight=None):
mask = ~torch.isnan(label)
if weight is None:
weight = torch.ones_like(label)
if self.loss == "mse":
return self.mse(pred[mask], label[mask])
return self.mse(pred[mask], label[mask], weight[mask])
raise ValueError("unknown loss `%s`" % self.loss)
@@ -163,12 +168,12 @@ class GRU(Model):
self.GRU_model.train()
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
label = data[:, -1, -1].to(self.device)
pred = self.GRU_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
self.train_optimizer.zero_grad()
loss.backward()
@@ -182,7 +187,7 @@ class GRU(Model):
scores = []
losses = []
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
# feature[torch.isnan(feature)] = 0
@@ -190,7 +195,7 @@ class GRU(Model):
with torch.no_grad():
pred = self.GRU_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
losses.append(loss.item())
score = self.metric_fn(pred, label)
@@ -203,6 +208,7 @@ class GRU(Model):
dataset,
evals_result=dict(),
save_path=None,
reweighter=None,
):
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
@@ -212,11 +218,28 @@ class GRU(Model):
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
if reweighter is None:
wl_train = np.ones(len(dl_train))
wl_valid = np.ones(len(dl_valid))
elif isinstance(reweighter, Reweighter):
wl_train = reweighter.reweight(dl_train)
wl_valid = reweighter.reweight(dl_valid)
else:
raise ValueError("Unsupported reweighter type.")
train_loader = DataLoader(
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_train, wl_train),
batch_size=self.batch_size,
shuffle=True,
num_workers=self.n_jobs,
drop_last=True,
)
valid_loader = DataLoader(
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_valid, wl_valid),
batch_size=self.batch_size,
shuffle=False,
num_workers=self.n_jobs,
drop_last=True,
)
save_path = get_or_create_path(save_path)

View File

@@ -20,6 +20,8 @@ from torch.utils.data import DataLoader
from ...model.base import Model
from ...data.dataset import DatasetH, TSDatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.utils import ConcatDataset
from ...data.dataset.weight import Reweighter
class LSTM(Model):
@@ -134,15 +136,18 @@ class LSTM(Model):
def use_gpu(self):
return self.device != torch.device("cpu")
def mse(self, pred, label):
loss = (pred - label) ** 2
def mse(self, pred, label, weight):
loss = weight * (pred - label) ** 2
return torch.mean(loss)
def loss_fn(self, pred, label):
mask = ~torch.isnan(label)
if weight is None:
weight = torch.ones_like(label)
if self.loss == "mse":
return self.mse(pred[mask], label[mask])
return self.mse(pred[mask], label[mask], weight[mask])
raise ValueError("unknown loss `%s`" % self.loss)
@@ -159,12 +164,12 @@ class LSTM(Model):
self.LSTM_model.train()
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
label = data[:, -1, -1].to(self.device)
pred = self.LSTM_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
self.train_optimizer.zero_grad()
loss.backward()
@@ -178,14 +183,14 @@ class LSTM(Model):
scores = []
losses = []
for data in data_loader:
for (data, weight) in data_loader:
feature = data[:, :, 0:-1].to(self.device)
# feature[torch.isnan(feature)] = 0
label = data[:, -1, -1].to(self.device)
pred = self.LSTM_model(feature.float())
loss = self.loss_fn(pred, label)
loss = self.loss_fn(pred, label, weight.to(self.device))
losses.append(loss.item())
score = self.metric_fn(pred, label)
@@ -198,6 +203,7 @@ class LSTM(Model):
dataset,
evals_result=dict(),
save_path=None,
reweighter=None,
):
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
@@ -207,11 +213,28 @@ class LSTM(Model):
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
if reweighter is None:
wl_train = np.ones(len(dl_train))
wl_valid = np.ones(len(dl_valid))
elif isinstance(reweighter, Reweighter):
wl_train = reweighter.reweight(dl_train)
wl_valid = reweighter.reweight(dl_valid)
else:
raise ValueError("Unsupported reweighter type.")
train_loader = DataLoader(
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_train, wl_train),
batch_size=self.batch_size,
shuffle=True,
num_workers=self.n_jobs,
drop_last=True,
)
valid_loader = DataLoader(
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
ConcatDataset(dl_valid, wl_valid),
batch_size=self.batch_size,
shuffle=False,
num_workers=self.n_jobs,
drop_last=True,
)
save_path = get_or_create_path(save_path)

View File

@@ -19,6 +19,7 @@ from .pytorch_utils import count_parameters
from ...model.base import Model
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from ...data.dataset.weight import Reweighter
from ...utils import unpack_archive_with_buffer, save_multiple_parts_file, get_or_create_path
from ...log import get_module_logger
from ...workflow import R
@@ -97,7 +98,6 @@ class DNNModelPytorch(Model):
"\nlr_decay_steps : {}"
"\noptimizer : {}"
"\nloss_type : {}"
"\neval_steps : {}"
"\nseed : {}"
"\ndevice : {}"
"\nuse_GPU : {}"
@@ -112,7 +112,6 @@ class DNNModelPytorch(Model):
lr_decay_steps,
optimizer,
loss,
eval_steps,
seed,
self.device,
self.use_gpu,
@@ -166,18 +165,22 @@ class DNNModelPytorch(Model):
evals_result=dict(),
verbose=True,
save_path=None,
reweighter=None,
):
df_train, df_valid = dataset.prepare(
["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
)
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
try:
wdf_train, wdf_valid = dataset.prepare(["train", "valid"], col_set=["weight"], data_key=DataHandlerLP.DK_L)
w_train, w_valid = wdf_train["weight"], wdf_valid["weight"]
except KeyError as e:
if reweighter is None:
w_train = pd.DataFrame(np.ones_like(y_train.values), index=y_train.index)
w_valid = pd.DataFrame(np.ones_like(y_valid.values), index=y_valid.index)
elif isinstance(reweighter, Reweighter):
w_train = pd.DataFrame(reweighter.reweight(df_train))
w_valid = pd.DataFrame(reweighter.reweight(df_valid))
else:
raise ValueError("Unsupported reweighter type.")
save_path = get_or_create_path(save_path)
stop_steps = 0
@@ -257,7 +260,7 @@ class DNNModelPytorch(Model):
self.scheduler.step(cur_loss_val)
# restore the optimal parameters after training
self.dnn_model.load_state_dict(torch.load(save_path))
self.dnn_model.load_state_dict(torch.load(save_path, map_location=self.device))
if self.use_gpu:
torch.cuda.empty_cache()
@@ -267,7 +270,7 @@ class DNNModelPytorch(Model):
loss = torch.mul(sqr_loss, w).mean()
return loss
elif loss_type == "binary":
loss = nn.BCELoss(weight=w)
loss = nn.BCEWithLogitsLoss(weight=w)
return loss(pred, target)
else:
raise NotImplementedError("loss {} is not supported!".format(loss_type))
@@ -296,7 +299,7 @@ class DNNModelPytorch(Model):
]
_model_path = os.path.join(model_dir, _model_name)
# Load model
self.dnn_model.load_state_dict(torch.load(_model_path))
self.dnn_model.load_state_dict(torch.load(_model_path, map_location=self.device))
self.fitted = True
@@ -326,24 +329,16 @@ class Net(nn.Module):
dnn_layers = []
drop_input = nn.Dropout(0.05)
dnn_layers.append(drop_input)
for i, (input_dim, hidden_units) in enumerate(zip(layers[:-1], layers[1:])):
fc = nn.Linear(input_dim, hidden_units)
for i, (_input_dim, hidden_units) in enumerate(zip(layers[:-1], layers[1:])):
fc = nn.Linear(_input_dim, hidden_units)
activation = nn.LeakyReLU(negative_slope=0.1, inplace=False)
bn = nn.BatchNorm1d(hidden_units)
seq = nn.Sequential(fc, bn, activation)
dnn_layers.append(seq)
drop_input = nn.Dropout(0.05)
dnn_layers.append(drop_input)
if loss == "mse":
fc = nn.Linear(hidden_units, output_dim)
dnn_layers.append(fc)
elif loss == "binary":
fc = nn.Linear(hidden_units, output_dim)
sigmoid = nn.Sigmoid()
dnn_layers.append(nn.Sequential(fc, sigmoid))
else:
raise NotImplementedError("loss {} is not supported!".format(loss))
fc = nn.Linear(hidden_units, output_dim)
dnn_layers.append(fc)
# optimizer
self.dnn_layers = nn.ModuleList(dnn_layers)
self._weight_init()

View File

@@ -160,7 +160,7 @@ class TabnetModel(Model):
self.logger.info("Pretrain...")
self.pretrain_fn(dataset, self.pretrain_file)
self.logger.info("Load Pretrain model")
self.tabnet_model.load_state_dict(torch.load(self.pretrain_file))
self.tabnet_model.load_state_dict(torch.load(self.pretrain_file, map_location=self.device))
# adding one more linear layer to fit the final output dimension
self.tabnet_model = FinetuneModel(self.out_dim, self.final_out_dim, self.tabnet_model).to(self.device)
@@ -446,7 +446,7 @@ class TabNet(nn.Module):
Args:
n_d: dimension of the features used to calculate the final results
n_a: dimension of the features input to the attention transformer of the next step
n_shared: numbr of shared steps in feature transfomer(optional)
n_shared: numbr of shared steps in feature transformer(optional)
n_ind: number of independent steps in feature transformer
n_steps: number of steps of pass through tabbet
relax coefficient:
@@ -479,7 +479,7 @@ class TabNet(nn.Module):
out = torch.zeros(x.size(0), self.n_d).to(x.device)
for step in self.steps:
x_te, l = step(x, x_a, priors)
out += F.relu(x_te[:, : self.n_d]) # split the feautre from feat_transformer
out += F.relu(x_te[:, : self.n_d]) # split the feature from feat_transformer
x_a = x_te[:, self.n_d :]
sparse_loss.append(l)
return self.fc(out), sum(sparse_loss)

View File

@@ -56,6 +56,7 @@ class TCTS(Model):
loss="mse",
fore_optimizer="adam",
weight_optimizer="adam",
input_dim=360,
output_dim=5,
fore_lr=5e-7,
weight_lr=5e-7,
@@ -83,6 +84,7 @@ class TCTS(Model):
self.device = torch.device("cuda:%d" % (GPU) if torch.cuda.is_available() else "cpu")
self.use_gpu = torch.cuda.is_available()
self.seed = seed
self.input_dim = input_dim
self.output_dim = output_dim
self.fore_lr = fore_lr
self.weight_lr = weight_lr
@@ -139,7 +141,6 @@ class TCTS(Model):
raise NotImplementedError("mode {} is not supported!".format(self.mode))
def train_epoch(self, x_train, y_train, x_valid, y_valid):
x_train_values = x_train.values
y_train_values = np.squeeze(y_train.values)
@@ -297,7 +298,7 @@ class TCTS(Model):
dropout=self.dropout,
)
self.weight_model = MLPModel(
d_feat=360 + 3 * self.output_dim + 1,
d_feat=self.input_dim + 3 * self.output_dim + 1,
hidden_size=self.hidden_size,
num_layers=self.num_layers,
dropout=self.dropout,
@@ -350,9 +351,9 @@ class TCTS(Model):
break
print("best loss:", best_loss, "@", best_epoch)
best_param = torch.load(save_path + "_fore_model.bin")
best_param = torch.load(save_path + "_fore_model.bin", map_location=self.device)
self.fore_model.load_state_dict(best_param)
best_param = torch.load(save_path + "_weight_model.bin")
best_param = torch.load(save_path + "_weight_model.bin", map_location=self.device)
self.weight_model.load_state_dict(best_param)
self.fitted = True

View File

@@ -19,12 +19,13 @@ import torch.nn.functional as F
try:
from torch.utils.tensorboard import SummaryWriter
except:
except ImportError:
SummaryWriter = None
from tqdm import tqdm
from qlib.utils import get_or_create_path
from qlib.constant import EPS
from qlib.log import get_module_logger
from qlib.model.base import Model
from qlib.contrib.data.dataset import MTSDatasetH
@@ -232,7 +233,7 @@ class TRAModel(Model):
choice_all.append(pd.DataFrame(choice.detach().cpu().numpy(), index=index))
decay = self.rho ** (self.global_step // 100) # decay every 100 steps
lamb = 0 if is_pretrain else self.lamb * decay
reg = prob.log().mul(P).sum(dim=1).mean() # train router to predict OT assignment
reg = prob.log().mul(P).sum(dim=1).mean() # train router to predict TO assignment
if self._writer is not None and not is_pretrain:
self._writer.add_scalar("training/router_loss", -reg.item(), self.global_step)
self._writer.add_scalar("training/reg_loss", loss.item(), self.global_step)
@@ -256,7 +257,7 @@ class TRAModel(Model):
total_loss += loss.item()
total_count += 1
if self.use_daily_transport and len(P_all):
if self.use_daily_transport and len(P_all) > 0:
P_all = pd.concat(P_all, axis=0)
prob_all = pd.concat(prob_all, axis=0)
choice_all = pd.concat(choice_all, axis=0)
@@ -663,7 +664,7 @@ class TRA(nn.Module):
"""Temporal Routing Adaptor (TRA)
TRA takes historical prediction erros & latent representation as inputs,
TRA takes historical prediction errors & latent representation as inputs,
then routes the input sample to a specific predictor for training & inference.
Args:
@@ -791,7 +792,7 @@ def minmax_norm(x):
xmin = x.min(dim=-1, keepdim=True).values
xmax = x.max(dim=-1, keepdim=True).values
mask = (xmin == xmax).squeeze()
x = (x - xmin) / (xmax - xmin + 1e-12)
x = (x - xmin) / (xmax - xmin + EPS)
x[mask] = 1
return x

View File

@@ -33,5 +33,5 @@ def count_parameters(models_or_parameters, unit="m"):
elif unit == "gb" or unit == "g":
counts /= 2 ** 30
elif unit is not None:
raise ValueError("Unknow unit: {:}".format(unit))
raise ValueError("Unknown unit: {:}".format(unit))
return counts

View File

@@ -9,6 +9,7 @@ from ...model.base import Model
from ...data.dataset import DatasetH
from ...data.dataset.handler import DataHandlerLP
from ...model.interpret.base import FeatureInt
from ...data.dataset.weight import Reweighter
class XGBModel(Model, FeatureInt):
@@ -26,6 +27,7 @@ class XGBModel(Model, FeatureInt):
early_stopping_rounds=50,
verbose_eval=20,
evals_result=dict(),
reweighter=None,
**kwargs
):
@@ -43,8 +45,17 @@ class XGBModel(Model, FeatureInt):
else:
raise ValueError("XGBoost doesn't support multi-label training")
dtrain = xgb.DMatrix(x_train, label=y_train_1d)
dvalid = xgb.DMatrix(x_valid, label=y_valid_1d)
if reweighter is None:
w_train = None
w_valid = None
elif isinstance(reweighter, Reweighter):
w_train = reweighter.reweight(df_train)
w_valid = reweighter.reweight(df_valid)
else:
raise ValueError("Unsupported reweighter type.")
dtrain = xgb.DMatrix(x_train.values, label=y_train_1d, weight=w_train)
dvalid = xgb.DMatrix(x_valid.values, label=y_valid_1d, weight=w_valid)
self.model = xgb.train(
self._params,
dtrain=dtrain,

View File

@@ -36,7 +36,7 @@ def save_instance(instance, file_path):
save(dump) an instance to a pickle file
Parameter
instance :
data to te dumped
data to be dumped
file_path : string / pathlib.Path()
path of file to be dumped
"""

View File

@@ -57,7 +57,7 @@ def _group_return(pred_label: pd.DataFrame = None, reverse: bool = False, N: int
).figure
t_df = t_df.loc[:, ["long-short", "long-average"]]
_bin_size = ((t_df.max() - t_df.min()) / 20).min()
_bin_size = float(((t_df.max() - t_df.min()) / 20).min())
group_hist_figure = SubplotsGraph(
t_df,
kind_map=dict(kind="DistplotGraph", kwargs=dict(bin_size=_bin_size)),

View File

@@ -15,7 +15,6 @@ from plotly.figure_factory import create_distplot
class BaseGraph:
""" """
_name = None
@@ -297,8 +296,8 @@ class SubplotsGraph:
:return:
"""
self._sub_graph_data = list()
self._subplot_titles = list()
self._sub_graph_data = []
self._subplot_titles = []
for i, column_name in enumerate(self._df.columns):
row = math.ceil((i + 1) / self.__cols)

View File

@@ -5,6 +5,7 @@
from .signal_strategy import (
TopkDropoutStrategy,
WeightStrategyBase,
EnhancedIndexingStrategy,
)
from .rule_strategy import (

View File

@@ -47,7 +47,7 @@ class SoftTopkStrategy(WeightStrategyBase):
Return the proportion of your total value you will used in investment.
Dynamically risk_degree will result in Market timing
"""
# It will use 95% amoutn of your total value by default
# It will use 95% amount of your total value by default
return self.risk_degree
def generate_target_weight_position(self, score, current, trade_start_time, trade_end_time):

View File

@@ -0,0 +1,203 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import numpy as np
import cvxpy as cp
import pandas as pd
from typing import Union, Optional, Dict, Any, List
from qlib.log import get_module_logger
from .base import BaseOptimizer
logger = get_module_logger("EnhancedIndexingOptimizer")
class EnhancedIndexingOptimizer(BaseOptimizer):
"""
Portfolio Optimizer for Enhanced Indexing
Notations:
w0: current holding weights
wb: benchmark weight
r: expected return
F: factor exposure
cov_b: factor covariance
var_u: residual variance (diagonal)
lamb: risk aversion parameter
delta: total turnover limit
b_dev: benchmark deviation limit
f_dev: factor deviation limit
Also denote:
d = w - wb: benchmark deviation
v = d @ F: factor deviation
The optimization problem for enhanced indexing:
max_w d @ r - lamb * (v @ cov_b @ v + var_u @ d**2)
s.t. w >= 0
sum(w) == 1
sum(|w - w0|) <= delta
d >= -b_dev
d <= b_dev
v >= -f_dev
v <= f_dev
"""
def __init__(
self,
lamb: float = 1,
delta: Optional[float] = 0.2,
b_dev: Optional[float] = 0.01,
f_dev: Optional[Union[List[float], np.ndarray]] = None,
scale_return: bool = True,
epsilon: float = 5e-5,
solver_kwargs: Optional[Dict[str, Any]] = {},
):
"""
Args:
lamb (float): risk aversion parameter (larger `lamb` means more focus on risk)
delta (float): total turnover limit
b_dev (float): benchmark deviation limit
f_dev (list): factor deviation limit
scale_return (bool): whether scale return to match estimated volatility
epsilon (float): minimum weight
solver_kwargs (dict): kwargs for cvxpy solver
"""
assert lamb >= 0, "risk aversion parameter `lamb` should be positive"
self.lamb = lamb
assert delta >= 0, "turnover limit `delta` should be positive"
self.delta = delta
assert b_dev is None or b_dev >= 0, "benchmark deviation limit `b_dev` should be positive"
self.b_dev = b_dev
if isinstance(f_dev, float):
assert f_dev >= 0, "factor deviation limit `f_dev` should be positive"
elif f_dev is not None:
f_dev = np.array(f_dev)
assert all(f_dev >= 0), "factor deviation limit `f_dev` should be positive"
self.f_dev = f_dev
self.scale_return = scale_return
self.epsilon = epsilon
self.solver_kwargs = solver_kwargs
def __call__(
self,
r: np.ndarray,
F: np.ndarray,
cov_b: np.ndarray,
var_u: np.ndarray,
w0: np.ndarray,
wb: np.ndarray,
mfh: Optional[np.ndarray] = None,
mfs: Optional[np.ndarray] = None,
) -> np.ndarray:
"""
Args:
r (np.ndarray): expected returns
F (np.ndarray): factor exposure
cov_b (np.ndarray): factor covariance
var_u (np.ndarray): residual variance
w0 (np.ndarray): current holding weights
wb (np.ndarray): benchmark weights
mfh (np.ndarray): mask force holding
mfs (np.ndarray): mask force selling
Returns:
np.ndarray: optimized portfolio allocation
"""
# scale return to match volatility
if self.scale_return:
r = r / r.std()
r *= np.sqrt(np.mean(np.diag(F @ cov_b @ F.T) + var_u))
# target weight
w = cp.Variable(len(r), nonneg=True)
w.value = wb # for warm start
# precompute exposure
d = w - wb # benchmark exposure
v = d @ F # factor exposure
# objective
ret = d @ r # excess return
risk = cp.quad_form(v, cov_b) + var_u @ (d ** 2) # tracking error
obj = cp.Maximize(ret - self.lamb * risk)
# weight bounds
lb = np.zeros_like(wb)
ub = np.ones_like(wb)
# bench bounds
if self.b_dev is not None:
lb = np.maximum(lb, wb - self.b_dev)
ub = np.minimum(ub, wb + self.b_dev)
# force holding
if mfh is not None:
lb[mfh] = w0[mfh]
ub[mfh] = w0[mfh]
# force selling
# NOTE: this will override mfh
if mfs is not None:
lb[mfs] = 0
ub[mfs] = 0
# constraints
# TODO: currently we assume fullly invest in the stocks,
# in the future we should support holding cash as an asset
cons = [cp.sum(w) == 1, w >= lb, w <= ub]
# factor deviation
if self.f_dev is not None:
cons.extend([v >= -self.f_dev, v <= self.f_dev])
# total turnover constraint
t_cons = []
if self.delta is not None:
if w0 is not None and w0.sum() > 0:
t_cons.extend([cp.norm(w - w0, 1) <= self.delta])
# optimize
# trial 1: use all constraints
success = False
try:
prob = cp.Problem(obj, cons + t_cons)
prob.solve(solver=cp.ECOS, warm_start=True, **self.solver_kwargs)
assert prob.status == "optimal"
success = True
except Exception as e:
logger.warning(f"trial 1 failed {e} (status: {prob.status})")
# trial 2: remove turnover constraint
if not success and len(t_cons):
logger.info("try removing turnover constraint as the last optimization failed")
try:
w.value = wb
prob = cp.Problem(obj, cons)
prob.solve(solver=cp.ECOS, warm_start=True, **self.solver_kwargs)
assert prob.status in ["optimal", "optimal_inaccurate"]
success = True
except Exception as e:
logger.warning(f"trial 2 failed {e} (status: {prob.status})")
# return current weight if not success
if not success:
logger.warning("optimization failed, will return current holding weight")
return w0
if prob.status == "optimal_inaccurate":
logger.warning(f"the optimization is inaccurate")
# remove small weight
w = np.asarray(w.value)
w[w < self.epsilon] = 0
w /= w.sum()
return w

View File

@@ -8,7 +8,7 @@ import pandas as pd
import scipy.optimize as so
from typing import Optional, Union, Callable, List
from qlib.portfolio.optimizer import BaseOptimizer
from .base import BaseOptimizer
class PortfolioOptimizer(BaseOptimizer):
@@ -35,7 +35,7 @@ class PortfolioOptimizer(BaseOptimizer):
lamb: float = 0,
delta: float = 0,
alpha: float = 0.0,
scale_alpha: bool = True,
scale_return: bool = True,
tol: float = 1e-8,
):
"""
@@ -44,7 +44,7 @@ class PortfolioOptimizer(BaseOptimizer):
lamb (float): risk aversion parameter (larger `lamb` means more focus on return)
delta (float): turnover rate limit
alpha (float): l2 norm regularizer
scale_alpha (bool): if to scale alpha to match the volatility of the covariance matrix
scale_return (bool): if to scale alpha to match the volatility of the covariance matrix
tol (float): tolerance for optimization termination
"""
assert method in [self.OPT_GMV, self.OPT_MVO, self.OPT_RP, self.OPT_INV], f"method `{method}` is not supported"
@@ -60,18 +60,18 @@ class PortfolioOptimizer(BaseOptimizer):
self.alpha = alpha
self.tol = tol
self.scale_alpha = scale_alpha
self.scale_return = scale_return
def __call__(
self,
S: Union[np.ndarray, pd.DataFrame],
u: Optional[Union[np.ndarray, pd.Series]] = None,
r: Optional[Union[np.ndarray, pd.Series]] = None,
w0: Optional[Union[np.ndarray, pd.Series]] = None,
) -> Union[np.ndarray, pd.Series]:
"""
Args:
S (np.ndarray or pd.DataFrame): covariance matrix
u (np.ndarray or pd.Series): expected returns (a.k.a., alpha)
r (np.ndarray or pd.Series): expected return
w0 (np.ndarray or pd.Series): initial weights (for turnover control)
Returns:
@@ -83,12 +83,12 @@ class PortfolioOptimizer(BaseOptimizer):
index = S.index
S = S.values
# transform alpha
if u is not None:
assert len(u) == len(S), "`u` has mismatched shape"
if isinstance(u, pd.Series):
assert u.index.equals(index), "`u` has mismatched index"
u = u.values
# transform return
if r is not None:
assert len(r) == len(S), "`r` has mismatched shape"
if isinstance(r, pd.Series):
assert r.index.equals(index), "`r` has mismatched index"
r = r.values
# transform initial weights
if w0 is not None:
@@ -97,13 +97,13 @@ class PortfolioOptimizer(BaseOptimizer):
assert w0.index.equals(index), "`w0` has mismatched index"
w0 = w0.values
# scale alpha to match volatility
if u is not None and self.scale_alpha:
u = u / u.std()
u *= np.mean(np.diag(S)) ** 0.5
# scale return to match volatility
if r is not None and self.scale_return:
r = r / r.std()
r *= np.sqrt(np.mean(np.diag(S)))
# optimize
w = self._optimize(S, u, w0)
w = self._optimize(S, r, w0)
# restore index if needed
if index is not None:
@@ -111,30 +111,30 @@ class PortfolioOptimizer(BaseOptimizer):
return w
def _optimize(self, S: np.ndarray, u: Optional[np.ndarray] = None, w0: Optional[np.ndarray] = None) -> np.ndarray:
def _optimize(self, S: np.ndarray, r: Optional[np.ndarray] = None, w0: Optional[np.ndarray] = None) -> np.ndarray:
# inverse volatility
if self.method == self.OPT_INV:
if u is not None:
warnings.warn("`u` is set but will not be used for `inv` portfolio")
if r is not None:
warnings.warn("`r` is set but will not be used for `inv` portfolio")
if w0 is not None:
warnings.warn("`w0` is set but will not be used for `inv` portfolio")
return self._optimize_inv(S)
# global minimum variance
if self.method == self.OPT_GMV:
if u is not None:
warnings.warn("`u` is set but will not be used for `gmv` portfolio")
if r is not None:
warnings.warn("`r` is set but will not be used for `gmv` portfolio")
return self._optimize_gmv(S, w0)
# mean-variance
if self.method == self.OPT_MVO:
return self._optimize_mvo(S, u, w0)
return self._optimize_mvo(S, r, w0)
# risk parity
if self.method == self.OPT_RP:
if u is not None:
warnings.warn("`u` is set but will not be used for `rp` portfolio")
if r is not None:
warnings.warn("`r` is set but will not be used for `rp` portfolio")
return self._optimize_rp(S, w0)
def _optimize_inv(self, S: np.ndarray) -> np.ndarray:
@@ -155,17 +155,17 @@ class PortfolioOptimizer(BaseOptimizer):
return self._solve(len(S), self._get_objective_gmv(S), *self._get_constrains(w0))
def _optimize_mvo(
self, S: np.ndarray, u: Optional[np.ndarray] = None, w0: Optional[np.ndarray] = None
self, S: np.ndarray, r: Optional[np.ndarray] = None, w0: Optional[np.ndarray] = None
) -> np.ndarray:
"""optimize mean-variance portfolio
This method solves the following optimization problem
min_w - w' u + lamb * w' S w
min_w - w' r + lamb * w' S w
s.t. w >= 0, sum(w) == 1
where `S` is the covariance matrix, `u` is the expected returns,
and `lamb` is the risk aversion parameter.
"""
return self._solve(len(S), self._get_objective_mvo(S, u), *self._get_constrains(w0))
return self._solve(len(S), self._get_objective_mvo(S, r), *self._get_constrains(w0))
def _optimize_rp(self, S: np.ndarray, w0: Optional[np.ndarray] = None) -> np.ndarray:
"""optimize risk parity portfolio
@@ -189,16 +189,16 @@ class PortfolioOptimizer(BaseOptimizer):
return func
def _get_objective_mvo(self, S: np.ndarray, u: np.ndarray = None) -> Callable:
def _get_objective_mvo(self, S: np.ndarray, r: np.ndarray = None) -> Callable:
"""mean-variance optimization objective
Optimization objective
min_w - w' u + lamb * w' S w
min_w - w' r + lamb * w' S w
"""
def func(x):
risk = x @ S @ x
ret = x @ u
ret = x @ r
return -ret + self.lamb * risk
return func

View File

@@ -24,7 +24,7 @@ class TWAPStrategy(BaseStrategy):
NOTE:
- This TWAP strategy will celling round when trading. This will make the TWAP trading strategy produce the order
ealier when the total trade unit of amount is less than the trading step
earlier when the total trade unit of amount is less than the trading step
"""
def reset(self, outer_trade_decision: BaseTradeDecision = None, **kwargs):
@@ -43,8 +43,8 @@ class TWAPStrategy(BaseStrategy):
def generate_trade_decision(self, execute_result=None):
# NOTE: corner cases!!!
# - If using upperbound round, please don't sell the amount which should in next step
# - the coordinate of the amount between steps is hard to be dealed between steps in the same level. It
# is easier to be dealed in upper steps
# - the coordinate of the amount between steps is hard to be dealt between steps in the same level. It
# is easier to be dealt in upper steps
# strategy is not available. Give an empty decision
if len(self.outer_trade_decision.get_decision()) == 0:

View File

@@ -1,70 +1,49 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import os
import copy
from qlib.backtest.signal import Signal, create_signal_from
from typing import Dict, List, Text, Tuple, Union
from qlib.data.dataset import Dataset
from qlib.model.base import BaseModel
from qlib.backtest.position import Position
import warnings
import cvxpy as cp
import numpy as np
import pandas as pd
from ...utils.resam import resam_ts_data
from ...strategy.base import BaseStrategy
from ...backtest.decision import Order, BaseTradeDecision, OrderDir, TradeDecisionWO
from typing import Dict, List, Text, Tuple, Union
from .order_generator import OrderGenWInteract
from qlib.data import D
from qlib.data.dataset import Dataset
from qlib.model.base import BaseModel
from qlib.strategy.base import BaseStrategy
from qlib.backtest.position import Position
from qlib.backtest.signal import Signal, create_signal_from
from qlib.backtest.decision import Order, BaseTradeDecision, OrderDir, TradeDecisionWO
from qlib.log import get_module_logger
from qlib.utils import get_pre_trading_date, load_dataset
from qlib.utils.resam import resam_ts_data
from qlib.contrib.strategy.order_generator import OrderGenWInteract, OrderGenWOInteract
from qlib.contrib.strategy.optimizer import EnhancedIndexingOptimizer
class TopkDropoutStrategy(BaseStrategy):
# TODO:
# 1. Supporting leverage the get_range_limit result from the decision
# 2. Supporting alter_outer_trade_decision
# 3. Supporting checking the availability of trade decision
class BaseSignalStrategy(BaseStrategy):
def __init__(
self,
*,
topk,
n_drop,
signal: Union[Signal, Tuple[BaseModel, Dataset], List, Dict, Text, pd.Series, pd.DataFrame] = None,
method_sell="bottom",
method_buy="top",
risk_degree=0.95,
hold_thresh=1,
only_tradable=False,
model=None,
dataset=None,
risk_degree: float = 0.95,
trade_exchange=None,
level_infra=None,
common_infra=None,
model=None,
dataset=None,
**kwargs,
):
"""
Parameters
-----------
topk : int
the number of stocks in the portfolio.
n_drop : int
number of stocks to be replaced in each trading date.
signal :
the information to describe a signal. Please refer to the docs of `qlib.backtest.signal.create_signal_from`
the decision of the strategy will base on the given signal
method_sell : str
dropout method_sell, random/bottom.
method_buy : str
dropout method_buy, random/top.
risk_degree : float
position percentage of total value.
hold_thresh : int
minimum holding days
before sell stock , will check current.get_stock_count(order.stock_id) >= self.hold_thresh.
only_tradable : bool
will the strategy only consider the tradable stock when buying and selling.
if only_tradable:
strategy will make buy sell decision without checking the tradable state of the stock.
else:
strategy will make decision with the tradable state of the stock info and avoid buy and sell them.
trade_exchange : Exchange
exchange that provides market info, used to deal order and generate report
- If `trade_exchange` is None, self.trade_exchange will be set with common_infra
@@ -74,16 +53,9 @@ class TopkDropoutStrategy(BaseStrategy):
- In minutely execution, the daily exchange is not usable, only the minutely exchange is recommended.
"""
super(TopkDropoutStrategy, self).__init__(
level_infra=level_infra, common_infra=common_infra, trade_exchange=trade_exchange, **kwargs
)
self.topk = topk
self.n_drop = n_drop
self.method_sell = method_sell
self.method_buy = method_buy
super().__init__(level_infra=level_infra, common_infra=common_infra, trade_exchange=trade_exchange, **kwargs)
self.risk_degree = risk_degree
self.hold_thresh = hold_thresh
self.only_tradable = only_tradable
# This is trying to be compatible with previous version of qlib task config
if model is not None and dataset is not None:
@@ -97,15 +69,65 @@ class TopkDropoutStrategy(BaseStrategy):
Return the proportion of your total value you will used in investment.
Dynamically risk_degree will result in Market timing.
"""
# It will use 95% amoutn of your total value by default
# It will use 95% amount of your total value by default
return self.risk_degree
class TopkDropoutStrategy(BaseSignalStrategy):
# TODO:
# 1. Supporting leverage the get_range_limit result from the decision
# 2. Supporting alter_outer_trade_decision
# 3. Supporting checking the availability of trade decision
def __init__(
self,
*,
topk,
n_drop,
method_sell="bottom",
method_buy="top",
hold_thresh=1,
only_tradable=False,
**kwargs,
):
"""
Parameters
-----------
topk : int
the number of stocks in the portfolio.
n_drop : int
number of stocks to be replaced in each trading date.
method_sell : str
dropout method_sell, random/bottom.
method_buy : str
dropout method_buy, random/top.
hold_thresh : int
minimum holding days
before sell stock , will check current.get_stock_count(order.stock_id) >= self.hold_thresh.
only_tradable : bool
will the strategy only consider the tradable stock when buying and selling.
if only_tradable:
strategy will make buy sell decision without checking the tradable state of the stock.
else:
strategy will make decision with the tradable state of the stock info and avoid buy and sell them.
"""
super().__init__(**kwargs)
self.topk = topk
self.n_drop = n_drop
self.method_sell = method_sell
self.method_buy = method_buy
self.hold_thresh = hold_thresh
self.only_tradable = only_tradable
def generate_trade_decision(self, execute_result=None):
# get the number of trading step finished, trade_step can be [0, 1, 2, ..., trade_len - 1]
trade_step = self.trade_calendar.get_trade_step()
trade_start_time, trade_end_time = self.trade_calendar.get_step_time(trade_step)
pred_start_time, pred_end_time = self.trade_calendar.get_step_time(trade_step, shift=1)
pred_score = self.signal.get_signal(start_time=pred_start_time, end_time=pred_end_time)
# NOTE: the current version of topk dropout strategy can't handle pd.DataFrame(multiple signal)
# So it only leverage the first col of signal
if isinstance(pred_score, pd.DataFrame):
pred_score = pred_score.iloc[:, 0]
if pred_score is None:
return TradeDecisionWO([], self)
if self.only_tradable:
@@ -253,7 +275,7 @@ class TopkDropoutStrategy(BaseStrategy):
return TradeDecisionWO(sell_order_list + buy_order_list, self)
class WeightStrategyBase(BaseStrategy):
class WeightStrategyBase(BaseSignalStrategy):
# TODO:
# 1. Supporting leverage the get_range_limit result from the decision
# 2. Supporting alter_outer_trade_decision
@@ -261,11 +283,7 @@ class WeightStrategyBase(BaseStrategy):
def __init__(
self,
*,
signal: Union[Signal, Tuple[BaseModel, Dataset], List, Dict, Text, pd.Series, pd.DataFrame],
order_generator_cls_or_obj=OrderGenWInteract,
trade_exchange=None,
level_infra=None,
common_infra=None,
order_generator_cls_or_obj=OrderGenWOInteract,
**kwargs,
):
"""
@@ -280,24 +298,13 @@ class WeightStrategyBase(BaseStrategy):
- In daily execution, both daily exchange and minutely are usable, but the daily exchange is recommended because it run faster.
- In minutely execution, the daily exchange is not usable, only the minutely exchange is recommended.
"""
super(WeightStrategyBase, self).__init__(
level_infra=level_infra, common_infra=common_infra, trade_exchange=trade_exchange, **kwargs
)
super().__init__(**kwargs)
if isinstance(order_generator_cls_or_obj, type):
self.order_generator = order_generator_cls_or_obj()
else:
self.order_generator = order_generator_cls_or_obj
self.signal: Signal = create_signal_from(signal)
def get_risk_degree(self, trade_step=None):
"""get_risk_degree
Return the proportion of your total value you will used in investment.
Dynamically risk_degree will result in Market timing.
"""
# It will use 95% amoutn of your total value by default
return 0.95
def generate_target_weight_position(self, score, current, trade_start_time, trade_end_time):
"""
Generate target position from score for this date and the current position.The cash is not considered in the position
@@ -341,3 +348,154 @@ class WeightStrategyBase(BaseStrategy):
trade_end_time=trade_end_time,
)
return TradeDecisionWO(order_list, self)
class EnhancedIndexingStrategy(WeightStrategyBase):
"""Enhanced Indexing Strategy
Enhanced indexing combines the arts of active management and passive management,
with the aim of outperforming a benchmark index (e.g., S&P 500) in terms of
portfolio return while controlling the risk exposure (a.k.a. tracking error).
Users need to prepare their risk model data like below:
├── /path/to/riskmodel
├──── 20210101
├────── factor_exp.{csv|pkl|h5}
├────── factor_cov.{csv|pkl|h5}
├────── specific_risk.{csv|pkl|h5}
├────── blacklist.{csv|pkl|h5} # optional
The risk model data can be obtained from risk data provider. You can also use
`qlib.model.riskmodel.structured.StructuredCovEstimator` to prepare these data.
Args:
riskmodel_path (str): risk model path
name_mapping (dict): alternative file names
"""
FACTOR_EXP_NAME = "factor_exp.pkl"
FACTOR_COV_NAME = "factor_cov.pkl"
SPECIFIC_RISK_NAME = "specific_risk.pkl"
BLACKLIST_NAME = "blacklist.pkl"
def __init__(
self,
*,
riskmodel_root,
market="csi500",
turn_limit=None,
name_mapping={},
optimizer_kwargs={},
verbose=False,
**kwargs,
):
super().__init__(**kwargs)
self.logger = get_module_logger("EnhancedIndexingStrategy")
self.riskmodel_root = riskmodel_root
self.market = market
self.turn_limit = turn_limit
self.factor_exp_path = name_mapping.get("factor_exp", self.FACTOR_EXP_NAME)
self.factor_cov_path = name_mapping.get("factor_cov", self.FACTOR_COV_NAME)
self.specific_risk_path = name_mapping.get("specific_risk", self.SPECIFIC_RISK_NAME)
self.blacklist_path = name_mapping.get("blacklist", self.BLACKLIST_NAME)
self.optimizer = EnhancedIndexingOptimizer(**optimizer_kwargs)
self.verbose = verbose
self._riskdata_cache = {}
def get_risk_data(self, date):
if date in self._riskdata_cache:
return self._riskdata_cache[date]
root = self.riskmodel_root + "/" + date.strftime("%Y%m%d")
if not os.path.exists(root):
return None
factor_exp = load_dataset(root + "/" + self.factor_exp_path, index_col=[0])
factor_cov = load_dataset(root + "/" + self.factor_cov_path, index_col=[0])
specific_risk = load_dataset(root + "/" + self.specific_risk_path, index_col=[0])
if not factor_exp.index.equals(specific_risk.index):
# NOTE: for stocks missing specific_risk, we always assume it have the highest volatility
specific_risk = specific_risk.reindex(factor_exp.index, fill_value=specific_risk.max())
universe = factor_exp.index.tolist()
blacklist = []
if os.path.exists(root + "/" + self.blacklist_path):
blacklist = load_dataset(root + "/" + self.blacklist_path).index.tolist()
self._riskdata_cache[date] = factor_exp.values, factor_cov.values, specific_risk.values, universe, blacklist
return self._riskdata_cache[date]
def generate_target_weight_position(self, score, current, trade_start_time, trade_end_time):
trade_date = trade_start_time
pre_date = get_pre_trading_date(trade_date, future=True) # previous trade date
# load risk data
outs = self.get_risk_data(pre_date)
if outs is None:
self.logger.warning(f"no risk data for {pre_date:%Y-%m-%d}, skip optimization")
return None
factor_exp, factor_cov, specific_risk, universe, blacklist = outs
# transform score
# NOTE: for stocks missing score, we always assume they have the lowest score
score = score.reindex(universe).fillna(score.min()).values
# get current weight
# NOTE: if a stock is not in universe, its current weight will be zero
cur_weight = current.get_stock_weight_dict(only_stock=False)
cur_weight = np.array([cur_weight.get(stock, 0) for stock in universe])
assert all(cur_weight >= 0), "current weight has negative values"
cur_weight = cur_weight / self.get_risk_degree(trade_date) # sum of weight should be risk_degree
if cur_weight.sum() > 1 and self.verbose:
self.logger.warning(f"previous total holdings excess risk degree (current: {cur_weight.sum()})")
# load bench weight
bench_weight = D.features(
D.instruments("all"), [f"${self.market}_weight"], start_time=pre_date, end_time=pre_date
).squeeze()
bench_weight.index = bench_weight.index.droplevel(level="datetime")
bench_weight = bench_weight.reindex(universe).fillna(0).values
# whether stock tradable
# NOTE: currently we use last day volume to check whether tradable
tradable = D.features(D.instruments("all"), ["$volume"], start_time=pre_date, end_time=pre_date).squeeze()
tradable.index = tradable.index.droplevel(level="datetime")
tradable = tradable.reindex(universe).gt(0).values
mask_force_hold = ~tradable
# mask force sell
mask_force_sell = np.array([stock in blacklist for stock in universe], dtype=bool)
# optimize
weight = self.optimizer(
r=score,
F=factor_exp,
cov_b=factor_cov,
var_u=specific_risk ** 2,
w0=cur_weight,
wb=bench_weight,
mfh=mask_force_hold,
mfs=mask_force_sell,
)
target_weight_position = {stock: weight for stock, weight in zip(universe, weight) if weight > 0}
if self.verbose:
self.logger.info("trade date: {:%Y-%m-%d}".format(trade_date))
self.logger.info("number of holding stocks: {}".format(len(target_weight_position)))
self.logger.info("total holding weight: {:.6f}".format(weight.sum()))
return target_weight_position

31
qlib/contrib/torch.py Normal file
View File

@@ -0,0 +1,31 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""
This module is not a necessary part of Qlib.
They are just some tools for convenience
It is should not imported into the core part of qlib
"""
import torch
import numpy as np
import pandas as pd
def data_to_tensor(data, device="cpu", raise_error=False):
if isinstance(data, torch.Tensor):
if device == "cpu":
return data.cpu()
else:
return data.to(device)
if isinstance(data, (pd.DataFrame, pd.Series)):
return data_to_tensor(torch.from_numpy(data.values).float(), device)
elif isinstance(data, np.ndarray):
return data_to_tensor(torch.from_numpy(data).float(), device)
elif isinstance(data, (tuple, list)):
return [data_to_tensor(i, device) for i in data]
elif isinstance(data, dict):
return {k: data_to_tensor(v, device) for k, v in data.items()}
else:
if raise_error:
raise ValueError(f"Unsupported data type: {type(data)}.")
else:
return data

View File

@@ -46,6 +46,7 @@ class Tuner:
space=self.space,
algo=tpe.suggest,
max_evals=self.max_evals,
show_progressbar=False,
)
self.logger.info("Local best params: {} ".format(self.best_params))
TimeInspector.log_cost_time(
@@ -89,7 +90,7 @@ class QLibTuner(Tuner):
def objective(self, params):
# 1. Setup an config for a spcific estimator process
# 1. Setup an config for a specific estimator process
estimator_path = self.setup_estimator_config(params)
self.logger.info("Searching params: {} ".format(params))

View File

@@ -359,7 +359,7 @@ class ExpressionCache(BaseProviderCache):
def update(self, cache_uri: Union[str, Path], freq: str = "day"):
"""Update expression cache to latest calendar.
Overide this method to define how to update expression cache corresponding to users' own cache mechanism.
Override this method to define how to update expression cache corresponding to users' own cache mechanism.
Parameters
----------
@@ -445,7 +445,7 @@ class DatasetCache(BaseProviderCache):
def update(self, cache_uri: Union[str, Path], freq: str = "day"):
"""Update dataset cache to latest calendar.
Overide this method to define how to update dataset cache corresponding to users' own cache mechanism.
Override this method to define how to update dataset cache corresponding to users' own cache mechanism.
Parameters
----------
@@ -543,7 +543,7 @@ class DiskExpressionCache(ExpressionCache):
# instance
series = self.provider.expression(instrument, field, _calendar[0], _calendar[-1], freq)
if not series.empty:
# This expresion is empty, we don't generate any cache for it.
# This expression is empty, we don't generate any cache for it.
with CacheUtils.writer_lock(self.r, f"{str(C.dpm.get_data_uri(freq))}:expression-{_cache_uri}"):
self.gen_expression_cache(
expression_data=series,
@@ -858,7 +858,7 @@ class DiskDatasetCache(DatasetCache):
"""gen_dataset_cache
.. note:: This function does not consider the cache read write lock. Please
Aquire the lock outside this function
Acquire the lock outside this function
The format the cache contains 3 parts(followed by typical filename).
@@ -1035,7 +1035,7 @@ class DiskDatasetCache(DatasetCache):
# FIXME:
# Because the feature cache are stored as .bin file.
# So the series read from features are all float32.
# However, the first dataset cache is calulated based on the
# However, the first dataset cache is calculated based on the
# raw data. So the data type may be float64.
# Different data type will result in failure of appending data
if "/{}".format(DatasetCache.HDF_KEY) in store.keys():

View File

@@ -58,7 +58,7 @@ class Client:
msg_proc_func : func
the function to process the message when receiving response, should have arg `*args`.
msg_queue: Queue
The queue to pass the messsage after callback.
The queue to pass the message after callback.
"""
head_info = {"version": qlib.__version__}

Some files were not shown because too many files have changed in this diff Show More