mirror of
https://github.com/microsoft/qlib.git
synced 2026-06-06 05:51:17 +08:00
DDG-DA paper code (#743)
* Merge data selection to main * Update trainer for reweighter * Typos fixed. * update data selection interface * successfully run exp after refactor some interface * data selection share handler & trainer * fix meta model time series bug * fix online workflow set_uri bug * fix set_uri bug * updawte ds docs and delay trainer bug * docs * resume reweighter * add reweighting result * fix qlib model import * make recorder more friendly * fix experiment workflow bug * commit for merging master incase of conflictions * Successful run DDG-DA with a single command * remove unused code * asdd more docs * Update README.md * Update & fix some bugs. * Update configuration & remove debug functions * Update README.md * Modfify horizon from code rather than yaml * Update performance in README.md * fix part comments * Remove unfinished TCTS. * Fix some details. * Update meta docs * Update README.md of the benchmarks_dynamic * Update README.md files * Add README.md to the rolling_benchmark baseline. * Refine the docs and link * Rename README.md in benchmarks_dynamic. * Remove comments. * auto download data Co-authored-by: wendili-cs <wendili.academic@qq.com> Co-authored-by: demon143 <785696300@qq.com>
This commit is contained in:
33
README.md
33
README.md
@@ -11,6 +11,7 @@
|
||||
Recent released features
|
||||
| Feature | Status |
|
||||
| -- | ------ |
|
||||
| Meta-Learning-based framework & DDG-DA | [Released](https://github.com/microsoft/qlib/pull/743) on Jan 10, 2022 |
|
||||
| Planning-based portfolio optimization | [Released](https://github.com/microsoft/qlib/pull/754) on Dec 28, 2021 |
|
||||
| Release Qlib v0.8.0 | [Released](https://github.com/microsoft/qlib/releases/tag/v0.8.0) on Dec 8, 2021 |
|
||||
| ADD model | [Released](https://github.com/microsoft/qlib/pull/704) on Nov 22, 2021 |
|
||||
@@ -50,9 +51,12 @@ For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative
|
||||
- [Data Preparation](#data-preparation)
|
||||
- [Auto Quant Research Workflow](#auto-quant-research-workflow)
|
||||
- [Building Customized Quant Research Workflow by Code](#building-customized-quant-research-workflow-by-code)
|
||||
- [Main Challenges & Solutions in Quant Research](#main-challenges--solutions-in-quant-research)
|
||||
- [Forecasting: Finding Valuable Signals/Patterns](#forecasting-finding-valuable-signalspatterns)
|
||||
- [**Quant Model (Paper) Zoo**](#quant-model-paper-zoo)
|
||||
- [Run a single model](#run-a-single-model)
|
||||
- [Run multiple models](#run-multiple-models)
|
||||
- [Run a Single Model](#run-a-single-model)
|
||||
- [Run Multiple Models](#run-multiple-models)
|
||||
- [Adapting to Market Dynamics](#adapting-to-market-dynamics)
|
||||
- [**Quant Dataset Zoo**](#quant-dataset-zoo)
|
||||
- [More About Qlib](#more-about-qlib)
|
||||
- [Offline Mode and Online Mode](#offline-mode-and-online-mode)
|
||||
@@ -69,7 +73,6 @@ Your feedbacks about the features are very important.
|
||||
| -- | ------ |
|
||||
| Point-in-Time database | Under review: https://github.com/microsoft/qlib/pull/343 |
|
||||
| Orderbook database | Under review: https://github.com/microsoft/qlib/pull/744 |
|
||||
| Meta-Learning-based data selection | Under review: https://github.com/microsoft/qlib/pull/743 |
|
||||
|
||||
# Framework of Qlib
|
||||
|
||||
@@ -280,8 +283,18 @@ Qlib provides a tool named `qrun` to run the whole workflow automatically (inclu
|
||||
## Building Customized Quant Research Workflow by Code
|
||||
The automatic workflow may not suit the research workflow of all Quant researchers. To support a flexible Quant research workflow, Qlib also provides a modularized interface to allow researchers to build their own workflow by code. [Here](examples/workflow_by_code.ipynb) is a demo for customized Quant research workflow by code.
|
||||
|
||||
# Main Challenges & Solutions in Quant Research
|
||||
Quant investment is an very unique scenario with lots of key challenges to be solved.
|
||||
Currently, Qlib provides some solutions for several of them.
|
||||
|
||||
# [Quant Model (Paper) Zoo](examples/benchmarks)
|
||||
## Forecasting: Finding Valuable Signals/Patterns
|
||||
Accurate forecasting of the stock price trend is a very important part to construct profitable portfolios.
|
||||
However, huge amount of data with various formats in the financial market which make it challenging to build forecasting models.
|
||||
|
||||
An increasing number of SOTA Quant research works/papers, which focus on building forecasting models to mine valuable signals/patterns in complex financial data, are released in `Qlib`
|
||||
|
||||
|
||||
### [Quant Model (Paper) Zoo](examples/benchmarks)
|
||||
|
||||
Here is a list of models built on `Qlib`.
|
||||
- [GBDT based on XGBoost (Tianqi Chen, et al. KDD 2016)](examples/benchmarks/XGBoost/)
|
||||
@@ -308,7 +321,7 @@ Your PR of new Quant models is highly welcomed.
|
||||
|
||||
The performance of each model on the `Alpha158` and `Alpha360` dataset can be found [here](examples/benchmarks/README.md).
|
||||
|
||||
## Run a single model
|
||||
### Run a single model
|
||||
All the models listed above are runnable with ``Qlib``. Users can find the config files we provide and some details about the model through the [benchmarks](examples/benchmarks) folder. More information can be retrieved at the model files listed above.
|
||||
|
||||
`Qlib` provides three different ways to run a single model, users can pick the one that fits their cases best:
|
||||
@@ -318,7 +331,7 @@ All the models listed above are runnable with ``Qlib``. Users can find the confi
|
||||
- Users can use the script [`run_all_model.py`](examples/run_all_model.py) listed in the `examples` folder to run a model. Here is an example of the specific shell command to be used: `python run_all_model.py run --models=lightgbm`, where the `--models` arguments can take any number of models listed above(the available models can be found in [benchmarks](examples/benchmarks/)). For more use cases, please refer to the file's [docstrings](examples/run_all_model.py).
|
||||
- **NOTE**: Each baseline has different environment dependencies, please make sure that your python version aligns with the requirements(e.g. TFT only supports Python 3.6~3.7 due to the limitation of `tensorflow==1.15.0`)
|
||||
|
||||
## Run multiple models
|
||||
### Run multiple models
|
||||
`Qlib` also provides a script [`run_all_model.py`](examples/run_all_model.py) which can run multiple models for several iterations. (**Note**: the script only support *Linux* for now. Other OS will be supported in the future. Besides, it doesn't support parallel running the same model for multiple times as well, and this will be fixed in the future development too.)
|
||||
|
||||
The script will create a unique virtual environment for each model, and delete the environments after training. Thus, only experiment results such as `IC` and `backtest` results will be generated and stored.
|
||||
@@ -330,6 +343,14 @@ python run_all_model.py run 10
|
||||
|
||||
It also provides the API to run specific models at once. For more use cases, please refer to the file's [docstrings](examples/run_all_model.py).
|
||||
|
||||
## [Adapting to Market Dynamics](examples/benchmarks_dynamic)
|
||||
|
||||
Due to the non-stationary nature of the environment of the financial market, the data distribution may change in different periods, which makes the performance of models build on training data decays in the future test data.
|
||||
So adapting the forecasting models/strategies to market dynamics is very important to the model/strategies' performance.
|
||||
|
||||
Here is a list of solutions built on `Qlib`.
|
||||
- [Rolling Retraining](examples/benchmarks_dynamic/baseline/)
|
||||
- [DDG-DA on pytorch (Wendi, et al. AAAI 2022)](examples/benchmarks_dynamic/DDG-DA/)
|
||||
|
||||
# Quant Dataset Zoo
|
||||
Dataset plays a very important role in Quant. Here is a list of the datasets built on `Qlib`:
|
||||
|
||||
68
docs/component/meta.rst
Normal file
68
docs/component/meta.rst
Normal file
@@ -0,0 +1,68 @@
|
||||
.. _meta:
|
||||
|
||||
=================================
|
||||
Meta Controller: Meta-Task & Meta-Dataset & Meta-Model
|
||||
=================================
|
||||
.. currentmodule:: qlib
|
||||
|
||||
|
||||
Introduction
|
||||
=============
|
||||
``Meta Controller`` provides guidance to ``Forecast Model``, which aims to learn regular patterns among a series of forecasting tasks and use learned patterns to guide forthcoming forecasting tasks. Users can implement their own meta-model instance based on ``Meta Controller`` module.
|
||||
|
||||
Meta Task
|
||||
=============
|
||||
|
||||
A `Meta Task` instance is the basic element in the meta-learning framework. It saves the data that can be used for the `Meta Model`. Multiple `Meta Task` instances may share the same `Data Handler`, controlled by `Meta Dataset`. Users should use `prepare_task_data()` to obtain the data that can be directly fed into the `Meta Model`.
|
||||
|
||||
.. autoclass:: qlib.model.meta.task.MetaTask
|
||||
:members:
|
||||
|
||||
Meta Dataset
|
||||
=============
|
||||
|
||||
`Meta Dataset` controls the meta-information generating process. It is on the duty of providing data for training the `Meta Model`. Users should use `prepare_tasks` to retrieve a list of `Meta Task` instances.
|
||||
|
||||
.. autoclass:: qlib.model.meta.dataset.MetaTaskDataset
|
||||
:members:
|
||||
|
||||
Meta Model
|
||||
=============
|
||||
|
||||
General Meta Model
|
||||
------------------
|
||||
`Meta Model` instance is the part that controls the workflow. The usage of the `Meta Model` includes:
|
||||
1. Users train their `Meta Model` with the `fit` function.
|
||||
2. The `Meta Model` instance guides the workflow by giving useful information via the `inference` function.
|
||||
|
||||
.. autoclass:: qlib.model.meta.model.MetaModel
|
||||
:members:
|
||||
|
||||
Meta Task Model
|
||||
------------------
|
||||
This type of meta-model may interact with task definitions directly. Then, the `Meta Task Model` is the class for them to inherit from. They guide the base tasks by modifying the base task definitions. The function `prepare_tasks` can be used to obtain the modified base task definitions.
|
||||
|
||||
.. autoclass:: qlib.model.meta.model.MetaTaskModel
|
||||
:members:
|
||||
|
||||
Meta Guide Model
|
||||
------------------
|
||||
This type of meta-model participates in the training process of the base forecasting model. The meta-model may guide the base forecasting models during their training to improve their performances.
|
||||
|
||||
.. autoclass:: qlib.model.meta.model.MetaGuideModel
|
||||
:members:
|
||||
|
||||
|
||||
Example
|
||||
=============
|
||||
``Qlib`` provides an implementation of ``Meta Model`` module, ``DDG-DA``,
|
||||
which adapts to the market dynamics.
|
||||
|
||||
``DDG-DA`` includes four steps:
|
||||
|
||||
1. Calculate meta-information and encapsulate it into ``Meta Task`` instances. All the meta-tasks form a ``Meta Dataset`` instance.
|
||||
2. Train ``DDG-DA`` based on the training data of the meta-dataset.
|
||||
3. Do the inference of the ``DDG-DA`` to get guide information.
|
||||
4. Apply guide information to the forecasting models to improve their performances.
|
||||
|
||||
The `above example <https://github.com/microsoft/qlib/tree/main/examples/benchmarks_dynamic/DDG-DA>`_ can be found in ``examples/benchmarks_dynamic/DDG-DA/workflow.py``.
|
||||
@@ -40,6 +40,7 @@ Document Structure
|
||||
Forecast Model: Model Training & Prediction <component/model.rst>
|
||||
Portfolio Management and Backtest <component/strategy.rst>
|
||||
Nested Decision Execution: High-Frequency Trading <component/highfreq.rst>
|
||||
Meta Controller: Meta-Task & Meta-Dataset & Meta-Model <component/meta.rst>
|
||||
Qlib Recorder: Experiment Management <component/recorder.rst>
|
||||
Analysis: Evaluation & Results Analysis <component/report.rst>
|
||||
Online Serving: Online Management & Strategy & Tool <component/online.rst>
|
||||
|
||||
@@ -22,7 +22,6 @@ data_handler_config: &data_handler_config
|
||||
- class: CSRankNorm
|
||||
kwargs:
|
||||
fields_group: label
|
||||
label: ["Ref($close, -2) / Ref($close, -1) - 1"]
|
||||
port_analysis_config: &port_analysis_config
|
||||
strategy:
|
||||
class: TopkDropoutStrategy
|
||||
|
||||
@@ -209,7 +209,6 @@ class TFTModel(ModelFT):
|
||||
fixed_params = self.data_formatter.get_experiment_params()
|
||||
params = self.data_formatter.get_default_model_params()
|
||||
|
||||
# Wendi: 合并调优的参数和非调优的参数
|
||||
params = {**params, **fixed_params}
|
||||
|
||||
if not os.path.exists(self.model_folder):
|
||||
|
||||
27
examples/benchmarks_dynamic/DDG-DA/README.md
Normal file
27
examples/benchmarks_dynamic/DDG-DA/README.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Introduction
|
||||
This is the implementation of `DDG-DA` based on `Meta Controller` component provided by `Qlib`.
|
||||
|
||||
## Background
|
||||
In many real-world scenarios, we often deal with streaming data that is sequentially collected over time. Due to the non-stationary nature of the environment, the streaming data distribution may change in unpredictable ways, which is known as concept drift. To handle concept drift, previous methods first detect when/where the concept drift happens and then adapt models to fit the distribution of the latest data. However, there are still many cases that some underlying factors of environment evolution are predictable, making it possible to model the future concept drift trend of the streaming data, while such cases are not fully explored in previous work.
|
||||
|
||||
Therefore, we propose a novel method `DDG-DA`, that can effectively forecast the evolution of data distribution and improve the performance of models. Specifically, we first train a predictor to estimate the future data distribution, then leverage it to generate training samples, and finally train models on the generated data.
|
||||
|
||||
## Dataset
|
||||
The data in the paper are private. So we conduct experiments on Qlib's public dataset.
|
||||
Though the dataset is different, the conclusion remains the same. By applying `DDG-DA`, users can see rising trends at the test phase both in the proxy models' ICs and the performances of the forecasting models.
|
||||
|
||||
## Run the Code
|
||||
Users can try `DDG-DA` by running the following command:
|
||||
```bash
|
||||
python workflow.py run_all
|
||||
```
|
||||
|
||||
The default forecasting models are `Linear`. Users can choose other forecasting models by changing the `forecast_model` parameter when `DDG-DA` initializes. For example, users can try `LightGBM` forecasting models by running the following command:
|
||||
```bash
|
||||
python workflow.py --forecast_model="gbdt" run_all
|
||||
```
|
||||
|
||||
|
||||
## Results
|
||||
|
||||
The results of other methods in Qlib's public dataset can be found [here](../)
|
||||
1
examples/benchmarks_dynamic/DDG-DA/requirements.txt
Normal file
1
examples/benchmarks_dynamic/DDG-DA/requirements.txt
Normal file
@@ -0,0 +1 @@
|
||||
torch==1.10.0
|
||||
258
examples/benchmarks_dynamic/DDG-DA/workflow.py
Normal file
258
examples/benchmarks_dynamic/DDG-DA/workflow.py
Normal file
@@ -0,0 +1,258 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
from pathlib import Path
|
||||
from qlib.model.meta.task import MetaTask
|
||||
from qlib.contrib.meta.data_selection.model import MetaModelDS
|
||||
from qlib.contrib.meta.data_selection.dataset import InternalData, MetaDatasetDS
|
||||
from qlib.data.dataset.handler import DataHandlerLP
|
||||
|
||||
import pandas as pd
|
||||
import fire
|
||||
import sys
|
||||
from tqdm.auto import tqdm
|
||||
import yaml
|
||||
import pickle
|
||||
from qlib import auto_init
|
||||
from qlib.model.trainer import TrainerR, task_train
|
||||
from qlib.utils import init_instance_by_config
|
||||
from qlib.workflow.task.gen import RollingGen, task_generator
|
||||
from qlib.workflow import R
|
||||
from qlib.tests.data import GetData
|
||||
|
||||
DIRNAME = Path(__file__).absolute().resolve().parent
|
||||
sys.path.append(str(DIRNAME.parent / "baseline"))
|
||||
from rolling_benchmark import RollingBenchmark # NOTE: sys.path is changed for import RollingBenchmark
|
||||
|
||||
|
||||
class DDGDA:
|
||||
"""
|
||||
please run `python workflow.py run_all` to run the full workflow of the experiment
|
||||
|
||||
**NOTE**
|
||||
before running the example, please clean your previous results with following command
|
||||
- `rm -r mlruns`
|
||||
"""
|
||||
|
||||
def __init__(self, sim_task_model="linear", forecast_model="linear"):
|
||||
self.step = 20
|
||||
# NOTE:
|
||||
# the horizon must match the meaning in the base task template
|
||||
self.horizon = 20
|
||||
self.meta_exp_name = "DDG-DA"
|
||||
self.sim_task_model = sim_task_model # The model to capture the distribution of data.
|
||||
self.forecast_model = forecast_model # downstream forecasting models' type
|
||||
|
||||
def get_feature_importance(self):
|
||||
# this must be lightGBM, because it needs to get the feature importance
|
||||
rb = RollingBenchmark(model_type="gbdt")
|
||||
task = rb.basic_task()
|
||||
|
||||
model = init_instance_by_config(task["model"])
|
||||
dataset = init_instance_by_config(task["dataset"])
|
||||
model.fit(dataset)
|
||||
|
||||
fi = model.get_feature_importance()
|
||||
|
||||
# Because the model use numpy instead of dataframe for training lightgbm
|
||||
# So the we must use following extra steps to get the right feature importance
|
||||
df = dataset.prepare(segments=slice(None), col_set="feature", data_key=DataHandlerLP.DK_R)
|
||||
cols = df.columns
|
||||
fi_named = {cols[int(k.split("_")[1])]: imp for k, imp in fi.to_dict().items()}
|
||||
|
||||
return pd.Series(fi_named)
|
||||
|
||||
def dump_data_for_proxy_model(self):
|
||||
"""
|
||||
Dump data for training meta model.
|
||||
The meta model will be trained upon the proxy forecasting model.
|
||||
This dataset is for the proxy forecasting model.
|
||||
"""
|
||||
topk = 30
|
||||
fi = self.get_feature_importance()
|
||||
col_selected = fi.nlargest(topk)
|
||||
|
||||
rb = RollingBenchmark(model_type=self.sim_task_model)
|
||||
task = rb.basic_task()
|
||||
dataset = init_instance_by_config(task["dataset"])
|
||||
prep_ds = dataset.prepare(slice(None), col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
|
||||
feature_df = prep_ds["feature"]
|
||||
label_df = prep_ds["label"]
|
||||
|
||||
feature_selected = feature_df.loc[:, col_selected.index]
|
||||
|
||||
feature_selected = feature_selected.groupby("datetime").apply(lambda df: (df - df.mean()).div(df.std()))
|
||||
feature_selected = feature_selected.fillna(0.0)
|
||||
|
||||
df_all = {
|
||||
"label": label_df.reindex(feature_selected.index),
|
||||
"feature": feature_selected,
|
||||
}
|
||||
df_all = pd.concat(df_all, axis=1)
|
||||
df_all.to_pickle(DIRNAME / "fea_label_df.pkl")
|
||||
|
||||
# dump data in handler format for aligning the interface
|
||||
handler = DataHandlerLP(
|
||||
data_loader={
|
||||
"class": "qlib.data.dataset.loader.StaticDataLoader",
|
||||
"kwargs": {"config": DIRNAME / "fea_label_df.pkl"},
|
||||
}
|
||||
)
|
||||
handler.to_pickle(DIRNAME / "handler_proxy.pkl", dump_all=True)
|
||||
|
||||
@property
|
||||
def _internal_data_path(self):
|
||||
return DIRNAME / f"internal_data_s{self.step}.pkl"
|
||||
|
||||
def dump_meta_ipt(self):
|
||||
"""
|
||||
Dump data for training meta model.
|
||||
This function will dump the input data for meta model
|
||||
"""
|
||||
# According to the experiments, the choice of the model type is very important for achieving good results
|
||||
rb = RollingBenchmark(model_type=self.sim_task_model)
|
||||
sim_task = rb.basic_task()
|
||||
|
||||
if self.sim_task_model == "gbdt":
|
||||
sim_task["model"].setdefault("kwargs", {}).update({"early_stopping_rounds": None, "num_boost_round": 150})
|
||||
|
||||
exp_name_sim = f"data_sim_s{self.step}"
|
||||
|
||||
internal_data = InternalData(sim_task, self.step, exp_name=exp_name_sim)
|
||||
internal_data.setup(trainer=TrainerR)
|
||||
|
||||
with self._internal_data_path.open("wb") as f:
|
||||
pickle.dump(internal_data, f)
|
||||
|
||||
def train_meta_model(self):
|
||||
"""
|
||||
training a meta model based on a simplified linear proxy model;
|
||||
"""
|
||||
|
||||
# 1) leverage the simplified proxy forecasting model to train meta model.
|
||||
# - Only the dataset part is important, in current version of meta model will integrate the
|
||||
rb = RollingBenchmark(model_type=self.sim_task_model)
|
||||
sim_task = rb.basic_task()
|
||||
proxy_forecast_model_task = {
|
||||
# "model": "qlib.contrib.model.linear.LinearModel",
|
||||
"dataset": {
|
||||
"class": "qlib.data.dataset.DatasetH",
|
||||
"kwargs": {
|
||||
"handler": f"file://{(DIRNAME / 'handler_proxy.pkl').absolute()}",
|
||||
"segments": {
|
||||
"train": ("2008-01-01", "2010-12-31"),
|
||||
"test": ("2011-01-01", sim_task["dataset"]["kwargs"]["segments"]["test"][1]),
|
||||
},
|
||||
},
|
||||
},
|
||||
# "record": ["qlib.workflow.record_temp.SignalRecord"]
|
||||
}
|
||||
|
||||
# 2) preparing meta dataset
|
||||
kwargs = dict(
|
||||
task_tpl=proxy_forecast_model_task,
|
||||
step=self.step,
|
||||
segments=0.62, # keep test period consistent with the dataset yaml
|
||||
trunc_days=1 + self.horizon,
|
||||
hist_step_n=30,
|
||||
fill_method="max",
|
||||
rolling_ext_days=0,
|
||||
)
|
||||
# NOTE:
|
||||
# the input of meta model (internal data) are shared between proxy model and final forecasting model
|
||||
# but their task test segment are not aligned! It worked in my previous experiment.
|
||||
# So the misalignment will not affect the effectiveness of the method.
|
||||
with self._internal_data_path.open("rb") as f:
|
||||
internal_data = pickle.load(f)
|
||||
md = MetaDatasetDS(exp_name=internal_data, **kwargs)
|
||||
|
||||
# 3) train and logging meta model
|
||||
with R.start(experiment_name=self.meta_exp_name):
|
||||
R.log_params(**kwargs)
|
||||
mm = MetaModelDS(step=self.step, hist_step_n=kwargs["hist_step_n"], lr=0.001, max_epoch=200, seed=43)
|
||||
mm.fit(md)
|
||||
R.save_objects(model=mm)
|
||||
|
||||
@property
|
||||
def _task_path(self):
|
||||
return DIRNAME / f"tasks_s{self.step}.pkl"
|
||||
|
||||
def meta_inference(self):
|
||||
"""
|
||||
Leverage meta-model for inference:
|
||||
- Given
|
||||
- baseline tasks
|
||||
- input for meta model(internal data)
|
||||
- meta model (its learnt knowledge on proxy forecasting model is expected to transfer to normal forecasting model)
|
||||
"""
|
||||
# 1) get meta model
|
||||
exp = R.get_exp(experiment_name=self.meta_exp_name)
|
||||
rec = exp.list_recorders(rtype=exp.RT_L)[0]
|
||||
meta_model: MetaModelDS = rec.load_object("model")
|
||||
|
||||
# 2)
|
||||
# we are transfer to knowledge of meta model to final forecasting tasks.
|
||||
# Create MetaTaskDataset for the final forecasting tasks
|
||||
# Aligning the setting of it to the MetaTaskDataset when training Meta model is necessary
|
||||
|
||||
# 2.1) get previous config
|
||||
param = rec.list_params()
|
||||
trunc_days = int(param["trunc_days"])
|
||||
step = int(param["step"])
|
||||
hist_step_n = int(param["hist_step_n"])
|
||||
fill_method = param.get("fill_method", "max")
|
||||
|
||||
rb = RollingBenchmark(model_type=self.forecast_model)
|
||||
task_l = rb.create_rolling_tasks()
|
||||
|
||||
# 2.2) create meta dataset for final dataset
|
||||
kwargs = dict(
|
||||
task_tpl=task_l,
|
||||
step=step,
|
||||
segments=0.0, # all the tasks are for testing
|
||||
trunc_days=trunc_days,
|
||||
hist_step_n=hist_step_n,
|
||||
fill_method=fill_method,
|
||||
task_mode=MetaTask.PROC_MODE_TRANSFER,
|
||||
)
|
||||
|
||||
with self._internal_data_path.open("rb") as f:
|
||||
internal_data = pickle.load(f)
|
||||
mds = MetaDatasetDS(exp_name=internal_data, **kwargs)
|
||||
|
||||
# 3) meta model make inference and get new qlib task
|
||||
new_tasks = meta_model.inference(mds)
|
||||
with self._task_path.open("wb") as f:
|
||||
pickle.dump(new_tasks, f)
|
||||
|
||||
def train_and_eval_tasks(self):
|
||||
"""
|
||||
Training the tasks generated by meta model
|
||||
Then evaluate it
|
||||
"""
|
||||
with self._task_path.open("rb") as f:
|
||||
tasks = pickle.load(f)
|
||||
rb = RollingBenchmark(rolling_exp="rolling_ds", model_type=self.forecast_model)
|
||||
rb.train_rolling_tasks(tasks)
|
||||
rb.ens_rolling()
|
||||
rb.update_rolling_rec()
|
||||
|
||||
def run_all(self):
|
||||
# 1) file: handler_proxy.pkl
|
||||
self.dump_data_for_proxy_model()
|
||||
# 2)
|
||||
# file: internal_data_s20.pkl
|
||||
# mlflow: data_sim_s20, models for calculating meta_ipt
|
||||
self.dump_meta_ipt()
|
||||
# 3) meta model will be stored in `DDG-DA`
|
||||
self.train_meta_model()
|
||||
# 4) new_tasks are saved in "tasks_s20.pkl" (reweighter is added)
|
||||
self.meta_inference()
|
||||
# 5) load the saved tasks and train model
|
||||
self.train_and_eval_tasks()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
GetData().qlib_data(exists_skip=True)
|
||||
auto_init()
|
||||
fire.Fire(DDGDA)
|
||||
18
examples/benchmarks_dynamic/REAMDE.md
Normal file
18
examples/benchmarks_dynamic/REAMDE.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Introduction
|
||||
Due to the non-stationary nature of the environment of the financial market, the data distribution may change in different periods, which makes the performance of models build on training data decays in the future test data.
|
||||
So adapting the forecasting models/strategies to market dynamics is very important to the model/strategies' performance.
|
||||
|
||||
The table below shows the performances of different solutions on different forecasting models.
|
||||
|
||||
## Alpha158 dataset
|
||||
|
||||
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Annualized Return | Information Ratio | Max Drawdown |
|
||||
|------------------|---------|----|------|---------|-----------|-------------------|-------------------|--------------|
|
||||
| RR[Linear] |Alpha158 |0.088|0.570|0.102 |0.622 |0.077 |1.175 |-0.086 |
|
||||
| DDG-DA[Linear] |Alpha158 |0.093|0.622|0.106 |0.670 |0.085 |1.213 |-0.093 |
|
||||
| RR[LightGBM] |Alpha158 |0.079|0.566|0.088 |0.592 |0.075 |1.226 |-0.096 |
|
||||
| DDG-DA[LightGBM] |Alpha158 |0.084|0.639|0.093 |0.664 |0.099 |1.442 |-0.071 |
|
||||
|
||||
- The label horizon of the `Alpha158` dataset is set to 20.
|
||||
- The rolling time intervals are set to 20 trading days.
|
||||
- The test rolling periods are from January 2017 to August 2020.
|
||||
15
examples/benchmarks_dynamic/baseline/README.md
Normal file
15
examples/benchmarks_dynamic/baseline/README.md
Normal file
@@ -0,0 +1,15 @@
|
||||
# Introduction
|
||||
|
||||
This is the framework of periodically Rolling Retrain (RR) forecasting models. RR adapts to market dynamics by utilizing the up-to-date data periodically.
|
||||
|
||||
## Run the Code
|
||||
Users can try RR by running the following command:
|
||||
```bash
|
||||
python rolling_benchmark.py run_all
|
||||
```
|
||||
|
||||
The default forecasting models are `Linear`. Users can choose other forecasting models by changing the `model_type` parameter.
|
||||
For example, users can try `LightGBM` forecasting models by running the following command:
|
||||
```bash
|
||||
python rolling_benchmark.py --model_type="gbdt" run_all
|
||||
```
|
||||
114
examples/benchmarks_dynamic/baseline/rolling_benchmark.py
Normal file
114
examples/benchmarks_dynamic/baseline/rolling_benchmark.py
Normal file
@@ -0,0 +1,114 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
from qlib.model.ens.ensemble import RollingEnsemble
|
||||
from qlib.utils import init_instance_by_config
|
||||
import fire
|
||||
import yaml
|
||||
from qlib import auto_init
|
||||
from pathlib import Path
|
||||
from tqdm.auto import tqdm
|
||||
from qlib.model.trainer import TrainerR
|
||||
from qlib.workflow import R
|
||||
from qlib.tests.data import GetData
|
||||
|
||||
DIRNAME = Path(__file__).absolute().resolve().parent
|
||||
from qlib.workflow.task.gen import task_generator, RollingGen
|
||||
from qlib.workflow.task.collect import RecorderCollector
|
||||
from qlib.workflow.record_temp import PortAnaRecord, SigAnaRecord
|
||||
|
||||
|
||||
class RollingBenchmark:
|
||||
"""
|
||||
**NOTE**
|
||||
before running the example, please clean your previous results with following command
|
||||
- `rm -r mlruns`
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self, rolling_exp="rolling_models", model_type="linear") -> None:
|
||||
self.step = 20
|
||||
self.horizon = 20
|
||||
self.rolling_exp = rolling_exp
|
||||
self.model_type = model_type
|
||||
|
||||
def basic_task(self):
|
||||
"""For fast training rolling"""
|
||||
if self.model_type == "gbdt":
|
||||
conf_path = DIRNAME.parent.parent / "benchmarks" / "LightGBM" / "workflow_config_lightgbm_Alpha158.yaml"
|
||||
# dump the processed data on to disk for later loading to speed up the processing
|
||||
h_path = DIRNAME / "lightgbm_alpha158_handler_horizon{}.pkl".format(self.horizon)
|
||||
elif self.model_type == "linear":
|
||||
conf_path = DIRNAME.parent.parent / "benchmarks" / "Linear" / "workflow_config_linear_Alpha158.yaml"
|
||||
h_path = DIRNAME / "linear_alpha158_handler_horizon{}.pkl".format(self.horizon)
|
||||
else:
|
||||
raise AssertionError("Model type is not supported!")
|
||||
with conf_path.open("r") as f:
|
||||
conf = yaml.safe_load(f)
|
||||
|
||||
# modify dataset horizon
|
||||
conf["task"]["dataset"]["kwargs"]["handler"]["kwargs"]["label"] = [
|
||||
"Ref($close, -{}) / Ref($close, -1) - 1".format(self.horizon + 1)
|
||||
]
|
||||
|
||||
task = conf["task"]
|
||||
|
||||
if not h_path.exists():
|
||||
h_conf = task["dataset"]["kwargs"]["handler"]
|
||||
h = init_instance_by_config(h_conf)
|
||||
h.to_pickle(h_path, dump_all=True)
|
||||
|
||||
task["dataset"]["kwargs"]["handler"] = f"file://{h_path}"
|
||||
task["record"] = ["qlib.workflow.record_temp.SignalRecord"]
|
||||
return task
|
||||
|
||||
def create_rolling_tasks(self):
|
||||
task = self.basic_task()
|
||||
task_l = task_generator(
|
||||
task, RollingGen(step=self.step, trunc_days=self.horizon + 1)
|
||||
) # the last two days should be truncated to avoid information leakage
|
||||
return task_l
|
||||
|
||||
def train_rolling_tasks(self, task_l=None):
|
||||
if task_l is None:
|
||||
task_l = self.create_rolling_tasks()
|
||||
trainer = TrainerR(experiment_name=self.rolling_exp)
|
||||
trainer(task_l)
|
||||
|
||||
COMB_EXP = "rolling"
|
||||
|
||||
def ens_rolling(self):
|
||||
rc = RecorderCollector(
|
||||
experiment=self.rolling_exp,
|
||||
artifacts_key=["pred", "label"],
|
||||
process_list=[RollingEnsemble()],
|
||||
# rec_key_func=lambda rec: (self.COMB_EXP, rec.info["id"]),
|
||||
artifacts_path={"pred": "pred.pkl", "label": "label.pkl"},
|
||||
)
|
||||
res = rc()
|
||||
with R.start(experiment_name=self.COMB_EXP):
|
||||
R.log_params(exp_name=self.rolling_exp)
|
||||
R.save_objects(**{"pred.pkl": res["pred"], "label.pkl": res["label"]})
|
||||
|
||||
def update_rolling_rec(self):
|
||||
"""
|
||||
Evaluate the combined rolling results
|
||||
"""
|
||||
for rid, rec in R.list_recorders(experiment_name=self.COMB_EXP).items():
|
||||
for rt_cls in SigAnaRecord, PortAnaRecord:
|
||||
rt = rt_cls(recorder=rec, skip_existing=True)
|
||||
rt.generate()
|
||||
print(f"Your evaluation results can be found in the experiment named `{self.COMB_EXP}`.")
|
||||
|
||||
def run_all(self):
|
||||
# the results will be save in mlruns.
|
||||
# 1) each rolling task is saved in rolling_models
|
||||
self.train_rolling_tasks()
|
||||
# 2) combined rolling tasks and evaluation results are saved in rolling
|
||||
self.ens_rolling()
|
||||
self.update_rolling_rec()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
GetData().qlib_data(exists_skip=True)
|
||||
auto_init()
|
||||
fire.Fire(RollingBenchmark)
|
||||
4
qlib/contrib/meta/__init__.py
Normal file
4
qlib/contrib/meta/__init__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from .data_selection import MetaTaskDS, MetaDatasetDS, MetaModelDS
|
||||
5
qlib/contrib/meta/data_selection/__init__.py
Normal file
5
qlib/contrib/meta/data_selection/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from .dataset import MetaDatasetDS, MetaTaskDS
|
||||
from .model import MetaModelDS
|
||||
325
qlib/contrib/meta/data_selection/dataset.py
Normal file
325
qlib/contrib/meta/data_selection/dataset.py
Normal file
@@ -0,0 +1,325 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
from copy import deepcopy
|
||||
from qlib.data.dataset.utils import init_task_handler
|
||||
from qlib.utils.data import deepcopy_basic_type
|
||||
from qlib.contrib.torch import data_to_tensor
|
||||
from qlib.workflow.task.utils import TimeAdjuster
|
||||
from qlib.model.meta.task import MetaTask
|
||||
from typing import Dict, List, Union, Text, Tuple
|
||||
from qlib.data.dataset.handler import DataHandler
|
||||
from qlib.log import get_module_logger
|
||||
from qlib.utils import auto_filter_kwargs, get_date_by_shift, init_instance_by_config
|
||||
from qlib.workflow import R
|
||||
from qlib.workflow.task.gen import RollingGen, task_generator
|
||||
from joblib import Parallel, delayed
|
||||
from qlib.model.meta.dataset import MetaTaskDataset
|
||||
from qlib.model.trainer import task_train, TrainerR
|
||||
from qlib.data.dataset import DatasetH
|
||||
from tqdm.auto import tqdm
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
|
||||
class InternalData:
|
||||
def __init__(self, task_tpl: dict, step: int, exp_name: str):
|
||||
self.task_tpl = task_tpl
|
||||
self.step = step
|
||||
self.exp_name = exp_name
|
||||
|
||||
def setup(self, trainer=TrainerR, trainer_kwargs={}):
|
||||
"""
|
||||
after running this function `self.data_ic_df` will become set.
|
||||
Each col represents a data.
|
||||
Each row represents the Timestamp of performance of that data.
|
||||
For example,
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
2021-06-21 2021-06-04 2021-05-21 2021-05-07 2021-04-20 2021-04-06 2021-03-22 2021-03-08 ...
|
||||
2021-07-02 2021-06-18 2021-06-03 2021-05-20 2021-05-06 2021-04-19 2021-04-02 2021-03-19 ...
|
||||
datetime ...
|
||||
2018-01-02 0.079782 0.115975 0.070866 0.028849 -0.081170 0.140380 0.063864 0.110987 ...
|
||||
2018-01-03 0.123386 0.107789 0.071037 0.045278 -0.060782 0.167446 0.089779 0.124476 ...
|
||||
2018-01-04 0.140775 0.097206 0.063702 0.042415 -0.078164 0.173218 0.098914 0.114389 ...
|
||||
2018-01-05 0.030320 -0.037209 -0.044536 -0.047267 -0.081888 0.045648 0.059947 0.047652 ...
|
||||
2018-01-08 0.107201 0.009219 -0.015995 -0.036594 -0.086633 0.108965 0.122164 0.108508 ...
|
||||
... ... ... ... ... ... ... ... ... ...
|
||||
|
||||
"""
|
||||
|
||||
# 1) prepare the prediction of proxy models
|
||||
perf_task_tpl = deepcopy(self.task_tpl) # this task is supposed to contains no complicated objects
|
||||
|
||||
trainer = auto_filter_kwargs(trainer)(experiment_name=self.exp_name, **trainer_kwargs)
|
||||
# NOTE:
|
||||
# The handler is initialized for only once.
|
||||
if not trainer.has_worker():
|
||||
self.dh = init_task_handler(perf_task_tpl)
|
||||
else:
|
||||
self.dh = init_instance_by_config(perf_task_tpl["dataset"]["kwargs"]["handler"])
|
||||
|
||||
seg = perf_task_tpl["dataset"]["kwargs"]["segments"]
|
||||
|
||||
# We want to split the training time period into small segments.
|
||||
perf_task_tpl["dataset"]["kwargs"]["segments"] = {
|
||||
"train": (DatasetH.get_min_time(seg), DatasetH.get_max_time(seg)),
|
||||
"test": (None, None),
|
||||
}
|
||||
|
||||
# NOTE:
|
||||
# we play a trick here
|
||||
# treat the training segments as test to create the rolling tasks
|
||||
rg = RollingGen(step=self.step, test_key="train", train_key=None, task_copy_func=deepcopy_basic_type)
|
||||
gen_task = task_generator(perf_task_tpl, [rg])
|
||||
|
||||
recorders = R.list_recorders(experiment_name=self.exp_name)
|
||||
if len(gen_task) == len(recorders):
|
||||
get_module_logger("Internal Data").info("the data has been initialized")
|
||||
else:
|
||||
# train new models
|
||||
assert 0 == len(recorders), "An empty experiment is required for setup `InternalData``"
|
||||
trainer.train(gen_task)
|
||||
|
||||
# 2) extract the similarity matrix
|
||||
label_df = self.dh.fetch(col_set="label")
|
||||
# for
|
||||
recorders = R.list_recorders(experiment_name=self.exp_name)
|
||||
|
||||
key_l = []
|
||||
ic_l = []
|
||||
for _, rec in tqdm(recorders.items(), desc="calc"):
|
||||
pred = rec.load_object("pred.pkl")
|
||||
task = rec.load_object("task")
|
||||
data_key = task["dataset"]["kwargs"]["segments"]["train"]
|
||||
key_l.append(data_key)
|
||||
ic_l.append(delayed(self._calc_perf)(pred.iloc[:, 0], label_df.iloc[:, 0]))
|
||||
|
||||
ic_l = Parallel(n_jobs=-1)(ic_l)
|
||||
self.data_ic_df = pd.DataFrame(dict(zip(key_l, ic_l)))
|
||||
self.data_ic_df = self.data_ic_df.sort_index().sort_index(axis=1)
|
||||
|
||||
del self.dh # handler is not useful now
|
||||
|
||||
def _calc_perf(self, pred, label):
|
||||
df = pd.DataFrame({"pred": pred, "label": label})
|
||||
df = df.groupby("datetime").corr(method="spearman")
|
||||
corr = df.loc(axis=0)[:, "pred"]["label"].droplevel(axis=0, level=-1)
|
||||
return corr
|
||||
|
||||
def update(self):
|
||||
"""update the data for online trading"""
|
||||
# TODO:
|
||||
# when new data are totally(including label) available
|
||||
# - update the prediction
|
||||
# - update the data similarity map(if applied)
|
||||
|
||||
|
||||
class MetaTaskDS(MetaTask):
|
||||
"""Meta Task for Data Selection"""
|
||||
|
||||
def __init__(self, task: dict, meta_info: pd.DataFrame, mode: str = MetaTask.PROC_MODE_FULL, fill_method="max"):
|
||||
"""
|
||||
The description of the processed data
|
||||
|
||||
time_perf: A array with shape <hist_step_n * step, data pieces> -> data piece performance
|
||||
|
||||
time_belong: A array with shape <sample, data pieces> -> belong or not (1. or 0.)
|
||||
array([[1., 0., 0., ..., 0., 0., 0.],
|
||||
[1., 0., 0., ..., 0., 0., 0.],
|
||||
[1., 0., 0., ..., 0., 0., 0.],
|
||||
...,
|
||||
[0., 0., 0., ..., 0., 0., 1.],
|
||||
[0., 0., 0., ..., 0., 0., 1.],
|
||||
[0., 0., 0., ..., 0., 0., 1.]])
|
||||
|
||||
"""
|
||||
super().__init__(task, meta_info)
|
||||
self.fill_method = fill_method
|
||||
|
||||
time_perf = self._get_processed_meta_info()
|
||||
self.processed_meta_input = {"time_perf": time_perf}
|
||||
# FIXME: memory issue in this step
|
||||
if mode == MetaTask.PROC_MODE_FULL:
|
||||
# process metainfo_
|
||||
ds = self.get_dataset()
|
||||
|
||||
# these three lines occupied 70% of the time of initializing MetaTaskDS
|
||||
d_train, d_test = ds.prepare(["train", "test"], col_set=["feature", "label"])
|
||||
prev_size = d_test.shape[0]
|
||||
d_train = d_train.dropna(axis=0)
|
||||
d_test = d_test.dropna(axis=0)
|
||||
if prev_size == 0 or d_test.shape[0] / prev_size <= 0.1:
|
||||
raise ValueError(f"Most of samples are dropped. Please check this task: {task}")
|
||||
|
||||
assert (
|
||||
d_test.groupby("datetime").size().shape[0] >= 5
|
||||
), "In this segment, this trading dates is less than 5, you'd better check the data."
|
||||
|
||||
sample_time_belong = np.zeros((d_train.shape[0], time_perf.shape[1]))
|
||||
for i, col in enumerate(time_perf.columns):
|
||||
# these two lines of code occupied 20% of the time of initializing MetaTaskDS
|
||||
slc = slice(*d_train.index.slice_locs(start=col[0], end=col[1]))
|
||||
sample_time_belong[slc, i] = 1.0
|
||||
|
||||
# If you want that last month also belongs to the last time_perf
|
||||
# Assumptions: the latest data has similar performance like the last month
|
||||
sample_time_belong[sample_time_belong.sum(axis=1) != 1, -1] = 1.0
|
||||
|
||||
self.processed_meta_input.update(
|
||||
dict(
|
||||
X=d_train["feature"],
|
||||
y=d_train["label"].iloc[:, 0],
|
||||
X_test=d_test["feature"],
|
||||
y_test=d_test["label"].iloc[:, 0],
|
||||
time_belong=sample_time_belong,
|
||||
test_idx=d_test["label"].index,
|
||||
)
|
||||
)
|
||||
|
||||
# TODO: set device: I think this is not necessary to converting data format.
|
||||
self.processed_meta_input = data_to_tensor(self.processed_meta_input)
|
||||
|
||||
def _get_processed_meta_info(self):
|
||||
meta_info_norm = self.meta_info.sub(self.meta_info.mean(axis=1), axis=0) # .fillna(0.)
|
||||
if self.fill_method == "max":
|
||||
meta_info_norm = meta_info_norm.T.fillna(
|
||||
meta_info_norm.max(axis=1)
|
||||
).T # fill it with row max to align with previous implementation
|
||||
elif self.fill_method == "zero":
|
||||
pass
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
meta_info_norm = meta_info_norm.fillna(0.0) # always fill zero in case of NaN
|
||||
return meta_info_norm
|
||||
|
||||
def get_meta_input(self):
|
||||
return self.processed_meta_input
|
||||
|
||||
|
||||
class MetaDatasetDS(MetaTaskDataset):
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
task_tpl: Union[dict, list],
|
||||
step: int,
|
||||
trunc_days: int = None,
|
||||
rolling_ext_days: int = 0,
|
||||
exp_name: Union[str, InternalData],
|
||||
segments: Union[Dict[Text, Tuple], float],
|
||||
hist_step_n: int = 10,
|
||||
task_mode: str = MetaTask.PROC_MODE_FULL,
|
||||
fill_method: str = "max",
|
||||
):
|
||||
"""
|
||||
A dataset for meta model.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
task_tpl : Union[dict, list]
|
||||
Decide what tasks are used.
|
||||
- dict : the task template, the prepared task is generated with `step`, `trunc_days` and `RollingGen`
|
||||
- list : when list, use the list of tasks directly
|
||||
the list is supposed to be sorted according timeline
|
||||
step : int
|
||||
the rolling step
|
||||
trunc_days: int
|
||||
days to be truncated based on the test start
|
||||
rolling_ext_days: int
|
||||
sometimes users want to train meta models for a longer test period but with smaller rolling steps for more task samples.
|
||||
the total length of test periods will be `step + rolling_ext_days`
|
||||
|
||||
exp_name : Union[str, InternalData]
|
||||
Decide what meta_info are used for prediction.
|
||||
- str: the name of the experiment to store the performance of data
|
||||
- InternalData: a prepared internal data
|
||||
segments: Union[Dict[Text, Tuple], float]
|
||||
the segments to divide data
|
||||
both left and right
|
||||
if segments is a float:
|
||||
the float represents the percentage of data for training
|
||||
hist_step_n: int
|
||||
length of historical steps for the meta infomation
|
||||
task_mode : str
|
||||
Please refer to the docs of MetaTask
|
||||
"""
|
||||
super().__init__(segments=segments)
|
||||
if isinstance(exp_name, InternalData):
|
||||
self.internal_data = exp_name
|
||||
else:
|
||||
self.internal_data = InternalData(task_tpl, step=step, exp_name=exp_name)
|
||||
self.internal_data.setup()
|
||||
self.task_tpl = deepcopy(task_tpl) # FIXME: if the handler is shared, how to avoid the explosion of the memroy.
|
||||
self.trunc_days = trunc_days
|
||||
self.hist_step_n = hist_step_n
|
||||
self.step = step
|
||||
|
||||
if isinstance(task_tpl, dict):
|
||||
rg = RollingGen(
|
||||
step=step, trunc_days=trunc_days, task_copy_func=deepcopy_basic_type
|
||||
) # NOTE: trunc_days is very important !!!!
|
||||
task_iter = rg(task_tpl)
|
||||
if rolling_ext_days > 0:
|
||||
self.ta = TimeAdjuster(future=True)
|
||||
for t in task_iter:
|
||||
t["dataset"]["kwargs"]["segments"]["test"] = self.ta.shift(
|
||||
t["dataset"]["kwargs"]["segments"]["test"], step=rolling_ext_days, rtype=RollingGen.ROLL_EX
|
||||
)
|
||||
if task_mode == MetaTask.PROC_MODE_FULL:
|
||||
# Only pre initializing the task when full task is req
|
||||
# initializing handler and share it.
|
||||
init_task_handler(task_tpl)
|
||||
else:
|
||||
assert isinstance(task_tpl, list)
|
||||
task_iter = task_tpl
|
||||
|
||||
self.task_list = []
|
||||
self.meta_task_l = []
|
||||
logger = get_module_logger("MetaDatasetDS")
|
||||
logger.info(f"Example task for training meta model: {task_iter[0]}")
|
||||
for t in tqdm(task_iter, desc="creating meta tasks"):
|
||||
try:
|
||||
self.meta_task_l.append(
|
||||
MetaTaskDS(t, meta_info=self._prepare_meta_ipt(t), mode=task_mode, fill_method=fill_method)
|
||||
)
|
||||
self.task_list.append(t)
|
||||
except ValueError as e:
|
||||
logger.warning(f"ValueError: {e}")
|
||||
assert len(self.meta_task_l) > 0, "No meta tasks found. Please check the data and setting"
|
||||
|
||||
def _prepare_meta_ipt(self, task):
|
||||
ic_df = self.internal_data.data_ic_df
|
||||
|
||||
segs = task["dataset"]["kwargs"]["segments"]
|
||||
end = max([segs[k][1] for k in ("train", "valid") if k in segs])
|
||||
ic_df_avail = ic_df.loc[:end, pd.IndexSlice[:, :end]]
|
||||
|
||||
# meta data set focus on the **information** instead of preprocess
|
||||
# 1) filter the future info
|
||||
def mask_future(s):
|
||||
"""mask future information"""
|
||||
# from qlib.utils import get_date_by_shift
|
||||
start, end = s.name
|
||||
end = get_date_by_shift(trading_date=end, shift=self.trunc_days - 1, future=True)
|
||||
return s.mask((s.index >= start) & (s.index <= end))
|
||||
|
||||
ic_df_avail = ic_df_avail.apply(mask_future) # apply to each col
|
||||
|
||||
# 2) filter the info with too long periods
|
||||
total_len = self.step * self.hist_step_n
|
||||
if ic_df_avail.shape[0] >= total_len:
|
||||
return ic_df_avail.iloc[-total_len:]
|
||||
else:
|
||||
raise ValueError("the history of distribution data is not long enough.")
|
||||
|
||||
def _prepare_seg(self, segment: Text) -> List[MetaTask]:
|
||||
if isinstance(self.segments, float):
|
||||
train_task_n = int(len(self.meta_task_l) * self.segments)
|
||||
if segment == "train":
|
||||
return self.meta_task_l[:train_task_n]
|
||||
elif segment == "test":
|
||||
return self.meta_task_l[train_task_n:]
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
182
qlib/contrib/meta/data_selection/model.py
Normal file
182
qlib/contrib/meta/data_selection/model.py
Normal file
@@ -0,0 +1,182 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from qlib.log import get_module_logger
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from qlib.model.meta.task import MetaTask
|
||||
import torch
|
||||
from torch import nn
|
||||
from torch import optim
|
||||
from tqdm.auto import tqdm
|
||||
import collections
|
||||
import copy
|
||||
from typing import Union, List, Tuple, Dict
|
||||
|
||||
from ....data.dataset.weight import Reweighter
|
||||
from ....model.meta.dataset import MetaTaskDataset
|
||||
from ....model.meta.model import MetaModel, MetaTaskModel
|
||||
from ....workflow import R
|
||||
|
||||
from .utils import ICLoss
|
||||
from .dataset import MetaDatasetDS
|
||||
from qlib.contrib.meta.data_selection.net import PredNet
|
||||
from qlib.data.dataset.weight import Reweighter
|
||||
from qlib.log import get_module_logger
|
||||
|
||||
logger = get_module_logger("data selection")
|
||||
|
||||
|
||||
class TimeReweighter(Reweighter):
|
||||
def __init__(self, time_weight: pd.Series):
|
||||
self.time_weight = time_weight
|
||||
|
||||
def reweight(self, data: Union[pd.DataFrame, pd.Series]):
|
||||
# TODO: handling TSDataSampler
|
||||
w_s = pd.Series(1.0, index=data.index)
|
||||
for k, w in self.time_weight.items():
|
||||
w_s.loc[slice(*k)] = w
|
||||
logger.info(f"Reweighting result: {w_s}")
|
||||
return w_s
|
||||
|
||||
|
||||
class MetaModelDS(MetaTaskModel):
|
||||
"""
|
||||
The meta-model for meta-learning-based data selection.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
step,
|
||||
hist_step_n,
|
||||
clip_method="tanh",
|
||||
clip_weight=2.0,
|
||||
criterion="ic_loss",
|
||||
lr=0.0001,
|
||||
max_epoch=100,
|
||||
seed=43,
|
||||
):
|
||||
self.step = step
|
||||
self.hist_step_n = hist_step_n
|
||||
self.clip_method = clip_method
|
||||
self.clip_weight = clip_weight
|
||||
self.criterion = criterion
|
||||
self.lr = lr
|
||||
self.max_epoch = max_epoch
|
||||
self.fitted = False
|
||||
torch.manual_seed(seed)
|
||||
|
||||
def run_epoch(self, phase, task_list, epoch, opt, loss_l, ignore_weight=False):
|
||||
if phase == "train":
|
||||
self.tn.train()
|
||||
torch.set_grad_enabled(True)
|
||||
else:
|
||||
self.tn.eval()
|
||||
torch.set_grad_enabled(False)
|
||||
running_loss = 0.0
|
||||
pred_y_all = []
|
||||
for task in tqdm(task_list, desc=f"{phase} Task", leave=False):
|
||||
meta_input = task.get_meta_input()
|
||||
pred, weights = self.tn(
|
||||
meta_input["X"],
|
||||
meta_input["y"],
|
||||
meta_input["time_perf"],
|
||||
meta_input["time_belong"],
|
||||
meta_input["X_test"],
|
||||
ignore_weight=ignore_weight,
|
||||
)
|
||||
if self.criterion == "mse":
|
||||
criterion = nn.MSELoss()
|
||||
loss = criterion(pred, meta_input["y_test"])
|
||||
elif self.criterion == "ic_loss":
|
||||
criterion = ICLoss()
|
||||
try:
|
||||
loss = criterion(pred, meta_input["y_test"], meta_input["test_idx"], skip_size=50)
|
||||
except ValueError as e:
|
||||
get_module_logger("MetaModelDS").warning(f"Exception `{e}` when calculating IC loss")
|
||||
continue
|
||||
|
||||
assert not np.isnan(loss.detach().item()), "NaN loss!"
|
||||
|
||||
if phase == "train":
|
||||
opt.zero_grad()
|
||||
norm_loss = nn.MSELoss()
|
||||
loss.backward()
|
||||
opt.step()
|
||||
elif phase == "test":
|
||||
pass
|
||||
|
||||
pred_y_all.append(
|
||||
pd.DataFrame(
|
||||
{
|
||||
"pred": pd.Series(pred.detach().cpu().numpy(), index=meta_input["test_idx"]),
|
||||
"label": pd.Series(meta_input["y_test"].detach().cpu().numpy(), index=meta_input["test_idx"]),
|
||||
}
|
||||
)
|
||||
)
|
||||
running_loss += loss.detach().item()
|
||||
running_loss = running_loss / len(task_list)
|
||||
loss_l.setdefault(phase, []).append(running_loss)
|
||||
|
||||
pred_y_all = pd.concat(pred_y_all)
|
||||
ic = pred_y_all.groupby("datetime").apply(lambda df: df["pred"].corr(df["label"], method="spearman")).mean()
|
||||
|
||||
R.log_metrics(**{f"loss/{phase}": running_loss, "step": epoch})
|
||||
R.log_metrics(**{f"ic/{phase}": ic, "step": epoch})
|
||||
|
||||
def fit(self, meta_dataset: MetaDatasetDS):
|
||||
"""
|
||||
The meta-learning-based data selection interacts directly with meta-dataset due to the close-form proxy measurement.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
meta_dataset : MetaDatasetDS
|
||||
The meta-model takes the meta-dataset for its training process.
|
||||
"""
|
||||
|
||||
if not self.fitted:
|
||||
for k in set(["lr", "step", "hist_step_n", "clip_method", "clip_weight", "criterion", "max_epoch"]):
|
||||
R.log_params(**{k: getattr(self, k)})
|
||||
|
||||
# FIXME: get test tasks for just checking the performance
|
||||
phases = ["train", "test"]
|
||||
meta_tasks_l = meta_dataset.prepare_tasks(phases)
|
||||
|
||||
if len(meta_tasks_l[1]):
|
||||
R.log_params(
|
||||
**dict(proxy_test_begin=meta_tasks_l[1][0].task["dataset"]["kwargs"]["segments"]["test"])
|
||||
) # debug: record when the test phase starts
|
||||
|
||||
self.tn = PredNet(
|
||||
step=self.step, hist_step_n=self.hist_step_n, clip_weight=self.clip_weight, clip_method=self.clip_method
|
||||
)
|
||||
|
||||
opt = optim.Adam(self.tn.parameters(), lr=self.lr)
|
||||
|
||||
# run weight with no weight
|
||||
for phase, task_list in zip(phases, meta_tasks_l):
|
||||
self.run_epoch(f"{phase}_noweight", task_list, 0, opt, {}, ignore_weight=True)
|
||||
self.run_epoch(f"{phase}_init", task_list, 0, opt, {})
|
||||
|
||||
# run training
|
||||
loss_l = {}
|
||||
for epoch in tqdm(range(self.max_epoch), desc="epoch"):
|
||||
for phase, task_list in zip(phases, meta_tasks_l):
|
||||
self.run_epoch(phase, task_list, epoch, opt, loss_l)
|
||||
R.save_objects(**{"model.pkl": self.tn})
|
||||
self.fitted = True
|
||||
|
||||
def _prepare_task(self, task: MetaTask) -> dict:
|
||||
meta_ipt = task.get_meta_input()
|
||||
weights = self.tn.twm(meta_ipt["time_perf"])
|
||||
|
||||
weight_s = pd.Series(weights.detach().cpu().numpy(), index=task.meta_info.columns)
|
||||
task = copy.copy(task.task) # NOTE: this is a shallow copy.
|
||||
task["reweighter"] = TimeReweighter(weight_s)
|
||||
return task
|
||||
|
||||
def inference(self, meta_dataset: MetaTaskDataset) -> List[dict]:
|
||||
res = []
|
||||
for mt in meta_dataset.prepare_tasks("test"):
|
||||
res.append(self._prepare_task(mt))
|
||||
return res
|
||||
68
qlib/contrib/meta/data_selection/net.py
Normal file
68
qlib/contrib/meta/data_selection/net.py
Normal file
@@ -0,0 +1,68 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from .utils import preds_to_weight_with_clamp, SingleMetaBase
|
||||
|
||||
|
||||
class TimeWeightMeta(SingleMetaBase):
|
||||
def __init__(self, hist_step_n, clip_weight=None, clip_method="clamp"):
|
||||
# clip_method includes "tanh" or "clamp"
|
||||
super().__init__(hist_step_n, clip_weight, clip_method)
|
||||
self.linear = nn.Linear(hist_step_n, 1)
|
||||
self.k = nn.Parameter(torch.Tensor([8.0]))
|
||||
|
||||
def forward(self, time_perf, time_belong=None, return_preds=False):
|
||||
hist_step_n = self.linear.in_features
|
||||
# NOTE: the reshape order is very important
|
||||
time_perf = time_perf.reshape(hist_step_n, time_perf.shape[0] // hist_step_n, *time_perf.shape[1:])
|
||||
time_perf = torch.mean(time_perf, dim=1, keepdim=False)
|
||||
|
||||
preds = []
|
||||
for i in range(time_perf.shape[1]):
|
||||
preds.append(self.linear(time_perf[:, i]))
|
||||
preds = torch.cat(preds)
|
||||
preds = preds - torch.mean(preds) # avoid using future information
|
||||
preds = preds * self.k
|
||||
if return_preds:
|
||||
if time_belong is None:
|
||||
return preds
|
||||
else:
|
||||
return time_belong @ preds
|
||||
else:
|
||||
weights = preds_to_weight_with_clamp(preds, self.clip_weight, self.clip_method)
|
||||
if time_belong is None:
|
||||
return weights
|
||||
else:
|
||||
return time_belong @ weights
|
||||
|
||||
|
||||
class PredNet(nn.Module):
|
||||
def __init__(self, step, hist_step_n, clip_weight=None, clip_method="tanh"):
|
||||
super().__init__()
|
||||
self.step = step
|
||||
self.twm = TimeWeightMeta(hist_step_n=hist_step_n, clip_weight=clip_weight, clip_method=clip_method)
|
||||
self.init_paramters(hist_step_n)
|
||||
|
||||
def get_sample_weights(self, X, time_perf, time_belong, ignore_weight=False):
|
||||
weights = torch.from_numpy(np.ones(X.shape[0])).float().to(X.device)
|
||||
if not ignore_weight:
|
||||
if time_perf is not None:
|
||||
weights_t = self.twm(time_perf, time_belong)
|
||||
weights = weights * weights_t
|
||||
return weights
|
||||
|
||||
def forward(self, X, y, time_perf, time_belong, X_test, ignore_weight=False):
|
||||
"""Please refer to the docs of MetaTaskDS for the description of the variables"""
|
||||
weights = self.get_sample_weights(X, time_perf, time_belong, ignore_weight=ignore_weight)
|
||||
X_w = X.T * weights.view(1, -1)
|
||||
theta = torch.inverse(X_w @ X) @ X_w @ y
|
||||
return X_test @ theta, weights
|
||||
|
||||
def init_paramters(self, hist_step_n):
|
||||
self.twm.linear.weight.data = 1.0 / hist_step_n + self.twm.linear.weight.data * 0.01
|
||||
self.twm.linear.bias.data.fill_(0.0)
|
||||
98
qlib/contrib/meta/data_selection/utils.py
Normal file
98
qlib/contrib/meta/data_selection/utils.py
Normal file
@@ -0,0 +1,98 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch import nn
|
||||
from qlib.contrib.torch import data_to_tensor
|
||||
|
||||
|
||||
class ICLoss(nn.Module):
|
||||
def forward(self, pred, y, idx, skip_size=50):
|
||||
"""forward.
|
||||
|
||||
:param pred:
|
||||
:param y:
|
||||
:param idx: Assume the level of the idx is (date, inst), and it is sorted
|
||||
"""
|
||||
prev = None
|
||||
diff_point = []
|
||||
for i, (date, inst) in enumerate(idx):
|
||||
if date != prev:
|
||||
diff_point.append(i)
|
||||
prev = date
|
||||
diff_point.append(None)
|
||||
|
||||
ic_all = 0.0
|
||||
skip_n = 0
|
||||
for start_i, end_i in zip(diff_point, diff_point[1:]):
|
||||
pred_focus = pred[start_i:end_i] # TODO: just for fake
|
||||
if pred_focus.shape[0] < skip_size:
|
||||
# skip some days which have very small amount of stock.
|
||||
skip_n += 1
|
||||
continue
|
||||
y_focus = y[start_i:end_i]
|
||||
ic_day = torch.dot(
|
||||
(pred_focus - pred_focus.mean()) / np.sqrt(pred_focus.shape[0]) / pred_focus.std(),
|
||||
(y_focus - y_focus.mean()) / np.sqrt(y_focus.shape[0]) / y_focus.std(),
|
||||
)
|
||||
ic_all += ic_day
|
||||
if len(diff_point) - 1 - skip_n <= 0:
|
||||
raise ValueError("No enough data for calculating iC")
|
||||
ic_mean = ic_all / (len(diff_point) - 1 - skip_n)
|
||||
return -ic_mean # ic loss
|
||||
|
||||
|
||||
def preds_to_weight_with_clamp(preds, clip_weight=None, clip_method="tanh"):
|
||||
"""
|
||||
Clip the weights.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
clip_weight: float
|
||||
The clip threshold.
|
||||
clip_method: str
|
||||
The clip method. Current available: "clamp", "tanh", and "sigmoid".
|
||||
"""
|
||||
if clip_weight is not None:
|
||||
if clip_method == "clamp":
|
||||
weights = torch.exp(preds)
|
||||
weights = weights.clamp(1.0 / clip_weight, clip_weight)
|
||||
elif clip_method == "tanh":
|
||||
weights = torch.exp(torch.tanh(preds) * np.log(clip_weight))
|
||||
elif clip_method == "sigmoid":
|
||||
# intuitively assume its sum is 1
|
||||
if clip_weight == 0.0:
|
||||
weights = torch.ones_like(preds)
|
||||
else:
|
||||
sm = nn.Sigmoid()
|
||||
weights = sm(preds) * clip_weight # TODO: The clip_weight is useless here.
|
||||
weights = weights / torch.sum(weights) * weights.numel()
|
||||
else:
|
||||
raise ValueError("Unknown clip_method")
|
||||
else:
|
||||
weights = torch.exp(preds)
|
||||
return weights
|
||||
|
||||
|
||||
class SingleMetaBase(nn.Module):
|
||||
def __init__(self, hist_n, clip_weight=None, clip_method="clamp"):
|
||||
# method can be tanh or clamp
|
||||
super().__init__()
|
||||
self.clip_weight = clip_weight
|
||||
if clip_method in ["tanh", "clamp"]:
|
||||
if self.clip_weight is not None and self.clip_weight < 1.0:
|
||||
self.clip_weight = 1 / self.clip_weight
|
||||
self.clip_method = clip_method
|
||||
|
||||
def is_enabled(self):
|
||||
if self.clip_weight is None:
|
||||
return True
|
||||
if self.clip_method == "sigmoid":
|
||||
if self.clip_weight > 0.0:
|
||||
return True
|
||||
else:
|
||||
if self.clip_weight > 1.0:
|
||||
return True
|
||||
return False
|
||||
@@ -11,6 +11,7 @@ from ...model.base import Model
|
||||
from ...data.dataset import DatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...model.interpret.base import FeatureInt
|
||||
from ...data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class CatBoostModel(Model, FeatureInt):
|
||||
@@ -31,6 +32,7 @@ class CatBoostModel(Model, FeatureInt):
|
||||
early_stopping_rounds=50,
|
||||
verbose_eval=20,
|
||||
evals_result=dict(),
|
||||
reweighter=None,
|
||||
**kwargs
|
||||
):
|
||||
df_train, df_valid = dataset.prepare(
|
||||
@@ -49,8 +51,17 @@ class CatBoostModel(Model, FeatureInt):
|
||||
else:
|
||||
raise ValueError("CatBoost doesn't support multi-label training")
|
||||
|
||||
train_pool = Pool(data=x_train, label=y_train_1d)
|
||||
valid_pool = Pool(data=x_valid, label=y_valid_1d)
|
||||
if reweighter is None:
|
||||
w_train = None
|
||||
w_valid = None
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
w_train = reweighter.reweight(df_train).values
|
||||
w_valid = reweighter.reweight(df_valid).values
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
|
||||
train_pool = Pool(data=x_train, label=y_train_1d, weight=w_train)
|
||||
valid_pool = Pool(data=x_valid, label=y_valid_1d, weight=w_valid)
|
||||
|
||||
# Initialize the catboost model
|
||||
self._params["iterations"] = num_boost_round
|
||||
|
||||
@@ -4,59 +4,73 @@
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import lightgbm as lgb
|
||||
from typing import Text, Union
|
||||
from typing import List, Text, Tuple, Union
|
||||
from ...model.base import ModelFT
|
||||
from ...data.dataset import DatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...model.interpret.base import LightGBMFInt
|
||||
from ...data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class LGBModel(ModelFT, LightGBMFInt):
|
||||
"""LightGBM Model"""
|
||||
|
||||
def __init__(self, loss="mse", early_stopping_rounds=50, **kwargs):
|
||||
def __init__(self, loss="mse", early_stopping_rounds=50, num_boost_round=1000, **kwargs):
|
||||
if loss not in {"mse", "binary"}:
|
||||
raise NotImplementedError
|
||||
self.params = {"objective": loss, "verbosity": -1}
|
||||
self.params.update(kwargs)
|
||||
self.early_stopping_rounds = early_stopping_rounds
|
||||
self.num_boost_round = num_boost_round
|
||||
self.model = None
|
||||
|
||||
def _prepare_data(self, dataset: DatasetH):
|
||||
df_train, df_valid = dataset.prepare(
|
||||
["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
|
||||
)
|
||||
if df_train.empty or df_valid.empty:
|
||||
def _prepare_data(self, dataset: DatasetH, reweighter=None) -> List[Tuple[lgb.Dataset, str]]:
|
||||
"""
|
||||
The motivation of current version is to make validation optional
|
||||
- train segment is necessary;
|
||||
"""
|
||||
ds_l = []
|
||||
assert "train" in dataset.segments
|
||||
for key in ["train", "valid"]:
|
||||
if key in dataset.segments:
|
||||
df = dataset.prepare(key, col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
if df.empty:
|
||||
raise ValueError("Empty data from dataset, please check your dataset config.")
|
||||
x_train, y_train = df_train["feature"], df_train["label"]
|
||||
x_valid, y_valid = df_valid["feature"], df_valid["label"]
|
||||
x, y = df["feature"], df["label"]
|
||||
|
||||
# Lightgbm need 1D array as its label
|
||||
if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
|
||||
y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
|
||||
if y.values.ndim == 2 and y.values.shape[1] == 1:
|
||||
y = np.squeeze(y.values)
|
||||
else:
|
||||
raise ValueError("LightGBM doesn't support multi-label training")
|
||||
|
||||
dtrain = lgb.Dataset(x_train, label=y_train)
|
||||
dvalid = lgb.Dataset(x_valid, label=y_valid)
|
||||
return dtrain, dvalid
|
||||
if reweighter is None:
|
||||
w = None
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
w = reweighter.reweight(df)
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
ds_l.append((lgb.Dataset(x.values, label=y, weight=w), key))
|
||||
return ds_l
|
||||
|
||||
def fit(
|
||||
self,
|
||||
dataset: DatasetH,
|
||||
num_boost_round=1000,
|
||||
num_boost_round=None,
|
||||
early_stopping_rounds=None,
|
||||
verbose_eval=20,
|
||||
evals_result=dict(),
|
||||
reweighter=None,
|
||||
**kwargs
|
||||
):
|
||||
dtrain, dvalid = self._prepare_data(dataset)
|
||||
ds_l = self._prepare_data(dataset, reweighter)
|
||||
ds, names = list(zip(*ds_l))
|
||||
self.model = lgb.train(
|
||||
self.params,
|
||||
dtrain,
|
||||
num_boost_round=num_boost_round,
|
||||
valid_sets=[dtrain, dvalid],
|
||||
valid_names=["train", "valid"],
|
||||
ds[0], # training dataset
|
||||
num_boost_round=self.num_boost_round if num_boost_round is None else num_boost_round,
|
||||
valid_sets=ds,
|
||||
valid_names=names,
|
||||
early_stopping_rounds=(
|
||||
self.early_stopping_rounds if early_stopping_rounds is None else early_stopping_rounds
|
||||
),
|
||||
@@ -64,8 +78,8 @@ class LGBModel(ModelFT, LightGBMFInt):
|
||||
evals_result=evals_result,
|
||||
**kwargs
|
||||
)
|
||||
evals_result["train"] = list(evals_result["train"].values())[0]
|
||||
evals_result["valid"] = list(evals_result["valid"].values())[0]
|
||||
for k in names:
|
||||
evals_result[k] = list(evals_result[k].values())[0]
|
||||
|
||||
def predict(self, dataset: DatasetH, segment: Union[Text, slice] = "test"):
|
||||
if self.model is None:
|
||||
@@ -73,7 +87,7 @@ class LGBModel(ModelFT, LightGBMFInt):
|
||||
x_test = dataset.prepare(segment, col_set="feature", data_key=DataHandlerLP.DK_I)
|
||||
return pd.Series(self.model.predict(x_test.values), index=x_test.index)
|
||||
|
||||
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
|
||||
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20, reweighter=None):
|
||||
"""
|
||||
finetune model
|
||||
|
||||
@@ -87,7 +101,7 @@ class LGBModel(ModelFT, LightGBMFInt):
|
||||
verbose level
|
||||
"""
|
||||
# Based on existing model and finetune by train more rounds
|
||||
dtrain, _ = self._prepare_data(dataset)
|
||||
dtrain, _ = self._prepare_data(dataset, reweighter)
|
||||
if dtrain.empty:
|
||||
raise ValueError("Empty data from dataset, please check your dataset config.")
|
||||
self.model = lgb.train(
|
||||
|
||||
@@ -4,6 +4,7 @@
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from typing import Text, Union
|
||||
from qlib.data.dataset.weight import Reweighter
|
||||
from scipy.optimize import nnls
|
||||
from sklearn.linear_model import LinearRegression, Ridge, Lasso
|
||||
|
||||
@@ -49,33 +50,40 @@ class LinearModel(Model):
|
||||
|
||||
self.coef_ = None
|
||||
|
||||
def fit(self, dataset: DatasetH):
|
||||
def fit(self, dataset: DatasetH, reweighter: Reweighter = None):
|
||||
df_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
if df_train.empty:
|
||||
raise ValueError("Empty data from dataset, please check your dataset config.")
|
||||
if reweighter is not None:
|
||||
w: pd.Series = reweighter.reweight(df_train)
|
||||
w = w.values
|
||||
else:
|
||||
w = None
|
||||
X, y = df_train["feature"].values, np.squeeze(df_train["label"].values)
|
||||
|
||||
if self.estimator in [self.OLS, self.RIDGE, self.LASSO]:
|
||||
self._fit(X, y)
|
||||
self._fit(X, y, w)
|
||||
elif self.estimator == self.NNLS:
|
||||
self._fit_nnls(X, y)
|
||||
self._fit_nnls(X, y, w)
|
||||
else:
|
||||
raise ValueError(f"unknown estimator `{self.estimator}`")
|
||||
|
||||
return self
|
||||
|
||||
def _fit(self, X, y):
|
||||
def _fit(self, X, y, w):
|
||||
if self.estimator == self.OLS:
|
||||
model = LinearRegression(fit_intercept=self.fit_intercept, copy_X=False)
|
||||
else:
|
||||
model = {self.RIDGE: Ridge, self.LASSO: Lasso}[self.estimator](
|
||||
alpha=self.alpha, fit_intercept=self.fit_intercept, copy_X=False
|
||||
)
|
||||
model.fit(X, y)
|
||||
model.fit(X, y, sample_weight=w)
|
||||
self.coef_ = model.coef_
|
||||
self.intercept_ = model.intercept_
|
||||
|
||||
def _fit_nnls(self, X, y):
|
||||
def _fit_nnls(self, X, y, w=None):
|
||||
if w is not None:
|
||||
raise NotImplementedError("TODO: support nnls with weight") # TODO
|
||||
if self.fit_intercept:
|
||||
X = np.c_[X, np.ones(len(X))] # NOTE: mem copy
|
||||
coef = nnls(X, y)[0]
|
||||
|
||||
@@ -22,6 +22,8 @@ from .pytorch_utils import count_parameters
|
||||
from ...model.base import Model
|
||||
from ...data.dataset import DatasetH, TSDatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...model.utils import ConcatDataset
|
||||
from ...data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class ALSTM(Model):
|
||||
@@ -139,15 +141,18 @@ class ALSTM(Model):
|
||||
def use_gpu(self):
|
||||
return self.device != torch.device("cpu")
|
||||
|
||||
def mse(self, pred, label):
|
||||
loss = (pred - label) ** 2
|
||||
def mse(self, pred, label, weight):
|
||||
loss = weight * (pred - label) ** 2
|
||||
return torch.mean(loss)
|
||||
|
||||
def loss_fn(self, pred, label):
|
||||
def loss_fn(self, pred, label, weight=None):
|
||||
mask = ~torch.isnan(label)
|
||||
|
||||
if weight is None:
|
||||
weight = torch.ones_like(label)
|
||||
|
||||
if self.loss == "mse":
|
||||
return self.mse(pred[mask], label[mask])
|
||||
return self.mse(pred[mask], label[mask], weight[mask])
|
||||
|
||||
raise ValueError("unknown loss `%s`" % self.loss)
|
||||
|
||||
@@ -164,12 +169,12 @@ class ALSTM(Model):
|
||||
|
||||
self.ALSTM_model.train()
|
||||
|
||||
for data in data_loader:
|
||||
for (data, weight) in data_loader:
|
||||
feature = data[:, :, 0:-1].to(self.device)
|
||||
label = data[:, -1, -1].to(self.device)
|
||||
|
||||
pred = self.ALSTM_model(feature.float())
|
||||
loss = self.loss_fn(pred, label)
|
||||
loss = self.loss_fn(pred, label, weight.to(self.device))
|
||||
|
||||
self.train_optimizer.zero_grad()
|
||||
loss.backward()
|
||||
@@ -183,7 +188,7 @@ class ALSTM(Model):
|
||||
scores = []
|
||||
losses = []
|
||||
|
||||
for data in data_loader:
|
||||
for (data, weight) in data_loader:
|
||||
|
||||
feature = data[:, :, 0:-1].to(self.device)
|
||||
# feature[torch.isnan(feature)] = 0
|
||||
@@ -191,7 +196,7 @@ class ALSTM(Model):
|
||||
|
||||
with torch.no_grad():
|
||||
pred = self.ALSTM_model(feature.float())
|
||||
loss = self.loss_fn(pred, label)
|
||||
loss = self.loss_fn(pred, label, weight.to(self.device))
|
||||
losses.append(loss.item())
|
||||
|
||||
score = self.metric_fn(pred, label)
|
||||
@@ -204,6 +209,7 @@ class ALSTM(Model):
|
||||
dataset,
|
||||
evals_result=dict(),
|
||||
save_path=None,
|
||||
reweighter=None,
|
||||
):
|
||||
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
@@ -213,11 +219,28 @@ class ALSTM(Model):
|
||||
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
|
||||
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
|
||||
|
||||
if reweighter is None:
|
||||
wl_train = np.ones(len(dl_train))
|
||||
wl_valid = np.ones(len(dl_valid))
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
wl_train = reweighter.reweight(dl_train)
|
||||
wl_valid = reweighter.reweight(dl_valid)
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
|
||||
train_loader = DataLoader(
|
||||
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
|
||||
ConcatDataset(dl_train, wl_train),
|
||||
batch_size=self.batch_size,
|
||||
shuffle=True,
|
||||
num_workers=self.n_jobs,
|
||||
drop_last=True,
|
||||
)
|
||||
valid_loader = DataLoader(
|
||||
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
|
||||
ConcatDataset(dl_valid, wl_valid),
|
||||
batch_size=self.batch_size,
|
||||
shuffle=False,
|
||||
num_workers=self.n_jobs,
|
||||
drop_last=True,
|
||||
)
|
||||
|
||||
save_path = get_or_create_path(save_path)
|
||||
|
||||
@@ -21,6 +21,8 @@ from .pytorch_utils import count_parameters
|
||||
from ...model.base import Model
|
||||
from ...data.dataset import DatasetH, TSDatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...model.utils import ConcatDataset
|
||||
from ...data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class GRU(Model):
|
||||
@@ -138,15 +140,18 @@ class GRU(Model):
|
||||
def use_gpu(self):
|
||||
return self.device != torch.device("cpu")
|
||||
|
||||
def mse(self, pred, label):
|
||||
loss = (pred - label) ** 2
|
||||
def mse(self, pred, label, weight):
|
||||
loss = weight * (pred - label) ** 2
|
||||
return torch.mean(loss)
|
||||
|
||||
def loss_fn(self, pred, label):
|
||||
def loss_fn(self, pred, label, weight=None):
|
||||
mask = ~torch.isnan(label)
|
||||
|
||||
if weight is None:
|
||||
weight = torch.ones_like(label)
|
||||
|
||||
if self.loss == "mse":
|
||||
return self.mse(pred[mask], label[mask])
|
||||
return self.mse(pred[mask], label[mask], weight[mask])
|
||||
|
||||
raise ValueError("unknown loss `%s`" % self.loss)
|
||||
|
||||
@@ -163,12 +168,12 @@ class GRU(Model):
|
||||
|
||||
self.GRU_model.train()
|
||||
|
||||
for data in data_loader:
|
||||
for (data, weight) in data_loader:
|
||||
feature = data[:, :, 0:-1].to(self.device)
|
||||
label = data[:, -1, -1].to(self.device)
|
||||
|
||||
pred = self.GRU_model(feature.float())
|
||||
loss = self.loss_fn(pred, label)
|
||||
loss = self.loss_fn(pred, label, weight.to(self.device))
|
||||
|
||||
self.train_optimizer.zero_grad()
|
||||
loss.backward()
|
||||
@@ -182,7 +187,7 @@ class GRU(Model):
|
||||
scores = []
|
||||
losses = []
|
||||
|
||||
for data in data_loader:
|
||||
for (data, weight) in data_loader:
|
||||
|
||||
feature = data[:, :, 0:-1].to(self.device)
|
||||
# feature[torch.isnan(feature)] = 0
|
||||
@@ -190,7 +195,7 @@ class GRU(Model):
|
||||
|
||||
with torch.no_grad():
|
||||
pred = self.GRU_model(feature.float())
|
||||
loss = self.loss_fn(pred, label)
|
||||
loss = self.loss_fn(pred, label, weight.to(self.device))
|
||||
losses.append(loss.item())
|
||||
|
||||
score = self.metric_fn(pred, label)
|
||||
@@ -203,6 +208,7 @@ class GRU(Model):
|
||||
dataset,
|
||||
evals_result=dict(),
|
||||
save_path=None,
|
||||
reweighter=None,
|
||||
):
|
||||
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
@@ -212,11 +218,28 @@ class GRU(Model):
|
||||
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
|
||||
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
|
||||
|
||||
if reweighter is None:
|
||||
wl_train = np.ones(len(dl_train))
|
||||
wl_valid = np.ones(len(dl_valid))
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
wl_train = reweighter.reweight(dl_train)
|
||||
wl_valid = reweighter.reweight(dl_valid)
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
|
||||
train_loader = DataLoader(
|
||||
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
|
||||
ConcatDataset(dl_train, wl_train),
|
||||
batch_size=self.batch_size,
|
||||
shuffle=True,
|
||||
num_workers=self.n_jobs,
|
||||
drop_last=True,
|
||||
)
|
||||
valid_loader = DataLoader(
|
||||
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
|
||||
ConcatDataset(dl_valid, wl_valid),
|
||||
batch_size=self.batch_size,
|
||||
shuffle=False,
|
||||
num_workers=self.n_jobs,
|
||||
drop_last=True,
|
||||
)
|
||||
|
||||
save_path = get_or_create_path(save_path)
|
||||
|
||||
@@ -20,6 +20,8 @@ from torch.utils.data import DataLoader
|
||||
from ...model.base import Model
|
||||
from ...data.dataset import DatasetH, TSDatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...model.utils import ConcatDataset
|
||||
from ...data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class LSTM(Model):
|
||||
@@ -134,15 +136,18 @@ class LSTM(Model):
|
||||
def use_gpu(self):
|
||||
return self.device != torch.device("cpu")
|
||||
|
||||
def mse(self, pred, label):
|
||||
loss = (pred - label) ** 2
|
||||
def mse(self, pred, label, weight):
|
||||
loss = weight * (pred - label) ** 2
|
||||
return torch.mean(loss)
|
||||
|
||||
def loss_fn(self, pred, label):
|
||||
mask = ~torch.isnan(label)
|
||||
|
||||
if weight is None:
|
||||
weight = torch.ones_like(label)
|
||||
|
||||
if self.loss == "mse":
|
||||
return self.mse(pred[mask], label[mask])
|
||||
return self.mse(pred[mask], label[mask], weight[mask])
|
||||
|
||||
raise ValueError("unknown loss `%s`" % self.loss)
|
||||
|
||||
@@ -159,12 +164,12 @@ class LSTM(Model):
|
||||
|
||||
self.LSTM_model.train()
|
||||
|
||||
for data in data_loader:
|
||||
for (data, weight) in data_loader:
|
||||
feature = data[:, :, 0:-1].to(self.device)
|
||||
label = data[:, -1, -1].to(self.device)
|
||||
|
||||
pred = self.LSTM_model(feature.float())
|
||||
loss = self.loss_fn(pred, label)
|
||||
loss = self.loss_fn(pred, label, weight.to(self.device))
|
||||
|
||||
self.train_optimizer.zero_grad()
|
||||
loss.backward()
|
||||
@@ -178,14 +183,14 @@ class LSTM(Model):
|
||||
scores = []
|
||||
losses = []
|
||||
|
||||
for data in data_loader:
|
||||
for (data, weight) in data_loader:
|
||||
|
||||
feature = data[:, :, 0:-1].to(self.device)
|
||||
# feature[torch.isnan(feature)] = 0
|
||||
label = data[:, -1, -1].to(self.device)
|
||||
|
||||
pred = self.LSTM_model(feature.float())
|
||||
loss = self.loss_fn(pred, label)
|
||||
loss = self.loss_fn(pred, label, weight.to(self.device))
|
||||
losses.append(loss.item())
|
||||
|
||||
score = self.metric_fn(pred, label)
|
||||
@@ -198,6 +203,7 @@ class LSTM(Model):
|
||||
dataset,
|
||||
evals_result=dict(),
|
||||
save_path=None,
|
||||
reweighter=None,
|
||||
):
|
||||
dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)
|
||||
@@ -207,11 +213,28 @@ class LSTM(Model):
|
||||
dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader
|
||||
dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader
|
||||
|
||||
if reweighter is None:
|
||||
wl_train = np.ones(len(dl_train))
|
||||
wl_valid = np.ones(len(dl_valid))
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
wl_train = reweighter.reweight(dl_train)
|
||||
wl_valid = reweighter.reweight(dl_valid)
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
|
||||
train_loader = DataLoader(
|
||||
dl_train, batch_size=self.batch_size, shuffle=True, num_workers=self.n_jobs, drop_last=True
|
||||
ConcatDataset(dl_train, wl_train),
|
||||
batch_size=self.batch_size,
|
||||
shuffle=True,
|
||||
num_workers=self.n_jobs,
|
||||
drop_last=True,
|
||||
)
|
||||
valid_loader = DataLoader(
|
||||
dl_valid, batch_size=self.batch_size, shuffle=False, num_workers=self.n_jobs, drop_last=True
|
||||
ConcatDataset(dl_valid, wl_valid),
|
||||
batch_size=self.batch_size,
|
||||
shuffle=False,
|
||||
num_workers=self.n_jobs,
|
||||
drop_last=True,
|
||||
)
|
||||
|
||||
save_path = get_or_create_path(save_path)
|
||||
|
||||
@@ -19,6 +19,7 @@ from .pytorch_utils import count_parameters
|
||||
from ...model.base import Model
|
||||
from ...data.dataset import DatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...data.dataset.weight import Reweighter
|
||||
from ...utils import unpack_archive_with_buffer, save_multiple_parts_file, get_or_create_path
|
||||
from ...log import get_module_logger
|
||||
from ...workflow import R
|
||||
@@ -166,18 +167,22 @@ class DNNModelPytorch(Model):
|
||||
evals_result=dict(),
|
||||
verbose=True,
|
||||
save_path=None,
|
||||
reweighter=None,
|
||||
):
|
||||
df_train, df_valid = dataset.prepare(
|
||||
["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
|
||||
)
|
||||
x_train, y_train = df_train["feature"], df_train["label"]
|
||||
x_valid, y_valid = df_valid["feature"], df_valid["label"]
|
||||
try:
|
||||
wdf_train, wdf_valid = dataset.prepare(["train", "valid"], col_set=["weight"], data_key=DataHandlerLP.DK_L)
|
||||
w_train, w_valid = wdf_train["weight"], wdf_valid["weight"]
|
||||
except KeyError as e:
|
||||
|
||||
if reweighter is None:
|
||||
w_train = pd.DataFrame(np.ones_like(y_train.values), index=y_train.index)
|
||||
w_valid = pd.DataFrame(np.ones_like(y_valid.values), index=y_valid.index)
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
w_train = pd.DataFrame(reweighter.reweight(df_train))
|
||||
w_valid = pd.DataFrame(reweighter.reweight(df_valid))
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
|
||||
save_path = get_or_create_path(save_path)
|
||||
stop_steps = 0
|
||||
|
||||
@@ -9,6 +9,7 @@ from ...model.base import Model
|
||||
from ...data.dataset import DatasetH
|
||||
from ...data.dataset.handler import DataHandlerLP
|
||||
from ...model.interpret.base import FeatureInt
|
||||
from ...data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class XGBModel(Model, FeatureInt):
|
||||
@@ -26,6 +27,7 @@ class XGBModel(Model, FeatureInt):
|
||||
early_stopping_rounds=50,
|
||||
verbose_eval=20,
|
||||
evals_result=dict(),
|
||||
reweighter=None,
|
||||
**kwargs
|
||||
):
|
||||
|
||||
@@ -43,8 +45,17 @@ class XGBModel(Model, FeatureInt):
|
||||
else:
|
||||
raise ValueError("XGBoost doesn't support multi-label training")
|
||||
|
||||
dtrain = xgb.DMatrix(x_train, label=y_train_1d)
|
||||
dvalid = xgb.DMatrix(x_valid, label=y_valid_1d)
|
||||
if reweighter is None:
|
||||
w_train = None
|
||||
w_valid = None
|
||||
elif isinstance(reweighter, Reweighter):
|
||||
w_train = reweighter.reweight(df_train)
|
||||
w_valid = reweighter.reweight(df_valid)
|
||||
else:
|
||||
raise ValueError("Unsupported reweighter type.")
|
||||
|
||||
dtrain = xgb.DMatrix(x_train.values, label=y_train_1d, weight=w_train)
|
||||
dvalid = xgb.DMatrix(x_valid.values, label=y_valid_1d, weight=w_valid)
|
||||
self.model = xgb.train(
|
||||
self._params,
|
||||
dtrain=dtrain,
|
||||
|
||||
@@ -124,6 +124,10 @@ class TopkDropoutStrategy(BaseSignalStrategy):
|
||||
trade_start_time, trade_end_time = self.trade_calendar.get_step_time(trade_step)
|
||||
pred_start_time, pred_end_time = self.trade_calendar.get_step_time(trade_step, shift=1)
|
||||
pred_score = self.signal.get_signal(start_time=pred_start_time, end_time=pred_end_time)
|
||||
# NOTE: the current version of topk dropout strategy can't handle pd.DataFrame(multiple signal)
|
||||
# So it only leverage the first col of signal
|
||||
if isinstance(pred_score, pd.DataFrame):
|
||||
pred_score = pred_score.iloc[:, 0]
|
||||
if pred_score is None:
|
||||
return TradeDecisionWO([], self)
|
||||
if self.only_tradable:
|
||||
|
||||
31
qlib/contrib/torch.py
Normal file
31
qlib/contrib/torch.py
Normal file
@@ -0,0 +1,31 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
"""
|
||||
This module is not a necessary part of Qlib.
|
||||
They are just some tools for convenience
|
||||
It is should not imported into the core part of qlib
|
||||
"""
|
||||
import torch
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def data_to_tensor(data, device="cpu", raise_error=False):
|
||||
if isinstance(data, torch.Tensor):
|
||||
if device == "cpu":
|
||||
return data.cpu()
|
||||
else:
|
||||
return data.to(device)
|
||||
if isinstance(data, (pd.DataFrame, pd.Series)):
|
||||
return data_to_tensor(torch.from_numpy(data.values).float(), device)
|
||||
elif isinstance(data, np.ndarray):
|
||||
return data_to_tensor(torch.from_numpy(data).float(), device)
|
||||
elif isinstance(data, (tuple, list)):
|
||||
return [data_to_tensor(i, device) for i in data]
|
||||
elif isinstance(data, dict):
|
||||
return {k: data_to_tensor(v, device) for k, v in data.items()}
|
||||
else:
|
||||
if raise_error:
|
||||
raise ValueError(f"Unsupported data type: {type(data)}.")
|
||||
else:
|
||||
return data
|
||||
@@ -1,5 +1,5 @@
|
||||
from ...utils.serial import Serializable
|
||||
from typing import Union, List, Tuple, Dict, Text, Optional
|
||||
from typing import Callable, Union, List, Tuple, Dict, Text, Optional
|
||||
from ...utils import init_instance_by_config, np_ffill, time_to_slc_point
|
||||
from ...log import get_module_logger
|
||||
from .handler import DataHandler, DataHandlerLP
|
||||
@@ -235,6 +235,28 @@ class DatasetH(Dataset):
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
|
||||
# helper functions
|
||||
@staticmethod
|
||||
def get_min_time(segments):
|
||||
return DatasetH._get_extrema(segments, 0, (lambda a, b: a > b))
|
||||
|
||||
@staticmethod
|
||||
def get_max_time(segments):
|
||||
return DatasetH._get_extrema(segments, 1, (lambda a, b: a < b))
|
||||
|
||||
@staticmethod
|
||||
def _get_extrema(segments, idx: int, cmp: Callable, key_func=pd.Timestamp):
|
||||
"""it will act like sort and return the max value or None"""
|
||||
candidate = None
|
||||
for k, seg in segments.items():
|
||||
point = seg[idx]
|
||||
if point is None:
|
||||
# None indicates unbounded, return directly
|
||||
return None
|
||||
elif candidate is None or cmp(key_func(candidate), key_func(point)):
|
||||
candidate = point
|
||||
return candidate
|
||||
|
||||
|
||||
class TSDataSampler:
|
||||
"""
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import abc
|
||||
import pickle
|
||||
from pathlib import Path
|
||||
import warnings
|
||||
import pandas as pd
|
||||
|
||||
@@ -10,6 +12,7 @@ from typing import Tuple, Union, List
|
||||
from qlib.data import D
|
||||
from qlib.utils import load_dataset, init_instance_by_config, time_to_slc_point
|
||||
from qlib.log import get_module_logger
|
||||
from qlib.utils.serial import Serializable
|
||||
|
||||
|
||||
class DataLoader(abc.ABC):
|
||||
@@ -216,12 +219,14 @@ class QlibDataLoader(DLWParser):
|
||||
return df
|
||||
|
||||
|
||||
class StaticDataLoader(DataLoader):
|
||||
class StaticDataLoader(DataLoader, Serializable):
|
||||
"""
|
||||
DataLoader that supports loading data from file or as provided.
|
||||
"""
|
||||
|
||||
def __init__(self, config: dict, join="outer"):
|
||||
include_attr = ["_config"]
|
||||
|
||||
def __init__(self, config: Union[dict, str], join="outer"):
|
||||
"""
|
||||
Parameters
|
||||
----------
|
||||
@@ -230,7 +235,7 @@ class StaticDataLoader(DataLoader):
|
||||
join : str
|
||||
How to align different dataframes
|
||||
"""
|
||||
self.config = config
|
||||
self._config = config # using "_" to avoid confliction with the method `config` of Serializable
|
||||
self.join = join
|
||||
self._data = None
|
||||
|
||||
@@ -254,12 +259,16 @@ class StaticDataLoader(DataLoader):
|
||||
def _maybe_load_raw_data(self):
|
||||
if self._data is not None:
|
||||
return
|
||||
if isinstance(self._config, dict):
|
||||
self._data = pd.concat(
|
||||
{fields_group: load_dataset(path_or_obj) for fields_group, path_or_obj in self.config.items()},
|
||||
{fields_group: load_dataset(path_or_obj) for fields_group, path_or_obj in self._config.items()},
|
||||
axis=1,
|
||||
join=self.join,
|
||||
)
|
||||
self._data.sort_index(inplace=True)
|
||||
elif isinstance(self._config, (str, Path)):
|
||||
with Path(self._config).open("rb") as f:
|
||||
self._data = pickle.load(f)
|
||||
|
||||
|
||||
class DataLoaderDH(DataLoader):
|
||||
|
||||
@@ -6,6 +6,7 @@ from typing import Union, Text
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from qlib.utils.data import robust_zscore, zscore
|
||||
from ...constant import EPS
|
||||
from .utils import fetch_df_by_index
|
||||
from ...utils.serial import Serializable
|
||||
@@ -293,14 +294,22 @@ class RobustZScoreNorm(Processor):
|
||||
class CSZScoreNorm(Processor):
|
||||
"""Cross Sectional ZScore Normalization"""
|
||||
|
||||
def __init__(self, fields_group=None):
|
||||
def __init__(self, fields_group=None, method="zscore"):
|
||||
self.fields_group = fields_group
|
||||
if method == "zscore":
|
||||
self.zscore_func = zscore
|
||||
elif method == "robust":
|
||||
self.zscore_func = robust_zscore
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
|
||||
def __call__(self, df):
|
||||
# try not modify original dataframe
|
||||
cols = get_group_columns(df, self.fields_group)
|
||||
df[cols] = df[cols].groupby("datetime").apply(lambda x: (x - x.mean()).div(x.std()))
|
||||
|
||||
if not isinstance(self.fields_group, list):
|
||||
self.fields_group = [self.fields_group]
|
||||
for g in self.fields_group:
|
||||
cols = get_group_columns(df, g)
|
||||
df[cols] = df[cols].groupby("datetime").apply(self.zscore_func)
|
||||
return df
|
||||
|
||||
|
||||
|
||||
@@ -1,8 +1,13 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from __future__ import annotations
|
||||
import pandas as pd
|
||||
from typing import Union, List
|
||||
from qlib.utils import init_instance_by_config
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from qlib.data.dataset import DataHandler
|
||||
|
||||
|
||||
def get_level_index(df: pd.DataFrame, level=Union[str, int]) -> int:
|
||||
@@ -111,3 +116,28 @@ def convert_index_format(df: Union[pd.DataFrame, pd.Series], level: str = "datet
|
||||
if get_level_index(df, level=level) == 1:
|
||||
df = df.swaplevel().sort_index()
|
||||
return df
|
||||
|
||||
|
||||
def init_task_handler(task: dict) -> Union[DataHandler, None]:
|
||||
"""
|
||||
initialize the handler part of the task **inplace**
|
||||
|
||||
Parameters
|
||||
----------
|
||||
task : dict
|
||||
the task to be handled
|
||||
|
||||
Returns
|
||||
-------
|
||||
Union[DataHandler, None]:
|
||||
returns
|
||||
"""
|
||||
# avoid recursive import
|
||||
from .handler import DataHandler
|
||||
|
||||
h_conf = task["dataset"]["kwargs"].get("handler")
|
||||
if h_conf is not None:
|
||||
handler = init_instance_by_config(h_conf, accept_types=DataHandler)
|
||||
task["dataset"]["kwargs"]["handler"] = handler
|
||||
|
||||
return handler
|
||||
|
||||
34
qlib/data/dataset/weight.py
Normal file
34
qlib/data/dataset/weight.py
Normal file
@@ -0,0 +1,34 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from typing import Union, List, Tuple
|
||||
from ...data.dataset import TSDataSampler
|
||||
from ...data.dataset.utils import get_level_index
|
||||
from ...utils import lazy_sort_index
|
||||
|
||||
|
||||
class Reweighter:
|
||||
def __init__(self, *args, **kwargs):
|
||||
"""
|
||||
To initialize the Reweighter, users should provide specific methods to let reweighter do the reweighting (such as sample-wise, rule-based).
|
||||
"""
|
||||
raise NotImplementedError()
|
||||
|
||||
def reweight(self, data: object) -> object:
|
||||
"""
|
||||
Get weights for data
|
||||
|
||||
Parameters
|
||||
----------
|
||||
data : object
|
||||
The input data.
|
||||
The first dimension is the index of samples
|
||||
|
||||
Returns
|
||||
-------
|
||||
object:
|
||||
the weights info for the data
|
||||
"""
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
@@ -4,6 +4,7 @@ import abc
|
||||
from typing import Text, Union
|
||||
from ..utils.serial import Serializable
|
||||
from ..data.dataset import Dataset
|
||||
from ..data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
class BaseModel(Serializable, metaclass=abc.ABCMeta):
|
||||
@@ -22,7 +23,7 @@ class BaseModel(Serializable, metaclass=abc.ABCMeta):
|
||||
class Model(BaseModel):
|
||||
"""Learnable Models"""
|
||||
|
||||
def fit(self, dataset: Dataset):
|
||||
def fit(self, dataset: Dataset, reweighter: Reweighter):
|
||||
"""
|
||||
Learn model from the base model
|
||||
|
||||
|
||||
@@ -107,6 +107,8 @@ class RollingGroup(Group):
|
||||
for key, values in rolling_dict.items():
|
||||
if isinstance(key, tuple):
|
||||
grouped_dict.setdefault(key[:-1], {})[key[-1]] = values
|
||||
else:
|
||||
raise TypeError(f"Expected `tuple` type, but got a value `{key}`")
|
||||
return grouped_dict
|
||||
|
||||
def __init__(self, ens=RollingEnsemble()):
|
||||
|
||||
5
qlib/model/meta/__init__.py
Normal file
5
qlib/model/meta/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from .task import MetaTask
|
||||
from .dataset import MetaTaskDataset
|
||||
76
qlib/model/meta/dataset.py
Normal file
76
qlib/model/meta/dataset.py
Normal file
@@ -0,0 +1,76 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import abc
|
||||
from qlib.model.meta.task import MetaTask
|
||||
from typing import Dict, Union, List, Tuple, Text
|
||||
from ...workflow.task.gen import RollingGen, task_generator
|
||||
from ...data.dataset.handler import DataHandler
|
||||
from ...utils.serial import Serializable
|
||||
|
||||
|
||||
class MetaTaskDataset(Serializable, metaclass=abc.ABCMeta):
|
||||
"""
|
||||
A dataset fetching the data in a meta-level.
|
||||
|
||||
A Meta Dataset is responsible for
|
||||
- input tasks(e.g. Qlib tasks) and prepare meta tasks
|
||||
- meta task contains more information than normal tasks (e.g. input data for meta model)
|
||||
|
||||
The learnt pattern could transfer to other meta dataset. The following cases should be supported
|
||||
- A meta-model trained on meta-dataset A and then applied to meta-dataset B
|
||||
- Some pattern are shared between meta-dataset A and B, so meta-input on meta-dataset A are used when meta model are applied on meta-dataset-B
|
||||
"""
|
||||
|
||||
def __init__(self, segments: Union[Dict[Text, Tuple], float], *args, **kwargs):
|
||||
"""
|
||||
The meta-dataset maintains a list of meta-tasks when it is initialized.
|
||||
|
||||
The segments indicates the way to divide the data
|
||||
|
||||
The duty of the `__init__` function of MetaTaskDataset
|
||||
- initialize the tasks
|
||||
"""
|
||||
super().__init__(*args, **kwargs)
|
||||
self.segments = segments
|
||||
|
||||
def prepare_tasks(self, segments: Union[List[Text], Text], *args, **kwargs) -> List[MetaTask]:
|
||||
"""
|
||||
Prepare the data in each meta-task and ready for training.
|
||||
|
||||
The following code example shows how to retrieve a list of meta-tasks from the `meta_dataset`:
|
||||
|
||||
.. code-block:: Python
|
||||
|
||||
# get the train segment and the test segment, both of them are lists
|
||||
train_meta_tasks, test_meta_tasks = meta_dataset.prepare_tasks(["train", "test"])
|
||||
|
||||
Parameters
|
||||
----------
|
||||
segments: Union[List[Text], Tuple[Text], Text]
|
||||
the info to select data
|
||||
|
||||
Returns
|
||||
-------
|
||||
list:
|
||||
A list of the prepared data of each meta-task for training the meta-model. For multiple segments [seg1, seg2, ... , segN], the returned list will be [[tasks in seg1], [tasks in seg2], ... , [tasks in segN]].
|
||||
Each task is a meta task
|
||||
"""
|
||||
if isinstance(segments, (list, tuple)):
|
||||
return [self._prepare_seg(seg) for seg in segments]
|
||||
elif isinstance(segments, str):
|
||||
return self._prepare_seg(segments)
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
|
||||
@abc.abstractmethod
|
||||
def _prepare_seg(self, segment: Text):
|
||||
"""
|
||||
prepare a single segment of data for training data
|
||||
|
||||
Parameters
|
||||
----------
|
||||
seg : Text
|
||||
the name of the segment
|
||||
"""
|
||||
pass
|
||||
79
qlib/model/meta/model.py
Normal file
79
qlib/model/meta/model.py
Normal file
@@ -0,0 +1,79 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import abc
|
||||
from qlib.contrib.meta.data_selection.dataset import MetaDatasetDS
|
||||
from typing import Union, List, Tuple
|
||||
|
||||
from qlib.model.meta.task import MetaTask
|
||||
from .dataset import MetaTaskDataset
|
||||
|
||||
|
||||
class MetaModel(metaclass=abc.ABCMeta):
|
||||
"""
|
||||
The meta-model guiding the model learning.
|
||||
|
||||
The word `Guiding` can be categorized into two types based on the stage of model learning
|
||||
- The definition of learning tasks: Please refer to docs of `MetaTaskModel`
|
||||
- Controlling the learning process of models: Please refer to the docs of `MetaGuideModel`
|
||||
"""
|
||||
|
||||
@abc.abstractmethod
|
||||
def fit(self, *args, **kwargs):
|
||||
"""
|
||||
The training process of the meta-model.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def inference(self, *args, **kwargs) -> object:
|
||||
"""
|
||||
The inference process of the meta-model.
|
||||
|
||||
Returns
|
||||
-------
|
||||
object:
|
||||
Some information to guide the model learning
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
class MetaTaskModel(MetaModel):
|
||||
"""
|
||||
This type of meta-model deals with base task definitions. The meta-model creates tasks for training new base forecasting models after it is trained. `prepare_tasks` directly modifies the task definitions.
|
||||
"""
|
||||
|
||||
def fit(self, meta_dataset: MetaTaskDataset):
|
||||
"""
|
||||
The MetaTaskModel is expected to get prepared MetaTask from meta_dataset.
|
||||
And then it will learn knowledge from the meta tasks
|
||||
"""
|
||||
raise NotImplementedError(f"Please implement the `fit` method")
|
||||
|
||||
def inference(self, meta_dataset: MetaTaskDataset) -> List[dict]:
|
||||
"""
|
||||
MetaTaskModel will make inference on the meta_dataset
|
||||
The MetaTaskModel is expected to get prepared MetaTask from meta_dataset.
|
||||
Then it will create modified task with Qlib format which can be executed by Qlib trainer.
|
||||
|
||||
Returns
|
||||
-------
|
||||
List[dict]:
|
||||
A list of modified task definitions.
|
||||
|
||||
"""
|
||||
raise NotImplementedError(f"Please implement the `inference` method")
|
||||
|
||||
|
||||
class MetaGuideModel(MetaModel):
|
||||
"""
|
||||
This type of meta-model aims to guide the training process of the base model. The meta-model interacts with the base forecasting models during their training process.
|
||||
"""
|
||||
|
||||
@abc.abstractmethod
|
||||
def fit(self, *args, **kwargs):
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def inference(self, *args, **kwargs):
|
||||
pass
|
||||
53
qlib/model/meta/task.py
Normal file
53
qlib/model/meta/task.py
Normal file
@@ -0,0 +1,53 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import abc
|
||||
from typing import Union, List, Tuple
|
||||
|
||||
from qlib.data.dataset import Dataset
|
||||
from ...utils import init_instance_by_config
|
||||
|
||||
|
||||
class MetaTask:
|
||||
"""
|
||||
A single meta-task, a meta-dataset contains a list of them.
|
||||
It serves as a component as in MetaDatasetDS
|
||||
|
||||
The data processing is different
|
||||
- the processed input may be different between training and testing
|
||||
- When training, the X, y, X_test, y_test in training tasks are necessary (# PROC_MODE_FULL #)
|
||||
but not necessary in test tasks. (# PROC_MODE_TEST #)
|
||||
- When the meta model can be transferred into other dataset, only meta_info is necessary (# PROC_MODE_TRANSFER #)
|
||||
"""
|
||||
|
||||
PROC_MODE_FULL = "full"
|
||||
PROC_MODE_TEST = "test"
|
||||
PROC_MODE_TRANSFER = "transfer"
|
||||
|
||||
def __init__(self, task: dict, meta_info: object, mode: str = PROC_MODE_FULL):
|
||||
"""
|
||||
The `__init__` func is responsible for
|
||||
- store the task
|
||||
- store the origin input data for
|
||||
- process the input data for meta data
|
||||
|
||||
Parameters
|
||||
----------
|
||||
task : dict
|
||||
the task to be enhanced by meta model
|
||||
|
||||
meta_info : object
|
||||
the input for meta model
|
||||
"""
|
||||
self.task = task
|
||||
self.meta_info = meta_info # the original meta input information, it will be processed later
|
||||
self.mode = mode
|
||||
|
||||
def get_dataset(self) -> Dataset:
|
||||
return init_instance_by_config(self.task["dataset"], accept_types=Dataset)
|
||||
|
||||
def get_meta_input(self) -> object:
|
||||
"""
|
||||
Return the **processed** meta_info
|
||||
"""
|
||||
return self.meta_info
|
||||
@@ -20,14 +20,12 @@ from tqdm.auto import tqdm
|
||||
from qlib.data.dataset import Dataset
|
||||
from qlib.log import get_module_logger
|
||||
from qlib.model.base import Model
|
||||
from qlib.utils import flatten_dict, get_callable_kwargs, init_instance_by_config
|
||||
from qlib.utils import flatten_dict, get_callable_kwargs, init_instance_by_config, auto_filter_kwargs, fill_placeholder
|
||||
from qlib.workflow import R
|
||||
from qlib.workflow.record_temp import SignalRecord
|
||||
from qlib.workflow.recorder import Recorder
|
||||
from qlib.workflow.task.manage import TaskManager, run_task
|
||||
|
||||
|
||||
# from qlib.data.dataset.weight import Reweighter
|
||||
from qlib.data.dataset.weight import Reweighter
|
||||
|
||||
|
||||
def _log_task_info(task_config: dict):
|
||||
@@ -41,11 +39,9 @@ def _exe_task(task_config: dict):
|
||||
# model & dataset initiation
|
||||
model: Model = init_instance_by_config(task_config["model"])
|
||||
dataset: Dataset = init_instance_by_config(task_config["dataset"])
|
||||
# FIXME: resume reweighter after merging data selection
|
||||
# reweighter: Reweighter = task_config.get("reweighter", None)
|
||||
reweighter: Reweighter = task_config.get("reweighter", None)
|
||||
# model training
|
||||
# auto_filter_kwargs(model.fit)(dataset, reweighter=reweighter)
|
||||
model.fit(dataset)
|
||||
auto_filter_kwargs(model.fit)(dataset, reweighter=reweighter)
|
||||
R.save_objects(**{"params.pkl": model})
|
||||
# this dataset is saved for online inference. So the concrete data should not be dumped
|
||||
dataset.config(dump_all=False, recursive=True)
|
||||
@@ -87,103 +83,6 @@ def begin_task_train(task_config: dict, experiment_name: str, recorder_name: str
|
||||
return R.get_recorder()
|
||||
|
||||
|
||||
def get_item_from_obj(config: dict, name_path: str) -> object:
|
||||
"""
|
||||
Follow the name_path to get values from config
|
||||
For example:
|
||||
If we follow the example in in the Parameters section,
|
||||
Timestamp('2008-01-02 00:00:00') will be returned
|
||||
|
||||
Parameters
|
||||
----------
|
||||
config : dict
|
||||
e.g.
|
||||
{'dataset': {'class': 'DatasetH',
|
||||
'kwargs': {'handler': {'class': 'Alpha158',
|
||||
'kwargs': {'end_time': '2020-08-01',
|
||||
'fit_end_time': '<dataset.kwargs.segments.train.1>',
|
||||
'fit_start_time': '<dataset.kwargs.segments.train.0>',
|
||||
'instruments': 'csi100',
|
||||
'start_time': '2008-01-01'},
|
||||
'module_path': 'qlib.contrib.data.handler'},
|
||||
'segments': {'test': (Timestamp('2017-01-03 00:00:00'),
|
||||
Timestamp('2019-04-08 00:00:00')),
|
||||
'train': (Timestamp('2008-01-02 00:00:00'),
|
||||
Timestamp('2014-12-31 00:00:00')),
|
||||
'valid': (Timestamp('2015-01-05 00:00:00'),
|
||||
Timestamp('2016-12-30 00:00:00'))}}
|
||||
}}
|
||||
name_path : str
|
||||
e.g.
|
||||
"dataset.kwargs.segments.train.1"
|
||||
|
||||
Returns
|
||||
-------
|
||||
object
|
||||
the retrieved object
|
||||
"""
|
||||
cur_cfg = config
|
||||
for k in name_path.split("."):
|
||||
if isinstance(cur_cfg, dict):
|
||||
cur_cfg = cur_cfg[k]
|
||||
elif k.isdigit():
|
||||
cur_cfg = cur_cfg[int(k)]
|
||||
else:
|
||||
raise ValueError(f"Error when getting {k} from cur_cfg")
|
||||
return cur_cfg
|
||||
|
||||
|
||||
def fill_placeholder(config: dict, config_extend: dict):
|
||||
"""
|
||||
Detect placeholder in config and fill them with config_extend.
|
||||
The item of dict must be single item(int, str, etc), dict and list. Tuples are not supported.
|
||||
There are two type of variables:
|
||||
- user-defined variables :
|
||||
e.g. when config_extend is `{"<MODEL>": model, "<DATASET>": dataset}`, "<MODEL>" and "<DATASET>" in `config` will be replaced with `model` `dataset`
|
||||
- variables extracted from `config` :
|
||||
e.g. the variables like "<dataset.kwargs.segments.train.0>" will be replaced with the values from `config`
|
||||
|
||||
Parameters
|
||||
----------
|
||||
config : dict
|
||||
the parameter dict will be filled
|
||||
config_extend : dict
|
||||
the value of all placeholders
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
the parameter dict
|
||||
"""
|
||||
# check the format of config_extend
|
||||
for placeholder in config_extend.keys():
|
||||
assert re.match(r"<[^<>]+>", placeholder)
|
||||
|
||||
# bfs
|
||||
top = 0
|
||||
tail = 1
|
||||
item_queue = [config]
|
||||
while top < tail:
|
||||
now_item = item_queue[top]
|
||||
top += 1
|
||||
if isinstance(now_item, list):
|
||||
item_keys = range(len(now_item))
|
||||
elif isinstance(now_item, dict):
|
||||
item_keys = now_item.keys()
|
||||
for key in item_keys:
|
||||
if isinstance(now_item[key], list) or isinstance(now_item[key], dict):
|
||||
item_queue.append(now_item[key])
|
||||
tail += 1
|
||||
elif isinstance(now_item[key], str):
|
||||
if now_item[key] in config_extend.keys():
|
||||
now_item[key] = config_extend[now_item[key]]
|
||||
else:
|
||||
m = re.match(r"<(?P<name_path>[^<>]+)>", now_item[key])
|
||||
if m is not None:
|
||||
now_item[key] = get_item_from_obj(config, m.groupdict()["name_path"])
|
||||
return config
|
||||
|
||||
|
||||
def end_task_train(rec: Recorder, experiment_name: str) -> Recorder:
|
||||
"""
|
||||
Finish task training with real model fitting and saving.
|
||||
@@ -349,7 +248,7 @@ class TrainerR(Trainer):
|
||||
if experiment_name is None:
|
||||
experiment_name = self.experiment_name
|
||||
recs = []
|
||||
for task in tqdm(tasks):
|
||||
for task in tqdm(tasks, desc="train tasks"):
|
||||
rec = train_func(task, experiment_name, **kwargs)
|
||||
rec.set_tags(**{self.STATUS_KEY: self.STATUS_BEGIN})
|
||||
recs.append(rec)
|
||||
@@ -606,13 +505,17 @@ class DelayTrainerRM(TrainerRM):
|
||||
tasks = [tasks]
|
||||
if len(tasks) == 0:
|
||||
return []
|
||||
return super().train(
|
||||
_skip_run_task = self.skip_run_task
|
||||
self.skip_run_task = False # The task preparation can't be skipped
|
||||
res = super().train(
|
||||
tasks,
|
||||
train_func=train_func,
|
||||
experiment_name=experiment_name,
|
||||
after_status=TaskManager.STATUS_PART_DONE,
|
||||
**kwargs,
|
||||
)
|
||||
self.skip_run_task = _skip_run_task
|
||||
return res
|
||||
|
||||
def end_train(self, recs, end_train_func=None, experiment_name: str = None, **kwargs) -> List[Recorder]:
|
||||
"""
|
||||
|
||||
15
qlib/model/utils.py
Normal file
15
qlib/model/utils.py
Normal file
@@ -0,0 +1,15 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
|
||||
class ConcatDataset(Dataset):
|
||||
def __init__(self, *datasets):
|
||||
self.datasets = datasets
|
||||
|
||||
def __getitem__(self, i):
|
||||
return tuple(d[i] for d in self.datasets)
|
||||
|
||||
def __len__(self):
|
||||
return min(len(d) for d in self.datasets)
|
||||
@@ -31,6 +31,12 @@ GBDT_MODEL = {
|
||||
}
|
||||
|
||||
|
||||
SA_RC = {
|
||||
"class": "SigAnaRecord",
|
||||
"module_path": "qlib.workflow.record_temp",
|
||||
}
|
||||
|
||||
|
||||
RECORD_CONFIG = [
|
||||
{
|
||||
"class": "SignalRecord",
|
||||
@@ -40,10 +46,7 @@ RECORD_CONFIG = [
|
||||
"model": "<MODEL>",
|
||||
},
|
||||
},
|
||||
{
|
||||
"class": "SigAnaRecord",
|
||||
"module_path": "qlib.workflow.record_temp",
|
||||
},
|
||||
SA_RC,
|
||||
]
|
||||
|
||||
|
||||
|
||||
@@ -16,6 +16,7 @@ import redis
|
||||
import bisect
|
||||
import shutil
|
||||
import difflib
|
||||
import inspect
|
||||
import hashlib
|
||||
import warnings
|
||||
import datetime
|
||||
@@ -30,7 +31,7 @@ from pathlib import Path
|
||||
from typing import Dict, Union, Tuple, Any, Text, Optional, Callable
|
||||
from types import ModuleType
|
||||
from urllib.parse import urlparse
|
||||
|
||||
from .file import get_or_create_path, save_multiple_parts_file, unpack_archive_with_buffer, get_tmp_file_with_buffer
|
||||
from ..config import C
|
||||
from ..log import get_module_logger, set_log_with_config
|
||||
|
||||
@@ -191,6 +192,24 @@ def get_module_by_module_path(module_path: Union[str, ModuleType]):
|
||||
return module
|
||||
|
||||
|
||||
def split_module_path(module_path: str) -> Tuple[str, str]:
|
||||
"""
|
||||
|
||||
Parameters
|
||||
----------
|
||||
module_path : str
|
||||
e.g. "a.b.c.ClassName"
|
||||
|
||||
Returns
|
||||
-------
|
||||
Tuple[str, str]
|
||||
e.g. ("a.b.c", "ClassName")
|
||||
"""
|
||||
*m_path, cls = module_path.split(".")
|
||||
m_path = ".".join(m_path)
|
||||
return m_path, cls
|
||||
|
||||
|
||||
def get_callable_kwargs(config: Union[dict, str], default_module: Union[str, ModuleType] = None) -> (type, dict):
|
||||
"""
|
||||
extract class/func and kwargs from config info
|
||||
@@ -212,17 +231,24 @@ def get_callable_kwargs(config: Union[dict, str], default_module: Union[str, Mod
|
||||
the class/func object and it's arguments.
|
||||
"""
|
||||
if isinstance(config, dict):
|
||||
if isinstance(config["class"], str):
|
||||
module = get_module_by_module_path(config.get("module_path", default_module))
|
||||
# raise AttributeError
|
||||
_callable = getattr(module, config["class" if "class" in config else "func"])
|
||||
key = "class" if "class" in config else "func"
|
||||
if isinstance(config[key], str):
|
||||
# 1) get module and class
|
||||
# - case 1): "a.b.c.ClassName"
|
||||
# - case 2): {"class": "ClassName", "module_path": "a.b.c"}
|
||||
m_path, cls = split_module_path(config[key])
|
||||
if m_path == "":
|
||||
m_path = config.get("module_path", default_module)
|
||||
module = get_module_by_module_path(m_path)
|
||||
|
||||
# 2) get callable
|
||||
_callable = getattr(module, cls) # may raise AttributeError
|
||||
else:
|
||||
_callable = config["class"] # the class type itself is passed in
|
||||
_callable = config[key] # the class type itself is passed in
|
||||
kwargs = config.get("kwargs", {})
|
||||
elif isinstance(config, str):
|
||||
# a.b.c.ClassName
|
||||
*m_path, cls = config.split(".")
|
||||
m_path = ".".join(m_path)
|
||||
m_path, cls = split_module_path(config)
|
||||
module = get_module_by_module_path(default_module if m_path == "" else m_path)
|
||||
|
||||
_callable = getattr(module, cls)
|
||||
@@ -352,153 +378,6 @@ def compare_dict_value(src_data: dict, dst_data: dict):
|
||||
return changes
|
||||
|
||||
|
||||
def get_or_create_path(path: Optional[Text] = None, return_dir: bool = False):
|
||||
"""Create or get a file or directory given the path and return_dir.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
path: a string indicates the path or None indicates creating a temporary path.
|
||||
return_dir: if True, create and return a directory; otherwise c&r a file.
|
||||
|
||||
"""
|
||||
if path:
|
||||
if return_dir and not os.path.exists(path):
|
||||
os.makedirs(path)
|
||||
elif not return_dir: # return a file, thus we need to create its parent directory
|
||||
xpath = os.path.abspath(os.path.join(path, ".."))
|
||||
if not os.path.exists(xpath):
|
||||
os.makedirs(xpath)
|
||||
else:
|
||||
temp_dir = os.path.expanduser("~/tmp")
|
||||
if not os.path.exists(temp_dir):
|
||||
os.makedirs(temp_dir)
|
||||
if return_dir:
|
||||
_, path = tempfile.mkdtemp(dir=temp_dir)
|
||||
else:
|
||||
_, path = tempfile.mkstemp(dir=temp_dir)
|
||||
return path
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def save_multiple_parts_file(filename, format="gztar"):
|
||||
"""Save multiple parts file
|
||||
|
||||
Implementation process:
|
||||
1. get the absolute path to 'filename'
|
||||
2. create a 'filename' directory
|
||||
3. user does something with file_path('filename/')
|
||||
4. remove 'filename' directory
|
||||
5. make_archive 'filename' directory, and rename 'archive file' to filename
|
||||
|
||||
:param filename: result model path
|
||||
:param format: archive format: one of "zip", "tar", "gztar", "bztar", or "xztar"
|
||||
:return: real model path
|
||||
|
||||
Usage::
|
||||
|
||||
>>> # The following code will create an archive file('~/tmp/test_file') containing 'test_doc_i'(i is 0-10) files.
|
||||
>>> with save_multiple_parts_file('~/tmp/test_file') as filename_dir:
|
||||
... for i in range(10):
|
||||
... temp_path = os.path.join(filename_dir, 'test_doc_{}'.format(str(i)))
|
||||
... with open(temp_path) as fp:
|
||||
... fp.write(str(i))
|
||||
...
|
||||
|
||||
"""
|
||||
|
||||
if filename.startswith("~"):
|
||||
filename = os.path.expanduser(filename)
|
||||
|
||||
file_path = os.path.abspath(filename)
|
||||
|
||||
# Create model dir
|
||||
if os.path.exists(file_path):
|
||||
raise FileExistsError("ERROR: file exists: {}, cannot be create the directory.".format(file_path))
|
||||
|
||||
os.makedirs(file_path)
|
||||
|
||||
# return model dir
|
||||
yield file_path
|
||||
|
||||
# filename dir to filename.tar.gz file
|
||||
tar_file = shutil.make_archive(file_path, format=format, root_dir=file_path)
|
||||
|
||||
# Remove filename dir
|
||||
if os.path.exists(file_path):
|
||||
shutil.rmtree(file_path)
|
||||
|
||||
# filename.tar.gz rename to filename
|
||||
os.rename(tar_file, file_path)
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def unpack_archive_with_buffer(buffer, format="gztar"):
|
||||
"""Unpack archive with archive buffer
|
||||
After the call is finished, the archive file and directory will be deleted.
|
||||
|
||||
Implementation process:
|
||||
1. create 'tempfile' in '~/tmp/' and directory
|
||||
2. 'buffer' write to 'tempfile'
|
||||
3. unpack archive file('tempfile')
|
||||
4. user does something with file_path('tempfile/')
|
||||
5. remove 'tempfile' and 'tempfile directory'
|
||||
|
||||
:param buffer: bytes
|
||||
:param format: archive format: one of "zip", "tar", "gztar", "bztar", or "xztar"
|
||||
:return: unpack archive directory path
|
||||
|
||||
Usage::
|
||||
|
||||
>>> # The following code is to print all the file names in 'test_unpack.tar.gz'
|
||||
>>> with open('test_unpack.tar.gz') as fp:
|
||||
... buffer = fp.read()
|
||||
...
|
||||
>>> with unpack_archive_with_buffer(buffer) as temp_dir:
|
||||
... for f_n in os.listdir(temp_dir):
|
||||
... print(f_n)
|
||||
...
|
||||
|
||||
"""
|
||||
temp_dir = os.path.expanduser("~/tmp")
|
||||
if not os.path.exists(temp_dir):
|
||||
os.makedirs(temp_dir)
|
||||
with tempfile.NamedTemporaryFile("wb", delete=False, dir=temp_dir) as fp:
|
||||
fp.write(buffer)
|
||||
file_path = fp.name
|
||||
|
||||
try:
|
||||
tar_file = file_path + ".tar.gz"
|
||||
os.rename(file_path, tar_file)
|
||||
# Create dir
|
||||
os.makedirs(file_path)
|
||||
shutil.unpack_archive(tar_file, format=format, extract_dir=file_path)
|
||||
|
||||
# Return temp dir
|
||||
yield file_path
|
||||
|
||||
except Exception as e:
|
||||
log.error(str(e))
|
||||
finally:
|
||||
# Remove temp tar file
|
||||
if os.path.exists(tar_file):
|
||||
os.unlink(tar_file)
|
||||
|
||||
# Remove temp model dir
|
||||
if os.path.exists(file_path):
|
||||
shutil.rmtree(file_path)
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def get_tmp_file_with_buffer(buffer):
|
||||
temp_dir = os.path.expanduser("~/tmp")
|
||||
if not os.path.exists(temp_dir):
|
||||
os.makedirs(temp_dir)
|
||||
with tempfile.NamedTemporaryFile("wb", delete=True, dir=temp_dir) as fp:
|
||||
fp.write(buffer)
|
||||
file_path = fp.name
|
||||
yield file_path
|
||||
|
||||
|
||||
def remove_repeat_field(fields):
|
||||
"""remove repeat field
|
||||
|
||||
@@ -845,6 +724,134 @@ def flatten_dict(d, parent_key="", sep=".") -> dict:
|
||||
return dict(items)
|
||||
|
||||
|
||||
def get_item_from_obj(config: dict, name_path: str) -> object:
|
||||
"""
|
||||
Follow the name_path to get values from config
|
||||
For example:
|
||||
If we follow the example in in the Parameters section,
|
||||
Timestamp('2008-01-02 00:00:00') will be returned
|
||||
|
||||
Parameters
|
||||
----------
|
||||
config : dict
|
||||
e.g.
|
||||
{'dataset': {'class': 'DatasetH',
|
||||
'kwargs': {'handler': {'class': 'Alpha158',
|
||||
'kwargs': {'end_time': '2020-08-01',
|
||||
'fit_end_time': '<dataset.kwargs.segments.train.1>',
|
||||
'fit_start_time': '<dataset.kwargs.segments.train.0>',
|
||||
'instruments': 'csi100',
|
||||
'start_time': '2008-01-01'},
|
||||
'module_path': 'qlib.contrib.data.handler'},
|
||||
'segments': {'test': (Timestamp('2017-01-03 00:00:00'),
|
||||
Timestamp('2019-04-08 00:00:00')),
|
||||
'train': (Timestamp('2008-01-02 00:00:00'),
|
||||
Timestamp('2014-12-31 00:00:00')),
|
||||
'valid': (Timestamp('2015-01-05 00:00:00'),
|
||||
Timestamp('2016-12-30 00:00:00'))}}
|
||||
}}
|
||||
name_path : str
|
||||
e.g.
|
||||
"dataset.kwargs.segments.train.1"
|
||||
|
||||
Returns
|
||||
-------
|
||||
object
|
||||
the retrieved object
|
||||
"""
|
||||
cur_cfg = config
|
||||
for k in name_path.split("."):
|
||||
if isinstance(cur_cfg, dict):
|
||||
cur_cfg = cur_cfg[k]
|
||||
elif k.isdigit():
|
||||
cur_cfg = cur_cfg[int(k)]
|
||||
else:
|
||||
raise ValueError(f"Error when getting {k} from cur_cfg")
|
||||
return cur_cfg
|
||||
|
||||
|
||||
def fill_placeholder(config: dict, config_extend: dict):
|
||||
"""
|
||||
Detect placeholder in config and fill them with config_extend.
|
||||
The item of dict must be single item(int, str, etc), dict and list. Tuples are not supported.
|
||||
There are two type of variables:
|
||||
- user-defined variables :
|
||||
e.g. when config_extend is `{"<MODEL>": model, "<DATASET>": dataset}`, "<MODEL>" and "<DATASET>" in `config` will be replaced with `model` `dataset`
|
||||
- variables extracted from `config` :
|
||||
e.g. the variables like "<dataset.kwargs.segments.train.0>" will be replaced with the values from `config`
|
||||
|
||||
Parameters
|
||||
----------
|
||||
config : dict
|
||||
the parameter dict will be filled
|
||||
config_extend : dict
|
||||
the value of all placeholders
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
the parameter dict
|
||||
"""
|
||||
# check the format of config_extend
|
||||
for placeholder in config_extend.keys():
|
||||
assert re.match(r"<[^<>]+>", placeholder)
|
||||
|
||||
# bfs
|
||||
top = 0
|
||||
tail = 1
|
||||
item_queue = [config]
|
||||
while top < tail:
|
||||
now_item = item_queue[top]
|
||||
top += 1
|
||||
if isinstance(now_item, list):
|
||||
item_keys = range(len(now_item))
|
||||
elif isinstance(now_item, dict):
|
||||
item_keys = now_item.keys()
|
||||
for key in item_keys:
|
||||
if isinstance(now_item[key], list) or isinstance(now_item[key], dict):
|
||||
item_queue.append(now_item[key])
|
||||
tail += 1
|
||||
elif isinstance(now_item[key], str):
|
||||
if now_item[key] in config_extend.keys():
|
||||
now_item[key] = config_extend[now_item[key]]
|
||||
else:
|
||||
m = re.match(r"<(?P<name_path>[^<>]+)>", now_item[key])
|
||||
if m is not None:
|
||||
now_item[key] = get_item_from_obj(config, m.groupdict()["name_path"])
|
||||
return config
|
||||
|
||||
|
||||
def auto_filter_kwargs(func: Callable) -> Callable:
|
||||
"""
|
||||
this will work like a decoration function
|
||||
|
||||
The decrated function will ignore and give warning when the parameter is not acceptable
|
||||
|
||||
Parameters
|
||||
----------
|
||||
func : Callable
|
||||
The original function
|
||||
|
||||
Returns
|
||||
-------
|
||||
Callable:
|
||||
the new callable function
|
||||
"""
|
||||
|
||||
def _func(*args, **kwargs):
|
||||
spec = inspect.getfullargspec(func)
|
||||
new_kwargs = {}
|
||||
for k, v in kwargs.items():
|
||||
# if `func` don't accept variable keyword arguments like `**kwargs` and have not according named arguments
|
||||
if spec.varkw is None and k not in spec.args:
|
||||
log.warning(f"The parameter `{k}` with value `{v}` is ignored.")
|
||||
else:
|
||||
new_kwargs[k] = v
|
||||
return func(*args, **new_kwargs)
|
||||
|
||||
return _func
|
||||
|
||||
|
||||
#################### Wrapper #####################
|
||||
class Wrapper:
|
||||
"""Wrapper class for anything that needs to set up during qlib.init"""
|
||||
@@ -920,6 +927,7 @@ def fname_to_code(fname: str):
|
||||
----------
|
||||
fname: str
|
||||
"""
|
||||
|
||||
prefix = "_qlib_"
|
||||
if fname.startswith(prefix):
|
||||
fname = fname.lstrip(prefix)
|
||||
|
||||
56
qlib/utils/data.py
Normal file
56
qlib/utils/data.py
Normal file
@@ -0,0 +1,56 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
from typing import Union
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
|
||||
def robust_zscore(x: pd.Series, zscore=False):
|
||||
"""Robust ZScore Normalization
|
||||
|
||||
Use robust statistics for Z-Score normalization:
|
||||
mean(x) = median(x)
|
||||
std(x) = MAD(x) * 1.4826
|
||||
|
||||
Reference:
|
||||
https://en.wikipedia.org/wiki/Median_absolute_deviation.
|
||||
"""
|
||||
x = x - x.median()
|
||||
mad = x.abs().median()
|
||||
x = np.clip(x / mad / 1.4826, -3, 3)
|
||||
if zscore:
|
||||
x -= x.mean()
|
||||
x /= x.std()
|
||||
return x
|
||||
|
||||
|
||||
def zscore(x: Union[pd.Series, pd.DataFrame]):
|
||||
return (x - x.mean()).div(x.std())
|
||||
|
||||
|
||||
def deepcopy_basic_type(obj: object) -> object:
|
||||
"""
|
||||
deepcopy an object without copy the complicated objects.
|
||||
This is useful when you want to generate Qlib tasks and share the handler
|
||||
|
||||
NOTE:
|
||||
- This function can't handle recursive objects!!!!!
|
||||
|
||||
Parameters
|
||||
----------
|
||||
obj : object
|
||||
the object to be copied
|
||||
|
||||
Returns
|
||||
-------
|
||||
object:
|
||||
The copied object
|
||||
"""
|
||||
if isinstance(obj, tuple):
|
||||
return tuple(deepcopy_basic_type(i) for i in obj)
|
||||
elif isinstance(obj, list):
|
||||
return list(deepcopy_basic_type(i) for i in obj)
|
||||
elif isinstance(obj, dict):
|
||||
return {k: deepcopy_basic_type(v) for k, v in obj.items()}
|
||||
else:
|
||||
return obj
|
||||
@@ -1,11 +1,165 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
# TODO: move file related utils into this module
|
||||
import contextlib
|
||||
from typing import IO, Union
|
||||
import os
|
||||
import shutil
|
||||
import tempfile
|
||||
import contextlib
|
||||
from typing import Optional, Text, IO, Union
|
||||
from pathlib import Path
|
||||
|
||||
from qlib.log import get_module_logger
|
||||
|
||||
log = get_module_logger("utils.file")
|
||||
|
||||
|
||||
def get_or_create_path(path: Optional[Text] = None, return_dir: bool = False):
|
||||
"""Create or get a file or directory given the path and return_dir.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
path: a string indicates the path or None indicates creating a temporary path.
|
||||
return_dir: if True, create and return a directory; otherwise c&r a file.
|
||||
|
||||
"""
|
||||
if path:
|
||||
if return_dir and not os.path.exists(path):
|
||||
os.makedirs(path)
|
||||
elif not return_dir: # return a file, thus we need to create its parent directory
|
||||
xpath = os.path.abspath(os.path.join(path, ".."))
|
||||
if not os.path.exists(xpath):
|
||||
os.makedirs(xpath)
|
||||
else:
|
||||
temp_dir = os.path.expanduser("~/tmp")
|
||||
if not os.path.exists(temp_dir):
|
||||
os.makedirs(temp_dir)
|
||||
if return_dir:
|
||||
_, path = tempfile.mkdtemp(dir=temp_dir)
|
||||
else:
|
||||
_, path = tempfile.mkstemp(dir=temp_dir)
|
||||
return path
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def save_multiple_parts_file(filename, format="gztar"):
|
||||
"""Save multiple parts file
|
||||
|
||||
Implementation process:
|
||||
1. get the absolute path to 'filename'
|
||||
2. create a 'filename' directory
|
||||
3. user does something with file_path('filename/')
|
||||
4. remove 'filename' directory
|
||||
5. make_archive 'filename' directory, and rename 'archive file' to filename
|
||||
|
||||
:param filename: result model path
|
||||
:param format: archive format: one of "zip", "tar", "gztar", "bztar", or "xztar"
|
||||
:return: real model path
|
||||
|
||||
Usage::
|
||||
|
||||
>>> # The following code will create an archive file('~/tmp/test_file') containing 'test_doc_i'(i is 0-10) files.
|
||||
>>> with save_multiple_parts_file('~/tmp/test_file') as filename_dir:
|
||||
... for i in range(10):
|
||||
... temp_path = os.path.join(filename_dir, 'test_doc_{}'.format(str(i)))
|
||||
... with open(temp_path) as fp:
|
||||
... fp.write(str(i))
|
||||
...
|
||||
|
||||
"""
|
||||
|
||||
if filename.startswith("~"):
|
||||
filename = os.path.expanduser(filename)
|
||||
|
||||
file_path = os.path.abspath(filename)
|
||||
|
||||
# Create model dir
|
||||
if os.path.exists(file_path):
|
||||
raise FileExistsError("ERROR: file exists: {}, cannot be create the directory.".format(file_path))
|
||||
|
||||
os.makedirs(file_path)
|
||||
|
||||
# return model dir
|
||||
yield file_path
|
||||
|
||||
# filename dir to filename.tar.gz file
|
||||
tar_file = shutil.make_archive(file_path, format=format, root_dir=file_path)
|
||||
|
||||
# Remove filename dir
|
||||
if os.path.exists(file_path):
|
||||
shutil.rmtree(file_path)
|
||||
|
||||
# filename.tar.gz rename to filename
|
||||
os.rename(tar_file, file_path)
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def unpack_archive_with_buffer(buffer, format="gztar"):
|
||||
"""Unpack archive with archive buffer
|
||||
After the call is finished, the archive file and directory will be deleted.
|
||||
|
||||
Implementation process:
|
||||
1. create 'tempfile' in '~/tmp/' and directory
|
||||
2. 'buffer' write to 'tempfile'
|
||||
3. unpack archive file('tempfile')
|
||||
4. user does something with file_path('tempfile/')
|
||||
5. remove 'tempfile' and 'tempfile directory'
|
||||
|
||||
:param buffer: bytes
|
||||
:param format: archive format: one of "zip", "tar", "gztar", "bztar", or "xztar"
|
||||
:return: unpack archive directory path
|
||||
|
||||
Usage::
|
||||
|
||||
>>> # The following code is to print all the file names in 'test_unpack.tar.gz'
|
||||
>>> with open('test_unpack.tar.gz') as fp:
|
||||
... buffer = fp.read()
|
||||
...
|
||||
>>> with unpack_archive_with_buffer(buffer) as temp_dir:
|
||||
... for f_n in os.listdir(temp_dir):
|
||||
... print(f_n)
|
||||
...
|
||||
|
||||
"""
|
||||
temp_dir = os.path.expanduser("~/tmp")
|
||||
if not os.path.exists(temp_dir):
|
||||
os.makedirs(temp_dir)
|
||||
with tempfile.NamedTemporaryFile("wb", delete=False, dir=temp_dir) as fp:
|
||||
fp.write(buffer)
|
||||
file_path = fp.name
|
||||
|
||||
try:
|
||||
tar_file = file_path + ".tar.gz"
|
||||
os.rename(file_path, tar_file)
|
||||
# Create dir
|
||||
os.makedirs(file_path)
|
||||
shutil.unpack_archive(tar_file, format=format, extract_dir=file_path)
|
||||
|
||||
# Return temp dir
|
||||
yield file_path
|
||||
|
||||
except Exception as e:
|
||||
log.error(str(e))
|
||||
finally:
|
||||
# Remove temp tar file
|
||||
if os.path.exists(tar_file):
|
||||
os.unlink(tar_file)
|
||||
|
||||
# Remove temp model dir
|
||||
if os.path.exists(file_path):
|
||||
shutil.rmtree(file_path)
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def get_tmp_file_with_buffer(buffer):
|
||||
temp_dir = os.path.expanduser("~/tmp")
|
||||
if not os.path.exists(temp_dir):
|
||||
os.makedirs(temp_dir)
|
||||
with tempfile.NamedTemporaryFile("wb", delete=True, dir=temp_dir) as fp:
|
||||
fp.write(buffer)
|
||||
file_path = fp.name
|
||||
yield file_path
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def get_io_object(file: Union[IO, str, Path], *args, **kwargs) -> IO:
|
||||
|
||||
@@ -11,23 +11,41 @@ from ..config import C
|
||||
class Serializable:
|
||||
"""
|
||||
Serializable will change the behaviors of pickle.
|
||||
- It only saves the state whose name **does not** start with `_`
|
||||
|
||||
The rule to tell if a attribute will be kept or dropped when dumping.
|
||||
The rule with higher priorities is on the top
|
||||
- in the config attribute list -> always dropped
|
||||
- in the include attribute list -> always kept
|
||||
- in the exclude attribute list -> always dropped
|
||||
- name not starts with `_` -> kept
|
||||
- name starts with `_` -> kept if `dump_all` is true else dropped
|
||||
|
||||
It provides a syntactic sugar for distinguish the attributes which user doesn't want.
|
||||
- For examples, a learnable Datahandler just wants to save the parameters without data when dumping to disk
|
||||
"""
|
||||
|
||||
pickle_backend = "pickle" # another optional value is "dill" which can pickle more things of python.
|
||||
default_dump_all = False # if dump all things
|
||||
config_attr = ["_include", "_exclude"]
|
||||
exclude_attr = [] # exclude_attr have lower priorities than `self._exclude`
|
||||
include_attr = [] # include_attr have lower priorities then `self._include`
|
||||
FLAG_KEY = "_qlib_serial_flag"
|
||||
|
||||
def __init__(self):
|
||||
self._dump_all = self.default_dump_all
|
||||
self._exclude = []
|
||||
self._exclude = None # this attribute have higher priorities than `exclude_attr`
|
||||
|
||||
def _is_kept(self, key):
|
||||
if key in self.config_attr:
|
||||
return False
|
||||
if key in self._get_attr_list("include"):
|
||||
return True
|
||||
if key in self._get_attr_list("exclude"):
|
||||
return False
|
||||
return self.dump_all or not key.startswith("_")
|
||||
|
||||
def __getstate__(self) -> dict:
|
||||
return {
|
||||
k: v for k, v in self.__dict__.items() if k not in self.exclude and (self.dump_all or not k.startswith("_"))
|
||||
}
|
||||
return {k: v for k, v in self.__dict__.items() if self._is_kept(k)}
|
||||
|
||||
def __setstate__(self, state: dict):
|
||||
self.__dict__.update(state)
|
||||
@@ -39,52 +57,77 @@ class Serializable:
|
||||
"""
|
||||
return getattr(self, "_dump_all", False)
|
||||
|
||||
@property
|
||||
def exclude(self):
|
||||
def _get_attr_list(self, attr_type: str) -> list:
|
||||
"""
|
||||
What attribute will not be dumped
|
||||
"""
|
||||
return getattr(self, "_exclude", [])
|
||||
What attribute will not be in specific list
|
||||
|
||||
def config(self, dump_all: bool = None, exclude: list = None, recursive=False):
|
||||
Parameters
|
||||
----------
|
||||
attr_type : str
|
||||
"include" or "exclude"
|
||||
|
||||
Returns
|
||||
-------
|
||||
list:
|
||||
"""
|
||||
if hasattr(self, f"_{attr_type}"):
|
||||
res = getattr(self, f"_{attr_type}", [])
|
||||
else:
|
||||
res = getattr(self.__class__, f"{attr_type}_attr", [])
|
||||
if res is None:
|
||||
return []
|
||||
return res
|
||||
|
||||
def config(self, recursive=False, **kwargs):
|
||||
"""
|
||||
configure the serializable object
|
||||
|
||||
Parameters
|
||||
----------
|
||||
kwargs may include following keys
|
||||
|
||||
dump_all : bool
|
||||
will the object dump all object
|
||||
exclude : list
|
||||
What attribute will not be dumped
|
||||
include : list
|
||||
What attribute will be dumped
|
||||
|
||||
recursive : bool
|
||||
will the configuration be recursive
|
||||
"""
|
||||
|
||||
params = {"dump_all": dump_all, "exclude": exclude}
|
||||
|
||||
for k, v in params.items():
|
||||
if v is not None:
|
||||
keys = {"dump_all", "exclude", "include"}
|
||||
for k, v in kwargs.items():
|
||||
if k in keys:
|
||||
attr_name = f"_{k}"
|
||||
setattr(self, attr_name, v)
|
||||
else:
|
||||
raise KeyError(f"Unknown parameter: {k}")
|
||||
|
||||
if recursive:
|
||||
for obj in self.__dict__.values():
|
||||
# set flag to prevent endless loop
|
||||
self.__dict__[self.FLAG_KEY] = True
|
||||
if isinstance(obj, Serializable) and self.FLAG_KEY not in obj.__dict__:
|
||||
obj.config(**params, recursive=True)
|
||||
obj.config(recursive=True, **kwargs)
|
||||
del self.__dict__[self.FLAG_KEY]
|
||||
|
||||
def to_pickle(self, path: Union[Path, str], dump_all: bool = None, exclude: list = None):
|
||||
def to_pickle(self, path: Union[Path, str], **kwargs):
|
||||
"""
|
||||
Dump self to a pickle file.
|
||||
|
||||
Args:
|
||||
path (Union[Path, str]): the path to dump
|
||||
dump_all (bool, optional): if need to dump all things. Defaults to None.
|
||||
exclude (list, optional): will exclude the attributes in this list when dumping. Defaults to None.
|
||||
|
||||
kwargs may include following keys
|
||||
|
||||
dump_all : bool
|
||||
will the object dump all object
|
||||
exclude : list
|
||||
What attribute will not be dumped
|
||||
include : list
|
||||
What attribute will be dumped
|
||||
"""
|
||||
self.config(dump_all=dump_all, exclude=exclude)
|
||||
self.config(**kwargs)
|
||||
with Path(path).open("wb") as f:
|
||||
# pickle interface like backend; such as dill
|
||||
self.get_backend().dump(self, f, protocol=C.dump_protocol_version)
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from contextlib import contextmanager
|
||||
from typing import Any, Dict, Text, Optional
|
||||
from typing import Text, Optional, Any, Dict, Text, Optional
|
||||
from .expm import ExpManager
|
||||
from .exp import Experiment
|
||||
from .recorder import Recorder
|
||||
@@ -15,7 +15,7 @@ class QlibRecorder:
|
||||
A global system that helps to manage the experiments.
|
||||
"""
|
||||
|
||||
def __init__(self, exp_manager):
|
||||
def __init__(self, exp_manager: ExpManager):
|
||||
self.exp_manager: ExpManager = exp_manager
|
||||
|
||||
def __repr__(self):
|
||||
@@ -341,6 +341,10 @@ class QlibRecorder:
|
||||
def set_uri(self, uri: Optional[Text]):
|
||||
"""
|
||||
Method to reset the current uri of current experiment manager.
|
||||
|
||||
NOTE:
|
||||
- When the uri is refer to a file path, please using the absolute path instead of strings like "~/mlruns/"
|
||||
The backend don't support strings like this.
|
||||
"""
|
||||
self.exp_manager.set_uri(uri)
|
||||
|
||||
@@ -501,13 +505,13 @@ class QlibRecorder:
|
||||
raise ValueError(
|
||||
"You can choose only one of `local_path`(save the files in a path) or `kwargs`(pass in the objects directly)"
|
||||
)
|
||||
self.get_exp().get_recorder().save_objects(local_path, artifact_path, **kwargs)
|
||||
self.get_exp().get_recorder(start=True).save_objects(local_path, artifact_path, **kwargs)
|
||||
|
||||
def load_object(self, name: Text):
|
||||
"""
|
||||
Method for loading an object from artifacts in the experiment in the uri.
|
||||
"""
|
||||
return self.get_exp().get_recorder().load_object(name)
|
||||
return self.get_exp().get_recorder(start=True).load_object(name)
|
||||
|
||||
def log_params(self, **kwargs):
|
||||
"""
|
||||
@@ -532,7 +536,7 @@ class QlibRecorder:
|
||||
keyword argument:
|
||||
name1=value1, name2=value2, ...
|
||||
"""
|
||||
self.get_exp().get_recorder().log_params(**kwargs)
|
||||
self.get_exp().get_recorder(start=True).log_params(**kwargs)
|
||||
|
||||
def log_metrics(self, step=None, **kwargs):
|
||||
"""
|
||||
@@ -557,7 +561,7 @@ class QlibRecorder:
|
||||
keyword argument:
|
||||
name1=value1, name2=value2, ...
|
||||
"""
|
||||
self.get_exp().get_recorder().log_metrics(step, **kwargs)
|
||||
self.get_exp().get_recorder(start=True).log_metrics(step, **kwargs)
|
||||
|
||||
def set_tags(self, **kwargs):
|
||||
"""
|
||||
@@ -582,7 +586,7 @@ class QlibRecorder:
|
||||
keyword argument:
|
||||
name1=value1, name2=value2, ...
|
||||
"""
|
||||
self.get_exp().get_recorder().set_tags(**kwargs)
|
||||
self.get_exp().get_recorder(start=True).set_tags(**kwargs)
|
||||
|
||||
|
||||
class RecorderWrapper(Wrapper):
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from typing import Dict, Union
|
||||
from typing import Dict, List, Union
|
||||
import mlflow, logging
|
||||
from mlflow.entities import ViewType
|
||||
from mlflow.exceptions import MlflowException
|
||||
@@ -22,6 +22,7 @@ class Experiment:
|
||||
self.id = id
|
||||
self.name = name
|
||||
self.active_recorder = None # only one recorder can running each time
|
||||
self._default_rec_name = "abstract_recorder"
|
||||
|
||||
def __repr__(self):
|
||||
return "{name}(id={id}, info={info})".format(name=self.__class__.__name__, id=self.id, info=self.info)
|
||||
@@ -150,7 +151,7 @@ class Experiment:
|
||||
create : boolean
|
||||
create the recorder if it hasn't been created before.
|
||||
start : boolean
|
||||
start the new recorder if one is created.
|
||||
start the new recorder if one is **created**.
|
||||
|
||||
Returns
|
||||
-------
|
||||
@@ -214,7 +215,10 @@ class Experiment:
|
||||
"""
|
||||
raise NotImplementedError(f"Please implement the `_get_recorder` method")
|
||||
|
||||
def list_recorders(self, **flt_kwargs) -> Dict[str, Recorder]:
|
||||
RT_D = "dict" # return type dict
|
||||
RT_L = "list" # return type list
|
||||
|
||||
def list_recorders(self, rtype: str = RT_D, **flt_kwargs) -> Union[List[Recorder], Dict[str, Recorder]]:
|
||||
"""
|
||||
List all the existing recorders of this experiment. Please first get the experiment instance before calling this method.
|
||||
If user want to use the method `R.list_recorders()`, please refer to the related API document in `QlibRecorder`.
|
||||
@@ -225,7 +229,11 @@ class Experiment:
|
||||
|
||||
Returns
|
||||
-------
|
||||
The return type depent on `rtype`
|
||||
if `rtype` == "dict":
|
||||
A dictionary (id -> recorder) of recorder information that being stored.
|
||||
elif `rtype` == "list":
|
||||
A list of Recorder.
|
||||
"""
|
||||
raise NotImplementedError(f"Please implement the `list_recorders` method.")
|
||||
|
||||
@@ -326,9 +334,16 @@ class MLflowExperiment(Experiment):
|
||||
UNLIMITED = 50000 # FIXME: Mlflow can only list 50000 records at most!!!!!!!
|
||||
|
||||
def list_recorders(
|
||||
self, max_results: int = UNLIMITED, status: Union[str, None] = None, filter_string: str = ""
|
||||
) -> Dict[str, Recorder]:
|
||||
self,
|
||||
rtype=Experiment.RT_D,
|
||||
max_results: int = UNLIMITED,
|
||||
status: Union[str, None] = None,
|
||||
filter_string: str = "",
|
||||
):
|
||||
"""
|
||||
Quoting docs of search_runs
|
||||
> The default ordering is to sort by start_time DESC, then run_id.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
max_results : int
|
||||
@@ -342,10 +357,17 @@ class MLflowExperiment(Experiment):
|
||||
runs = self._client.search_runs(
|
||||
self.id, run_view_type=ViewType.ACTIVE_ONLY, max_results=max_results, filter_string=filter_string
|
||||
)
|
||||
recorders = dict()
|
||||
rids = []
|
||||
recorders = []
|
||||
for i in range(len(runs)):
|
||||
recorder = MLflowRecorder(self.id, self._uri, mlflow_run=runs[i])
|
||||
if status is None or recorder.status == status:
|
||||
recorders[runs[i].info.run_id] = recorder
|
||||
rids.append(runs[i].info.run_id)
|
||||
recorders.append(recorder)
|
||||
|
||||
if rtype == Experiment.RT_D:
|
||||
return dict(zip(rids, recorders))
|
||||
elif rtype == Experiment.RT_L:
|
||||
return recorders
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
|
||||
@@ -17,7 +17,7 @@ from .recorder import Recorder
|
||||
from ..log import get_module_logger
|
||||
from ..utils.exceptions import ExpAlreadyExistError
|
||||
|
||||
logger = get_module_logger("workflow", logging.INFO)
|
||||
logger = get_module_logger("workflow")
|
||||
|
||||
|
||||
class ExpManager:
|
||||
@@ -279,7 +279,8 @@ class ExpManager:
|
||||
|
||||
"""
|
||||
if uri is None:
|
||||
logger.info("No tracking URI is provided. Use the default tracking URI.")
|
||||
if self._current_uri is None:
|
||||
logger.debug("No tracking URI is provided. Use the default tracking URI.")
|
||||
self._current_uri = self.default_uri
|
||||
else:
|
||||
# Temporarily re-set the current uri as the uri argument.
|
||||
@@ -290,6 +291,7 @@ class ExpManager:
|
||||
def _set_uri(self):
|
||||
"""
|
||||
Customized features for subclasses' set_uri function.
|
||||
This method is designed for the underlying experiment backend storage.
|
||||
"""
|
||||
raise NotImplementedError(f"Please implement the `_set_uri` method.")
|
||||
|
||||
@@ -351,8 +353,6 @@ class MLflowExpManager(ExpManager):
|
||||
if self.active_experiment is not None:
|
||||
self.active_experiment.end(recorder_status)
|
||||
self.active_experiment = None
|
||||
# When an experiment end, we will release the current uri.
|
||||
self._current_uri = None
|
||||
|
||||
def create_exp(self, experiment_name: Optional[Text] = None):
|
||||
assert experiment_name is not None
|
||||
|
||||
@@ -14,8 +14,9 @@ from ..data.dataset import DatasetH
|
||||
from ..data.dataset.handler import DataHandlerLP
|
||||
from ..backtest import backtest as normal_backtest
|
||||
from ..log import get_module_logger
|
||||
from ..utils import flatten_dict, class_casting
|
||||
from ..utils import fill_placeholder, flatten_dict, class_casting, get_date_by_shift
|
||||
from ..utils.time import Freq
|
||||
from ..utils.data import deepcopy_basic_type
|
||||
from ..contrib.eva.alpha import calc_ic, calc_long_short_return, calc_long_short_prec
|
||||
|
||||
|
||||
@@ -175,9 +176,10 @@ class SignalRecord(RecordTemp):
|
||||
del params["data_key"]
|
||||
# The backend handler should be DataHandler
|
||||
raw_label = dataset.prepare(**params)
|
||||
except AttributeError:
|
||||
except AttributeError as e:
|
||||
# The data handler is initialize with `drop_raw=True`...
|
||||
# So raw_label is not available
|
||||
logger.warning(f"Exception: {e}")
|
||||
raw_label = None
|
||||
return raw_label
|
||||
|
||||
@@ -203,6 +205,35 @@ class SignalRecord(RecordTemp):
|
||||
return ["pred.pkl", "label.pkl"]
|
||||
|
||||
|
||||
class ACRecordTemp(RecordTemp):
|
||||
"""Automatically checking record template"""
|
||||
|
||||
def __init__(self, recorder, skip_existing=False):
|
||||
self.skip_existing = skip_existing
|
||||
super().__init__(recorder=recorder)
|
||||
|
||||
def generate(self, *args, **kwargs):
|
||||
"""automatically checking the files and then run the concrete generating task"""
|
||||
if self.skip_existing:
|
||||
try:
|
||||
self.check(include_self=True, parents=False)
|
||||
except FileNotFoundError:
|
||||
pass # continue to generating metrics
|
||||
else:
|
||||
logger.info("The results has previously generated, Generation skipped.")
|
||||
return
|
||||
|
||||
try:
|
||||
self.check()
|
||||
except FileNotFoundError:
|
||||
logger.warning("The dependent data does not exists. Generation skipped.")
|
||||
return
|
||||
return self._generate(*args, **kwargs)
|
||||
|
||||
def _generate(self, *args, **kwargs):
|
||||
raise NotImplementedError(f"Please implement the `_generate` method")
|
||||
|
||||
|
||||
class HFSignalRecord(SignalRecord):
|
||||
"""
|
||||
This is the Signal Analysis Record class that generates the analysis results such as IC and IR. This class inherits the ``RecordTemp`` class.
|
||||
@@ -250,7 +281,7 @@ class HFSignalRecord(SignalRecord):
|
||||
return ["ic.pkl", "ric.pkl", "long_pre.pkl", "short_pre.pkl", "long_short_r.pkl", "long_avg_r.pkl"]
|
||||
|
||||
|
||||
class SigAnaRecord(RecordTemp):
|
||||
class SigAnaRecord(ACRecordTemp):
|
||||
"""
|
||||
This is the Signal Analysis Record class that generates the analysis results such as IC and IR. This class inherits the ``RecordTemp`` class.
|
||||
"""
|
||||
@@ -259,39 +290,23 @@ class SigAnaRecord(RecordTemp):
|
||||
depend_cls = SignalRecord
|
||||
|
||||
def __init__(self, recorder, ana_long_short=False, ann_scaler=252, label_col=0, skip_existing=False):
|
||||
super().__init__(recorder=recorder)
|
||||
super().__init__(recorder=recorder, skip_existing=skip_existing)
|
||||
self.ana_long_short = ana_long_short
|
||||
self.ann_scaler = ann_scaler
|
||||
self.label_col = label_col
|
||||
self.skip_existing = skip_existing
|
||||
|
||||
def generate(self, label: Optional[pd.DataFrame] = None, **kwargs):
|
||||
def _generate(self, label: Optional[pd.DataFrame] = None, **kwargs):
|
||||
"""
|
||||
Parameters
|
||||
----------
|
||||
label : Optional[pd.DataFrame]
|
||||
Label should be a dataframe.
|
||||
"""
|
||||
if self.skip_existing:
|
||||
try:
|
||||
self.check(include_self=True, parents=False)
|
||||
except FileNotFoundError:
|
||||
pass # continue to generating metrics
|
||||
else:
|
||||
logger.info("The results has previously generated, Generation skipped.")
|
||||
return
|
||||
|
||||
try:
|
||||
self.check()
|
||||
except FileNotFoundError:
|
||||
logger.warning("The dependent data does not exists. Generation skipped.")
|
||||
return
|
||||
|
||||
pred = self.load("pred.pkl")
|
||||
if label is None:
|
||||
label = self.load("label.pkl")
|
||||
if label is None or not isinstance(label, pd.DataFrame) or label.empty:
|
||||
logger.warn(f"Empty label.")
|
||||
logger.warning(f"Empty label.")
|
||||
return
|
||||
ic, ric = calc_ic(pred.iloc[:, 0], label.iloc[:, self.label_col])
|
||||
metrics = {
|
||||
@@ -328,7 +343,7 @@ class SigAnaRecord(RecordTemp):
|
||||
return paths
|
||||
|
||||
|
||||
class PortAnaRecord(RecordTemp):
|
||||
class PortAnaRecord(ACRecordTemp):
|
||||
"""
|
||||
This is the Portfolio Analysis Record class that generates the analysis results such as those of backtest. This class inherits the ``RecordTemp`` class.
|
||||
|
||||
@@ -339,14 +354,35 @@ class PortAnaRecord(RecordTemp):
|
||||
"""
|
||||
|
||||
artifact_path = "portfolio_analysis"
|
||||
depend_cls = SignalRecord
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
recorder,
|
||||
config,
|
||||
config: dict = { # Default config for daily trading
|
||||
"strategy": {
|
||||
"class": "TopkDropoutStrategy",
|
||||
"module_path": "qlib.contrib.strategy",
|
||||
"kwargs": {"signal": "<PRED>", "topk": 50, "n_drop": 5},
|
||||
},
|
||||
"backtest": {
|
||||
"start_time": None,
|
||||
"end_time": None,
|
||||
"account": 100000000,
|
||||
"benchmark": "SH000300",
|
||||
"exchange_kwargs": {
|
||||
"limit_threshold": 0.095,
|
||||
"deal_price": "close",
|
||||
"open_cost": 0.0005,
|
||||
"close_cost": 0.0015,
|
||||
"min_cost": 5,
|
||||
},
|
||||
},
|
||||
},
|
||||
risk_analysis_freq: Union[List, str] = None,
|
||||
indicator_analysis_freq: Union[List, str] = None,
|
||||
indicator_analysis_method=None,
|
||||
skip_existing=False,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
@@ -363,7 +399,12 @@ class PortAnaRecord(RecordTemp):
|
||||
indicator_analysis_method : str, optional, default by None
|
||||
the candidated values include 'mean', 'amount_weighted', 'value_weighted'
|
||||
"""
|
||||
super().__init__(recorder=recorder, **kwargs)
|
||||
super().__init__(recorder=recorder, skip_existing=skip_existing, **kwargs)
|
||||
|
||||
# We only deepcopy_basic_type because
|
||||
# - We don't want to affect the config outside.
|
||||
# - We don't want to deepcopy complex object to avoid overhead
|
||||
config = deepcopy_basic_type(config)
|
||||
|
||||
self.strategy_config = config["strategy"]
|
||||
_default_executor_config = {
|
||||
@@ -405,7 +446,21 @@ class PortAnaRecord(RecordTemp):
|
||||
ret_freq.extend(self._get_report_freq(executor_config["kwargs"]["inner_executor"]))
|
||||
return ret_freq
|
||||
|
||||
def generate(self, **kwargs):
|
||||
def _generate(self, **kwargs):
|
||||
pred = self.load("pred.pkl")
|
||||
|
||||
# replace the "<PRED>" with prediction saved before
|
||||
placehorder_value = {"<PRED>": pred}
|
||||
for k in "executor_config", "strategy_config":
|
||||
setattr(self, k, fill_placeholder(getattr(self, k), placehorder_value))
|
||||
|
||||
# if the backtesting time range is not set, it will automatically extract time range from the prediction file
|
||||
dt_values = pred.index.get_level_values("datetime")
|
||||
if self.backtest_config["start_time"] is None:
|
||||
self.backtest_config["start_time"] = dt_values.min()
|
||||
if self.backtest_config["end_time"] is None:
|
||||
self.backtest_config["end_time"] = get_date_by_shift(dt_values.max(), 1)
|
||||
|
||||
# custom strategy and get backtest
|
||||
portfolio_metric_dict, indicator_dict = normal_backtest(
|
||||
executor=self.executor_config, strategy=self.strategy_config, **self.backtest_config
|
||||
|
||||
@@ -306,6 +306,7 @@ class MLflowRecorder(Recorder):
|
||||
self.end_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
if self.status != Recorder.STATUS_S:
|
||||
self.status = status
|
||||
if self.async_log is not None:
|
||||
with TimeInspector.logt("waiting `async_log`"):
|
||||
self.async_log.wait()
|
||||
self.async_log = None
|
||||
|
||||
@@ -6,7 +6,7 @@ TaskGenerator module can generate many tasks based on TaskGen and some task temp
|
||||
import abc
|
||||
import copy
|
||||
import pandas as pd
|
||||
from typing import List, Union, Callable
|
||||
from typing import Dict, List, Union, Callable
|
||||
|
||||
from qlib.utils import transform_end_date
|
||||
from .utils import TimeAdjuster
|
||||
@@ -119,14 +119,38 @@ def handler_mod(task: dict, rolling_gen):
|
||||
pass
|
||||
except TypeError:
|
||||
# May be the handler is a string. `"handler.pkl"["kwargs"]` will raise TypeError
|
||||
# e.g. a dumped file like file:///<file>/
|
||||
pass
|
||||
|
||||
|
||||
def trunc_segments(ta: TimeAdjuster, segments: Dict[str, pd.Timestamp], days, test_key="test"):
|
||||
"""
|
||||
To avoid the leakage of future information, the segments should be truncated according to the test start_time
|
||||
|
||||
NOTE:
|
||||
This function will change segments **inplace**
|
||||
"""
|
||||
# adjust segment
|
||||
test_start = min(t for t in segments[test_key] if t is not None)
|
||||
for k in list(segments.keys()):
|
||||
if k != test_key:
|
||||
segments[k] = ta.truncate(segments[k], test_start, days)
|
||||
|
||||
|
||||
class RollingGen(TaskGen):
|
||||
ROLL_EX = TimeAdjuster.SHIFT_EX # fixed start date, expanding end date
|
||||
ROLL_SD = TimeAdjuster.SHIFT_SD # fixed segments size, slide it from start date
|
||||
|
||||
def __init__(self, step: int = 40, rtype: str = ROLL_EX, ds_extra_mod_func: Union[None, Callable] = handler_mod):
|
||||
def __init__(
|
||||
self,
|
||||
step: int = 40,
|
||||
rtype: str = ROLL_EX,
|
||||
ds_extra_mod_func: Union[None, Callable] = handler_mod,
|
||||
test_key="test",
|
||||
train_key="train",
|
||||
trunc_days: int = None,
|
||||
task_copy_func: Callable = copy.deepcopy,
|
||||
):
|
||||
"""
|
||||
Generate tasks for rolling
|
||||
|
||||
@@ -139,14 +163,20 @@ class RollingGen(TaskGen):
|
||||
ds_extra_mod_func: Callable
|
||||
A method like: handler_mod(task: dict, rg: RollingGen)
|
||||
Do some extra action after generating a task. For example, use ``handler_mod`` to modify the end time of the handler of a dataset.
|
||||
trunc_days: int
|
||||
trunc some data to avoid future information leakage
|
||||
task_copy_func: Callable
|
||||
the function to copy entire task. This is very useful when user want to share something between tasks
|
||||
"""
|
||||
self.step = step
|
||||
self.rtype = rtype
|
||||
self.ds_extra_mod_func = ds_extra_mod_func
|
||||
self.ta = TimeAdjuster(future=True)
|
||||
|
||||
self.test_key = "test"
|
||||
self.train_key = "train"
|
||||
self.test_key = test_key
|
||||
self.train_key = train_key
|
||||
self.trunc_days = trunc_days
|
||||
self.task_copy_func = task_copy_func
|
||||
|
||||
def _update_task_segs(self, task, segs):
|
||||
# update segments of this task
|
||||
@@ -191,7 +221,7 @@ class RollingGen(TaskGen):
|
||||
break
|
||||
|
||||
prev_seg = segments
|
||||
t = copy.deepcopy(task) # deepcopy is necessary to avoid modify task inplace
|
||||
t = self.task_copy_func(task) # deepcopy is necessary to avoid replace task inplace
|
||||
self._update_task_segs(t, segments)
|
||||
yield t
|
||||
|
||||
@@ -247,7 +277,7 @@ class RollingGen(TaskGen):
|
||||
"""
|
||||
res = []
|
||||
|
||||
t = copy.deepcopy(task)
|
||||
t = self.task_copy_func(task)
|
||||
|
||||
# calculate segments
|
||||
|
||||
@@ -258,6 +288,8 @@ class RollingGen(TaskGen):
|
||||
# 2) and init test segments
|
||||
test_start_idx = self.ta.align_idx(segments[self.test_key][0])
|
||||
segments[self.test_key] = (self.ta.get(test_start_idx), self.ta.get(test_start_idx + self.step - 1))
|
||||
if self.trunc_days is not None:
|
||||
trunc_segments(self.ta, segments, self.trunc_days, self.test_key)
|
||||
|
||||
# update segments of this task
|
||||
self._update_task_segs(t, segments)
|
||||
@@ -313,10 +345,7 @@ class MultiHorizonGenBase(TaskGen):
|
||||
|
||||
# adjust segment
|
||||
segments = self.ta.align_seg(t["dataset"]["kwargs"]["segments"])
|
||||
test_start = min(t for t in segments[self.test_key] if t is not None)
|
||||
for k in list(segments.keys()):
|
||||
if k != self.test_key:
|
||||
segments[k] = self.ta.truncate(segments[k], test_start, hr + self.label_leak_n)
|
||||
trunc_segments(self.ta, segments, days=hr + self.label_leak_n, test_key=self.test_key)
|
||||
t["dataset"]["kwargs"]["segments"] = segments
|
||||
res.append(t)
|
||||
return res
|
||||
|
||||
@@ -100,7 +100,7 @@ class TimeAdjuster:
|
||||
idx : int
|
||||
index of the calendar
|
||||
"""
|
||||
if idx >= len(self.cals):
|
||||
if idx is None or idx >= len(self.cals):
|
||||
return None
|
||||
return self.cals[idx]
|
||||
|
||||
@@ -123,6 +123,9 @@ class TimeAdjuster:
|
||||
-------
|
||||
index : int
|
||||
"""
|
||||
if time_point is None:
|
||||
# `None` indicates unbounded index/boarder
|
||||
return None
|
||||
time_point = pd.Timestamp(time_point)
|
||||
if tp_type == "start":
|
||||
idx = bisect.bisect_left(self.cals, time_point)
|
||||
@@ -158,6 +161,8 @@ class TimeAdjuster:
|
||||
Returns:
|
||||
pd.Timestamp
|
||||
"""
|
||||
if time_point is None:
|
||||
return None
|
||||
return self.cals[self.align_idx(time_point, tp_type=tp_type)]
|
||||
|
||||
def align_seg(self, segment: Union[dict, tuple]) -> Union[dict, tuple]:
|
||||
@@ -201,6 +206,10 @@ class TimeAdjuster:
|
||||
days : int
|
||||
The trading days to be truncated
|
||||
the data in this segment may need 'days' data
|
||||
`days` are based on the `test_start`.
|
||||
For example, if the label contains the information of 2 days in the near future, the prediction horizon 1 day.
|
||||
(e.g. the prediction target is `Ref($close, -2)/Ref($close, -1) - 1`)
|
||||
the days should be 2 + 1 == 3 days.
|
||||
|
||||
Returns
|
||||
---------
|
||||
@@ -220,10 +229,17 @@ class TimeAdjuster:
|
||||
SHIFT_SD = "sliding"
|
||||
SHIFT_EX = "expanding"
|
||||
|
||||
def _add_step(self, index, step):
|
||||
if index is None:
|
||||
return None
|
||||
return index + step
|
||||
|
||||
def shift(self, seg: tuple, step: int, rtype=SHIFT_SD) -> tuple:
|
||||
"""
|
||||
Shift the datatime of segment
|
||||
|
||||
If there are None (which indicates unbounded index) in the segment, this method will return None.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
seg :
|
||||
@@ -245,13 +261,13 @@ class TimeAdjuster:
|
||||
if isinstance(seg, tuple):
|
||||
start_idx, end_idx = self.align_idx(seg[0], tp_type="start"), self.align_idx(seg[1], tp_type="end")
|
||||
if rtype == self.SHIFT_SD:
|
||||
start_idx += step
|
||||
end_idx += step
|
||||
start_idx = self._add_step(start_idx, step)
|
||||
end_idx = self._add_step(end_idx, step)
|
||||
elif rtype == self.SHIFT_EX:
|
||||
end_idx += step
|
||||
end_idx = self._add_step(end_idx, step)
|
||||
else:
|
||||
raise NotImplementedError(f"This type of input is not supported")
|
||||
if start_idx > len(self.cals):
|
||||
if start_idx is not None and start_idx > len(self.cals):
|
||||
raise KeyError("The segment is out of valid calendar")
|
||||
return self.get(start_idx), self.get(end_idx)
|
||||
else:
|
||||
|
||||
Reference in New Issue
Block a user