mirror of
https://github.com/microsoft/qlib.git
synced 2026-07-06 04:20:57 +08:00
split code into core and contrib for data&model
This commit is contained in:
@@ -195,8 +195,8 @@ Your PR of new Quant models is highly welcomed.
|
||||
|
||||
# Quant Dataset Zoo
|
||||
Dataset plays a very important role in Quant. Here is a list of the datasets built on `Qlib`.
|
||||
- [Alpha360](./qlib/contrib/estimator/handler.py)
|
||||
- [Alpha158](./qlib/contrib/estimator/handler.py)
|
||||
- [Alpha360](./qlib/contrib/data/handler.py)
|
||||
- [Alpha158](./qlib/contrib/data/handler.py)
|
||||
|
||||
[Here](https://qlib.readthedocs.io/en/latest/advanced/alpha.html) is a tutorial to build dataset with `Qlib`.
|
||||
Your PR to build new Quant dataset is highly welcomed.
|
||||
|
||||
@@ -49,7 +49,7 @@ Users can use ``Data Handler`` to build formulaic alphas `MACD` in qlib:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
>> from qlib.contrib.estimator.handler import QLibDataHandler
|
||||
>> from qlib.data.dataset.handler import QLibDataHandler
|
||||
>> MACD_EXP = '(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close'
|
||||
>> fields = [MACD_EXP] # MACD
|
||||
>> names = ['MACD']
|
||||
|
||||
@@ -156,12 +156,12 @@ Data Handler
|
||||
|
||||
Users can use ``Data Handler`` in an automatic workflow by ``Estimator``, refer to `Estimator: Workflow Management <estimator.html>`_ for more details.
|
||||
|
||||
Also, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data(standardization, remove NaN, etc.) and build datasets. It is a subclass of ``qlib.contrib.estimator.handler.BaseDataHandler``, which provides some interfaces as follows.
|
||||
Also, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data(standardization, remove NaN, etc.) and build datasets. It is a subclass of ``qlib.data.dataset.handler.BaseDataHandler``, which provides some interfaces as follows.
|
||||
|
||||
Base Class & Interface
|
||||
----------------------
|
||||
|
||||
Qlib provides a base class `qlib.contrib.estimator.BaseDataHandler <../reference/api.html#qlib.contrib.estimator.handler.BaseDataHandler>`_, which provides the following interfaces:
|
||||
Qlib provides a base class `qlib.data.dataset.BaseDataHandler <../reference/api.html#qlib.data.dataset.handler.BaseDataHandler>`_, which provides the following interfaces:
|
||||
|
||||
- `setup_feature`
|
||||
Implement the interface to load the data features.
|
||||
@@ -182,7 +182,7 @@ Qlib also provides two functions to help users init the data handler, users can
|
||||
Users can init the raw df, feature names, and label names of data handler in this function.
|
||||
If the index of feature df and label df are not the same, users need to override this method to merge them (e.g. inner, left, right merge).
|
||||
|
||||
If users want to load features and labels by config, users can inherit ``qlib.contrib.estimator.handler.ConfigDataHandler``, ``Qlib`` also provides some preprocess method in this subclass.
|
||||
If users want to load features and labels by config, users can inherit ``qlib.data.dataset.handler.ConfigDataHandler``, ``Qlib`` also provides some preprocess method in this subclass.
|
||||
If users want to use qlib data, `QLibDataHandler` is recommended. Users can inherit their custom class from `QLibDataHandler`, which is also a subclass of `ConfigDataHandler`.
|
||||
|
||||
|
||||
@@ -214,7 +214,7 @@ Qlib provides implemented data handler `Alpha158`. The following example shows h
|
||||
|
||||
.. code-block:: Python
|
||||
|
||||
from qlib.contrib.estimator.handler import Alpha158
|
||||
from qlib.contrib.data.handler import Alpha158
|
||||
from qlib.contrib.model.gbdt import LGBModel
|
||||
|
||||
DATA_HANDLER_CONFIG = {
|
||||
@@ -251,7 +251,7 @@ Also, the above example has been given in ``examples.estimator.train_backtest_an
|
||||
API
|
||||
---------
|
||||
|
||||
To know more about ``Data Handler``, please refer to `Data Handler API <../reference/api.html#module-qlib.contrib.estimator.handler>`_.
|
||||
To know more about ``Data Handler``, please refer to `Data Handler API <../reference/api.html#module-qlib.data.dataset.handler>`_.
|
||||
|
||||
Cache
|
||||
==========
|
||||
|
||||
@@ -266,7 +266,7 @@ Users can use a specified model by configuration with hyper-parameters.
|
||||
Custom Models
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Qlib supports custom models, but it must be a subclass of the `qlib.contrib.model.Model`, the config for a custom model may be as following.
|
||||
Qlib supports custom models, but it must be a subclass of the `qlib.model.Model`, the config for a custom model may be as following.
|
||||
|
||||
.. code-block:: YAML
|
||||
|
||||
@@ -284,7 +284,7 @@ To know more about ``Interday Model``, please refer to `Interday Model: Training
|
||||
Data Section
|
||||
-----------------
|
||||
|
||||
``Data Handler`` can be used to load raw data, prepare features and label columns, preprocess data (standardization, remove NaN, etc.), split training, validation, and test sets. It is a subclass of `qlib.contrib.estimator.handler.BaseDataHandler`.
|
||||
``Data Handler`` can be used to load raw data, prepare features and label columns, preprocess data (standardization, remove NaN, etc.), split training, validation, and test sets. It is a subclass of `qlib.data.dataset.handler.BaseDataHandler`.
|
||||
|
||||
Users can use the specified data handler by config as follows.
|
||||
|
||||
@@ -315,10 +315,10 @@ Users can use the specified data handler by config as follows.
|
||||
fend_time: 2018-12-11
|
||||
|
||||
- `class`
|
||||
Data handler class, str type, which should be a subclass of `qlib.contrib.estimator.handler.BaseDataHandler`, and implements 5 important interfaces for loading features, loading raw data, preprocessing raw data, slicing train, validation, and test data. The default value is `ALPHA360`. If users want to write a data handler to retrieve the data in ``Qlib``, `QlibDataHandler` is suggested.
|
||||
Data handler class, str type, which should be a subclass of `qlib.data.dataset.handler.BaseDataHandler`, and implements 5 important interfaces for loading features, loading raw data, preprocessing raw data, slicing train, validation, and test data. The default value is `ALPHA360`. If users want to write a data handler to retrieve the data in ``Qlib``, `QlibDataHandler` is suggested.
|
||||
|
||||
- `module_path`
|
||||
The module path, str type, absolute url is also supported, indicates the path of the `class` implementation of the data processor class. The default value is `qlib.contrib.estimator.handler`.
|
||||
The module path, str type, absolute url is also supported, indicates the path of the `class` implementation of the data processor class. The default value is `qlib.data.dataset.handler`.
|
||||
|
||||
- `args`
|
||||
Parameters used for ``Data Handler`` initialization.
|
||||
@@ -376,7 +376,7 @@ Qlib support custom data handler, but it must be a subclass of the ``qlib.contri
|
||||
|
||||
The class `SomeDataHandler` should be in the module `custom_data_handler`, and ``Qlib`` could parse the `module_path` to load the class.
|
||||
|
||||
If users want to load features and labels by config, they can inherit ``qlib.contrib.estimator.handler.ConfigDataHandler``, ``Qlib`` also has provided some preprocess methods in this subclass.
|
||||
If users want to load features and labels by config, they can inherit ``qlib.data.dataset.handler.ConfigDataHandler``, ``Qlib`` also has provided some preprocess methods in this subclass.
|
||||
If users want to use qlib data, `QLibDataHandler` is recommended, from which users can inherit the custom class. `QLibDataHandler` is also a subclass of `ConfigDataHandler`.
|
||||
|
||||
To know more about ``Data Handler``, please refer to `Data Framework&Usage <data.html>`_.
|
||||
|
||||
@@ -13,7 +13,7 @@ Because the components in ``Qlib`` are designed in a loosely-coupled way, ``Inte
|
||||
Base Class & Interface
|
||||
======================
|
||||
|
||||
``Qlib`` provides a base class `qlib.contrib.model.base.Model <../reference/api.html#module-qlib.contrib.model.base>`_ from which all models should inherit.
|
||||
``Qlib`` provides a base class `qlib.model.base.Model <../reference/api.html#module-qlib.model.base>`_ from which all models should inherit.
|
||||
|
||||
The base class provides the following interfaces:
|
||||
|
||||
@@ -110,7 +110,7 @@ The base class provides the following interfaces:
|
||||
The format of `w_test` is same as `w_train` in `fit` method.
|
||||
- Return: float type, evaluation score
|
||||
|
||||
For other interfaces such as `save`, `load`, `finetune`, please refer to `Model API <../reference/api.html#module-qlib.contrib.model.base>`_.
|
||||
For other interfaces such as `save`, `load`, `finetune`, please refer to `Model API <../reference/api.html#module-qlib.model.base>`_.
|
||||
|
||||
Example
|
||||
==================
|
||||
@@ -121,7 +121,7 @@ Example
|
||||
- Run the following code to get the `prediction score` `pred_score`
|
||||
.. code-block:: Python
|
||||
|
||||
from qlib.contrib.estimator.handler import Alpha158
|
||||
from qlib.contrib.data.handler import Alpha158
|
||||
from qlib.contrib.model.gbdt import LGBModel
|
||||
|
||||
DATA_HANDLER_CONFIG = {
|
||||
@@ -175,4 +175,4 @@ Qlib supports custom models. If users are interested in customizing their own mo
|
||||
|
||||
API
|
||||
===================
|
||||
Please refer to `Model API <../reference/api.html#module-qlib.contrib.model.base>`_.
|
||||
Please refer to `Model API <../reference/api.html#module-qlib.model.base>`_.
|
||||
|
||||
@@ -63,12 +63,12 @@ Contrib
|
||||
|
||||
Data Handler
|
||||
---------------
|
||||
.. automodule:: qlib.contrib.estimator.handler
|
||||
.. automodule:: qlib.data.dataset.handler
|
||||
:members:
|
||||
|
||||
Model
|
||||
--------------------
|
||||
.. automodule:: qlib.contrib.model.base
|
||||
.. automodule:: qlib.model.base
|
||||
:members:
|
||||
|
||||
Strategy
|
||||
|
||||
@@ -9,13 +9,13 @@ Introduction
|
||||
|
||||
Users can integrate their own custom models according to the following steps.
|
||||
|
||||
- Define a custom model class, which should be a subclass of the `qlib.contrib.model.base.Model <../reference/api.html#module-qlib.contrib.model.base>`_.
|
||||
- Define a custom model class, which should be a subclass of the `qlib.model.base.Model <../reference/api.html#module-qlib.model.base>`_.
|
||||
- Write a configuration file that describes the path and parameters of the custom model.
|
||||
- Test the custom model.
|
||||
|
||||
Custom Model Class
|
||||
===========================
|
||||
The Custom models need to inherit `qlib.contrib.model.base.Model <../reference/api.html#module-qlib.contrib.model.base>`_ and override the methods in it.
|
||||
The Custom models need to inherit `qlib.model.base.Model <../reference/api.html#module-qlib.model.base>`_ and override the methods in it.
|
||||
|
||||
- Override the `__init__` method
|
||||
- ``Qlib`` passes the initialized parameters to the \_\_init\_\_ method.
|
||||
@@ -63,7 +63,7 @@ The Custom models need to inherit `qlib.contrib.model.base.Model <../reference/a
|
||||
- Override the `predict` method
|
||||
- The parameters include the test features.
|
||||
- Return the `prediction score`.
|
||||
- Please refer to `Model API <../reference/api.html#module-qlib.contrib.model.base>`_ for the parameter types of the fit method.
|
||||
- Please refer to `Model API <../reference/api.html#module-qlib.model.base>`_ for the parameter types of the fit method.
|
||||
- Code Example: In the following example, users need to use dnn to predict the label(such as `preds`) of test data `x_test` and return it.
|
||||
.. code-block:: Python
|
||||
|
||||
@@ -143,4 +143,4 @@ Also, ``Model`` can also be tested as a single module. An example has been given
|
||||
Reference
|
||||
=====================
|
||||
|
||||
To know more about ``Interday Model``, please refer to `Interday Model: Model Training & Prediction <../component/model.html>`_ and `Model API <../reference/api.html#module-qlib.contrib.model.base>`_.
|
||||
To know more about ``Interday Model``, please refer to `Interday Model: Model Training & Prediction <../component/model.html>`_ and `Model API <../reference/api.html#module-qlib.model.base>`_.
|
||||
|
||||
@@ -5,7 +5,7 @@ experiment:
|
||||
|
||||
model:
|
||||
class: LGBModel
|
||||
module_path: qlib.contrib.model.gbdt
|
||||
module_path: qlib.gbdt.model.gbdt
|
||||
args:
|
||||
loss: mse
|
||||
colsample_bytree: 0.8879
|
||||
|
||||
@@ -4,7 +4,7 @@ experiment:
|
||||
mode: train
|
||||
|
||||
model:
|
||||
module_path: qlib.contrib.model.pytorch_nn
|
||||
module_path: qlib.model.pytorch_nn
|
||||
class: DNNModelPytorch
|
||||
args:
|
||||
loss: mse
|
||||
|
||||
@@ -8,7 +8,7 @@ import qlib
|
||||
import pandas as pd
|
||||
from qlib.config import REG_CN
|
||||
from qlib.contrib.model.gbdt import LGBModel
|
||||
from qlib.contrib.estimator.handler import Alpha158
|
||||
from qlib.contrib.data.handler import Alpha158
|
||||
from qlib.contrib.strategy.strategy import TopkDropoutStrategy
|
||||
from qlib.contrib.evaluate import (
|
||||
backtest as normal_backtest,
|
||||
|
||||
63
qlib/contrib/data/handler.py
Normal file
63
qlib/contrib/data/handler.py
Normal file
@@ -0,0 +1,63 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
from ...data.dataset.handler import ConfigQLibDataHandler
|
||||
from ...log import TimeInspector
|
||||
|
||||
|
||||
class ALPHA360(ConfigQLibDataHandler):
|
||||
config_template = {
|
||||
"price": {"windows": range(60)},
|
||||
"volume": {"windows": range(60)},
|
||||
}
|
||||
|
||||
|
||||
class QLibDataHandlerV1(ConfigQLibDataHandler):
|
||||
config_template = {
|
||||
"kbar": {},
|
||||
"price": {
|
||||
"windows": [0],
|
||||
"feature": ["OPEN", "HIGH", "LOW", "VWAP"],
|
||||
},
|
||||
"rolling": {},
|
||||
}
|
||||
|
||||
def __init__(self, start_date, end_date, processors=None, **kwargs):
|
||||
if processors is None:
|
||||
processors = ["PanelProcessor"] # V1 default processor
|
||||
super().__init__(start_date, end_date, processors, **kwargs)
|
||||
|
||||
def setup_label(self):
|
||||
"""
|
||||
load the labels df
|
||||
:return: df_labels
|
||||
"""
|
||||
TimeInspector.set_time_mark()
|
||||
|
||||
df_labels = super().setup_label()
|
||||
|
||||
## calculate new labels
|
||||
df_labels["LABEL1"] = df_labels["LABEL0"].groupby(level="datetime").apply(lambda x: (x - x.mean()) / x.std())
|
||||
|
||||
df_labels = df_labels.drop(["LABEL0"], axis=1)
|
||||
|
||||
TimeInspector.log_cost_time("Finished loading labels.")
|
||||
|
||||
return df_labels
|
||||
|
||||
|
||||
class Alpha158(QLibDataHandlerV1):
|
||||
config_template = {
|
||||
"kbar": {},
|
||||
"price": {
|
||||
"windows": [0],
|
||||
"feature": ["OPEN", "HIGH", "LOW", "CLOSE"],
|
||||
},
|
||||
"rolling": {},
|
||||
}
|
||||
|
||||
def _init_kwargs(self, **kwargs):
|
||||
kwargs["labels"] = ["Ref($close, -2)/Ref($close, -1) - 1"]
|
||||
super(Alpha158, self)._init_kwargs(**kwargs)
|
||||
|
||||
|
||||
@@ -103,7 +103,7 @@ class DataConfig(object):
|
||||
:param config: The config dict for data
|
||||
:param CONFIG_MANAGER: The estimator config manager
|
||||
"""
|
||||
self.handler_module_path = config.get("module_path", "qlib.contrib.estimator.handler")
|
||||
self.handler_module_path = config.get("module_path", "qlib.contrib.data.handler")
|
||||
self.handler_class = config.get("class", "ALPHA360")
|
||||
self.handler_parameters = config.get("args", dict())
|
||||
self.handler_filter = config.get("filter", dict())
|
||||
@@ -118,7 +118,7 @@ class ModelConfig(object):
|
||||
:param CONFIG_MANAGER: The estimator config manager
|
||||
"""
|
||||
self.model_class = config.get("class", "Model")
|
||||
self.model_module_path = config.get("module_path", "qlib.contrib.model")
|
||||
self.model_module_path = config.get("module_path", "qlib.model")
|
||||
self.save_dir = os.path.join(CONFIG_MANAGER.ex_config.tmp_run_dir, "model")
|
||||
self.save_path = config.get("save_path", os.path.join(self.save_dir, "model.bin"))
|
||||
self.parameters = config.get("args", dict())
|
||||
|
||||
0
qlib/contrib/model/__init__.py
Normal file
0
qlib/contrib/model/__init__.py
Normal file
@@ -9,7 +9,7 @@ import numpy as np
|
||||
import lightgbm as lgb
|
||||
from sklearn.metrics import roc_auc_score, mean_squared_error
|
||||
|
||||
from .base import Model
|
||||
from ...model.base import Model
|
||||
from ...utils import drop_nan_by_y_index
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ import torch
|
||||
import torch.nn as nn
|
||||
import torch.optim as optim
|
||||
|
||||
from .base import Model
|
||||
from ...model.base import Model
|
||||
|
||||
|
||||
class DNNModelPytorch(Model):
|
||||
0
qlib/data/dataset/__init__.py
Normal file
0
qlib/data/dataset/__init__.py
Normal file
@@ -513,73 +513,3 @@ class ConfigQLibDataHandler(QLibDataHandler):
|
||||
if "labels" not in kwargs:
|
||||
kwargs["labels"] = ["Ref($vwap, -2)/Ref($vwap, -1) - 1"]
|
||||
super()._init_kwargs(**kwargs)
|
||||
|
||||
|
||||
class ALPHA360(ConfigQLibDataHandler):
|
||||
config_template = {
|
||||
"price": {"windows": range(60)},
|
||||
"volume": {"windows": range(60)},
|
||||
}
|
||||
|
||||
|
||||
class QLibDataHandlerV1(ConfigQLibDataHandler):
|
||||
config_template = {
|
||||
"kbar": {},
|
||||
"price": {
|
||||
"windows": [0],
|
||||
"feature": ["OPEN", "HIGH", "LOW", "VWAP"],
|
||||
},
|
||||
"rolling": {},
|
||||
}
|
||||
|
||||
def __init__(self, start_date, end_date, processors=None, **kwargs):
|
||||
if processors is None:
|
||||
processors = ["PanelProcessor"] # V1 default processor
|
||||
super().__init__(start_date, end_date, processors, **kwargs)
|
||||
|
||||
def setup_label(self):
|
||||
"""
|
||||
load the labels df
|
||||
:return: df_labels
|
||||
"""
|
||||
TimeInspector.set_time_mark()
|
||||
|
||||
df_labels = super().setup_label()
|
||||
|
||||
## calculate new labels
|
||||
df_labels["LABEL1"] = df_labels["LABEL0"].groupby(level="datetime").apply(lambda x: (x - x.mean()) / x.std())
|
||||
|
||||
df_labels = df_labels.drop(["LABEL0"], axis=1)
|
||||
|
||||
TimeInspector.log_cost_time("Finished loading labels.")
|
||||
|
||||
return df_labels
|
||||
|
||||
|
||||
class Alpha158(QLibDataHandlerV1):
|
||||
config_template = {
|
||||
"kbar": {},
|
||||
"price": {
|
||||
"windows": [0],
|
||||
"feature": ["OPEN", "HIGH", "LOW", "CLOSE"],
|
||||
},
|
||||
"rolling": {},
|
||||
}
|
||||
|
||||
def _init_kwargs(self, **kwargs):
|
||||
kwargs["labels"] = ["Ref($close, -2)/Ref($close, -1) - 1"]
|
||||
super(Alpha158, self)._init_kwargs(**kwargs)
|
||||
|
||||
|
||||
# if __name__ == '__main__':
|
||||
# import qlib
|
||||
#
|
||||
# qlib.init()
|
||||
#
|
||||
# handler = ALPHA80('2010-01-01', '2018-12-31')
|
||||
# data = handler.get_split_data(
|
||||
# pd.Timestamp('2010-01-01'), pd.Timestamp('2014-01-01'),
|
||||
# pd.Timestamp('2015-01-01'), pd.Timestamp('2016-01-01'),
|
||||
# pd.Timestamp('2017-01-01'), pd.Timestamp('2018-01-01'))
|
||||
# print(data[0])
|
||||
# data[0].to_pickle('alpha80.pkl')
|
||||
|
||||
249
qlib/data/dataset/processor.py
Normal file
249
qlib/data/dataset/processor.py
Normal file
@@ -0,0 +1,249 @@
|
||||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import abc
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from ...log import TimeInspector
|
||||
|
||||
EPS = 1e-12
|
||||
|
||||
|
||||
class Processor(abc.ABC):
|
||||
def __init__(self, feature_names, label_names, **kwargs):
|
||||
self.feature_names = feature_names
|
||||
self.label_names = label_names
|
||||
|
||||
@abc.abstractmethod
|
||||
def __call__(self, df_train, df_valid, df_test):
|
||||
pass
|
||||
|
||||
|
||||
class PanelProcessor(Processor):
|
||||
"""Panel Preprocessor"""
|
||||
|
||||
STD_NORM = "Std"
|
||||
MINMAX_NORM = "MinMax"
|
||||
|
||||
def __init__(self, feature_names, label_names, **kwargs):
|
||||
super().__init__(feature_names, label_names)
|
||||
# Options.
|
||||
self.dropna_label = kwargs.get("dropna_label", True)
|
||||
self.dropna_feature = kwargs.get("dropna_feature", False)
|
||||
self.normalize_method = kwargs.get("normalize_method", None)
|
||||
self.replace_inf = kwargs.get("replace_inf_feature", False)
|
||||
|
||||
def __call__(self, df_train, df_valid, df_test):
|
||||
"""
|
||||
Preprocess the data
|
||||
:param df: the dataframe to process data.
|
||||
"""
|
||||
# Drop null labels.
|
||||
if self.dropna_label:
|
||||
df_train, df_valid, df_test = self._process_drop_null_label(df_train, df_valid, df_test)
|
||||
|
||||
# Dropna if need.
|
||||
if self.dropna_feature:
|
||||
df_train, df_valid, df_test = self._process_drop_null_feature(df_train, df_valid, df_test)
|
||||
|
||||
# replace the 'inf' with the mean the corresponding dimension
|
||||
if self.replace_inf:
|
||||
df_train, df_valid, df_test = self._process_replace_inf_feature(df_train, df_valid, df_test)
|
||||
|
||||
# normalize data in given method.
|
||||
if self.normalize_method is not None:
|
||||
df_train, df_valid, df_test = self._process_normalize_feature(df_train, df_valid, df_test)
|
||||
|
||||
return df_train, df_valid, df_test
|
||||
|
||||
def _process_drop_null_label(self, df_train, df_valid, df_test):
|
||||
"""
|
||||
Drop null labels.
|
||||
"""
|
||||
TimeInspector.set_time_mark()
|
||||
df_train = df_train.dropna(subset=self.label_names)
|
||||
df_valid = df_valid.dropna(subset=self.label_names)
|
||||
# The test data's label is Unkown. They can not be seen when preprocessing
|
||||
TimeInspector.log_cost_time("Finished dropping null labels.")
|
||||
|
||||
return df_train, df_valid, df_test
|
||||
|
||||
def _process_drop_null_feature(self, df_train, df_valid, df_test):
|
||||
"""
|
||||
Drop data which contain null features if needed.
|
||||
"""
|
||||
# TODO - `Pandas.dropna` is a low performance method.
|
||||
TimeInspector.set_time_mark()
|
||||
df_train = df_train.dropna(subset=self.feature_names)
|
||||
df_valid = df_valid.dropna(subset=self.feature_names)
|
||||
df_test = df_test.dropna(subset=self.feature_names)
|
||||
TimeInspector.log_cost_time("Finished dropping nan.")
|
||||
|
||||
return df_train, df_valid, df_test
|
||||
|
||||
def _process_replace_inf_feature(self, df_train, df_valid, df_test):
|
||||
"""
|
||||
replace the 'inf' in feature with the mean of this dimension.
|
||||
"""
|
||||
TimeInspector.set_time_mark()
|
||||
|
||||
def replace_inf(data):
|
||||
def process_inf(df):
|
||||
for col in df.columns:
|
||||
df[col] = df[col].replace([np.inf, -np.inf], df[col][~np.isinf(df[col])].mean())
|
||||
return df
|
||||
|
||||
data = data.groupby("datetime").apply(process_inf)
|
||||
data.sort_index(inplace=True)
|
||||
return data
|
||||
|
||||
df_train = replace_inf(df_train)
|
||||
df_valid = replace_inf(df_valid)
|
||||
df_test = replace_inf(df_test)
|
||||
TimeInspector.log_cost_time("Finished replace inf.")
|
||||
|
||||
return df_train, df_valid, df_test
|
||||
|
||||
def _process_normalize_feature(self, df_train, df_valid, df_test):
|
||||
"""
|
||||
Normalize data if needed, we provide two method now: min-max normalization and standard normalization.
|
||||
"""
|
||||
TimeInspector.set_time_mark()
|
||||
|
||||
if self.normalize_method == self.MINMAX_NORM:
|
||||
min_train = np.nanmin(df_train[self.feature_names].values, axis=0)
|
||||
max_train = np.nanmax(df_train[self.feature_names].values, axis=0)
|
||||
ignore = min_train == max_train
|
||||
|
||||
def normalize(x, min_train=min_train, max_train=max_train, ignore=ignore):
|
||||
if (~ignore).all():
|
||||
return (x - min_train) / (max_train - min_train)
|
||||
for i in range(ignore.size):
|
||||
if not ignore[i]:
|
||||
x[i] = (x[i] - min_train) / (max_train - min_train)
|
||||
return x
|
||||
|
||||
elif self.normalize_method == self.STD_NORM:
|
||||
mean_train = np.nanmean(df_train[self.feature_names].values, axis=0)
|
||||
std_train = np.nanstd(df_train[self.feature_names].values, axis=0)
|
||||
ignore = std_train == 0
|
||||
|
||||
def normalize(x, mean_train=mean_train, std_train=std_train, ignore=ignore):
|
||||
if (~ignore).all():
|
||||
return (x - mean_train) / std_train
|
||||
for i in range(ignore.size):
|
||||
if not ignore[i]:
|
||||
x[i] = (x[i] - mean_train) / std_train
|
||||
return x
|
||||
|
||||
else:
|
||||
raise ValueError("Normalize method {} is not allowed".format(self.normalize_method))
|
||||
|
||||
df_train.loc(axis=1)[self.feature_names] = normalize(df_train[self.feature_names].values)
|
||||
df_valid.loc(axis=1)[self.feature_names] = normalize(df_valid[self.feature_names].values)
|
||||
df_test.loc(axis=1)[self.feature_names] = normalize(df_test[self.feature_names].values)
|
||||
|
||||
TimeInspector.log_cost_time("Finished normalizing data.")
|
||||
|
||||
return df_train, df_valid, df_test
|
||||
|
||||
|
||||
class ConfigSectionProcessor(Processor):
|
||||
def __init__(self, feature_names, label_names, **kwargs):
|
||||
super().__init__(feature_names, label_names)
|
||||
# Options
|
||||
self.fillna_feature = kwargs.get("fillna_feature", True)
|
||||
self.fillna_label = kwargs.get("fillna_label", True)
|
||||
self.clip_feature_outlier = kwargs.get("clip_feature_outlier", False)
|
||||
self.shrink_feature_outlier = kwargs.get("shrink_feature_outlier", True)
|
||||
self.clip_label_outlier = kwargs.get("clip_label_outlier", False)
|
||||
|
||||
def __call__(self, *args):
|
||||
return [self._transform(x) for x in args]
|
||||
|
||||
def _transform(self, df):
|
||||
def _label_norm(x):
|
||||
x = x - x.mean() # copy
|
||||
x /= x.std()
|
||||
if self.clip_label_outlier:
|
||||
x.clip(-3, 3, inplace=True)
|
||||
if self.fillna_label:
|
||||
x.fillna(0, inplace=True)
|
||||
return x
|
||||
|
||||
def _feature_norm(x):
|
||||
x = x - x.median() # copy
|
||||
x /= x.abs().median() * 1.4826
|
||||
if self.clip_feature_outlier:
|
||||
x.clip(-3, 3, inplace=True)
|
||||
if self.shrink_feature_outlier:
|
||||
x.where(x <= 3, 3 + (x - 3).div(x.max() - 3) * 0.5, inplace=True)
|
||||
x.where(x >= -3, -3 - (x + 3).div(x.min() + 3) * 0.5, inplace=True)
|
||||
if self.fillna_feature:
|
||||
x.fillna(0, inplace=True)
|
||||
return x
|
||||
|
||||
TimeInspector.set_time_mark()
|
||||
|
||||
# Copy
|
||||
df_new = df.copy()
|
||||
|
||||
# Label
|
||||
cols = df.columns[df.columns.str.contains("^LABEL")]
|
||||
df_new[cols] = df[cols].groupby(level="datetime").apply(_label_norm)
|
||||
|
||||
# Features
|
||||
cols = df.columns[df.columns.str.contains("^KLEN|^KLOW|^KUP")]
|
||||
df_new[cols] = df[cols].apply(lambda x: x ** 0.25).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^KLOW2|^KUP2")]
|
||||
df_new[cols] = df[cols].apply(lambda x: x ** 0.5).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
_cols = [
|
||||
"KMID",
|
||||
"KSFT",
|
||||
"OPEN",
|
||||
"HIGH",
|
||||
"LOW",
|
||||
"CLOSE",
|
||||
"VWAP",
|
||||
"ROC",
|
||||
"MA",
|
||||
"BETA",
|
||||
"RESI",
|
||||
"QTLU",
|
||||
"QTLD",
|
||||
"RSV",
|
||||
"SUMP",
|
||||
"SUMN",
|
||||
"SUMD",
|
||||
"VSUMP",
|
||||
"VSUMN",
|
||||
"VSUMD",
|
||||
]
|
||||
pat = "|".join(["^" + x for x in _cols])
|
||||
cols = df.columns[df.columns.str.contains(pat) & (~df.columns.isin(["HIGH0", "LOW0"]))]
|
||||
df_new[cols] = df[cols].groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^STD|^VOLUME|^VMA|^VSTD")]
|
||||
df_new[cols] = df[cols].apply(np.log).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^RSQR")]
|
||||
df_new[cols] = df[cols].fillna(0).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^MAX|^HIGH0")]
|
||||
df_new[cols] = df[cols].apply(lambda x: (x - 1) ** 0.5).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^MIN|^LOW0")]
|
||||
df_new[cols] = df[cols].apply(lambda x: (1 - x) ** 0.5).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^CORR|^CORD")]
|
||||
df_new[cols] = df[cols].apply(np.exp).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
cols = df.columns[df.columns.str.contains("^WVMA")]
|
||||
df_new[cols] = df[cols].apply(np.log1p).groupby(level="datetime").apply(_feature_norm)
|
||||
|
||||
TimeInspector.log_cost_time("Finished preprocessing data.")
|
||||
|
||||
return df_new
|
||||
@@ -2,12 +2,12 @@
|
||||
# Licensed under the MIT License.
|
||||
|
||||
|
||||
import logging
|
||||
import logging.handlers
|
||||
import os
|
||||
import re
|
||||
import logging
|
||||
from time import time
|
||||
import logging.handlers
|
||||
from logging import config as logging_config
|
||||
from time import time
|
||||
|
||||
from .config import C
|
||||
|
||||
|
||||
@@ -13,7 +13,7 @@ import qlib
|
||||
from qlib.config import REG_CN
|
||||
from qlib.utils import drop_nan_by_y_index
|
||||
from qlib.contrib.model.gbdt import LGBModel
|
||||
from qlib.contrib.estimator.handler import Alpha158
|
||||
from qlib.contrib.data.handler import Alpha158
|
||||
from qlib.contrib.strategy.strategy import TopkDropoutStrategy
|
||||
from qlib.contrib.evaluate import (
|
||||
backtest as normal_backtest,
|
||||
|
||||
Reference in New Issue
Block a user