Update docs and fix tabnet

2026-07-02 10:31:00 +08:00 · 2020-11-26 00:55:26 +08:00
parent 5be847909f
commit 87cee85cea
27 changed files with 624 additions and 495 deletions
--- a/docs/component/data.rst
+++ b/docs/component/data.rst
@@ -1,7 +1,7 @@
 .. _data:

 ================================
-Data Layer: Data Framework&Usage
+Data Layer: Data Framework & Usage
 ================================

 Introduction
@@ -15,7 +15,9 @@ The introduction of ``Data Layer`` includes the following parts.

 - Data Preparation
 - Data API
+- Data Loader
 - Data Handler
+- Dataset
 - Cache
 - Data and Cache File Structure

@@ -146,43 +148,161 @@ Filter

 To know more about ``Filter``, please refer to `Filter API <../reference/api.html#module-qlib.data.filter>`_.

-
 Reference
 -------------

 To know more about ``Data API``, please refer to `Data API <../reference/api.html#data>`_.

+
+Data Loader
+=================
+
+``Data Loader`` in ``Qlib`` is designed to load raw data from the original data source. It will be loaded and used in the ``Data Handler`` module.
+
+The ``QlibDataLoader`` class in ``Qlib`` is such an interface that allows users to load raw data from the data source.
+
+Interface
+------------
+
+Here are some interfaces of the ``QlibDataLoader`` class:
+
+- `load(instruments, start_time=None, end_time=None)`
+    - This method loads the data as pd.DataFrame
+    - Parameters:
+        - `instruments` : str or dict
+            it can either be the market name or the config file of instruments generated by InstrumentProvider.
+        - `start_time` : str
+            start of the time range.
+        - `end_time` : str
+            end of the time range.
+    - Returns:
+        - The data being loaded with type `pd.DataFrame`
+
+- `load_group_df(instruments, exprs: list, names: list, start_time=None, end_time=None)`
+    -  This method loads the dataframe for specific group.
+    - Parameters:
+        - `instruments` : str or dict
+            it can either be the market name or the config file of instruments generated by InstrumentProvider.
+        - `exprs` : list
+            the expressions to describe the content of the data.
+        - `names` : list
+            the name of the data.
+        - `start_time` : str
+            start of the time range.
+        - `end_time` : str
+            end of the time range.
+    - Returns:
+        - The queried data in type `pd.DataFrame`.
+
+API
+-----------
+
+To know more about ``Data Loader``, please refer to `Data Loader API <../reference/api.html#module-qlib.data.dataset.loader>`_.
+
+
 Data Handler
 =================

-Users can use ``Data Handler`` in an automatic workflow by ``Estimator``, refer to `Estimator: Workflow Management <estimator.html>`_ for more details. 
+The ``Data Handler`` module in ``Qlib`` is designed to handler those common data processing methods which will be used by most of the models.
+
+Users can use ``Data Handler`` in an automatic workflow by ``qrun``, refer to `Workflow: Workflow Management <workflow.html>`_ for more details. 

-Also, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data(standardization, remove NaN, etc.) and build datasets. It is a subclass of ``qlib.data.dataset.handler.DataHandlerLP``, which provides some interfaces as follows.

 Base Class & Interface
 ----------------------

-Qlib provides a base class `qlib.data.dataset.DataHandlerLP <../reference/api.html#qlib.data.dataset.handler.DataHandlerLP>`_, which provides the following interfaces:
+In addition to use ``Data Handler`` in an automatic workflow with ``qrun``, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data (standardization, remove NaN, etc.) and build datasets. 

- `load_feature`    
-    Implement the interface to load the data features.
+In order to achieve so, ``Qlib`` provides a base class `qlib.data.dataset.DataHandlerLP <../reference/api.html#qlib.data.dataset.handler.DataHandlerLP>`_. The core idea of this class is that: we will have some leanable ``Processors`` which can learn the parameters of data processing. When new data comes in, these `trained` ``Processors`` can then infer on the new data and thus processing real-time data in an efficient way. More information about ``Processors`` will be listed in the next subsection.

- `load_label`   
-    Implement the interface to load the data labels and calculate the users' labels. 
+Here are some important interfaces that ``DataHandlerLP`` provides:

- `setup_processed_data`    
-    Implement the interface for data preprocessing, such as preparing feature columns, discarding blank lines, and so on.
+- `__init__(instruments=None, start_time=None, end_time=None, data_loader: Tuple[dict, str, DataLoader] = None, infer_processors=[], learn_processors=[], process_type=PTYPE_A, **kwargs)`
+    - Initialization of the class.
+    - Parameters:
+        - `infer_processors` : list
+            - list of <description info> of processors to generate data for inference
+            - example of <description info>:

-Qlib also provides two functions to help users init the data handler, users can override them for users' needs.
+            .. code-block::
+            
+                1) classname & kwargs:
+                    {
+                        "class": "MinMaxNorm",
+                        "kwargs": {
+                            "fit_start_time": "20080101",
+                            "fit_end_time": "20121231"
+                        }
+                    }
+                2) Only classname:
+                    "DropnaFeature"
+                3) object instance of Processor

- `_init_raw_data`
-    Users can init the raw df, feature names, and label names of data handler in this function. 
-    If the index of feature df and label df are not the same, users need to override this method to merge them (e.g. inner, left, right merge).
+        - `learn_processors` : list
+            similar to infer_processors, but for generating data for learning models
+
+        - `process_type`: str
+            - PTYPE_I = 'independent'
+                - self._infer will processed by infer_processors
+                - self._learn will be processed by learn_processors
+            - PTYPE_A = 'append'
+                - self._infer will processed by infer_processors
+                - self._learn will be processed by infer_processors + learn_processors
+                    - (e.g. self._infer processed by learn_processors )
+
+- `fetch(selector: Union[pd.Timestamp, slice, str] = slice(None, None), level: Union[str, int] = "datetime", col_set=DataHandler.CS_ALL, data_key: str = DK_I)`    
+    - This method fetches data from underlying data source
+    - Parameters:
+        - `selector` : Union[pd.Timestamp, slice, str]
+            describe how to select data by index.
+        - `level` : Union[str, int]
+            which index level to select the data.
+        - `col_set` : str
+            select a set of meaningful columns.(e.g. features, columns).
+        - `data_key` : str
+            The data to fetch:  DK_*.
+    - Returns:
+        - The retrieved results in the type: `pd.DataFrame`.
+
+- `get_cols(col_set=DataHandler.CS_ALL, data_key: str = DK_I)`   
+    - This method gets the column names.
+    - Parameters:
+        - `col_set` : str
+            select a set of meaningful columns.(e.g. features, columns).
+        - `data_key` : str
+            the data to fetch:  DK_*.
+    - Returns:
+        - A list of column names.

 If users want to load features and labels by config, users can inherit ``qlib.data.dataset.handler.ConfigDataHandler``, ``Qlib`` also provides some preprocess method in this subclass.
 If users want to use qlib data, `QLibDataHandler` is recommended. Users can inherit their custom class from `QLibDataHandler`, which is also a subclass of `ConfigDataHandler`.


+Processor
+----------
+
+The ``Processor`` module in ``Qlib`` is designed to be learnable and it is responsible for handling data processing such as `normalization` and `drop none/nan features/labels`.
+
+``Qlib`` provides the following ``Processors``:
+
+- ``DropnaProcessor``: `processor` that drops N/A features.
+- ``DropnaLabel``: `processor` that drops N/A labels.
+- ``TanhProcess``: `processor` that uses `tanh` to process noise data.
+- ``ProcessInf``: `processor` that handles infinity values, it will be replaces by the mean of the column.
+- ``Fillna``: `processor` that handles N/A values, which will fill the N/A value by 0 or other given number.
+- ``MinMaxNorm``: `processor` that applies min-max normalization.
+- ``ZscoreNorm``: `processor` that applies z-score normalization.
+- ``CSZScoreNorm``: `processor` that applies cross sectional z-score normalization.
+- ``CSRankNorm``: `processor` that applies cross sectional rank normalization.
+
+Users can also create their own `processor` by inheriting the base class of ``Processor``. Please refer to the implementation of all the processors for more information (`Processor Link <https://github.com/microsoft/qlib/blob/main/qlib/data/dataset/processor.py>`_). 
+
+API
+---------
+
+To know more about ``Processor``, please refer to `Processor API <../reference/api.html#module-qlib.data.dataset.processor>`_.
+
+
 Usage
 --------------

@@ -194,15 +314,12 @@ Usage
 - `get_rolling_data`
    - According to the start and end dates, and `rolling_period`, an iterator is returned, which can be used to traverse the features and labels used for rolling.

-
-
-
 Example
 --------------

-``Data Handler`` can be run with ``estimator`` by modifying the configuration file, and can also be used as a single module. 
+``Data Handler`` can be run with ``qrun`` by modifying the configuration file, and can also be used as a single module. 

-Know more about how to run ``Data Handler`` with ``Estimator``, please refer to `Estimator: Workflow Management <estimator.html>`_
+Know more about how to run ``Data Handler`` with ``qrun``, please refer to `Workflow: Workflow Management <workflow.html>`_

 Qlib provides implemented data handler `Alpha158`. The following example shows how to run `Alpha158` as a single module.

@@ -211,45 +328,70 @@ Qlib provides implemented data handler `Alpha158`. The following example shows h

 .. code-block:: Python

+    import qlib
    from qlib.contrib.data.handler import Alpha158
-    from qlib.contrib.model.gbdt import LGBModel

-    DATA_HANDLER_CONFIG = {
-        "dropna_label": True,
-        "start_date": "2007-01-01",
-        "end_date": "2020-08-01",
-        "market": "csi300",
+    data_handler_config = {
+        "start_time": "2008-01-01",
+        "end_time": "2020-08-01",
+        "fit_start_time": "2008-01-01",
+        "fit_end_time": "2014-12-31",
+        "instruments": "csi300",
    }

-    TRAINER_CONFIG = {
-        "train_start_date": "2007-01-01",
-        "train_end_date": "2014-12-31",
-        "validate_start_date": "2015-01-01",
-        "validate_end_date": "2016-12-31",
-        "test_start_date": "2017-01-01",
-        "test_end_date": "2020-08-01",
-    }
+    if __name__ == "__main__":
+        qlib.init()
+        h = Alpha158(**data_handler_config)

-    exampleDataHandler = Alpha158(**DATA_HANDLER_CONFIG)
+        # get all the columns of the data
+        print(h.get_cols())

-    # example of 'get_split_data'
-    x_train, y_train, x_validate, y_validate, x_test, y_test = exampleDataHandler.get_split_data(**TRAINER_CONFIG)
+        # fetch all the labels
+        print(h.fetch(col_set="label"))

-    # example of 'get_rolling_data'
-
-    for (x_train, y_train, x_validate, y_validate, x_test, y_test) in exampleDataHandler.get_rolling_data(**TRAINER_CONFIG):
-        print(x_train, y_train, x_validate, y_validate, x_test, y_test) 
-
-
-.. note:: (x_train, y_train, x_validate, y_validate, x_test, y_test) can be used as arguments for the `fit`, `predic``, and `score` methods of the ``Interday Model`` , please refer to `Model <model.html#base-class-interface>`_.
-
-Also, the above example has been given in ``examples.estimator.train_backtest_analyze.ipynb``.
+        # fetch all the features
+        print(h.fetch(col_set="feature"))

 API
 ---------

 To know more about ``Data Handler``, please refer to `Data Handler API <../reference/api.html#module-qlib.data.dataset.handler>`_.

+
+Dataset
+=================
+
+The ``Dataset`` module in ``Qlib`` aims to prepare data for model training and inferencing.
+
+The motivation of this module is that we want to maximize the flexibility of of different models to handle data that are suitable for themselves. This module gives the model the rights to process their data in an unique way. For instance, models such as ``GBDT`` may work well on data that contains `nan` or `None` value, while neural networks such as ``DNN`` will break down on such data. 
+
+The ``DatasetH`` class is the `dataset` with `Data Handler`. Here is the most important interface of the class:
+
+- `prepare(segments: Union[List[str], Tuple[str], str, slice], col_set=DataHandler.CS_ALL, data_key=DataHandlerLP.DK_I, **kwargs)`
+    - This method prepares the data for learning and inference.
+    - Parameters:
+        - `segments` : Union[List[str], Tuple[str], str, slice]
+            Describe the scope of the data to be prepared
+            Here are some examples:
+
+            - 'train'
+
+            - ['train', 'valid']
+
+        - `col_set` : str
+            The col_set will be passed to self._handler when fetching data.
+        - `data_key` : str
+            The data to fetch:  DK_*
+            Default is DK_I, which indicate fetching data for **inference**.
+
+
+API
+---------
+
+To know more about ``Dataset``, please refer to `Dataset API <../reference/api.html#module-qlib.data.dataset.__init__>`_.
+
+
+
 Cache
 ==========

--- a/docs/component/model.rst
+++ b/docs/component/model.rst
@@ -7,7 +7,7 @@ Interday Model: Model Training & Prediction
 Introduction
 ===================

-``Interday Model`` is designed to make the `prediction score` about stocks. Users can use the ``Interday Model`` in an automatic workflow by ``Estimator``, please refer to `Estimator: Workflow Management <estimator.html>`_.  
+``Interday Model`` is designed to make the `prediction score` about stocks. Users can use the ``Interday Model`` in an automatic workflow by ``qrun``, please refer to `Workflow: Workflow Management <workflow.html>`_.  

 Because the components in ``Qlib`` are designed in a loosely-coupled way, ``Interday Model`` can be used as an independent module also.

@@ -20,151 +20,125 @@ The base class provides the following interfaces:

 - `__init__(**kwargs)`
    - Initialization.
-    - If users use ``Estimator`` to start an `experiment`, the parameter of `__init__` method shoule be consistent with the hyperparameters in the configuration file.

- `fit(self, x_train, y_train, x_valid, y_valid, w_train=None, w_valid=None, **kwargs)`
+- `fit(self, dataset, **kwargs)`
    - Train model.
    - Parameter:
-        - `x_train`, pd.DataFrame type, train feature
-            The following example explains the value of `x_train`:
+        - `dataset`, ``Qlib``'s ``DatasetH`` type. For more information about ``DatasetH``, users can refer to the related document: `Qlib Dataset <../component/data.html#dataset>`_.
+            The `dataset` is passed into the `model`'s method because there are some unique data preprocessing procedures for each, we want to give each model maximum flexibility to handle the data that is suitable for their own.
+            The following code example shows how to retrieve `x_train`, `y_train` and `w_train` from the `dataset`:

-            .. code-block:: YAML
-                                
-                                        KMID      KLEN      KMID2     KUP       KUP2
-                instrument  datetime                                                       
-                SH600004    2012-01-04  0.000000  0.017685  0.000000  0.012862  0.727275   
-                            2012-01-05 -0.006473  0.025890 -0.250001  0.012945  0.499998   
-                            2012-01-06  0.008117  0.019481  0.416666  0.008117  0.416666   
-                            2012-01-09  0.016051  0.025682  0.624998  0.006421  0.250001   
-                            2012-01-10  0.017323  0.026772  0.647057  0.003150  0.117648   
-                ...                         ...       ...       ...       ...       ...   
-                SZ300273    2014-12-25 -0.005295  0.038697 -0.136843  0.016293  0.421052   
-                            2014-12-26 -0.022486  0.041701 -0.539215  0.002453  0.058824   
-                            2014-12-29 -0.031526  0.039092 -0.806451  0.000000  0.000000   
-                            2014-12-30 -0.010000  0.032174 -0.310811  0.013913  0.432433   
-                            2014-12-31  0.010917  0.020087  0.543479  0.001310  0.065216   
+            .. code-block:: Python

-            
-            `x_train` is a pandas DataFrame, whose index is MultiIndex <instrument(str), datetime(pd.Timestamp)>. Each column of `x_train` corresponds to a feature, and the column name is the feature name. 
-            
-            .. note::
-            
-                The number and names of the columns are determined by the data handler, please refer to `Data Handler <data.html#data-handler>`_ and `Estimator Data Section <estimator.html#data-section>`_.
-            
-        - `y_train`, pd.DataFrame type, train label
-            The following example explains the value of `y_train`:
+                # get features and labels
+                df_train, df_valid = dataset.prepare(
+                    ["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
+                )
+                x_train, y_train = df_train["feature"], df_train["label"]
+                x_valid, y_valid = df_valid["feature"], df_valid["label"]

-             .. code-block:: YAML
-                                
-                                        LABEL
-                instrument  datetime            
-                SH600004    2012-01-04 -0.798456
-                            2012-01-05 -1.366716
-                            2012-01-06 -0.491026
-                            2012-01-09  0.296900
-                            2012-01-10  0.501426
-                ...                         ...
-                SZ300273    2014-12-25 -0.465540
-                            2014-12-26  0.233864
-                            2014-12-29  0.471368
-                            2014-12-30  0.411914
-                            2014-12-31  1.342723
-            
-            `y_train` is a pandas DataFrame, whose index is MultiIndex <instrument(str), datetime(pd.Timestamp)>. The `LABEL` column represents the value of train label.
-
-            .. note::
-
-                The number and names of the columns are determined by the ``Data Handler``, please refer to `Data Handler <data.html#data-handler>`_.
-
-        - `x_valid`, pd.DataFrame type, validation feature
-            The format of `x_valid` is same as `x_train`
-
-
-        - `y_valid`, pd.DataFrame type, validation label
-            The format of `y_valid` is same as `y_train`
-
-        - `w_train`(Optional args, default is None), pd.DataFrame type, train weight
-            `w_train` is a pandas DataFrame, whose shape and index is same as `x_train`. The float value in `w_train` represents the weight of the feature at the same position in `x_train`.
-
-        - `w_train`(Optional args, default is None), pd.DataFrame type, validation weight
-            `w_train` is a pandas DataFrame, whose shape and index is the same as `x_valid`. The float value in `w_train` represents the weight of the feature at the same position in `x_train`.
-
- `predict(self, x_test, **kwargs)`
-    - Predict test data 'x_test'
-    - Parameter:
-        - `x_test`, pd.DataFrame type, test features
-            The form of `x_test` is same as `x_train` in 'fit' method.
-    - Return: 
-        - `label`, np.ndarray type, test label
-            The label of `x_test` that predicted by model.
-
- `score(self, x_test, y_test, w_test=None, **kwargs)`
-    - Evaluate model with test feature/label
-    - Parameter:
-        - `x_test`, pd.DataFrame type, test feature
-            The format of `x_test` is same as `x_train` in `fit` method.
+                # get weights
+                try:
+                    wdf_train, wdf_valid = dataset.prepare(["train", "valid"], col_set=["weight"], data_key=DataHandlerLP.DK_L)
+                    w_train, w_valid = wdf_train["weight"], wdf_valid["weight"]
+                except KeyError as e:
+                    w_train = pd.DataFrame(np.ones_like(y_train.values), index=y_train.index)
+                    w_valid = pd.DataFrame(np.ones_like(y_valid.values), index=y_valid.index)
        
-        - `x_test`, pd.DataFrame type, test label
-            The format of `y_test` is same as `y_train` in `fit` method.
+- `predict(self, dataset, **kwargs)`
+    - Predict test data.
+    - Parameter:
+        - `dataset`, ``Qlib``'s ``DatasetH`` type. The usage is similar to the example above.
+    - Returns:
+        - Predic results with type: `pandas.Series`.

-        - `w_test`, pd.DataFrame type, test weight
-            The format of `w_test` is same as `w_train` in `fit` method.
-    - Return: float type, evaluation score
+- `finetune(self, dataset, **kwargs)`
+    - Finetune the model.
+    - Parameter:
+        - `dataset`, ``Qlib``'s ``DatasetH`` type. The usage is similar to the example above.

-For other interfaces such as `save`, `load`, `finetune`, please refer to `Model API <../reference/api.html#module-qlib.model.base>`_.
+    
+For other interfaces such as `finetune`, please refer to `Model API <../reference/api.html#module-qlib.model.base>`_.

 Example
 ==================

-``Qlib`` provides ``LightGBM`` and ``DNN`` models as the baseline, the following steps show how to run`` LightGBM`` as an independent module.
+``Qlib``'s `Model Zoo` includes models such as ``LightGBM``, ``DNN``, ``LSTM``, etc.. These models are treated as the baselines of ``Interday Model``. The following steps show how to run`` LightGBM`` as an independent module.

 - Initialize ``Qlib`` with `qlib.init` first, please refer to `Initialization <../start/initialization.html>`_.
 - Run the following code to get the `prediction score` `pred_score`
    .. code-block:: Python

-        from qlib.contrib.data.handler import Alpha158
        from qlib.contrib.model.gbdt import LGBModel
+        from qlib.contrib.data.handler import Alpha158
+        from qlib.utils import init_instance_by_config, flatten_dict
+        from qlib.workflow import R
+        from qlib.workflow.record_temp import SignalRecord, PortAnaRecord

-        DATA_HANDLER_CONFIG = {
-            "dropna_label": True,
-            "start_date": "2007-01-01",
-            "end_date": "2020-08-01",
-            "market": MARKET,
+        market = "csi300"
+        benchmark = "SH000300"
+
+        data_handler_config = {
+            "start_time": "2008-01-01",
+            "end_time": "2020-08-01",
+            "fit_start_time": "2008-01-01",
+            "fit_end_time": "2014-12-31",
+            "instruments": market,
        }

-        TRAINER_CONFIG = {
-            "train_start_date": "2007-01-01",
-            "train_end_date": "2014-12-31",
-            "validate_start_date": "2015-01-01",
-            "validate_end_date": "2016-12-31",
-            "test_start_date": "2017-01-01",
-            "test_end_date": "2020-08-01",
+        task = {
+            "model": {
+                "class": "LGBModel",
+                "module_path": "qlib.contrib.model.gbdt",
+                "kwargs": {
+                    "loss": "mse",
+                    "colsample_bytree": 0.8879,
+                    "learning_rate": 0.0421,
+                    "subsample": 0.8789,
+                    "lambda_l1": 205.6999,
+                    "lambda_l2": 580.9768,
+                    "max_depth": 8,
+                    "num_leaves": 210,
+                    "num_threads": 20,
+                },
+            },
+            "dataset": {
+                "class": "DatasetH",
+                "module_path": "qlib.data.dataset",
+                "kwargs": {
+                    "handler": {
+                        "class": "Alpha158",
+                        "module_path": "qlib.contrib.data.handler",
+                        "kwargs": data_handler_config,
+                    },
+                    "segments": {
+                        "train": ("2008-01-01", "2014-12-31"),
+                        "valid": ("2015-01-01", "2016-12-31"),
+                        "test": ("2017-01-01", "2020-08-01"),
+                    },
+                },
+            },
        }
+        
+        # model initiaiton
+        model = init_instance_by_config(task["model"])
+        dataset = init_instance_by_config(task["dataset"])

-        x_train, y_train, x_validate, y_validate, x_test, y_test = Alpha158(
-            **DATA_HANDLER_CONFIG
-        ).get_split_data(**TRAINER_CONFIG)
+        # start exp
+        with R.start(experiment_name="workflow"):
+            # train
+            R.log_params(**flatten_dict(task))
+            model.fit(dataset)

+            # prediction
+            recorder = R.get_recorder()
+            sr = SignalRecord(model, dataset, recorder)
+            sr.generate()

-        MODEL_CONFIG = {
-            "loss": "mse",
-            "colsample_bytree": 0.8879,
-            "learning_rate": 0.0421,
-            "subsample": 0.8789,
-            "lambda_l1": 205.6999,
-            "lambda_l2": 580.9768,
-            "max_depth": 8,
-            "num_leaves": 210,
-            "num_threads": 20,
-        }
-        # use default model
-        model = LGBModel(**MODEL_CONFIG)
-        model.fit(x_train, y_train, x_validate, y_validate)
-        _pred = model.predict(x_test)
-        pred_score = pd.DataFrame(index=_pred.index)
-        pred_score["score"] = _pred.iloc(axis=1)[0]
-
-    .. note:: `Alpha158` is the data handler provided by ``Qlib``, please refer to `Data Handler <data.html#data-handler>`_.
+    .. note:: 
+        
+        `Alpha158` is the data handler provided by ``Qlib``, please refer to `Data Handler <data.html#data-handler>`_.
+        `SignalRecord` is the `Record Template` in ``Qlib``, please refer to `Workflow <recorder.html#record-template>`_.

 Also, the above example has been given in ``examples/train_backtest_analyze.ipynb``.

--- a/docs/component/recorder.rst
+++ b/docs/component/recorder.rst
@@ -402,8 +402,8 @@ Record Template

 The ``RecordTemp`` class is a class that enables generate experiment results such as IC and backtest in a certain format. We have provided three different `Record Template` class:

- ``SignalRecord``: This class generates the `preidction` of the model.
- ``SigAnaRecord``: This class generates the `IC`, `ICIR`, `Rank IC` and `Rank ICIR`.
+- ``SignalRecord``: This class generates the `preidction` results of the model.
+- ``SigAnaRecord``: This class generates the `IC`, `ICIR`, `Rank IC` and `Rank ICIR` of the model.
 - ``PortAnaRecord``: This class generates the results of `backtest`. The detailed information about `backtest` as well as the available `strategy`, users can refer to `Strategy <../component/strategy.html>`_ and `Backtest <../component/backtest.html>`_.

-For more information, please refer to `Record Template API <../reference/api.html#module-qlib.workflow.record_temp>`_.
+For more information about the APIs, please refer to `Record Template API <../reference/api.html#module-qlib.workflow.record_temp>`_.
--- a/docs/reference/api.rst
+++ b/docs/reference/api.rst
@@ -60,12 +60,26 @@ Cache
 Contrib
 ====================

+Data Loader
+---------------
+.. automodule:: qlib.data.dataset.loader
+    :members:

 Data Handler
 ---------------
 .. automodule:: qlib.data.dataset.handler
    :members:

+Processor
+---------------
+.. automodule:: qlib.data.dataset.processor
+    :members:
+
+Dataset
+---------------
+.. automodule:: qlib.data.dataset.__init__
+    :members:
+
 Model
 --------------------
 .. automodule:: qlib.model.base
--- a/docs/start/integration.rst
+++ b/docs/start/integration.rst
@@ -5,7 +5,7 @@ Custom Model Integration
 Introduction
 ===================

-``Qlib`` provides ``lightGBM`` and ``Dnn`` model as the baseline of ``Interday Model``. In addition to the default model, users can integrate their own custom models into ``Qlib``.
+``Qlib``'s `Model Zoo` includes models such as ``LightGBM``, ``DNN``, ``LSTM``, etc.. These models are treated as the baselines of ``Interday Model``. In addition to the default models ``Qlib`` provide, users can integrate their own custom models into ``Qlib``.

 Users can integrate their own custom models according to the following steps.

@@ -32,79 +32,76 @@ The Custom models need to inherit `qlib.model.base.Model <../reference/api.html#

 - Override the `fit` method
    - ``Qlib`` calls the fit method to train the model
-    - The parameters must include training feature `x_train`, training label `y_train`, test feature `x_valid`, test label `y_valid` at least.
-    - The parameters could include some optional parameters with default values, such as train weight `w_train`, test weight `w_valid` and `num_boost_round = 1000`.
+    - The parameters must include training feature `dataset`.
+    - The parameters could include some optional parameters with default values, such as `num_boost_round = 1000` for `GBDT`.
    - Code Example: In the following example, `num_boost_round = 1000` is an optional parameter.
    .. code-block:: Python
    
-        def fit(self, x_train:pd.DataFrame, y_train:pd.DataFrame, x_valid:pd.DataFrame, y_valid:pd.DataFrame,
-            w_train:pd.DataFrame = None, w_valid:pd.DataFrame = None, num_boost_round = 1000, **kwargs):
+        def fit(self, dataset: DatasetH, num_boost_round = 1000, **kwargs):
+
+            # prepare dataset for lgb training and evaluation
+            df_train, df_valid = dataset.prepare(
+                ["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
+            )
+            x_train, y_train = df_train["feature"], df_train["label"]
+            x_valid, y_valid = df_valid["feature"], df_valid["label"]

            # Lightgbm need 1D array as its label
            if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
-                y_train_1d, y_valid_1d = np.squeeze(y_train.values), np.squeeze(y_valid.values)
+                y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
            else:
-                raise ValueError('LightGBM doesn\'t support multi-label training')
+                raise ValueError("LightGBM doesn't support multi-label training")

-            w_train_weight = None if w_train is None else w_train.values
-            w_valid_weight = None if w_valid is None else w_valid.values
+            dtrain = lgb.Dataset(x_train.values, label=y_train)
+            dvalid = lgb.Dataset(x_valid.values, label=y_valid)

-            dtrain = lgb.Dataset(x_train.values, label=y_train_1d, weight=w_train_weight)
-            dvalid = lgb.Dataset(x_valid.values, label=y_valid_1d, weight=w_valid_weight)
-            self._model = lgb.train(
-                self._params, 
-                dtrain, 
+            # fit the model
+            self.model = lgb.train(
+                self.params,
+                dtrain,
                num_boost_round=num_boost_round,
                valid_sets=[dtrain, dvalid],
-                valid_names=['train', 'valid'],
+                valid_names=["train", "valid"],
+                early_stopping_rounds=early_stopping_rounds,
+                verbose_eval=verbose_eval,
+                evals_result=evals_result,
                **kwargs
            )

 - Override the `predict` method
-    - The parameters include the test features.
+    - The parameters must include training feature `dataset`, which will be userd to get the test dataset.
    - Return the `prediction score`.
    - Please refer to `Model API <../reference/api.html#module-qlib.model.base>`_ for the parameter types of the fit method.
-    - Code Example: In the following example, users need to use dnn to predict the label(such as `preds`) of test data `x_test` and return it.
+    - Code Example: In the following example, users need to use `LightGBM` to predict the label(such as `preds`) of test data `x_test` and return it.
    .. code-block:: Python

-        def predict(self, x_test:pd.DataFrame, **kwargs)-> numpy.ndarray:
-            if self._model is None:
-                raise ValueError('model is not fitted yet!')
-            return self._model.predict(x_test.values)
+        def predict(self, dataset: DatasetH, **kwargs)-> pandas.Series:
+            if self.model is None:
+                raise ValueError("model is not fitted yet!")
+            x_test = dataset.prepare("test", col_set="feature", data_key=DataHandlerLP.DK_I)
+            return pd.Series(self.model.predict(x_test.values), index=x_test.index)

- Override the `save` method & `load` method
-    - The `save` method parameter includes the a `filename` that represents an absolute path, user need to save model into the path.
-    - The `load` method parameter includes the a `buffer` read from the `filename` passed in the `save` method, users need to load model from the `buffer`.
-    - Code Example:
+- Override the `finetune` method
+    - The parameters must include training feature `dataset`.
+    - Code Example: In the following example, users will use `LightGBM` as the model and finetune it.
    .. code-block:: Python

-        def save(self, filename):
-            if self._model is None:
-                raise ValueError('model is not fitted yet!')
-            self._model.save_model(filename)
-
-        def load(self, buffer):
-            self._model = lgb.Booster(params={'model_str': buffer.decode('utf-8')})
-
-.. Without tuner, this part will not be used
-.. - Override the `score` method(This step is optional)
-..     - The parameters include the test features and test labels.
-..     - Return the evaluation score of the model. It's recommended to adopt the loss between labels and `prediction score`.
-..     - Code Example: In the following example, users need to calculate the weighted loss with test data `x_test`,  test label `y_test` and the weight `w_test`.
-..     .. code-block:: Python
-..
-..         def score(self, x_test:pd.Dataframe, y_test:pd.Dataframe, w_test:pd.DataFrame = None) -> float:
-..             # Remove rows from x, y and w, which contain Nan in any columns in y_test.
-..             x_test, y_test, w_test = drop_nan_by_y_index(x_test, y_test, w_test)
-..             preds = self.predict(x_test)
-..             w_test_weight = None if w_test is None else w_test.values
-..             scorer = mean_squared_error if self.loss_type == 'mse' else roc_auc_score
-..             return scorer(y_test.values, preds, sample_weight=w_test_weight)
+        def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
+            dtrain, _ = self._prepare_data(dataset)
+            self.model = lgb.train(
+                self.params,
+                dtrain,
+                num_boost_round=num_boost_round,
+                init_model=self.model,
+                valid_sets=[dtrain],
+                valid_names=["train"],
+                verbose_eval=verbose_eval,
+            )

 Configuration File
 =======================

-The configuration file is described in detail in the `estimator <../component/estimator.html#complete-example>`_ document. In order to integrate the custom model into ``Qlib``, users need to modify the "model" field in the configuration file.
+The configuration file is described in detail in the `Workflow <../component/workflow.html#complete-example>`_ document. In order to integrate the custom model into ``Qlib``, users need to modify the "model" field in the configuration file.

 - Example: The following example describes the `model` field of configuration file about the custom lightgbm model mentioned above, where `module_path` is the module path, `class` is the class name, and `args` is the hyperparameter passed into the __init__ method. All parameters in the field is passed to `self._params` by `\*\*kwargs` in `__init__` except `loss = mse`. 

@@ -124,20 +121,20 @@ The configuration file is described in detail in the `estimator <../component/es
            num_leaves: 210
            num_threads: 20

-Users could find configuration file of the baseline of the ``Model`` in ``qlib/examples/estimator/estimator_config.yaml`` and ``qlib/examples/estimator/estimator_config_dnn.yaml``
+Users could find configuration file of the baselines of the ``Model`` in ``examples/benchmarks``. All the configurations of different models are listed under the corresponding model folder.

 Model Testing
 =====================
-Assuming that the configuration file is ``examples/estimator/estimator_config.yaml``, users can run the following command to test the custom model:
+Assuming that the configuration file is ``examples/benchmarks/LightGBM/workflow_config_lightgbm.yaml``, users can run the following command to test the custom model:

 .. code-block:: bash

    cd examples  # Avoid running program under the directory contains `qlib`
-    estimator -c estimator/estimator_config.yaml
+    qrun benchmarks/LightGBM/workflow_config_lightgbm.yaml

-.. note:: ``estimator`` is a built-in command of ``Qlib``.
+.. note:: ``qrun`` is a built-in command of ``Qlib``.

-Also, ``Model`` can also be tested as a single module. An example has been given in ``examples/train_backtest_analyze.ipynb``. 
+Also, ``Model`` can also be tested as a single module. An example has been given in ``examples/workflow_by_code.ipynb``. 


 Reference