Update docs and fix tabnet

2026-07-03 19:10:58 +08:00 · 2020-11-26 00:55:26 +08:00
parent 5be847909f
commit 87cee85cea
27 changed files with 624 additions and 495 deletions
--- a/docs/component/model.rst
+++ b/docs/component/model.rst
@@ -7,7 +7,7 @@ Interday Model: Model Training & Prediction
 Introduction
 ===================

-``Interday Model`` is designed to make the `prediction score` about stocks. Users can use the ``Interday Model`` in an automatic workflow by ``Estimator``, please refer to `Estimator: Workflow Management <estimator.html>`_.  
+``Interday Model`` is designed to make the `prediction score` about stocks. Users can use the ``Interday Model`` in an automatic workflow by ``qrun``, please refer to `Workflow: Workflow Management <workflow.html>`_.  

 Because the components in ``Qlib`` are designed in a loosely-coupled way, ``Interday Model`` can be used as an independent module also.

@@ -20,151 +20,125 @@ The base class provides the following interfaces:

 - `__init__(**kwargs)`
    - Initialization.
-    - If users use ``Estimator`` to start an `experiment`, the parameter of `__init__` method shoule be consistent with the hyperparameters in the configuration file.

- `fit(self, x_train, y_train, x_valid, y_valid, w_train=None, w_valid=None, **kwargs)`
+- `fit(self, dataset, **kwargs)`
    - Train model.
    - Parameter:
-        - `x_train`, pd.DataFrame type, train feature
-            The following example explains the value of `x_train`:
+        - `dataset`, ``Qlib``'s ``DatasetH`` type. For more information about ``DatasetH``, users can refer to the related document: `Qlib Dataset <../component/data.html#dataset>`_.
+            The `dataset` is passed into the `model`'s method because there are some unique data preprocessing procedures for each, we want to give each model maximum flexibility to handle the data that is suitable for their own.
+            The following code example shows how to retrieve `x_train`, `y_train` and `w_train` from the `dataset`:

-            .. code-block:: YAML
-                                
-                                        KMID      KLEN      KMID2     KUP       KUP2
-                instrument  datetime                                                       
-                SH600004    2012-01-04  0.000000  0.017685  0.000000  0.012862  0.727275   
-                            2012-01-05 -0.006473  0.025890 -0.250001  0.012945  0.499998   
-                            2012-01-06  0.008117  0.019481  0.416666  0.008117  0.416666   
-                            2012-01-09  0.016051  0.025682  0.624998  0.006421  0.250001   
-                            2012-01-10  0.017323  0.026772  0.647057  0.003150  0.117648   
-                ...                         ...       ...       ...       ...       ...   
-                SZ300273    2014-12-25 -0.005295  0.038697 -0.136843  0.016293  0.421052   
-                            2014-12-26 -0.022486  0.041701 -0.539215  0.002453  0.058824   
-                            2014-12-29 -0.031526  0.039092 -0.806451  0.000000  0.000000   
-                            2014-12-30 -0.010000  0.032174 -0.310811  0.013913  0.432433   
-                            2014-12-31  0.010917  0.020087  0.543479  0.001310  0.065216   
+            .. code-block:: Python

-            
-            `x_train` is a pandas DataFrame, whose index is MultiIndex <instrument(str), datetime(pd.Timestamp)>. Each column of `x_train` corresponds to a feature, and the column name is the feature name. 
-            
-            .. note::
-            
-                The number and names of the columns are determined by the data handler, please refer to `Data Handler <data.html#data-handler>`_ and `Estimator Data Section <estimator.html#data-section>`_.
-            
-        - `y_train`, pd.DataFrame type, train label
-            The following example explains the value of `y_train`:
+                # get features and labels
+                df_train, df_valid = dataset.prepare(
+                    ["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
+                )
+                x_train, y_train = df_train["feature"], df_train["label"]
+                x_valid, y_valid = df_valid["feature"], df_valid["label"]

-             .. code-block:: YAML
-                                
-                                        LABEL
-                instrument  datetime            
-                SH600004    2012-01-04 -0.798456
-                            2012-01-05 -1.366716
-                            2012-01-06 -0.491026
-                            2012-01-09  0.296900
-                            2012-01-10  0.501426
-                ...                         ...
-                SZ300273    2014-12-25 -0.465540
-                            2014-12-26  0.233864
-                            2014-12-29  0.471368
-                            2014-12-30  0.411914
-                            2014-12-31  1.342723
-            
-            `y_train` is a pandas DataFrame, whose index is MultiIndex <instrument(str), datetime(pd.Timestamp)>. The `LABEL` column represents the value of train label.
-
-            .. note::
-
-                The number and names of the columns are determined by the ``Data Handler``, please refer to `Data Handler <data.html#data-handler>`_.
-
-        - `x_valid`, pd.DataFrame type, validation feature
-            The format of `x_valid` is same as `x_train`
-
-
-        - `y_valid`, pd.DataFrame type, validation label
-            The format of `y_valid` is same as `y_train`
-
-        - `w_train`(Optional args, default is None), pd.DataFrame type, train weight
-            `w_train` is a pandas DataFrame, whose shape and index is same as `x_train`. The float value in `w_train` represents the weight of the feature at the same position in `x_train`.
-
-        - `w_train`(Optional args, default is None), pd.DataFrame type, validation weight
-            `w_train` is a pandas DataFrame, whose shape and index is the same as `x_valid`. The float value in `w_train` represents the weight of the feature at the same position in `x_train`.
-
- `predict(self, x_test, **kwargs)`
-    - Predict test data 'x_test'
-    - Parameter:
-        - `x_test`, pd.DataFrame type, test features
-            The form of `x_test` is same as `x_train` in 'fit' method.
-    - Return: 
-        - `label`, np.ndarray type, test label
-            The label of `x_test` that predicted by model.
-
- `score(self, x_test, y_test, w_test=None, **kwargs)`
-    - Evaluate model with test feature/label
-    - Parameter:
-        - `x_test`, pd.DataFrame type, test feature
-            The format of `x_test` is same as `x_train` in `fit` method.
+                # get weights
+                try:
+                    wdf_train, wdf_valid = dataset.prepare(["train", "valid"], col_set=["weight"], data_key=DataHandlerLP.DK_L)
+                    w_train, w_valid = wdf_train["weight"], wdf_valid["weight"]
+                except KeyError as e:
+                    w_train = pd.DataFrame(np.ones_like(y_train.values), index=y_train.index)
+                    w_valid = pd.DataFrame(np.ones_like(y_valid.values), index=y_valid.index)
        
-        - `x_test`, pd.DataFrame type, test label
-            The format of `y_test` is same as `y_train` in `fit` method.
+- `predict(self, dataset, **kwargs)`
+    - Predict test data.
+    - Parameter:
+        - `dataset`, ``Qlib``'s ``DatasetH`` type. The usage is similar to the example above.
+    - Returns:
+        - Predic results with type: `pandas.Series`.

-        - `w_test`, pd.DataFrame type, test weight
-            The format of `w_test` is same as `w_train` in `fit` method.
-    - Return: float type, evaluation score
+- `finetune(self, dataset, **kwargs)`
+    - Finetune the model.
+    - Parameter:
+        - `dataset`, ``Qlib``'s ``DatasetH`` type. The usage is similar to the example above.

-For other interfaces such as `save`, `load`, `finetune`, please refer to `Model API <../reference/api.html#module-qlib.model.base>`_.
+    
+For other interfaces such as `finetune`, please refer to `Model API <../reference/api.html#module-qlib.model.base>`_.

 Example
 ==================

-``Qlib`` provides ``LightGBM`` and ``DNN`` models as the baseline, the following steps show how to run`` LightGBM`` as an independent module.
+``Qlib``'s `Model Zoo` includes models such as ``LightGBM``, ``DNN``, ``LSTM``, etc.. These models are treated as the baselines of ``Interday Model``. The following steps show how to run`` LightGBM`` as an independent module.

 - Initialize ``Qlib`` with `qlib.init` first, please refer to `Initialization <../start/initialization.html>`_.
 - Run the following code to get the `prediction score` `pred_score`
    .. code-block:: Python

-        from qlib.contrib.data.handler import Alpha158
        from qlib.contrib.model.gbdt import LGBModel
+        from qlib.contrib.data.handler import Alpha158
+        from qlib.utils import init_instance_by_config, flatten_dict
+        from qlib.workflow import R
+        from qlib.workflow.record_temp import SignalRecord, PortAnaRecord

-        DATA_HANDLER_CONFIG = {
-            "dropna_label": True,
-            "start_date": "2007-01-01",
-            "end_date": "2020-08-01",
-            "market": MARKET,
+        market = "csi300"
+        benchmark = "SH000300"
+
+        data_handler_config = {
+            "start_time": "2008-01-01",
+            "end_time": "2020-08-01",
+            "fit_start_time": "2008-01-01",
+            "fit_end_time": "2014-12-31",
+            "instruments": market,
        }

-        TRAINER_CONFIG = {
-            "train_start_date": "2007-01-01",
-            "train_end_date": "2014-12-31",
-            "validate_start_date": "2015-01-01",
-            "validate_end_date": "2016-12-31",
-            "test_start_date": "2017-01-01",
-            "test_end_date": "2020-08-01",
+        task = {
+            "model": {
+                "class": "LGBModel",
+                "module_path": "qlib.contrib.model.gbdt",
+                "kwargs": {
+                    "loss": "mse",
+                    "colsample_bytree": 0.8879,
+                    "learning_rate": 0.0421,
+                    "subsample": 0.8789,
+                    "lambda_l1": 205.6999,
+                    "lambda_l2": 580.9768,
+                    "max_depth": 8,
+                    "num_leaves": 210,
+                    "num_threads": 20,
+                },
+            },
+            "dataset": {
+                "class": "DatasetH",
+                "module_path": "qlib.data.dataset",
+                "kwargs": {
+                    "handler": {
+                        "class": "Alpha158",
+                        "module_path": "qlib.contrib.data.handler",
+                        "kwargs": data_handler_config,
+                    },
+                    "segments": {
+                        "train": ("2008-01-01", "2014-12-31"),
+                        "valid": ("2015-01-01", "2016-12-31"),
+                        "test": ("2017-01-01", "2020-08-01"),
+                    },
+                },
+            },
        }
+        
+        # model initiaiton
+        model = init_instance_by_config(task["model"])
+        dataset = init_instance_by_config(task["dataset"])

-        x_train, y_train, x_validate, y_validate, x_test, y_test = Alpha158(
-            **DATA_HANDLER_CONFIG
-        ).get_split_data(**TRAINER_CONFIG)
+        # start exp
+        with R.start(experiment_name="workflow"):
+            # train
+            R.log_params(**flatten_dict(task))
+            model.fit(dataset)

+            # prediction
+            recorder = R.get_recorder()
+            sr = SignalRecord(model, dataset, recorder)
+            sr.generate()

-        MODEL_CONFIG = {
-            "loss": "mse",
-            "colsample_bytree": 0.8879,
-            "learning_rate": 0.0421,
-            "subsample": 0.8789,
-            "lambda_l1": 205.6999,
-            "lambda_l2": 580.9768,
-            "max_depth": 8,
-            "num_leaves": 210,
-            "num_threads": 20,
-        }
-        # use default model
-        model = LGBModel(**MODEL_CONFIG)
-        model.fit(x_train, y_train, x_validate, y_validate)
-        _pred = model.predict(x_test)
-        pred_score = pd.DataFrame(index=_pred.index)
-        pred_score["score"] = _pred.iloc(axis=1)[0]
-
-    .. note:: `Alpha158` is the data handler provided by ``Qlib``, please refer to `Data Handler <data.html#data-handler>`_.
+    .. note:: 
+        
+        `Alpha158` is the data handler provided by ``Qlib``, please refer to `Data Handler <data.html#data-handler>`_.
+        `SignalRecord` is the `Record Template` in ``Qlib``, please refer to `Workflow <recorder.html#record-template>`_.

 Also, the above example has been given in ``examples/train_backtest_analyze.ipynb``.