mirror of
https://github.com/microsoft/qlib.git
synced 2026-07-04 11:30:57 +08:00
Improve the style of documentation (#1132)
This commit improves the documentation (rst files) only in the following three ways: * Aligned section headers with their underline/overline punctuation characters * Deleted all trailling whitespaces in rst files * Deleted a few trailling newlines at the end of the rst files Co-authored-by: Bingyao Liu <Bingyao.Liu@sofund.com>
This commit is contained in:
@@ -1,13 +1,13 @@
|
||||
.. _data:
|
||||
|
||||
================================
|
||||
==================================
|
||||
Data Layer: Data Framework & Usage
|
||||
================================
|
||||
==================================
|
||||
|
||||
Introduction
|
||||
============================
|
||||
============
|
||||
|
||||
``Data Layer`` provides user-friendly APIs to manage and retrieve data. It provides high-performance data infrastructure.
|
||||
``Data Layer`` provides user-friendly APIs to manage and retrieve data. It provides high-performance data infrastructure.
|
||||
|
||||
It is designed for quantitative investment. For example, users could build formulaic alphas with ``Data Layer`` easily. Please refer to `Building Formulaic Alphas <../advanced/alpha.html>`_ for more details.
|
||||
|
||||
@@ -23,16 +23,16 @@ The introduction of ``Data Layer`` includes the following parts.
|
||||
|
||||
Here is a typical example of Qlib data workflow
|
||||
|
||||
- Users download data and converting data into Qlib format(with filename suffix `.bin`). In this step, typically only some basic data are stored on disk(such as OHLCV).
|
||||
- Users download data and converting data into Qlib format(with filename suffix `.bin`). In this step, typically only some basic data are stored on disk(such as OHLCV).
|
||||
- Creating some basic features based on Qlib's expression Engine(e.g. "Ref($close, 60) / $close", the return of last 60 trading days). Supported operators in the expression engine can be found `here <https://github.com/microsoft/qlib/blob/main/qlib/data/ops.py>`_. This step is typically implemented in Qlib's `Data Loader <https://qlib.readthedocs.io/en/latest/component/data.html#data-loader>`_ which is a component of `Data Handler <https://qlib.readthedocs.io/en/latest/component/data.html#data-handler>`_ .
|
||||
- If users require more complicated data processing (e.g. data normalization), `Data Handler <https://qlib.readthedocs.io/en/latest/component/data.html#data-handler>`_ support user-customized processors to process data(some predefined processors can be found `here <https://github.com/microsoft/qlib/blob/main/qlib/data/dataset/processor.py>`_). The processors are different from operators in expression engine. It is designed for some complicated data processing methods which is hard to supported in operators in expression engine.
|
||||
- At last, `Dataset <https://qlib.readthedocs.io/en/latest/component/data.html#dataset>`_ is responsible to prepare model-specific dataset from the processed data of Data Handler
|
||||
|
||||
Data Preparation
|
||||
============================
|
||||
================
|
||||
|
||||
Qlib Format Data
|
||||
------------------
|
||||
----------------
|
||||
|
||||
We've specially designed a data structure to manage financial data, please refer to the `File storage design section in Qlib paper <https://arxiv.org/abs/2009.11189>`_ for detailed information.
|
||||
Such data will be stored with filename suffix `.bin` (We'll call them `.bin` file, `.bin` format, or qlib format). `.bin` file is designed for scientific computing on finance data.
|
||||
@@ -50,9 +50,9 @@ Alpha158 √ √
|
||||
Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency dataset example through this `link <https://github.com/microsoft/qlib/tree/main/examples/highfreq>`_.
|
||||
|
||||
Qlib Format Dataset
|
||||
--------------------
|
||||
-------------------
|
||||
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows.
|
||||
The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
|
||||
The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
|
||||
Users can leverage `$factor` to get the original trading price (e.g. `$close / $factor` to get the original close price).
|
||||
|
||||
.. code-block:: bash
|
||||
@@ -104,7 +104,7 @@ Automatic update of daily frequency data
|
||||
|
||||
|
||||
Converting CSV Format into Qlib Format
|
||||
-------------------------------------------
|
||||
--------------------------------------
|
||||
|
||||
``Qlib`` has provided the script ``scripts/dump_bin.py`` to convert **any** data in CSV format into `.bin` files (``Qlib`` format) as long as they are in the correct format.
|
||||
|
||||
@@ -126,16 +126,16 @@ Users can also provide their own data in CSV format. However, the CSV data **mus
|
||||
- CSV file is named after a specific stock *or* the CSV file includes a column of the stock name
|
||||
|
||||
- Name the CSV file after a stock: `SH600000.csv`, `AAPL.csv` (not case sensitive).
|
||||
|
||||
|
||||
- CSV file includes a column of the stock name. User **must** specify the column name when dumping the data. Here is an example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python scripts/dump_bin.py dump_all ... --symbol_field_name symbol
|
||||
|
||||
|
||||
where the data are in the following format:
|
||||
|
||||
.. code-block::
|
||||
.. code-block::
|
||||
|
||||
symbol,close
|
||||
SH600000,120
|
||||
@@ -145,10 +145,10 @@ Users can also provide their own data in CSV format. However, the CSV data **mus
|
||||
.. code-block:: bash
|
||||
|
||||
python scripts/dump_bin.py dump_all ... --date_field_name date
|
||||
|
||||
|
||||
where the data are in the following format:
|
||||
|
||||
.. code-block::
|
||||
.. code-block::
|
||||
|
||||
symbol,date,close,open,volume
|
||||
SH600000,2020-11-01,120,121,12300000
|
||||
@@ -172,7 +172,7 @@ After conversion, users can find their Qlib format data in the directory `~/.qli
|
||||
.. note::
|
||||
|
||||
The arguments of `--include_fields` should correspond with the column names of CSV files. The columns names of dataset provided by ``Qlib`` should include open, close, high, low, volume and factor at least.
|
||||
|
||||
|
||||
- `open`
|
||||
The adjusted opening price
|
||||
- `close`
|
||||
@@ -186,11 +186,11 @@ After conversion, users can find their Qlib format data in the directory `~/.qli
|
||||
- `factor`
|
||||
The Restoration factor. Normally, ``factor = adjusted_price / original_price``, `adjusted price` reference: `split adjusted <https://www.investopedia.com/terms/s/splitadjusted.asp>`_
|
||||
|
||||
In the convention of `Qlib` data processing, `open, close, high, low, volume, money and factor` will be set to NaN if the stock is suspended.
|
||||
In the convention of `Qlib` data processing, `open, close, high, low, volume, money and factor` will be set to NaN if the stock is suspended.
|
||||
If you want to use your own alpha-factor which can't be calculate by OCHLV, like PE, EPS and so on, you could add it to the CSV files with OHCLV together and then dump it to the Qlib format data.
|
||||
|
||||
Stock Pool (Market)
|
||||
--------------------------------
|
||||
-------------------
|
||||
|
||||
``Qlib`` defines `stock pool <https://github.com/microsoft/qlib/blob/main/examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml#L4>`_ as stock list and their date ranges. Predefined stock pools (e.g. csi300) may be imported as follows.
|
||||
|
||||
@@ -200,7 +200,7 @@ Stock Pool (Market)
|
||||
|
||||
|
||||
Multiple Stock Modes
|
||||
--------------------------------
|
||||
--------------------
|
||||
|
||||
``Qlib`` now provides two different stock modes for users: China-Stock Mode & US-Stock Mode. Here are some different settings of these two modes:
|
||||
|
||||
@@ -218,23 +218,23 @@ The `trade unit` defines the unit number of stocks can be used in a trade, and t
|
||||
- Download china-stock in qlib format, please refer to section `Qlib Format Dataset <#qlib-format-dataset>`_.
|
||||
- Initialize ``Qlib`` in china-stock mode
|
||||
Supposed that users download their Qlib format data in the directory ``~/.qlib/qlib_data/cn_data``. Users only need to initialize ``Qlib`` as follows.
|
||||
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from qlib.constant import REG_CN
|
||||
qlib.init(provider_uri='~/.qlib/qlib_data/cn_data', region=REG_CN)
|
||||
|
||||
|
||||
|
||||
- If users use ``Qlib`` in US-stock mode, US-stock data is required. ``Qlib`` also provides a script to download US-stock data. Users can use ``Qlib`` in US-stock mode according to the following steps:
|
||||
- Download us-stock in qlib format, please refer to section `Qlib Format Dataset <#qlib-format-dataset>`_.
|
||||
- Initialize ``Qlib`` in US-stock mode
|
||||
Supposed that users prepare their Qlib format data in the directory ``~/.qlib/qlib_data/us_data``. Users only need to initialize ``Qlib`` as follows.
|
||||
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from qlib.config import REG_US
|
||||
qlib.init(provider_uri='~/.qlib/qlib_data/us_data', region=REG_US)
|
||||
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -242,14 +242,14 @@ The `trade unit` defines the unit number of stocks can be used in a trade, and t
|
||||
|
||||
|
||||
Data API
|
||||
========================
|
||||
========
|
||||
|
||||
Data Retrieval
|
||||
---------------
|
||||
--------------
|
||||
Users can use APIs in ``qlib.data`` to retrieve data, please refer to `Data Retrieval <../start/getdata.html>`_.
|
||||
|
||||
Feature
|
||||
------------------
|
||||
-------
|
||||
|
||||
``Qlib`` provides `Feature` and `ExpressionOps` to fetch the features according to users' needs.
|
||||
|
||||
@@ -264,7 +264,7 @@ Feature
|
||||
To know more about ``Feature``, please refer to `Feature API <../reference/api.html#module-qlib.data.base>`_.
|
||||
|
||||
Filter
|
||||
-------------------
|
||||
------
|
||||
``Qlib`` provides `NameDFilter` and `ExpressionDFilter` to filter the instruments according to users' needs.
|
||||
|
||||
- `NameDFilter`
|
||||
@@ -272,7 +272,7 @@ Filter
|
||||
|
||||
- `ExpressionDFilter`
|
||||
Expression dynamic instrument filter. Filter the instruments based on a certain expression. An expression rule indicating a certain feature field is required.
|
||||
|
||||
|
||||
- `basic features filter`: rule_expression = '$close/$open>5'
|
||||
- `cross-sectional features filter` \: rule_expression = '$rank($close)<10'
|
||||
- `time-sequence features filter`: rule_expression = '$Ref($close, 3)>100'
|
||||
@@ -299,29 +299,29 @@ Here is a simple example showing how to use filter in a basic ``Qlib`` workflow
|
||||
To know more about ``Filter``, please refer to `Filter API <../reference/api.html#module-qlib.data.filter>`_.
|
||||
|
||||
Reference
|
||||
-------------
|
||||
---------
|
||||
|
||||
To know more about ``Data API``, please refer to `Data API <../reference/api.html#data>`_.
|
||||
|
||||
|
||||
Data Loader
|
||||
=================
|
||||
===========
|
||||
|
||||
``Data Loader`` in ``Qlib`` is designed to load raw data from the original data source. It will be loaded and used in the ``Data Handler`` module.
|
||||
|
||||
QlibDataLoader
|
||||
---------------
|
||||
--------------
|
||||
|
||||
The ``QlibDataLoader`` class in ``Qlib`` is such an interface that allows users to load raw data from the ``Qlib`` data source.
|
||||
|
||||
StaticDataLoader
|
||||
---------------
|
||||
----------------
|
||||
|
||||
The ``StaticDataLoader`` class in ``Qlib`` is such an interface that allows users to load raw data from file or as provided.
|
||||
|
||||
|
||||
Interface
|
||||
------------
|
||||
---------
|
||||
|
||||
Here are some interfaces of the ``QlibDataLoader`` class:
|
||||
|
||||
@@ -329,28 +329,28 @@ Here are some interfaces of the ``QlibDataLoader`` class:
|
||||
:members:
|
||||
|
||||
API
|
||||
-----------
|
||||
---
|
||||
|
||||
To know more about ``Data Loader``, please refer to `Data Loader API <../reference/api.html#module-qlib.data.dataset.loader>`_.
|
||||
|
||||
|
||||
Data Handler
|
||||
=================
|
||||
============
|
||||
|
||||
The ``Data Handler`` module in ``Qlib`` is designed to handler those common data processing methods which will be used by most of the models.
|
||||
|
||||
Users can use ``Data Handler`` in an automatic workflow by ``qrun``, refer to `Workflow: Workflow Management <workflow.html>`_ for more details.
|
||||
Users can use ``Data Handler`` in an automatic workflow by ``qrun``, refer to `Workflow: Workflow Management <workflow.html>`_ for more details.
|
||||
|
||||
DataHandlerLP
|
||||
--------------
|
||||
-------------
|
||||
|
||||
In addition to use ``Data Handler`` in an automatic workflow with ``qrun``, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data (standardization, remove NaN, etc.) and build datasets.
|
||||
In addition to use ``Data Handler`` in an automatic workflow with ``qrun``, ``Data Handler`` can be used as an independent module, by which users can easily preprocess data (standardization, remove NaN, etc.) and build datasets.
|
||||
|
||||
In order to achieve so, ``Qlib`` provides a base class `qlib.data.dataset.DataHandlerLP <../reference/api.html#qlib.data.dataset.handler.DataHandlerLP>`_. The core idea of this class is that: we will have some learnable ``Processors`` which can learn the parameters of data processing(e.g., parameters for zscore normalization). When new data comes in, these `trained` ``Processors`` can then process the new data and thus processing real-time data in an efficient way becomes possible. More information about ``Processors`` will be listed in the next subsection.
|
||||
|
||||
|
||||
Interface
|
||||
----------------------
|
||||
---------
|
||||
|
||||
Here are some important interfaces that ``DataHandlerLP`` provides:
|
||||
|
||||
@@ -364,7 +364,7 @@ Also, users can pass ``qlib.contrib.data.processor.ConfigSectionProcessor`` that
|
||||
|
||||
|
||||
Processor
|
||||
----------
|
||||
---------
|
||||
|
||||
The ``Processor`` module in ``Qlib`` is designed to be learnable and it is responsible for handling data processing such as `normalization` and `drop none/nan features/labels`.
|
||||
|
||||
@@ -382,14 +382,14 @@ The ``Processor`` module in ``Qlib`` is designed to be learnable and it is respo
|
||||
- ``CSRankNorm``: `processor` that applies cross sectional rank normalization.
|
||||
- ``CSZFillna``: `processor` that fills N/A values in a cross sectional way by the mean of the column.
|
||||
|
||||
Users can also create their own `processor` by inheriting the base class of ``Processor``. Please refer to the implementation of all the processors for more information (`Processor Link <https://github.com/microsoft/qlib/blob/main/qlib/data/dataset/processor.py>`_).
|
||||
Users can also create their own `processor` by inheriting the base class of ``Processor``. Please refer to the implementation of all the processors for more information (`Processor Link <https://github.com/microsoft/qlib/blob/main/qlib/data/dataset/processor.py>`_).
|
||||
|
||||
To know more about ``Processor``, please refer to `Processor API <../reference/api.html#module-qlib.data.dataset.processor>`_.
|
||||
|
||||
Example
|
||||
--------------
|
||||
-------
|
||||
|
||||
``Data Handler`` can be run with ``qrun`` by modifying the configuration file, and can also be used as a single module.
|
||||
``Data Handler`` can be run with ``qrun`` by modifying the configuration file, and can also be used as a single module.
|
||||
|
||||
Know more about how to run ``Data Handler`` with ``qrun``, please refer to `Workflow: Workflow Management <workflow.html>`_
|
||||
|
||||
@@ -427,17 +427,17 @@ Qlib provides implemented data handler `Alpha158`. The following example shows h
|
||||
.. note:: In the ``Alpha158``, ``Qlib`` uses the label `Ref($close, -2)/Ref($close, -1) - 1` that means the change from T+1 to T+2, rather than `Ref($close, -1)/$close - 1`, of which the reason is that when getting the T day close price of a china stock, the stock can be bought on T+1 day and sold on T+2 day.
|
||||
|
||||
API
|
||||
---------
|
||||
---
|
||||
|
||||
To know more about ``Data Handler``, please refer to `Data Handler API <../reference/api.html#module-qlib.data.dataset.handler>`_.
|
||||
|
||||
|
||||
Dataset
|
||||
=================
|
||||
=======
|
||||
|
||||
The ``Dataset`` module in ``Qlib`` aims to prepare data for model training and inferencing.
|
||||
|
||||
The motivation of this module is that we want to maximize the flexibility of different models to handle data that are suitable for themselves. This module gives the model the flexibility to process their data in an unique way. For instance, models such as ``GBDT`` may work well on data that contains `nan` or `None` value, while neural networks such as ``MLP`` will break down on such data.
|
||||
The motivation of this module is that we want to maximize the flexibility of different models to handle data that are suitable for themselves. This module gives the model the flexibility to process their data in an unique way. For instance, models such as ``GBDT`` may work well on data that contains `nan` or `None` value, while neural networks such as ``MLP`` will break down on such data.
|
||||
|
||||
If user's model need process its data in a different way, user could implement his own ``Dataset`` class. If the model's
|
||||
data processing is not special, ``DatasetH`` can be used directly.
|
||||
@@ -448,18 +448,18 @@ The ``DatasetH`` class is the `dataset` with `Data Handler`. Here is the most im
|
||||
:members:
|
||||
|
||||
API
|
||||
---------
|
||||
---
|
||||
|
||||
To know more about ``Dataset``, please refer to `Dataset API <../reference/api.html#dataset>`_.
|
||||
|
||||
|
||||
Cache
|
||||
==========
|
||||
=====
|
||||
|
||||
``Cache`` is an optional module that helps accelerate providing data by saving some frequently-used data as cache file. ``Qlib`` provides a `Memcache` class to cache the most-frequently-used data in memory, an inheritable `ExpressionCache` class, and an inheritable `DatasetCache` class.
|
||||
|
||||
Global Memory Cache
|
||||
---------------------
|
||||
-------------------
|
||||
|
||||
`Memcache` is a global memory cache mechanism that composes of three `MemCacheUnit` instances to cache **Calendar**, **Instruments**, and **Features**. The `MemCache` is defined globally in `cache.py` as `H`. Users can use `H['c'], H['i'], H['f']` to get/set `memcache`.
|
||||
|
||||
@@ -471,7 +471,7 @@ Global Memory Cache
|
||||
|
||||
|
||||
ExpressionCache
|
||||
-----------------
|
||||
---------------
|
||||
|
||||
`ExpressionCache` is a cache mechanism that saves expressions such as **Mean($close, 5)**. Users can inherit this base class to define their own cache mechanism that saves expressions according to the following steps.
|
||||
|
||||
@@ -486,7 +486,7 @@ The following shows the details about the interfaces:
|
||||
``Qlib`` has currently provided implemented disk cache `DiskExpressionCache` which inherits from `ExpressionCache` . The expressions data will be stored in the disk.
|
||||
|
||||
DatasetCache
|
||||
-----------------
|
||||
------------
|
||||
|
||||
`DatasetCache` is a cache mechanism that saves datasets. A certain dataset is regulated by a stock pool configuration (or a series of instruments, though not recommended), a list of expressions or static feature fields, the start time, and end time for the collected features and the frequency. Users can inherit this base class to define their own cache mechanism that saves datasets according to the following steps.
|
||||
|
||||
@@ -503,7 +503,7 @@ The following shows the details about the interfaces:
|
||||
|
||||
|
||||
Data and Cache File Structure
|
||||
==================================
|
||||
=============================
|
||||
|
||||
We've specially designed a file structure to manage data and cache, please refer to the `File storage design section in Qlib paper <https://arxiv.org/abs/2009.11189>`_ for detailed information. The file structure of data and cache is listed as follows.
|
||||
|
||||
@@ -536,4 +536,3 @@ We've specially designed a file structure to manage data and cache, please refer
|
||||
- .meta : an assorted meta file recording the stockpool config, field names and visit times
|
||||
- .index : an assorted index file recording the line index of all calendars
|
||||
- ...
|
||||
|
||||
|
||||
Reference in New Issue
Block a user