1
0
mirror of https://github.com/microsoft/qlib.git synced 2026-06-06 05:51:17 +08:00

Qlib data doc (#1207)

* Explain data crawler structure

* Add documentation for data and feature

* Update scripts/data_collector/yahoo/README.md

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* Remove some confusing wording

* Add third party data source

* Fix command typo

* Update commands

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
This commit is contained in:
Di
2022-07-22 09:24:58 +08:00
committed by GitHub
parent 8199822ca0
commit 86f08e47e8
5 changed files with 97 additions and 2 deletions

View File

@@ -67,3 +67,10 @@ from qlib.constant import REG_CN
provider_uri = "~/.qlib/qlib_data/cn_data" # target_dir
qlib.init(provider_uri=provider_uri, region=REG_CN)
```
## Use Crowd Sourced Data
The is also a [crowd sourced version of qlib data](data_collector/crowd_source/README.md): https://github.com/chenditc/investment_data/releases
```bash
wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
```

View File

@@ -0,0 +1,32 @@
# Crowd Source Data
## Initiative
Public data source like yahoo is flawed, it might miss data for stock which is delisted and it might has data which is wrong. This can introduce survivorship bias into our training process.
The crowd sourced data is introduced to merged data from multiple data source and cross validate against each other, so that:
1. We will have a more complete history record.
2. We can identify the anomaly data and apply correction when necessary.
## Related Repo
The raw data is hosted on dolthub repo: https://www.dolthub.com/repositories/chenditc/investment_data
The processing script and sql is hosted on github repo: https://github.com/chenditc/investment_data
The pakcaged docker runtime is hosted on dockerhub: https://hub.docker.com/repository/docker/chenditc/investment_data
## How to use it in qlib
### Option 1: Download release bin data
User can download data in qlib bin format and use it directly: https://github.com/chenditc/investment_data/releases/tag/20220720
```bash
wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
```
### Option 2: Generate qlib data from dolthub
Dolthub data will be update daily, so that if user wants to get up to date data, they can dump qlib bin using docker:
```
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
```
## FAQ and other info
See: https://github.com/chenditc/investment_data/blob/main/README.md

View File

@@ -36,7 +36,7 @@ pip install -r requirements.txt
- `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data*
- `version`: dataset version, value from [`v1`, `v2`], by default `v1`
- `v2` end date is *2021-06*, `v1` end date is *2020-09*
- user can append data to `v2`: [automatic update of daily frequency data](#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
- If users want to incrementally update data, they need to use yahoo collector to [collect data from scratch](#collector-yahoofinance-data-to-qlib).
- **the [benchmarks](https://github.com/microsoft/qlib/tree/main/examples/benchmarks) for qlib use `v1`**, *due to the unstable access to historical data by YahooFinance, there are some differences between `v2` and `v1`*
- `interval`: `1d` or `1min`, by default `1d`
- `region`: `cn` or `us` or `in`, by default `cn`
@@ -62,6 +62,8 @@ pip install -r requirements.txt
> collector *YahooFinance* data and *dump* into `qlib` format.
> If the above ready-made data can't meet users' requirements, users can follow this section to crawl the latest data and convert it to qlib-data.
1. download data to csv: `python scripts/data_collector/yahoo/collector.py download_data`
This will download the raw data such as high, low, open, close, adjclose price from yahoo to a local directory. One file per symbol.
- parameters:
- `source_dir`: save the directory
@@ -99,6 +101,10 @@ pip install -r requirements.txt
```
2. normalize data: `python scripts/data_collector/yahoo/collector.py normalize_data`
This will:
1. Normalize high, low, close, open price using adjclose.
2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1.
- parameters:
- `source_dir`: csv directory
- `normalize_dir`: result directory
@@ -136,6 +142,8 @@ pip install -r requirements.txt
```
3. dump data: `python scripts/dump_bin.py dump_all`
This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory.
- parameters:
- `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
- `qlib_dir`: qlib(dump) data director