From 5e69d089c0fcbbc6666cbadf7348733093334238 Mon Sep 17 00:00:00 2001 From: Pengrong Zhu Date: Sun, 12 Dec 2021 09:49:10 +0800 Subject: [PATCH] add description of dataset document (#742) --- README.md | 2 +- scripts/data_collector/README.md | 60 ++++++++++++++++++++++++++++++++ 2 files changed, 61 insertions(+), 1 deletion(-) create mode 100644 scripts/data_collector/README.md diff --git a/README.md b/README.md index 7ed0ef42c..169e5372c 100644 --- a/README.md +++ b/README.md @@ -160,7 +160,7 @@ Load and prepare data by running the following code: This dataset is created by public data collected by [crawler scripts](scripts/data_collector/), which have been released in the same repository. -Users could create the same dataset with it. +Users could create the same dataset with it. [Description of dataset](https://github.com/microsoft/qlib/tree/main/scripts/data_collector#description-of-dataset) *Please pay **ATTENTION** that the data is collected from [Yahoo Finance](https://finance.yahoo.com/lookup), and the data might not be perfect. We recommend users to prepare their own data if they have a high-quality dataset. For more information, users can refer to the [related document](https://qlib.readthedocs.io/en/latest/component/data.html#converting-csv-format-into-qlib-format)*. diff --git a/scripts/data_collector/README.md b/scripts/data_collector/README.md new file mode 100644 index 000000000..d0058b33e --- /dev/null +++ b/scripts/data_collector/README.md @@ -0,0 +1,60 @@ +# Data Collector + +## Introduction + +Scripts for data collection + +- yahoo: get *US/CN* stock data from *Yahoo Finance* +- fund: get fund data from *http://fund.eastmoney.com* +- cn_index: get *CN index* from *http://www.csindex.com.cn*, *CSI300*/*CSI100* +- us_index: get *US index* from *https://en.wikipedia.org/wiki*, *SP500*/*NASDAQ100*/*DJIA*/*SP400* +- contrib: scripts for some auxiliary functions + + +## Custom Data Collection + +> Specific implementation reference: https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo + +1. Create a dataset code directory in the current directory +2. Add `collector.py` + - add collector class: + ```python + CUR_DIR = Path(__file__).resolve().parent + sys.path.append(str(CUR_DIR.parent.parent)) + from data_collector.base import BaseCollector, BaseNormalize, BaseRun + class UserCollector(BaseCollector): + ... + ``` + - add normalize class: + ```python + class UserNormalzie(BaseNormalize): + ... + ``` + - add `CLI` class: + ```python + class Run(BaseRun): + ... + ``` +3. add `README.md` +4. add `requirements.txt` + + +## Description of dataset + + | | Basic data | + |------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------| + | Features | **Price/Volume**:
   - $close/$open/$low/$high/$volume/$change/$factor | + | Calendar | **\.txt**:
   - day.txt
   - 1min.txt | + | Instruments | **\.txt**:
   - required: **all.txt**;
   - csi300.txt/csi500.txt/sp500.txt | + + - `Features`: data, **digital** + - if not **adjusted**, **factor=1** + +### Data-dependent component + +> To make the component running correctly, the dependent data are required + + | Component | required data | + |---------------------------------------------------|--------------------------------| + | Data retrieval | Features, Calendar, Instrument | + | Backtest | **Features[Price/Volume]**, Calendar, Instruments | \ No newline at end of file