From 5e69d089c0fcbbc6666cbadf7348733093334238 Mon Sep 17 00:00:00 2001
From: Pengrong Zhu <zhu.pengrong@foxmail.com>
Date: Sun, 12 Dec 2021 09:49:10 +0800
Subject: [PATCH] add description of dataset document (#742)

---
 README.md                        |  2 +-
 scripts/data_collector/README.md | 60 ++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 1 deletion(-)
 create mode 100644 scripts/data_collector/README.md
diff --git a/README.md b/README.md
index 7ed0ef42c..169e5372c 100644
--- a/README.md
+++ b/README.md
@@ -160,7 +160,7 @@ Load and prepare data by running the following code:
 
 This dataset is created by public data collected by [crawler scripts](scripts/data_collector/), which have been released in
 the same repository.
-Users could create the same dataset with it. 
+Users could create the same dataset with it. [Description of dataset](https://github.com/microsoft/qlib/tree/main/scripts/data_collector#description-of-dataset)
 
 *Please pay **ATTENTION** that the data is collected from [Yahoo Finance](https://finance.yahoo.com/lookup), and the data might not be perfect.
 We recommend users to prepare their own data if they have a high-quality dataset. For more information, users can refer to the [related document](https://qlib.readthedocs.io/en/latest/component/data.html#converting-csv-format-into-qlib-format)*.
diff --git a/scripts/data_collector/README.md b/scripts/data_collector/README.md
new file mode 100644
index 000000000..d0058b33e
--- /dev/null
+++ b/scripts/data_collector/README.md
@@ -0,0 +1,60 @@
+# Data Collector
+
+## Introduction
+
+Scripts for data collection
+
+- yahoo: get *US/CN* stock data from *Yahoo Finance*
+- fund: get fund data from *http://fund.eastmoney.com*
+- cn_index: get *CN index* from *http://www.csindex.com.cn*, *CSI300*/*CSI100*
+- us_index: get *US index* from *https://en.wikipedia.org/wiki*, *SP500*/*NASDAQ100*/*DJIA*/*SP400*
+- contrib: scripts for some auxiliary functions
+
+
+## Custom Data Collection
+
+> Specific implementation reference: https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
+
+1. Create a dataset code directory in the current directory
+2. Add `collector.py`
+   - add collector class:
+     ```python
+     CUR_DIR = Path(__file__).resolve().parent
+     sys.path.append(str(CUR_DIR.parent.parent))
+     from data_collector.base import BaseCollector, BaseNormalize, BaseRun
+     class UserCollector(BaseCollector):
+         ...
+     ```
+   - add normalize class:
+     ```python
+     class UserNormalzie(BaseNormalize):
+         ...
+     ```
+   - add `CLI` class:
+     ```python
+     class Run(BaseRun):
+         ...
+     ```
+3. add `README.md`
+4. add `requirements.txt`
+
+
+## Description of dataset
+
+  |             | Basic data                                                                                                       |
+  |------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
+  | Features    | **Price/Volume**: <br>&nbsp;&nbsp; - $close/$open/$low/$high/$volume/$change/$factor                             |
+  | Calendar    | **\<freq>.txt**: <br>&nbsp;&nbsp; - day.txt<br>&nbsp;&nbsp;  - 1min.txt                                          |
+  | Instruments | **\<market>.txt**: <br>&nbsp;&nbsp; - required: **all.txt**; <br>&nbsp;&nbsp;  - csi300.txt/csi500.txt/sp500.txt |
+
+  - `Features`: data, **digital**
+    - if not **adjusted**, **factor=1**
+
+### Data-dependent component
+
+> To make the component running correctly, the dependent data are required
+
+  | Component      | required data                                     |
+  |---------------------------------------------------|--------------------------------|
+  | Data retrieval | Features, Calendar, Instrument                    |
+  | Backtest       | **Features[Price/Volume]**, Calendar, Instruments |
\ No newline at end of file