1
0
mirror of https://github.com/microsoft/qlib.git synced 2026-07-03 02:50:58 +08:00
Files
qlib/scripts/data_collector/yahoo
wangwenxi-handsome 3760a18a8d Merge nested main (#597)
* MVP for Indian Stocks in qlib using yahooquery

* cleaned with black

* cleaned with black

* add YahooNormalizeIN and YahooNormalizeIN1d

* cleaned the code

* added 1min for IN and also updated readme

* update comments

* fix comments

* recorder support upload both raw file and directory

* fix comments

* Update README.md

* Fix docs of QlibRecorder

* sort index after loader (#538)

make sure the fetch method is based on a index-sorted pd.DataFrame

* refactor online serving rolling api

* refactor TRA

* format by black

* fix horizon

* fix TRA when use single head

* clean up

* improve pretrain

* update README

* fix tra when logdir is None

* fix tra when logdir is None

* Update strategy.py

* Update README.md

* Update README.md

* Conda Suggestion

* code standard docs

* Update ensemble.py (#560)

* Fix CI  Bug (#575)


Co-authored-by: yuxwang <anduinnn@foxmail.com>

* Update gen.py (#576)

* Fix multi-process loop calls (#574)

* check lexsort in the 'lazy_sort_index' function (#566)

* check lexsort

* check lexsort

* lexsort comment

* lexsort comment

* Delete .DS_Store

* Update README.md

* bug fix & use oracle transport pretrain

* mend

* Add `backend_freq_config` parameter, support multi-freq uri

* Add sample_config to QlibDataLoader, support multi-freq

* add multi-freq example

* get_cls_kwargs renamed get_callable_kwargs

* support multi-freq uri

* Add inst_processors to D.features

* Fix typo

* Fix the index type of the multi-freq example

* Fix duplicate mlflow directories in tests

* Add DataPathManager to QlibConfig && modify inst_processors to supports list only

* Modify the default value in the multi_freq example

* Modify client-server mode and dataset-cache to disable inst_processor

* Add wheel package to github CI

* fix comment

* Update FAQ.rst

* Update README.md

Fix wrong link

* Update the docs of TaskManager (#586)

* Update manage.py

* update yaml

* update run_all_model

* Modify the Feature to be case sensitive (#589)

* update README

* remove verbose

* fix spell bug

* fix typos (#592)

* Update Release Note

* fix portfolio bug

* Add calendar support for resample

* add freq kwargs

* test.yml: Remove redundant code (#595)

* Supporting shared processor (#596)

* Supporting shared processor

* fix readonly reverse bug

* remove pytests dependency

* with fit bug

* fix parameter error

* fix comments

* Fix undefined names in Python code (#599)

* Update pytorch_tabnet.py

$ `flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics`
```
./qlib/qlib/contrib/model/pytorch_tabnet.py:567:38: F821 undefined name 'inp'
            self.independ.append(GLU(inp, out_dim, vbs=vbs))
                                     ^
./qlib/examples/model_rolling/task_manager_rolling.py:75:18: F821 undefined name 'task_train'
        run_task(task_train, self.task_pool, experiment_name=self.experiment_name)
                 ^
2     F821 undefined name 'task_train'
2
```

* Fix undefined names in Python code

* from qlib.model.trainer import task_train

* update seed

* fix some docstring

* add comments

* Fix SimpleDatasetCache

* Update setup.py

updated classifiers

* Update setup.py

change to matplotlib==3.3

* Update python-publish.yml

added python 3.9

* updategrade version number

* Update model list

* fix the type of filter_pipe

* fix comment

* fix record_temp

* update cvxpy version

* Update code_standard.rst (#587)

* Update code_standard.rst

* Update docs/developer/code_standard.rst

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* Add file lock for MLflowExpManager (#619)

* fix torch version

* Share version number (#620)

* Update initialization.rst (#622)

* Update initialization.rst

* Update docs/start/initialization.rst

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* Update docs/start/initialization.rst

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* fix bugs for running previous exmaple

* fix deal amount bug

* update change doc (#623)

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* Delete change doc.gif

* Add files via upload

* Update README.md

* Delete change doc.gif

* Add files via upload

* Delete change doc.gif

* Add files via upload

* Update README.md

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>

* update doc

* simplify run all model

* fix run all model bug

* Fix Models (#483)

* fix gat dataset

* fix tft model

* Update tft.py

* Fix tft.py

Co-authored-by: Pengrong Zhu <zhu.pengrong@foxmail.com>

* type and skip empty exp

* fix model yaml config

* fix tft import bug

* skip empty result

* fix model and yaml bug

* fix wrong generate parameter

* Modify multi-freq example (#626)

* modify the example of multi-freq

* add Copyright

* add a comment to average_ops.py

* modify the example of multi-freq

* add comment to multi_freq_handler.py

* add the Ref expression description to multi_freq_handler.py

* add expression description to multi_freq_handler.py

* update images

* fix workflow and update framework

Co-authored-by: Gaurav <2796gaurav@gmail.com>
Co-authored-by: 2796gaurav <17353992+2796gaurav@users.noreply.github.com>
Co-authored-by: bxdd <bxd98@126.com>
Co-authored-by: Young <afe.young@gmail.com>
Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
Co-authored-by: Dong Zhou <Zhou.Dong@microsoft.com>
Co-authored-by: ZhangTP1996 <ztp18@mails.tsinghua.edu.cn>
Co-authored-by: demon143 <59681577+demon143@users.noreply.github.com>
Co-authored-by: Wangwuyi123 <51237097+Wangwuyi123@users.noreply.github.com>
Co-authored-by: yuxwang <anduinnn@foxmail.com>
Co-authored-by: Pengrong Zhu <zhu.pengrong@foxmail.com>
Co-authored-by: Mark Zhao <50850474+markzhao98@users.noreply.github.com>
Co-authored-by: cslwqxx <cslwqxx@users.noreply.github.com>
Co-authored-by: Dong Zhou <evanzd@users.noreply.github.com>
Co-authored-by: SaintMalik <37118134+saintmalik@users.noreply.github.com>
Co-authored-by: Christian Clauss <cclauss@me.com>
Co-authored-by: Anurag Kumar <mailanu98@gmail.com>
Co-authored-by: demon143 <785696300@qq.com>
2021-10-01 02:15:30 +08:00
..
2021-10-01 02:15:30 +08:00
2021-10-01 02:15:30 +08:00

Collect Data From Yahoo Finance

Please pay ATTENTION that the data is collected from Yahoo Finance and the data might not be perfect. We recommend users to prepare their own data if they have high-quality dataset. For more information, users can refer to the related document

Examples of abnormal data

We have considered STOCK PRICE ADJUSTMENT, but some price series seem still very abnormal.

Requirements

pip install -r requirements.txt

Collector Data

Get Qlib data(bin file)

qlib-data from YahooFinance, is the data that has been dumped and can be used directly in qlib

  • get data: python scripts/get_data.py qlib_data
  • parameters:
    • target_dir: save dir, by default ~/.qlib/qlib_data/cn_data
    • version: dataset version, value from [v1, v2], by default v1
      • v2 end date is 2021-06, v1 end date is 2020-09
      • user can append data to v2: automatic update of daily frequency data
      • the benchmarks for qlib use v1, due to the unstable access to historical data by YahooFinance, there are some differences between v2 and v1
    • interval: 1d or 1min, by default 1d
    • region: cn or us or in, by default cn
    • delete_old: delete existing data from target_dir(features, calendars, instruments, dataset_cache, features_cache), value from [True, False], by default True
    • exists_skip: traget_dir data already exists, skip get_data, value from [True, False], by default False
  • examples:
    # cn 1d
    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn
    # cn 1min
    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --region cn --interval 1min
    # us 1d
    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_us_1d --region us --interval 1d
    # us 1min
    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_us_1min --region us --interval 1min
    # in 1d
    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_in_1d --region in --interval 1d
    # in 1min
    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_in_1min --region in --interval 1min
    

Collector YahooFinance data to qlib

collector YahooFinance data and dump into qlib format

  1. download data to csv: python scripts/data_collector/yahoo/collector.py download_data

    • parameters:
      • source_dir: save the directory
      • interval: 1d or 1min, by default 1d

        due to the limitation of the YahooFinance API, only the last month's data is available in 1min

      • region: CN or US or IN, by default CN
      • delay: time.sleep(delay), by default 0.5
      • start: start datetime, by default "2000-01-01"; closed interval(including start)
      • end: end datetime, by default pd.Timestamp(datetime.datetime.now() + pd.Timedelta(days=1)); open interval(excluding end)
      • max_workers: get the number of concurrent symbols, it is not recommended to change this parameter in order to maintain the integrity of the symbol data, by default 1
      • check_data_length: check the number of rows per symbol, by default None

        if len(symbol_df) < check_data_length, it will be re-fetched, with the number of re-fetches coming from the max_collector_count parameter

      • max_collector_count: number of "failed" symbol retries, by default 2
    • examples:
      # cn 1d data
      python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1d --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region CN
      # cn 1min data
      python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --delay 1 --interval 1min --region CN
      # us 1d data
      python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_1d --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region US
      # us 1min data
      python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_1min --delay 1 --interval 1min --region US
      # in 1d data
      python collector.py download_data --source_dir ~/.qlib/stock_data/source/in_1d --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region IN
      # in 1min data
      python collector.py download_data --source_dir ~/.qlib/stock_data/source/in_1min --delay 1 --interval 1min --region IN
      
  2. normalize data: python scripts/data_collector/yahoo/collector.py normalize_data

    • parameters:
      • source_dir: csv directory
      • normalize_dir: result directory
      • max_workers: number of concurrent, by default 1
      • interval: 1d or 1min, by default 1d

        if interval == 1min, qlib_data_1d_dir cannot be None

      • region: CN or US or IN, by default CN
      • date_field_name: column name identifying time in csv files, by default date
      • symbol_field_name: column name identifying symbol in csv files, by default symbol
      • end_date: if not None, normalize the last date saved (including end_date); if None, it will ignore this parameter; by default None
      • qlib_data_1d_dir: qlib directory(1d data)
        if interval==1min, qlib_data_1d_dir cannot be None, normalize 1min needs to use 1d data;
        
            qlib_data_1d can be obtained like this:
                $ python scripts/get_data.py qlib_data --target_dir <qlib_data_1d_dir> --interval 1d
                $ python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <qlib_data_1d_dir> --trading_date 2021-06-01
            or:
                download 1d data from YahooFinance
        
        
    • examples:
      # normalize 1d cn
      python collector.py normalize_data --source_dir ~/.qlib/stock_data/source/cn_1d --normalize_dir ~/.qlib/stock_data/source/cn_1d_nor --region CN --interval 1d
      # normalize 1min cn
      python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/qlib_cn_1d --source_dir ~/.qlib/stock_data/source/cn_1min --normalize_dir ~/.qlib/stock_data/source/cn_1min_nor --region CN --interval 1min
      
  3. dump data: python scripts/dump_bin.py dump_all

    • parameters:
      • csv_path: stock data path or directory, normalize result(normalize_dir)
      • qlib_dir: qlib(dump) data director
      • freq: transaction frequency, by default day

        freq_map = {1d:day, 1mih: 1min}

      • max_workers: number of threads, by default 16
      • include_fields: dump fields, by default ""
      • exclude_fields: fields not dumped, by default `"""

        dump_fields = include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns

      • symbol_field_name: column name identifying symbol in csv files, by default symbol
      • date_field_name: column name identifying time in csv files, by default date
    • examples:
      # dump 1d cn
      python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1d_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1d --freq day --exclude_fields date,symbol
      # dump 1min cn
      python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1min_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1min --freq 1min --exclude_fields date,symbol
      

Automatic update of daily frequency data(from yahoo finance)

It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.

  • Automatic update of data to the "qlib" directory each trading day(Linux)

    • use crontab: crontab -e

    • set up timed tasks:

      * * * * 1-5 python <script path> update_data_to_bin --qlib_data_1d_dir <user data dir>
      
      • script path: scripts/data_collector/yahoo/collector.py
  • Manual update of data

    python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
    
    • trading_date: start of trading day
    • end_date: end of trading day(not included)
    • check_data_length: check the number of rows per symbol, by default None

      if len(symbol_df) < check_data_length, it will be re-fetched, with the number of re-fetches coming from the max_collector_count parameter

  • scripts/data_collector/yahoo/collector.py update_data_to_bin parameters:

    • source_dir: The directory where the raw data collected from the Internet is saved, default "Path(file).parent/source"
    • normalize_dir: Directory for normalize data, default "Path(file).parent/normalize"
    • qlib_data_1d_dir: the qlib data to be updated for yahoo, usually from: download qlib data
    • trading_date: trading days to be updated, by default datetime.datetime.now().strftime("%Y-%m-%d")
    • end_date: end datetime, default pd.Timestamp(trading_date + pd.Timedelta(days=1)); open interval(excluding end)
    • region: region, value from ["CN", "US"], default "CN"

Using qlib data

import qlib
from qlib.data import D

# 1d data cn
# freq=day, freq default day
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1d", region="cn")
df = D.features(D.instruments("all"), ["$close"], freq="day")

# 1min data cn
# freq=1min
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1min", region="cn")
inst = D.list_instruments(D.instruments("all"), freq="1min", as_list=True)
# get 100 symbols
df = D.features(inst[:100], ["$close"], freq="1min")
# get all symbol data
# df = D.features(D.instruments("all"), ["$close"], freq="1min")

# 1d data us
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_us_1d", region="us")
df = D.features(D.instruments("all"), ["$close"], freq="day")

# 1min data us
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_us_1min", region="cn")
inst = D.list_instruments(D.instruments("all"), freq="1min", as_list=True)
# get 100 symbols
df = D.features(inst[:100], ["$close"], freq="1min")
# get all symbol data
# df = D.features(D.instruments("all"), ["$close"], freq="1min")