add_baostock_collector (#1641)

* add_baostock_collector * modify_comments * fix_pylint_error * solve_duplication_methods * modified the logic of update_data_to_bin * modified the logic of update_data_to_bin * optimize code * optimize pylint issue * fix pylint error * changes suggested by the review * fix CI faild * fix CI faild * fix issue 1121 * format with black * optimize code logic * optimize code logic * fix error code * drop warning during code runs * optimize code * format with black * fix bug * format with black * optimize code * optimize code * add comments
2026-07-22 03:37:34 +08:00 · 2023-11-21 20:31:47 +08:00
parent ceff886f49
commit 98f569eed2
17 changed files with 724 additions and 320 deletions
--- a/scripts/data_collector/baostock_5min/README.md
+++ b/scripts/data_collector/baostock_5min/README.md
@@ -0,0 +1,81 @@
+## Collector Data
+
+### Get Qlib data(`bin file`)
+
+  - get data: `python scripts/get_data.py qlib_data`
+  - parameters:
+    - `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data_5min*
+    - `version`: dataset version, value from [`v2`], by default `v2`
+      - `v2` end date is *2022-12*
+    - `interval`: `5min`
+    - `region`: `hs300`
+    - `delete_old`: delete existing data from `target_dir`(*features, calendars, instruments, dataset_cache, features_cache*), value from [`True`, `False`], by default `True`
+    - `exists_skip`: traget_dir data already exists, skip `get_data`, value from [`True`, `False`], by default `False`
+  - examples:
+    ```bash
+    # hs300 5min
+    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/hs300_data_5min --region hs300 --interval 5min
+    ```
+    
+### Collector *Baostock high frequency* data to qlib
+> collector *Baostock high frequency* data and *dump* into `qlib` format.
+> If the above ready-made data can't meet users' requirements,  users can follow this section to crawl the latest data and convert it to qlib-data.
+  1. download data to csv: `python scripts/data_collector/baostock_5min/collector.py download_data`
+     
+     This will download the raw data such as date, symbol, open, high, low, close, volume, amount, adjustflag from baostock to a local directory. One file per symbol.
+     - parameters:
+          - `source_dir`: save the directory
+          - `interval`: `5min`
+          - `region`: `HS300`
+          - `start`: start datetime, by default *None*
+          - `end`: end datetime, by default *None*
+     - examples:
+          ```bash
+          # cn 5min data
+          python collector.py download_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --start 2022-01-01 --end 2022-01-30 --interval 5min --region HS300
+          ```
+  2. normalize data: `python scripts/data_collector/baostock_5min/collector.py normalize_data`
+     
+     This will:
+     1. Normalize high, low, close, open price using adjclose.
+     2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1. 
+     - parameters:
+          - `source_dir`: csv directory
+          - `normalize_dir`: result directory
+          - `interval`: `5min`
+            > if **`interval == 5min`**, `qlib_data_1d_dir` cannot be `None`
+          - `region`: `HS300`
+          - `date_field_name`: column *name* identifying time in csv files, by default `date`
+          - `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
+          - `end_date`: if not `None`, normalize the last date saved (*including end_date*); if `None`, it will ignore this parameter; by default `None`
+          - `qlib_data_1d_dir`: qlib directory(1d data)
+            if interval==5min, qlib_data_1d_dir cannot be None, normalize 5min needs to use 1d data;
+            ```
+                # qlib_data_1d can be obtained like this:
+                python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn --version v3
+            ```
+      - examples:
+        ```bash
+        # normalize 5min cn
+        python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --normalize_dir ~/.qlib/stock_data/source/hs300_5min_nor --region HS300 --interval 5min
+        ```
+  3. dump data: `python scripts/dump_bin.py dump_all`
+    
+     This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory. 
+    
+     - parameters:
+       - `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
+       - `qlib_dir`: qlib(dump) data director
+       - `freq`: transaction frequency, by default `day`
+         > `freq_map = {1d:day, 5mih: 5min}`
+       - `max_workers`: number of threads, by default *16*
+       - `include_fields`: dump fields, by default `""`
+       - `exclude_fields`: fields not dumped, by default `"""
+         > dump_fields = `include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns`
+       - `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
+       - `date_field_name`: column *name* identifying time in csv files, by default `date`
+     - examples:
+       ```bash
+       # dump 5min cn
+       python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/hs300_5min_nor --qlib_dir ~/.qlib/qlib_data/hs300_5min_bin --freq 5min --exclude_fields date,symbol
+       ```
--- a/scripts/data_collector/baostock_5min/collector.py
+++ b/scripts/data_collector/baostock_5min/collector.py
@@ -0,0 +1,328 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+
+import sys
+import copy
+import fire
+import numpy as np
+import pandas as pd
+import baostock as bs
+from tqdm import tqdm
+from pathlib import Path
+from loguru import logger
+from typing import Iterable, List
+
+import qlib
+from qlib.data import D
+
+CUR_DIR = Path(__file__).resolve().parent
+sys.path.append(str(CUR_DIR.parent.parent))
+
+from data_collector.base import BaseCollector, BaseNormalize, BaseRun
+from data_collector.utils import generate_minutes_calendar_from_daily, calc_adjusted_price
+
+
+class BaostockCollectorHS3005min(BaseCollector):
+    def __init__(
+        self,
+        save_dir: [str, Path],
+        start=None,
+        end=None,
+        interval="5min",
+        max_workers=4,
+        max_collector_count=2,
+        delay=0,
+        check_data_length: int = None,
+        limit_nums: int = None,
+    ):
+        """
+
+        Parameters
+        ----------
+        save_dir: str
+            stock save dir
+        max_workers: int
+            workers, default 4
+        max_collector_count: int
+            default 2
+        delay: float
+            time.sleep(delay), default 0
+        interval: str
+            freq, value from [5min], default 5min
+        start: str
+            start datetime, default None
+        end: str
+            end datetime, default None
+        check_data_length: int
+            check data length, by default None
+        limit_nums: int
+            using for debug, by default None
+        """
+        bs.login()
+        super(BaostockCollectorHS3005min, self).__init__(
+            save_dir=save_dir,
+            start=start,
+            end=end,
+            interval=interval,
+            max_workers=max_workers,
+            max_collector_count=max_collector_count,
+            delay=delay,
+            check_data_length=check_data_length,
+            limit_nums=limit_nums,
+        )
+
+    def get_trade_calendar(self):
+        _format = "%Y-%m-%d"
+        start = self.start_datetime.strftime(_format)
+        end = self.end_datetime.strftime(_format)
+        rs = bs.query_trade_dates(start_date=start, end_date=end)
+        calendar_list = []
+        while (rs.error_code == "0") & rs.next():
+            calendar_list.append(rs.get_row_data())
+        calendar_df = pd.DataFrame(calendar_list, columns=rs.fields)
+        trade_calendar_df = calendar_df[~calendar_df["is_trading_day"].isin(["0"])]
+        return trade_calendar_df["calendar_date"].values
+
+    @staticmethod
+    def process_interval(interval: str):
+        if interval == "1d":
+            return {"interval": "d", "fields": "date,code,open,high,low,close,volume,amount,adjustflag"}
+        if interval == "5min":
+            return {"interval": "5", "fields": "date,time,code,open,high,low,close,volume,amount,adjustflag"}
+
+    def get_data(
+        self, symbol: str, interval: str, start_datetime: pd.Timestamp, end_datetime: pd.Timestamp
+    ) -> pd.DataFrame:
+        df = self.get_data_from_remote(
+            symbol=symbol, interval=interval, start_datetime=start_datetime, end_datetime=end_datetime
+        )
+        df.columns = ["date", "time", "symbol", "open", "high", "low", "close", "volume", "amount", "adjustflag"]
+        df["time"] = pd.to_datetime(df["time"], format="%Y%m%d%H%M%S%f")
+        df["date"] = df["time"].dt.strftime("%Y-%m-%d %H:%M:%S")
+        df["date"] = df["date"].map(lambda x: pd.Timestamp(x) - pd.Timedelta(minutes=5))
+        df.drop(["time"], axis=1, inplace=True)
+        df["symbol"] = df["symbol"].map(lambda x: str(x).replace(".", "").upper())
+        return df
+
+    @staticmethod
+    def get_data_from_remote(
+        symbol: str, interval: str, start_datetime: pd.Timestamp, end_datetime: pd.Timestamp
+    ) -> pd.DataFrame:
+        df = pd.DataFrame()
+        rs = bs.query_history_k_data_plus(
+            symbol,
+            BaostockCollectorHS3005min.process_interval(interval=interval)["fields"],
+            start_date=str(start_datetime.strftime("%Y-%m-%d")),
+            end_date=str(end_datetime.strftime("%Y-%m-%d")),
+            frequency=BaostockCollectorHS3005min.process_interval(interval=interval)["interval"],
+            adjustflag="3",
+        )
+        if rs.error_code == "0" and len(rs.data) > 0:
+            data_list = rs.data
+            columns = rs.fields
+            df = pd.DataFrame(data_list, columns=columns)
+        return df
+
+    def get_hs300_symbols(self) -> List[str]:
+        hs300_stocks = []
+        trade_calendar = self.get_trade_calendar()
+        with tqdm(total=len(trade_calendar)) as p_bar:
+            for date in trade_calendar:
+                rs = bs.query_hs300_stocks(date=date)
+                while rs.error_code == "0" and rs.next():
+                    hs300_stocks.append(rs.get_row_data())
+                p_bar.update()
+        return sorted({e[1] for e in hs300_stocks})
+
+    def get_instrument_list(self):
+        logger.info("get HS stock symbols......")
+        symbols = self.get_hs300_symbols()
+        logger.info(f"get {len(symbols)} symbols.")
+        return symbols
+
+    def normalize_symbol(self, symbol: str):
+        return str(symbol).replace(".", "").upper()
+
+
+class BaostockNormalizeHS3005min(BaseNormalize):
+    COLUMNS = ["open", "close", "high", "low", "volume"]
+    AM_RANGE = ("09:30:00", "11:29:00")
+    PM_RANGE = ("13:00:00", "14:59:00")
+
+    def __init__(
+        self, qlib_data_1d_dir: [str, Path], date_field_name: str = "date", symbol_field_name: str = "symbol", **kwargs
+    ):
+        """
+
+        Parameters
+        ----------
+        qlib_data_1d_dir: str, Path
+            the qlib data to be updated for yahoo, usually from: Normalised to 5min using local 1d data
+        date_field_name: str
+            date field name, default is date
+        symbol_field_name: str
+            symbol field name, default is symbol
+        """
+        bs.login()
+        qlib.init(provider_uri=qlib_data_1d_dir)
+        self.all_1d_data = D.features(D.instruments("all"), ["$paused", "$volume", "$factor", "$close"], freq="day")
+        super(BaostockNormalizeHS3005min, self).__init__(date_field_name, symbol_field_name)
+
+    @staticmethod
+    def calc_change(df: pd.DataFrame, last_close: float) -> pd.Series:
+        df = df.copy()
+        _tmp_series = df["close"].fillna(method="ffill")
+        _tmp_shift_series = _tmp_series.shift(1)
+        if last_close is not None:
+            _tmp_shift_series.iloc[0] = float(last_close)
+        change_series = _tmp_series / _tmp_shift_series - 1
+        return change_series
+
+    def _get_calendar_list(self) -> Iterable[pd.Timestamp]:
+        return self.generate_5min_from_daily(self.calendar_list_1d)
+
+    @property
+    def calendar_list_1d(self):
+        calendar_list_1d = getattr(self, "_calendar_list_1d", None)
+        if calendar_list_1d is None:
+            calendar_list_1d = self._get_1d_calendar_list()
+            setattr(self, "_calendar_list_1d", calendar_list_1d)
+        return calendar_list_1d
+
+    @staticmethod
+    def normalize_baostock(
+        df: pd.DataFrame,
+        calendar_list: list = None,
+        date_field_name: str = "date",
+        symbol_field_name: str = "symbol",
+        last_close: float = None,
+    ):
+        if df.empty:
+            return df
+        symbol = df.loc[df[symbol_field_name].first_valid_index(), symbol_field_name]
+        columns = copy.deepcopy(BaostockNormalizeHS3005min.COLUMNS)
+        df = df.copy()
+        df.set_index(date_field_name, inplace=True)
+        df.index = pd.to_datetime(df.index)
+        df = df[~df.index.duplicated(keep="first")]
+        if calendar_list is not None:
+            df = df.reindex(
+                pd.DataFrame(index=calendar_list)
+                .loc[pd.Timestamp(df.index.min()).date() : pd.Timestamp(df.index.max()).date() + pd.Timedelta(days=1)]
+                .index
+            )
+        df.sort_index(inplace=True)
+        df.loc[(df["volume"] <= 0) | np.isnan(df["volume"]), list(set(df.columns) - {symbol_field_name})] = np.nan
+
+        df["change"] = BaostockNormalizeHS3005min.calc_change(df, last_close)
+
+        columns += ["change"]
+        df.loc[(df["volume"] <= 0) | np.isnan(df["volume"]), columns] = np.nan
+
+        df[symbol_field_name] = symbol
+        df.index.names = [date_field_name]
+        return df.reset_index()
+
+    def generate_5min_from_daily(self, calendars: Iterable) -> pd.Index:
+        return generate_minutes_calendar_from_daily(
+            calendars, freq="5min", am_range=self.AM_RANGE, pm_range=self.PM_RANGE
+        )
+
+    def adjusted_price(self, df: pd.DataFrame) -> pd.DataFrame:
+        df = calc_adjusted_price(
+            df=df,
+            _date_field_name=self._date_field_name,
+            _symbol_field_name=self._symbol_field_name,
+            frequence="5min",
+            _1d_data_all=self.all_1d_data,
+        )
+        return df
+
+    def _get_1d_calendar_list(self) -> Iterable[pd.Timestamp]:
+        return list(D.calendar(freq="day"))
+
+    def normalize(self, df: pd.DataFrame) -> pd.DataFrame:
+        # normalize
+        df = self.normalize_baostock(df, self._calendar_list, self._date_field_name, self._symbol_field_name)
+        # adjusted price
+        df = self.adjusted_price(df)
+        return df
+
+
+class Run(BaseRun):
+    def __init__(self, source_dir=None, normalize_dir=None, max_workers=1, interval="5min", region="HS300"):
+        """
+        Changed the default value of: scripts.data_collector.base.BaseRun.
+        """
+        super().__init__(source_dir, normalize_dir, max_workers, interval)
+        self.region = region
+
+    @property
+    def collector_class_name(self):
+        return f"BaostockCollector{self.region.upper()}{self.interval}"
+
+    @property
+    def normalize_class_name(self):
+        return f"BaostockNormalize{self.region.upper()}{self.interval}"
+
+    @property
+    def default_base_dir(self) -> [Path, str]:
+        return CUR_DIR
+
+    def download_data(
+        self,
+        max_collector_count=2,
+        delay=0.5,
+        start=None,
+        end=None,
+        check_data_length=None,
+        limit_nums=None,
+    ):
+        """download data from Baostock
+
+        Notes
+        -----
+            check_data_length, example:
+                hs300 5min, a week: 4 * 60 * 5
+
+        Examples
+        ---------
+            # get hs300 5min data
+            $ python collector.py download_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --start 2022-01-01 --end 2022-01-30 --interval 5min --region HS300
+        """
+        super(Run, self).download_data(max_collector_count, delay, start, end, check_data_length, limit_nums)
+
+    def normalize_data(
+        self,
+        date_field_name: str = "date",
+        symbol_field_name: str = "symbol",
+        end_date: str = None,
+        qlib_data_1d_dir: str = None,
+    ):
+        """normalize data
+
+        Attention
+        ---------
+        qlib_data_1d_dir cannot be None, normalize 5min needs to use 1d data;
+
+            qlib_data_1d can be obtained like this:
+                $ python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn --version v3
+            or:
+                download 1d data, reference: https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#1d-from-yahoo
+
+        Examples
+        ---------
+            $ python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --normalize_dir ~/.qlib/stock_data/source/hs300_5min_nor --region HS300 --interval 5min
+        """
+        if qlib_data_1d_dir is None or not Path(qlib_data_1d_dir).expanduser().exists():
+            raise ValueError(
+                "If normalize 5min, the qlib_data_1d_dir parameter must be set: --qlib_data_1d_dir <user qlib 1d data >, Reference: https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#automatic-update-of-daily-frequency-datafrom-yahoo-finance"
+            )
+        super(Run, self).normalize_data(
+            date_field_name, symbol_field_name, end_date=end_date, qlib_data_1d_dir=qlib_data_1d_dir
+        )
+
+
+if __name__ == "__main__":
+    fire.Fire(Run)
--- a/scripts/data_collector/baostock_5min/requirements.txt
+++ b/scripts/data_collector/baostock_5min/requirements.txt
@@ -0,0 +1,13 @@
+loguru
+fire
+requests
+numpy
+pandas
+tqdm
+lxml
+yahooquery
+joblib
+beautifulsoup4
+bs4
+soupsieve
+baostock
--- a/scripts/data_collector/base.py
+++ b/scripts/data_collector/base.py
@@ -8,7 +8,7 @@ import datetime
 import importlib
 from pathlib import Path
 from typing import Type, Iterable
-from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
+from concurrent.futures import ProcessPoolExecutor

 import pandas as pd
 from tqdm import tqdm
@@ -290,7 +290,7 @@ class Normalize:

        # some symbol_field values such as TRUE, NA are decoded as True(bool), NaN(np.float) by pandas default csv parsing.
        # manually defines dtype and na_values of the symbol_field.
-        default_na = pd._libs.parsers.STR_NA_VALUES
+        default_na = pd._libs.parsers.STR_NA_VALUES  # pylint: disable=I1101
        symbol_na = default_na.copy()
        symbol_na.remove("NA")
        columns = pd.read_csv(file_path, nrows=0).columns
--- a/scripts/data_collector/br_index/collector.py
+++ b/scripts/data_collector/br_index/collector.py
@@ -3,7 +3,6 @@
 from functools import partial
 import sys
 from pathlib import Path
-import importlib
 import datetime

 import fire
@@ -98,7 +97,7 @@ class IBOVIndex(IndexBase):
        now = datetime.datetime.now()
        current_year = now.year
        current_month = now.month
-        for year in [item for item in range(init_year, current_year)]:
+        for year in [item for item in range(init_year, current_year)]:  # pylint: disable=R1721
            for el in four_months_period:
                self.years_4_month_periods.append(str(year) + "_" + el)
        # For current year the logic must be a little different
--- a/scripts/data_collector/cn_index/collector.py
+++ b/scripts/data_collector/cn_index/collector.py
@@ -4,7 +4,6 @@
 import re
 import abc
 import sys
-import datetime
 from io import BytesIO
 from typing import List, Iterable
 from pathlib import Path
@@ -39,7 +38,7 @@ def retry_request(url: str, method: str = "get", exclude_status: List = None):
    if exclude_status is None:
        exclude_status = []
    method_func = getattr(requests, method)
-    _resp = method_func(url, headers=REQ_HEADERS)
+    _resp = method_func(url, headers=REQ_HEADERS, timeout=None)
    _status = _resp.status_code
    if _status not in exclude_status and _status != 200:
        raise ValueError(f"response status: {_status}, url={url}")
--- a/scripts/data_collector/crypto/collector.py
+++ b/scripts/data_collector/crypto/collector.py
@@ -5,7 +5,6 @@ from abc import ABC
 from pathlib import Path

 import fire
-import requests
 import pandas as pd
 from loguru import logger
 from dateutil.tz import tzlocal
@@ -31,15 +30,15 @@ def get_cg_crypto_symbols(qlib_data_path: [str, Path] = None) -> list:
    -------
        crypto symbols in given exchanges list of coingecko
    """
-    global _CG_CRYPTO_SYMBOLS
+    global _CG_CRYPTO_SYMBOLS  # pylint: disable=W0603

    @deco_retry
    def _get_coingecko():
        try:
            cg = CoinGeckoAPI()
            resp = pd.DataFrame(cg.get_coins_markets(vs_currency="usd"))
-        except:
-            raise ValueError("request error")
+        except Exception as e:
+            raise ValueError("request error") from e
        try:
            _symbols = resp["id"].to_list()
        except Exception as e:
--- a/scripts/data_collector/fund/collector.py
+++ b/scripts/data_collector/fund/collector.py
@@ -107,7 +107,7 @@ class FundCollector(BaseCollector):
            url = INDEX_BENCH_URL.format(
                index_code=symbol, numberOfHistoricalDaysToCrawl=10000, startDate=start, endDate=end
            )
-            resp = requests.get(url, headers={"referer": "http://fund.eastmoney.com/110022.html"})
+            resp = requests.get(url, headers={"referer": "http://fund.eastmoney.com/110022.html"}, timeout=None)

            if resp.status_code != 200:
                raise ValueError("request error")
@@ -116,8 +116,8 @@ class FundCollector(BaseCollector):

            # Some funds don't show the net value, example: http://fundf10.eastmoney.com/jjjz_010288.html
            SYType = data["Data"]["SYType"]
-            if (SYType == "每万份收益") or (SYType == "每百份收益") or (SYType == "每百万份收益"):
-                raise Exception("The fund contains 每*份收益")
+            if SYType in {"每万份收益", "每百份收益", "每百万份收益"}:
+                raise ValueError("The fund contains 每*份收益")

            # TODO: should we sort the value by datetime?
            _resp = pd.DataFrame(data["Data"]["LSJZList"])
--- a/scripts/data_collector/future_calendar_collector.py
+++ b/scripts/data_collector/future_calendar_collector.py
@@ -53,7 +53,7 @@ class CollectorFutureCalendar:
        return datetime_d.strftime(self.calendar_format)

    def write_calendar(self, calendar: Iterable):
-        calendars_list = list(map(lambda x: self._format_datetime(x), sorted(set(self.calendar_list + calendar))))
+        calendars_list = [self._format_datetime(x) for x in sorted(set(self.calendar_list + calendar))]
        np.savetxt(self.future_path, calendars_list, fmt="%s", encoding="utf-8")

    @abc.abstractmethod
--- a/scripts/data_collector/us_index/collector.py
+++ b/scripts/data_collector/us_index/collector.py
@@ -4,7 +4,6 @@
 import abc
 from functools import partial
 import sys
-import importlib
 from pathlib import Path
 from concurrent.futures import ThreadPoolExecutor
 from typing import List
@@ -113,7 +112,7 @@ class WIKIIndex(IndexBase):
        return _calendar_list

    def _request_new_companies(self) -> requests.Response:
-        resp = requests.get(self._target_url)
+        resp = requests.get(self._target_url, timeout=None)
        if resp.status_code != 200:
            raise ValueError(f"request error: {self._target_url}")

@@ -164,7 +163,7 @@ class NASDAQ100Index(WIKIIndex):
            df = pd.read_pickle(cache_path)
        else:
            url = self.HISTORY_COMPANIES_URL.format(trade_date=trade_date)
-            resp = requests.post(url)
+            resp = requests.post(url, timeout=None)
            if resp.status_code != 200:
                raise ValueError(f"request error: {url}")
            df = pd.DataFrame(resp.json()["aaData"])
--- a/scripts/data_collector/utils.py
+++ b/scripts/data_collector/utils.py
@@ -2,6 +2,7 @@
 #  Licensed under the MIT License.

 import re
+import copy
 import importlib
 import time
 import bisect
@@ -68,7 +69,7 @@ def get_calendar_list(bench_code="CSI300") -> List[pd.Timestamp]:
    logger.info(f"get calendar list: {bench_code}......")

    def _get_calendar(url):
-        _value_list = requests.get(url).json()["data"]["klines"]
+        _value_list = requests.get(url, timeout=None).json()["data"]["klines"]
        return sorted(map(lambda x: pd.Timestamp(x.split(",")[0]), _value_list))

    calendar = _CALENDAR_MAP.get(bench_code, None)
@@ -85,12 +86,14 @@ def get_calendar_list(bench_code="CSI300") -> List[pd.Timestamp]:
                def _get_calendar(month):
                    _cal = []
                    try:
-                        resp = requests.get(SZSE_CALENDAR_URL.format(month=month, random=random.random)).json()
+                        resp = requests.get(
+                            SZSE_CALENDAR_URL.format(month=month, random=random.random), timeout=None
+                        ).json()
                        for _r in resp["data"]:
                            if int(_r["jybz"]):
                                _cal.append(pd.Timestamp(_r["jyrq"]))
                    except Exception as e:
-                        raise ValueError(f"{month}-->{e}")
+                        raise ValueError(f"{month}-->{e}") from e
                    return _cal

                month_range = pd.date_range(start="2000-01", end=pd.Timestamp.now() + pd.Timedelta(days=31), freq="M")
@@ -109,7 +112,7 @@ def get_calendar_list(bench_code="CSI300") -> List[pd.Timestamp]:

 def return_date_list(date_field_name: str, file_path: Path):
    date_list = pd.read_csv(file_path, sep=",", index_col=0)[date_field_name].to_list()
-    return sorted(map(lambda x: pd.Timestamp(x), date_list))
+    return sorted([pd.Timestamp(x) for x in date_list])


 def get_calendar_list_by_ratio(
@@ -155,7 +158,7 @@ def get_calendar_list_by_ratio(
                if date_list:
                    all_oldest_list.append(date_list[0])
                for date in date_list:
-                    if date not in _dict_count_trade.keys():
+                    if date not in _dict_count_trade:
                        _dict_count_trade[date] = 0

                    _dict_count_trade[date] += 1
@@ -163,7 +166,7 @@ def get_calendar_list_by_ratio(
                p_bar.update()

    logger.info(f"count how many funds have founded in this day......")
-    _dict_count_founding = {date: _number_all_funds for date in _dict_count_trade.keys()}  # dict{date:count}
+    _dict_count_founding = {date: _number_all_funds for date in _dict_count_trade}  # dict{date:count}
    with tqdm(total=_number_all_funds) as p_bar:
        for oldest_date in all_oldest_list:
            for date in _dict_count_founding.keys():
@@ -171,9 +174,7 @@ def get_calendar_list_by_ratio(
                    _dict_count_founding[date] -= 1

    calendar = [
-        date
-        for date in _dict_count_trade
-        if _dict_count_trade[date] >= max(int(_dict_count_founding[date] * threshold), minimum_count)
+        date for date, count in _dict_count_trade.items() if count >= max(int(count * threshold), minimum_count)
    ]

    return calendar
@@ -186,16 +187,16 @@ def get_hs_stock_symbols() -> list:
    -------
        stock symbols
    """
-    global _HS_SYMBOLS
+    global _HS_SYMBOLS  # pylint: disable=W0603

    def _get_symbol():
        _res = set()
        for _k, _v in (("ha", "ss"), ("sa", "sz"), ("gem", "sz")):
-            resp = requests.get(HS_SYMBOLS_URL.format(s_type=_k))
+            resp = requests.get(HS_SYMBOLS_URL.format(s_type=_k), timeout=None)
            _res |= set(
                map(
-                    lambda x: "{}.{}".format(re.findall(r"\d+", x)[0], _v),
-                    etree.HTML(resp.text).xpath("//div[@class='result']/ul//li/a/text()"),
+                    lambda x: "{}.{}".format(re.findall(r"\d+", x)[0], _v),  # pylint: disable=W0640
+                    etree.HTML(resp.text).xpath("//div[@class='result']/ul//li/a/text()"),  # pylint: disable=I1101
                )
            )
            time.sleep(3)
@@ -230,12 +231,12 @@ def get_us_stock_symbols(qlib_data_path: [str, Path] = None) -> list:
    -------
        stock symbols
    """
-    global _US_SYMBOLS
+    global _US_SYMBOLS  # pylint: disable=W0603

    @deco_retry
    def _get_eastmoney():
        url = "http://4.push2.eastmoney.com/api/qt/clist/get?pn=1&pz=10000&fs=m:105,m:106,m:107&fields=f12"
-        resp = requests.get(url)
+        resp = requests.get(url, timeout=None)
        if resp.status_code != 200:
            raise ValueError("request error")

@@ -277,7 +278,7 @@ def get_us_stock_symbols(qlib_data_path: [str, Path] = None) -> list:
            "maxResultsPerPage": 10000,
            "filterToken": "",
        }
-        resp = requests.post(url, json=_parms)
+        resp = requests.post(url, json=_parms, timeout=None)
        if resp.status_code != 200:
            raise ValueError("request error")

@@ -317,7 +318,7 @@ def get_in_stock_symbols(qlib_data_path: [str, Path] = None) -> list:
    -------
        stock symbols
    """
-    global _IN_SYMBOLS
+    global _IN_SYMBOLS  # pylint: disable=W0603

    @deco_retry
    def _get_nifty():
@@ -358,7 +359,7 @@ def get_br_stock_symbols(qlib_data_path: [str, Path] = None) -> list:
    -------
        B3 stock symbols
    """
-    global _BR_SYMBOLS
+    global _BR_SYMBOLS  # pylint: disable=W0603

    @deco_retry
    def _get_ibovespa():
@@ -367,7 +368,7 @@ def get_br_stock_symbols(qlib_data_path: [str, Path] = None) -> list:

        # Request
        agent = {"User-Agent": "Mozilla/5.0"}
-        page = requests.get(url, headers=agent)
+        page = requests.get(url, headers=agent, timeout=None)

        # BeautifulSoup
        soup = BeautifulSoup(page.content, "html.parser")
@@ -375,7 +376,7 @@ def get_br_stock_symbols(qlib_data_path: [str, Path] = None) -> list:

        children = tbody.findChildren("a", recursive=True)
        for child in children:
-            _symbols.append(str(child).split('"')[-1].split(">")[1].split("<")[0])
+            _symbols.append(str(child).rsplit('"', maxsplit=1)[-1].split(">")[1].split("<")[0])

        return _symbols

@@ -409,12 +410,12 @@ def get_en_fund_symbols(qlib_data_path: [str, Path] = None) -> list:
    -------
        fund symbols in China
    """
-    global _EN_FUND_SYMBOLS
+    global _EN_FUND_SYMBOLS  # pylint: disable=W0603

    @deco_retry
    def _get_eastmoney():
        url = "http://fund.eastmoney.com/js/fundcode_search.js"
-        resp = requests.get(url)
+        resp = requests.get(url, timeout=None)
        if resp.status_code != 200:
            raise ValueError("request error")
        try:
@@ -605,5 +606,177 @@ def get_instruments(
    getattr(obj, method)()


+def _get_all_1d_data(_date_field_name: str, _symbol_field_name: str, _1d_data_all: pd.DataFrame):
+    df = copy.deepcopy(_1d_data_all)
+    df.reset_index(inplace=True)
+    df.rename(columns={"datetime": _date_field_name, "instrument": _symbol_field_name}, inplace=True)
+    df.columns = list(map(lambda x: x[1:] if x.startswith("$") else x, df.columns))
+    return df
+
+
+def get_1d_data(
+    _date_field_name: str,
+    _symbol_field_name: str,
+    symbol: str,
+    start: str,
+    end: str,
+    _1d_data_all: pd.DataFrame,
+) -> pd.DataFrame:
+    """get 1d data
+
+    Returns
+    ------
+        data_1d: pd.DataFrame
+            data_1d.columns = [_date_field_name, _symbol_field_name, "paused", "volume", "factor", "close"]
+
+    """
+    _all_1d_data = _get_all_1d_data(_date_field_name, _symbol_field_name, _1d_data_all)
+    return _all_1d_data[
+        (_all_1d_data[_symbol_field_name] == symbol.upper())
+        & (_all_1d_data[_date_field_name] >= pd.Timestamp(start))
+        & (_all_1d_data[_date_field_name] < pd.Timestamp(end))
+    ]
+
+
+def calc_adjusted_price(
+    df: pd.DataFrame,
+    _1d_data_all: pd.DataFrame,
+    _date_field_name: str,
+    _symbol_field_name: str,
+    frequence: str,
+    consistent_1d: bool = True,
+    calc_paused: bool = True,
+) -> pd.DataFrame:
+    """calc adjusted price
+    This method does 4 things.
+    1. Adds the `paused` field.
+        - The added paused field comes from the paused field of the 1d data.
+    2. Aligns the time of the 1d data.
+    3. The data is reweighted.
+        - The reweighting method:
+            - volume / factor
+            - open * factor
+            - high * factor
+            - low * factor
+            - close * factor
+    4. Called `calc_paused_num` method to add the `paused_num` field.
+        - The `paused_num` is the number of consecutive days of trading suspension.
+    """
+    # TODO: using daily data factor
+    if df.empty:
+        return df
+    df = df.copy()
+    df.drop_duplicates(subset=_date_field_name, inplace=True)
+    df.sort_values(_date_field_name, inplace=True)
+    symbol = df.iloc[0][_symbol_field_name]
+    df[_date_field_name] = pd.to_datetime(df[_date_field_name])
+    # get 1d data from qlib
+    _start = pd.Timestamp(df[_date_field_name].min()).strftime("%Y-%m-%d")
+    _end = (pd.Timestamp(df[_date_field_name].max()) + pd.Timedelta(days=1)).strftime("%Y-%m-%d")
+    data_1d: pd.DataFrame = get_1d_data(_date_field_name, _symbol_field_name, symbol, _start, _end, _1d_data_all)
+    data_1d = data_1d.copy()
+    if data_1d is None or data_1d.empty:
+        df["factor"] = 1 / df.loc[df["close"].first_valid_index()]["close"]
+        # TODO: np.nan or 1 or 0
+        df["paused"] = np.nan
+    else:
+        # NOTE: volume is np.nan or volume <= 0, paused = 1
+        # FIXME: find a more accurate data source
+        data_1d["paused"] = 0
+        data_1d.loc[(data_1d["volume"].isna()) | (data_1d["volume"] <= 0), "paused"] = 1
+        data_1d = data_1d.set_index(_date_field_name)
+
+        # add factor from 1d data
+        # NOTE: 1d data info:
+        #   - Close price adjusted for splits. Adjusted close price adjusted for both dividends and splits.
+        #   - data_1d.adjclose: Adjusted close price adjusted for both dividends and splits.
+        #   - data_1d.close: `data_1d.adjclose / (close for the first trading day that is not np.nan)`
+        def _calc_factor(df_1d: pd.DataFrame):
+            try:
+                _date = pd.Timestamp(pd.Timestamp(df_1d[_date_field_name].iloc[0]).date())
+                df_1d["factor"] = data_1d.loc[_date]["close"] / df_1d.loc[df_1d["close"].last_valid_index()]["close"]
+                df_1d["paused"] = data_1d.loc[_date]["paused"]
+            except Exception:
+                df_1d["factor"] = np.nan
+                df_1d["paused"] = np.nan
+            return df_1d
+
+        df = df.groupby([df[_date_field_name].dt.date], group_keys=False).apply(_calc_factor)
+        if consistent_1d:
+            # the date sequence is consistent with 1d
+            df.set_index(_date_field_name, inplace=True)
+            df = df.reindex(
+                generate_minutes_calendar_from_daily(
+                    calendars=pd.to_datetime(data_1d.reset_index()[_date_field_name].drop_duplicates()),
+                    freq=frequence,
+                    am_range=("09:30:00", "11:29:00"),
+                    pm_range=("13:00:00", "14:59:00"),
+                )
+            )
+            df[_symbol_field_name] = df.loc[df[_symbol_field_name].first_valid_index()][_symbol_field_name]
+            df.index.names = [_date_field_name]
+            df.reset_index(inplace=True)
+    for _col in ["open", "close", "high", "low", "volume"]:
+        if _col not in df.columns:
+            continue
+        if _col == "volume":
+            df[_col] = df[_col] / df["factor"]
+        else:
+            df[_col] = df[_col] * df["factor"]
+    if calc_paused:
+        df = calc_paused_num(df, _date_field_name, _symbol_field_name)
+    return df
+
+
+def calc_paused_num(df: pd.DataFrame, _date_field_name, _symbol_field_name):
+    """calc paused num
+    This method adds the paused_num field
+        - The `paused_num` is the number of consecutive days of trading suspension.
+    """
+    _symbol = df.iloc[0][_symbol_field_name]
+    df = df.copy()
+    df["_tmp_date"] = df[_date_field_name].apply(lambda x: pd.Timestamp(x).date())
+    # remove data that starts and ends with `np.nan` all day
+    all_data = []
+    # Record the number of consecutive trading days where the whole day is nan, to remove the last trading day where the whole day is nan
+    all_nan_nums = 0
+    # Record the number of consecutive occurrences of trading days that are not nan throughout the day
+    not_nan_nums = 0
+    for _date, _df in df.groupby("_tmp_date"):
+        _df["paused"] = 0
+        if not _df.loc[_df["volume"] < 0].empty:
+            logger.warning(f"volume < 0, will fill np.nan: {_date} {_symbol}")
+            _df.loc[_df["volume"] < 0, "volume"] = np.nan
+
+        check_fields = set(_df.columns) - {
+            "_tmp_date",
+            "paused",
+            "factor",
+            _date_field_name,
+            _symbol_field_name,
+        }
+        if _df.loc[:, list(check_fields)].isna().values.all() or (_df["volume"] == 0).all():
+            all_nan_nums += 1
+            not_nan_nums = 0
+            _df["paused"] = 1
+            if all_data:
+                _df["paused_num"] = not_nan_nums
+                all_data.append(_df)
+        else:
+            all_nan_nums = 0
+            not_nan_nums += 1
+            _df["paused_num"] = not_nan_nums
+            all_data.append(_df)
+    all_data = all_data[: len(all_data) - all_nan_nums]
+    if all_data:
+        df = pd.concat(all_data, sort=False)
+    else:
+        logger.warning(f"data is empty: {_symbol}")
+        df = pd.DataFrame()
+        return df
+    del df["_tmp_date"]
+    return df
+
+
 if __name__ == "__main__":
    assert len(get_hs_stock_symbols()) >= MINIMUM_SYMBOLS_NUM
--- a/scripts/data_collector/yahoo/README.md
+++ b/scripts/data_collector/yahoo/README.md
@@ -121,7 +121,7 @@ pip install -r requirements.txt
        
                qlib_data_1d can be obtained like this:
                    $ python scripts/get_data.py qlib_data --target_dir <qlib_data_1d_dir> --interval 1d
-                    $ python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <qlib_data_1d_dir> --trading_date 2021-06-01
+                    $ python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <qlib_data_1d_dir> --end_date <end_date>
                or:
                    download 1d data from YahooFinance
            
@@ -180,9 +180,8 @@ pip install -r requirements.txt

  * Manual update of data
      ```
-      python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
+      python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --end_date <end date>
      ```
-      * `trading_date`: start of trading day
      * `end_date`: end of trading day(not included)
      * `check_data_length`: check the number of rows per *symbol*, by default `None`
        > if `len(symbol_df) < check_data_length`, it will be re-fetched, with the number of re-fetches coming from the `max_collector_count` parameter
@@ -191,10 +190,10 @@ pip install -r requirements.txt
      * `source_dir`: The directory where the raw data collected from the Internet is saved, default "Path(__file__).parent/source"
      * `normalize_dir`: Directory for normalize data, default "Path(__file__).parent/normalize"
      * `qlib_data_1d_dir`: the qlib data to be updated for yahoo, usually from: [download qlib data](https://github.com/microsoft/qlib/tree/main/scripts#download-cn-data)
-      * `trading_date`: trading days to be updated, by default ``datetime.datetime.now().strftime("%Y-%m-%d")``
      * `end_date`: end datetime, default ``pd.Timestamp(trading_date + pd.Timedelta(days=1))``; open interval(excluding end)
      * `region`: region, value from ["CN", "US"], default "CN"
-
+      * `interval`: interval, default "1d"(Currently only supports 1d data)
+      * `exists_skip`: exists skip, by default False

 ## Using qlib data

--- a/scripts/data_collector/yahoo/collector.py
+++ b/scripts/data_collector/yahoo/collector.py
@@ -2,7 +2,6 @@
 # Licensed under the MIT License.

 import abc
-from re import I
 import sys
 import copy
 import time
@@ -21,6 +20,8 @@ from loguru import logger
 from yahooquery import Ticker
 from dateutil.tz import tzlocal

+import qlib
+from qlib.data import D
 from qlib.tests.data import GetData
 from qlib.utils import code_to_fname, fname_to_code, exists_qlib_data
 from qlib.constant import REG_CN as REGION_CN
@@ -38,6 +39,7 @@ from data_collector.utils import (
    get_in_stock_symbols,
    get_br_stock_symbols,
    generate_minutes_calendar_from_daily,
+    calc_adjusted_price,
 )

 INDEX_BENCH_URL = "http://push2his.eastmoney.com/api/qt/stock/kline/get?secid=1.{index_code}&fields1=f1%2Cf2%2Cf3%2Cf4%2Cf5&fields2=f51%2Cf52%2Cf53%2Cf54%2Cf55%2Cf56%2Cf57%2Cf58&klt=101&fqt=0&beg={begin}&end={end}"
@@ -229,9 +231,9 @@ class YahooCollectorCN1d(YahooCollectorCN):
                df = pd.DataFrame(
                    map(
                        lambda x: x.split(","),
-                        requests.get(INDEX_BENCH_URL.format(index_code=_index_code, begin=_begin, end=_end)).json()[
-                            "data"
-                        ]["klines"],
+                        requests.get(
+                            INDEX_BENCH_URL.format(index_code=_index_code, begin=_begin, end=_end), timeout=None
+                        ).json()["data"]["klines"],
                    )
                )
            except Exception as e:
@@ -316,7 +318,7 @@ class YahooCollectorIN1min(YahooCollectorIN):


 class YahooCollectorBR(YahooCollector, ABC):
-    def retry(cls):
+    def retry(cls):  # pylint: disable=E0213
        """
        The reason to use retry=2 is due to the fact that
        Yahoo Finance unfortunately does not keep track of some
@@ -356,12 +358,10 @@ class YahooCollectorBR(YahooCollector, ABC):

 class YahooCollectorBR1d(YahooCollectorBR):
    retry = 2
-    pass


 class YahooCollectorBR1min(YahooCollectorBR):
    retry = 2
-    pass


 class YahooNormalize(BaseNormalize):
@@ -393,6 +393,7 @@ class YahooNormalize(BaseNormalize):
        df = df.copy()
        df.set_index(date_field_name, inplace=True)
        df.index = pd.to_datetime(df.index)
+        df.index = df.index.tz_localize(None)
        df = df[~df.index.duplicated(keep="first")]
        if calendar_list is not None:
            df = df.reindex(
@@ -522,78 +523,39 @@ class YahooNormalize1dExtend(YahooNormalize1d):
            symbol field name, default is symbol
        """
        super(YahooNormalize1dExtend, self).__init__(date_field_name, symbol_field_name)
-        self._first_close_field = "first_close"
-        self._ori_close_field = "ori_close"
+        self.column_list = ["open", "high", "low", "close", "volume", "factor", "change"]
        self.old_qlib_data = self._get_old_data(old_qlib_data_dir)

    def _get_old_data(self, qlib_data_dir: [str, Path]):
-        import qlib
-        from qlib.data import D
-
        qlib_data_dir = str(Path(qlib_data_dir).expanduser().resolve())
        qlib.init(provider_uri=qlib_data_dir, expression_cache=None, dataset_cache=None)
-        df = D.features(D.instruments("all"), ["$close/$factor", "$adjclose/$close"])
-        df.columns = [self._ori_close_field, self._first_close_field]
+        df = D.features(D.instruments("all"), ["$" + col for col in self.column_list])
+        df.columns = self.column_list
        return df

-    def _get_close(self, df: pd.DataFrame, field_name: str):
-        _symbol = df.loc[df[self._symbol_field_name].first_valid_index()][self._symbol_field_name].upper()
-        _df = self.old_qlib_data.loc(axis=0)[_symbol]
-        _close = _df.loc[_df.last_valid_index()][field_name]
-        return _close
-
-    def _get_first_close(self, df: pd.DataFrame) -> float:
-        try:
-            _close = self._get_close(df, field_name=self._first_close_field)
-        except KeyError:
-            _close = super(YahooNormalize1dExtend, self)._get_first_close(df)
-        return _close
-
-    def _get_last_close(self, df: pd.DataFrame) -> float:
-        try:
-            _close = self._get_close(df, field_name=self._ori_close_field)
-        except KeyError:
-            _close = None
-        return _close
-
-    def _get_last_date(self, df: pd.DataFrame) -> pd.Timestamp:
-        _symbol = df.loc[df[self._symbol_field_name].first_valid_index()][self._symbol_field_name].upper()
-        try:
-            _df = self.old_qlib_data.loc(axis=0)[_symbol]
-            _date = _df.index.max()
-        except KeyError:
-            _date = None
-        return _date
-
    def normalize(self, df: pd.DataFrame) -> pd.DataFrame:
-        _last_close = self._get_last_close(df)
-        # reindex
-        _last_date = self._get_last_date(df)
-        if _last_date is not None:
-            df = df.set_index(self._date_field_name)
-            df.index = pd.to_datetime(df.index)
-            df = df[~df.index.duplicated(keep="first")]
-            _max_date = df.index.max()
-            df = df.reindex(self._calendar_list).loc[:_max_date].reset_index()
-            df = df[df[self._date_field_name] > _last_date]
-            if df.empty:
-                return pd.DataFrame()
-            _si = df["close"].first_valid_index()
-            if _si > df.index[0]:
-                logger.warning(
-                    f"{df.loc[_si][self._symbol_field_name]} missing data: {df.loc[:_si - 1][self._date_field_name].to_list()}"
-                )
-        # normalize
-        df = self.normalize_yahoo(
-            df, self._calendar_list, self._date_field_name, self._symbol_field_name, last_close=_last_close
-        )
-        # adjusted price
-        df = self.adjusted_price(df)
-        df = self._manual_adj_data(df)
-        return df
+        df = super(YahooNormalize1dExtend, self).normalize(df)
+        df.set_index(self._date_field_name, inplace=True)
+        symbol_name = df[self._symbol_field_name].iloc[0]
+        old_symbol_list = self.old_qlib_data.index.get_level_values("instrument").unique().to_list()
+        if str(symbol_name).upper() not in old_symbol_list:
+            return df.reset_index()
+        old_df = self.old_qlib_data.loc[str(symbol_name).upper()]
+        latest_date = old_df.index[-1]
+        df = df.loc[latest_date:]
+        new_latest_data = df.iloc[0]
+        old_latest_data = old_df.loc[latest_date]
+        for col in self.column_list[:-1]:
+            if col == "volume":
+                df[col] = df[col] / (new_latest_data[col] / old_latest_data[col])
+            else:
+                df[col] = df[col] * (old_latest_data[col] / new_latest_data[col])
+        return df.drop(df.index[0]).reset_index()


 class YahooNormalize1min(YahooNormalize, ABC):
+    """Normalised to 1min using local 1d data"""
+
    AM_RANGE = None  # type: tuple  # eg: ("09:30:00", "11:29:00")
    PM_RANGE = None  # type: tuple  # eg: ("13:00:00", "14:59:00")

@@ -601,160 +563,6 @@ class YahooNormalize1min(YahooNormalize, ABC):
    CONSISTENT_1d = True
    CALC_PAUSED_NUM = True

-    @property
-    def calendar_list_1d(self):
-        calendar_list_1d = getattr(self, "_calendar_list_1d", None)
-        if calendar_list_1d is None:
-            calendar_list_1d = self._get_1d_calendar_list()
-            setattr(self, "_calendar_list_1d", calendar_list_1d)
-        return calendar_list_1d
-
-    def generate_1min_from_daily(self, calendars: Iterable) -> pd.Index:
-        return generate_minutes_calendar_from_daily(
-            calendars, freq="1min", am_range=self.AM_RANGE, pm_range=self.PM_RANGE
-        )
-
-    def get_1d_data(self, symbol: str, start: str, end: str) -> pd.DataFrame:
-        """get 1d data
-
-        Returns
-        ------
-            data_1d: pd.DataFrame
-                data_1d.columns = [self._date_field_name, self._symbol_field_name, "paused", "volume", "factor", "close"]
-
-        """
-        data_1d = YahooCollector.get_data_from_remote(self.symbol_to_yahoo(symbol), interval="1d", start=start, end=end)
-        if not (data_1d is None or data_1d.empty):
-            _class_name = self.__class__.__name__.replace("min", "d")
-            _class: type(YahooNormalize) = getattr(importlib.import_module("collector"), _class_name)
-            data_1d_obj = _class(self._date_field_name, self._symbol_field_name)
-            data_1d = data_1d_obj.normalize(data_1d)
-        return data_1d
-
-    def adjusted_price(self, df: pd.DataFrame) -> pd.DataFrame:
-        # TODO: using daily data factor
-        if df.empty:
-            return df
-        df = df.copy()
-        df = df.sort_values(self._date_field_name)
-        symbol = df.iloc[0][self._symbol_field_name]
-        # get 1d data from yahoo
-        _start = pd.Timestamp(df[self._date_field_name].min()).strftime(self.DAILY_FORMAT)
-        _end = (pd.Timestamp(df[self._date_field_name].max()) + pd.Timedelta(days=1)).strftime(self.DAILY_FORMAT)
-        data_1d: pd.DataFrame = self.get_1d_data(symbol, _start, _end)
-        data_1d = data_1d.copy()
-        if data_1d is None or data_1d.empty:
-            df["factor"] = 1 / df.loc[df["close"].first_valid_index()]["close"]
-            # TODO: np.nan or 1 or 0
-            df["paused"] = np.nan
-        else:
-            # NOTE: volume is np.nan or volume <= 0, paused = 1
-            # FIXME: find a more accurate data source
-            data_1d["paused"] = 0
-            data_1d.loc[(data_1d["volume"].isna()) | (data_1d["volume"] <= 0), "paused"] = 1
-            data_1d = data_1d.set_index(self._date_field_name)
-
-            # add factor from 1d data
-            # NOTE: yahoo 1d data info:
-            #   - Close price adjusted for splits. Adjusted close price adjusted for both dividends and splits.
-            #   - data_1d.adjclose: Adjusted close price adjusted for both dividends and splits.
-            #   - data_1d.close: `data_1d.adjclose / (close for the first trading day that is not np.nan)`
-            def _calc_factor(df_1d: pd.DataFrame):
-                try:
-                    _date = pd.Timestamp(pd.Timestamp(df_1d[self._date_field_name].iloc[0]).date())
-                    df_1d["factor"] = (
-                        data_1d.loc[_date]["close"] / df_1d.loc[df_1d["close"].last_valid_index()]["close"]
-                    )
-                    df_1d["paused"] = data_1d.loc[_date]["paused"]
-                except Exception:
-                    df_1d["factor"] = np.nan
-                    df_1d["paused"] = np.nan
-                return df_1d
-
-            df = df.groupby([df[self._date_field_name].dt.date]).apply(_calc_factor)
-
-            if self.CONSISTENT_1d:
-                # the date sequence is consistent with 1d
-                df.set_index(self._date_field_name, inplace=True)
-                df = df.reindex(
-                    self.generate_1min_from_daily(
-                        pd.to_datetime(data_1d.reset_index()[self._date_field_name].drop_duplicates())
-                    )
-                )
-                df[self._symbol_field_name] = df.loc[df[self._symbol_field_name].first_valid_index()][
-                    self._symbol_field_name
-                ]
-                df.index.names = [self._date_field_name]
-                df.reset_index(inplace=True)
-        for _col in self.COLUMNS:
-            if _col not in df.columns:
-                continue
-            if _col == "volume":
-                df[_col] = df[_col] / df["factor"]
-            else:
-                df[_col] = df[_col] * df["factor"]
-
-        if self.CALC_PAUSED_NUM:
-            df = self.calc_paused_num(df)
-        return df
-
-    def calc_paused_num(self, df: pd.DataFrame):
-        _symbol = df.iloc[0][self._symbol_field_name]
-        df = df.copy()
-        df["_tmp_date"] = df[self._date_field_name].apply(lambda x: pd.Timestamp(x).date())
-        # remove data that starts and ends with `np.nan` all day
-        all_data = []
-        # Record the number of consecutive trading days where the whole day is nan, to remove the last trading day where the whole day is nan
-        all_nan_nums = 0
-        # Record the number of consecutive occurrences of trading days that are not nan throughout the day
-        not_nan_nums = 0
-        for _date, _df in df.groupby("_tmp_date"):
-            _df["paused"] = 0
-            if not _df.loc[_df["volume"] < 0].empty:
-                logger.warning(f"volume < 0, will fill np.nan: {_date} {_symbol}")
-                _df.loc[_df["volume"] < 0, "volume"] = np.nan
-
-            check_fields = set(_df.columns) - {
-                "_tmp_date",
-                "paused",
-                "factor",
-                self._date_field_name,
-                self._symbol_field_name,
-            }
-            if _df.loc[:, check_fields].isna().values.all() or (_df["volume"] == 0).all():
-                all_nan_nums += 1
-                not_nan_nums = 0
-                _df["paused"] = 1
-                if all_data:
-                    _df["paused_num"] = not_nan_nums
-                    all_data.append(_df)
-            else:
-                all_nan_nums = 0
-                not_nan_nums += 1
-                _df["paused_num"] = not_nan_nums
-                all_data.append(_df)
-        all_data = all_data[: len(all_data) - all_nan_nums]
-        if all_data:
-            df = pd.concat(all_data, sort=False)
-        else:
-            logger.warning(f"data is empty: {_symbol}")
-            df = pd.DataFrame()
-            return df
-        del df["_tmp_date"]
-        return df
-
-    @abc.abstractmethod
-    def symbol_to_yahoo(self, symbol):
-        raise NotImplementedError("rewrite symbol_to_yahoo")
-
-    @abc.abstractmethod
-    def _get_1d_calendar_list(self) -> Iterable[pd.Timestamp]:
-        raise NotImplementedError("rewrite _get_1d_calendar_list")
-
-
-class YahooNormalize1minOffline(YahooNormalize1min):
-    """Normalised to 1min using local 1d data"""
-
    def __init__(
        self, qlib_data_1d_dir: [str, Path], date_field_name: str = "date", symbol_field_name: str = "symbol", **kwargs
    ):
@@ -769,42 +577,45 @@ class YahooNormalize1minOffline(YahooNormalize1min):
        symbol_field_name: str
            symbol field name, default is symbol
        """
-        self.qlib_data_1d_dir = qlib_data_1d_dir
-        super(YahooNormalize1minOffline, self).__init__(date_field_name, symbol_field_name)
-        self._all_1d_data = self._get_all_1d_data()
+        super(YahooNormalize1min, self).__init__(date_field_name, symbol_field_name)
+        qlib.init(provider_uri=qlib_data_1d_dir)
+        self.all_1d_data = D.features(D.instruments("all"), ["$paused", "$volume", "$factor", "$close"], freq="day")

    def _get_1d_calendar_list(self) -> Iterable[pd.Timestamp]:
-        import qlib
-        from qlib.data import D
-
-        qlib.init(provider_uri=self.qlib_data_1d_dir)
        return list(D.calendar(freq="day"))

-    def _get_all_1d_data(self):
-        import qlib
-        from qlib.data import D
+    @property
+    def calendar_list_1d(self):
+        calendar_list_1d = getattr(self, "_calendar_list_1d", None)
+        if calendar_list_1d is None:
+            calendar_list_1d = self._get_1d_calendar_list()
+            setattr(self, "_calendar_list_1d", calendar_list_1d)
+        return calendar_list_1d

-        qlib.init(provider_uri=self.qlib_data_1d_dir)
-        df = D.features(D.instruments("all"), ["$paused", "$volume", "$factor", "$close"], freq="day")
-        df.reset_index(inplace=True)
-        df.rename(columns={"datetime": self._date_field_name, "instrument": self._symbol_field_name}, inplace=True)
-        df.columns = list(map(lambda x: x[1:] if x.startswith("$") else x, df.columns))
+    def generate_1min_from_daily(self, calendars: Iterable) -> pd.Index:
+        return generate_minutes_calendar_from_daily(
+            calendars, freq="1min", am_range=self.AM_RANGE, pm_range=self.PM_RANGE
+        )
+
+    def adjusted_price(self, df: pd.DataFrame) -> pd.DataFrame:
+        df = calc_adjusted_price(
+            df=df,
+            _date_field_name=self._date_field_name,
+            _symbol_field_name=self._symbol_field_name,
+            frequence="1min",
+            consistent_1d=self.CONSISTENT_1d,
+            calc_paused=self.CALC_PAUSED_NUM,
+            _1d_data_all=self.all_1d_data,
+        )
        return df

-    def get_1d_data(self, symbol: str, start: str, end: str) -> pd.DataFrame:
-        """get 1d data
+    @abc.abstractmethod
+    def symbol_to_yahoo(self, symbol):
+        raise NotImplementedError("rewrite symbol_to_yahoo")

-        Returns
-        ------
-            data_1d: pd.DataFrame
-                data_1d.columns = [self._date_field_name, self._symbol_field_name, "paused", "volume", "factor", "close"]
-
-        """
-        return self._all_1d_data[
-            (self._all_1d_data[self._symbol_field_name] == symbol.upper())
-            & (self._all_1d_data[self._date_field_name] >= pd.Timestamp(start))
-            & (self._all_1d_data[self._date_field_name] < pd.Timestamp(end))
-        ]
+    @abc.abstractmethod
+    def _get_1d_calendar_list(self) -> Iterable[pd.Timestamp]:
+        raise NotImplementedError("rewrite _get_1d_calendar_list")


 class YahooNormalizeUS:
@@ -821,7 +632,7 @@ class YahooNormalizeUS1dExtend(YahooNormalizeUS, YahooNormalize1dExtend):
    pass


-class YahooNormalizeUS1min(YahooNormalizeUS, YahooNormalize1minOffline):
+class YahooNormalizeUS1min(YahooNormalizeUS, YahooNormalize1min):
    CALC_PAUSED_NUM = False

    def _get_calendar_list(self) -> Iterable[pd.Timestamp]:
@@ -844,7 +655,7 @@ class YahooNormalizeIN1d(YahooNormalizeIN, YahooNormalize1d):
    pass


-class YahooNormalizeIN1min(YahooNormalizeIN, YahooNormalize1minOffline):
+class YahooNormalizeIN1min(YahooNormalizeIN, YahooNormalize1min):
    CALC_PAUSED_NUM = False

    def _get_calendar_list(self) -> Iterable[pd.Timestamp]:
@@ -872,7 +683,7 @@ class YahooNormalizeCN1dExtend(YahooNormalizeCN, YahooNormalize1dExtend):
    pass


-class YahooNormalizeCN1min(YahooNormalizeCN, YahooNormalize1minOffline):
+class YahooNormalizeCN1min(YahooNormalizeCN, YahooNormalize1min):
    AM_RANGE = ("09:30:00", "11:29:00")
    PM_RANGE = ("13:00:00", "14:59:00")

@@ -899,7 +710,7 @@ class YahooNormalizeBR1d(YahooNormalizeBR, YahooNormalize1d):
    pass


-class YahooNormalizeBR1min(YahooNormalizeBR, YahooNormalize1minOffline):
+class YahooNormalizeBR1min(YahooNormalizeBR, YahooNormalize1min):
    CALC_PAUSED_NUM = False

    def _get_calendar_list(self) -> Iterable[pd.Timestamp]:
@@ -1123,10 +934,10 @@ class Run(BaseRun):
    def update_data_to_bin(
        self,
        qlib_data_1d_dir: str,
-        trading_date: str = None,
        end_date: str = None,
        check_data_length: int = None,
        delay: float = 1,
+        exists_skip: bool = False,
    ):
        """update yahoo data to bin

@@ -1135,14 +946,14 @@ class Run(BaseRun):
        qlib_data_1d_dir: str
            the qlib data to be updated for yahoo, usually from: https://github.com/microsoft/qlib/tree/main/scripts#download-cn-data

-        trading_date: str
-            trading days to be updated, by default ``datetime.datetime.now().strftime("%Y-%m-%d")``
        end_date: str
            end datetime, default ``pd.Timestamp(trading_date + pd.Timedelta(days=1))``; open interval(excluding end)
        check_data_length: int
            check data length, if not None and greater than 0, each symbol will be considered complete if its data length is greater than or equal to this value, otherwise it will be fetched again, the maximum number of fetches being (max_collector_count). By default None.
        delay: float
            time.sleep(delay), default 1
+        exists_skip: bool
+            exists skip, by default False
        Notes
        -----
            If the data in qlib_data_dir is incomplete, np.nan will be populated to trading_date for the previous trading day
@@ -1150,24 +961,24 @@ class Run(BaseRun):
        Examples
        -------
            $ python collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
-            # get 1m data
        """

        if self.interval.lower() != "1d":
            logger.warning(f"currently supports 1d data updates: --interval 1d")

-        # start/end date
-        if trading_date is None:
-            trading_date = datetime.datetime.now().strftime("%Y-%m-%d")
-            logger.warning(f"trading_date is None, use the current date: {trading_date}")
-
-        if end_date is None:
-            end_date = (pd.Timestamp(trading_date) + pd.Timedelta(days=1)).strftime("%Y-%m-%d")
-
        # download qlib 1d data
        qlib_data_1d_dir = str(Path(qlib_data_1d_dir).expanduser().resolve())
        if not exists_qlib_data(qlib_data_1d_dir):
-            GetData().qlib_data(target_dir=qlib_data_1d_dir, interval=self.interval, region=self.region)
+            GetData().qlib_data(
+                target_dir=qlib_data_1d_dir, interval=self.interval, region=self.region, exists_skip=exists_skip
+            )
+
+        # start/end date
+        calendar_df = pd.read_csv(Path(qlib_data_1d_dir).joinpath("calendars/day.txt"))
+        trading_date = (pd.Timestamp(calendar_df.iloc[-1, 0]) - pd.Timedelta(days=1)).strftime("%Y-%m-%d")
+
+        if end_date is None:
+            end_date = (pd.Timestamp(trading_date) + pd.Timedelta(days=1)).strftime("%Y-%m-%d")

        # download data from yahoo
        # NOTE: when downloading data from YahooFinance, max_workers is recommended to be 1