2019-06-17 14:55:19 230浏览
今天千锋扣丁学堂大数据培训老师给大家分享一篇关于Pandas必备技能之“时间序列数据处理”的详细介绍,首先时间序列数据TimeSeriesData是在不同时间上收集到的数据,这类数据是按时间顺序收集到的,用于所描述现象随时间变化的情况。
In [8]: data = pd.read_csv('unemployment.csv') In [9]: data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 209 entries, 0 to 208 Data columns (total 2 columns): date 209 non-null object UNRATE 209 non-null float64 dtypes: float64(1), object(1) memory usage: 3.3+ KB
In [11]: data = pd.read_csv('unemployment.csv', parse_dates=['date'], index_col='date') In [12]: data.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 209 entries, 2000-01-01 to 2017-05-01 Data columns (total 1 columns): UNRATE 209 non-null float64 dtypes: float64(1) memory usage: 13.3 KB
import tushare as ts import pandas as pd pd.set_option('expand_frame_repr', False) # 列太多时不换行 pro = ts.pro_api() df = pro.daily(ts_code='000001.SZ', start_date='20180701', end_date='20180718') df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 13 entries, 0 to 12 Data columns (total 11 columns): ts_code 13 non-null object trade_date 13 non-null object open 13 non-null float64 high 13 non-null float64 low 13 non-null float64 close 13 non-null float64 pre_close 13 non-null float64 change 13 non-null float64 pct_chg 13 non-null float64 vol 13 non-null float64 amount 13 non-null float64 dtypes: float64(9), object(2) memory usage: 1.2+ KB None df['trade_date'] = pd.to_datetime(df['trade_date']) df.set_index('trade_date', inplace=True) df.sort_values('trade_date', ascending=True, inplace=True) # 升序排列 df.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 13 entries, 2018-07-02 to 2018-07-18 Data columns (total 10 columns): ts_code 13 non-null object open 13 non-null float64 high 13 non-null float64 low 13 non-null float64 close 13 non-null float64 pre_close 13 non-null float64 change 13 non-null float64 pct_chg 13 non-null float64 vol 13 non-null float64 amount 13 non-null float64 dtypes: float64(9), object(1) memory usage: 1.1+ KB
df.head() Out[15]: ts_code open high low close pre_close change pct_chg vol amount trade_date 2018-07-02 000001.SZ 9.05 9.05 8.55 8.61 9.09 -0.48 -5.28 1315520.13 1158545.868 2018-07-03 000001.SZ 8.69 8.70 8.45 8.67 8.61 0.06 0.70 1274838.57 1096657.033 2018-07-04 000001.SZ 8.63 8.75 8.61 8.61 8.67 -0.06 -0.69 711153.37 617278.559 2018-07-05 000001.SZ 8.62 8.73 8.55 8.60 8.61 -0.01 -0.12 835768.77 722169.579 2018-07-06 000001.SZ 8.61 8.78 8.45 8.66 8.60 0.06 0.70 988282.69 852071.526
df = df[['ts_code', 'open', 'high', 'low', 'close', 'vol']] # 单位:成交量 (手) ts_code open high low close vol trade_date 2018-07-02 000001.SZ 9.05 9.05 8.55 8.61 1315520.13 2018-07-03 000001.SZ 8.69 8.70 8.45 8.67 1274838.57 2018-07-04 000001.SZ 8.63 8.75 8.61 8.61 711153.37 2018-07-05 000001.SZ 8.62 8.73 8.55 8.60 835768.77 2018-07-06 000001.SZ 8.61 8.78 8.45 8.66 988282.69 2018-07-09 000001.SZ 8.69 9.03 8.68 9.03 1409954.60 2018-07-10 000001.SZ 9.02 9.02 8.89 8.98 896862.02 2018-07-11 000001.SZ 8.76 8.83 8.68 8.78 851296.70 2018-07-12 000001.SZ 8.60 8.97 8.58 8.88 1140492.31 2018-07-13 000001.SZ 8.92 8.94 8.82 8.88 603378.21 2018-07-16 000001.SZ 8.85 8.90 8.69 8.73 689845.58 2018-07-17 000001.SZ 8.74 8.75 8.66 8.72 375356.33 2018-07-18 000001.SZ 8.75 8.85 8.69 8.70 525152.77
为了方便大家观察,把这段时间的日历附在下面,'2018-07-02'正好是星期一。
转换的思路是这样的,以日历中的周进行聚合,如'20180702'-'20180708',取该周期内,日线开盘价的第一个值作为周开盘价,日线最高价的最大值作为周最高价,日线最低价的最小值作为周最低价,日线收盘价的最后一个值作为周最收盘价,日线最高价的最大值作为周最高价,日线成交量的求和作为周成交量(手),如下图黄色方框所示。
freq = '1W' df_weekly = df[['open']].resample(rule=freq).first() df_weekly['high'] = df['high'].resample(rule=freq).max() df_weekly['low'] = df['low'].resample(rule=freq).min() df_weekly['close'] = df['close'].resample(rule=freq).last() df_weekly['vol'] = df['vol'].resample(rule=freq).sum() df_weekly Out[33]: open high low close vol trade_date 2018-07-08 9.05 9.05 8.45 8.66 5125563.53 2018-07-15 8.69 9.03 8.58 8.88 4901983.84 2018-07-22 8.85 8.90 8.66 8.70 1590354.68
df_daily = df_weekly.resample('D').asfreq() print(df_daily) Out[52]: open high low close vol trade_date 2018-07-08 9.05 9.05 8.45 8.66 5125563.53 2018-07-09 NaN NaN NaN NaN NaN 2018-07-10 NaN NaN NaN NaN NaN 2018-07-11 NaN NaN NaN NaN NaN 2018-07-12 NaN NaN NaN NaN NaN 2018-07-13 NaN NaN NaN NaN NaN 2018-07-14 NaN NaN NaN NaN NaN 2018-07-15 8.69 9.03 8.58 8.88 4901983.84 2018-07-16 NaN NaN NaN NaN NaN 2018-07-17 NaN NaN NaN NaN NaN 2018-07-18 NaN NaN NaN NaN NaN 2018-07-19 NaN NaN NaN NaN NaN 2018-07-20 NaN NaN NaN NaN NaN 2018-07-21 NaN NaN NaN NaN NaN 2018-07-22 8.85 8.90 8.66 8.70 1590354.68
df_daily = df_weekly.resample('D').ffill() df_daily Out[54]: open high low close vol trade_date 2018-07-08 9.05 9.05 8.45 8.66 5125563.53 2018-07-09 9.05 9.05 8.45 8.66 5125563.53 2018-07-10 9.05 9.05 8.45 8.66 5125563.53 2018-07-11 9.05 9.05 8.45 8.66 5125563.53 2018-07-12 9.05 9.05 8.45 8.66 5125563.53 2018-07-13 9.05 9.05 8.45 8.66 5125563.53 2018-07-14 9.05 9.05 8.45 8.66 5125563.53 2018-07-15 8.69 9.03 8.58 8.88 4901983.84 2018-07-16 8.69 9.03 8.58 8.88 4901983.84 2018-07-17 8.69 9.03 8.58 8.88 4901983.84 2018-07-18 8.69 9.03 8.58 8.88 4901983.84 2018-07-19 8.69 9.03 8.58 8.88 4901983.84 2018-07-20 8.69 9.03 8.58 8.88 4901983.84 2018-07-21 8.69 9.03 8.58 8.88 4901983.84 2018-07-22 8.85 8.90 8.66 8.70 1590354.68
df_daily = df_weekly.resample('D').bfill() df_daily Out[55]: open high low close vol trade_date 2018-07-08 9.05 9.05 8.45 8.66 5125563.53 2018-07-09 8.69 9.03 8.58 8.88 4901983.84 2018-07-10 8.69 9.03 8.58 8.88 4901983.84 2018-07-11 8.69 9.03 8.58 8.88 4901983.84 2018-07-12 8.69 9.03 8.58 8.88 4901983.84 2018-07-13 8.69 9.03 8.58 8.88 4901983.84 2018-07-14 8.69 9.03 8.58 8.88 4901983.84 2018-07-15 8.69 9.03 8.58 8.88 4901983.84 2018-07-16 8.85 8.90 8.66 8.70 1590354.68 2018-07-17 8.85 8.90 8.66 8.70 1590354.68 2018-07-18 8.85 8.90 8.66 8.70 1590354.68 2018-07-19 8.85 8.90 8.66 8.70 1590354.68 2018-07-20 8.85 8.90 8.66 8.70 1590354.68 2018-07-21 8.85 8.90 8.66 8.70 1590354.68 2018-07-22 8.85 8.90 8.66 8.70 1590354.68
df = df[['ts_code', 'close']] df Out[58]: ts_code close trade_date 2018-07-02 000001.SZ 8.61 2018-07-03 000001.SZ 8.67 2018-07-04 000001.SZ 8.61 2018-07-05 000001.SZ 8.60 2018-07-06 000001.SZ 8.66 2018-07-09 000001.SZ 9.03 2018-07-10 000001.SZ 8.98 2018-07-11 000001.SZ 8.78 2018-07-12 000001.SZ 8.88 2018-07-13 000001.SZ 8.88 2018-07-16 000001.SZ 8.73 2018-07-17 000001.SZ 8.72 2018-07-18 000001.SZ 8.70
df['MA3'] = df['close'].rolling(3).mean() df Out[76]: ts_code close MA3 trade_date 2018-07-02 000001.SZ 8.61 NaN 2018-07-03 000001.SZ 8.67 NaN 2018-07-04 000001.SZ 8.61 8.630000 2018-07-05 000001.SZ 8.60 8.626667 2018-07-06 000001.SZ 8.66 8.623333 2018-07-09 000001.SZ 9.03 8.763333 2018-07-10 000001.SZ 8.98 8.890000 2018-07-11 000001.SZ 8.78 8.930000 2018-07-12 000001.SZ 8.88 8.880000 2018-07-13 000001.SZ 8.88 8.846667 2018-07-16 000001.SZ 8.73 8.830000 2018-07-17 000001.SZ 8.72 8.776667 2018-07-18 000001.SZ 8.70 8.716667
df = df[['ts_code', 'pct_chg']] # 列pct_chg单位是(%) Out[71]: ts_code pct_chg trade_date 2018-07-02 000001.SZ -5.28 2018-07-03 000001.SZ 0.70 2018-07-04 000001.SZ -0.69 2018-07-05 000001.SZ -0.12 2018-07-06 000001.SZ 0.70 2018-07-09 000001.SZ 4.27 2018-07-10 000001.SZ -0.55 2018-07-11 000001.SZ -2.23 2018-07-12 000001.SZ 2.78 2018-07-13 000001.SZ 0.00 2018-07-16 000001.SZ -1.69 2018-07-17 000001.SZ -0.11 2018-07-18 000001.SZ -0.23
df['cum_pct_chg'] = df['pct_chg'].expanding().sum() df Out[78]: ts_code pct_chg cum_pct_chg trade_date 2018-07-02 000001.SZ -5.28 -5.28 2018-07-03 000001.SZ 0.70 -4.58 2018-07-04 000001.SZ -0.69 -5.27 2018-07-05 000001.SZ -0.12 -5.39 2018-07-06 000001.SZ 0.70 -4.69 2018-07-09 000001.SZ 4.27 -0.42 2018-07-10 000001.SZ -0.55 -0.97 2018-07-11 000001.SZ -2.23 -3.20 2018-07-12 000001.SZ 2.78 -0.42 2018-07-13 000001.SZ 0.00 -0.42 2018-07-16 000001.SZ -1.69 -2.11 2018-07-17 000001.SZ -0.11 -2.22 2018-07-18 000001.SZ -0.23 -2.45
【关注微信公众号获取更多学习资料】 【扫码进入Python全栈开发免费公开课】