Pandas
创建
import numpy as np
import pandas as pd
dates = pd.date_range('20210101', periods=6)
df=pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443
2021-01-04 0.934461 1.235513 0.174699 -0.257215
2021-01-05 0.941209 -0.113666 0.510728 -0.283018
2021-01-06 -0.795331 1.446177 -0.133865 0.008041
获取数据
-
df.A / df['A'] 获取单列,产生Series
>>> df.A
2021-01-01 1.215216
2021-01-02 -1.326274
2021-01-03 -0.658326
2021-01-04 0.934461
2021-01-05 0.941209
2021-01-06 -0.795331
Freq: D, Name: A, dtype: float64 -
用[]切片行
>>> df[0:3]
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443
按标签选择(即按index或者列名选择)
-
按标签提取一行
df.loc[dates[0]] -
用标签提取多列
df.loc[:, ['A', 'B']]同下,获取全部行的A和B列
-
用标签切片,包含行与列结束点
df.loc['20130102':'20130104', ['A', 'B']]>>> df.loc['2021-01-01':'2021-01-03',['A','C']]
A C
2021-01-01 1.215216 -1.756434
2021-01-02 -1.326274 -0.430175
2021-01-03 -0.658326 -0.877119 -
提取标量值
df.loc[dates[0], 'A']或者df.at[dates[0], 'A'] -
返回对象降维
df.loc['20130102', ['A', 'B']]>>> df.loc[dates[0],['A','B']]
A 1.215216
B 0.068544
Name: 2021-01-01 00:00:00, dtype: float64
按位置选择(即按照二维数组的坐标位置选择)
df.iloc(row_0:row_m, col_0:col_n) iloc支持两个参数
- 第一个是针对行的,可以是单个元素
0,也可以是一个1:3这样的切片范围,也可以是[1,3]这样的整数列表 - 第二个是针对列的,可以是单个元素
1,也可以是一个1:3这样的切片范围,也可以是[1,3]这样的整数列表
可以按行切片,按列切片,精确定位某一个值
>>> df.iloc[0]
A 1.215216
B 0.068544
C -1.756434
D 1.207404
Name: 2021-01-01 00:00:00, dtype: float64
>>> df.iloc[0:3,[0,2]]
A C
2021-01-01 1.215216 -1.756434
2021-01-02 -1.326274 -0.430175
2021-01-03 -0.658326 -0.877119
>>> df.iloc[0:2,]
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
>>> df.iloc[0:2, :]
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
>>> df.iloc[:,1:3]
B C
2021-01-01 0.068544 -1.756434
2021-01-02 -0.511758 -0.430175
2021-01-03 -1.520871 -0.877119
2021-01-04 1.235513 0.174699
2021-01-05 -0.113666 0.510728
2021-01-06 1.446177 -0.133865
>>> df.iloc[1,1]
-0.5117581034724134
>>> df.iat[1,1]
-0.5117581034724134
布尔索引
-
单列值筛选数据
df[df.A>0] -
整体筛选满足条件的值
df[df>0]>>> df[df.A > 0]
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-04 0.934461 1.235513 0.174699 -0.257215
2021-01-05 0.941209 -0.113666 0.510728 -0.283018
>>> df[df > 0]
A B C D
2021-01-01 1.215216 0.068544 NaN 1.207404
2021-01-02 NaN NaN NaN NaN
2021-01-03 NaN NaN NaN NaN
2021-01-04 0.934461 1.235513 0.174699 NaN
2021-01-05 0.941209 NaN 0.510728 NaN
2021-01-06 NaN 1.446177 NaN 0.008041 -
用isin筛选
>>> df2=df.copy()
>>> df2
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443
2021-01-04 0.934461 1.235513 0.174699 -0.257215
2021-01-05 0.941209 -0.113666 0.510728 -0.283018
2021-01-06 -0.795331 1.446177 -0.133865 0.008041
>>> df2['E'] = ['1','2','3','4','4','3']
>>> df2
A B C D E
2021-01-01 1.215216 0.068544 -1.756434 1.207404 1
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
2021-01-05 0.941209 -0.113666 0.510728 -0.283018 4
2021-01-06 -0.795331 1.446177 -0.133865 0.008041 3
>>> df2[df2['E'].isin(['1','4'])]
A B C D E
2021-01-01 1.215216 0.068544 -1.756434 1.207404 1
2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
2021-01-05 0.941209 -0.113666 0.510728 -0.283018 4
赋值
-
用索引自动对齐新增列的数据
>>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210101', periods=6))
>>> s1
2021-01-01 1
2021-01-02 2
2021-01-03 3
2021-01-04 4
2021-01-05 5
2021-01-06 6
Freq: D, dtype: int64
>>> df['F'] = s1
>>> df
A B C D F
2021-01-01 1.215216 0.068544 -1.756434 1.207404 1
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
2021-01-05 0.941209 -0.113666 0.510728 -0.283018 5
2021-01-06 -0.795331 1.446177 -0.133865 0.008041 6 -
赋值
-
按标签赋值
df.at[dates[0], 'A'] = 0 -
按位置赋值
df.iat[0,1] = 0 -
按NumPy数组赋值
df.loc[:, "D"] = np.array([5]*len(df))>>> df.at[dates[0], 'A'] = 0
>>> df
A B C D F
2021-01-01 0.000000 0.068544 -1.756434 1.207404 1
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
2021-01-05 0.941209 -0.113666 0.510728 -0.283018 5
2021-01-06 -0.795331 1.446177 -0.133865 0.008041 6
>>> df.iat[0,2] = 0
>>> df
A B C D F
2021-01-01 0.000000 0.068544 0.000000 1.207404 1
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
2021-01-05 0.941209 -0.113666 0.510728 -0.283018 5
2021-01-06 -0.795331 1.446177 -0.133865 0.008041 6
>>> df.loc[:,'D'] = np.array([5]*len(df))
>>> df
A B C D F
2021-01-01 0.000000 0.068544 0.000000 5 1
2021-01-02 -1.326274 -0.511758 -0.430175 5 2
2021-01-03 -0.658326 -1.520871 -0.877119 5 3
2021-01-04 0.934461 1.235513 0.174699 5 4
2021-01-05 0.941209 -0.113666 0.510728 5 5
2021-01-06 -0.795331 1.446177 -0.133865 5 6
-