跳到主要内容

Pandas

创建

import numpy as np
import pandas as pd

dates = pd.date_range('20210101', periods=6)
df=pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

>>> df
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443
2021-01-04 0.934461 1.235513 0.174699 -0.257215
2021-01-05 0.941209 -0.113666 0.510728 -0.283018
2021-01-06 -0.795331 1.446177 -0.133865 0.008041

获取数据

  • df.A / df['A'] 获取单列,产生Series

    >>> df.A
    2021-01-01 1.215216
    2021-01-02 -1.326274
    2021-01-03 -0.658326
    2021-01-04 0.934461
    2021-01-05 0.941209
    2021-01-06 -0.795331
    Freq: D, Name: A, dtype: float64
  • 用[]切片行

    >>> df[0:3]
    A B C D
    2021-01-01 1.215216 0.068544 -1.756434 1.207404
    2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
    2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443

按标签选择(即按index或者列名选择)

  • 按标签提取一行 df.loc[dates[0]]

  • 用标签提取多列 df.loc[:, ['A', 'B']]

    同下,获取全部行的A和B列

  • 用标签切片,包含行与列结束点 df.loc['20130102':'20130104', ['A', 'B']]

    >>> df.loc['2021-01-01':'2021-01-03',['A','C']]
    A C
    2021-01-01 1.215216 -1.756434
    2021-01-02 -1.326274 -0.430175
    2021-01-03 -0.658326 -0.877119
  • 提取标量值 df.loc[dates[0], 'A'] 或者 df.at[dates[0], 'A']

  • 返回对象降维 df.loc['20130102', ['A', 'B']]

    >>> df.loc[dates[0],['A','B']]
    A 1.215216
    B 0.068544
    Name: 2021-01-01 00:00:00, dtype: float64

按位置选择(即按照二维数组的坐标位置选择)

df.iloc(row_0:row_m, col_0:col_n) iloc支持两个参数

  • 第一个是针对行的,可以是单个元素0,也可以是一个1:3这样的切片范围,也可以是[1,3]这样的整数列表
  • 第二个是针对列的,可以是单个元素1,也可以是一个1:3这样的切片范围,也可以是[1,3]这样的整数列表

可以按行切片,按列切片,精确定位某一个值

>>> df.iloc[0]
A 1.215216
B 0.068544
C -1.756434
D 1.207404
Name: 2021-01-01 00:00:00, dtype: float64
>>> df.iloc[0:3,[0,2]]
A C
2021-01-01 1.215216 -1.756434
2021-01-02 -1.326274 -0.430175
2021-01-03 -0.658326 -0.877119
>>> df.iloc[0:2,]
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
>>> df.iloc[0:2, :]
A B C D
2021-01-01 1.215216 0.068544 -1.756434 1.207404
2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
>>> df.iloc[:,1:3]
B C
2021-01-01 0.068544 -1.756434
2021-01-02 -0.511758 -0.430175
2021-01-03 -1.520871 -0.877119
2021-01-04 1.235513 0.174699
2021-01-05 -0.113666 0.510728
2021-01-06 1.446177 -0.133865
>>> df.iloc[1,1]
-0.5117581034724134
>>> df.iat[1,1]
-0.5117581034724134

布尔索引

  • 单列值筛选数据 df[df.A>0]

  • 整体筛选满足条件的值 df[df>0]

    >>> df[df.A > 0]
    A B C D
    2021-01-01 1.215216 0.068544 -1.756434 1.207404
    2021-01-04 0.934461 1.235513 0.174699 -0.257215
    2021-01-05 0.941209 -0.113666 0.510728 -0.283018
    >>> df[df > 0]
    A B C D
    2021-01-01 1.215216 0.068544 NaN 1.207404
    2021-01-02 NaN NaN NaN NaN
    2021-01-03 NaN NaN NaN NaN
    2021-01-04 0.934461 1.235513 0.174699 NaN
    2021-01-05 0.941209 NaN 0.510728 NaN
    2021-01-06 NaN 1.446177 NaN 0.008041
  • 用isin筛选

    >>> df2=df.copy()
    >>> df2
    A B C D
    2021-01-01 1.215216 0.068544 -1.756434 1.207404
    2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281
    2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443
    2021-01-04 0.934461 1.235513 0.174699 -0.257215
    2021-01-05 0.941209 -0.113666 0.510728 -0.283018
    2021-01-06 -0.795331 1.446177 -0.133865 0.008041
    >>> df2['E'] = ['1','2','3','4','4','3']
    >>> df2
    A B C D E
    2021-01-01 1.215216 0.068544 -1.756434 1.207404 1
    2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
    2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
    2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
    2021-01-05 0.941209 -0.113666 0.510728 -0.283018 4
    2021-01-06 -0.795331 1.446177 -0.133865 0.008041 3
    >>> df2[df2['E'].isin(['1','4'])]
    A B C D E
    2021-01-01 1.215216 0.068544 -1.756434 1.207404 1
    2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
    2021-01-05 0.941209 -0.113666 0.510728 -0.283018 4

赋值

  • 用索引自动对齐新增列的数据

    >>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210101', periods=6))
    >>> s1
    2021-01-01 1
    2021-01-02 2
    2021-01-03 3
    2021-01-04 4
    2021-01-05 5
    2021-01-06 6
    Freq: D, dtype: int64
    >>> df['F'] = s1
    >>> df
    A B C D F
    2021-01-01 1.215216 0.068544 -1.756434 1.207404 1
    2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
    2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
    2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
    2021-01-05 0.941209 -0.113666 0.510728 -0.283018 5
    2021-01-06 -0.795331 1.446177 -0.133865 0.008041 6
  • 赋值

    • 按标签赋值 df.at[dates[0], 'A'] = 0

    • 按位置赋值 df.iat[0,1] = 0

    • 按NumPy数组赋值 df.loc[:, "D"] = np.array([5]*len(df))

      >>> df.at[dates[0], 'A'] = 0
      >>> df
      A B C D F
      2021-01-01 0.000000 0.068544 -1.756434 1.207404 1
      2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
      2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
      2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
      2021-01-05 0.941209 -0.113666 0.510728 -0.283018 5
      2021-01-06 -0.795331 1.446177 -0.133865 0.008041 6
      >>> df.iat[0,2] = 0
      >>> df
      A B C D F
      2021-01-01 0.000000 0.068544 0.000000 1.207404 1
      2021-01-02 -1.326274 -0.511758 -0.430175 -0.328281 2
      2021-01-03 -0.658326 -1.520871 -0.877119 -1.202443 3
      2021-01-04 0.934461 1.235513 0.174699 -0.257215 4
      2021-01-05 0.941209 -0.113666 0.510728 -0.283018 5
      2021-01-06 -0.795331 1.446177 -0.133865 0.008041 6
      >>> df.loc[:,'D'] = np.array([5]*len(df))
      >>> df
      A B C D F
      2021-01-01 0.000000 0.068544 0.000000 5 1
      2021-01-02 -1.326274 -0.511758 -0.430175 5 2
      2021-01-03 -0.658326 -1.520871 -0.877119 5 3
      2021-01-04 0.934461 1.235513 0.174699 5 4
      2021-01-05 0.941209 -0.113666 0.510728 5 5
      2021-01-06 -0.795331 1.446177 -0.133865 5 6

运算

  • 描述性统计 df.describe() 计算出count, mean,std,min,max等,主要计算每一列的数据

    如下,计算A-F列,每一列的count,mean(平均值)等数据

    df.mean()计算每一列的平均值,结果按列给出

    df.mean(1)计算每一行的平均值,结果是按行列出

    >>> df.describe()
    A B C D F
    count 6.000000 6.000000 6.000000 6.0 6.000000
    mean -0.150710 0.100657 -0.125955 5.0 3.500000
    std 0.943161 1.109026 0.483422 0.0 1.870829
    min -1.326274 -1.520871 -0.877119 5.0 1.000000
    25% -0.761080 -0.412235 -0.356097 5.0 2.250000
    50% -0.329163 -0.022561 -0.066933 5.0 3.500000
    75% 0.700846 0.943771 0.131024 5.0 4.750000
    max 0.941209 1.446177 0.510728 5.0 6.000000

    >>> df.mean()
    A -0.150710
    B 0.100657
    C -0.125955
    D 5.000000
    F 3.500000
    dtype: float64

    >>> df.mean(1)
    2021-01-01 1.213709
    2021-01-02 0.946359
    2021-01-03 0.988737
    2021-01-04 2.268935
    2021-01-05 2.267654
    2021-01-06 2.303396
    Freq: D, dtype: float64
  • apply函数

    • df.apply(np.cumsum) 对df的每一列累加求和
    • df.apply(lambda x: x.max() - x.min()) 对df的每一列,求列最大值减去列最小值的差值
  • 直方图 用于Series,s.value_counts()

    s=pd.Series(np.random.randint(0,7, size=10))
    >>> s
    0 0
    1 5
    2 6
    3 0
    4 5
    5 3
    6 4
    7 4
    8 1
    9 6
    dtype: int64
    >>> s.value_counts()
    0 2
    5 2
    6 2
    4 2
    3 1
    1 1
    dtype: int64

合并

  • concat 用于连接 Pandas 对象,如Series和DataFrame

    df=pd.DataFrame(np.random.randn(10,4))
    pieces=[df[:3], df[3:7], df[7:]]
    pd.concat(pieces)
  • join sql风格的合并

    left=pd.DataFrame({'key':['foo', 'foo'], 'lval':[1, 2]})
    right=pd.DataFrame({'key':['foo', 'foo'], 'rval':[4, 5]})
    pd.merge(left, right, on='key')

    left=pd.DataFrame({'key':['foo', 'bar'], 'lval':[1, 2]})
    right=pd.DataFrame({'key':['foo', 'bar'], 'rval':[4, 5]})
    pd.merge(left, right, on='key')
  • append 为DataFrame追加行

    df=pd.DataFrame(np.random.randn(8,4), columns=list("ABCD"))
    s=df.iloc[3]
    df.append(s, ignore_index=True)

分组(Grouping)

“group by” 指的是涵盖下列一项或多项步骤的处理流程:

  • 分割:按条件把数据分割成多组;
  • 应用:为每组单独应用函数;
  • 组合:将处理结果组合成一个数据结构。

df.groupby('A').sum() 对df按列A分组统计,然后按组求和(只能求和数字列)

重塑(Reshaping)

  • 堆叠(stack) 多层索引

数据透视表(Pivot tables)

分类统计

pd.pivot_table(df, values='D', index=['A','B'], columns=['C'])

以A,B为分类索引,C为表头,计算D的值