技术标签: # python - 数据分析
在许多数据分析工作中,缺失数据是经常发生的。pandas的目标之一就是尽量轻松地处理缺失数据。例如,pandas对象的所有描述性统计默认都不包括缺失数据。
缺失数据在pandas中呈现的方式有些不完美,但对于大多数用户可以保证功能正常。对于数值数据,pandas使用浮点值NaN(Not a Number)表示缺失数据。我们称其为哨兵值,可以方便的检测出来:
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
--------------------------------------------------------------------
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
-----------------------------------------------------------------
0 False
1 False
2 True
3 False
dtype: bool
string_data.describe()
------------------------------------------------------------------------------
count 3
unique 3
top artichoke
freq 1
dtype: object
在pandas中将缺失值表示为NA,它表示不可用not available。在统计应用中,NA数据可能是不存在的数据或者虽然存在,但是没有观察到(例如,数据采集中发生了问题)。当进行数据清洗以进行分析时,最好直接对缺失数据进行分析,以判断数据采集的问题或缺失数据可能导致的偏差。
Python内置的None值在对象数组中也可以作为NA:
string_data
-----------------------------------------------
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data[0] = None
string_data.isnull()
----------------------------------------------------------------------------
0 True
1 False
2 True
3 False
dtype: bool
下表列出了一些关于缺失数据处理的函数:
过滤掉缺失数据的办法有很多种。你可以通过pandas.isnull或布尔索引的手工方法,但dropna可能会更实用一些。对于一个Series,dropna返回一个仅含非空数据和索引值的Series:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data
---------------------------------------------------------------------------
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data.dropna()
-----------------------------------------------------------------------
0 1.0
2 3.5
4 7.0
dtype: float64
这等价于:
data[data.notnull()]
-----------------------------------------------------------------------
0 1.0
2 3.5
4 7.0
dtype: float64
- data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
data
-----------------------------------------------------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
******************************************************************
cleaned = data.dropna()
cleaned
------------------------------------------
0 1 2
0 1.0 6.5 3.0
data.dropna(how='all')
--------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4] = NA
data
data.dropna(axis=1, how='all')
-------------------------------------------------
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
df.dropna()
df.dropna(thresh=2)
--------------------------------------------------------------------
0 1 2
0 0.603531 NaN NaN
1 -0.664478 NaN NaN
2 1.574308 NaN -2.026062
3 -0.796888 NaN -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
0 1 2
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
0 1 2
2 1.574308 NaN -2.026062
3 -0.796888 NaN -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
df.fillna(0)
-----------------------------------------------------------
0 1 2
0 0.603531 0.000000 0.000000
1 -0.664478 0.000000 0.000000
2 1.574308 0.000000 -2.026062
3 -0.796888 0.000000 -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
fillna
,就可以实现对不同的列填充不同的值:df.fillna({
1:0.5, 2:0})
inplace
对现有对象进行就地修改:df.fillna(0, inplace=True)
df
----------------------------------------------------------------------
0 1 2
0 0.603531 0.000000 0.000000
1 -0.664478 0.000000 0.000000
2 1.574308 0.000000 -2.026062
3 -0.796888 0.000000 -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
reindexing
有效的那些插值方法也可用于fillna
:df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
df.fillna(method='ffill')
df.fillna(method='ffill', limit=2)
----------------------------------------------------------------------------
0 1 2
0 0.786754 -1.180053 2.165359
1 -0.802301 0.124125 -0.737388
2 0.285469 NaN 0.210286
3 0.647529 NaN 1.124584
4 -0.706682 NaN NaN
5 1.418289 NaN NaN
0 1 2
0 0.786754 -1.180053 2.165359
1 -0.802301 0.124125 -0.737388
2 0.285469 0.124125 0.210286
3 0.647529 0.124125 1.124584
4 -0.706682 0.124125 1.124584
5 1.418289 0.124125 1.124584
0 1 2
0 0.786754 -1.180053 2.165359
1 -0.802301 0.124125 -0.737388
2 0.285469 0.124125 0.210286
3 0.647529 0.124125 1.124584
4 -0.706682 NaN 1.124584
5 1.418289 NaN 1.124584
data = pd.Series([1., NA, 3.5, NA, 7])
data
data.fillna(data.mean())
-------------------------------------------------------------------
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
DataFrame中出现重复行有多种原因。下面就是一个例子:
data = pd.DataFrame({
'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
data.duplicated()
data.drop_duplicates()
----------------------------------------------------------------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
这两个方法默认会判断全部列,你也可以指定部分列进行重复项判断。假设我们还有一列值,且只希望根据k1列过滤重复项:
data['v1'] = range(7)
data
data.drop_duplicates(['k1'])
----------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated
和drop_duplicates
默认保留的是第一个出现的值组合。传入keep='last'
则保留最后一个:
data.drop_duplicates(['k1', 'k2'], keep='last')
----------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
data = pd.DataFrame({
'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
----------------------------------------------------------------------
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
map
方法可以接受一个函数或含有映射关系的字典型对象,但是这里有一个小问题,即有些肉类的首字母大写了,而另一些则没有。因此,我们还需要使用Series的str.lower
方法,将各个值转换为小写:lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data
-----------------------------------------------------------------------------
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
data['food'].map(lambda x: meat_to_animal[x.lower()])
-----------------------------------------------------------
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
map
是一种实现元素级转换以及其他数据清理工作的便捷方式。fillna
方法填充缺失数据可以看做值替换的一种特殊情况。前面已经看到,map可用于修改对象的数据子集,而replace
则提供了一种实现该功能的更简单、更灵活的方式。我们来看看下面这个Series:data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
----------------------------------------------------------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
inplace=True
):data.replace(-999, np.nan)
-----------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999, -1000], np.nan)
-----------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
要让每个值有不同的替换值,可以传递一个替换列表:
data.replace([-999, -1000], [np.nan, 0])
-----------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
传入的参数也可以是字典:
data.replace({
-999: np.nan, -1000: 0})
-----------------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
-----------------------------------------------------------------------
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
transform = lambda x: x[:4].upper()
data.index.map(transform)
-------------------------------------------------------------------------
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
-------------------------------------------------------------------------
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
如果想要创建数据集的转换版(而不是修改原始数据),比较实用的方法是rename
:
data.rename(index=str.title, columns=str.upper)
str.title('this is a test')
-------------------------------------------------------------------------
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
'This Is A Test'
rename
可以结合字典型对象实现对部分轴标签的更新:data.rename(index={
'OHIO': 'INDIANA'},
columns={
'three': 'peekaboo'})
-------------------------------------------------------------------------
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
rename
可以实现复制DataFrame并对其索引和列标签进行赋值。如果希望就地修改某个数据集,传入inplace=True
即可data.rename(index={
'OHIO': 'INDIANA'}, inplace=True)
data
-------------------------------------------------------------------------
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
为了便于分析,连续数据常常被离散化或拆分为“面元”(bin)。假设有一组人员数据,而你希望将它们划分为不同的年龄组:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
cut
函数:bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
-------------------------------------------------------------------------
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的是一个特殊的Categorical对象。结果展示了pandas.cut
划分的面元。你可以将其看做一组表示面元名称的字符串。它的底层含有一个表示不同分类名称的类型数组,以及一个codes
属性中的年龄数据的标签:
cats.codes
-------------------------------------------------------------------------
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
-------------------------------------------------------------------------
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
# pd.value_counts(cats)是pandas.cut结果的面元计数。
pd.value_counts(cats)
-------------------------------------------------------------------------
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
right=False
进行修改:pd.cut(ages, [18, 26, 36, 61, 100], right=False)
-------------------------------------------------------------------------
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
-------------------------------------------------------------------------
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
data = np.random.rand(20)
pd.cut(data, 4, precision=2)
-------------------------------------------------------------------------
[(0.67, 0.9], (0.67, 0.9], (0.011, 0.23], (0.23, 0.45], (0.23, 0.45], ..., (0.45, 0.67], (0.011, 0.23], (0.23, 0.45], (0.011, 0.23], (0.67, 0.9]]
Length: 20
Categories (4, interval[float64]): [(0.011, 0.23] < (0.23, 0.45] < (0.45, 0.67] < (0.67, 0.9]]
qcut
是一个非常类似于cut
的函数,它可以根据样本分位数对数据进行面元划分。根据数据的分布情况,cut
可能无法使各个面元中含有相同数量的数据点。而qcut
由于使用的是样本分位数,因此可以得到大小基本相等的面元:data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats
-------------------------------------------------------------------------
[(-3.627, -0.665], (-0.665, -0.0236], (-3.627, -0.665], (-3.627, -0.665], (0.595, 3.352], ..., (-3.627, -0.665], (0.595, 3.352], (-0.0236, 0.595], (-0.0236, 0.595], (-0.0236, 0.595]]
Length: 1000
Categories (4, interval[float64]): [(-3.627, -0.665] < (-0.665, -0.0236] < (-0.0236, 0.595] < (0.595, 3.352]]
pd.value_counts(cats)
-------------------------------------------------------------------------
(0.595, 3.352] 250
(-0.0236, 0.595] 250
(-0.665, -0.0236] 250
(-3.627, -0.665] 250
dtype: int64
cut
类似,你也可以传递自定义的分位数(0到1之间的数值,包含端点):pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
----------------------------------------------------------------------------------
[(-1.319, -0.0236], (-1.319, -0.0236], (-3.627, -1.319], (-1.319, -0.0236], (1.218, 3.352], ..., (-3.627, -1.319], (1.218, 3.352], (-0.0236, 1.218], (-0.0236, 1.218], (-0.0236, 1.218]]
Length: 1000
Categories (4, interval[float64]): [(-3.627, -1.319] < (-1.319, -0.0236] < (-0.0236, 1.218] < (1.218, 3.352]]
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
----------------------------------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.000284 0.003834 0.052079 0.021614
std 1.010268 1.006346 1.007687 1.029508
min -3.261766 -3.762325 -3.544134 -3.124230
25% -0.741905 -0.643554 -0.613592 -0.651136
50% -0.019509 -0.039399 0.036878 0.094302
75% 0.695546 0.676507 0.737588 0.712619
max 3.026675 3.482009 3.390924 3.116602
# 假设你想要找出某列中绝对值大小超过3的值:
col = data[2]
col[np.abs(col) > 3]
----------------------------------------------------------------------------------
156 3.174432
406 3.312773
708 -3.544134
788 3.390924
Name: 2, dtype: float64
data[(np.abs(data) > 3).any(1)]
----------------------------------------------------------------------------------
0 1 2 3
32 -0.464433 -3.762325 0.593019 -3.124230
156 0.771649 0.483199 3.174432 0.117988
257 1.056537 1.435697 -1.161737 3.095851
307 -3.056003 -1.987528 -0.628526 0.297810
406 0.912561 -0.073800 3.312773 -0.802398
526 3.026675 -0.463901 -0.486543 -1.458392
628 -1.211783 -1.272122 -1.051027 3.116602
708 -0.276584 -0.283202 -3.544134 -0.418665
713 -3.261766 -1.122042 0.869456 -0.035866
779 0.082870 3.482009 0.439070 -0.291009
788 -0.111096 -0.608896 3.390924 -0.688649
837 0.884890 -3.544678 -0.799940 -1.402440
883 -0.140538 -3.086551 -2.524166 -2.525502
901 -0.949264 3.354125 1.301728 0.192881
# sign()是Python的Numpy中的取数字符号(数字前的正负号)的函数
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
----------------------------------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.000575 0.004392 0.051745 0.021525
std 1.009208 0.999045 1.003169 1.028508
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.741905 -0.643554 -0.613592 -0.651136
50% -0.019509 -0.039399 0.036878 0.094302
75% 0.695546 0.676507 0.737588 0.712619
max 3.000000 3.000000 3.000000 3.000000
np.sign(data).head()
----------------------------------------------------------------------------------
0 1 2 3
0 1.0 1.0 1.0 -1.0
1 1.0 1.0 -1.0 -1.0
2 1.0 1.0 1.0 1.0
3 -1.0 1.0 -1.0 -1.0
4 -1.0 1.0 -1.0 -1.0
numpy.random.permutation
函数可以轻松实现对Series
或DataFrame
的列的排列工作(permuting,随机重排序)。通过需要排列的轴的长度调用permutation
,可产生一个表示新顺序的整数数组:df = pd.DataFrame(np.arange(5 * 4).reshape((5,4)))
sampler = np.random.permutation(5)
sampler
----------------------------------------------------------------------------------
array([1, 0, 2, 3, 4])
iloc
的索引操作或take
函数中使用该数组了:df
----------------------------------------------------------------------------------
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
df.take(sampler)
----------------------------------------------------------------------------------
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
df.sample(n=3)
----------------------------------------------------------------------------------
0 1 2 3
2 8 9 10 11
4 16 17 18 19
1 4 5 6 7
replace=True
到sample
:choices = pd.Series([5,7,-1,6,4])
draws = choices.sample(n=10, replace=True)
draws
----------------------------------------------------------------------------------
4 4
2 -1
1 7
0 5
0 5
3 6
3 6
3 6
1 7
4 4
dtype: int64
另一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为“哑变量”或“指标矩阵”。
如果DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0)。pandas有一个get_dummies
函数可以实现该功能(其实自己动手做一个也不难)。使用之前的一个DataFrame例子:
df = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
df
----------------------------------------------------------------------------------
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
pd.get_dummies(df['key'])
----------------------------------------------------------------------------------
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
有时候,你可能想给指标DataFrame
的列加上一个前缀,以便能够跟其他数据进行合并。get_dummies
的prefix
参数可以实现该功能:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
----------------------------------------------------------------------------------
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
如果DataFrame中的某行同属于多个分类,则事情就会有点复杂。看一下MovieLens 1M数据集
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/movie/movies.dat', sep='::',
header=None, names=mnames)
movies[:10]
----------------------------------------------------------------------------------
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
----------------------------------------------------------------------------------
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
现在,迭代每一部电影,并将dummies各行的条目设为1。要这么做,我们使用dummies.columns
来计算每个类型的列索引:
gen = movies.genres[0]
gen.split('|')
----------------------------------------------------------------------------------
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split('|'))
----------------------------------------------------------------------------------
array([0, 1, 2])
.iloc
设定值:for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
----------------------------------------------------------------------------------
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
Genre_Comedy 1
Genre_Adventure 0
Genre_Fantasy 0
Genre_Romance 0
Genre_Drama 0
Genre_Action 0
Genre_Crime 0
Genre_Thriller 0
Genre_Horror 0
Genre_Sci-Fi 0
Genre_Documentary 0
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, dtype: object
Python能够成为流行的数据处理语言,部分原因是其简单易用的字符串和文本处理功能。大部分文本运算都直接做成了字符串对象的内置方法。对于更为复杂的模式匹配和文本操作,则可能需要用到正则表达式。pandas对此进行了加强,它使你能够对整组数据应用字符串表达式和正则表达式,而且能处理烦人的缺失数据。
对于许多字符串处理和脚本应用,内置的字符串方法已经能够满足要求了。例如,以逗号分隔的字符串可以用split
拆分成数段:
val = 'a,b, guido'
val.split(',')
----------------------------------------------------------------------------------
['a', 'b', ' guido']
strip
一起使用,以去除空白符(包括换行符):pieces = [x.strip() for x in val.split(',')]
pieces
----------------------------------------------------------------------------------
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
----------------------------------------------------------------------------------
'a::b::guido'
'::'.join(pieces)
----------------------------------------------------------------------------------
'a::b::guido'
in
关键字,还可以使用index
和find
:'guido' in val
True
----------------------------------------------------------------------------------
val.index(',')
1
----------------------------------------------------------------------------------
val.find(':')
-1
val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-105-2c016e7367ac> in <module>()
----> 1 val.index(':')
ValueError: substring not found
与此相关,count
可以返回指定子串的出现次数:
val.count(',')
2
replace
用于将指定模式替换为另一个模式。通过传入空字符串,它也常常用于删除模式:
val.replace(',', '::')
'a::b:: guido'
val.replace(',','')
'ab guido'
下表列出了Python内置的字符串方法,这些运算大部分都能使用正则表达式实现:
正则表达式提供了一种灵活的在文本中搜索或匹配(通常比前者复杂)字符串模式的方式。正则表达式,常称作regex,是根据正则表达式语言编写的字符串。Python内置的re模块负责对字符串应用正则表达式。
re模块的函数可以分为三个大类:模式匹配、替换以及拆分。当然,它们之间是相辅相成的。一个regex描述了需要在文本中定位的一个模式,它可以用于许多目的。我们先来看一个简单的例子:假设想要拆分一个字符串,分隔符为数量不定的一组空白符(制表符、空格、换行符等)。描述一个或多个空白符的regex是\s+:
import re
text = 'foo bar\t baz \tqux'
re.split('\s+', text)
----------------------------------------------------------------------------------
['foo', 'bar', 'baz', 'qux']
re.split('\s+',text)
时,正则表达式会先被编译,然后再在text上调用其split
方法。你可以用re.compile
自己编译regex
以得到一个可重用的regex
对象:regex = re.compile('\s+')
regex.split(text)
----------------------------------------------------------------------------------
['foo', 'bar', 'baz', 'qux']
如果只希望得到匹配regex的所有模式,则可以使用findall
方法:
regex.findall(text)
----------------------------------------------------------------------------------
[' ', '\t ', ' \t']
如果打算对许多字符串应用同一条正则表达式,强烈建议通过re.compile
创建regex
对象。这样将可以节省大量的CPU时间。
match
和search跟findall
功能类似。findall返回的是字符串中所有的匹配项,而search则只返回第一个匹配项。match更加严格,它只匹配字符串的首部。来看一个小例子,假设我们有一段文本以及一条能够识别大部分电子邮件地址的正则表达式:
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
# 对text使用findall将得到一组电子邮件地址:
regex.findall(text)
----------------------------------------------------------------------------------------
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
m = regex.search(text)
m
----------------------------------------------------------------------------------------
<_sre.SRE_Match object; span=(5, 20), match='[email protected]'>
text[m.start():m.end()]
----------------------------------------------------------------------------------------
'[email protected]'
regex.match
则将返回None,因为它只匹配出现在字符串开头的模式:print(regex.match(text))
----------------------------------------------------------------------------------------
None
sub
方法可以将匹配到的模式替换为指定字符串,并返回所得到的新字符串:print(regex.sub('REDACTED', text))
----------------------------------------------------------------------------------------
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
groups
方法返回一个由模式各段组成的元组:m = regex.match('[email protected]')
m.groups()
----------------------------------------------------------------------------------------
('wesm', 'bright', 'net')
对于带有分组功能的模式,findall会返回一个元组列表:
regex.findall(text)
----------------------------------------------------------------------------------------
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
sub
还能通过诸如\1、\2之类的特殊符号访问各匹配项中的分组。符号\1对应第一个匹配的组,\2对应第二个匹配的组,以此类推:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
----------------------------------------------------------------------------------------
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
清理待分析的散乱数据时,常常需要做一些字符串规整化工作。更为复杂的情况是,含有字符串的列有时还含有缺失数据:
data = {
'Dave': '[email protected]', 'Steve': '[email protected]',
'Rob': '[email protected]', 'Wes': np.nan}
data = pd.Series(data)
data
-----------------------------------------------------------------------
Dave dave@google.com
Rob rob@gmail.com
Steve steve@gmail.com
Wes NaN
dtype: object
data.isnull()
-----------------------------------------------------------------------
Dave False
Rob False
Steve False
Wes True
dtype: bool
data.map
,所有字符串和正则表达式方法都能被应用于(传入lambda表达式或其他函数)各个值,但是如果存在NA(null)就会报错。为了解决这个问题,Series有一些能够跳过NA值的面向数组方法,进行字符串操作。通过Series的str属性即可访问这些方法。例如,我们可以通过str.contains
检查各个电子邮件地址是否含有"gmail":data.str.contains('gmail')
-----------------------------------------------------------------------
Dave False
Rob True
Steve True
Wes NaN
dtype: object
re
选项(如IGNORECASE):pattern
-----------------------------------------------------------------------
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)
-----------------------------------------------------------------------
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
Steve [(steve, gmail, com)]
Wes NaN
dtype: object
str.get
,要么在str属性上使用索引:matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
---------------------------------------------------------------------------
Dave True
Rob True
Steve True
Wes NaN
dtype: object
data.str.get(1)
---------------------------------------------------------------------------
Dave a
Rob o
Steve t
Wes NaN
dtype: object
data.str[0]
---------------------------------------------------------------------------
Dave d
Rob r
Steve s
Wes NaN
dtype: object
data.str[:5]
---------------------------------------------------------------------------
Dave dave@
Rob rob@g
Steve steve
Wes NaN
dtype: object
pandas
字符串方法:前面介绍的这些向量化字符串操作方法非常适合用来处理现实中那些凌 乱的数据。下面将通过一个从不同网站获取的公开食谱数据库的案例来 进行演示。我们的目标是将这些食谱数据解析为食材列表,这样就可以 根据现有的食材快速找到食谱。
获取数据的脚本可以在https://github.com/fictivekin/openrecipes
上找到, 那里还有最新版的数据库链接。
这个数据库是 JSON 格式的,来试试通过 pd.read_json
读取数据:
try:
recipes = pd.read_json('data/openrecipes.txt')
except ValueError as e:
print("ValueError:", e)
---------------------------------------------------------------------------
ValueError: Trailing data
提示数据里有“trailing data”(数据断行)的 ValueError 错误。原因是虽然文件中的每一行都是一个有效的 JSON 对象,但是全文却不是这样。来看看文件是不是这样:
with open('data/openrecipes.txt') as f:
line = f.readline()
print(line)
---------------------------------------------------------------------------
{
"name": "Easter Leftover Sandwich", "ingredients": "12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\n3 Tablespoons Grainy Dijon Mustard\n Salt And Pepper, to taste\n Several Dashes Worcestershire Sauce\n Leftover Baked Ham, Sliced\n Kaiser Rolls Or Other Bread\n Extra Mayonnaise And Dijon, For Spreading\n Swiss Cheese Or Other Cheese Slices\n Thinly Sliced Red Onion\n Avocado Slices\n Sliced Tomatoes\n Lettuce, Spinach, Or Arugula", "url": "http://thepioneerwoman.com/cooking/2013/04/easter-leftover-sandwich/", "image": "http://static.thepioneerwoman.com/cooking/files/2013/03/leftoversandwich.jpg", "cookTime": "PT", "recipeYield": "8", "datePublished": "2013-04-01", "prepTime": "PT15M", "description": "Got leftover Easter eggs? Got leftover Easter ham? Got a hearty appetite? Good! You've come to the right place! I..."}
显然每一行都是一个有效的 JSON 对象,因此需要将这些字符串连接在 一起。解决这个问题的一种方法就是新建一个字符串,将所有行 JSON 对象连接起来,然后再通过 pd.read_json
来读取所有数据:
import pandas as pd
# 将文件内容读取成Python数组
with open('data/openrecipes.txt', 'r') as f:
# 提取每一行内容
data = (line.strip() for line in f)
# 将所有内容合并成一个列表
data_json = "[{0}]".format(','.join(data))
# 用JSON形式读取数据
recipes = pd.read_json(data_json)
recipes.shape
---------------------------------------------------------------------------
(1042, 9)
recipes.iloc[0]
---------------------------------------------------------------------------
cookTime PT
datePublished 2013-04-01
description Got leftover Easter eggs? Got leftover East...
image http://static.thepioneerwoman.com/cooking/file...
ingredients 12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...
name Easter Leftover Sandwich
prepTime PT15M
recipeYield 8
url http://thepioneerwoman.com/cooking/2013/04/eas...
Name: 0, dtype: object
这里有一堆信息,而且其中有不少都和从网站上抓取的数据一样,字段 形式混乱。值得关注的是,食材列表是字符串形式,我们需要从中抽取 感兴趣的信息。下面来仔细看看这个字段:
recipes.ingredients.str.len().describe()
---------------------------------------------------------------------------
count 1042.000000
mean 358.645873
std 187.332133
min 22.000000
25% 246.250000
50% 338.000000
75% 440.000000
max 3160.000000
Name: ingredients, dtype: float64
食材列表平均 250 个字符,最短的字符串是 22,最长的竟然3000多字符!
来看看这个拥有最长食材列表的究竟是哪道菜:
recipes.name[np.argmax(recipes.ingredients.str.len())]
---------------------------------------------------------------------------
'A Nice Berry Pie'
我们还可以再做一些累计探索,例如看看哪些食谱是早餐:
recipes.description.str.contains('[Bb]reakfast').sum()
---------------------------------------------------------------------------
11
或者看看有多少食谱用肉桂(cinnamon)作为食材:
recipes.ingredients.str.contains('[Cc]innamon').sum()
---------------------------------------------------------------------------
79
还可以看看究竟是哪些食谱里把肉桂错写成了“cinamon”:
recipes.ingredients.str.contains('[Cc]inamon').sum()
---------------------------------------------------------------------------
0
现在让我们更进一步,来制作一个简易的美食推荐系统:如果用户 提供一些食材,系统就会推荐使用了所有食材的食谱。这说起来是 容易,但是由于大量不规则(heterogeneity)数据的存在,这个任 务变得十分复杂,例如并没有一个简单直接的办法可以从每一行数 据中清理出一份干净的食材列表。因此,我们在这里简化处理:首 先提供一些常见食材列表,然后通过简单搜索判断这些食材是否在 食谱中。为了简化任务,这里只列举常用的香料和调味料:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
现在就可以通过一个由 True 与 False 构成的布尔类型的 DataFrame 来判断食材是否出现在某个食谱中:
import time
start = time.time()
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE,)) for spice in spice_list))
print(time.time() - start)
#spice_df[spice_df.any(1)]
0.011888265609741211
现在,来找一份使用了pepper、salt和sage这三种食材的食谱。我们可以通过 DataFrame 的 query()
方法来快速完成计算
%timeit selection = spice_df.query('pepper & salt & sage')
# selection
# len(selection)
1.57 ms ± 6.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spice_df["pepper"] & spice_df["salt"]& spice_df["sage"]
391 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
recipes.iloc[selection.index][['name', 'ingredients']]
name ingredients
687 1 quart (4 cups) water\n1 bay leaf\n2 sage lea...
1014 Spiced Candied Walnuts Recipe Peanut or canola oil for deep-frying (hs note:...
recipes.head()
cookTime datePublished description image ingredients name prepTime recipeYield url
0 PT 2013-04-01 Got leftover Easter eggs? Got leftover East... http://static.thepioneerwoman.com/cooking/file... 12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\... Easter Leftover Sandwich PT15M 8 http://thepioneerwoman.com/cooking/2013/04/eas...
1 PT10M 2011-06-06 I finally have basil in my garden. Basil I can... http://static.thepioneerwoman.com/cooking/file... 3/4 cups Fresh Basil Leaves\n1/2 cup Grated Pa... Pasta with Pesto Cream Sauce PT6M 8 http://thepioneerwoman.com/cooking/2011/06/pas...
2 PT15M 2011-09-15 This was yummy. And easy. And pretty! And it t... http://static.thepioneerwoman.com/cooking/file... 2 whole Pork Tenderloins\n Salt And Pepper, to... Herb Roasted Pork Tenderloin with Preserves PT5M 12 http://thepioneerwoman.com/cooking/2011/09/her...
3 PT20M 2012-04-23 I made this for a late lunch Saturday, and it ... http://static.thepioneerwoman.com/cooking/file... 1 pound Penne\n4 whole Boneless, Skinless Chic... Chicken Florentine Pasta PT10M 10 http://thepioneerwoman.com/cooking/2012/04/chi...
4 PT 2011-06-13 Iced coffee is my life. When I wake up, often ... http://static.thepioneerwoman.com/cooking/file... 1 pound Ground Coffee (good, Rich Roast)\n8 qu... Perfect Iced Coffee PT8H 24 http://thepioneerwoman.com/cooking/2011/06/per...
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
---------------------------------------------------------------------------
#init dict
dict_inverse_index = {
}
for i in spice_list:
dict_inverse_index[i] = set()
ingredients = recipes.ingredients.values
---------------------------------------------------------------------------
start = time.time()
for recipe_index in range(len(ingredients)):
for spice in spice_list:
if(spice in ingredients[recipe_index]):
dict_inverse_index[spice].add(recipe_index)
print(time.time() - start)
0.005366086959838867
#selection = spice_df.query('pepper | salt | sage')
%timeit dict_inverse_index["pepper"] & dict_inverse_index["salt"] & dict_inverse_index["sage"]
2.95 µs ± 24.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
文章浏览阅读1.5w次,点赞3次,收藏29次。Spring Cloud不仅提供了使用Ribbon进行客户端负载均衡,还提供了Spring Cloud LoadBalancer。相比较于Ribbon,Spring Cloud LoadBalancer不仅能够支持RestTemplate,还支持WebClient。WeClient是Spring Web Flux中提供的功能,可以实现响应式异步请求,因此学习Spring Cloud LoadBalancer之前,建议先了解下Spring Web Flux。 无论使用Ribbo..._reactiveloadbalancerclientfilter
文章浏览阅读2k次,点赞2次,收藏7次。1.安装依赖包//一键安装上面四个依赖sudo apt-get install gcc zlib zlib-devel pcre-devel openssl openssl-devel2.下载并解压安装包//创建一个文件夹cd /usr/localmkdir nginxcd nginx//下载tar包wget http://nginx.org/download/nginx-1.13.7.tar.gztar -xvf nginx-1.13.7.tar.gz3.安装nginx/_ubuntu 安装nginx cd.current_salt[0] = ~salt[0];
文章浏览阅读1.6k次。微信公众号搜索 DevOps和k8s全栈技术 ,即可关注公众号,也可扫描文章最后的二维码关注公众号,每天会分享技术文章供大家阅读参考哈~前言Ingress-Nginx是一个K8S ing..._"nginx.ingress.kubernetes.io/canary-by-header-value: \"vip"
文章浏览阅读695次。FFmpeg从入门到放弃官网下载源码包下载地址: https://ffmpeg.org/releases/ffmpeg-snapshot.tar.bz2链下载地址: 本地资源暂时未上传后续在修改上传解压解压 : ffmpeg-snapshot-git.tar.bz2tar -zxvf ffmpeg-snapshot-git.tar.bz2cd ffmpeg编译linux 默认基础编译及其效果ubuntu@VM-192-8-ubuntu:~/ffmpeg$ ./configure -_avx512 ffmpeg
文章浏览阅读2.4k次。相关文章阅读记录:JAVA解析各种编码密钥对(DER、PEM、openssh公钥)https://blog.csdn.net/jiftlixu/article/details/19836405【Java密码学】使用Bouncy Castle生成数字签名、数字信封https://www.cnblogs.com/qiuxiangmuyu/p/6195215.htmlJava生成CS..._android csr
文章浏览阅读768次。1.使用构造函数。显式调用: Stock food = Stock("adasdas",2250);隐式调用:Stock food("adasdas",2250);使用动态内存分配调用:Stock *pstock = new Stock("adasdas",2250);注意:无法使用对象来调用构造函数,因为在构造函数构造出对象前,对象是不存在的。因此,构造函数被用来创建对象,而不是通过对象来调用。2.const成员函数。声明:void show() const;定义:void stack
文章浏览阅读4.6k次。见个人博客: https://ziyubiti.github.io/2018/04/04/5gnewwave2/_w/f-ofdm
文章浏览阅读950次。image.jsp Color getRandColor(int fc,int bc){Random random = new Random();if(fc>255) fc=255;if(bc>255) bc=255;int r=fc+random.nextInt(bc-fc)_if (fc > 255) fc = 255; if (bc > 255) bc = 255; int r = fc random.nextint(bc
文章浏览阅读4.8k次,点赞5次,收藏4次。1、设置相机和网络适配器的ip地址为静态ip ,如果多相机多网卡,则最好不同的相机设置不同的网段。2、利用Vpp应用程序,和windows的任务计划程序,实现开机自动(或延时)启动Vpp程序。..._visionpro 设置开机运行
文章浏览阅读2.8k次。华为SNS2624光纤交换机默认管理口IP:10.77.77.77默认账号密码:admin/Huawei12#$登录光交,输入命令switchshow,可以看到没有一个端口激活输入以下命令激活8个端口SNS2624:admin> licenseport --reserve 0-7SNS2624:admin> portenable 0-7再次输入命令switchshow,就可以看到0-7端口激活了。..._华为sns2624光纤交换机配置手册
文章浏览阅读1.3w次,点赞4次,收藏16次。微动探测原理及仪器介绍1、微动探测原理1.1 基本理论微动是由体波和面波组成,其中微动能量的百分之七十以上都来自于面波,并且它与体波的主要区别是在不均匀的介质中传播时面波会发生频散,而体波不存在频散现象,因此面波是微动探测的主要对象。对于介质的弹性参数(如横波速度、层薄厚、压缩波速度、密度等),面波传播速度相对比较敏感,特别是对横波的敏感性很高,因此可以利用面波来反演横波速度。在时间和空间上,微动信号具有以下特性:高度变化、无规律性、无重复性等,有研究发现全世界各处记录的微动信号都有相似_微动探测
文章浏览阅读482次。在C语言中,我们可以编写程序来将给定的正整数分解为其质因数。在本文中,我将展示一个简单的C语言程序,用于分解正整数的质因数,并提供相应的源代码和解析。希望通过这个示例程序和解析,您能够理解如何在C语言中实现正整数的质因数分解。这是一个基本的算法,可以用于解决数论和因数分解相关的问题。在函数内部,我们使用一个循环来迭代从2到给定正整数的所有可能的因数。是否等于1,如果不等于1,则打印" × ",表示这个因数还没有完全分解。在这个例子中,我们将正整数84分解为其质因数:2、2、3和7。如果可以整除,我们将。_正整数分解质因数c语言