当今社会,个人信贷业务发展迅速,但同时也会暴露较高的信用风险。信息不对称在金融贷款领域突出,表现在过去时期借款一方对自身财务状况、还款能力及还款意愿有着较为全面的掌握,而金融机构不能全面获知借款方的风险水平,或在相关信息的掌握上具有明显的滞后性。这种信息劣势,使得金融机构在贷款过程中可能由于风险评估与实际情况的偏离,产生资金损失,直接影响金融机构的利润水平。而现今,金融机构可以结合多方数据,提前对客户风险水平进行评估,并做出授信决策。
运用分类算法预测违约
可用的训练数据包括用户的基本属性user_info.txt、银行流水记录bank_detail.txt、用户浏览行为browse_history.txt、信用卡账单记录bill_detail.txt、放款时间loan_time.txt,以及用户是否会发生逾期行为的记录overdue.txt。
相应的,还有用于测试的用户基本属性、银行流水、信用卡账单记录、用户浏览行为、放款时间等数据信息,以及待预测用户的id列表。
表中数据进行了脱敏处理:(a)隐藏了用户id信息;(b)将用户属性信息全部数字化;将时间戳和所有金额的值都做了数据变换。
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
#导入银行流水记录表
columns_BankDetail = ['用户id','时间戳','交易类型','交易金额','工资收入标记']
df_bank_detail_train = pd.read_csv(r'D:\python数据分析\数据分析项目\CSDN\个人征信预测数据加代码\bank_detail_train.txt',names=columns_BankDetail,sep=',')
df_bank_detail_train.head()
用户id 时间戳 交易类型 交易金额 工资收入标记
0 6965 5894316387 0 13.756664 0
1 6965 5894321388 1 13.756664 0
2 6965 5897553564 0 14.449810 0
3 6965 5897563463 1 10.527763 0
4 6965 5897564598 1 13.651303 0
df_bank_detail_train.info()
#查看缺失值
df_bank_detail_train.isnull().sum()
#查看银行流水记录表总的记录数
len(df_bank_detail_train.用户id)
#查看银行流水记录表总的用户数
len(set(df_bank_detail_train.用户id))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6070197 entries, 0 to 6070196
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 用户id int64
1 时间戳 int64
2 交易类型 int64
3 交易金额 float64
4 工资收入标记 int64
dtypes: float64(1), int64(4)
memory usage: 231.6 MB
用户id 0
时间戳 0
交易类型 0
交易金额 0
工资收入标记 0
dtype: int64
6070197
9294
#导入其它五张训练集数据表
columns_UserInfo = ['用户id','性别','职业','教育程度','婚姻状态','户口类型']
columns_BillDetail = ['用户id','账单时间戳','银行id','上期账单金额','上期还款金额','信用卡额度','本期账单余额','本期账单最低还款额','消费笔数','本期账单金额','调整金额','循环利息','可用金额','预借现金额度','还款状态']
columns_BrowseHistory = ['用户id','时间戳','浏览行为数据','浏览子行为编号']
columns_LoanTime= ['用户id','放款时间']
columns_overdue = ['用户id','样本标签']
df_UserInfo_train= pd.read_csv(r'D:\python数据分析\数据分析项目\CSDN\个人征信预测数据加代码\user_info_train.txt',names=columns_UserInfo,sep=',')
df_overdue_train = pd.read_csv(r'D:\python数据分析\数据分析项目\CSDN\个人征信预测数据加代码\overdue_train.txt',names=columns_overdue,sep=',')
df_BillDetail_train = pd.read_csv(r'D:\python数据分析\数据分析项目\CSDN\个人征信预测数据加代码\bill_detail_train.txt',names=columns_BillDetail,sep=',')
df_BrowseHistory_train = pd.read_csv(r'D:\python数据分析\数据分析项目\CSDN\个人征信预测数据加代码\browse_history_train.txt',names=columns_BrowseHistory,sep=',')
df_LoanTime_train = pd.read_csv(r'D:\python数据分析\数据分析项目\CSDN\个人征信预测数据加代码\loan_time_train.txt',names=columns_LoanTime,sep=',')
df_UserInfo_train.head()
df_UserInfo_train.info()
df_UserInfo_train.isnull().sum()
len(df_UserInfo_train.用户id)
len(df_UserInfo_train.用户id.unique())
用户id 性别 职业 教育程度 婚姻状态 户口类型
0 3150 1 2 4 1 4
1 6965 1 2 4 3 2
2 1265 1 3 4 3 1
3 6360 1 2 4 3 2
4 2583 2 2 2 1 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55596 entries, 0 to 55595
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 55596 non-null int64
1 性别 55596 non-null int64
2 职业 55596 non-null int64
3 教育程度 55596 non-null int64
4 婚姻状态 55596 non-null int64
5 户口类型 55596 non-null int64
dtypes: int64(6)
memory usage: 2.5 MB
用户id 0
性别 0
职业 0
教育程度 0
婚姻状态 0
户口类型 0
dtype: int64
55596
55596
df_overdue_train.head()
df_overdue_train.info()
df_overdue_train.isnull().sum()
len(df_overdue_train.用户id)
len(df_overdue_train.用户id.unique())
用户id 样本标签
0 1 0
1 2 0
2 3 0
3 4 1
4 5 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55596 entries, 0 to 55595
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 55596 non-null int64
1 样本标签 55596 non-null int64
dtypes: int64(2)
memory usage: 868.8 KB
用户id 0
样本标签 0
dtype: int64
55596
55596
df_BillDetail_train.head()
df_BillDetail_train.info()
df_BillDetail_train.isnull().sum()
len(df_BillDetail_train.用户id)
len(df_BillDetail_train.用户id.unique())
用户id 账单时间戳 银行id 上期账单金额 上期还款金额 信用卡额度 本期账单余额 本期账单最低还款额 消费笔数 本期账单金额 调整金额 循环利息 可用金额 预借现金额度 还款状态
0 3150 5906744363 6 18.626118 18.661937 20.664418 18.905766 17.847133 1 0.0 0.0 0.0 0.0 19.971271 0
1 3150 5906744401 6 18.905766 18.909954 20.664418 19.113305 17.911506 1 0.0 0.0 0.0 0.0 19.971271 0
2 3150 5906744427 6 19.113305 19.150290 20.664418 19.300194 17.977610 1 0.0 0.0 0.0 0.0 19.971271 0
3 3150 5906744515 6 19.300194 19.300280 21.000890 20.303240 18.477177 1 0.0 0.0 0.0 0.0 20.307743 0
4 3150 5906744562 6 20.303240 20.307744 21.000890 20.357134 18.510985 1 0.0 0.0 0.0 0.0 20.307743 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2338118 entries, 0 to 2338117
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 用户id int64
1 账单时间戳 int64
2 银行id int64
3 上期账单金额 float64
4 上期还款金额 float64
5 信用卡额度 float64
6 本期账单余额 float64
7 本期账单最低还款额 float64
8 消费笔数 int64
9 本期账单金额 float64
10 调整金额 float64
11 循环利息 float64
12 可用金额 float64
13 预借现金额度 float64
14 还款状态 int64
dtypes: float64(10), int64(5)
memory usage: 267.6 MB
用户id 0
账单时间戳 0
银行id 0
上期账单金额 0
上期还款金额 0
信用卡额度 0
本期账单余额 0
本期账单最低还款额 0
消费笔数 0
本期账单金额 0
调整金额 0
循环利息 0
可用金额 0
预借现金额度 0
还款状态 0
dtype: int64
2338118
53174
df_BrowseHistory_train.head()
df_BrowseHistory_train.info()
df_BrowseHistory_train.isnull().sum()
len(df_BrowseHistory_train.用户id)
len(df_BrowseHistory_train.用户id.unique())
用户id 时间戳 浏览行为数据 浏览子行为编号
0 34801 5926003545 173 1
1 34801 5926003545 164 4
2 34801 5926003545 38 7
3 34801 5926003545 45 1
4 34801 5926003545 110 7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22919547 entries, 0 to 22919546
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 用户id int64
1 时间戳 int64
2 浏览行为数据 int64
3 浏览子行为编号 int64
dtypes: int64(4)
memory usage: 699.4 MB
用户id 0
时间戳 0
浏览行为数据 0
浏览子行为编号 0
dtype: int64
22919547
47330
df_LoanTime_train.head()
df_LoanTime_train.info()
df_LoanTime_train.isnull().sum()
len(df_LoanTime_train.用户id)
len(df_LoanTime_train.用户id.unique())
用户id 放款时间
0 1 5914855887
1 2 5914855887
2 3 5914855887
3 4 5914855887
4 5 5914855887
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55596 entries, 0 to 55595
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 55596 non-null int64
1 放款时间 55596 non-null int64
dtypes: int64(2)
memory usage: 868.8 KB
用户id 0
放款时间 0
dtype: int64
55596
55596
print({
'userinfo表的用户id数为':len(df_UserInfo_train.用户id.unique()),
'bank_detail表的用户id数为':len(df_bank_detail_train.用户id.unique()),
'overdue表的用户id数为':len(df_overdue_train.用户id.unique()),
'bill_detail表的用户id数为':len(df_BillDetail_train.用户id.unique()),
'browse_history表的用户id数为':len(df_BrowseHistory_train.用户id.unique()),
'loan_time表的用户id数为':len(df_LoanTime_train.用户id.unique())
})
{
'userinfo表的用户id数为': 55596, 'bank_detail表的用户id数为': 9294, 'overdue表的用户id数为': 55596, 'bill_detail表的用户id数为': 53174, 'browse_history表的用户id数为': 47330, 'loan_time表的用户id数为': 55596}
其中,user_info表、overdue表和loan_time表的用户id数目都是55596,而bank_detail为9294、browse_history为47330、bill_detail为53174。说明并非每个用户id都有银行流水记录、信用卡账单等信息,所以下面我们将选取6个表共同的用户记录,组成一份完成的新表。
df = pd.merge(left=df_bank_detail_train, right=df_LoanTime_train, how='left', on='用户id')
df.head()
t = df[df.时间戳<=df.放款时间]
bank_detail = t[['用户id']]
bank_detail = bank_detail.drop_duplicates(subset='用户id', keep='first')
bank_detail.info()
用户id 时间戳 交易类型 交易金额 工资收入标记 放款时间
0 6965 5894316387 0 13.756664 0 5923841487
1 6965 5894321388 1 13.756664 0 5923841487
2 6965 5897553564 0 14.449810 0 5923841487
3 6965 5897563463 1 10.527763 0 5923841487
4 6965 5897564598 1 13.651303 0 5923841487
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9271 entries, 0 to 6069672
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 9271 non-null int64
dtypes: int64(1)
memory usage: 144.9 KB
df1 = pd.merge(left=df_BillDetail_train, right=df_LoanTime_train, how='left', on='用户id')
df1.head()
t1 = df1[df1.账单时间戳<=df1.放款时间]
bill_detail = t1[['用户id']]
bill_detail = bill_detail.drop_duplicates(subset='用户id', keep='first')
bill_detail.info()
用户id 账单时间戳 银行id 上期账单金额 上期还款金额 信用卡额度 本期账单余额 本期账单最低还款额 消费笔数 本期账单金额 调整金额 循环利息 可用金额 预借现金额度 还款状态 放款时间
0 3150 5906744363 6 18.626118 18.661937 20.664418 18.905766 17.847133 1 0.0 0.0 0.0 0.0 19.971271 0 5919867087
1 3150 5906744401 6 18.905766 18.909954 20.664418 19.113305 17.911506 1 0.0 0.0 0.0 0.0 19.971271 0 5919867087
2 3150 5906744427 6 19.113305 19.150290 20.664418 19.300194 17.977610 1 0.0 0.0 0.0 0.0 19.971271 0 5919867087
3 3150 5906744515 6 19.300194 19.300280 21.000890 20.303240 18.477177 1 0.0 0.0 0.0 0.0 20.307743 0 5919867087
4 3150 5906744562 6 20.303240 20.307744 21.000890 20.357134 18.510985 1 0.0 0.0 0.0 0.0 20.307743 0 5919867087
<class 'pandas.core.frame.DataFrame'>
Int64Index: 46739 entries, 0 to 2338115
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 46739 non-null int64
dtypes: int64(1)
memory usage: 730.3 KB
df2 = pd.merge(left=df_BrowseHistory_train, right=df_LoanTime_train, how='left', on='用户id')
df2.head()
t2 = df2[df2.时间戳<=df2.放款时间]
browse_history = t2[['用户id']]
browse_history = browse_history.drop_duplicates(subset='用户id', keep='first')
browse_history.info()
用户id 时间戳 浏览行为数据 浏览子行为编号 放款时间
0 34801 5926003545 173 1 5929543887
1 34801 5926003545 164 4 5929543887
2 34801 5926003545 38 7 5929543887
3 34801 5926003545 45 1 5929543887
4 34801 5926003545 110 7 5929543887
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44945 entries, 0 to 22919430
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 44945 non-null int64
dtypes: int64(1)
memory usage: 702.3 KB
user1 = pd.merge(left=bill_detail, right=browse_history, how='inner', on='用户id')
user2 = pd.merge(left=user1, right=bank_detail, how='inner', on='用户id')
user2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5735 entries, 0 to 5734
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 5735 non-null int64
dtypes: int64(1)
memory usage: 89.6 KB
user2.T
0 1 2 3 4 5 6 7 8 9 ... 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734
用户id 6965 1265 2583 29165 2443 590 10313 55354 20235 694 ... 11532 13999 40491 23264 6949 52736 45989 18855 37530 8917
1 rows × 5735 columns
筛选出这6张表共有的用户id,得出5735个完整的用户记录,进行字段预处理。
bank_detail_select = pd.merge(left=df_bank_detail_train, right=user2, how='inner', on='用户id')
bank_detail_select.head()
用户id 时间戳 交易类型 交易金额 工资收入标记
0 6965 5894316387 0 13.756664 0
1 6965 5894321388 1 13.756664 0
2 6965 5897553564 0 14.449810 0
3 6965 5897563463 1 10.527763 0
4 6965 5897564598 1 13.651303 0
#统计用户进账单数、求和
b1 = bank_detail_select[bank_detail_select['交易类型']==0].groupby('用户id', as_index=False)
c1 = b1['交易金额'].agg({
'进账单数':'count','进账金额':'sum'})
c1.head()
#统计用户支出单数、求和
b2 = bank_detail_select[bank_detail_select['交易类型']==1].groupby('用户id', as_index=False)
c2 = b2['交易金额'].agg({
'支出单数':'count','支出金额':'sum'})
c2.head()
#统计用户工资收入计数、求和
b3 = bank_detail_select[bank_detail_select['工资收入标记']==1].groupby('用户id', as_index=False)
c3 = b3['交易金额'].agg({
'工资笔数':'count','工资收入':'sum'})
c3.head()
用户id 进账单数 进账金额
0 3 172 2278.873446
1 4 96 1164.342384
2 10 141 1793.642133
3 14 521 6856.993313
4 16 35 478.264186
用户id 支出单数 支出金额
0 3 507 4985.957607
1 4 195 2129.425722
2 10 183 2250.292530
3 14 729 7960.811363
4 16 75 935.201357
用户id 工资笔数 工资收入
0 67 7 85.972926
1 117 17 146.429384
2 119 17 246.778621
3 178 14 207.186980
4 189 7 107.501477
d1 = pd.merge(left=user2, right=c1, how='left', on='用户id')
d1 = d1.fillna(0)
d1.info()
d2 = pd.merge(left=user2, right=c2, how='left', on='用户id')
d2 = d2.fillna(0)
d2.info()
d3 = pd.merge(left=user2, right=c3, how='left', on='用户id')
d3 = d3.fillna(0)
d3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5735 entries, 0 to 5734
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 5735 non-null int64
1 进账单数 5735 non-null float64
2 进账金额 5735 non-null float64
dtypes: float64(2), int64(1)
memory usage: 179.2 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5735 entries, 0 to 5734
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 5735 non-null int64
1 支出单数 5735 non-null float64
2 支出金额 5735 non-null float64
dtypes: float64(2), int64(1)
memory usage: 179.2 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5735 entries, 0 to 5734
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 5735 non-null int64
1 工资笔数 5735 non-null float64
2 工资收入 5735 non-null float64
dtypes: float64(2), int64(1)
memory usage: 179.2 KB
bank_train = d1.merge(d2)
bank_train = bank_train.merge(d3)
bank_train.head()
用户id 进账单数 进账金额 支出单数 支出金额 工资笔数 工资收入
0 6965 75.0 972.850228 289.0 3234.531975 0.0 0.000000
1 1265 125.0 1708.206195 294.0 3662.457063 0.0 0.000000
2 2583 213.0 2736.475318 618.0 7064.310678 0.0 0.000000
3 29165 189.0 2277.607807 473.0 5099.861165 14.0 202.630532
4 2443 252.0 3020.288782 341.0 3762.790364 0.0 0.000000
browse_history_select = pd.merge(left=user2, right=df_BrowseHistory_train, how='left', on='用户id')
g1 = browse_history_select.groupby(['用户id'], as_index=False)
h1 = g1['浏览行为数据'].agg({
'浏览行为数据':'count'})
browse_train = pd.merge(left=user2, right=h1, how='inner', on='用户id')
browse_train.head()
用户id 浏览行为数据
0 6965 1710
1 1265 420
2 2583 702
3 29165 783
4 2443 671
df_BillDetail_train.columns
df_BillDetail_train.head()
Index(['用户id', '账单时间戳', '银行id', '上期账单金额', '上期还款金额', '信用卡额度', '本期账单余额',
'本期账单最低还款额', '消费笔数', '本期账单金额', '调整金额', '循环利息', '可用金额', '预借现金额度',
'还款状态'],
dtype='object')
用户id 账单时间戳 银行id 上期账单金额 上期还款金额 信用卡额度 本期账单余额 本期账单最低还款额 消费笔数 本期账单金额 调整金额 循环利息 可用金额 预借现金额度 还款状态
0 3150 5906744363 6 18.626118 18.661937 20.664418 18.905766 17.847133 1 0.0 0.0 0.0 0.0 19.971271 0
1 3150 5906744401 6 18.905766 18.909954 20.664418 19.113305 17.911506 1 0.0 0.0 0.0 0.0 19.971271 0
2 3150 5906744427 6 19.113305 19.150290 20.664418 19.300194 17.977610 1 0.0 0.0 0.0 0.0 19.971271 0
3 3150 5906744515 6 19.300194 19.300280 21.000890 20.303240 18.477177 1 0.0 0.0 0.0 0.0 20.307743 0
4 3150 5906744562 6 20.303240 20.307744 21.000890 20.357134 18.510985 1 0.0 0.0 0.0 0.0 20.307743 0
bill_select = pd.merge(left=user2, right=df_BillDetail_train, how='right', on='用户id')
bill_select.drop(['账单时间戳','银行id','还款状态'], axis=1, inplace=True)
e1 = bill_select.groupby(['用户id'],as_index=False)
f1 = e1['上期账单金额', '上期还款金额', '信用卡额度', '本期账单余额','本期账单最低还款额', '消费笔数', '本期账单金额', '调整金额', '循环利息', '可用金额', '预借现金额度'].agg(np.mean)
bill_train = pd.merge(left=user2, right=f1, how='left', on='用户id')
bill_train.head()
用户id 上期账单金额 上期还款金额 信用卡额度 本期账单余额 本期账单最低还款额 消费笔数 本期账单金额 调整金额 循环利息 可用金额 预借现金额度
0 6965 10.002659 12.733206 19.971271 19.957631 17.220095 10.750000 18.825107 0.000000 15.520681 0.0 19.624697
1 1265 17.715686 14.191595 19.973385 19.909123 17.866453 1.444444 19.007284 0.000000 5.175483 0.0 9.702118
2 2583 15.192264 15.265601 18.307126 17.736937 10.292788 1.791667 17.199134 0.000000 6.478271 0.0 11.014650
3 29165 -6.973236 12.852082 19.740221 17.921520 15.902257 0.000000 12.969975 0.000000 0.000000 0.0 6.001719
4 2443 16.759482 4.151986 17.309158 19.397134 18.206423 2.251572 15.892834 0.229931 4.082358 0.0 10.873986
删除账单时间戳、银行id、还款状态这几个字段,并按照用户id分组后对每个字段均值化处理。
overdue_train = pd.merge(left=df_overdue_train, right=user2, how='right', on='用户id')
user_train = pd.merge(left=user2, right=df_UserInfo_train, how='left', on='用户id')
overdue_train.head()
user_train.head()
用户id 样本标签
0 3 0
1 4 1
2 10 0
3 14 0
4 16 0
用户id 性别 职业 教育程度 婚姻状态 户口类型
0 6965 1 2 4 3 2
1 1265 1 3 4 3 1
2 2583 2 2 2 1 1
3 29165 1 2 4 1 4
4 2443 1 4 4 3 1
df_train = user_train.merge(bank_train)
df_train = df_train.merge(bill_train)
df_train = df_train.merge(browse_train)
df_train = df_train.merge(overdue_train)
df_train.head()
用户id 性别 职业 教育程度 婚姻状态 户口类型 进账单数 进账金额 支出单数 支出金额 ... 本期账单余额 本期账单最低还款额 消费笔数 本期账单金额 调整金额 循环利息 可用金额 预借现金额度 浏览行为数据 样本标签
0 6965 1 2 4 3 2 75.0 972.850228 289.0 3234.531975 ... 19.957631 17.220095 10.750000 18.825107 0.000000 15.520681 0.0 19.624697 1710 0
1 1265 1 3 4 3 1 125.0 1708.206195 294.0 3662.457063 ... 19.909123 17.866453 1.444444 19.007284 0.000000 5.175483 0.0 9.702118 420 0
2 2583 2 2 2 1 1 213.0 2736.475318 618.0 7064.310678 ... 17.736937 10.292788 1.791667 17.199134 0.000000 6.478271 0.0 11.014650 702 0
3 29165 1 2 4 1 4 189.0 2277.607807 473.0 5099.861165 ... 17.921520 15.902257 0.000000 12.969975 0.000000 0.000000 0.0 6.001719 783 0
4 2443 1 4 4 3 1 252.0 3020.288782 341.0 3762.790364 ... 19.397134 18.206423 2.251572 15.892834 0.229931 4.082358 0.0 10.873986 671 0
df_train.columns
Index(['用户id', '性别', '职业', '教育程度', '婚姻状态', '户口类型', '进账单数', '进账金额', '支出单数',
'支出金额', '工资笔数', '工资收入', '上期账单金额', '上期还款金额', '信用卡额度', '本期账单余额',
'本期账单最低还款额', '消费笔数', '本期账单金额', '调整金额', '循环利息', '可用金额', '预借现金额度',
'浏览行为数据', '样本标签'],
dtype='object')
df_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5735 entries, 0 to 5734
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 用户id 5735 non-null int64
1 性别 5735 non-null int64
2 职业 5735 non-null int64
3 教育程度 5735 non-null int64
4 婚姻状态 5735 non-null int64
5 户口类型 5735 non-null int64
6 进账单数 5735 non-null float64
7 进账金额 5735 non-null float64
8 支出单数 5735 non-null float64
9 支出金额 5735 non-null float64
10 工资笔数 5735 non-null float64
11 工资收入 5735 non-null float64
12 上期账单金额 5735 non-null float64
13 上期还款金额 5735 non-null float64
14 信用卡额度 5735 non-null float64
15 本期账单余额 5735 non-null float64
16 本期账单最低还款额 5735 non-null float64
17 消费笔数 5735 non-null float64
18 本期账单金额 5735 non-null float64
19 调整金额 5735 non-null float64
20 循环利息 5735 non-null float64
21 可用金额 5735 non-null float64
22 预借现金额度 5735 non-null float64
23 浏览行为数据 5735 non-null int64
24 样本标签 5735 non-null int64
dtypes: float64(17), int64(8)
memory usage: 1.1 MB
主要看相关性
bank_train.columns
Index(['用户id', '进账单数', '进账金额', '支出单数', '支出金额', '工资笔数', '工资收入'], dtype='object')
internal_chars = ['进账单数', '进账金额', '支出单数', '支出金额', '工资笔数', '工资收入']
#相关性结果
corrmat = bank_train[internal_chars].corr()
#绘制热力图
plt.subplots(figsize=(10,10))
sns.heatmap(corrmat, square=True, linewidths=.5, annot=True)
plt.show()
可见,收入、支出、工资三个指标的金额与笔数具有线性关系,后续将构建一个新的特征:笔均=金额/笔数。另外,收入与支出相关性也很强,所以只取一个即可,本项目选择支出。
bill_train.columns
Index(['用户id', '上期账单金额', '上期还款金额', '信用卡额度', '本期账单余额', '本期账单最低还款额', '消费笔数',
'本期账单金额', '调整金额', '循环利息', '可用金额', '预借现金额度'],
dtype='object')
internal_chars = ['上期账单金额', '上期还款金额', '信用卡额度', '本期账单余额', '本期账单最低还款额', '消费笔数',
'本期账单金额', '调整金额', '循环利息', '可用金额', '预借现金额度']
#相关性结果
corrmat = bill_train[internal_chars].corr()
#绘制热力图
plt.subplots(figsize=(10,8))
sns.heatmap(corrmat, square=False, linewidths=.5, annot=True)
plt.show()
重构数据表
df_train['平均支出'] = df_train.apply(lambda x:x.支出金额/x.支出单数, axis=1)
# df_train['平均进账'] = df_train.apply(lambda x:x.进账金额/x.进账单数, axis=1)
df_train['平均工资收入'] = df_train.apply(lambda x:x.工资收入/x.工资笔数, axis=1)
df_train['上期还款差额'] = df_train.apply(lambda x:x.上期账单金额-x.上期还款金额, axis=1)
df_select = df_train.loc[:,['用户id','性别','教育程度','婚姻状态','平均支出','平均工资收入','上期还款差额','信用卡额度','本期账单余额','本期账单最低还款额','消费笔数','浏览行为数据','样本标签']].fillna(0)
df_select.head()
用户id 性别 教育程度 婚姻状态 平均支出 平均工资收入 上期还款差额 信用卡额度 本期账单余额 本期账单最低还款额 消费笔数 浏览行为数据 样本标签
0 6965 1 4 3 11.192152 0.000000 -2.730547 19.971271 19.957631 17.220095 10.750000 1710 0
1 1265 1 4 3 12.457337 0.000000 3.524092 19.973385 19.909123 17.866453 1.444444 420 0
2 2583 2 2 1 11.430923 0.000000 -0.073337 18.307126 17.736937 10.292788 1.791667 702 0
3 29165 1 4 1 10.781947 14.473609 -19.825318 19.740221 17.921520 15.902257 0.000000 783 0
4 2443 1 4 3 11.034576 0.000000 12.607495 17.309158 19.397134 18.206423 2.251572 671 0
将上期还款差额二值化
# 即将数值型特征转换成布尔型特征,默认根据0来二值化,大于0标记为1,小于0标记为0。
from sklearn.preprocessing import Binarizer
X = df_select['上期还款差额'].values.reshape(-1,1)
transformer = Binarizer(threshold=0).fit_transform(X)
df_select['上期还款差额标签'] = transformer
方差过滤法
df_select.head()
用户id 性别 教育程度 婚姻状态 平均支出 平均工资收入 上期还款差额 信用卡额度 本期账单余额 本期账单最低还款额 消费笔数 浏览行为数据 样本标签 上期还款差额标签
0 6965 1 4 3 11.192152 0.000000 -2.730547 19.971271 19.957631 17.220095 10.750000 1710 0 0.0
1 1265 1 4 3 12.457337 0.000000 3.524092 19.973385 19.909123 17.866453 1.444444 420 0 1.0
2 2583 2 2 1 11.430923 0.000000 -0.073337 18.307126 17.736937 10.292788 1.791667 702 0 0.0
3 29165 1 4 1 10.781947 14.473609 -19.825318 19.740221 17.921520 15.902257 0.000000 783 0 0.0
4 2443 1 4 3 11.034576 0.000000 12.607495 17.309158 19.397134 18.206423 2.251572 671 0 1.0
from sklearn.feature_selection import VarianceThreshold
x = df_select.drop(['用户id','上期还款差额','样本标签'],axis=1)
VTS = VarianceThreshold()
x_01 = VTS.fit_transform(x)
x_01.shape
(5735, 11)
x.shape
(5735, 11)
相关性过滤之互信息法
from sklearn.feature_selection import mutual_info_classif as MIC
x = df_select.drop(['用户id','上期还款差额','样本标签'],axis=1)
y = df_select['样本标签']
result = MIC(x,y)
result
k = result.shape[0]-sum(result<=0.001)
k
array([0.00523732, 0.0047046 , 0.00220937, 0.0087247 , 0.00120858,
0.01094742, 0. , 0.00375009, 0.0091549 , 0.01545167,
0. ])
9
x.columns
Index(['性别', '教育程度', '婚姻状态', '平均支出', '平均工资收入', '信用卡额度', '本期账单余额', '本期账单最低还款额',
'消费笔数', '浏览行为数据', '上期还款差额标签'],
dtype='object')
x1 = x.drop(['本期账单余额','上期还款差额标签'], axis=1)
y = df_select['样本标签']
y.value_counts()
0 4899
1 836
Name: 样本标签, dtype: int64
本项目选择三种模型:逻辑回归、决策树、随机森林进行预测
对逻辑回归参数C进行调参:采用互信息法筛选之前的特征
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score as cvs
# 先不采用SMOTE算法
# 通过学习曲线对逻辑回归模型参数C进行调整,分别在两个模型中进行调参
# C:正则化系数的倒数,float类型,默认为1,必须为正浮点型数,一般不超过1,越小的的数值表示正则化越强,惩罚力度越大。
x = df_select.drop(['用户id','上期还款差额','样本标签'],axis=1)
y = df_select['样本标签']
l1CVS = []
l2CVS = []
for i in np.linspace(0.05,1,30):
#使用参数进行建模
lrl1 = LogisticRegression(penalty='l1', solver='liblinear', C=i, max_iter=1000, random_state=0)
lrl2 = LogisticRegression(penalty='l2', solver='liblinear', C=i, max_iter=1000, random_state=0)
once1 = cvs(lrl1, x, y, cv=5, scoring='f1').mean()
once2 = cvs(lrl2, x, y, cv=5, scoring='f1').mean()
l1CVS.append(once1)
l2CVS.append(once2)
plt.plot(np.linspace(0.05,1,30), l1CVS, "r")
plt.plot(np.linspace(0.05,1,30), l2CVS, 'g')
plt.show()
对C进行调参:采用互信息法筛选之后的特征
# 先不采用SMOTE算法
# 通过学习曲线对逻辑回归模型参数C进行调整,分别在两个模型中进行调参
# C:正则化系数的倒数,float类型,默认为1,必须为正浮点型数,一般不超过1,越小的的数值表示正则化越强,惩罚力度越大。
x1 = df_select.drop(['用户id','上期还款差额','样本标签','本期账单余额','上期还款差额标签'],axis=1)
y = df_select['样本标签']
l1CVS = []
l2CVS = []
for i in np.linspace(0.05,1,30):
#使用参数进行建模
lrl1 = LogisticRegression(penalty='l1', solver='liblinear', C=i, max_iter=1000, random_state=0)
lrl2 = LogisticRegression(penalty='l2', solver='liblinear', C=i, max_iter=1000, random_state=0)
once1 = cvs(lrl1, x1, y, cv=5, scoring='f1').mean()
once2 = cvs(lrl2, x1, y, cv=5, scoring='f1').mean()
l1CVS.append(once1)
l2CVS.append(once2)
plt.plot(np.linspace(0.05,1,30), l1CVS, 'r')
plt.plot(np.linspace(0.05,1,30), l2CVS, 'g')
plt.show()
包装法筛选变量:以逻辑回归为基分类器,结合包装法筛选变量,并运用交叉验证绘制学习曲线,探索最佳变量的个数。同时,运用SMOTE算法进行样本均衡处理,并比较均衡前后模型效果的差异。
使用SMOTE算法前:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score as cvs
x1 = df_select.drop(['用户id','上期还款差额','样本标签','本期账单余额','上期还款差额标签'],axis=1)
#x = df_select.drop(['用户id','上期还款差额','样本标签'],axis=1)
y = df_select['样本标签']
LR_1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.6, max_iter=1000, random_state=0)
LR_2 = LogisticRegression(penalty='l2', solver='liblinear', C=0.6, max_iter=1000, random_state=0)
score1 = []
score2 = []
for i in range(1,12,1):
selector1 = RFE(LR_1, n_features_to_select=i, step=1)
selector2 = RFE(LR_2, n_features_to_select=i, step=2)
X_wrapper1 = selector1.fit_transform(x1, y)
X_wrapper2 = selector2.fit_transform(x1, y)
once1 = cvs(LR_1, X_wrapper1, y, cv=5, scoring='f1').mean()
once2 = cvs(LR_2, X_wrapper2, y, cv=5, scoring='f1').mean()
score1.append(once1)
score2.append(once2)
plt.plot(range(1,12,1), score1, 'r')
plt.plot(range(1,12,1), score2, 'g')
plt.show()
使用SMOTE算法后:
from imblearn.over_sampling import SMOTE
from collections import Counter
x1 = df_select.drop(['用户id','上期还款差额','样本标签','本期账单余额','上期还款差额标签'],axis=1)
#x = df_select.drop(['用户id','上期还款差额','样本标签'],axis=1)
y = df_select['样本标签']
over_samples = SMOTE(random_state=111)
over_samples_x, over_samples_y = over_samples.fit_sample(x1,y)
Counter(over_samples_y)
Counter({
0: 4899, 1: 4899})
LR_1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.6, max_iter=1000, random_state=0)
LR_2 = LogisticRegression(penalty='l2', solver='liblinear', C=0.6, max_iter=1000, random_state=0)
score1 = []
score2 = []
for i in range(1,12,1):
selector1 = RFE(LR_1, n_features_to_select=i, step=1)
selector2 = RFE(LR_2, n_features_to_select=i, step=2)
X_wrapper1 = selector1.fit_transform(over_samples_x, over_samples_y)
X_wrapper2 = selector2.fit_transform(over_samples_x, over_samples_y)
once1 = cvs(LR_1, X_wrapper1, over_samples_y, cv=5, scoring='f1').mean()
once2 = cvs(LR_2, X_wrapper2, over_samples_y, cv=5, scoring='f1').mean()
score1.append(once1)
score2.append(once2)
plt.plot(range(1,12,1), score1, 'r')
plt.plot(range(1,12,1), score2, 'g')
plt.show()
由于样本均衡化处理前后,模型效果提升较明显,因此在使用决策树模型之前,首先对样本进行均衡化处理。
下面将绘制决策树深度max_depth学习曲线,探索最佳参数。
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.tree import DecisionTreeClassifier as DTC
#x1 = df_select.drop(['用户id','上期还款差额','样本标签','本期账单余额','上期还款差额标签'],axis=1)
x = df_select.drop(['用户id','上期还款差额','样本标签'],axis=1)
y = df_select['样本标签']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=999)
from imblearn.over_sampling import SMOTE
from collections import Counter
over_samples = SMOTE(random_state=111)
over_samples_x_train, over_samples_y_train = over_samples.fit_sample(x_train, y_train)
over_samples_x_test, over_samples_y_test = over_samples.fit_sample(x_test, y_test)
Counter(over_samples_y_train)
Counter(over_samples_y_test)
Counter({
0: 3445, 1: 3445})
Counter({
0: 1454, 1: 1454})
dtc = DTC(criterion='gini', random_state=11, splitter='best', min_samples_leaf=10, min_samples_split=25).fit(over_samples_x_train, over_samples_y_train)
cvs(dtc, over_samples_x_train, over_samples_y_train, cv=5, scoring='f1').mean()
print(classification_report(y_test, dtc.predict(x_test)))
print(classification_report(over_samples_y_test, dtc.predict(over_samples_x_test)))
print(classification_report(over_samples_y_train, dtc.predict(over_samples_x_train)))
0.751060846446681
precision recall f1-score support
0 0.85 0.79 0.82 1454
1 0.19 0.27 0.22 267
accuracy 0.71 1721
macro avg 0.52 0.53 0.52 1721
weighted avg 0.75 0.71 0.73 1721
precision recall f1-score support
0 0.67 0.79 0.72 1454
1 0.74 0.60 0.66 1454
accuracy 0.70 2908
macro avg 0.71 0.70 0.69 2908
weighted avg 0.71 0.70 0.69 2908
precision recall f1-score support
0 0.85 0.89 0.87 3445
1 0.89 0.84 0.86 3445
accuracy 0.87 6890
macro avg 0.87 0.87 0.87 6890
weighted avg 0.87 0.87 0.87 6890
L_train = []
L_test = []
L_CVS = []
for i in range(2,11):
dtc = DTC(criterion='gini', random_state=11, splitter='best', max_depth=i, min_samples_leaf=10, min_samples_split=25)
dtc.fit(over_samples_x_train, over_samples_y_train)
once = cvs(dtc, over_samples_x_train, over_samples_y_train, cv=5, scoring='f1').mean()
L_CVS.append(once)
L_train.append(f1_score(over_samples_y_train, dtc.predict(over_samples_x_train)))
L_test.append(f1_score(over_samples_y_test, dtc.predict(over_samples_x_test)))
plt.plot(L_CVS, 'r') #交叉验证
plt.plot(L_train, 'g') #训练集
plt.plot(L_test, 'b') #验证集
#max_depth=5
dtc = DTC(criterion='gini', random_state=11, splitter='best', max_depth=5, min_samples_leaf=10, min_samples_split=25).fit(over_samples_x_train, over_samples_y_train)
cvs(dtc, over_samples_x_train, over_samples_y_train, cv=5, scoring='f1').mean()
print(classification_report(y_test, dtc.predict(x_test)))
print(classification_report(over_samples_y_test, dtc.predict(over_samples_x_test)))
print(classification_report(over_samples_y_train, dtc.predict(over_samples_x_train)))
0.6666266170104387
precision recall f1-score support
0 0.87 0.74 0.80 1454
1 0.22 0.40 0.29 267
accuracy 0.69 1721
macro avg 0.55 0.57 0.54 1721
weighted avg 0.77 0.69 0.72 1721
precision recall f1-score support
0 0.70 0.74 0.72 1454
1 0.72 0.68 0.70 1454
accuracy 0.71 2908
macro avg 0.71 0.71 0.71 2908
weighted avg 0.71 0.71 0.71 2908
precision recall f1-score support
0 0.73 0.76 0.74 3445
1 0.75 0.72 0.73 3445
accuracy 0.74 6890
macro avg 0.74 0.74 0.74 6890
weighted avg 0.74 0.74 0.74 6890
特征重要性
dtc.feature_importances_
features_imp = pd.Series(dtc.feature_importances_, index=x.columns).sort_values(ascending=False)
features_imp
array([0.10177878, 0.01525748, 0.01266462, 0.06421788, 0.04764356,
0. , 0. , 0.00445492, 0.00373405, 0.04433287,
0.70591584])
上期还款差额标签 0.705916
性别 0.101779
平均支出 0.064218
平均工资收入 0.047644
浏览行为数据 0.044333
教育程度 0.015257
婚姻状态 0.012665
本期账单最低还款额 0.004455
消费笔数 0.003734
本期账单余额 0.000000
信用卡额度 0.000000
dtype: float64
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import GridSearchCV
rfc = RFC(n_estimators=200, random_state=90).fit(over_samples_x_train, over_samples_y_train)
score_pre = cvs(rfc, over_samples_x_train, over_samples_y_train, cv=5, scoring='f1').mean()
score_pre
print(classification_report(y_train, rfc.predict(x_train)))
print(classification_report(over_samples_y_train, rfc.predict(over_samples_x_train)))
print(classification_report(over_samples_y_test, rfc.predict(over_samples_x_test)))
0.8429840630249139
precision recall f1-score support
0 1.00 1.00 1.00 3445
1 1.00 1.00 1.00 569
accuracy 1.00 4014
macro avg 1.00 1.00 1.00 4014
weighted avg 1.00 1.00 1.00 4014
precision recall f1-score support
0 1.00 1.00 1.00 3445
1 1.00 1.00 1.00 3445
accuracy 1.00 6890
macro avg 1.00 1.00 1.00 6890
weighted avg 1.00 1.00 1.00 6890
precision recall f1-score support
0 0.69 0.93 0.79 1454
1 0.89 0.58 0.70 1454
accuracy 0.75 2908
macro avg 0.79 0.75 0.75 2908
weighted avg 0.79 0.75 0.75 2908
模型调参:有些模型参数是没有参照的,一开始很难确定具体的值或范围,这种情况下先通过学习曲线确定大致范围,再通过网格搜索确定最佳参数。
调整n_estimators
score1 = []
for i in range(0,200,10):
rfc = RFC(n_estimators=i+1, n_jobs=-1, random_state=90)
score = cvs(rfc, over_samples_x_train, over_samples_y_train, cv=5, scoring='f1').mean()
score1.append(score)
print('最高score:', max(score1))
print('最优n_estimators:', (score1.index(max(score1))*10)+1)
plt.plot(range(1,201,10), score1)
plt.show()
最高score: 0.8433495577693615
最优n_estimators: 191
调整max_depth
param_grid = {
'max_depth':np.arange(1,20,1)}
rfc = RFC(n_estimators=150, random_state=90, n_jobs=-1)
GS = GridSearchCV(rfc, param_grid, cv=5, scoring='f1')
GS.fit(over_samples_x_train, over_samples_y_train)
GS.best_params_
GS.best_score_
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=150, n_jobs=-1,
random_state=90),
param_grid={
'max_depth': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19])},
scoring='f1')
{
'max_depth': 18}
0.8397335310010792
调整max_features
param_grid = {
'max_features':np.arange(4,12,1)}
rfc = RFC(n_estimators=150, random_state=90)
GS = GridSearchCV(rfc, param_grid, cv=5, scoring='f1')
GS.fit(over_samples_x_train, over_samples_y_train)
GS.best_params_
GS.best_score_
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=150,
random_state=90),
param_grid={
'max_features': array([ 4, 5, 6, 7, 8, 9, 10, 11])},
scoring='f1')
{
'max_features': 6}
0.8449510789812068
调整min_samples_leaf
param_grid = {
'min_samples_leaf':np.arange(1,1+10,1)}
rfc = RFC(n_estimators=150, random_state=90, n_jobs=-1)
GS = GridSearchCV(rfc, param_grid, cv=5, scoring='f1')
GS.fit(over_samples_x_train, over_samples_y_train)
GS.best_params_
GS.best_score_
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=150, n_jobs=-1,
random_state=90),
param_grid={
'min_samples_leaf': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])},
scoring='f1')
{
'min_samples_leaf': 1}
0.8415850321705683
调整min_samples_split
param_grid = {
'min_samples_split':np.arange(2,2+20,1)}
rfc = RFC(n_estimators=150, random_state=90, n_jobs=-1)
GS = GridSearchCV(rfc, param_grid, cv=5, scoring='f1')
GS.fit(over_samples_x_train, over_samples_y_train)
GS.best_params_
GS.best_score_
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=150, n_jobs=-1,
random_state=90),
param_grid={
'min_samples_split': array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21])},
scoring='f1')
{
'min_samples_split': 2}
0.8415850321705683
调整Criterion
param_grid = {
'criterion':['gini','entropy']}
rfc = RFC(n_estimators=150, random_state=90, n_jobs=-1)
GS = GridSearchCV(rfc, param_grid, cv=5, scoring='f1')
GS.fit(over_samples_x_train, over_samples_y_train)
GS.best_params_
GS.best_score_
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=150, n_jobs=-1,
random_state=90),
param_grid={
'criterion': ['gini', 'entropy']}, scoring='f1')
{
'criterion': 'entropy'}
0.8416553782456979
rfc = RFC(n_estimators=150, criterion='entropy', max_depth=18, max_features=6, class_weight='balanced',
min_samples_leaf=1, min_samples_split=2,random_state=90).fit(over_samples_x_train, over_samples_y_train)
score_pre = cvs(rfc, over_samples_x_train, over_samples_y_train, cv=5, scoring='f1').mean()
score_pre
print(classification_report(y_test, rfc.predict(x_test)))
print(classification_report(over_samples_y_train, rfc.predict(over_samples_x_train)))
print(classification_report(over_samples_y_test, rfc.predict(over_samples_x_test)))
0.8378346907385964
precision recall f1-score support
0 0.86 0.92 0.89 1454
1 0.28 0.18 0.22 267
accuracy 0.80 1721
macro avg 0.57 0.55 0.55 1721
weighted avg 0.77 0.80 0.78 1721
precision recall f1-score support
0 1.00 1.00 1.00 3445
1 1.00 1.00 1.00 3445
accuracy 1.00 6890
macro avg 1.00 1.00 1.00 6890
weighted avg 1.00 1.00 1.00 6890
precision recall f1-score support
0 0.69 0.92 0.79 1454
1 0.88 0.58 0.70 1454
accuracy 0.75 2908
macro avg 0.78 0.75 0.74 2908
weighted avg 0.78 0.75 0.74 2908
单元测试写的代码能一次正确执行是每个程序员的追求,但世事皆不能尽如人意,我们的代码经常会有 Bug,这就需要测试的存在。测试有黑盒和白盒之分。黑盒测试,测试时认为被测程序就像一个漆黑的盒子,虽然不明白其中的运行原理,但知道怎么输入有对应的输出。QA (Quality assurance),也就是我们的测试部门一般负责对程序进行黑盒测试,调用接口时传确定的参数,再校验接口响应值符合某种预期。与黑盒测试对应的是白盒测试,白盒测试要求被测试人员了解被测程序的构造,从而构造测试用例校验程序各个分支逻辑。从这.
把一串数字表示成千位分隔形式——JS正则表达式的应用一个案例如何把一串整数转换成千位分隔形式,例如10000000000,转换成10,000,000,000。在了解正则表达式之前,想要实现这个功能,无论代码量还是烧脑程度,都很令人抓狂,但若是运用正则表达式来解决的话,两三行代码即可搞定!匹配、替换那些符合某种规则的字符串,恰恰是正则表达式的强项。正则表达式#####概念 正则表...
近年来,企业纷纷投入基础数据技术建设。但是,真正被利用的大数据数量仅仅为1%。如何让数据充分释放商业价值,是很多企业始终在思考的问题。11月16-17日,浙大EMBA•商学+科技系列课程邀请中国科学院院士张泽、陈纯,阿里云副总裁李树翀、袋鼠云CEO拖雷,为企业家开放了一个产学研的交流平台,解密数字经济时代中,商业的走向与机遇。 以下是阿里云副总裁李树翀、袋鼠云CEO拖雷课程内容摘要...
一开始想的是遍历所有文件夹,幸好查了度娘,才意识到这个想法有多蠢。最为节省时间的做法是通过一个ContentProvider:MediaStore进行查询一、什么是MediaStore1、概述MediaStore这个类是android系统提供的一个多媒体数据库,android中多媒体信息都可以从这里提取。这个数据库中包括 音频、视频、图像、文件等多种类型的文件索引,Android会周期性地检索手机中的文件,添加至该数据库中。(一些应用也可以自发地将文件添加至该数据库中)所以,我们只需要查询该数据
Java中提供了socket编程来构建客户端和服务器端构建服务器端的步骤:(1)bind
分类:链表难度:medium方法:双指针删除排序链表中的重复元素 II给定一个排序链表,删除所有含有重复数字的节点,只保留原始链表中 没有重复出现 的数字。示例 1:输入: 1->2->3->3->4->4->5输出: 1->2->5示例 2:输入: 1->1->1->2->3输出: 2->3题...
从前台输入的的json 格式数据。转换成对应的实体类后。 可能会出现把null 转换成“null”字符串。这样的数据在存入数据库是可能会因为数据格式,存入时会有问题。实体字段多,修改有很麻烦。所以需要统一编辑处理下。实体类package com.tansun.ider.model.vo;public class XXXXVO {private String returnCode;private...
关注微信公共号:小程在线关注程序员秘密:程志伟的博客1.计算饼图的百分比选中该角度、点击右键-快速表计算-合计百分比。发现该饼图没有任何显示。2.把支付方式、销售额拖入到标签,均可正常显示,百分比不显示。3.解决方法:选择分析-勾选显示标记标签,就可以显示。...
一、缓存内存(cache memories)早期的计算机系统中仅包含三个基本层次:CPU寄存器、DRAM主内存及磁盘存储。然而,由于计算机CPU和主内存间速度差异的不断拉大,系统设计者被迫在主存和CPU间插入一个小的SRAM缓存内存,称为L1缓存(Level 1 cache),如图6.26所示:L1缓存的访问速度和寄存器差不多快,通常需要2-4个时钟循环。随着CPU和主内存的性能差异不断拉大。...
敌兵布阵Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others)Total Submission(s): 21443 Accepted Submission(s): 9432Problem DescriptionC国的死对头A国这段时间正在进行军事演习,所以C国间谍头子D
命令行下执行此段命令dpkg -l | grep ^rc | cut -d' ' -f3 | sudo xargs dpkg --purge
这一章节我们来讨论一下同步方法与同步静态代码块持有的是不同的锁。代码清单:package com.ray.deepintothread.ch02.topic_18;/** * * @author RayLee * */public class SynchClass { public static void main(String[] args) throws Interrupte