【DS实践 | Coursera】Assignment4 | Introduction to Data Science in Python_Mart_inn的博客-程序员秘密

技术标签: Data Science with Python  python  数据分析  机器学习  数据挖掘  Coursera  


一、问题分析

1.1 问题描述

Description
In this assignment you must read in a file of metropolitan regions and associated sports teams from assets/wikipedia_data.html and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the “Big 4”: NFL (football, in assets/nfl.csv), MLB (baseball, in assets/mlb.csv), NBA (basketball, in assets/nba.csv or NHL (hockey, in assets/nhl.csv). Please keep in mind that all questions are from the perspective of the metropolitan region, and that this file is the “source of authority” for the location of a given sports team. Thus teams which are commonly known by a different area (e.g. “Oakland Raiders”) need to be mapped into the metropolitan region given (e.g. San Francisco Bay Area). This will require some human data understanding outside of the data you’ve been given (e.g. you will have to hand-code some names, and might need to google to find out where teams are)!

For each sport I would like you to answer the question: what is the win/loss ratio’s correlation with the population of the city it is in? Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses. Remember that to calculate the correlation with pearsonr, so you are going to send in two ordered lists of values, the populations from the wikipedia_data.html file and the win/loss ratio for a given sport in the same order. Average the win/loss ratios for those cities which have multiple teams of a single sport. Each sport is worth an equal amount in this assignment (20%*4=80%) of the grade for this assignment. You should only use data from year 2018 for your analysis – this is important!

Notes

  1. Do not include data about the MLS or CFL in any of the work you are doing, we’re only interested in the Big 4 in this assignment.
  2. I highly suggest that you first tackle the four correlation questions in order, as they are all similar and worth the majority of grades for this assignment. This is by design!
  3. It’s fair game to talk with peers about high level strategy as well as the relationship between metropolitan areas and sports teams. However, do not post code solving aspects of the assignment (including such as dictionaries mapping areas to teams, or regexes which will clean up names).
  4. There may be more teams than the assert statements test, remember to collapse multiple teams in one city into a single value!

1.2 问题分析

  1. 处理html与csv数据,生成Dataframe格式

  (对于csv文件的读取可见csv模块与python基础 )

  (对于Pandas中DataFrame的产生与处理,可见Pandas中Series & DataFrame的结构、创建、查询、修改语法与实例 )

  我们可以发现Q1-Q4都是要计算2018年中Big4中某个比赛的胜/负率与人口的相关系数,我们首先将assets中的csv文件数据导入,得到DataFrame格式的表,查看发现里面只有胜/负率相关的数据,没有球队所属城市和城市人口,好在前文给出了建议,从维基百科的网页上寻找城市人口数量相关的数据,并给出了html文件:wikipedia_data.html,那么我们要做的就是根据html上城市人口数量与球队与城市的关联制成另一张DataFrame格式的表。

  1. 用正则表达式规范队伍名

  (对于正则表达式的用法,可见re模块与正则表达式 )

  但是我们发现两个DataFrame中球队的格式规范在html文件和csv文件中又是不一样的,这时我们就需要用正则表达式regex来进行规范了,html中是一个城市对应多个球队名,在读取的时候把多个球队名连在了一起,需要分开,有的名字有脚注,需要去除,且只有名字(eg.‘Red Sox’),而csv中则是全称(eg.‘Boston Red Sox’)但全称的前几个单词又不一定是所属城市名,见描述中的例子。本文用到的方法首先得到html生成的DataFrame中各个队伍名的长度,如果队伍名长度为1,则只匹配csv生成DataFrame的队伍名拆分后的最后一个单词,若队伍名为2,则匹配倒数两个,以此类推。

  1. 合并两张表并分组计算

  (对于DataFrame的合并与分组计算,可见DataFrame的合并、分组聚合与数据透视表 )

  将html得到的DataFrame和csv得到的DataFrame做内连接,得到Big4各联赛中各支球队所属的城市及城市的人口数量,关联键应为球队名。

  合并后将数据转化成int型,再按城市分组,将各个城市的胜场与负场分别相加(如果先求平均值再聚合,得到的结果可能与我的结果不同,我看有人在论坛上问为什么数值很接近但没通过就是这个原因)再计算每个城市的胜/负率,再结合stats.pearsonr就可以计算相关的皮尔森系数了

  1. 计算T检验p-value矩阵

  (对于T检验的数学理论与实现,可见假设检验、显著性水平、T检验的数学理论 与scipy.stats模块实现)

  为了得到T检验的p-value矩阵,我们只要将Big4各联赛的城市名与比赛胜/负率做T检验的配对变量,两辆配对检验一下就可以了。


二、具体代码与注释

2.1 Q1-Q4通用:得到合并后的城市与胜率

import pandas as pd
import numpy as np
import scipy.stats as stats
import re

def get_WLrate_Population(Big4):
    #传入参数为Big4名
    
    #读取html文件
    cities=pd.read_html("assets/wikipedia_data.html")[1]
    cities=cities.iloc[:-1,[0,3,5,6,7,8]]

    #1、处理《城市-人口-队伍表》
    #改名
    cities.rename(columns={
    'Population (2016 est.)[8]':'Population'},inplace=True)
    
    #去掉中括号及内容
    cities.replace(to_replace='\[[\w ]*\]$',value='',regex=True,inplace=True)
    
    #选择城市、人口、NHL队伍名
    cities=cities[['Metropolitan area','Population',Big4]]
    
    #删掉为空和'-'值的数据
    cities.replace(to_replace='—',value=np.nan,regex=True,inplace=True)
    cities.replace(to_replace='',value=np.nan,regex=True,inplace=True)
    cities.dropna(inplace=True)
    
    #重置index并删除原index
    cities.reset_index(inplace=True)
    cities.drop('index',axis=1,inplace=True)
    
    #针对一个城市产生多支队伍的情况开始匹配,得到“城市_队伍”的list,其元素为[城市名,队伍名]
    pattern='([A-Z]{0,2}[a-z0-9]+ [A-Z]{0,2}[a-z]+|[A-Z]{0,2}[a-z0-9]+)'
    cty_team=[]
    for i in range(len(cities)):
        a=re.findall(pattern,cities.iloc[i,2])
        #print(a,len(a))
        for j in a:
            cty_team.append([cities.iloc[i,0],j])
            
    #将队伍名按空格拆分,方便计算队伍名的长度        
    for i in cty_team:
        i[1]=re.split(' ',i[1])
    #print(len(cty_team))
    
    #2、处理《城市-队伍-战绩》 
    
    #取18年的战队数据并去掉分区信息  
    big4_df=pd.read_csv('assets/'+Big4.lower()+'.csv')
    big4_df2018=big4_df[big4_df['year']==2018]
    
    #提取'W'、'L'和'team'列
    big4_df2018=big4_df2018[['W','L','team']]
    
    #去掉*和?号,获得新的列'team_name'
    #注意:结尾有空值键,如果不去掉空值键无法匹配'*$'
    big4_df2018['team'].replace(to_replace='\([0-9]{0,2}\)$',value='',regex=True,inplace=True)
    big4_df2018['team'].replace(to_replace='[\*+]?[\s]?$',value='',regex=True,inplace=True)

    #将所有的team名按空格进行分割
    lists=[]
    for i in range(len(big4_df2018['team'])):
        lists.append(re.split(' ',big4_df2018['team'][i]))
    #print(lists)
    
    #添加分割后的列‘team_name’
    big4_df2018['team_name']=lists
    #big4_df2018['team_name']=big4_df2018['team'].str.extract('([\w]*$)')

    #增添城市信息,逻辑如下:
    #如果cty_team中队伍名字长度为1,匹配最后一个元素相同即可返回城市,若果队伍长度为2,匹配最后两个元素相同再返回城市
    def get_region(item):
        for i in cty_team:
            if len(i[-1])==1 and i[-1][-1]==item[-1] :
                return i[0]
            elif len(i[-1])==2 and i[-1][-1]==item[-1] and i[-1][-2]==item[-2]:
                return i[0]
    big4_df2018['Region']=big4_df2018['team_name'].apply(get_region)

    # #将两张表合并
    merge1=pd.merge(big4_df2018,cities,left_on='Region',right_on='Metropolitan area')
    #先将数据转化成int型,后按城市分类并求和
    merge1['W']=merge1['W'].astype('int')
    merge1['L']=merge1['L'].astype('int')
    
    #分组聚合求的胜场和、负场和与城市人口
    merge2=merge1.groupby('Region').agg({
    'W':np.sum,'L':np.sum,'Population':np.unique})
    merge2['W/L rate']=merge2['W']/(merge2['L']+merge2['W'])
    merge2['Population']=merge2['Population'].astype('int')
    
    #返回城市胜率与城市人口的DataFrame
    return merge2

文中提到的不去掉空值符,可能出现以下问题,这样就去不掉*号了:
在这里插入图片描述


2.2 Q1求解

取出对应的列计算皮尔森系数即可,后面一样:

def nhl_correlation(): 
    merge2=get_WLrate_Population('NHL')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nhl_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q1: Your lists must be the same length"
    assert len(population_by_region) == 28, "Q1: There should be 28 teams being analysed for NHL"
    
    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
nhl_correlation()

解为:0.012308996455744264
所以NHL球队胜率与人口关系并不具有相关性。


2.3 Q2求解

def nba_correlation():
    merge2=get_WLrate_Population('NBA')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nba_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q2: Your lists must be the same length"
    assert len(population_by_region) == 28, "Q2: There should be 28 teams being analysed for NBA"

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
nba_correlation()

解为:-0.17657160252844614
所以NBA球队胜率与人口关系并不具有相关性。


2.4 Q3求解

def mlb_correlation(): 
    merge2=get_WLrate_Population('MLB')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nfl_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q3: Your lists must be the same length"
    assert len(population_by_region) == 26, "Q3: There should be 26 teams being analysed for MLB"

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
mlb_correlation()

解为:0.1505230448710485
所以MLB球队胜率与人口关系并不具有相关性。

2.5 Q4求解

def nfl_correlation(): 

    merge2=get_WLrate_Population('NFL')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nfl_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q4: Your lists must be the same length"
    assert len(population_by_region) == 29, "Q4: There should be 29 teams being analysed for NFL"

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
nfl_correlation()

解为:0.004922112149349409
所以NFL球队胜率与人口关系并不具有相关性。


2.6 Q5求解

用双重for循环得到p-value矩阵即可

import pandas as pd
import numpy as np
import scipy.stats as stats
#import rehttps://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

mlb_df=pd.read_csv("assets/mlb.csv")
nhl_df=pd.read_csv("assets/nhl.csv")
nba_df=pd.read_csv("assets/nba.csv")
nfl_df=pd.read_csv("assets/nfl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]

def sports_team_performance():
    # Note: p_values is a full dataframe, so df.loc["NFL","NBA"] should be the same as df.loc["NBA","NFL"] and
    # df.loc["NFL","NFL"] should return np.nan
    sports = ['NFL', 'NBA', 'NHL', 'MLB']
    p_values = pd.DataFrame({
    k:np.nan for k in sports}, index=sports)
    
    for i in sports:
        for j in sports:
            if i!=j:
                Mi=get_WLrate_Population(i)
                Mj=get_WLrate_Population(j)
                Mi=Mi['W/L rate']
                Mj=Mj['W/L rate']
                merge=pd.merge(Mi,Mj,how='inner',left_index=True,right_index=True)
                #print(merge)
                p_values.loc[i, j]=stats.ttest_rel(merge['W/L rate_x'],merge['W/L rate_y'])[1]
    assert abs(p_values.loc["NBA", "NHL"] - 0.02) <= 1e-2, "The NBA-NFL p-value should be around 0.02"
    assert abs(p_values.loc["MLB", "NFL"] - 0.80) <= 1e-2, "The MLB-NFL p-value should be around 0.80"
    return p_values
sports_team_performance()

解为:
在这里插入图片描述


总结

有一定难度,需要对之前的知识进行灵活应用,水到渠成,熟能生巧。

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/Mart_inn/article/details/122312148

智能推荐

计算机网络原理实验8,【石油大学】计算机网络原理(含模拟实验)-第一次在线作业试卷总分:100得分:100第1题,1.(2.5分)在OSI模型中,提供路由选择功能的层次是__________A、物理层..._鲜柚游戏的博客-程序员秘密

【石油大学】计算机网络原理(含模拟实验)-第一次在线作业试卷总分:100 得分:100第1题,1.(2.5分)在OSI模型中,提供路由选择功能的层次是__________A、物理层B、应用层C、数据链路层D、网络层正确答案:第2题,2.(2.5分)以下没有采用存储转发机制的交换方式有()A、电路交换B、报文交换C、分组交换D、信元交换正确答案:第3题,3.(2.5分)Internet最早起源...

linux下源码包安装_fcitx-3.2-050827.tar.bz2_可人冰凌的博客-程序员秘密

顾名思义,源码包就是源代码的可见的软件包,基于Linux和BSD系统的软件最常见;在国内源可见的软件几乎绝迹;大多开源软件都是国外出品;在国内较为出名的开源软件有fcitx;lumaqq;Lumaqq及scim等;但软件的源代码可见并不等于软件是开源的,我们还要以软件的许可为准;比如有些软件是源码可见的,但他约定用户只能按他约定的内容来修改;比如vbb论坛程序;所以一个软件是否是开源软件,

微信公众平台网页授权两次重定向问题_公众号接口调用失败 报错302_皮皮猫32的博客-程序员秘密

调用微信授权接口,微信服务器多次请求本地本地服务器接口,前几天做公众号分享功能,然后调用授权接口API 时一直都是两次。 会出现{'errcode':40029,'errmsg':'invalid code, hints: [ req_id: 0407ns44 ]'}这样的错误。https://open.weixin.qq.com/connect/oauth2/authorize?appid=AP...

redis 如何查看某个库的key_redis命令行查询某个key值_weixin_39887961的博客-程序员秘密

使用工具xshell连接到按照redis的linux服务器,然后执行以下命令:[[email protected] 10:28 ~ ]$ redis-cli127.0.0.1:6379&gt; helpredis-cli 5.0.2To get help about Redis commands type:"help @" to get a list of commands in "help " for ...

Ubuntu18.04TLS 中终端(Terminal)无法打开的解决办法!!!,坑爹啊_ubuntu 终端打不开_希哈科技的博客-程序员秘密

前几天把系统自带的python3.5升级到python3.6.5后就关闭了系统,结果今天打开电脑发现终端怎么也打不开了,于是从网上找了一些解决办法,终于找到一和自己类似情况的解决办法终端无法打开的解决办法。作者写的很详细,这里我说一下一些操作的注意事项(这些情况是我在操作的时候遇到的,虽然可能很弱者,但是真的很浪费时间。)(1)当终端不能用的时候我们又要命令行怎么办?...

旅行商问题--蚁群优化算法求解(matlab实现)_旅行商问题matlab算法_晨少爱学习的博客-程序员秘密

今天给大家分享一下matlab实现蚁群优化算法,解决旅行商问题。在上一篇博客中对蚁群优化算法做了较为详细的介绍,有需要的小伙伴可研读一下https://blog.csdn.net/HuangChen666/article/details/1158277321. 初始化城市信息&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.1 随机获取城市坐标&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.2 求出城市之间距离2. 初始化参数3. 迭代寻找最佳路径&nbsp;&

随便推点

Directx11学习笔记【十二】 画一个旋转的彩色立方体_weixin_34194087的博客-程序员秘密

上一次我们学习了如何画一个2D三角形,现在让我们进一步学习如何画一个旋转的彩色立方体吧。具体流程同画三角形类似,因此不再给出完整代码了,不同的部分会再说明。 由于我们要画彩色的立方体,所以顶点结构体中加入颜色变量struct Vertex{ XMFLOAT3 pos; XMFLOAT4 color;};着色器代码  1 cbuffer...

各种文件的格式_disuibing1805的博客-程序员秘密

General formatsAxt formatBAM formatBED formatBED detail formatbedGraph formatbarChart and bigBarChart formatbigBed formatbigGenePred table formatbigPsl table formatbigMa...

还在用 Guava Cache?它才是 Java 本地缓存之王!_Java技术栈的博客-程序员秘密

作者:rickiyang来源:https://www.cnblogs.com/rickiyang/p/11074158.htmlGuava Cache 的优点是封装了get,put操作;提供线程安全的缓存操作;提供过期策略;提供回收策略;缓存监控。当缓存的数据超过最大值时,使用LRU算法替换。这一篇我们将要谈到一个新的本地缓存框架:Caffeine Cache。它也是站在巨人的肩膀上-Guava Cache,借着它的思想优化了算法发展而来。本篇博文主要介绍Caffine Cache 的使用方式。另外,

操作系统中的页面置换算法_nur算法_绿豆先森的博客-程序员秘密

主要的页面置换算法有:OPT、LRU、LFU、NUR以及FIFO。FIFO:先进先出算法。OPT: 最佳置换算法。LRU:最近最久未使用算法。LFU:最近最少使用算法。NUR: 最近未使用算法。以上五种算法的具体实现均在体现在以下程序中#include<stdio.h>#include<stdlib.h>#define TRUE 1#define FALSE 0#define INVALID

flask框架_我爱千玺的博客-程序员秘密

flask框架一、flask简介二、初体验三、flask配置1、开启debug模式2、如何正确显示中文2.1、 配置文件的优化四、URL与视图一、flask简介flask是一款非常流行的python web框架,它能流行的原因有以下几点:1、微框架,简洁2、flask和相应的插件写的很好3、开发效率非常高,比如使用sqlalchemy操作的数据库可以节省开发者大量书写sql语句的时间相较于django的优点:django内置了非常完善和丰富的功能,如果想替换成自己想要的,要么不支持,要么

推荐文章

热门文章

相关标签