Using the Pandas Python Library¶

http://realpython.com/pandas-python-explore-dataset

In this tutorial, you’ll learn how to:

Calculate metrics about your data
Perform basic queries and aggregations
Discover and handle incorrect data, inconsistencies, and missing values
Visualize your data with plots

You’ll also learn about the differences between the main data structures that Pandas and Python use. To follow along, you can get all of the example code in this tutorial at the link below:

nba dataset¶

import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

Download ready.

import pandas as pd
nba = pd.read_csv("nba_all_elo.csv")
type(ds)

pandas.core.frame.DataFrame

len(nba)

126314

nba.shape

(126314, 23)

nba.head()

pd.set_option("display.precision", 2)

nba.tail()

nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB

Getting to know the data¶

nba.describe()

import numpy as np
nba.describe(include=np.object)

nba["team_id"].value_counts()

BOS    5997
NYK    5769
LAL    5078
DET    4985
PHI    4533
       ... 
PIT      60
INJ      60
DTF      60
TRH      60
SDS      11
Name: team_id, Length: 104, dtype: int64

nba["fran_id"].value_counts()

Lakers          6024
Celtics         5997
Knicks          5769
Warriors        5657
Pistons         5650
Sixers          5644
Hawks           5572
Kings           5475
Wizards         4582
Spurs           4309
Bulls           4307
Pacers          4227
Thunder         4178
Rockets         4154
Nuggets         4120
Nets            4106
Suns            4080
Bucks           4034
Trailblazers    3870
Cavaliers       3810
Clippers        3733
Jazz            3555
Mavericks       3013
Heat            2371
Pelicans        2254
Magic           2207
Timberwolves    2131
Grizzlies       1657
Raptors         1634
Hornets          894
Colonels         846
Squires          799
Spirits          777
Stars            756
Sounds           697
Baltimore        467
Floridians       440
Condors          430
Capitols         291
Olympians        282
Sails            274
Stags            260
Bombers          249
Steamrollers     168
Packers           72
Redskins          65
Rebels            63
Denver            62
Waterloo          62
Falcons           60
Ironmen           60
Huskies           60
Jets              60
Name: fran_id, dtype: int64

nba.loc[nba["fran_id"] == "Lakers", "team_id"].value_counts()

LAL    5078
MNL     946
Name: team_id, dtype: int64

nba.loc[nba["team_id"] == "MNL", "date_game"].agg(("min", "max"))

min    1/1/1949
max    4/9/1959
Name: date_game, dtype: object

nba.loc[nba["team_id"] == "MNL", "date_game"].max()

'4/9/1959'

nba.loc[nba["team_id"] == "MNL", "date_game"].min()

'1/1/1949'

Getting to know Pandas Data Structures¶

revenues = pd.Series([5555, 7000, 1980])
revenues

0    5555
1    7000
2    1980
dtype: int64

revenues.values

array([5555, 7000, 1980], dtype=int64)

revenues.index

RangeIndex(start=0, stop=3, step=1)

type(revenues.values)

numpy.ndarray

city_revenues = pd.Series([4200, 8000, 6500], index=["Amsterdam", "London", "New York"])

city_revenues

Amsterdam    4200
London       8000
New York     6500
dtype: int64

city_employee_count = pd.Series({"Amsterdam": 5, "London": 8})
city_employee_count

Amsterdam    5
London       8
dtype: int64

city_employee_count.keys()

Index(['Amsterdam', 'London'], dtype='object')

"Tokyo" in city_employee_count

False

"London" in city_employee_count

True

city_data = pd.DataFrame({"revenue": city_revenues, "employee_count": city_employee_count})

city_data

city_data.index

Index(['Amsterdam', 'London', 'New York'], dtype='object')

city_data.values

array([[4.2e+03, 5.0e+00],
       [8.0e+03, 8.0e+00],
       [6.5e+03,     nan]])

city_data.axes

[Index(['Amsterdam', 'London', 'New York'], dtype='object'),
 Index(['revenue', 'employee_count'], dtype='object')]

city_data.axes[0]

Index(['Amsterdam', 'London', 'New York'], dtype='object')

city_data.axes[1]

Index(['revenue', 'employee_count'], dtype='object')

city_data.keys()

Index(['revenue', 'employee_count'], dtype='object')

"revenue" in city_data

True

"Tokyo" in city_data

False

nba.index

RangeIndex(start=0, stop=126314, step=1)

nba.axes

[RangeIndex(start=0, stop=126314, step=1),
 Index(['gameorder', 'game_id', 'lg_id', '_iscopy', 'year_id', 'date_game',
        'seasongame', 'is_playoffs', 'team_id', 'fran_id', 'pts', 'elo_i',
        'elo_n', 'win_equiv', 'opp_id', 'opp_fran', 'opp_pts', 'opp_elo_i',
        'opp_elo_n', 'game_location', 'game_result', 'forecast', 'notes'],
       dtype='object')]

"pts" in nba.keys()

True

Accessing series elements¶

city_revenues["London"]

8000

city_revenues[1]

8000

city_revenues[-1]

6500

city_revenues[1:]

London      8000
New York    6500
dtype: int64

city_revenues["New York"]

6500

colors = pd.Series(["red", "purple", "blue", "green", "yellow"], index=[1, 2, 3, 5, 8])

colors

1       red
2    purple
3      blue
5     green
8    yellow
dtype: object

colors.loc[1]

'red'

colors.iloc[1]

'purple'

colors.iloc[1:3]

2    purple
3      blue
dtype: object

colors.loc[3:8]

3      blue
5     green
8    yellow
dtype: object

colors.iloc[-2]

'green'

city_data["revenue"]

Amsterdam    4200
London       8000
New York     6500
Name: revenue, dtype: int64

type(city_data["revenue"])

pandas.core.series.Series

city_data.revenue

Amsterdam    4200
London       8000
New York     6500
Name: revenue, dtype: int64

city_data

toys = pd.DataFrame([
    {"name": "ball", "shape": "sphere"},
    {"name": "rubik's cube", "shape": "cube"}    
])

toys["shape"]

0    sphere
1      cube
Name: shape, dtype: object

city_data.loc["Amsterdam"]

revenue           4200.0
employee_count       5.0
Name: Amsterdam, dtype: float64

city_data.iloc[1]

revenue           8000.0
employee_count       8.0
Name: London, dtype: float64

nba.iloc[-2]

gameorder               63157
game_id          201506170CLE
lg_id                     NBA
_iscopy                     0
year_id                  2015
date_game           6/16/2015
seasongame                102
is_playoffs                 1
team_id                   CLE
fran_id             Cavaliers
pts                        97
elo_i                 1.7e+03
elo_n                 1.7e+03
win_equiv                  59
opp_id                    GSW
opp_fran             Warriors
opp_pts                   105
opp_elo_i             1.8e+03
opp_elo_n             1.8e+03
game_location               H
game_result                 L
forecast                 0.48
notes                     NaN
Name: 126312, dtype: object

nba[
    (nba["_iscopy"] == 0) &
    (nba["pts"] > 100) &
    (nba["opp_pts"] > 100) &
    (nba["team_id"] == "BLB")
]

nba[
    (nba["_iscopy"] == 0) &
    (nba["team_id"].str.startswith("LA")) &
    (nba["year_id"] == 1992) &
    (nba["notes"].notnull())
    
]

Grouping and aggregating the data¶

city_revenues.sum()

18700

city_revenues

Amsterdam    4200
London       8000
New York     6500
dtype: int64

city_revenues.max()

8000

points = nba["pts"]

type(points)

pandas.core.series.Series

points.sum()

12976235

nba.groupby("fran_id", sort=False)["pts"].sum()

fran_id
Huskies           3995
Knicks          582497
Stags            20398
Falcons           3797
Capitols         22387
Celtics         626484
Steamrollers     12372
Ironmen           3674
Bombers          17793
Rebels            4474
Warriors        591224
Baltimore        37219
Jets              4482
Pistons         572758
Lakers          637444
Kings           569245
Hawks           567261
Denver            4818
Olympians        22864
Redskins          5372
Waterloo          4921
Packers           6193
Sixers          585891
Wizards         474809
Bulls           437269
Thunder         437735
Squires          91127
Stars            84940
Rockets         432504
Colonels         94435
Pacers          438288
Nuggets         445780
Spurs           453822
Spirits          85874
Sounds           75582
Floridians       49568
Nets            417809
Condors          49642
Bucks           418326
Suns            437486
Clippers        380523
Cavaliers       380416
Trailblazers    402695
Sails            30080
Jazz            363155
Mavericks       309239
Pelicans        220794
Heat            229103
Timberwolves    207693
Magic           219436
Grizzlies       157683
Raptors         158370
Hornets          84489
Name: pts, dtype: int64

maxpoints = nba.groupby("fran_id", sort=False)["pts"].max()

maxpoints.max()

186

nba[
    (nba["fran_id"] == "Spurs") &
    (nba["year_id"] > 2010) 
].groupby(["year_id", "game_result"])["game_id"].count()

year_id  game_result
2011     L              25
         W              63
2012     L              20
         W              60
2013     L              30
         W              73
2014     L              27
         W              78
2015     L              31
         W              58
Name: game_id, dtype: int64

nba[
    (nba["fran_id"] == "Warriors") &
    (nba["year_id"] == 2015) 
].groupby(["is_playoffs", "game_result"])["game_id"].count()

is_playoffs  game_result
0            L              15
             W              67
1            L               5
             W              16
Name: game_id, dtype: int64

Manipulating Columns¶

df = nba.copy()

df.shape

(126314, 23)

df["difference"] = df.pts - df.opp_pts
df.shape

(126314, 24)

df["difference"].max()

68

renamed_df = df.rename(
    columns={"game_result": "result", "game_location": "location"}
)

renamed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 24 columns):
gameorder      126314 non-null int64
game_id        126314 non-null object
lg_id          126314 non-null object
_iscopy        126314 non-null int64
year_id        126314 non-null int64
date_game      126314 non-null object
seasongame     126314 non-null int64
is_playoffs    126314 non-null int64
team_id        126314 non-null object
fran_id        126314 non-null object
pts            126314 non-null int64
elo_i          126314 non-null float64
elo_n          126314 non-null float64
win_equiv      126314 non-null float64
opp_id         126314 non-null object
opp_fran       126314 non-null object
opp_pts        126314 non-null int64
opp_elo_i      126314 non-null float64
opp_elo_n      126314 non-null float64
location       126314 non-null object
result         126314 non-null object
forecast       126314 non-null float64
notes          5424 non-null object
difference     126314 non-null int64
dtypes: float64(6), int64(8), object(10)
memory usage: 23.1+ MB

df.shape

(126314, 24)

elo_columns = ["elo_i", "elo_n", "opp_elo_i", "opp_elo_n"]

df.drop(elo_columns, inplace=True, axis=1)

df.shape

(126314, 20)

Specifing data types¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: float64(2), int64(8), object(10)
memory usage: 19.3+ MB

df["date_game"] = pd.to_datetime(df["date_game"])

df["game_location"].nunique()

3

df["game_location"].value_counts()

H    63138
A    63138
N       38
Name: game_location, dtype: int64

df["game_location"] = pd.Categorical(df["game_location"])

df["game_location"].dtype

CategoricalDtype(categories=['A', 'H', 'N'], ordered=False)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null datetime64[ns]
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null category
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: category(1), datetime64[ns](1), float64(2), int64(8), object(8)
memory usage: 18.4+ MB

df["game_result"].nunique()

2

df["game_result"].value_counts()

L    63157
W    63157
Name: game_result, dtype: int64

df["game_result"] = pd.Categorical(df["game_result"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null datetime64[ns]
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null category
game_result      126314 non-null category
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: category(2), datetime64[ns](1), float64(2), int64(8), object(7)
memory usage: 17.6+ MB

Cleaning data¶

rows_without_missing_data = nba.dropna()

rows_without_missing_data.shape

(5424, 23)

data_without_missing_columns = nba.dropna(axis=1)

data_without_missing_columns.shape

(126314, 22)

data_with_default_notes = nba.copy()

data_with_default_notes["notes"].fillna(value="no notes at all", inplace=True)

data_with_default_notes["notes"].describe()

count              126314
unique                232
top       no notes at all
freq               120890
Name: notes, dtype: object

nba[nba["pts"] == 0]

nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty

True

nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != 'L')].empty

True

Combining multiple datasets¶

further_city_data = pd.DataFrame(
    {"revenue": [7000, 3400], "employee_count":[2, 2]},
    index=["New York", "Barcelona"]
)

all_city_data = pd.concat([city_data, further_city_data], sort=False)

all_city_data

Querying the dataset¶

current_decade = nba[nba["year_id"] > 2010]

current_decade.shape

(12658, 23)

games_with_notes = nba[nba["notes"].notnull()]

games_with_notes.shape

(5424, 23)

ers = nba[nba["fran_id"].str.endswith("ers")]

ers.shape

(27797, 23)

Visualising the dataframe¶

%matplotlib inline
nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x294a51a4e48>

nba["fran_id"].value_counts().head(10).plot(kind="bar")

<matplotlib.axes._subplots.AxesSubplot at 0x294a5279e48>

nba[
    (nba["fran_id"] =="Heat") &
    (nba["year_id"] == 2013)    
]["game_result"].value_counts().plot(kind="pie")

<matplotlib.axes._subplots.AxesSubplot at 0x294a0396488>

	gameorder	_iscopy	year_id	seasongame	is_playoffs	pts	elo_i	elo_n	win_equiv	opp_pts	opp_elo_i	opp_elo_n	forecast
count	126314.00	126314.0	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00
mean	31579.00	0.5	1988.20	43.53	0.06	102.73	1495.24	1495.24	41.71	102.73	1495.24	1495.24	0.50
std	18231.93	0.5	17.58	25.38	0.24	14.81	112.14	112.46	10.63	14.81	112.14	112.46	0.22
min	1.00	0.0	1947.00	1.00	0.00	0.00	1091.64	1085.77	10.15	0.00	1091.64	1085.77	0.02
25%	15790.00	0.0	1975.00	22.00	0.00	93.00	1417.24	1416.99	34.10	93.00	1417.24	1416.99	0.33
50%	31579.00	0.5	1990.00	43.00	0.00	103.00	1500.95	1500.95	42.11	103.00	1500.95	1500.95	0.50
75%	47368.00	1.0	2003.00	65.00	0.00	112.00	1576.06	1576.29	49.64	112.00	1576.06	1576.29	0.67
max	63157.00	1.0	2015.00	108.00	1.00	186.00	1853.10	1853.10	71.11	186.00	1853.10	1853.10	0.98

	gameorder	game_id	lg_id	_iscopy	year_id	date_game	seasongame	team_id	fran_id	...	win_equiv	opp_id	opp_fran	opp_pts	opp_elo_i	opp_elo_n	game_location	game_result	forecast	notes
0	1	194611010TRH	NBA	0	1947	11/1/1946	1	TRH	Huskies	...	40.294830	NYK	Knicks	68	1300.0000	1306.7233	H	L	0.640065	NaN
1	1	194611010TRH	NBA	1	1947	11/1/1946	1	NYK	Knicks	...	41.705170	TRH	Huskies	66	1300.0000	1293.2767	A	W	0.359935	NaN
2	2	194611020CHS	NBA	0	1947	11/2/1946	1	CHS	Stags	...	42.012257	NYK	Knicks	47	1306.7233	1297.0712	H	W	0.631101	NaN
3	2	194611020CHS	NBA	1	1947	11/2/1946	2	NYK	Knicks	...	40.692783	CHS	Stags	63	1300.0000	1309.6521	A	L	0.368899	NaN
4	3	194611020DTF	NBA	0	1947	11/2/1946	1	DTF	Falcons	...	38.864048	WSC	Capitols	50	1300.0000	1320.3811	H	L	0.640065	NaN

	gameorder	game_id	lg_id	_iscopy	year_id	date_game	seasongame	is_playoffs	team_id	fran_id	...	win_equiv	opp_id	opp_fran	opp_pts	opp_elo_i	opp_elo_n	game_location	game_result	forecast	notes
126309	63155	201506110CLE	NBA	0	2015	6/11/2015	100	1	CLE	Cavaliers	...	60.31	GSW	Warriors	103	1790.96	1809.98	H	L	0.55	NaN
126310	63156	201506140GSW	NBA	0	2015	6/14/2015	102	1	GSW	Warriors	...	68.01	CLE	Cavaliers	91	1704.39	1700.74	H	W	0.77	NaN
126311	63156	201506140GSW	NBA	1	2015	6/14/2015	101	1	CLE	Cavaliers	...	60.01	GSW	Warriors	104	1809.98	1813.63	A	L	0.23	NaN
126312	63157	201506170CLE	NBA	0	2015	6/16/2015	102	1	CLE	Cavaliers	...	59.29	GSW	Warriors	105	1813.63	1822.29	H	L	0.48	NaN
126313	63157	201506170CLE	NBA	1	2015	6/16/2015	103	1	GSW	Warriors	...	68.52	CLE	Cavaliers	97	1700.74	1692.09	A	W	0.52	NaN

	game_id	lg_id	date_game	team_id	fran_id	opp_id	opp_fran	game_location	game_result	notes
count	126314	126314	126314	126314	126314	126314	126314	126314	126314	5424
unique	63157	2	12426	104	53	104	53	3	2	231
top	199611290DAL	NBA	4/16/2014	BOS	Lakers	BOS	Lakers	H	L	at New York NY
freq	2	118016	30	5997	6024	5997	6024	63138	63157	440

	gameorder	game_id	lg_id	year_id	date_game	seasongame	team_id	fran_id	...	win_equiv	opp_id	opp_fran	opp_pts	opp_elo_i	opp_elo_n	game_location	game_result	forecast	notes
1726	864	194902260BLB	NBA	1949	2/26/1949	53	BLB	Baltimore	...	38.56	MNL	Lakers	115	1637.99	1640.49	H	L	0.34	NaN
4890	2446	195301100BLB	NBA	1953	1/10/1953	32	BLB	Baltimore	...	25.80	BOS	Celtics	105	1591.14	1563.17	H	W	0.28	NaN
4909	2455	195301140BLB	NBA	1953	1/14/1953	34	BLB	Baltimore	...	24.88	MNL	Lakers	112	1665.44	1668.91	H	L	0.22	NaN
5208	2605	195303110BLB	NBA	1953	3/11/1953	66	BLB	Baltimore	...	19.58	NYK	Knicks	113	1649.15	1651.44	H	L	0.18	at Boston MA
5825	2913	195402220BLB	NBA	1954	2/22/1954	60	BLB	Baltimore	...	20.74	BOS	Celtics	111	1591.49	1593.27	H	L	0.25	at Worcester MA

	gameorder	game_id	lg_id	_iscopy	year_id	date_game	seasongame	is_playoffs	team_id	fran_id	...	win_equiv	opp_id	opp_fran	opp_pts	opp_elo_i	opp_elo_n	game_location	game_result	forecast	notes
68901	34451	199205030LAC	NBA	0	1992	5/3/1992	86	1	LAC	Clippers	...	47.89	UTA	Jazz	107	1647.51	1638.41	H	W	0.50	at Anaheim, CA (Anaheim Convention Center)
68903	34452	199205030LAL	NBA	0	1992	5/3/1992	86	1	LAL	Lakers	...	39.40	POR	Trailblazers	102	1655.12	1670.37	H	L	0.41	at Las Vegas, NV

	revenue	employee_count
Amsterdam	4200	5.0
London	8000	8.0
New York	6500	NaN
New York	7000	2.0
Barcelona	3400	2.0