增進工程師效率 Python DataFrame - CSV & Plot

Download the code: https://github.com/allenlu2009/colab/blob/master/dataframe_demo.ipynb

Python DataFrame

Create DataFrame

Direct input
Use dict: Method 1: 一筆一筆加入。

import pandas as pd
dict1 = {'Name': 'Allen' , 'Sex': 'male', 'Age': 33}
dict2 = {'Name': 'Alice' , 'Sex': 'female', 'Age': 22}
dict3 = {'Name': 'Bob' , 'Sex': 'male', 'Age': 11}
data = [dict1, dict2, dict3]
df = pd.DataFrame(data)
df

	Name	Sex	Age
0	Allen	male	33
1	Alice	female	22
2	Bob	male	11

Method 2: 一次加入所有資料。

name = ['Allen', 'Alice', 'Bob']
sex = ['male', 'female', 'male']
age = [33, 22, 11]
all_dict = {
    "Name": name,
    "Sex": sex,
    "Age": age
}
df = pd.DataFrame(all_dict)
df[['Name', 'Age']]

	Name	Age
0	Allen	33
1	Alice	22
2	Bob	11

Dataframe 的屬性

ndim: 2 for 2D dataframe; axis 0 => row; axis 1 => column
shape: (row no. x column no.) (not including number index)
dtypes: (object or int) of each column

df.ndim

df.shape

(3, 3)

df.dtypes

Name    object
Sex     object
Age      int64
dtype: object

df.columns

1	`Index(['Name', 'Sex', 'Age'], dtype='object')`

df.index

1	`RangeIndex(start=0, stop=3, step=1)`

Read CSV

Donwload a test csv file from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html Pick the biostats.csv

For 2, Before read csv, reference Medium article to import google drive

Read csv 使用 read_csv function. 但是要加上 skipinitialspace to strip the leading space!!
Two ways to read_csv: (1) load csv file directly; (2) load from url

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
#!ls 'drive/My Drive/Colab Notebooks/'
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', skipinitialspace=True)
df

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155
3	Dave	M	39	72	167
4	Elly	F	30	66	124
5	Fran	F	33	66	115
6	Gwen	F	26	64	121
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143
10	Kate	F	47	69	139
11	Luke	M	34	72	163
12	Myra	F	23	62	98
13	Neil	M	36	75	160
14	Omar	M	38	70	145
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

url = "https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv"
df = pd.read_csv(url, skipinitialspace=True)
df

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155
3	Dave	M	39	72	167
4	Elly	F	30	66	124
5	Fran	F	33	66	115
6	Gwen	F	26	64	121
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143
10	Kate	F	47	69	139
11	Luke	M	34	72	163
12	Myra	F	23	62	98
13	Neil	M	36	75	160
14	Omar	M	38	70	145
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

print(df.columns); print(df.index)

Index(['Name', 'Sex', 'Age', 'Height (in)', 'Weight (lbs)'], dtype='object')
RangeIndex(start=0, stop=18, step=1)

df.ndim

df.shape

(18, 5)

df.dtypes

Name            object
Sex             object
Age              int64
Height (in)      int64
Weight (lbs)     int64
dtype: object

Basic Viewing Command

df.head(3)

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155

df.tail(3)

	Name	Sex	Age	Height (in)	Weight (lbs)
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

df.shape

(18, 5)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 5 columns):
Name            18 non-null object
Sex             18 non-null object
Age             18 non-null int64
Height (in)     18 non-null int64
Weight (lbs)    18 non-null int64
dtypes: int64(3), object(2)
memory usage: 848.0+ bytes

df[7:10]

	Name	Sex	Age	Height (in)	Weight (lbs)
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143

df['Name'][7:10]

  Hank
  Ivan
  Jake
Name: Name, dtype: object

df[['Name', 'Age', 'Sex']][7:10]

	Name	Age	Sex
7	Hank	30	M
8	Ivan	53	M
9	Jake	32	M

df.loc[7:10, ['Name', 'Age', 'Sex']] # compare with loc call

	Name	Age	Sex
7	Hank	30	M
8	Ivan	53	M
9	Jake	32	M
10	Kate	47	F

df.count()

Name            18
Sex             18
Age             18
Height (in)     18
Weight (lbs)    18
dtype: int64

Basic Index Operation

Index (索引) is a very useful key for DataFrame. The default index is the row number starting from 0 to N-1, where N is the number of data. 除了用 row number 做為 index, 一般也會使用 unique feature 例如 name, id, or phone number 做為 index.

把 column 變成 index

Method 1: 直接在 read_csv 指定 index_col. 可以看到 index number 消失，而被 Name column 取代。

df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', skipinitialspace=True, index_col='Name')
df

	Sex	Age	Height (in)	Weight (lbs)
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

df.index shows the element in index column

df.index

Index(['Alex', 'Bert', 'Carl', 'Dave', 'Elly', 'Fran', 'Gwen', 'Hank', 'Ivan',
       'Jake', 'Kate', 'Luke', 'Myra', 'Neil', 'Omar', 'Page', 'Quin', 'Ruth'],
      dtype='object', name='Name')

使用 reset_index 又會回到 index number.

df.reset_index()

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155
3	Dave	M	39	72	167
4	Elly	F	30	66	124
5	Fran	F	33	66	115
6	Gwen	F	26	64	121
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143
10	Kate	F	47	69	139
11	Luke	M	34	72	163
12	Myra	F	23	62	98
13	Neil	M	36	75	160
14	Omar	M	38	70	145
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

再看一次 df 並沒有改變。很多 DataFrame 的 function 都是保留原始的 df, create a new object, 也就是 inplace = False. 如果要取代原來的 df, 必須 inplace = True!

df

	Sex	Age	Height (in)	Weight (lbs)
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

df.reset_index(inplace=True)
df

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155
3	Dave	M	39	72	167
4	Elly	F	30	66	124
5	Fran	F	33	66	115
6	Gwen	F	26	64	121
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143
10	Kate	F	47	69	139
11	Luke	M	34	72	163
12	Myra	F	23	62	98
13	Neil	M	36	75	160
14	Omar	M	38	70	145
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

如果再 reset_index(）一次，會是什麼結果？此處用 default inplace=False. 多了一個 index column

df.reset_index()

	index	Name	Sex	Age	Height (in)	Weight (lbs)
0	0	Alex	M	41	74	170
1	1	Bert	M	42	68	166
2	2	Carl	M	32	70	155
3	3	Dave	M	39	72	167
4	4	Elly	F	30	66	124
5	5	Fran	F	33	66	115
6	6	Gwen	F	26	64	121
7	7	Hank	M	30	71	158
8	8	Ivan	M	53	72	175
9	9	Jake	M	32	69	143
10	10	Kate	F	47	69	139
11	11	Luke	M	34	72	163
12	12	Myra	F	23	62	98
13	13	Neil	M	36	75	160
14	14	Omar	M	38	70	145
15	15	Page	F	31	67	135
16	16	Quin	M	29	71	176
17	17	Ruth	F	28	65	131

df

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155
3	Dave	M	39	72	167
4	Elly	F	30	66	124
5	Fran	F	33	66	115
6	Gwen	F	26	64	121
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143
10	Kate	F	47	69	139
11	Luke	M	34	72	163
12	Myra	F	23	62	98
13	Neil	M	36	75	160
14	Omar	M	38	70	145
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

Method 2: 使用 set_index()

df.set_index('Name', inplace=True)
df

	Sex	Age	Height (in)	Weight (lbs)
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

loc[]

使用 loc[] 配合 index label 取出資料非常方便。
如果是 number index, 可以用 df[0], df[3], etc.
但如果是其他 column index, e.g. Name, df[2] 或是 df[“Hank”] are wrong!, 必須用 df.loc[‘Hank’]
或是 df.loc[ [‘Hank’, ‘Ruth’, ‘Page’] ]

df.loc['Hank']

Sex               M
Age              30
Height (in)      71
Weight (lbs)    158
Name: Hank, dtype: object

df.loc[:, ['Sex', 'Age']]

	Sex	Age
Name
Alex	M	41
Bert	M	42
Carl	M	32
Dave	M	39
Elly	F	30
Fran	F	33
Gwen	F	26
Hank	M	30
Ivan	M	53
Jake	M	32
Kate	F	47
Luke	M	34
Myra	F	23
Neil	M	36
Omar	M	38
Page	F	31
Quin	M	29
Ruth	F	28

df.loc[ ['Hank', 'Ruth', 'Page'] ]

	Sex	Age	Height (in)	Weight (lbs)
Name
Hank	M	30	71	158
Ruth	F	28	65	131
Page	F	31	67	135

loc[] 可以用 row, column 得到對應的 element, 似乎是奇怪的用法

df.loc['Hank', 'Age']

iloc[]

使用 column index 仍然可以用 iloc[] 配合 index number 取出資料。

df.iloc[0]

Sex               M
Age              41
Height (in)      74
Weight (lbs)    170
Name: Alex, dtype: object

df.iloc[1:10]

	Sex	Age	Height (in)	Weight (lbs)
Name
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143

df.iloc[[1, 4, 6]]

	Sex	Age	Height (in)	Weight (lbs)
Name
Bert	M	42	68	166
Elly	F	30	66	124
Gwen	F	26	64	121

排序

包含兩種排序

sort_index()
sort_value()

df.sort_index()

	Sex	Age	Height (in)	Weight (lbs)
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

df.sort_values(by = 'Age')

	Sex	Age	Height (in)	Weight (lbs)
Name
Myra	F	23	62	98
Gwen	F	26	64	121
Ruth	F	28	65	131
Quin	M	29	71	176
Elly	F	30	66	124
Hank	M	30	71	158
Page	F	31	67	135
Carl	M	32	70	155
Jake	M	32	69	143
Fran	F	33	66	115
Luke	M	34	72	163
Neil	M	36	75	160
Omar	M	38	70	145
Dave	M	39	72	167
Alex	M	41	74	170
Bert	M	42	68	166
Kate	F	47	69	139
Ivan	M	53	72	175

Rename and Drop Column(s) and Index(s)

df.rename(columns={"Height (in)": "Height", "Weight (lbs)": "Weight"}, inplace=True)
df

	Sex	Age	Height	Weight
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

df.rename(index={"Alex": "Allen", "Bert": "Bob"}, inplace=True)
df

	Sex	Age	Height	Weight
Name
Allen	M	41	74	170
Bob	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

df.drop(labels=['Sex', 'Weight'], axis="columns") # axis=1 eq axis="columns"

	Age	Height
Name
Allen	41	74
Bob	42	68
Carl	32	70
Dave	39	72
Elly	30	66
Fran	33	66
Gwen	26	64
Hank	30	71
Ivan	53	72
Jake	32	69
Kate	47	69
Luke	34	72
Myra	23	62
Neil	36	75
Omar	38	70
Page	31	67
Quin	29	71
Ruth	28	65

df.drop(labels=['Allen', 'Ruth'], axis="index") # axis=0 eq axis="index"

	Sex	Age	Height	Weight
Name
Bob	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176

進階技巧

Multiple Index (多重索引)

這是非常有用的技巧，使用 set_index with keys

df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', skipinitialspace=True)
df  # show the original dataframe

	Name	Sex	Age	Height (in)	Weight (lbs)
0	Alex	M	41	74	170
1	Bert	M	42	68	166
2	Carl	M	32	70	155
3	Dave	M	39	72	167
4	Elly	F	30	66	124
5	Fran	F	33	66	115
6	Gwen	F	26	64	121
7	Hank	M	30	71	158
8	Ivan	M	53	72	175
9	Jake	M	32	69	143
10	Kate	F	47	69	139
11	Luke	M	34	72	163
12	Myra	F	23	62	98
13	Neil	M	36	75	160
14	Omar	M	38	70	145
15	Page	F	31	67	135
16	Quin	M	29	71	176
17	Ruth	F	28	65	131

df.set_index(keys = ['Name', 'Sex'])  
# Notice "Name" "Sex" columns header is lower than the rest

		Age	Height (in)	Weight (lbs)
Name	Sex
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

df.set_index(keys = ['Sex', 'Name'], inplace=True)
df
# Note that key sequence matters; and same index values group
# Note that inplace=True replaces the original df 
# This is useful to display sorted group

		Age	Height (in)	Weight (lbs)
Sex	Name
M	Alex	41	74	170
	Bert	42	68	166
	Carl	32	70	155
	Dave	39	72	167
F	Elly	30	66	124
	Fran	33	66	115
	Gwen	26	64	121
M	Hank	30	71	158
	Ivan	53	72	175
	Jake	32	69	143
F	Kate	47	69	139
M	Luke	34	72	163
F	Myra	23	62	98
M	Neil	36	75	160
M	Omar	38	70	145
F	Page	31	67	135
M	Quin	29	71	176
F	Ruth	28	65	131

df.index

MultiIndex([('M', 'Alex'),
            ('M', 'Bert'),
            ('M', 'Carl'),
            ('M', 'Dave'),
            ('F', 'Elly'),
            ('F', 'Fran'),
            ('F', 'Gwen'),
            ('M', 'Hank'),
            ('M', 'Ivan'),
            ('M', 'Jake'),
            ('F', 'Kate'),
            ('M', 'Luke'),
            ('F', 'Myra'),
            ('M', 'Neil'),
            ('M', 'Omar'),
            ('F', 'Page'),
            ('M', 'Quin'),
            ('F', 'Ruth')],
           names=['Sex', 'Name'])

df.index.names

1	`FrozenList(['Sex', 'Name'])`

type(df.index)  # MultiIndex

1	`pandas.core.indexes.multi.MultiIndex`

df.sort_index(inplace=True)
df
# sorting is based on "Sex", and then "Name"

		Age	Height (in)	Weight (lbs)
Sex	Name
F	Elly	30	66	124
	Fran	33	66	115
	Gwen	26	64	121
	Kate	47	69	139
	Myra	23	62	98
	Page	31	67	135
	Ruth	28	65	131
M	Alex	41	74	170
	Bert	42	68	166
	Carl	32	70	155
	Dave	39	72	167
	Hank	30	71	158
	Ivan	53	72	175
	Jake	32	69	143
	Luke	34	72	163
	Neil	36	75	160
	Omar	38	70	145
	Quin	29	71	176

Groupby Command

Groupby 是 SQL 的語法。根據某一項資料做分組方便查找。
The SQL GROUP BY Statement The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', index_col="Name", skipinitialspace=True)
df  # show the dataframe with "Name" index column

	Sex	Age	Height (in)	Weight (lbs)
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Elly	F	30	66	124
Fran	F	33	66	115
Gwen	F	26	64	121
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Kate	F	47	69	139
Luke	M	34	72	163
Myra	F	23	62	98
Neil	M	36	75	160
Omar	M	38	70	145
Page	F	31	67	135
Quin	M	29	71	176
Ruth	F	28	65	131

grpBySex = df.groupby('Sex')  # output is a DataFrameGroupBy object
type(grpBySex)

1	`pandas.core.groupby.generic.DataFrameGroupBy`

grpBySex.groups  
# output is a dict, use get_group() obtains each sub-group

{'F': Index(['Elly', 'Fran', 'Gwen', 'Kate', 'Myra', 'Page', 'Ruth'], dtype='object', name='Name'),
 'M': Index(['Alex', 'Bert', 'Carl', 'Dave', 'Hank', 'Ivan', 'Jake', 'Luke', 'Neil',
        'Omar', 'Quin'],
       dtype='object', name='Name')}

grpBySex.size()  # size() shows the counts of each group

Sex
F     7
M    11
dtype: int64

grpBySex.get_group('M')  # get_group() output a DataFrame object

	Sex	Age	Height (in)	Weight (lbs)
Name
Alex	M	41	74	170
Bert	M	42	68	166
Carl	M	32	70	155
Dave	M	39	72	167
Hank	M	30	71	158
Ivan	M	53	72	175
Jake	M	32	69	143
Luke	M	34	72	163
Neil	M	36	75	160
Omar	M	38	70	145
Quin	M	29	71	176

Groupby Operation

分組後可以進行各類運算：sum(), mean(), max(), min()

grpBySex.sum()

	Age	Height (in)	Weight (lbs)
Sex
F	218	459	863
M	406	784	1778

grpBySex.mean()

	Age	Height (in)	Weight (lbs)
Sex
F	31.142857	65.571429	123.285714
M	36.909091	71.272727	161.636364

grpBySex.max()

	Age	Height (in)	Weight (lbs)
Sex
F	47	69	139
M	53	75	176

grpBySex.min()

	Age	Height (in)	Weight (lbs)
Sex
F	23	62	98
M	29	68	143

Wash Data with NAN

判斷 NAN

isnull()
notnull()

處理 NAN

dropna()
fillna()

import numpy as np
import pandas as pd

groups = ["Modern Web", "DevOps", np.nan, "Big Data", "Security", "自我挑戰組"]
ironmen = [59, 9, 19, 14, 6, np.nan]

ironmen_dict = {
                "groups": groups,
                "ironmen": ironmen
}

# 建立 data frame
ironmen_df = pd.DataFrame(ironmen_dict)

print(ironmen_df.loc[:, "groups"].isnull()) # 判斷哪些組的組名是遺失值
print("---") # 分隔線
print(ironmen_df.loc[:, "ironmen"].notnull()) # 判斷哪些組的鐵人數不是遺失值

ironmen_df_na_dropped = ironmen_df.dropna() # 有遺失值的觀測值都刪除
print(ironmen_df_na_dropped)
print("---") # 分隔線
ironmen_df_na_filled = ironmen_df.fillna(0) # 有遺失值的觀測值填補 0
print(ironmen_df_na_filled)
print("---") # 分隔線
ironmen_df_na_filled = ironmen_df.fillna({"groups": "Cloud", "ironmen": 71}) # 依欄位填補遺失值
print(ironmen_df_na_filled)

  False
  False
   True
  False
  False
  False
Name: groups, dtype: bool
---
   True
   True
   True
   True
   True
  False
Name: ironmen, dtype: bool
       groups  ironmen
Modern Web     59.0
    DevOps      9.0
  Big Data     14.0
  Security      6.0
---
       groups  ironmen
Modern Web     59.0
    DevOps      9.0
         0     19.0
  Big Data     14.0
  Security      6.0
     自我挑戰組      0.0
---
       groups  ironmen
Modern Web     59.0
    DevOps      9.0
     Cloud     19.0
  Big Data     14.0
  Security      6.0
     自我挑戰組     71.0

Plot

DataFrame 一個很重要的特性是利用 matplotlib.pyplot 繪圖功能 visuallize data!
有兩種方式：(1) 直接用 df.plot; (2) 用 pyplot 的 plot.
(1) 是一個 quick way to plot
(2) 可以調用 pyplot 所有的功能

import matplotlib.pyplot as plt
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', index_col="Name", skipinitialspace=True)
df.plot(title="Generated Plot", grid=True, figsize=(8,4))

1	`<matplotlib.axes._subplots.AxesSubplot at 0x7f952bc52240>`

df.columns
plt.plot(df[['Age', 'Height (in)']])
plt.xlabel('Name')
plt.ylabel('Number')
plt.title('Generated Plot')
plt.grid()