增進工程師效率 Python DataFrame - CSV & Plot

Download the code: https://github.com/allenlu2009/colab/blob/master/dataframe_demo.ipynb

Python DataFrame

Create DataFrame

  • Direct input

  • Use dict: Method 1: 一筆一筆加入。

1
2
3
4
5
6
7
import pandas as pd
dict1 = {'Name': 'Allen' , 'Sex': 'male', 'Age': 33}
dict2 = {'Name': 'Alice' , 'Sex': 'female', 'Age': 22}
dict3 = {'Name': 'Bob' , 'Sex': 'male', 'Age': 11}
data = [dict1, dict2, dict3]
df = pd.DataFrame(data)
df
Name Sex Age
0 Allen male 33
1 Alice female 22
2 Bob male 11

Method 2: 一次加入所有資料。

1
2
3
4
5
6
7
8
9
10
name = ['Allen', 'Alice', 'Bob']
sex = ['male', 'female', 'male']
age = [33, 22, 11]
all_dict = {
    "Name": name,
    "Sex": sex,
    "Age": age
}
df = pd.DataFrame(all_dict)
df[['Name', 'Age']]
Name Age
0 Allen 33
1 Alice 22
2 Bob 11

Dataframe 的屬性

  • ndim: 2 for 2D dataframe; axis 0 => row; axis 1 => column
  • shape: (row no. x column no.) (not including number index)
  • dtypes: (object or int) of each column
1
1
df.ndim
1
1
2
1
1
1
df.shape
1
(3, 3)
1
1
df.dtypes
1
2
3
4
Name    object
Sex     object
Age      int64
dtype: object
1
df.columns
1
Index(['Name', 'Sex', 'Age'], dtype='object')
1
1
1
df.index
1
RangeIndex(start=0, stop=3, step=1)

Read CSV

Donwload a test csv file from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html Pick the biostats.csv

For 2, Before read csv, reference Medium article to import google drive

  • Read csv 使用 read_csv function. 但是要加上 skipinitialspace to strip the leading space!!
  • Two ways to read_csv: (1) load csv file directly; (2) load from url
1
2
3
4
5
6
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
#!ls 'drive/My Drive/Colab Notebooks/'
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', skipinitialspace=True)
df
1
2
3
4
5
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
3 Dave M 39 72 167
4 Elly F 30 66 124
5 Fran F 33 66 115
6 Gwen F 26 64 121
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
10 Kate F 47 69 139
11 Luke M 34 72 163
12 Myra F 23 62 98
13 Neil M 36 75 160
14 Omar M 38 70 145
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131
1
2
3
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv"
df = pd.read_csv(url, skipinitialspace=True)
df
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
3 Dave M 39 72 167
4 Elly F 30 66 124
5 Fran F 33 66 115
6 Gwen F 26 64 121
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
10 Kate F 47 69 139
11 Luke M 34 72 163
12 Myra F 23 62 98
13 Neil M 36 75 160
14 Omar M 38 70 145
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131
1
print(df.columns); print(df.index)
1
2
Index(['Name', 'Sex', 'Age', 'Height (in)', 'Weight (lbs)'], dtype='object')
RangeIndex(start=0, stop=18, step=1)
1
1
df.ndim
1
1
2
1
1
1
df.shape
1
1
(18, 5)
1
1
df.dtypes
1
2
3
4
5
6
Name            object
Sex             object
Age              int64
Height (in)      int64
Weight (lbs)     int64
dtype: object

Basic Viewing Command

1
df.head(3)
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
1
df.tail(3)
Name Sex Age Height (in) Weight (lbs)
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131
1
1
1
df.shape
1
1
(18, 5)
1
df.info()
1
2
3
4
5
6
7
8
9
10
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 5 columns):
Name            18 non-null object
Sex             18 non-null object
Age             18 non-null int64
Height (in)     18 non-null int64
Weight (lbs)    18 non-null int64
dtypes: int64(3), object(2)
memory usage: 848.0+ bytes
1
df[7:10]
Name Sex Age Height (in) Weight (lbs)
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
1
df['Name'][7:10]
1
2
3
4
7    Hank
8    Ivan
9    Jake
Name: Name, dtype: object
1
df[['Name', 'Age', 'Sex']][7:10]
Name Age Sex
7 Hank 30 M
8 Ivan 53 M
9 Jake 32 M
1
df.loc[7:10, ['Name', 'Age', 'Sex']] # compare with loc call
Name Age Sex
7 Hank 30 M
8 Ivan 53 M
9 Jake 32 M
10 Kate 47 F
1
df.count()
1
2
3
4
5
6
Name            18
Sex             18
Age             18
Height (in)     18
Weight (lbs)    18
dtype: int64

Basic Index Operation

Index (索引) is a very useful key for DataFrame. The default index is the row number starting from 0 to N-1, where N is the number of data. 除了用 row number 做為 index, 一般也會使用 unique feature 例如 name, id, or phone number 做為 index.

把 column 變成 index

  • Method 1: 直接在 read_csv 指定 index_col. 可以看到 index number 消失,而被 Name column 取代。
1
2
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', skipinitialspace=True, index_col='Name')
df
Sex Age Height (in) Weight (lbs)
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
  • df.index shows the element in index column
1
1
1
df.index
1
2
3
Index(['Alex', 'Bert', 'Carl', 'Dave', 'Elly', 'Fran', 'Gwen', 'Hank', 'Ivan',
       'Jake', 'Kate', 'Luke', 'Myra', 'Neil', 'Omar', 'Page', 'Quin', 'Ruth'],
      dtype='object', name='Name')
  • 使用 reset_index 又會回到 index number.
1
1
df.reset_index()
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
3 Dave M 39 72 167
4 Elly F 30 66 124
5 Fran F 33 66 115
6 Gwen F 26 64 121
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
10 Kate F 47 69 139
11 Luke M 34 72 163
12 Myra F 23 62 98
13 Neil M 36 75 160
14 Omar M 38 70 145
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131

再看一次 df 並沒有改變。很多 DataFrame 的 function 都是保留原始的 df, create a new object, 也就是 inplace = False. 如果要取代原來的 df, 必須 inplace = True!

1
1
df
Sex Age Height (in) Weight (lbs)
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
1
2
df.reset_index(inplace=True)
df
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
3 Dave M 39 72 167
4 Elly F 30 66 124
5 Fran F 33 66 115
6 Gwen F 26 64 121
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
10 Kate F 47 69 139
11 Luke M 34 72 163
12 Myra F 23 62 98
13 Neil M 36 75 160
14 Omar M 38 70 145
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131

如果再 reset_index()一次,會是什麼結果?此處用 default inplace=False. 多了一個 index column

1
1
df.reset_index()
index Name Sex Age Height (in) Weight (lbs)
0 0 Alex M 41 74 170
1 1 Bert M 42 68 166
2 2 Carl M 32 70 155
3 3 Dave M 39 72 167
4 4 Elly F 30 66 124
5 5 Fran F 33 66 115
6 6 Gwen F 26 64 121
7 7 Hank M 30 71 158
8 8 Ivan M 53 72 175
9 9 Jake M 32 69 143
10 10 Kate F 47 69 139
11 11 Luke M 34 72 163
12 12 Myra F 23 62 98
13 13 Neil M 36 75 160
14 14 Omar M 38 70 145
15 15 Page F 31 67 135
16 16 Quin M 29 71 176
17 17 Ruth F 28 65 131
1
1
df
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
3 Dave M 39 72 167
4 Elly F 30 66 124
5 Fran F 33 66 115
6 Gwen F 26 64 121
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
10 Kate F 47 69 139
11 Luke M 34 72 163
12 Myra F 23 62 98
13 Neil M 36 75 160
14 Omar M 38 70 145
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131
  • Method 2: 使用 set_index()
1
2
df.set_index('Name', inplace=True)
df
Sex Age Height (in) Weight (lbs)
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131

loc[]

使用 loc[] 配合 index label 取出資料非常方便。
如果是 number index, 可以用 df[0], df[3], etc.
但如果是其他 column index, e.g. Name, df[2] 或是 df[“Hank”] are wrong!, 必須用 df.loc[‘Hank’]
或是 df.loc[ [‘Hank’, ‘Ruth’, ‘Page’] ]

1
df.loc['Hank']
1
2
3
4
5
Sex               M
Age              30
Height (in)      71
Weight (lbs)    158
Name: Hank, dtype: object
1
df.loc[:, ['Sex', 'Age']]
Sex Age
Name
Alex M 41
Bert M 42
Carl M 32
Dave M 39
Elly F 30
Fran F 33
Gwen F 26
Hank M 30
Ivan M 53
Jake M 32
Kate F 47
Luke M 34
Myra F 23
Neil M 36
Omar M 38
Page F 31
Quin M 29
Ruth F 28
1
df.loc[ ['Hank', 'Ruth', 'Page'] ]
Sex Age Height (in) Weight (lbs)
Name
Hank M 30 71 158
Ruth F 28 65 131
Page F 31 67 135

loc[] 可以用 row, column 得到對應的 element, 似乎是奇怪的用法

1
df.loc['Hank', 'Age']
1
30

iloc[]

使用 column index 仍然可以用 iloc[] 配合 index number 取出資料。

1
df.iloc[0]
1
2
3
4
5
Sex               M
Age              41
Height (in)      74
Weight (lbs)    170
Name: Alex, dtype: object
1
df.iloc[1:10]
Sex Age Height (in) Weight (lbs)
Name
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
1
df.iloc[[1, 4, 6]]
Sex Age Height (in) Weight (lbs)
Name
Bert M 42 68 166
Elly F 30 66 124
Gwen F 26 64 121

排序

包含兩種排序

  • sort_index()
  • sort_value()
1
df.sort_index()
Sex Age Height (in) Weight (lbs)
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
1
df.sort_values(by = 'Age')
Sex Age Height (in) Weight (lbs)
Name
Myra F 23 62 98
Gwen F 26 64 121
Ruth F 28 65 131
Quin M 29 71 176
Elly F 30 66 124
Hank M 30 71 158
Page F 31 67 135
Carl M 32 70 155
Jake M 32 69 143
Fran F 33 66 115
Luke M 34 72 163
Neil M 36 75 160
Omar M 38 70 145
Dave M 39 72 167
Alex M 41 74 170
Bert M 42 68 166
Kate F 47 69 139
Ivan M 53 72 175

Rename and Drop Column(s) and Index(s)

1
2
df.rename(columns={"Height (in)": "Height", "Weight (lbs)": "Weight"}, inplace=True)
df
Sex Age Height Weight
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
1
2
df.rename(index={"Alex": "Allen", "Bert": "Bob"}, inplace=True)
df
Sex Age Height Weight
Name
Allen M 41 74 170
Bob M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
1
df.drop(labels=['Sex', 'Weight'], axis="columns") # axis=1 eq axis="columns"
Age Height
Name
Allen 41 74
Bob 42 68
Carl 32 70
Dave 39 72
Elly 30 66
Fran 33 66
Gwen 26 64
Hank 30 71
Ivan 53 72
Jake 32 69
Kate 47 69
Luke 34 72
Myra 23 62
Neil 36 75
Omar 38 70
Page 31 67
Quin 29 71
Ruth 28 65
1
df.drop(labels=['Allen', 'Ruth'], axis="index") # axis=0 eq axis="index"
Sex Age Height Weight
Name
Bob M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176

進階技巧

Multiple Index (多重索引)

這是非常有用的技巧,使用 set_index with keys

1
2
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', skipinitialspace=True)
df  # show the original dataframe
Name Sex Age Height (in) Weight (lbs)
0 Alex M 41 74 170
1 Bert M 42 68 166
2 Carl M 32 70 155
3 Dave M 39 72 167
4 Elly F 30 66 124
5 Fran F 33 66 115
6 Gwen F 26 64 121
7 Hank M 30 71 158
8 Ivan M 53 72 175
9 Jake M 32 69 143
10 Kate F 47 69 139
11 Luke M 34 72 163
12 Myra F 23 62 98
13 Neil M 36 75 160
14 Omar M 38 70 145
15 Page F 31 67 135
16 Quin M 29 71 176
17 Ruth F 28 65 131
1
2
df.set_index(keys = ['Name', 'Sex'])  
# Notice "Name" "Sex" columns header is lower than the rest
Age Height (in) Weight (lbs)
Name Sex
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
1
2
3
4
5
df.set_index(keys = ['Sex', 'Name'], inplace=True)
df
# Note that key sequence matters; and same index values group
# Note that inplace=True replaces the original df 
# This is useful to display sorted group
Age Height (in) Weight (lbs)
Sex Name
M Alex 41 74 170
Bert 42 68 166
Carl 32 70 155
Dave 39 72 167
F Elly 30 66 124
Fran 33 66 115
Gwen 26 64 121
M Hank 30 71 158
Ivan 53 72 175
Jake 32 69 143
F Kate 47 69 139
M Luke 34 72 163
F Myra 23 62 98
M Neil 36 75 160
Omar 38 70 145
F Page 31 67 135
M Quin 29 71 176
F Ruth 28 65 131
1
1
1
df.index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
MultiIndex([('M', 'Alex'),
            ('M', 'Bert'),
            ('M', 'Carl'),
            ('M', 'Dave'),
            ('F', 'Elly'),
            ('F', 'Fran'),
            ('F', 'Gwen'),
            ('M', 'Hank'),
            ('M', 'Ivan'),
            ('M', 'Jake'),
            ('F', 'Kate'),
            ('M', 'Luke'),
            ('F', 'Myra'),
            ('M', 'Neil'),
            ('M', 'Omar'),
            ('F', 'Page'),
            ('M', 'Quin'),
            ('F', 'Ruth')],
           names=['Sex', 'Name'])
1
df.index.names
1
FrozenList(['Sex', 'Name'])
1
type(df.index)  # MultiIndex
1
pandas.core.indexes.multi.MultiIndex
1
2
3
df.sort_index(inplace=True)
df
# sorting is based on "Sex", and then "Name"
Age Height (in) Weight (lbs)
Sex Name
F Elly 30 66 124
Fran 33 66 115
Gwen 26 64 121
Kate 47 69 139
Myra 23 62 98
Page 31 67 135
Ruth 28 65 131
M Alex 41 74 170
Bert 42 68 166
Carl 32 70 155
Dave 39 72 167
Hank 30 71 158
Ivan 53 72 175
Jake 32 69 143
Luke 34 72 163
Neil 36 75 160
Omar 38 70 145
Quin 29 71 176

Groupby Command

Groupby 是 SQL 的語法。根據某一項資料做分組方便查找。
The SQL GROUP BY Statement The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

1
2
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', index_col="Name", skipinitialspace=True)
df  # show the dataframe with "Name" index column
Sex Age Height (in) Weight (lbs)
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Elly F 30 66 124
Fran F 33 66 115
Gwen F 26 64 121
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Kate F 47 69 139
Luke M 34 72 163
Myra F 23 62 98
Neil M 36 75 160
Omar M 38 70 145
Page F 31 67 135
Quin M 29 71 176
Ruth F 28 65 131
1
2
grpBySex = df.groupby('Sex')  # output is a DataFrameGroupBy object
type(grpBySex)
1
pandas.core.groupby.generic.DataFrameGroupBy
1
2
grpBySex.groups  
# output is a dict, use get_group() obtains each sub-group
1
2
3
4
{'F': Index(['Elly', 'Fran', 'Gwen', 'Kate', 'Myra', 'Page', 'Ruth'], dtype='object', name='Name'),
 'M': Index(['Alex', 'Bert', 'Carl', 'Dave', 'Hank', 'Ivan', 'Jake', 'Luke', 'Neil',
        'Omar', 'Quin'],
       dtype='object', name='Name')}
1
grpBySex.size()  # size() shows the counts of each group
1
2
3
4
Sex
F     7
M    11
dtype: int64
1
grpBySex.get_group('M')  # get_group() output a DataFrame object
Sex Age Height (in) Weight (lbs)
Name
Alex M 41 74 170
Bert M 42 68 166
Carl M 32 70 155
Dave M 39 72 167
Hank M 30 71 158
Ivan M 53 72 175
Jake M 32 69 143
Luke M 34 72 163
Neil M 36 75 160
Omar M 38 70 145
Quin M 29 71 176

Groupby Operation

分組後可以進行各類運算:sum(), mean(), max(), min()

1
grpBySex.sum()
Age Height (in) Weight (lbs)
Sex
F 218 459 863
M 406 784 1778
1
grpBySex.mean()
Age Height (in) Weight (lbs)
Sex
F 31.142857 65.571429 123.285714
M 36.909091 71.272727 161.636364
1
grpBySex.max()
Age Height (in) Weight (lbs)
Sex
F 47 69 139
M 53 75 176
1
grpBySex.min()
Age Height (in) Weight (lbs)
Sex
F 23 62 98
M 29 68 143

Wash Data with NAN

判斷 NAN

  • isnull()
  • notnull()

處理 NAN

  • dropna()
  • fillna()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import pandas as pd

groups = ["Modern Web", "DevOps", np.nan, "Big Data", "Security", "自我挑戰組"]
ironmen = [59, 9, 19, 14, 6, np.nan]

ironmen_dict = {
                "groups": groups,
                "ironmen": ironmen
}

# 建立 data frame
ironmen_df = pd.DataFrame(ironmen_dict)

print(ironmen_df.loc[:, "groups"].isnull()) # 判斷哪些組的組名是遺失值
print("---") # 分隔線
print(ironmen_df.loc[:, "ironmen"].notnull()) # 判斷哪些組的鐵人數不是遺失值

ironmen_df_na_dropped = ironmen_df.dropna() # 有遺失值的觀測值都刪除
print(ironmen_df_na_dropped)
print("---") # 分隔線
ironmen_df_na_filled = ironmen_df.fillna(0) # 有遺失值的觀測值填補 0
print(ironmen_df_na_filled)
print("---") # 分隔線
ironmen_df_na_filled = ironmen_df.fillna({"groups": "Cloud", "ironmen": 71}) # 依欄位填補遺失值
print(ironmen_df_na_filled)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
0    False
1    False
2     True
3    False
4    False
5    False
Name: groups, dtype: bool
---
0     True
1     True
2     True
3     True
4     True
5    False
Name: ironmen, dtype: bool
       groups  ironmen
0  Modern Web     59.0
1      DevOps      9.0
3    Big Data     14.0
4    Security      6.0
---
       groups  ironmen
0  Modern Web     59.0
1      DevOps      9.0
2           0     19.0
3    Big Data     14.0
4    Security      6.0
5       自我挑戰組      0.0
---
       groups  ironmen
0  Modern Web     59.0
1      DevOps      9.0
2       Cloud     19.0
3    Big Data     14.0
4    Security      6.0
5       自我挑戰組     71.0

Plot

DataFrame 一個很重要的特性是利用 matplotlib.pyplot 繪圖功能 visuallize data!
有兩種方式:(1) 直接用 df.plot; (2) 用 pyplot 的 plot.
(1) 是一個 quick way to plot
(2) 可以調用 pyplot 所有的功能

1
2
3
import matplotlib.pyplot as plt
df = pd.read_csv('drive/My Drive/Colab Notebooks/biostats.csv', index_col="Name", skipinitialspace=True)
df.plot(title="Generated Plot", grid=True, figsize=(8,4))
1
<matplotlib.axes._subplots.AxesSubplot at 0x7f952bc52240>

png

1
2
3
4
5
6
df.columns
plt.plot(df[['Age', 'Height (in)']])
plt.xlabel('Name')
plt.ylabel('Number')
plt.title('Generated Plot')
plt.grid()

png