import pandas as pd
% pip install lxml
Requirement already satisfied: lxml in /home/codespace/.python/current/lib/python3.12/site-packages (5.3.1)
[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
The above installation may not be required for everyone, but a few versions do need them
Function for reading HTML tables into pandas DataFrames
Basic usage
Reading HTML tables
Let’s consider a Wikipedia page with multiple tables to understand the topic
url = 'https://en.wikipedia.org/wiki/List_of_Indian_states_and_union_territories_by_literacy_rate'
tables = pd.read_html(url)
display(tables)
[ States and union territories of India ordered by
0 Area Population GDP (per capita) Abbreviations...
1 vte,
State or UT Census 2011[2] \
State or UT Average Male Female
0 India 74.04 82.14 65.46
1 A&N islands[UT][citation needed] 86.63 90.27 82.43
2 Andhra Pradesh 67.02 74.88 59.15
3 Arunachal Pradesh 65.38 72.55 57.70
4 Assam 72.19 77.85 66.27
5 Bihar 61.80 71.20 51.50
6 Chhattisgarh 70.28 80.27 60.24
7 Chandigarh[UT] 86.05 89.99 81.19
8 Dadra and Nagar Haveli[UT] 76.34 85.17 64.32
9 Daman & Diu[UT] 87.10 91.54 79.55
10 Delhi[UT] 86.21 90.94 80.76
11 Goa 88.70 92.65 84.66
12 Gujarat 78.03 85.75 69.68
13 Haryana 75.55 84.06 65.94
14 Himachal Pradesh 82.80 89.53 75.93
15 Jammu and Kashmir 67.16 76.75 56.43
16 Jharkhand 66.41 76.84 55.42
17 Karnataka 75.36 82.47 68.08
18 Kerala 94.00 96.11 91.07
19 Lakshadweep[UT] 91.85 95.56 87.95
20 Madhya Pradesh 69.32 78.73 59.24
21 Maharashtra 82.34 88.38 75.87
22 Manipur 76.94 83.58 70.26
23 Meghalya 74.43 75.95 72.89
24 Mizoram 91.33 93.35 89.27
25 Nagaland 79.55 82.75 76.11
26 Odisha 72.87 81.59 64.01
27 Puducherry[UT] 85.85 91.26 80.67
28 Punjab 75.84 80.44 70.73
29 Rajasthan 66.11 79.19 52.12
30 Sikkim 81.42 86.55 75.61
31 Tamil Nadu 80.09 86.77 73.44
32 Telangana - - -
33 Tripura 87.22 92.53 82.73
34 Uttarakhand 78.82 87.40 70.01
35 Uttar Pradesh 67.68 77.28 57.18
36 West Bengal 76.36 81.69 70.54
NSO survey (2017)[3]
Average Male Female
0 77.7 84.7 70.3
1 86.27 90.11 81.84
2 66.9 80 59.5
3 66.95 73.4 59.50
4 85.9 90.1 81.2
5 70.9 79.7 60.5
6 77.3 85.4 68.8
7 86.43 90.54 81.38
8 77.65 86.46 77.65
9 87.07 91.48 87.07
10 88.70 82.40 93.70
11 87.4 92.81 81.84
12 82.4 89.5 74.8
13 80.4 88.0 71.3
14 86.6 92.9 80.5
15 77.3 85.7 68.0
16 74.3 83.0 64.7
17 77.2 83.4 70.5
18 96.2 97.4 95.2
19 92.28 96.11 88.25
20 73.7 81.2 65.5
21 84.8 90.7 78.4
22 79.85 86.49 73.17
23 75.48 77.17 73.78
24 91.58 93.72 89.40
25 80.11 83.29 80.11
26 77.3 84.0 70.3
27 86.55 92.12 86.55
28 83.7 88.5 78.5
29 69.7 80.8 57.6
30 82.2 87.29 76.43
31 82.9 87.9 77.9
32 - - -
33 87.75 92.18 83.15
34 87.6 94.3 80.7
35 73.0 81.8 63.4
36 80.50 84.80 76.10 ,
State/UT 1951 1961 1971 1981 1991 2001 2011
0 A&N islands 30.30 40.07 51.15 63.19 73.02 81.30 86.63
1 Andhra Pradesh - 21.19 24.57 35.66 44.08 60.47 67.02
2 Arunachal Pradesh - 7.13 11.29 25.55 41.59 54.34 65.38
3 Assam 18.53 32.95 33.94 - 52.89 63.25 72.19
4 Bihar 13.49 21.95 23.17 32.32 37.49 47.00 61.80
5 Chandigarh - - 70.43 74.80 77.81 81.94 86.05
6 Chhattisgarh 9.41 18.14 24.08 32.63 42.91 64.66 70.28
7 Dadra and Nagar Haveli - - 18.13 32.90 40.71 57.63 76.24
8 Daman and Diu - - - - 71.20 78.18 87.10
9 Delhi - 61.95 65.08 71.94 75.29 81.67 86.21
10 Goa 23.48 35.41 51.96 65.71 75.51 82.01 88.70
11 Gujarat 21.82 31.47 36.95 44.92 61.29 69.14 78.03
12 Haryana - - 25.71 37.13 55.85 67.91 75.55
13 Himachal Pradesh - - - - 63.86 76.48 82.80
14 Jammu and Kashmir - 12.95 21.71 30.64 - 55.52 67.16
15 Jharkhand 12.93 21.14 23.87 35.03 41.39 53.56 66.41
16 Karnataka - 29.80 36.83 46.21 56.04 66.06 75.36
17 Kerala 47.18 55.08 69.75 78.85 89.81 90.86 94.00
18 Lakshadweep 15.23 27.15 51.76 68.42 81.78 86.66 91.85
19 Madhya Pradesh 13.16 21.41 27.27 38.63 44.67 63.74 69.32
20 Maharashtra 27.91 35.08 45.77 57.24 64.87 76.84 82.34
21 Manipur 12.57 36.04 38.47 49.66 59.89 70.50 76.94
22 Meghalya - 26.92 29.49 42.05 49.10 62.56 74.43
23 Mizoram 31.14 44.01 53.80 59.88 82.26 88.80 91.33
24 Nagaland 10.52 21.95 33.78 50.28 61.65 66.59 79.55
25 Odisha 15.80 21.66 26.18 33.62 49.09 63.08 72.87
26 Puducherry - 43.65 53.38 65.14 74.74 81.24 85.85
27 Punjab - - 34.12 43.37 58.51 69.65 75.84
28 Rajasthan 8.5 18.12 22.57 30.11 38.55 60.41 66.11
29 Sikkim - - 17.74 34.05 56.94 68.81 81.42
30 Tamil Nadu - 36.39 45.40 54.39 62.66 73.45 80.33
31 Tripura - 20.24 30.98 50.10 60.44 73.19 87.22
32 Uttar Pradesh 12.02 20.87 23.99 32.65 40.71 56.27 67.68
33 Uttarakhand 18.93 18.05 33.26 46.06 57.75 71.62 78.82
34 West Bengal 24.61 34.46 38.86 48.65 57.70 68.64 76.26
35 India 18.33 28.30 34.45 43.57 52.21 64.84 74.04,
Social Group Rural Urban Rural + Urban
Social Group Male Female Male Female Male Female
0 ST 75.6 58.8 91.3 79.6 77.5 61.3
1 SC 78.0 60.9 88.4 75.3 80.3 63.9
2 OBC 81.7 64.2 91.1 80.5 84.4 68.9
3 OTHERS 87.6 74.5 95.0 88.6 90.8 80.6
4 ALL 81.5 65.0 92.2 82.8 84.7 70.3,
Social Group Rural Urban Rural + Urban
Social Group Male Female Male Female Male Female
0 Hindu 81.8 64.5 93.4 83.8 85.1 70.0
1 Muslim 77.4 64.8 85.8 75.6 80.6 68.8
2 Christian 84.4 77.0 95.5 91.4 88.2 82.2
3 Sikh 92.7 96.4 94.2 95.3 88.9 90.1
4 ALL 81.5 65.0 92.2 82.8 84.7 70.0]
Demonstrating some features of pandas.read_html()
Using match:
Filters tables by matching the given text. It’s useful for pages with multiple tables.
tables = pd.read_html(url, match= 'Social Group' )
display(tables[0 ].head())
Social Group
Male
Female
Male
Female
Male
Female
0
ST
75.6
58.8
91.3
79.6
77.5
61.3
1
SC
78.0
60.9
88.4
75.3
80.3
63.9
2
OBC
81.7
64.2
91.1
80.5
84.4
68.9
3
OTHERS
87.6
74.5
95.0
88.6
90.8
80.6
4
ALL
81.5
65.0
92.2
82.8
84.7
70.3
tables = pd.read_html(url, match= 'Social Group' )
display(tables[1 ].head())
Social Group
Male
Female
Male
Female
Male
Female
0
Hindu
81.8
64.5
93.4
83.8
85.1
70.0
1
Muslim
77.4
64.8
85.8
75.6
80.6
68.8
2
Christian
84.4
77.0
95.5
91.4
88.2
82.2
3
Sikh
92.7
96.4
94.2
95.3
88.9
90.1
4
ALL
81.5
65.0
92.2
82.8
84.7
70.0
Using index_col and header:
Sets index and headers for better DataFrame structure.
df = pd.read_html(url, index_col= 0 )[1 ]
display(df.head())
State or UT
Average
Male
Female
Average
Male
Female
India
74.04
82.14
65.46
77.7
84.7
70.3
A&N islands[UT][citation needed]
86.63
90.27
82.43
86.27
90.11
81.84
Andhra Pradesh
67.02
74.88
59.15
66.9
80
59.5
Arunachal Pradesh
65.38
72.55
57.70
66.95
73.4
59.50
Assam
72.19
77.85
66.27
85.9
90.1
81.2
Now “State or UT” is the index instead of a regular column.
df = pd.read_html(url, header= 1 )[1 ]
display(df.head())
0
India
74.04
82.14
65.46
77.7
84.7
70.3
1
A&N islands[UT][citation needed]
86.63
90.27
82.43
86.27
90.11
81.84
2
Andhra Pradesh
67.02
74.88
59.15
66.9
80
59.5
3
Arunachal Pradesh
65.38
72.55
57.70
66.95
73.4
59.50
4
Assam
72.19
77.85
66.27
85.9
90.1
81.2