Handling CSV files - an example¶
In this chapter, we’ll look at handling structured data from spreadsheets in Python. The CSV module’s main documentation is at CSV File Reading and Writing.
Example dataset¶
For our example we will be using aquaculture data which is freely available from Fiskeridirektoratet.
You can get a copy of the dataset also from this local link
Akvakulturregisteret.csv
(Source: Fiskeridirektoratet, 13.04.2020).
The dataset contains an overview of all registered aquaculture setups in Norway, showing which species they grow, where they’re located, and so on.
The CSV file format is the most common import and export format for spreadsheets and databases, and is a plain text file. Let’s open it in an editor to see what columns it contains. Spreadsheets often have the columns explained in the first few lines:
AKVAKULTURTILLATELSER PR. 13.04.2020;;;;;;;;;;;;;(Enhet i stk. oppgis i 1000) ;;;;;;;;(Enhet i stk. oppgis i 1000) ;;;
TILL_NR;ORG.NR/PERS.NR;NAVN;ADRESSE;POSTNR;POSTSTED;TILDELINGSTIDSPUNKT;TIDSBEGRENSET;TILL_KOMNR;TILL_KOM;FORMÅL;PRODUKSJONSFORM;ART;TILL_KAP;TILL_ENHET;LOK_NR;LOK_NAVN;LOK_KOMNR;LOK_KOM;LOK_PLASS;VANNMILJØ;LOK_KAP;LOK_ENHET;N_GEOWGS84;Ø_GEOWGS84
Here, it looks like the second line is what we need. The rest of the file contains the data, one line for each data point, and the entries are separated with semicolons.
CSV module¶
Since the data is stored as plain text, we could use our usual file reading code
to read it in, and use line.split(';')
to separate out the columns.
The python library also comes with a specialised module to read CSV files, that
does a lot of the repetitive work for us. Take a look at the
CSV File Reading and Writing
section in the Python manual. Let’s try to use the simple example they give there
with our data instead. Compared to the first example on the Python CSV website,
I have changed the delimiter option to ;
and removed the
quotechar option, since it looks like no quotes are used in our CSV file.
import csv
with open('Akvakulturregisteret.csv', newline='') as csvfile:
akvareader = csv.reader(csvfile, delimiter=';')
for row in akvareader:
print(row)
Trying to run this, you may see
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 230: invalid continuation byte
.
This indicates that the encoding of the text in the file is not in the UTF-8
format that your Python expects as standard. Given it’s a Norwegian dataset, the
other likely encoding is the older iso-8859-1
format. Let’s specify that
explicitly:
import csv
with open('Akvakulturregisteret.csv', newline='', encoding='iso-8859-1') as csvfile:
akvareader = csv.reader(csvfile, delimiter=';')
for row in akvareader:
print(row)
This seems to work! We get a printout of all datasets.
Application examples¶
The last two fields in each entry are latitude and longitude. Let’s try to plot the locations on a map. First, we check if we can extract the values:
import csv
lats = []
lons = []
with open('Akvakulturregisteret.csv', newline='', encoding='iso-8859-1') as csvfile:
akvareader = csv.reader(csvfile, delimiter=';')
for row in akvareader:
lat = float(row[-2]) # latitude is second last
lon = float(row[-1]) # longitude is last
lats.append(lat)
lons.append(lon)
print(lats,lons)
We get stuck on the first line in the CSV-file: ValueError: could not convert string to float: ''
.
Right, they didn’t contain any data, just the header information. One way
of skipping them is to handle this exception with try ... except
. If we can’t
get a valid latitude, let’s continue to the next line:
import csv
lats = []
lons = []
with open('Akvakulturregisteret.csv', newline='', encoding='iso-8859-1') as csvfile:
akvareader = csv.reader(csvfile, delimiter=';')
for row in akvareader:
try:
lat = float(row[-2]) # latitude is second last
lon = float(row[-1]) # longitude is last
except ValueError:
continue
lats.append(lat)
lons.append(lon)
print(lats,lons)
This works and prints long lists of numbers. Let’s plot them:
import csv
lats = []
lons = []
with open('Akvakulturregisteret.csv', newline='', encoding='iso-8859-1') as csvfile:
akvareader = csv.reader(csvfile, delimiter=';')
for row in akvareader:
try:
lat = float(row[-2]) # latitude is second last
lon = float(row[-1]) # longitude is last
except ValueError:
continue
lats.append(lat)
lons.append(lon)
try:
import matplotlib.pyplot as plt
plt.plot(lons,lats,'+')
plt.show()
except (ImportError, ModuleNotFoundError) as e:
print(f'Import of matplotlib failed: {e}')
print(f'We have {len(lats)} latitudes and {len(lons)} longitudes')