Amospheric pressure and correlations ==================================== Let's take another geosciences application. Here we will explore how the atmospheric pressure is related to environmental variables such as wind speed, precipitation and humidity with the help of numpy and matplotlib. Understanding how these features evolve, both in time and in space, are crucial for weather forecasting. Low-pressure (cyclones, L) and high-pressure (anti-cyclones, H) systems are associated to cold-outbreaks (L), storms (L), hurricanes (L) and droughts and natural bushfires (H), for example. Knowing beforehand that an intense storm is advancing allows forecasters to emit alerts informing bad conditions for shipping and oil exploration due to high waves, flying and even prevent life losses caused by floods. As a short note, Bergen was the house of the so-called “Bergen School of Meteorology” founded by Prof. Vilhelm Bjerknes in 1917, known by its fundamental contributions in operational forecasting and dynamic meteorology. Our main goal in this assignment is to calculate the correlation between the atmospheric pressure recorded by a weather station in the Flesland airport and the three aforementioned environmental variables. Which of them has the highest correlation? Which has the lowest? For this, you will have to implement the Pearson’s correlation equation and apply it on the pairs of variables contained in a file. Part 1 ****** Task 1: ------- .. note:: This task may look familiar to you from week 8! Here it is in context of a larger problem: The Pearson’s correlation equation is given by: .. math:: r = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x - \bar{x})^2 \sum(y - \bar{y})^2}} where :math:`(x, y)` are your time series and (:math:`\bar{x}`, :math:`\bar{y}`) are the corresponding averages. In simple words, the Pearson’s correlation coefficient is the *covariance* between :math:`x` and :math:`y` divided by their *standard deviations*. **Implement the previous equation** in the ``pearson_corr()`` function, which takes as arguments two time series (:math:`x`and :math:`y`) and return the coefficient :math:`r` (-1 :math:`\leq` *r* :math:`\leq` 1). Proceed with the following steps to accomplish that: - **Step 1:** Take the mean of :math:`x` (``mean_x``) and :math:`y` (``mean_y``); - **Step 2:** Create the lists ``dx`` and ``dy`` where the anomalies (:math:`x - \bar{x}`) and (:math:`y - \bar{y}`) will be stored. - **Step 3:** Take the sum of the anomalies product (``dx`` :math:`*` ``dy``). Let’s call it ``sum_dxdy``. - **Step 4:** Define as ``dx2`` and ``dy2`` the sum of the squared anomalies [(:math:`x - \bar{x})\ ^{2}`, (:math:`y - \bar{y})\ ^{2}`]. - **Step 5:** Multiply ``dx2`` and ``dy2``. Take the square root of this product and store in a new variable called ``sqr_dx2dy2``. - **Step 6:** Define the correlation coefficient ``r`` as the ratio between ``sum_dxdy`` and ``sqr_dx2dy2``. **Tip:** Use the built-in function ``sum`` to represent the summations together with the built-in function ``zip`` and list comprehension. .. code:: python def pearson_corr(x, y): mean_x = sum(x)/len(x) # calculate mean of x mean_y = sum(y)/len(y) # calculate mean of y dx = [(i - mean_x) for i in x] # (x - x_mean) dy = [(j - mean_y) for j in y] # (y - y_mean) sum_dxdy = sum([i*j for i,j in zip(dx, dy)]) # covariance dx2 = sum([i*i for i in dx]) # x-standard deviation dy2 = sum([j*j for j in dy]) # y-standard deviation sqr_dx2dy2 = (dx2*dy2)**0.5 r = sum_dxdy/sqr_dx2dy2 # coefficient return r We have created the function to calculate the Pearson’s correlation coefficient (``r``). We have to test this function before using it with real data. For this, create two arrays with length > 10, in which :math:`(x = y)`, and use your function ``pearson_corr()`` to obtain ``r``. **We can easily use** ``numpy`` **to create these arrays!** For example, to create an array of 30 random numbers from a normal distribution, you can use ``np.random.randn()``: .. code:: python import numpy as np x = np.random.randn(30) y = x r = pearson_corr(x,y) print("The value of r is %f"%(r)) # The value of r is 1.000000. As expected, the correlation between x and y is 1. Why? Task 2 ------ Calculate the correlation between :math:`x` and :math:`y` for the case when :math:`y = -x`. What does this result represent? .. code:: python x1 = np.random.randn(30) y1 = (-1)*x1 r1 = pearson_corr(x1,y1) print("The value of r is %f"%(r1)) # The value of r is -1.000000 Part 2 ****** In the last assignment you implemented the ``pearson_corr()`` function to obtain the correlation between two variables :math:`x` and :math:`y`. Now, you will reuse it to verify how the atmospheric pression is related to other environmental data. The file :download:`flesland_eklima.csv ` contains time series `(08-04-2021, 08-05-2021)` of four variables obtained in a weather station located in the Flesland airport, Bergen. Both data and weather station are under the Norwegian Meteorological Institute (MET) responsibility and are considered reliable. This dataset and others can be easily downloaded through the Eklima service made available by MET. Task 1 ------ Open the :download:`flesland_eklima.csv ` data with the ``Pandas`` function ``read_csv()`` as a dataframe (``df``). .. code:: python import pandas as pd df = pd.read_csv('flesland_eklima.csv') Let's check the contents of your new dataframe. .. code:: python df .. raw:: html
Name Station Time Prec Press Rel_hum Ws
0 Flesland SN50500 08.04.2021 18:00 0.7 992.1 92.0 10.6
1 Flesland SN50500 08.04.2021 19:00 1.9 991.9 82.0 6.7
2 Flesland SN50500 08.04.2021 20:00 0.0 991.0 77.0 7.4
3 Flesland SN50500 08.04.2021 21:00 0.1 990.5 80.0 7.8
4 Flesland SN50500 08.04.2021 22:00 0.0 989.6 86.0 6.7
... ... ... ... ... ... ... ...
714 Flesland SN50500 08.05.2021 12:00 0.0 1010.3 66.0 4.0
715 Flesland SN50500 08.05.2021 13:00 0.0 1010.1 59.0 4.1
716 Flesland SN50500 08.05.2021 14:00 0.0 1009.8 57.0 3.6
717 Flesland SN50500 08.05.2021 15:00 0.0 1009.4 50.0 2.9
718 Data valid for 08.05.2021 (CC BY 4.0) The Norwegian Meteorological Institute (MET N... NaN NaN NaN NaN NaN

719 rows × 7 columns

In this table, ``Prec``, ``Press``, ``Rel_hum`` and ``Ws`` stand for precipitation, atmospheric pressure at sea level, relative humidity and wind speed, respectively. We see that the file was read correctly, but the last line presents text information that might cause us some troubles. To delete it, use `pandas.DataFrame.drop `__ and specify the row to be deleted (row 718). **Tip:** To delete the last row, use ``-1`` as the index input. .. code:: python df = df.drop(df.index[[-1]]) The first two columns (Name and Station) are not important for us, but the other five are. Let’s save these df values into the variables ``time``, ``prec``, ``press``, ``rel_hum`` and ``wnd_spd``. .. code:: python time = df.Time # time prec = df.Prec # precipitation press = df.Press # atmospheric pressure at sea level rel_hum = df.Rel_hum # relative humidity wnd_spd = df.Ws # wind speed Now let's have a look at how atmospheric pressure ``press`` correlates with each of the following variables: ``prec``, ``rel_hum``, and ``wnd_spd`` with the help of our ``pearson_corr()``. .. code:: python press_prec = pearson_corr(press,prec) # atmospheric pressure x precipitation print("The correlation value between atmospheric pressure and precipitation is %f"%(press_prec)) # The correlation value between atmospheric pressure and precipitation is -0.184995 press_relh = pearson_corr(press,rel_hum) # atmospheric pressure x relative humidity print("The correlation value between atmospheric pressure and relative humidty is %f"%(press_relh)) # The correlation value between atmospheric pressure and relative humidty is -0.075288 press_wnd = pearson_corr(press,wnd_spd) # atmospheric pressure x wind speed print("The correlation value between atmospheric pressure and wind speed is %f"%(press_wnd)) # The correlation value between atmospheric pressure and wind speed is -0.177363 We have negative correlations between atmospheric pressure and the precipitation (approx. -0.18), a slightly negative correlation between atmospheric pressure and wind speed (approx. -0.18), and near zero with relative humidity. What does this tell us? The negative values mean that whenever there is a **decrease** in atmospheric pressure, one may observe an **increase** in wind speed and precipitation. Task 2 ------ In order to validate your results, compare the values obtained with your ``pearson_corr()`` function to the ``Pandas`` built-in correlation function. .. code:: python # Using Pandas built-in correlation function. press_prec_sc = press.corr(prec) # atmospheric pressure x precipitation print("The correlation value between atmospheric pressure and precipitation is %f"%(press_prec)) # The correlation value between atmospheric pressure and precipitation is -0.184995 press_relh_sc = press.corr(rel_hum) # atmospheric pressure x relative humidity print("The correlation value between atmospheric pressure and relative humidty is %f"%(press_relh)) # The correlation value between atmospheric pressure and relative humidty is -0.075288 press_wnd_sc = press.corr(wnd_spd) # atmospheric pressure x wind speed print("The correlation value between atmospheric pressure and wind speed is %f"%(press_wnd)) # The correlation value between atmospheric pressure and wind speed is -0.177363 How do the results of your ``pearson_corr()`` compare with the ``Pandas`` built-in function? Task 3 ------ What must be made clear is that correlation does not provide information about how much two time series are ‘equal’. **Step 1:** Going back to the tide example that we showed in `Week 12 `_, create two pure sine signals (``ssh``, ``ssh1``) with the same period (12 hours) and phase (:math:`0\ ^{\circ}`, but different amplitudes (15 cm and 3.4 cm). Consider a length of 15 days. .. code:: python import matplotlib.pyplot as plt a = 15 # amplitude 1 a1 = 3.4 # amplitude 2 phase = 0 period = 12 omega = (2*np.pi/period) # convert period to rad t = np.arange(0, 15*24+1, 1) # define time # calculate the sine signals ssh = a*np.sin(omega*t - phase) # first ssh ssh1 = a1*np.sin(omega*t - phase) # first ssh # plot thw two waves plt.plot(ssh, 'r-') plt.plot(ssh1, 'k-') plt.show() .. image:: oppgaver3_16_0.png **Step 2:** Using your ``pearson_corr()`` function, calculate the correlation coefficient between ssh and ssh1. .. code:: python r_ssh = pearson_corr(ssh, ssh1) print("The correlation value between ssh and ssh is %f"%(r_ssh)) # The correlation value between ssh and ssh is 1.000000 Ok, so the correlation between them is maximum, i.e., 1. However, they are not really the same, right? The Pearson correlation just tells us the relationship between two time series, or how they vary in relation to each other: if :math:`y` increases (or decreases) at the same time that :math:`x` increases (or decreases), then :math:`r = 1 (-1)`. In some applications, for instance comparing model outputs to observed data, it is necessary to know more than this relationship. You also must know if they have the same magnitudes. How can we determine this?