How Does Geography Impact Climate?

In this post, I will be creating several interesting, interactive data graphics using the NOAA climate data that we’ve explored in the first several weeks of lectures.

§1. Required Packages

To follow this tutorial, the following packages are required.

import numpy as np
import pandas as pd
import sqlite3
from plotly import express as px
import datetime
from sklearn.linear_model import LinearRegression

§2. Downloading Data

First, we need to load the three datasets into Python.

# '../' is included in the path since the files used are one directory above
temps = pd.read_csv("../temps_stacked.csv")
temps.head()
ID Year Month Temp
0 ACW00011604 1961 1 -0.89
1 ACW00011604 1961 2 2.36
2 ACW00011604 1961 3 4.72
3 ACW00011604 1961 4 7.73
4 ACW00011604 1961 5 11.28
countries = pd.read_csv('../countries.csv')

# whitespaces in column names are bad for SQL
countries = countries.rename(columns={"FIPS 10-4": "FIPS_10_4", "ISO 3166": "ISO_3166" })
countries.head()
FIPS_10_4 ISO_3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

§3. Creating a Database

Next, we will create a database containing the datasets so that we can interact with them using SQL.

conn = sqlite3.connect("../temps.db")

# Use if_exists to prevent errors when temps.db already exists.
temps.to_sql("temperatures", conn, if_exists="replace", index=False)
countries.to_sql("countries", conn, if_exists="replace", index=False)
stations.to_sql("stations", conn, if_exists="replace", index=False)

# Always remember to close a connection after use.
conn.close()

§4. Average Temperature Change within a Country

The first question we want to answer is:

How does the average yearly change in temperature vary within a given country?

To answer this, we first need to create a function that can query the database for relevant data. The necessary data is as follows:

  • The station name.
  • The latitude of the station.
  • The longitude of the station.
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature at the specified station during the specified year and month.
def query_climate_database(country, year_begin, year_end, month):
    """
    Gets a country's temperature readings from its stations for a specified month across the provided timespan.

    Args:
        country (str): The name of a country (e.g. ‘South Korea’) for which data should be returned.
        year_begin (int): The earliest year for which data should be returned.
        year_end (int): The latest year for which data should be returned.
        month (int): The month of the year for which data should be returned.

    Returns:
        A Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. 
    """
    
    conn = sqlite3.connect("../temps.db")

    cmd = \
    f"""
    SELECT S.name, latitude, longitude, C.name Country, year, month, temp
    FROM temperatures T
    LEFT JOIN stations S on T.id=S.id
    LEFT JOIN countries C on SUBSTR(T.id,1,2)=fips_10_4
    WHERE Country='{country}' AND month={month} AND year BETWEEN {year_begin} AND {year_end}
    """

    df = pd.read_sql_query(cmd, conn)
    conn.close()
    
    return df

As can be seen below, we can now use query_climate_database() to gather data for a specified country in a chosen timeframe.

query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

We are looking for the average yearly change for each station, which is essentially the slopes from fitting a linear regression model to each station’s temperatures. The following helper function will make these slopes easy to find.

def coef(data_group):
    """
    Calculates the average change in temperature over each year.

    Args:
        data_group (DataFrameGroupBy): Data that includes temperature recordings by year.
        
    Returns:
        Returns the average change in temperature.
    """
    X = data_group[["Year"]]
    y = data_group["Temp"]
    LR = LinearRegression()
    LR.fit(X, y)
    slope = LR.coef_[0]
    return slope

Next, we want to create an interactive geographic scatterplot, using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """
    Plots a country's temperature readings from its stations for a specified month across the provided timespan.

    Args:
        country (str): The name of a country (e.g. ‘South Korea’) for which data should be returned.
        year_begin (int): The earliest year for which data should be returned.
        year_end (int): The latest year for which data should be returned.
        month (int): The month of the year for which data should be returned.
        min_obs (int): The minimum required number of years of data for any given station. 
        **kwargs: Additional keyword arguments passed to px.scatter_mapbox()
        
    Returns:
        Returns a plot with an estimate of the yearly change in temperature during the specified month and time period for stations in a country.
    """

    # Use the created query
    df = query_climate_database(country, year_begin, year_end, month)
    
    # Filter stations with minimum observations.
    df['count'] = df.groupby(["NAME"])["Year"].transform(len)
    df = df[df['count'] >= min_obs]
    
    # Calculate average increase in temp using coef() and round the values to 4 decimal places.
    coefs = df.groupby(["NAME"]).apply(coef)
    coefs = coefs.reset_index()
    df = df.merge(coefs.round(4))
    df.rename(columns={0:'Estimated Yearly<br>Increase \N{DEGREE SIGN}C'},inplace=True)

    # Get name of the month.
    M = datetime.datetime.strptime(str(month), "%m").strftime("%B")
    fig = px.scatter_mapbox(df, 
                            lat = "LATITUDE",
                            lon = "LONGITUDE", 
                            hover_name = "NAME", 
                            color = "Estimated Yearly<br>Increase \N{DEGREE SIGN}C", 
                            title= f"Estimates of yearly increase in temperature in {M} <br> for stations in {country}, years {year_begin} - {year_end}",
                            **kwargs)

    # Set the margins to make room for the title.
    fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0})
    
    # Center the colorbar at 0
    fig.update_coloraxes(cmid=0)
    return fig

Now that we have created temperature_coefficient_plot(), we can use it to create a plot of estimated yearly increases in temperature during the month of January, in the interval 1980-2020, in India.

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

Let’s also create a plot of estimated yearly increases in temperature during the month of May, in the interval 1980-2020, in the United Kingdom.

fig = temperature_coefficient_plot("United Kingdom", 1980, 2020, 5, 
                                   min_obs = 10,
                                   zoom = 3,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

§5. How do Elevation and Latitude Impact Temperature?

The next question we would like to answer is:

How does station elevation and latitude relate to the average temperature within a given country?

To answer this, we first need to create a function that can query the database for relevant data. The necessary data is as follows:

  • The station name.
  • The latitude of the station.
  • The elevation of the station.
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The average temperature at the specified station during the specified year and month.
def query_climate_database2(country, year, month):
    """
    Gets a country's temperature readings from its stations for a specified month across the provided timespan.

    Args:
        country (str): The name of a country (e.g. ‘South Korea’) for which data should be returned.
        year_begin (int): The earliest year for which data should be returned.
        year_end (int): The latest year for which data should be returned.
        month (int): The month of the year for which data should be returned.

    Returns:
        A Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. 
    """
    
    conn = sqlite3.connect("../temps.db")

    cmd = \
    f"""
    SELECT S.name, latitude, stnelev, C.name Country, year, temp
    FROM temperatures T
    LEFT JOIN stations S on T.id=S.id
    LEFT JOIN countries C on SUBSTR(T.id,1,2)=fips_10_4
    WHERE Country='{country}' AND year = {year} AND month = {month}
    """

    df = pd.read_sql_query(cmd, conn)
    conn.close()
    
    return df

Next, we want to create an interactive scatterplot, using Plotly Express, with a point for each station, such that its position reflects the latitude and average temperature and the color of the point reflects the elevation.

def temperature_elevation_plot(country, year, month, **kwargs):
    """
    Plots a country's temperature readings from its stations for a specified year against the stations' elevation and latitude.

    Args:
        country (str): The name of a country (e.g. ‘South Korea’) for which data should be returned.
        year (int): The year for which data should be returned.
        month (int): They month for which data should be returned.
        **kwargs: Additional keyword arguments passed to px.scatter()
        
    Returns:
        Returns a plot with the average temperature during the specified month and time period for stations in a country.
    """

    # Use the created query
    df = query_climate_database2(country, year, month)

    # Get name of the month.
    M = datetime.datetime.strptime(str(month), "%m").strftime("%B") # Get name of month.
    fig = px.scatter(df,
                     x = "LATITUDE",
                     y = "Temp",
                     color = "STNELEV",
                     hover_name = "NAME",
                     labels={"LATITUDE": "Latitude", # Make the variable names readable.
                             "Temp": "Average Temperature \N{DEGREE SIGN}C",
                             "STNELEV": "Station Elevation"}, 
                     title= f"Average temperature for stations in {country} in {M}, {year}",
                     **kwargs)

    fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
    
    # Edit the colorbar title to fit better
    fig.update_coloraxes(colorbar_title_text="Elevation of<br>Station (m)")
    return fig

Let’s create a plot of the average temperatures recorded at each station in the United States during Janurary, 2020, against the stations’ elevations and latitudes.

color_map = px.colors.sequential.matter # choose a colormap
fig = temperature_elevation_plot("United States", 2020, 1,
                                 color_continuous_scale=color_map,
                                 opacity = 0.7)

fig.show()

From this scatterplot, it there appears to be a linear relationship between latitude and average temperature and elevation and average temperature.

§6. How do Temperatures Differ in the Northern and Southern Hemispheres?

The last question we would like to answer is:

How do the distibutions of average temperatures differ in the northern and southern hemispheres over a year?

To answer this, we first need to create a function that can query the database for relevant data. The necessary data is as follows:

  • The name of the country in which the station is located.
  • The latitude of the station.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature for the country over all of its stations during the specified year and month.
def query_climate_database3(year):
    """
    Gets a country's temperature readings from its stations for a specified month across the provided timespan.

    Args:
        country (str): The name of a country (e.g. ‘South Korea’) for which data should be returned.
        year_begin (int): The earliest year for which data should be returned.
        year_end (int): The latest year for which data should be returned.
        month (int): The month of the year for which data should be returned.

    Returns:
        A Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. 
    """
    
    conn = sqlite3.connect("../temps.db")

    cmd = \
    f"""
    SELECT C.name Country, latitude, year, month, AVG(temp) mean_temp
    FROM temperatures T
    LEFT JOIN stations S on T.id=S.id
    LEFT JOIN countries C on SUBSTR(T.id,1,2)=fips_10_4
    WHERE year = {year}
    GROUP BY Country, month
    """

    df = pd.read_sql_query(cmd, conn)
    conn.close()
    
    return df

Next, we want to create an interactive histogram, using Plotly Express, that reflects all of the countries’ temperature readings for each month in a given year.

def temperature_distribution_plot(year, **kwargs):
    """
    Plots the world's monthly temperature distributions for the given year

    Args:
        year (int): The year for which data should be returned.
        month (int): They month for which data should be returned.
        **kwargs: Additional keyword arguments passed to px.histogram()
        
    Returns:
        Returns a plot with temperature distributions for each month.
    """

    # Use the created query
    df = query_climate_database3(year)

    # Filter and label each country's hemisphere.
    df['hem'] = df['LATITUDE'] > 0
    df['hem'].replace({True: 'North', False: 'South'}, inplace=True)
    
    fig = px.histogram(df,
                 x='mean_temp',
                 color="hem",
                 facet_row="Month",
                 labels={"hem": "Hemisphere",
                        "mean_temp": "Average Temperature \N{DEGREE SIGN}C"},
                 height = 1500, # Increase the height of the figure to fit all 12 distributions
                 title= f"Average temperature distributions for countries in the<br>Northern and Southern hemispheres in {year}"
                )

    fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0})
    return fig

Let’s create distributions of the average temperatures recorded in the northern and southern hemispheres in 2020.

fig = temperature_distribution_plot(2020)
fig.show()

From these distributions, it appears that a higher percentage of countries in the northern hemisphere experience a more drastic change in average temperature over the period of a year compared to countries in the southern hemisphere.

Written on April 15, 2022