Curso: Análisis Geoespacial - Prof. Edier Aristizábal - Universidad Nacional de Colombia, sede Medellín
0:00:00

ANÁLISIS GEOESPACIAL

Prof. Edier Aristizábal

unal.jpg

First Law of Geography

“Everything is related to everything else, but near things are more related than distant things."

Waldo R. Tobler (1970)

Introducción

La era de los datos

datos

La era de los datos

datos1

Data store

store

Data store

Los kilobytes eran almacenados en discos, megabytes fueron almacenados en discos duros, terabytes fueron almacenados en arreglos de discos, y petabytes son almacenados en la nube.

store1
Mevin M. Vopson (2021)
store

Geospatial Data Science

Geospatial data science (GDS) is a subset of Data Science that focuses on the unique characteristics of spatial data, moving beyond simply looking at where things happen to understand why they happen there


https://carto.com/what-is-spatial-data-science/

Geospatial data science

The extraction of meaningful information from data involving location, geographic proximity and/or spatial interaction through the use techniques specifically designed to deal appropriately with spatial data.

Source: Anselin (2000)

Geospatial Data Science


https://carto.com/what-is-spatial-data-science/

Geospatial technology

Source: Components of Geospatial Technology – Credits: Geospatial Global Outlook Report 2017/ GeoBuiz Report 2017

Charles Picquet (1832)

Source: MundoGIS

Dr. John Snow (1854)

Fotografía aérea (1858)

Remote sensing (1972)

Roger F. Tomlinson (1960)

Big Data

Spatial analysis

It is a broad term that includes

  • Spatial data manipulation through geographical information systems (GIS),

  • Spatial data analysis in a descriptive and exploratory way,

  • Spatial statistics that employ statistical procedures to investigate if inferences can be made

  • Spatial modeling which involves the construction of models to identify relationships and predict outcomes in a spatial context.
Source: Sullivan & Unwin (2010)

Why spatial is special?

Why spatial is special?

Why spatial is special?

Why spatial is special?

Source: HEAVY.AI

Ambiente de trabajo

TIOBE index - Programming languaje popularity

Python

Python code is fast to develop: As the code is not required to be compiled and built, Python code can be much readily changed and executed. This makes for a fast development cycle.

Python code is not as fast in execution: Since the code is not directly compiled and executed and an additional layer of the Python virtual machine is responsible for execution, Python code runs a little slow as compared to conventional languages like C, C++, etc.

Python is interpreted: Many programming languages require that a program be converted from the source language, into binary code that the computer can understand. Python does not need compilation to binary code, which makes Python easier to work with and much more portable than other programming languages.

Python is object oriented: Python is an object-oriented programming language. Many modern programming languages support object-oriented programming. ArcGIS and QGIS is designed to work with object-oriented languages, and Python qualifies in this respect.

Paquetes

Jupyter lab

Conda

PIP

Docker

Javascript

Google Earth Engine

Sentinel EO Browser

Spatial data I

Models are simplifications of reality

Measurement scales

Spatial data models

  • Data can be defined as verifiable facts about the real world.
  • Information is data organized to reveal patterns, and to facilitate search.
  • Data model: an abstraction of the real world which incoprorates onlu those properties thought to be relevant to the application
  • Data structure: a representation of the data model
  • File format: the representation of the data in storage hardware

Real world data must be described in terms of a data model, then a data structure must be chosen to represent the data model, and finally a file format must be selected that is suitable for that data structure.

Spatial data

Spatial data is geographically referenced data, given at known locations and often represented visually through maps. That geographic reference, or the location component of the data, may be represented using any number of coordinate reference systems, for example, longitude and latitude.

In other words, spatial data is spatially dependent or correlated, and independence between the observations, which is a common assumption for many statistical techniques, is not satisfied

An observed spatial pattern may be observed in variables strictly depending on the location, or because of direct interactions between the points.

Geospatial data

Geospatial data is data about objects, events, or phenomena that have a location on the surface of the earth, including location information, attribute information (the characteristics of the object, event, or phenomena concerned), and often also temporal information (the time or life span at which the location and attributes exist)

Spatial Data Models

Spatial data may be of various broad types: points, lines, areas, and fields. Each type typically requires different techniques and approaches.

The relationship between real geographic entities and spatial data is complex and scale-dependent.

Types of spatial models

  • Object-based (feature) model: In the object view, we consider the world as a series of entities located in space. Entities are (usually) real. An object is a digital representation of all or part of an entity, which can be described in detail according to their boundary lines and other objects that constitute them or are related to them.

  • Field model: In the field view, the world consists of properties continuously varying across space. It represents data that are considered to be continuously changing in two-dimensional or three-dimensional space. In a field, every location has a value (including ‘‘not here’’ or zero) and sets of values taken together define the field.

Object-based model

Vector

objects are frequently not as simple as this geometric view leads one to assume. They may exist in three spatial dimensions, move and change over time, have a representation that is strongly scale-dependent, relate to entities that are themselves fuzzy and/or have indeterminate boundaries, or even be fractal.

Vector

Vector

Shapefile

A shapefile is a file-based data format native to ArcView software . Conceptually, a shapefile is a feature class–it stores a collection of features that have the same geometry type (point, line, or polygon), the same attributes, and a common spatial extent. Despite what its name may imply, a “single” shapefile is actually composed of at least three files, and as many as eight. Each file that makes up a “shapefile” has a common filename but different extension type.

Arc-Info Interchange (e00)

An ArcInfo interchange file, is also known as an export file type, this file format is used to enable a coverage, grid or TIN, and an associated INFO table to be transferred between different machines. This file has the .e00 extension.

File Geodatabase

A file geodatabase is a relational database storage format. It’s a far more complex data structure than the shapefile and consists of a .gdb folder housing dozens of files. Its complexity renders it more versatile allowing it to store multiple feature classes and enabling topological definitions. An example of the contents of a geodatabase is shown in the following figure.

GeoPackage

This is a relatively new data format that follows open format standards (i.e. it is non-proprietary). It’s built on top of SQLite (a self-contained relational database). Its one big advantage over many other vector formats is its compactness–coordinate value, metadata, attribute table, projection information, etc…, are all stored in a single file which facilitates portability. Its filename usually ends in .gpkg. Applications such as QGIS (2.12 and up), R and ArcGIS will recognize this format (ArcGIS version 10.2.2 and above will read the file from ArcCatalog but requires a script to create a GeoPackage).

Geojson

GeoJSON is an Open Standard Format designed for representing simple geographical features, along with their non-spatial attribute

Source: Introduction to web mapping by Michael Dorman

Geojson -- Multi-part geometry

Multi-part geometry types are similar to their single-part counterparts. The only difference is that one more hierarchical level is added into the coordinates array, for specifying multiple shapes.

Source: Introduction to web mapping by Michael Dorman

Geojson -- Geometry collections

A geometry collection is a set of several geometries, where each geometry is one of the previously listed six types, i.e., any geometry type excluding "GeometryCollection". For example, a "GeometryCollection" consisting of two geometries, a "Point" and a "MultiLineString", can be defined as follows:

Source: Introduction to web mapping by Michael Dorman

Geojson -- Feature

A "Feature" is formed when a geometry is combined with non-spatial attributes, to form a single object. The non-spatial attributes are encompassed in a property named "properties", containing one or more name-value pairs—one for each attribute. For example, the following "Feature" represents a geometry with two attributes, named "color" and "area":

Source: Introduction to web mapping by Michael Dorman

Geojson -- Feature Collections

A "FeatureCollection" is, like the name suggests, a collection of "Feature" objects. The separate features are contained in an array, comprising the "features" property. For example, a "FeatureCollection" composed of four features can be specified as follows:

Source: Introduction to web mapping by Michael Dorman

http://geojson.io/

Mapshaper

Keyhole Markup Language (KML)

XML based file format, used to visualize spatial data and modelling information like lines, shapes, 3D images and points in an Google Earth.

Geography Markup Language (GML)

It is used in the Open GIS Consortium for storing geographical data in a standard interchangeable format, It is based on XML.

SVG (Scalable Vector Graphics)

It is an XML-based vector image format for two-dimensional graphics Any program that recognizes XML can display the SVG image.

DWG

DWG is an intern format for AutoCAD. A DWG file is a database of 2D or 3D drawings.

Tidy data

Dataframe

Dataframe

GeoDataframe

Fiel-based model

In the field view, the world consists of properties continuously varying across space

Raster

Raster

Raster GIS File Format

txt / ASCII (American Standard Code for Information Interchange)

Standard text document that contains plain text. It can be opened and edited in any text-editing or word-processing program

Imagine

Imagine file format by ERDAS. It consists of a single .img file. It is sometimes accompanied by an .xml file which usually stores metadata information about the raster layer.

GeoTiff

A GeoTIFF is a TIF file that ends in a three letter. tif extension just like other TIF files, but a GeoTIFF contains additional tags that provide projection information for that image as specified by the GeoTIFF standard

Raster GIS File Format

Enhanced Compression Wavelet (ECW)

Enhanced Compressed Wavelet (from ERDAS). A compressed wavelet format, often lossy

Network Common Data Form (NetCDF)

netCDF file format with The Climate and Forecast (CF) metadata conventions for earth science data. It allows for direct web-access of subsets/aggregations of maps through OPeNDAP protocol.

HDF5

is an open source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer

Spatial data II

Spatial statistics vs Classical statistics

There is a fundamental difference between classical and spatial statistics. In classical statistics, we make a basic assumption regarding the sample: it is a collection of independent observations that follow a specific, usually normal, distribution. Contrariwise, in spatial statistics, because of the inherent spatial dependence and the fact that spatial autocorrelation exists (usually), the focus is on adopting techniques for detecting and describing these correlations.

In other words, in classical statistics, observation independence should exist while, in spatial statistics, spatial dependence usually exists. Classical statistics should be modified accordingly to adapt to this condition.

Types

  • Point Pattern Analysis: spatial distribution of events.

  • Geostatistical Analysis: continuous surface modeling.

  • Lattice Data Analysis (Area): Spatial patterns of attributes observed for discrete spatial objetcs, where the spatial regions can be regular shapes (grid or pixels) or irregular shapes (polygons).

Spatial data analysis in this course

  • Spatial operations
  • Spatial mapping/geovisualization --> Showing interesting patterns.
  • Spatial statistical analysis --> Discovering interesting patterns
  • Spatial model --> Explaining interesting patterns.
  • Spatial database management:
  • Spatial model base management

Spatial data

Spatial is special?

Three fundamental properties of spatial data:

  • Spatial dependence...it is the rule, not the exception
  • Spatial heterogeneity
  • Spatial scale.

Spatial dependence

Spatial autocorrelation is a complicated name for the obvious fact that data from locations near one another in space are more likely to be similar than data from locations remote from one another.

The existence of spatial autocorrelation is therefore a given in geography. Unfortunately, it is also an impediment to the application of conventional statistics

Spatial autocorrelation introduces redundancy into data

Spatial dependence

Spatial heterogeneity

Global measures of spatial autocorrelation may confirm the existence of positive or negative self-similarity with regard to distance, but this comes at the cost of a fundamental assumption. The parameters (mean and variance) of the random function representing the process are assumed to be constant. This is called the stationarity of the random function associated with that process, and when it is violated (called a nonstationary process), the process is heterogeneous.

In other words, a spatial process is said to be stationary when the difference between values of an attribute is only explained by the distance between the points or units. Another source of spatial heterogeneity is when the spatial dependence is different in various directions (anisotropy).

Source: Nikparvar & Thill (2021)

Spatial heterogeneity

Scale

Scale is also important because it can inform about sampling for training experience. Learning is more reliable when the distribution of the samples in the training experience is similar to the distribution of the test experience. In many geographic studies, training occurs on data from a specific geographic area. This makes it challenging to use the trained model for other geographic regions because the distribution of the test and train data sets is not similar, due to spatial heterogeneity.

This means that the sampling strategy for the training data set is essential to cover the heterogeneity of the phenomena of interest over the spatial frame of study. By increasing the extent of the study area, more processes and contextual environmental factors may alter the variable and result in non-stationarity by interweaving spatial patterns of different scales or inconsistent effect of processes in different regions.

Source: Nikparvar & Thill (2021)

First and second order effects

Tree density distribution can be influenced by 1st order effects such as elevation gradient or spatial distribution of soil characteristics; and by 2nd order effects such as seed dispersal processes where the process is independent of location and, instead, dependent on the presence of other trees.

Source: Intro to GIS and Spatial Analysis by Manuel Gimond (2020)

MAUP

The Modifiable Area Unit Problem (MAUP) problem refers to the influence the zone design has on the outcomes of the analysis. A different designation would probably lead to different results.

Source: https://en.wikipedia.org/wiki/Modifiable_areal_unit_problem

MAUP

MAUP

MAUP

There are two types of biases for the MAUP:

Source: Spatial Modelling for Data Scientist by Francisco Rowe and Dani Arribas-Bel (2022)

Zonal effect

The zonal effect occurs when you group data by various artificial boundaries. In this type of MAUP error, each subsequent boundary yields major analytical differences.

https://gisgeography.com/maup-modifiable-areal-unit-problem/

Scale effect

The scale effect occurs when maps show different analytical results at different levels of aggregation. Despite using the same points, each successive smaller unit consequently changes the pattern.

https://gisgeography.com/maup-modifiable-areal-unit-problem/


Source: Intro to GIS and Spatial Analysis by Manuel Gimond (2022)

The Edge Effects Problem

In the edge effects problem, spatial units that lie in the center of the study area tend to have neighbors in all directions, whereas spatial units at the edges of the study area have neighbors only in some specific directions.

Edge effect

Ecological Fallacy

This problem occurs when a relationship that is statistically significant at one level of analysis is assumed to hold true at a more detailed level as well. This is a typical mistake that occurs when we use aggregated data to describe the behavior of individuals.

Source: https://commons.wikimedia.org/wiki/File:Simpsons_paradox_-_animation.gif

Neighborhood effect

The characteristics of neighboring properties might have certain impact on the same characteristic to neighbors.

“if block group A is next to a high crime neighborhood, then block group A has high crime”

Spillover effect

Externalities (sometimes called spillover effects). An externality is a cost or benefit imposed on others (without compensation)

The characteristics of neighboring properties might have certain impact on a different characteristic to neighbors.

“if a block group A is next-to a shopping mall, then block group A will experience high crime”

Web mapping

Web mapping

A web map is an interactive display of geographic information, in the form of a web page, that you can use to tell stories and answer questions. Web maps are interactive. The term interactive implies that the viewer can interact with the map. This can mean selecting different map data layers or features to view, zooming into a particular part of the map that you are interested in, inspecting feature properties, editing existing content, or submitting new content, and so on.

Web maps are useful for various purposes, such as data visualization in journalism (and elsewhere), displaying real-time spatial data, powering spatial queries in online catalogs and search tools, providing computational tools, reporting, and collaborative mapping.

Earth weather
Stuff in space
Real-time flight locations

Herramientas

Source: Introduce to Web Mapping

Arquitectura

Source: Maptimeboston

Tile layers

Tile layers are a fundamental technology behind web maps. They comprise the background layer in most web maps, thus helping the viewer to locate the foreground layers in geographical space. The word tile in tile layers comes from the fact that the layer is split into individual rectangular tiles. Tile layers come in two forms, which we are going to cover next: raster tiles and vector tiles.

Raster tiles: tile layers are usually composed of PNG images. Traditionally, each PNG image is 256 × 256 pixels in size.

Vector tiles Vector tiles are distinguished by the ability to rotate the map while the labels keep their horizontal orientation, and by the ability to zoom in or out smoothly—without the strict division to discrete zoom levels that raster tile layers have.

Tile layers

https://a.tile.openstreetmap.org/2/1/3.png

  • zoom level 2
  • column 1
  • row 3

  • Zoom level

    Source: Maptimeboston

    Vector tiles

    Source: Gaffuri (2012)

    Ejemplo

    Data distribution

    Underlying data distribution

    Before making modeling decisions, you need to know the underlying data distribution.

    E. Taskesen

    Frequency Distribution and Histograms

    Frequency distribution table is a table that stores the categories (also called “bins”), the frequency, the relative frequency and the cumulative relative frequency of a single continuous interval variable

    The frequency for a particular category or value (also called “observation”) of a variable is the number of times the category or the value appears in the dataset.

    Relative frequency is the proportion (%) of the observations that belong to a category. It is used to understand how a sample or population is distributed across bins (calculated as relative frequency = frequency/n )

    The cumulative relative frequency of each row is the addition of the relative frequency of this row and above. It tells us what percent of a population (observations) ranges up to this bin. The final row should be 100%.

    A probability density histogram is defined so that (i) The area of each box equals the relative frequency (probability) of the corresponding bin, (ii) The total area of the histogram equals 1

    Histogramas & bins

    Distribución de frecuencia

    Distribución de frecuencia

    Distribución de frecuencia

    Box plot

    A boxplot is a graphical representation of the key descriptive statistics of a distribution.

    The characteristics of a boxplot are

    • The box is defined by using the lower quartile Q1 (25%; left vertical edge of the box) and the upper quartile Q3 (75%; right vertical edge of the box). The length of the box equals the interquartile range IQR = Q3 - Q1.
    • The median is depicted by using a line inside the box. If the median is not centered, then skewness exists.
    • To trace and depict outliers, we have to calculate the whiskers, which are the lines starting from the edges of the box and extending to the last object not considered an outlier.
    • Objects lying further away than 1.5 IQR are considered outliers.
    • Objects lying more than 3.0 IQR are considered extreme outliers, and those between (1.5 IQR and 3.0 IQR) are considered mild outliers. One may change the 1.5 or 3.0 coefficient to another value according to the study’s needs, but most statistical programs use these values by default.
    • Whiskers do not necessarily stretch up to 1.5 IQR but to the last object lying before this distance from the upper or lower quartiles.

    QQ plot

    The normal QQ plot is a graphical technique that plots data against a theoretical normal distribution that forms a straight line

    A normal QQ plot is used to identify if the data are normally distributed

    If data points deviate from the straight line and curves appear (especially in the beginning or at the end of the line), the normality assumption is violated.

    Learn by example

    Scatter plot

    A scatter plot displays the values of two variables as a set of point coordinates

    A scatter plot is used to identify the relations between two variables and trace potential outliers.

    Inspecting a scatter plot allows one to identify linear or other types of associations

    If points tend to form a linear pattern, a linear relationship between variables is evident. If data points are scattered, the linear correlation is close to zero, and no association is observed between the two variables. Data points that lie further away on the x or y direction (or both) are potential outliers

    Scatter plot

    Visualización D3js: 4 variables

    Covariance matrix

    Covariance is a measure of the extent to which two variables vary together (i.e., change in the same linear direction). Covariance Cov(X, Y) is calculated as:

    $cov_{x,y}=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}$

    where $x_i$ is the score of variable X of the i-th object, $y_i$ is the score of variable Y of the i-th object, $\bar{x}$ is the mean value of variable X, $\bar{y}$ is the mean value of variable Y.

    For positive covariance, if variable X increases, then variable Y increases as well. If the covariance is negative, then the variables change in the opposite way (one increases, the other decreases). Zero covariance indicates no correlation between the variables.

    Correlation coefficient

    Correlation coefficient $r_{(x, y)}$ analyzes how two variables (X, Y) are linearly related. Among the correlation coefficient metrics available, the most widely used is the Pearson’s correlation coefficient (also called Pearson product-moment correlation),

    $r_{(x, y)} = \frac{\text{cov}(X,Y)}{s_x s_y}$

    Correlation is a measure of association and not of causation.

    Lattice data analysis in Python

    Object-based model

    Point pattern analysis

    Point pattern analysis

    A point pattern consists of a set of events at a set of locations, where each event represents a single instance of the phenomenon of interest

    Most point pattern analysis techniques deal only with the location of the events and not with other attributes they might carry.

    Process vs Pattern

    Spatial process is a description of how a spatial pattern can be generated.

    There are three main types of spatial process:

    • Complete spatial randomness process --> Random spatial pattern
      • There is an equal probability of event occurrence at any location in the study region (also called first-order stationary).
      • The location of an event is independent of the locations of other events (also called second-order stationary).

    • Competitive process --> Dispersed: is a process that leads events to be arranged as far away from each other as possible, events tend to be uniformly distributed

    • Aggregating process --> Clustered: is a process where events tend to cluster as a result of some pulling action. The events create clusters in some parts of the study area, and the pattern has a large variation

    Point pattern analysis

    There are two main (interrelated) methods of analyzing point patterns, namely the distance-based methods and the density-based methods.

    • Density-based methods--> Absolute location use the intensity of events occurrence across space. For this reason, they describe first-order effects better. Kernel estimation methods are common density ased methods. In quadrat count methods, space is divided into a regular grid (such as a grid of squares or hexagons) of a unitary area. Each unitary region includes a different number of points due to a spatial process. The distribution analysis and its correspondence to a spatial pattern are based on probabilistic and statistical methods. Another, more widely used method is the kernel density estimation (KDE).
    • Distance-based methods --> Relative location employ the distances among events and describe second-order effects. Such methods include the nearest neighbor method the G and F distance functions, the Ripley’s K distance function and its transformation,

    Centrograhy

    A very basic form of point pattern analysis involves summary statistics such as the mean center, standard distance and standard deviational ellipse.

    Source: Intro to GIS and Spatial Analysis by Manuel Gimond (2020)

    Standard deviational ellipse

    It is a measure of dispersion (spread) that calculates standard distance separately in the x and y directions. Standard deviational ellipse reveals dispersion and directional trend

    Convex Hull

    The convex hull of a point pattern pp is the smallest convex set that contains pp

    Quadrant density

    This technique requires that the study area be divided into sub-regions (aka quadrats). Then, the point density is computed for each quadrat by dividing the number of points in each quadrat by the quadrat’s area. Quadrats can take on many different shapes such as hexagons and triangles

    Kernel Density Function

    The kernel density approach is an extension of the quadrat method. Kernel density estimation is a nonparametric method that uses kernel functions to create smooth maps of density values, in which the density at each location indicates the concentration of points within the neighboring area (high concentrations as peaks, low concentrations as valleys)

    Kernel Density Function

    Kernel Density Function

    Kernel Density Function

    Modeling intensity as a function of a covariate

    It is often more interesting to model the relationship between the distribution of points and some underlying covariate by defining that relationship mathematically. This can be done by exploring the changes in point density as a function of a covariate.

    $Pr(X_i) = {\frac{exp(\beta_0 + \beta_1X_i)}{1 + exp (\beta_0 + \beta_1X_i)}}$

    NN analysis

    The method compares the observed spatial distribution to a random theoretical one. The Average Nearest Neighbor (NN) tool measures the distance between each feature centroid and its nearest neighbor's centroid location. It then averages all these nearest neighbor distances. If the average distance is less than the average for a hypothetical random distribution, the distribution of the features being analyzed is considered clustered. If the average distance is greater than a hypothetical random distribution, the features are considered dispersed.

    NN analysis

    NN analysis

    An extension of this idea is to plot the ANN values for different order neighbors, that is for the first closest point, then the second closest point, and so forth.

    Ripley's K function

    It is a spatial analysis method of analyzing point patterns based on a distance function. The outcome of the function is the expected number of events inside a radius of d. It is calculated as a series of incremental distances d centered on each of the events in turn

    Ripley's K function

    Clustering

    El objetivo es identificar subgrupos en los datos, de tal forma que los datos en cada subgrupo (clusters) sean muy similares, mientras que los datos en diferentes subgrupos sean muy diferentes.

    Distance

    • Hierarchical Clustering: descomposición jerárquica utilizando algún criterio, pueden ser aglomerativos (bottom-up) o de separación (top-down). No necesitan K al inicio.
    • Partitioning Methods ( (k-means, PAM, CLARA): se construye a partir de particiones, las cuales son evaluadas por algún criterio. Necesitan K al inicio.
    • Density-Based Clustering: basados en funciones de conectividad y funciones de densidad.
    • Model-based Clustering: se utiliza un modelo para agrupar los modelos.
    • Fuzzy Clustering: A partir de lógica difusa se separan o agrupan los clusters.

    Clustering

    Dendrograma

    Dendrograma

    Dendrograma

    K-means

    Elbow method

    Método Silhouette

    DBScan

    DBSCAN is a density-based clustering method, which means that points that are closely packed together are assigned into the same cluster and given the same ID. The DBSCAN algorithm has two parameters, which the user needs to specify:

  • ε —The maximal distance between points to be considered within the same cluster
  • minPts —The minimal number of points required to form a cluster
  • In short, all groups of at least minPts points, where each point is within ε or less from at least one other point in the group, are considered to be separate clusters and assigned with unique IDs. All other points are considered “noise” and are not assigned with an ID.

    DBScan

    Geovisualization

    Choropleth maps

    Choropleth maps are thematic maps in which areas are rendered according to the values of the variable displayed

    Cloropleth maps are used to obtain a graphical perspective of the spatial distribution of the values of a specific variable across the study area.

    There are two main categories of variables displayed in choropleth maps:

    • Spatially extensive variables: each polygon is rendered based on a measured value that holds for the entire polygon. Ej. total population

    • Spatially intensive variables: the values of the variable are adjusted for the area or some other variable. Ej. population density

    Choropleth maps

    Breaks

    Breaks

    Breaks

    Spatial association

    Spatial dependence

    Characteristics of spatial data in terms of spatial autocorrelation and spatial heterogeneity. Many statistical tests used for nonspatial data are based on the hypothesis that samples are randomly selected and observations are independent. When we collect spatial data, however, this hypothesis is usually violated. This phenomenon is described as “spatial dependence.”

    Spatial autocorrelation

    Formal property that measures the degree to which near and distant things are related

    Refers to systematic spatial changes that are observed as clusters of similar values or a systematic spatial pattern.

    Spatial heterogeneity

    Spatial heterogeneity refers to structural relationships that change with the location of the object. These changes can be abrupt (e.g. countryside–town) or continuous.

    Spatial heterogeneity refers to the uneven distribution of a trait, event, or relationship across a region

    Spatial weight matrix (W)

    Spatial weights are numbers that reflect some sort of distance, time or cost between a target spatial object and every other object in the dataset or specified neighborhood. Spatial weights quantify the spatial or spatiotemporal relationships among the spatial features of a neighborhood.

    Neighborhood

    Neighborhood in the spatial analysis context is a geographically localized area to which local spatial analysis and statistics are applied based on the hypothesis that objects within the neighborhood are likely to interact more than those outside it.

    • Neighbours by contiguity: areas that share common boundaries
    • neighbours by distance: areas will be defined as neighbours if they are within a specified radius

    First & Second order processes

    Neighbourhood can be first order, second order or higher. A first-order neighbourhood means that only neighbours of the examined object are considered (according to the contiguity criterion), while in the case of the second-order neighbourhood matrix, neighbours’ neighbours are also included (also according to the contiguity criterion).

    First order process

    It is one that produces a variation in point density in response to some causal variable

    Ej. The density of cases of malaria echoes the density of particular species of mosquito

    Second-order process

    It results from interactions, when the presence of one point makes others more likely in the immediate vicinity

    Ej. Patterns of contagious disease reflect second-order processes, when the disease is passed from an initial carrier to family members, co-workers, sexual partners, and others who come into contact with infectious carriers

    Competition for space provides a familiar example of a form of second-order process that results in an exception to Tobler’s First Law

    Ej. The presence of a shopping center in an area generally discourages other shopping centers from locating nearby

    Spatial Relationships

    Adjacency (Contiguity)

    Adjacency can be thought of as the nominal, or binary, equivalent of distance. Two spatial entities are either adjacent or they are not.

    Contiguity among features means the features have common borders. We have three types of contiguity:

    • Rook Contiguity: the features share common edges
    • Bishop Contiguity: the features share common vertices (corners)
    • Queen Contiguity: the feature share common edges and corners.

    Contiguity

    Ej.

    Distance

    Among the most common distance measures used in geographical analysis are the Euclidean distance, the Manhattan distance, the Minkowski distance, the Pearson’s correlation distance, the Spearman correlation distance, the network distance, and the geodetic distance. In spatial statistics, Euclidean and Manhattan distance are those most widely used

    Using the coordinates of the centres of the areas, one can also create a matrix of spatial weights according to the criterion of neighbourhood in a radius of d km. This means that the neighbour will be an object whose centre is not more than d km away in a straight line. A special case of such a matrix is the inclusion of all areas as neighbours.

    Minkowski Distance

    The Minkowski distance is a generalized form of the Euclidean distance (if p=2) and the Manhattan distance (if p=1).

    $\left(\sum_{i = 1}^n |x_i-y_i|^p\right)^{1 / p}$

    Euclidean distance

    $\sqrt{\sum_{i = 1}^n (x_i-y_i)^2}$

    Manhattan distance

    $\left(\sum_{i = 1}^n |x_i-y_i|^p\right)^{1 / p}$

    Interaction

    Interaction may be considered as a combination of distance and contiguity, and rests on the intuitively obvious idea that nearer things are more closely related than distant things

    Various types of functions can be used, including reciprocal function (or inverse distance), negative power (or inverse distance squared for a power of 2), negative exponential or linear with a negative slope (which is uncommon).

    Matrix of k nearest neighbours (knn)

    The matrix k nearest neighbours (knn neighbours, knn) is usually constructed for point data, because unlike the contiguity matrix, it only examines point data (without referring to areas). One can also create a knn matrix for area data by first determining the area centroids (centres of gravity of regions – spatial geometries) and operating on these points. In the case of point data, the knn matrix is a natural analytical solution, although determining the number of neighbours is most often based on modelling or random experience. I

    Standarized Spatial Weights

    Row standardization is recommended when there is a potential bias in the distribution of spatial objects and their attribute values due to poorly designed sampling procedures.

    Row standardization should also be used when polygon features refer to administrative boundaries or any type of man-made zones.

    Ej.

    Spatial lag

    Spatial lagis when the dependent variable y in place i is affected by the independent variables in both place i and j.

    Global indicator of Spatial Association (GISA)

    The measures (test statistics) related to the existence of spatial autocorrelation in data, that is, focusing on whether there is any spatial autocorrelation in the data

    Indice de Moran

    The positive value of global Moran implies the existence of a positive autocorrelation, and conversely, the negative value implies the existence of a negative autocorrelation

    If there is no relationship between Income and Incomelag, the slope will be close to flat (resulting in a Moran’s I value near 0).

    Moran’s I at different lags

    Moran’s I at different spatial lags defined by a 50 km width annulus at 50 km distance increments. Red dots indicate Moran I values for which a P-value was 0.05 or less.

    Local indicators of Spatial Association (LISA)

    A local statistic is any descriptive statistic associated with a spatial data set whose value varies from place to place.

    Moran´s I scatter plot

    Moran´s I scatter plot

    Red points and polygons highlight counties with high income values surrounded by high income counties. Blue points and polygons highlight counties with low income values surrounded by low income counties.

    Moran´s I scatter plot

    Significantly High-High and Low-Low clusters with P-values less than or equal to 0.5.

    Lattice data analysis in Python

    Field model

    Geostatistics

    Geostatistics

    The type of spatial statistical analysis dealing with continuous field variables is named “geostatistics”

    Geostatistics focus on the description of the spatial variation in a set of observed values and on their prediction at unsampled locations

    Spatial interpolation

    techniques used with points that represent samples of a continuous field are interpolation methods

    Here, our point data represents sampled observations of an entity that can be measured anywhere within our study area

    There are many interpolation tools available, but these tools can usually be grouped into two categories: deterministic and interpolation methods

    Proximity interpolation

    It was introduced by Alfred H. Thiessen more than a century ago. The goal is simple: Assign to all unsampled locations the value of the closest sampled location. This generates a tessellated surface whereby lines that split the midpoint between each sampled location are connected thus enclosing an area. Each area ends up enclosing a sample point whose value it inherits.

    Voronoi diagram

    Source: Wikipedia

    Voronoi & Delanauy triangulation

    Source: Francesco Bellelli in towardsdatascience

    Inverse Distance Weighted (IDW)

    The IDW technique computes an average value for unsampled locations using values from nearby weighted locations. The weights are proportional to the proximity of the sampled points to the unsampled location and can be specified by the IDW power coefficient.

    $\hat{Z_j} = \frac{\sum_i{Z_i / d ^ n_{ij}}}{\sum_i{1 / d ^ n_{ij}}}$

    So a large n results in nearby points wielding a much greater influence on the unsampled location than a point further away resulting in an interpolated output looking like a Thiessen interpolation. On the other hand, a very small value of n will give all points within the search radius equal weight such that all unsampled locations will represent nothing more than the mean values of all sampled points within the search radius.

    Kriging

    Several forms of kriging interpolators exist: ordinary, universal and simple just to name a few. This section will focus on ordinary kriging (OK) interpolation. This form of kriging usually involves four steps:

    • Removing any spatial trend in the data
    • Computing the experimental variogram, $γ$ , which is a measure of spatial autocorrelation.
    • Defining an experimental variogram model that best characterizes the spatial autocorrelation in the data.
    • Interpolating the surface using the experimental variogram.
    • Adding the kriged interpolated surface to the trend interpolated surface to produce the final output.

    We are interested in how these attribute values vary as the distance between location point pairs increases. We can compute the difference, $γ$, in values by squaring their differences then dividing by 2.

    $\gamma = \frac{(Z_2 - Z_1) ^ 2}{2} = \frac{(-1.2 - (1.6)) ^ 2}{2} = 3.92$

    Experimental variogram

    Experimental variogram

    Experimental semivariogram

    Variogram models

    Variogram models

    Parameters in a variogram model

    Spherical model fit

    Earth observation

    Sensores Remotos

    Ley de Stefan-Boltzmann & Ley de Wien

    Unidades

    • Energía radiante: total de energía radiada en todas las direcciones (J).
    • Flujo radiante: energía radiada en todas las direcciones por unidad de tiempo (W).
    • Irradiancia: flujo radiante incidente sobre unidad de área (w/m2).
    • Radiancia: flujo radiante emitido o reflejado por unidad de área y por ángulo solido de medida (W Sr/m2).
    • Emisividad: relación entre la emitancia y la de un emisor perfecto.
    • Reflectancia: relación entre el flujo incidente y el flujo reflejado por una superficie.
    • Absortancia: relación entre el flujo incidente y el flujo que absorbe una superficie.
    • Transmitancia: relación entre el flujo incidente y el transmitido por una superficie.

    Radiación solar & Radiación terrestre

    Tipos de sensores

    Tipos de sensores

    Tipos de sensores

    Interacción de la atmósfera

    Firma espectral

    La firma espectral se define como el comportamiento diferencial que presenta la radiación reflejada (reflectancia) o emitida (emitancia) desde algún tipo de superficie u objeto terrestre en los distintos rangos del espectro electromagnético. Una forma gráfica de estudiar este comportamiento es disponer los datos de reflectancia (%) en el eje Y y la longitud de onda λ en el eje X. Al unir los puntos con una línea continua se origina una representación bidimensional de la firma espectral.

    Raster

    Raster data (also known as grid data) represents surfaces. A raster, in its most basic form, is a matrix of cells (or pixels) grouped into rows and columns (or a grid), each cell containing a value reflecting information such as temperature. Each pixel corresponds to a particular geographic location.

    The extent and cell size of the raster, the number of rows and columns, and the spatial reference system are all factors to consider (or CRS)

    Raster

    Representation of raster data. Source: National Ecological Observatory Network (NEON) via datacarpentry

    Resolución de imágenes

    Resolución espacial

    Para films (análogas) → resolving power of the film: La resolución es función de la distribución del tamaño de los granos de silver halide en la emulsión. Los films con granos gruesos tienen una resolución menor sin embargo son mas sensibles o rápidos a la luz, por el contrario con granos mas finos tienen mas resolución, pero son menos sensible o lentos a la luz.

    Escala

    Source: National Ecological Observatory Network (NEON)

    IFOV --> GSD --> Pixel --> Resolución

    IFOV

    IFOV

    Pixel size

    Relación entre escala y resolución

      El procesamiento de imágenes está interesado:
    • Detección: discernir discretamente los objetos
    • Reconocer: determinar que tipo de objeto es
    • Identificar: identificar el objeto específicamente

    Escala VS Resolución

    Escala VS Resolución

    Área Mínima Cartografiable (AMC)

    La relación entre la resolución espacial y la escala está mediada por el AMC (mínima área de un elemento que debe ser representado en un mapa)

    Tamaño del pixel recomendado

    Regla de Waldo Tobler --> Map scale = raster resolution (in meters) x 2 x 1000

    Resolución espectral

    Resolución espectral

    Resolución espectral

    Resolución espectral

    Resolución espectral

    Resolución espectral

    Resolución radiométrica

    Resolución radiométrica

    Escala temporal --> Revisit time

    Resolución temporal vs Resolución espacial

    Resolución trade-off

    Tratamiento de imágenes de satelite

    Existen una gran cantidad de procedimientos para el análisis de imágenes de satélite. En este curso nos concentraremos en 4 de ellas:

    • Pro-procesamiento de imágenes
    • Mejoramiento de imágenes
    • Transformaciones de imágenes
    • Clasificación de imágenes

    QA bands & Bitmasks

    Most optical satellite imagery products come with one or more QA-bands that allows the user to assess quality of each pixel and extract pixels that meet their requirements.

    Pre-procesamiento de imágenes

    Cualquier imagen adquirida por un sensor remoto presenta una serie de alteraciones radiométricas y geométricas.

    Striping

    Line drop

    Bit or noisy error - "salt-and-pepper" effect

    Correción geométrica

    The processes of georeferencing (alignment of imagery to its correct geographic location) and orthorectifying (correction for the effects of relief and view direction on pixel location) are components of geometric correction necessary to ensure the exact positioning of an image

    Landsat Level-1 products are precision registered and orthorectified through a systematic process that involves ground control points and a digital elevation model (DEM).

    Orthographic projection

    Image Orthorectification

    IKONOS Satellite Image Orthorectification

    Resampling

    Solar correction

    Solar correction accounts for solar influences on pixel values. Solar correction converts at-sensor radiance to top-of-atmosphere (TOA) reflectance by incorporating exoatmospheric solar irradiance (power of the sun), Earth-Sun distance, and solar elevation angle

    Atmospheric correction

    The energy that is captured by Landsat sensors is influenced by the Earth’s atmosphere. These effects include scattering and absorption due to interactions of the electromagnetic radiation with atmospheric particles (gases, water vapor, and aerosols)

    Topographic correction

    Topographic correction account for illumination effects from slope, aspect, and elevation that can cause variations in reflectance values for similar features with different terrain positions (Riaño etal. 2003).

    TOA (Level 1) vs BOA (Level 2)

    Radiación termal

    La Temperatura cinética es la manifestación interna de la energía traslacional promedio de la moléculas que componen un cuerpo (temperatura cinética). Como consecuencia los objetos irradian energía en función de su temperatura (Temperatura radiante), adicionalmente esta temperatura sensada es de los primeros 50 cm, puede no ser representativa de todo el objeto. Sin embargo debido a la diferencia de emisividad que tienen los objetos, un cuerpo puede tener la misma temperatura y aun así tener diferente radiancia. Solo los cuerpo negros presentan que la Trad = Tcin, para los demas cuerpos la temperatura radiante siempre es menor, ya que la emisividad es menor que 1

    Image Enhancement

    Ajuste del histograma

    Ajuste del histograma

    Filters

    Filters

    Filters

    Filters

    Ratios

    Pan-sharpening

    Pan-sharpening

    Index

    NDVI Index

    Soil Vegetation Wetness Index (SVWI)

    Composite bands

    Combinación de bandas ---> True color

    Combinación de bandas ---> False color

    Combinación de "bandas" ---> radios

    Image transformations

    Principal components

    Principal components

    Tasselled cap

    Tasselled cap

    RGB --> IHS

    Image classification

    Método no supervisado

    Image classification

    Image classification

    Image classification

    Image classification

    Evaluación

    Cohen´s kappa

    What is Google Earth Engine?

    • A cloud-based platform for planetary scale geospatial analysis
    • Uses Google's computational resources to reduce processing time
    • A massive archive of remote sensing data

    Google Earth Engine

    Source: Earth Engine Code Editor

    Networks

    Image by Elias Wilberg. Bike-sharing bikes in Helsinki and Espoo from one summer weekday in 2021

    Multivariate & Global Spatial Regression Models

    Simple Linear Regression model

    $\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i + \hat{\epsilon}_i$

    Regresión lineal

    Multivariate regression model

    $\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i + \hat{\beta}_2 X_2 + \hat{\beta}_n X_n +\hat{\epsilon}_i$

    Multivariate regression model

    Ordinary least squares regression (OLS)

    Assumptions

    • Linear relationship between the dependent and independent variables
    • Multivariate normality: the residual of the linear model should be normally distributed
    • No multicolinearity between independent variables, i.e. they should not correlate between each other
    • Homoscedasticity: the errors/residuals should have constant variance (no trends)
    • No autocorrelation: residuals (errors) in the model shoul not be correlqated in any way

    Resultados

    Source: Medium (Stuti Singh, 2020)

    $R^2$

    Adjusted $R^2$

    Spatial regression

    Spatial regression is about explicitly introducing space or geographical context into the statistical framework of a regression

    Introducing spatial dependence in a regression can be done by considering:

  • Exogenous effects (Wx): take into account the spatial lag of the explanatory variable
  • Spatial error model: include the spatial lag in the error term of the equation
  • Spatial lag model (Wy): introduce a spatial lag to the dependent variable
  • Introducing spatial heterogeneity in a regression can be done by considering:

  • Proximity variables: Consider closeness to factors / environmental characteristics that might influence the modelled phenomena as an explanatory
  • Spatial fixed effects (FE) / Spatial regimes (SR): Consider the uniqueness of a place and (i) allow the constant term to vary geographically (Fes) or (ii) allow also the explanatory variables (in adition to constant) to vary geographically

  • Source: Spatial data science for sustainable development

    Geographycally Weighted Regresion (GWR)

    GWR is an extension of the linear regression model that allows the regression coefficients to vary across geographical space.

    Geographycally Weighted Regresion (GWR)

    $\hat{Y}_i = \hat{\beta}_0 (u_i,v_i) + \sum_{k=1}^{m}\hat{\beta}_k (u_i,v_i) X_{ik} +\hat{\epsilon}_i$

    where $(ui, vi)$ are the spatial coordinates of the observations $i$, and $β_k (ui, vi)$ are the coefficients estimated at those locations.

    Thus, in contrast to global LRMs, GWR conducts local regression at a series of locations to estimate local coefficients (the geographical part of GWR), using observations weighted by their distances to the location at the center of the moving window/kernel (the weighted part).

    Parameters

    Bandwidth is the distance band or number of neighbors used for each local regression equation and is perhaps the most important parameter to consider for Geographically Weighted Regression, as it controls the degree of smoothing in the model.

    It can be based on either Number of Neighbors or Distance Band. When Number of Neighbors is used, the neighborhood size is a function of a specified number of neighbors, which allows neighborhoods to be smaller where features are dense and larger where features are sparse. When Distance Band is used, the neighborhood size remains constant for each feature in the study area, resulting in more features per neighborhood where features are dense and fewer per neighborhood where they are sparse.

    Parameters

    The power of GWR is that it applies a geographical weighting to the features used in each of the local regression equations. Features that are farther away from the regression point are given less weight and thus have less influence on the regression results for the target feature; features that are closer have more weight in the regression equation. The weights are determined using a kernel, which is a distance decay function that determines how quickly weights decrease as distances increase. The Geographically Weighted Regression tool provides two kernel options in the Local Weighting Scheme parameter, Gaussian and Bisquare.

    Single bandwidth

    a single bandwidth is used in GWR under the assumption that the response-to-predictor relationships operate over the same scales for all of the variables contained in the model. This may be unrealistic because some relationships can operate at larger scales and others at smaller ones. A standard GWR will nullify these differences and find a “best-on-average” scale of relationship non-stationarity (geographical variation)

    Multiscale GWR

    In this, the bandwidth for each relationship is determined separately, allowing the scale of individual response-to-predictor relationships to vary.

    Geographycally Weighted Regresion (GWR)

    Geographycally Weighted Regresion (GWR)

    Geographycally Weighted Regresion (GWR)

    Geographycally Weighted Regresion (GWR)

    Distribución del error