Introduction to Spatial Data
Overview
Teaching: 30 min
Exercises: ?? minQuestions
TBD
Objectives
TBD
This tutorial provides an overview of key spatial data concepts, including structures and common storage and transfer formats.
Skill Level: This tutorial provides conceptual background for the Data Carpentry geospatial tutorial series. The concepts outlined in this tutorial relate to many programming languages and data types!
Goals / Objectives
After completing this activity, you will:
- Understand the data structures used to represent spatial information, including their strengths and weaknesses
- Become familiar with common storage and transfer formats
All of the topics below are covered in more detail in later episodes. This episode just provides enough background to help you get started.
Structures: Raster and Vector
Raster
The raster data structure is a regular grid of equally-sized cells, each holding a single numeric value. The cell values can represent a continuous surface (e.g. elevation) or a categorical classification (e.g. land use). If this sounds familiar, it is because this data structure is very common: its how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its coordinate reference system (CRS, more on this later).
Raster data has some important advantages:
- representation of continuous surfaces
- potentially very high levels of detail
- Data is ‘unweighted’ across its extent - the geometry doesn’t implicitly highlight features (wording?? unbiased?)
- Cell-by-cell calculations can be very fast and efficient
The downsides of raster data are
- very large file sizes as cell size gets smaller
- currently popular formats don’t embed metadata well (more on this later!)
- can be difficult to represent complex information
Satellite imagery is probably the most complex raster representation you will work with. Satellites like Landsat capture information at multiple wavelengths, from both human-visible and infrared bands. Final products are split into specific wavelength ranges, so to make a colour image of the land surface, one may have to combine red, green, and blue datasets.
Vector
Vector data structures aim to represent specific features on the Earth’s surface, and then assign attributes to those features. All vector datasets are based around coordinate points - usually paired x,y values in 2D space. There are many possible way to arrange and connect a set of points, so some standards have been developed to keep vector formats interoperable. The most common standard you will encounter is OGC Simple Features.
Simple Features defines 17 types of vector geometry, and the vast majority of data uses just three of these classes - Points, Lines, and Polygons. Points are assembled into more complex structures by straight lines only.
< insert image >
A point is just a single coordinate pair. A line made when at least two points are grouped together. A polygon requires at least three points, and then a fourth point that matches the first one, closing the loop. The points that make up lines and polygons also need to be arranged in a sensible sequence to be valid - if you draw straight lines between each point, those lines should never cross. Following these rules makes it possible to do complex geometric operations by layering vector datasets together.
Vector data has some important advantages:
- The geometry itself contains information about what the dataset creator thought was important
- The geometry structures hold information in themselves - why choose point over polygon, for instance?
- Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
- Data storage can be very efficient compared to rasters
The downsides of vector data are
- potential loss of detail compared to raster
- potential bias in datasets - what didn’t get recorded?
- Calculations involving multiple vector layers need to do math on the geometry as well as the attributes, so can be slow compared to raster math
Vector datasets are in use in many industries besides geospatial. For instance, computer graphics are largely vector-based, although the data structures in use tend to join points using arcs and complex curves rather than straight lines. Computer-aided design (CAD) is also vector-based. The difference is, again, that geospatial datasets are accompanied by information tying their features to real-world locations. Its time to talk about how that works.
Coordinate Reference Systems
A data structure cannot be considered geospatial unless it is accompanied by spatial reference system (SRS) information, in a format that geospatial applications can use to display and manipulate the data correctly. SRS information connects data to the Earth’s surface using a mathematical model.
Note: ‘CRS’ (‘coordinate reference system’) is a common acronym used interchangeably with SRS.
CRS information has three components:
CRS = Datum + Projection + Additional Parameters formatting
A common analogy employed to teach projections is the orange peel analogy. If you imagine that the earth is an orange, how you peel it and then flatten the peel is similar to how projections get made. We will also use it here.
Datum
A Datum is a model of the shape of the earth. It has angular units (i.e. degrees) and defines the starting point (i.e. where is (0,0)?) so the angles reference a meaningful spot on the earth. Common global datums are WGS84 and NAD83. Datums can also be local, fit to a particular area of the globe, but ill-fitting outside the area of intended use.
When datums are used by themselves they are called a Geographic Coordinate System.
Orange Peel Analogy: a datum is your choice of fruit to use in the orange peel analogy. Is the earth an orange, a lemon, a lime, a grapefruit?

Image source: https://github.com/MicheleTobias/R-Projections-Workshop
Projection
A Projection is a mathematical transformation of the angular measurements on a round earth to a flat surface (i.e. paper or a computer screen).
The units associated with a given projection are usually linear (feet, meters, etc.).
Many people use the term “projection” when they actually mean “coordinate reference system”.
Orange Peel Analogy: a projection is how you peel your orange and then flatten the peel.

Image source: http://blogs.lincoln.ac.nz/gis/2017/03/29/where-on-earth-are-we/
Additional Parameters
Additional parameters are often necessary to create the full coordinate reference system. For example, one common additional parameter is a definition of the center of the map. The number of required additional parameters depends on what is needed by each specific projection.
Orange Peel Analogy: an additional parameter could include a definition of the location of the stem of the fruit.
Which CRS/projection should I use?
To decide if a projection is right for your data, answer these questions:
- What is the area of minimal distortion?
- What aspect of the data does it preserve?
University of Colorado’s Map Projections and the Department of Geo-Information Processing has a good discussion of these aspects of projections. Online tools like Projection Wizard can also help you discover projections that might be a good fit for your data.
Comments from the pros
Take the time to figure identify a projection that is suited for your project. You don’t have to stick to the ones that are popular.
Describing Coordinate Reference Systems
There are several common systems in use for storing and transmitting CRS information, as well as translating between different CRSs. These systems generally comply with ISO 19111. EPSG, PROJ, and OGC WKT are the most common. They aren’t usually used on their own, but are built in to geospatial software.
EPSG
The EPSG system is a database of CRS information maintained by the International Association of Oil and Gas Producers. The dataset contains both CRS definitions and information on how to safely convert data from one CRS to another. Using EPSG is easy as every CRS has a integer identifier, e.g. WGS84 is EPSG:4326. The downside is that you can only use the CRSs EPSG defines and cannot customise them.
Detailed information on the structure of the EPSG dataset is available here.
PROJ
PROJ is an open-source library for storing, representing and transforming CRS information. PROJ.5 has been recently released, but PROJ.4 was in use for 25 years so you will still mostly see PROJ referred to as PROJ.4.
PROJ represents CRS information as a text string of key-value pairs, which makes it easy to customise (and with a little practice, easy to read and interpret).
OGC Well-known text (WKT)
The OGC WKT standard is used by a number of important geospatial apps and software libraries. WKT is a nested list of geodetic parameters. The structure of the information is defined here. WKT is valuable in that the CRS information is more transparent than in EPSG, but can be more difficult to read and compare than PROJ. Additionally, the WKT standard is implemented inconsistently across various software platforms, and the spec itself has some known issues (more information here).
Translating between CRS systems
CRS information can generally be translated between EPSG, PROJ and WKT representations without too much trouble, but some mistranslations are possible, especially with obscure projections. For convenience, the website spatialreference.org holds descriptions of many common projections data in several formats. Users should that the site does not appear to be actively maintained at present, with the last update made in 2013. The GDAL library (more on this in Lesson 3) has a function called gdalsrsinfo that will report a file’s CRS information in the format of your choice.
Metadata
Spatial data is useless without metadata. Essential metadata is, of course, the CRS information, but proper spatial metadata encompasses more than that. History and provenance of a dataset (how it was made), who is in charge of maintaining it, and appropriate (and inappropriate!) use cases should also be documented in metadata. This information should accompany a spatial dataset wherever it goes.
In practice this can be difficult, as many spatial data formats don’t have a built-in place to hold this kind of information. Metadata often has to be stored in a companion file, generated and maintained manually.
Common storage formats
Raster
Many geospatial raster formats are just existing image formats with an extended definition that allows CRS information to be embedded in the file. GeoTIFF is one of the most common of these, along with MrSID, JPEG2000 and IMG. Other formats are ASCII-based (GRD, XYZ, ASC), with a few rows of plain-text header information followed by cell values arranged in rows and columns. These are often less useful due to their inefficient storage. More robust, but more complicated formats like NetCDF are available, but not commonly used outside of research. Other formats are industry-specific, like GRIB for meteorology.
Vector
Many vector file formats (particularly ESRI SHP and MapInfo TAB) are really several interrelated files on disk - one holding a table of attributes, usually in DBF format, one holding related geometric data, and various index and header files. This can make them difficult to move around without placing them in an archive format like zip. Despite this and several other problems, the Shapefile (SHP) is still the most commonly used vector data format.
GeoPackage is an SQLite database with an extended definition that allows spatial data storage. It has the advantage of being a single file on disk, along with stronger internal rules around data and encoding.
XML and JSON-style formats for vector spatial data also exist, notably KML (popularised in Google Earth) and GeoJSON. These formats are commonly used by software developers for delivering geospatial data over web services. They have the advantage of being streamable (you don’t need to download and open a whole file, you can just access part of it), but like any plain-text format, file size becomes very large very quickly. GeoJSON also only supports the WGS84 coordinate reference system.
Why not both?
Very few formats can contain both raster and vector data - in fact, most are even more restrictive than that. Vector datasets are usually locked to one geometry type, e.g. points only. Raster datasets can usually only encode one data type, for example you can’t have a multiband GeoTIFF where one layer is integer data and another is floating-point.
There are sound reasons for this - format standards are easier to define and maintain, and so is metadata. The effects of particular data manipulations are more predictable if you are confident that all of your input data has the same characteristics. Even so, some limited support for mixed vector geometries is available in R’s sf package, and mixed raster datatypes in raster. Such objects can only be saved in R’s native object storage format, RDS, and should be used with caution.
Format interoperability
Many existing file formats were invented by GIS software developers, often in a closed-source environment. This led to the large number of formats on offer today, and considerable problems transferring data between software environments. Some companies have built their own file translation capabilities into their software, but maintaining this capability takes a lot of work. The Geospatial Data Abstraction Library (GDAL) is an open-source answer to this issue.
GDAL is a set of software tools that translate between almost any geospatial format in common use today (and some not so common ones). GDAL also contains tools for editing and manipulating both raster and vector files, including reprojecting data to different CRSs. GDAL can be used as a standalone command-line tool, or built in to other GIS software. Several open-source GIS programs use GDAL for all file import/export operations.
Key Points
TBD