In this post we will see how to convert a single shapefile containing vector features of several classes (where the class is stored as an attribute of the features) to multiple single-class presence-absence rasters, one for each class. More concretely, we have one shapefile containing a bunch of vector features, representing many classes. For each class, we want to extract one single band presence-absence raster, that is a raster where a pixel equals 1 if there is an vector of this class overlapping it, 0 else.

A typical use case is to have a shapefile containing many species occurrences (point features) and extract a presence-absence raster for each species. We illustrate the process with such an example: occurrences of 3 different species in New-Caledonia.

To achieve this task we rely on the Python 3 programming language (note that the same can be achieved with Python 2) and the GeoPandas and Rasterio libraries. GeoPandas is a geospatial extension of the Pandas data analysis library which allows the manipulation of vector data. On the other hand, Rasterio is mainly based on the GDAL library and provides an Python API to read, manipulate and write raster data through NumPy N-dimensional arrays.

1 - Read the occurrence shapefile with GeoPandas

Our first step consists in reading the shapefile with GeoPandas, which is quite straightforward: we import the library and store the content of the shapefile into a DataFrame with the read_file function.

 1 2 3  import geopandas as gpd features = gpd.read_file("path_to_the_shapefile") class_column = "species_id" 

Below is the DataFrame that we get with our example dataset:

id species_id geometry
0 1 POINT (164.6773504277914 -20.69403200188892)
1 1 POINT (165.1705275154343 -21.06129153524009)
2 1 POINT (165.4853214011639 -21.51249610478582)
3 1 POINT (164.9291855363749 -21.06129153524009)
4 1 POINT (164.3625565420617 -20.33726559806206)
5 1 POINT (165.9260328411853 -21.58594801145605)
6 1 POINT (165.1390481268613 -21.17671596000761)
7 2 POINT (165.0131305725695 -20.87241520380234)
8 2 POINT (164.8032679820831 -20.9563602399969)
9 2 POINT (165.4013763649693 -21.35509916192103)
10 2 POINT (165.8525809345151 -21.76433121336948)
11 2 POINT (166.7654832031308 -22.14208387624497)
12 3 POINT (166.2618129859635 -21.60693427050469)
13 3 POINT (166.1778679497689 -21.69087930669925)
14 3 POINT (168.0771243936707 -21.53348236383446)
15 3 POINT (167.1747152545793 -20.72551139046188)
16 3 POINT (166.4297030583526 -22.12109761719634)

Note: The unit of the coordinates and any induced measure depends on the dataset’s Coordinate Reference System (CRS). In our case, we encoded our occurrences locations using WGS84, our data is thus expressed in degrees.

2 - Define the bounding box and the output resolution

We want our output rasters to cover a certain geographical extent (more or less 1 pixel), which is usually called the bounding box. A bounding box is tuple of 4 values, $$(x_{min}, y_{min}, x_{max}, y_{max})$$. In the case of geographic data, $$x_{min}$$ and $$x_{max}$$ are the west and east longitudes, and $$y_{min}$$ and $$y_{max}$$ are the south and north latitudes. Here, we define our bounding box such that it covers New Caledonia’s boundaries: (163.5423, -22.6963, 168.1895, -19.535). Here is a nice online tool for defining a bounding box.

From the bounding box, We can define the output resolution $$\text{pixel_width} \times \text{pixel_height}$$, that is the dimension of the output rasters pixels, in the CRS unit (we want square pixels in this example, so $$\text{pixel_width} = \text{pixel_height}$$). For instance, if we want $$n_{cols}$$ columns, we compute the resolution as $$\text{pixel_width} = (x_{max} - x_{min}) / n_{cols}$$, we then compute the number of rows as $$n_{rows} = \lceil(y_{max} - y_{min}) / \text{pixel_width}\rceil$$. $$\lceil x \rceil$$ means the ceiling of $$x$$, that is the smallest integer greater than or equal to $$x$$, we use it to ensure that the whole bounding box will be included in our rasters. Note that we could also define the number of columns and rows and compute the resolution, but we would end with rectangular pixels (which can be desired in some cases).

 1 2 3 4 5  import math x_min, y_min, x_max, y_max = 163.5423, -22.6963, 168.1895, -19.535 n_cols = 600 pixel_width = pixel_height = (x_max - x_min) / n_cols n_rows = math.ceil((y_max - y_min) / pixel_width) 

3 - Define the affine transformation

The affine transformation is a tool from matrix algebra that allows the transformation between classical raster grid coordinates and geographical coordinates. This article provides a quick explanation of what are affine transformations and how to use them with Python.

 1 2  from affine import Affine affine = Affine(pixel_width, 0, x_min, 0, -pixel_height, y_max) 

4 - Iterate over the species ids and rasterize the corresponding features with Rasterio

We now have everything we need for generating our output rasters. First, we iterate over the classes (obtained in line 4). Then, for each class (line 15), we extract the corresponding features (line 16). We create a list of couples from the geometries, each couple containing the geometry and the value 1 (line 17). This value is the one that will be given to the pixels overlapping the corresponding geometry. We then use the rasterize function from rasterio.features to generate a $$n_{rows} \times n_{cols}$$ NumPy ndarray (lines 19-24). We finaly store this array in a 1-band raster file, using the open function of Rasterio (lines 25-26).

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  import os import rasterio import rasterio.features classes = features[class_column].unique() output_base_path = "base_path_of_the_output_rasters" out_meta = { 'width': n_cols, 'height': n_rows, 'transform': affine, 'crs': features.crs, 'driver': "GTiff", 'count': 1, 'dtype': rasterio.int16 } for class_value in classes: feats = features[features[class_column] == class_value] shapes = [(geom, 1) for geom in feats.geometry] output_path = os.path.join(output_base_path, "{}.tif".format(class_value)) rasterized = rasterio.features.rasterize( shapes=shapes, out_shape=(out_meta['height'], out_meta['width']), dtype=out_meta['dtype'], transform=out_meta['transform'] ) with rasterio.open(output_path, 'w', **out_meta) as out: out.write_band(1, rasterized) 

5 - Putting all together into a generic function

The whole process can be encapsulated into a generic python function, that can also be used as a command line function. Below is the code of such a generic function. You can also find it on this gist, with a command-line enabled version.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78  import os import math import geopandas as gpd import rasterio import rasterio.features from affine import Affine def multiclass_vector_to_singleclass_rasters(input_vector, class_column, bounding_box, output_base_path, pixel_size=None, n_cols=None, n_rows=None, file_prefix='', file_suffix='', all_touched=False): # Load the features features = gpd.read_file(input_vector) # Compute resolution x_min, y_min, x_max, y_max = bounding_box # - Case 1: pixel size is directly given if pixel_size is not None: if isinstance(pixel_size, (list, tuple)) and len(pixel_size) >= 2: pixel_width = pixel_size[0] pixel_height = pixel_size[1] elif isinstance(pixel_size, (int, float)): pixel_width = pixel_size pixel_height = pixel_size else: raise TypeError() n_rows = math.ceil((y_max - y_min) / pixel_height) n_cols = math.ceil((x_max - x_min) / pixel_width) # - Case 2: pixel size determined from the number of columns and/or rows elif n_cols is not None: if n_rows is not None: # - Case 2.1: from number of columns and rows pixel_width = (x_max - x_min) / n_cols pixel_height = (y_max - y_min) / n_rows else: # - Case 2.2: from number of columns only (compute n_rows) pixel_width = (x_max - x_min) / n_cols pixel_height = pixel_width n_rows = math.ceil((y_max - y_min) / pixel_width) elif n_rows is not None: # - Case 2.3: from number of rows only (compute n_cols) pixel_height = (y_max - y_min) / n_rows pixel_width = pixel_height n_cols = math.ceil((x_max - x_min) / pixel_height) else: raise TypeError() # Define the affine transformation affine = Affine(pixel_width, 0, x_min, 0, -pixel_height, y_max) # Generate the rasters out_meta = { 'width': n_cols, 'height': n_rows, 'transform': affine, 'crs': features.crs, 'driver': 'GTiff', 'count': 1, 'dtype': rasterio.int16 } classes = features[class_column].unique() for class_value in classes: feats = features[features[class_column] == class_value] shapes = [(geom, 1) for geom in feats.geometry] file_name = "{}{}{}.tif".format(file_prefix, class_value, file_suffix) if not os.path.exists(output_base_path): os.makedirs(output_base_path) output_path = os.path.join(output_base_path, file_name) rasterized = rasterio.features.rasterize( shapes=shapes, out_shape=(out_meta['height'], out_meta['width']), dtype=out_meta['dtype'], transform=out_meta['transform'], all_touched=all_touched ) with rasterio.open(output_path, 'w', **out_meta) as out: out.write_band(1, rasterized)