An introduction to eDITH

Overview

eDITH (eDNA Integrating Transport and Hydrology) allows spatially explicit interpretation and prediction of environmental DNA (eDNA) data collected from river networks. It implements the eDITH model, which couples a geomorphological and hydrological characterization of a catchment, eDNA transport and decay dynamics, and a species distribution model, to transform pointwise water eDNA data (sensu (Pawlowski, Apothéloz-Perret-Gentil, and Altermatt 2020)) collected at a catchment into space-filling predicted maps of taxon density.

Features:

It provides estimations of detection probability of any number of taxa across a whole catchment based on spatially replicated eDNA data of these taxa.
It can handle both DNA concentration data (e.g., from qPCR) and metabarcoding (read counts) data of any taxonomic group, and can be used to predict space-filling maps of density of taxa (say, at species, genus or family level), OTUs (Operational Taxonomic Units) or ASVs (Amplicon Sequence Variants).
Model fit can be performed via Bayesian techniques or optimization algorithms.
Covariates can be specified by the user and/or selected by means of Asymmetric Eigenvector Maps (AEMs).
An interface to the DHARMa package for residual diagnostics is provided.

eDITH requires the use of river networks defined as river objects, which can be built via the rivnet package.

Installing the package

eDITH can be installed from CRAN:

install.packages("eDITH")

The development version can be installed from Github:

devtools::install_github("lucarraro/eDITH")

Among other packages, eDITH depends on rivnet, which in turn depends on the traudem package. traudem relies on the TauDEM library. traudem provides a guide to correct installation of TauDEM and its dependencies for different operating systems, and offers wrapper commands to call TauDEM methods from R. Please read the traudem documentation carefully.

The eDITH model in a nutshell

Overview and underlying assumptions

The eDITH model (Carraro et al. 2018; Carraro, Mächler, et al. 2020; Carraro, Stauffer, and Altermatt 2021; Carraro, Blackman, and Altermatt 2023) makes use of spatially replicated eDNA measurements within a river network to infer the spatial distribution of any taxon (or OTU, or ASV; hereafter simply “taxon”) of interest across that river network. The key underlying concept is that eDNA particles are advected downstream by streamflow, and hence an eDNA sample is not only representative of the location where it is taken, but it provides information about taxon abundance (in the case of qPCR single-species data) or biodiversity (for metabarcoding data) for a certain area upstream of the sampling location. By exploiting information from multiple sampling sites distributed in space, and embedding a model for transport and concurrent decay of DNA, the eDITH model is able to disentangle the various sources of DNA shedding (and hence, of the target taxon’s abundance). As a result, eDITH complements pointwise eDNA measurements by projecting taxon distributions (and thus biodiversity information) into space-filling catchment maps.

DNA production and decay rates

The DNA production (i.e., shedding) rate of a taxon in stream water is assumed to be proportional to its density¹ (Lodge et al. 2012; Apothéloz-Perret-Gentil et al. 2017). The validity of this assumption is restricted to the time point when eDNA is sampled. Thus, it is irrelevant if a taxon varies in its release of DNA across the season, as long as DNA released is proportionate to taxon density at the time of sampling. For the same reason, hydrological variables (such as discharge and velocity) are assumed to be constant in time (not in space!) across the duration of a sampling campaign. For instance, if multiple days are required to perform eDNA sampling across all sampling sites, these days would be assumed to constitute a unique time point, during which hydrological conditions can be adequately represented by time-averaged values of discharge and velocity.

The decay of DNA is assumed to be expressed by first order kinetics (i.e., the rate of change of DNA concentration decreases linearly with time). Similarly to the production rate, it is assumed that a single value of decay time is representative for a given taxon at a given time point.

Sampling design

The spatial extent of the eDNA sampling must adequately cover the entire river network, in order to allow the model to make robust predictions. As a rule of thumb, there should be one sampling site per each 10–20 km² of drained area.

Moreover, all main tributaries of a river network should be sampled in a spatially hierarchical design. Ideally, sampling sites should be located just upstream of a confluence, so that the independent signals from the joining tributaries can be gauged. See (Carraro, Stauffer, and Altermatt 2021) for further information on optimal eDNA sampling strategies.

The mean distance (along the river network) between eDNA samples should not be too large, otherwise the eDNA found at an upstream site would be totally depleted before reaching the next downstream site. In this case, the eDITH model could not make inferences based on these multiple sites. To make an example, if we consider first order kinetics, a water velocity of 1 ms^-1 and a decay time of 4 h (i.e., a half life of 2.77 h), we would be able to measure only about 50% of the concentration that we would measure 10 km upstream, and 3% of the concentration that we would measure 50 km upstream². As a rule of thumb, sites that are more than 50 km apart (along the river network) should be considered as independent samples. See also (Deiner and Altermatt 2014; Pont et al. 2018) with respect to transport distances of eDNA.

Finally, the model assumes that eDNA is well mixed in the water column, and/or that samples at a site adequately span the river cross section (e.g., sampling at both banks and at the centre, for large rivers, while in small rivers sampling at a single bank might be sufficient, by assuming that eDNA is sufficiently well mixed). Analogous care should be taken in the choice of a representative sampling volume to be filtered (Altermatt et al. 2023).

Measurement errors

Variability in DNA detection through sampling (e.g., filtration) or laboratory procedures (e.g. DNA extraction, Polymerase Chain Reaction (PCR), or sequencing) do not introduce a systematic bias in DNA concentrations across samples.

If eDITH is used to model metabarcoding data, the expected read count at a given site for a given taxon is assumed to be proportional to the underlying DNA concentration.

Replicated eDNA measures at the same site and time point are treated as independent measures. No distinctions is made on whether these are physical (i.e., different water samples) or laboratory (different PCR runs) replicates.

What taxa can be modelled?

In principle, the eDITH model is suitable for aquatic macroorganisms that shed their DNA into stream water. Nonetheless, the model can also be extended for microorganisms that are sampled in their entirety (e.g., free floating bacteria), in which case their upstream sources can be considered as biofilm colonies, and the decay time does not refer to DNA molecules but rather to the bacteria lifetime.

The eDITH model does not make a distinction on whether the eDNA data is referred to an assigned taxon (say, at a species, genus or family level) or an unassigned cluster such as OTUs or ASVs. What matters is that the assumptions on production and decay rates (see DNA production and decay rates) must hold.

In principle, it is also possible to estimate the spatial distribution of terrestrial, rather than aquatic, taxa, provided that the assumption of proportionality between DNA production rate and taxon density holds. In this case, the spatial unit on which predictions are performed is the subcatchment, i.e., the portion of land that directly drains towards the associated reach.

River network model

The domain on which the eDITH model is applied is a river network discretized into reaches, i.e. segments of river not interrupted by confluences and treated as smallest spatial units. Each reach is considered as a node of the network, and the ensemble of the reaches covers the entire river network. It is assumed that reaches have no internal variability (e.g., the exact coordinates of a sampling site do not matter, provided that the sampling site is associated to the same reach). Note that the along-stream distance between two consecutive confluences can be covered by a sequence of different reaches, to allow for a finer discretization of the river network. Indeed, number and maximum length of reaches can be tuned by using function aggregate_river of rivnet. See also (Carraro, Bertuzzo, et al. 2020) and the documentation of the OCNet package for details on aggregation of a river network into reaches.

Lakes and braided channels

The eDITH model is designed to work for river networks not containing lakes or reservoirs. Dynamics of eDNA in lakes are still largely unclear, but the very large residence times of water particles in lakes compared to those of a river reach imply that all dissolved DNA entering a lake either degrades or is retained by substrate particles. Particle-bound DNA can eventually be resuspended (Shogren et al. 2017), but this likely occurs at long time scales (say, monthly, yearly or even longer), hence the assumption of first-order decay kinetics with a single decay time value across the river network would be violated. Similarly, the river networks produced by rivnet do not admit bifurcations in the downstream direction (e.g., braided or artificial channels creating loops in the river network). Hence, the different braids or channels cannot be treated as independent reaches in the eDITH model. In such a case, the eDITH model could still be applied by considering a single channel as a conceptual equivalent of the multiple real braids or channels.

The governing equations

The main equation of the eDITH model results from a mass balance of eDNA across a cross-section of a river, and it reads:

$\begin{equation} C_j = \frac{1}{Q_j}\sum_{i \in \gamma(j)} p_i A_{S,i} \exp\left(-\frac{L_{ij}}{\overline{v_{ij}} \tau}\right) \end{equation}$

where:

$C_j$ is the DNA concentration at a sampling site $j$ (strictly speaking, $j$ is the reach where a sampling site is located);
$Q_j$ is the water discharge in $j$ ;
$\gamma(j)$ identifies the set of reaches upstream of $j$ ;
$p_i$ is the DNA production rate at an upstream reach $i$ ;
$A_{S,i}$ is the source area in $i$ (i.e., extent of the node; considering a river reach and an aquatic taxon, this could e.g. be equal to the product of its length and width);
$L_{ij}$ is the length of the along-stream path joining $i$ to $j$ ;
$\overline{v_{ij}}$ is the average water velocity along $L_{ij}$ ;
$\tau$ is a characteristic decay time for DNA in stream water.

Assuming that the morphology and hydrology of the river network are known (and thus lengths, areas and discharges), the above equation links DNA concentrations to the unknown parameters $\tau$ and $\mathbf{p}=\left( p_1, \dots, p_N\right)$ (where $N$ is the total number of reaches). While the former is linked to the behaviour of DNA in stream water (and could in principle be measured, or at least its value be constrained), the estimation of the latter is the actual goal of the eDITH model. Contrasting observed and modelled DNA concentrations thus enables the estimation of maps of $\mathbf{p}$ across all $N$ reaches constituting the river network, and hence of relative taxon density, given the initial assumption.

It is often convenient (both from a modelling and interpretation viewpoint) to express the DNA production rate $p_i$ as a function of environmental covariates, possibly related to the spatial patterns of the investigated taxon:

$\begin{equation} p_i = p_0 \exp\left( \boldsymbol\beta^T \mathbf{X}(i) \right) \end{equation}$

where $\mathbf{X}(i)$ is a vector of covariates evaluated at reach $i$ ; $\boldsymbol\beta$ a vector of covariate effect sizes; and $p_0$ a baseline production rate, i.e., the DNA production rate at a site where all covariates are at a null level. This reduces the number of unknowns from $N$ (size of $\mathbf{p}$ ) to the number of selected covariates (length of $\boldsymbol\beta$ ) plus one ( $p_0$ ).

If the eDNA data are in the form of read counts, the above equations remain applicable. Indeed, $C_j$ would play the role of the expected read number at site $j$ (proportional to the underlying DNA concentration, as per the initial hypothesis). Consequently, $\mathbf{p}$ would in this case represent the DNA production rates multiplied by such constant of proportionality.

Estimating detection probability

Once model parameters are estimated (either via a Bayesian method or likelihood maximization), it is possible to transform DNA production rates $\mathbf{p}$ into corresponding detection probabilities. This is done by exploiting the assumption on the probability distribution used to model measurement errors (and hence formulate the likelihood, see Likelihood function and model parameters).

In particular, this is done by calculating, for each reach $j$ , the expected eDNA value $\widetilde{C_j}$ (concentration or read count) if that reach were detached from the river network (that is, in the absence of upstream inputs, with the water discharge in the reach being equal to the locally produced discharge³). This is then transformed into a detection probability value, calculated as the probability of observing a non-null eDNA value under the assumed error probability distribution and the expected value $\widetilde{C_j}$ .

The so-obtained detection probability maps can be further transformed into presence-absence maps and used to assess biodiversity patterns (e.g., Carraro, Blackman, and Altermatt 2023).

Required data

eDNA data must be provided in the data field of run_eDITH_BT and run_eDITH_optim as a data frame with components values (eDNA values measured for a given taxon and a given time point) and ID (identifiers of the network nodes (i.e., reaches) at the AG level⁴ where the sampling sites are located). To identify the latter, function locate_site from rivnet can be used.

The river network must be provided as a river object, obtained via the rivnet package. It is fundamental that the river be aggregated into reaches (via aggregate_river) and that it contains hydrological data, such that discharges and water velocities can be used as input in the eDITH model. This can be obtained via the hydro_river function of rivnet. In its simplest setting, a single value of discharge (or depth) and width, not necessarily at the same reach, are required in order to extrapolate hydraulic variables across the whole network. The exact method with which hydrological variables are extrapolated to the river network depends on the number and type of data provided; please refer to the documentation of hydro_river for further details.

Optionally, covariates can be passed to run_eDITH_BT and run_eDITH_optim as a data frame. Function covariate_river of rivnet allows computing covariate values from raster maps and a river object. If covariates are not provided, asymmetric eigenvector maps (AEMs) are calculated on the river network and used as covariates. AEMs (Blanchet, Legendre, and Borcard 2008) are mutually orthogonal spatial variables obtained by a spatial filtering technique that considers space in an asymmetric way, and are thus suitable to model species distributions in river networks. It is of course possible to combine user-provided covariates and AEMs, as detailed in the flowchart below:

Flowchart for the choice of covariates used to fit the eDITH model.

Likelihood function and model parameters

By default, eDITH implements four probability distributions to model errors between observed and modelled data:

the normal (ll.type = "norm") and log-normal (ll.type = "lnorm") distributions are suitable to model DNA concentration data.
the geometric (ll.type = "geom") and negative binomial (ll.type = "nbinom") distributions are suitable for read count data.

Note that the user is expected to set option ll.type according to the nature of the data provided in run_eDITH_BT or run_eDITH_optim.

Moreover, the optional argument no.det = TRUE can be used to produce a zero-inflated error distribution. This is mandatory if ll.type = "lnorm", as the log-normal distribution does not admit zeros.

Default model parameters are as follows:

tau represents decay time $\tau$ in hours.
log_p0 is the logarithm in base 10 of the baseline production rate $p_0$ . The units of $p_0$ are equal to the units of the inputted data$values multiplied by a length unit and divided by a time unit. For instance, if data$values contains DNA concentration data in mol m^-3 and the river object contains discharge (river$AG$discharge) in m³s^-1 and areas⁵ in m², $p_0$ is expressed in mol m^-2 s^-1 (i.e. amount of DNA shed per unit area and unit time). In the case of read count data, the unit for $p_0$ might look weird; this is because, in this case, $p_0$ embeds a constant that transforms expected DNA concentration predicted by the model into expected read number.
beta_X is the effect size of covariate X, where X is replaced by the object names in the covariates data frame (if provided), and/or by e.g. AEM1 when AEM eigenfunctions are used. In this case, the number indicates the corresponding eigenfunction.

Additional parameters can be added depending on the likelihood definition:

sigma is the standard deviation of the error (when ll.type = "norm" or "lnorm"). Note that, if a log-normal distribution is used, sigma is the standard deviation of the non-log-transformed values (and is hence different than parameter sdlog of dlnorm).
omega is the overdispersion parameter (when ll.type = "nbinom"), which is defined as the ratio between variance and mean (Lindén and Mäntyniemi 2011).
no additional error parameter is present if ll.type = "geom".
Cstar is a further parameter added when the optional argument no.det = TRUE is passed. In this case, the probability of non-detection is expressed as exp(-C/Cstar), where C is the modelled eDNA value (expected concentration or read number) at a given site. Increased Cstar increases the non-detection probability, all else being equal. See Carraro et al. (2018) for details.

A minimal example

Generating a `river` object

To illustrate the functioning of eDITH, let’s run an example in the river Wigger (Switzerland), whose river object can be called via data(wigger). Such an object can be built via the rivnet package as follows. Please refer to rivnet documentation for details on its functions.

# Extract river from DEM
river <- rivnet::extract_river(outlet=c(637478,237413),
                    EPSG=21781, #CH1903/LV03 coordinate system
                    ext=c(6.2e5,6.6e5,2e5,2.5e5),
                    z=9)

# Aggregate river - default thrA and maxReachLength = 2500 m
river <- rivnet::aggregate_river(river, maxReachLength = 2500)

# Hydraulic data: width = 8 m, discharge = 15 m3/s at the outlet
hydrodata <- data.frame(data = c(8, 15), 
                        type = c("w", "Q"), 
                        node = river$AG$outlet*c(1, 1))

# Assign hydraulic variables across the river network
river <- rivnet::hydro_river(hydrodata, river)

# Attribute landcover classes as covariates
r1 <- terra::rast(system.file("extdata/landcover.tif", 
                       package = "rivnet"))
river <- rivnet::covariate_river(r1, river)

Preparing and displaying eDNA data

An example of DNA concentration data for a given target taxon can be called as:

data(dataC)

dataC contains the following objects:

ID: identifiers of river network nodes at the AG levels where eDNA has been sampled (these can be retrieved via rivnet::locate_site based on the site coordinates).
values: corresponding values of DNA concentration measured (in mol m^-3)

dataC contains eDNA values for 24 sampling sites, each of which has been sampled three times. In such case, the three different eDNA values are inserted in dataC as independent rows:

dataC[which(dataC$ID==2),]
#>    ID       values
#> 1   2 0.000000e+00
#> 25  2 1.037331e-12
#> 49  2 8.176798e-13

We can use function rivnet::points_colorscale to display the mean DNA concentration values across the sampling sites:

sites <- unique(dataC$ID)
values <- numeric(length(sites))
for (ind in 1:length(sites)){
  s <- sites[ind]
  values[ind] <- mean(dataC$values[dataC$ID==s])
}

plot(river)
rivnet::points_colorscale(river$AG$X[unique(dataC$ID)], river$AG$Y[unique(dataC$ID)],
                          values)
title("Mean observed DNA concentration [mol m-3]")

Running the eDITH model

We can define a set of covariates containing main landcover classes (urban areas, agricultural areas and forest) in addition to elevation and log-scaled drainage area:

covariates <- data.frame(urban = river$SC$locCov$landcover_1,
                         agriculture = river$SC$locCov$landcover_2,
                         forest = river$SC$locCov$landcover_3,
                         elev = river$AG$Z,
                         log_drainageArea = log(river$AG$A))

Bayesian methods

We can use a Bayesian approach to fit the eDITH model. Function run_eDITH_BT interfaces with the BayesianTools package, which contains several different Bayesian samplers. The default choice in run_eDITH_BT is DREAMzs (Carraro, Blackman, and Altermatt 2023). We can run eDITH with the above-specified covariates:

set.seed(1)
out.bt.cov <- run_eDITH_BT(dataC, river, covariates)

The default option for ll.type is norm, which is suitable for DNA concentration data.

Alternatively, we can run eDITH via AEM covariates. By default, the first 10% AEM eigenfunctions are used as covariates.

set.seed(1)
out.bt.aem <- run_eDITH_BT(dataC, river)

Optimization

Alternatively, an optimization method can be used to estimate the best-fit parameters of the eDITH model. Function run_eDITH_optim internally calls optim and its default Nelder-Mead method. The main options such as covariates and ll.type work in the same way as in run_eDITH_BT:

set.seed(27)
out.opt.aem <- run_eDITH_optim(dataC, river, n.AEM = 10, 
    n.attempts = 1)

For illustrative purposes, we here use a reduced number of AEM eigenfunctions (10) and a single optimization attempt. It is advisable to increase the number of optimization attempts (as in the default settings) to ensure convergence.

We can qualitatively check the goodness of fit of the model:

plot(out.opt.aem$C[dataC$ID], dataC$values, 
     xlim=c(0,8e-12), ylim=c(0, 8e-12), asp=1,
     xlab = "Modelled concentration [mol m-3]", 
     ylab = "Observed concentration [mol m-3]")
abline(0,1)

And also display modelled concentrations across the river network, by using rivnet’s plot method:

plot(river, out.opt.aem$C, colLevels=c(0, max(values), 1000), addLegend = FALSE, 
     colPalette = hcl.colors(1000, "Reds 3", rev=T))
rivnet::points_colorscale(river$AG$X[unique(dataC$ID)], river$AG$Y[unique(dataC$ID)],
                          values)
title("DNA concentration [mol m-3]")

Finally, we can have a look at the predicted map of relative taxon density (i.e., DNA production rate $\mathbf{p}$ )

plot(river, out.opt.aem$p)
title('DNA production rate [mol m-2 s-1]')

And the corresponding detection probability $\mathbf{p_D}$ :

plot(river, out.opt.aem$probDet)
title('Detection probability')

References

Altermatt, F., L. Carraro, M. Antonetti, C. Albouy, Y. Zhang, A. Lyet, X. Zhang, and L. Pellissier. 2023. “Quantifying Biodiversity Using eDNA from Water Bodies. General Principles and Recommendations for Sampling Designs.” Environmental DNA. https://doi.org/10.1002/edn3.430.

Apothéloz-Perret-Gentil, L., A. Cordonier, F. Straub, J. Iseli, and J. Pawlowski. 2017. “Taxonomy-Free Molecular Diatom Index for High-Throughput eDNA Biomonitoring.” Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.12668.

Blanchet, F. G., P. Legendre, and D. Borcard. 2008. “Modelling Directional Spatial Processes in Ecological Data.” Ecological Modelling. https://doi.org/10.1016/j.ecolmodel.2008.04.001.

Carraro, L., E. Bertuzzo, E. A. Fronhofer, R. Furrer, I. Gounand, A. Rinaldo, and F. Altermatt. 2020. “Generation and Application of River Network Analogues for Use in Ecology and Evolution.” Ecology and Evolution. https://doi.org/10.1002/ece3.6479.

Carraro, L., R. C. Blackman, and F. Altermatt. 2023. “Modelling Environmental DNA Transport in Rivers Reveals Highly Resolved Spatio-Temporal Biodiversity Patterns.” Scientific Reports. https://doi.org/10.1038/s41598-023-35614-6.

Carraro, L., H. Hartikainen, J. Jokela, E. Bertuzzo, and A. Rinaldo. 2018. “Estimating Species Distribution and Abundance in River Networks Using Environmental DNA.” Proceedings of the National Academy of Sciences of the United States of America. https://doi.org/10.1073/pnas.1813843115.

Carraro, L., E. Mächler, R. Wüthrich, and F. Altermatt. 2020. “Environmental DNA Allows Upscaling Spatial Patterns of Biodiversity in Freshwater Ecosystems.” Nature Communications. https://doi.org/10.1038/s41467-020-17337-8.

Carraro, L., J. B. Stauffer, and F. Altermatt. 2021. “How to Design Optimal eDNA Sampling Strategies for Biomonitoring in River Networks.” Environmental DNA. https://doi.org/10.1002/edn3.137.

Deiner, K., and F. Altermatt. 2014. “Transport Distance of Invertebrate Environmental DNA in a Natural River.” PLOS ONE. https://doi.org/10.1371/journal.pone.0088786.

Lindén, A., and S. Mäntyniemi. 2011. “Using the Negative Binomial Distribution to Model Overdispersion in Ecological Count Data.” Ecology. https://doi.org/10.1890/10-1831.1.

Lodge, D. M., C. R. Turner, C. L. Jerde, M. A. Barnes, L. Chadderton, S. P. Egan, J. L. Feder, A. R. Mahon, and M. E. Pfrender. 2012. “Conservation in a Cup of Water. Estimating Biodiversity and Population Abundance from Environmental DNA.” Molecular Ecology. https://doi.org/10.1111/j.1365-294X.2012.05600.x.

Pawlowski, J., L. Apothéloz-Perret-Gentil, and F. Altermatt. 2020. “Environmental DNA. What’s Behind the Term? Clarifying the Terminology and Recommendations for Its Future Use in Biomonitoring.” Molecular Ecology. https://doi.org/10.1111/mec.15643.

Pont, D., M. Rocle, A. Valentini, R. Civade, J. Pauline, A. Maire, N. Roset, M. Schabuss, H. Zornig, and T. Dejean. 2018. “Environmental DNA Reveals Quantitative Patterns of Fish Biodiversity in Large Rivers Despite Its Downstream Transportation.” Scientific Reports. https://doi.org/10.1038/s41598-018-28424-8.

Shogren, A. J., J. L. Tank, E. Andruszkiewicz, B. Olds, A. R. Mahon, C. L. Jerde, and D. Bolster. 2017. “Controls on eDNA Movement in Streams. Transport, Retention, and Resuspension.” Scientific Reports. https://doi.org/10.1038/s41598-017-05223-1.

Yates, M. C., D. M. Glaser, J. R. Post, M. E. Cristescu, D. J. Fraser, and A. M. Derry. 2020. “The Relationship Between eDNA Particle Concentration and Organism Abundance in Nature Is Strengthened by Allometric Scaling.” Molecular Ecology. https://doi.org/10.1111/mec.15543.

Luca Carraro, Florian Altermatt