ggBRT - Tutorial

Jean-Baptiste Jouffray

2019-01-16

Vignette Info

This document is a package vignette for ggBRT designed to explore and visualise the results of boosted regression trees fitted with the gbm.step routine in the dismo package. The vignette shows the use of every function in the package by using data from Jouffray et al. (2019), for which ggBRT was originally created.

Introduction

The tutorial builds on data from 620 sites across the Hawaiian islands classified into four reef regimes. Each site is associated to a set of 20 biophysical and anthropogenic predictors. We use boosted regression trees (Elith et al. 2008) to model the occurence of each regime at each site based on those predictors. We start by computing boosted regression trees (BRT) with the gbm.step routine and then provide examples for every function included in ggBRT.

# Load ggBRT 
library(ggBRT)

# Import data
dat<-Hawaii_RegimesPredictors
PredictorsCat<-Hawaii_PredictorsCategories

# Select the set of 20 predictors
pred<-c(15:34)
# Run the boosted regression trees analyses
brt1 <- gbm.step(data=dat, gbm.x = pred, gbm.y = which(names(dat) == 'Regime1'),
                 family = "bernoulli", tree.complexity = 3,
                 learning.rate = 0.01, bag.fraction = 0.75)
brt2 <- gbm.step(data=dat, gbm.x = pred, gbm.y = which(names(dat) == 'Regime2'),
                 family = "bernoulli", tree.complexity = 3,
                 learning.rate = 0.01, bag.fraction = 0.75)
brt3 <- gbm.step(data=dat, gbm.x = pred, gbm.y = which(names(dat) == 'Regime3'),
                 family = "bernoulli", tree.complexity = 3,
                 learning.rate = 0.01, bag.fraction = 0.75)

ggPerformance

Function to extract the models performance of one or several boosted regression trees. Returns a data frame with mean total deviance, mean residual deviance, correlation, area under the receiver operating curve (AUC), percentage deviance explained, cross-validated deviance, cross-validated correlation, cross-validated AUC and cross-validated percentage deviance explained, calculated with the formula 1-(cross-validated deviance/Total.Deviance).

ggPerformance(Regime1=brt1,Regime2=brt2,Regime3=brt3)
#>                      Regime1    Regime2    Regime3
#> Total.Deviance     1.1778978  1.1809989  1.0801094
#> Residual.Deviance  0.4217460  0.5241492  0.3731044
#> Correlation        0.8551102  0.8088271  0.8657369
#> AUC                0.9751000  0.9572000  0.9837000
#> Per.Expl          64.1950265 55.6181480 65.4567942
#> cvDeviance         0.6947494  0.7500610  0.6546351
#> cvCorrelation      0.6896335  0.6542699  0.6577353
#> cvAUC              0.8966000  0.8818000  0.9008700
#> cvPer.Expl        41.0178546 36.4892719 39.3917820

ggInfluence

Function to return a data frame with the relative influence of each predictor in the boosted regression trees and a bar plot. If signif= TRUE, retains only the “significant” predictors (100/number of predictors, Muller et al. 2013) and rescale their influence to 100. There are several parameters to customise the aesthetics of the plot, see ?ggInfluence.

ggInfluence(brt1)

#>                                 rel.inf
#> Fishing_NonComm_Boat_Total  21.51630207
#> Complexity                  20.06472410
#> Fishing_Comm_Total          11.60439750
#> Depth                        8.84157882
#> Effluent                     5.51566196
#> WAV_ANOM_F                   4.88705511
#> WAV_CLIM_M                   3.56151977
#> CHL_CLIM_M                   3.46760599
#> PAR_STD                      3.39546441
#> Fishing_NonComm_Shore_Line   2.62461046
#> SST_STD                      2.59124138
#> Fishing_NonComm_Shore_Spear  2.48290879
#> Fishing_NonComm_Shore_Net    2.33254295
#> SST_CLIM_M                   1.97082226
#> CHL_ANOM_F                   1.87677332
#> Sedimentation                1.74553360
#> PAR_CLIM_M                   0.80488729
#> New_Development              0.67903956
#> Invasive_Algae               0.02503618
#> Habitat_Modification         0.01229448
ggInfluence(brt1, signif=T,col.bar = "darkred",main = "Relative influence - Regime 1")

#>                              rel.inf
#> Fishing_NonComm_Boat_Total 31.855868
#> Complexity                 29.706741
#> Fishing_Comm_Total         17.180841
#> Depth                      13.090361
#> Effluent                    8.166189

ggMultiInfluence

Function to return and plot (heatmap) the relative influence of predictors from several boosted regression trees. The models need to have the same set of predictors. There are several parameters to customise the aesthetics of the plot, see ?ggMultiInfluence.

ggMultiInfluence(Regime1=brt1,Regime2=brt2,Regime3=brt3)

#>                      Predictor Regime1 Regime2 Regime3
#> 1   Fishing_NonComm_Boat_Total 21.5163  7.0412   4.282
#> 2                   Complexity 20.0647  8.4717   7.623
#> 3           Fishing_Comm_Total 11.6044  2.5704   2.878
#> 4                        Depth  8.8416  8.6967   8.872
#> 5                     Effluent  5.5157  1.8023   2.630
#> 6                   WAV_ANOM_F  4.8871  1.7214   3.590
#> 7                   WAV_CLIM_M  3.5615 36.4942   8.220
#> 8                   CHL_CLIM_M  3.4676  0.8984  13.556
#> 9                      PAR_STD  3.3955  3.5059   8.998
#> 10  Fishing_NonComm_Shore_Line  2.6246  0.8620   2.162
#> 11                     SST_STD  2.5912  7.8055   2.886
#> 12 Fishing_NonComm_Shore_Spear  2.4829  4.2890   4.834
#> 13   Fishing_NonComm_Shore_Net  2.3325  0.4240   1.029
#> 14                  SST_CLIM_M  1.9708  9.3312   6.521
#> 15                  CHL_ANOM_F  1.8768  1.7004   7.659
#> 16               Sedimentation  1.7455  0.9092   0.881
#> 17                  PAR_CLIM_M  0.8049  1.5126   6.902
#> 18             New_Development  0.6790  1.9322   5.650
#> 19              Invasive_Algae  0.0250  0.0319   0.827
#> 20        Habitat_Modification  0.0123  0.0000   0.000

ggCatInfluence

Function to return the relative influence per category of predictors. If signif= TRUE, retains only the “significant” predictors (100/number of predictors, Muller et al. 2013) and rescale their influence to 100.

# all predictors
ggCatInfluence(brt1,brt2,brt3,category = "Category",data=PredictorsCat)
#>        Category   Model1   Model2   Model3
#> 1 Anthropogenic 48.53833 19.86211 25.17297
#> 2   Biophysical 51.46167 80.13789 74.82703
# significant predictors 
ggCatInfluence(brt1,brt2,brt3,category = "Category",data=PredictorsCat, signif=T)
#>        Category  Model1    Model2    Model3
#> 1 Anthropogenic 57.2029  9.045641  7.634679
#> 2   Biophysical 42.7971 90.954359 92.365321

ggPD

Function to draw partial dependency plots with fitted function. There are several parameters to customise the aesthetics of the plot: see ?ggPD.

ggPD(brt1,rug = T)

ggPDfit

Function to draw partial dependency plots with fitted values. There are several parameters to customise the aesthetics of the plot: see ?ggPDfit.

ggPDfit(brt1)

ggPD_boot

Function to draw partial dependency plots with fitted function and confidence intervals obtained through bootstrapping (with replacement). There are several parameters to customise the aesthetics of the plot: see ?ggPDboot.

# Boostrap the BRT 1000 times to build confidence intervals
brt1.prerun<- plot.gbm.4list(brt1)
brt1.boot <- gbm.bootstrap.functions(brt1, list.predictors=brt1.prerun, n.reps=100)
# Draw partial dependency plots a given predictor (i.e., "Complexity")
ggPD_boot(brt1, predictor="Complexity", list.4.preds=brt1.prerun, booted.preds=brt1.boot$function.preds, type.ci = "ribbon",rug = T)

ggMultiPD

Function to draw partial dependency plots with fitted functions from several boosted regression trees. The models need to have the same set of predictors. There are several parameters to customise the aesthetics of the plot: see ?ggMultiPD.

ggMultiPD(brt1, brt3, predictor = "Complexity", legend.pos = "top", col.lines =c("blue","red"))

ggMultiPDfit

Function to draw partial dependency plots with fitted values from several boosted regression trees. The models need to have the same set of predictors. There are several parameters to customise the aesthetics of the plot: see ?ggMultiPDfit.

ggMultiPDfit(brt1, brt3, predictor = "Complexity", legend.pos = "top", col.dots =c("blue","red"), cex.dot=1.5, cex.smooth = 0.8)

ggInteract_list

Function to return the interactions sizes for a given boosted regression trees.

ggInteract_list(brt2,index = F)
#>    var1.names                  var2.names int.size
#> 1  WAV_CLIM_M                     SST_STD    64.82
#> 2       Depth                  WAV_CLIM_M    18.51
#> 3     SST_STD                  SST_CLIM_M    18.17
#> 4  Complexity                  SST_CLIM_M     9.96
#> 5       Depth                     SST_STD     7.55
#> 6  WAV_CLIM_M  Fishing_NonComm_Boat_Total     7.17
#> 7  WAV_CLIM_M                  SST_CLIM_M     6.31
#> 8  WAV_CLIM_M                     PAR_STD     2.08
#> 9  Complexity                  WAV_CLIM_M     2.07
#> 10 Complexity Fishing_NonComm_Shore_Spear     1.52
#> 11    PAR_STD                  PAR_CLIM_M     1.19
#> 12      Depth                  SST_CLIM_M     0.87
#> 13 WAV_CLIM_M                  CHL_ANOM_F     0.84
#> 14 SST_CLIM_M  Fishing_NonComm_Boat_Total     0.78
#> 15 Complexity                     SST_STD     0.71
#> 16    PAR_STD                     SST_STD     0.71
#> 17 WAV_CLIM_M          Fishing_Comm_Total     0.57
#> 18      Depth                     PAR_STD     0.46
#> 19 WAV_ANOM_F                  WAV_CLIM_M     0.31
#> 20 WAV_CLIM_M                    Effluent     0.31

ggInteract_boot

Function to randomize the response n times to generate a distribution under the null hypothesis of no interaction among predictors for one or several pairs of predictors (following Pinsky and Byler 2015). Model parameters should be the same as for the observed interaction strength. Results allow to compare with observed interaction strength.


# Randomization of the response to test significance of the two strongest interactions
Interact_boot_brt2<-ggInteract_boot(c('WAV_CLIM_M','SST_STD'),c('Depth','WAV_CLIM_M'),nboots = 10,data=dat, predictors = pred, response=which(names(dat) == 'Regime2'),family = "bernoulli", tc = 5, lr = 0.001, bf= 0.75,global.env=F)

ggInteract_2D

Function to draw 2D interaction plot showing predicted values for two predictors from a boosted regression trees. Values for all other variables are set at their mean by default. There are several parameters to customise the aesthetics of the plot: see ?ggInteract_2D.

ggInteract_2D(gbm.object = brt2,x="WAV_CLIM_M",y="SST_STD",col.gradient = c("white","#5576AF"),show.dot = T,col.dot = "grey20",alpha.dot = 0.5,cex.dot = 0.2,label.contour = T,col.contour = "#254376",show.axis = T,legend = T)
#> maximum value =  0.42

ggInteract_3D

Function to draw 3D interaction plot showing predicted values for two predictors from a boosted regression trees. Values for all other variables are set at their mean by default. Possibility to export an interactive version of the plot in html format. There are several parameters to customise the aesthetics of the plot: see ?ggInteract_3D.

ggInteract_3D(brt2,x="WAV_CLIM_M",y="SST_STD")
#> maximum value =  0.42

References

Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802-813.

Jouffray JB, Wedding L.M., Norstrom A.V., Donovan M.K., Williams G.J., Crowder L.B., Erickson A.L., Friedlander A.M., Graham N.A.J., Gove J.M., Kappel C.V., Kittinger J.N., Lecky J., Oleson K.L.L., Selkoe K.A., White C., Williams I.D., Nystrom M. (2019). Parsing Human and Biophysical Drivers of Coral Reef Regimes. Proc. R. Soc. B.

Müller, D., Leitão, P. J., & Sikor, T. (2013). Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees. Agricultural Systems, 117, 66-77.

Pinsky, M. L., & Byler, D. (2015). Fishing, fast growth and climate variability increase the risk of collapse. In Proc. R. Soc. B (Vol. 282, No. 1813, p. 20151053).