Supplementary Material

Supporting information for paper ‘Examining properness in the external validation of survival models with squared and logarithmic losses’
Author
Published

May 19, 2025

High variability for generated survival and censoring distributions

To illustrate the diversity of the Weibull distributions used in our simulation design, we plot both the density f(t) and survival S(t) functions for 10 randomly sampled shape \alpha and scale \sigma parameter pairs. These distributions represent either the true survival, predicted survival, or censoring times across simulations. The wide parameter range induces substantial variability, ensuring a rich family of distributions and increasing the sensitivity of our evaluation to detect violations of properness. By construction, all data-generating mechanisms assume Y independent of C, satisfying the random censoring assumption.

Code
library(ggplot2)
library(patchwork)
library(dplyr)

set.seed(20250408)

shape = runif(10, 0.5, 5)
scale = runif(10, 0.5, 5)

dw = mapply(function(shape,scale) dweibull(seq.int(1,5,0.1), shape,scale),shape,scale) |>
  reshape2::melt() |>
  mutate(x = rep(seq.int(1,5,0.1), 10),
        group = rep(sprintf("(%.2f,%.2f)",shape,scale), each = 41)) |>
  select(x, group, value)
pw = mapply(function(shape,scale) pweibull(seq.int(1,5,0.1), shape,scale, lower.tail=F),shape,scale) |>
  reshape2::melt() |>
  mutate(x = rep(seq.int(1,5,0.1), 10),
        group = rep(sprintf("(%.2f,%.2f)",shape,scale), each = 41)) |>
  select(x, group, value)

(ggplot(dw, aes(x = x, y = value, group = group, color = group)) + ylab("f(x)")) /
(ggplot(pw, aes(x = x, y = value, group = group, color = group)) + ylab("S(x)")) +
  plot_layout(guides='collect') &
  geom_line() &
  theme_classic() &
  guides(color = guide_legend(title='(α,σ)')) &
  scale_color_viridis_d(option = "turbo")

Weibull density (top) and survival (bottom) curves for randomly generated shape, α, and scale, σ parameters. Each curve corresponds to a different parameter pair, illustrating the broad variability across simulated distributions.
Code
# ggsave("weibull_plots.png", device = png, dpi=600, width =7, height = 4)

Table D1

Note

This section supports the “Simulation Experiments” section of the paper.

Use properness_test.R to run the experiments. To run for different sample sizes n, use run_tests.sh. Merged simulation results for all n are available here.

Load libraries and explore data:

Code
library(dplyr)
library(tibble)
library(tidyr)
library(DT)

res = readRDS("results/res_sims10000_distrs1000_0.rds")

res[sample(1:nrow(res), 10), ] |>
  DT::datatable(rownames = FALSE, options = list(searching = FALSE)) |>
  formatRound(columns = 3:22, digits = c(rep(3,8), rep(5,12)))

Each row corresponds to one simulation, using m = 1000 draws from Weibull survival distributions at a fixed sample size n \in {10, \dots, 1000}.

The output columns are:

  • sim: simulation index (k in the paper)
  • n: sample size
  • {surv|pred|cens|}_{scale|shape} the scale and shape of the Weibull distributions for true, predicted and censoring times respectively
  • prop_cens: proportion of censoring in each simulation
  • tv_dist: the total variation distance between the true and the predicted Weibull distribution (closer to 0 means more similar distributions)
  • {score}_diff, {score}_sd: the mean and standard deviation of the score difference D between true and predicted distributions across draws

Scoring rules analyzed:

  • The SBS at the 10th, 50th and 90th percentiles of observed times
  • The ISBS, using 50 equidistant points between the 5th and 80th percentile of observed times
  • The RCLL and its re-weighted rRCLL (RCLL* in the paper)

We compute 95% confidence intervals (CIs) for the mean score difference D using a t-distribution. A violation is marked statistically significant if:

  • The mean score difference exceeds a threshold: \bar{D} > 0.001
  • The CI excludes zero (i.e., \text{CI}_\text{lower} > 0)
Code
res = res |>
  select(!matches("shape|scale|prop_cens|tv_dist"))

measures = c("SBS_median", "SBS_q10", "SBS_q90", "ISBS", "RCLL", "rRCLL")
n_distrs = 1000 # `m` value from paper's experiment
tcrit = qt(0.975, df = n_distrs - 1) # 95% t-test
threshold = 1e-3

data = measures |>
  lapply(function(m) {
    mean_diff_col = paste0(m, "_diff")
    mean_sd_col = paste0(m, "_sd")
    
    res |>
      select(sim, n, !!mean_diff_col, !!mean_sd_col) |>
      mutate(
        se = .data[[mean_sd_col]] / sqrt(n_distrs),
        CI_lower = .data[[mean_diff_col]] - tcrit * se,
        CI_upper = .data[[mean_diff_col]] + tcrit * se,
        signif_violation = .data[[mean_diff_col]] > threshold & CI_lower > 0
      ) |>
      select(!se) |>
      mutate(metric = !!m) |>
      relocate(metric, .after = 1) |>
      rename(diff = !!mean_diff_col,
             sd   = !!mean_sd_col)
  }) |>
  bind_rows()

data$metric = factor(
  data$metric,
  levels = measures,
  labels = c("SBS (Median)", "SBS (Q10)", "SBS (Q90)", "ISBS", "RCLL", "RCLL*")
)

For example, the violation results for the first simulation and n = 250 across all scoring rules are:

Code
data |> 
  filter(sim == 1, n == 250) |>
  DT::datatable(rownames = FALSE, options = list(searching = FALSE)) |>
  formatRound(columns = 4:7, digits = 5)

We summarize violations across simulations, by computing for each score & sample size:

  • Number of significant violations
  • Violation rate
  • Average, median, min and max score difference among simulations where violations occurred
Code
all_stats = 
  data |> 
  group_by(n, metric) |> 
  summarize(
    n_violations = sum(signif_violation),
    violation_rate = mean(signif_violation),
    diff_mean = if (any(signif_violation)) mean(diff[signif_violation]) else NA_real_,
    diff_median = if (any(signif_violation)) median(diff[signif_violation]) else NA_real_,
    diff_min = if (any(signif_violation)) min(diff[signif_violation]) else NA_real_,
    diff_max = if (any(signif_violation)) max(diff[signif_violation]) else NA_real_,
    .groups = "drop"
  )

all_stats |>
  arrange(metric) |>
  DT::datatable(
    rownames = FALSE,
    options = list(searching = FALSE), 
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: center; font-size:150%",
      "Table D1: Empirical violations of properness")
  ) |>
  formatRound(columns = 4:8, digits = 5)
Summary
  • RCLL and RCLL* show no violations across any simulation or sample size.
  • SBS shows time- and sample-size-dependent violations, mostly at smaller n, especially for early evaluation times (\tau = q_{0.1}). All differences are small (typically < 0.01).
  • ISBS showed minor violations only at n = 10; none for n > 50.

Table D2

Note

The only difference with the previous experiment is that S_C(t) is now being estimated using the marginal Kaplan-Meier model via survival::survfit() instead of using the true Weibull censoring distribution (see helper.R). For SBS/ISBS we use constant interpolation of the censoring survival distribution S_C(t). For RCLL* we use linear interpolation of S_C(t) to mitigate density estimation issues, i.e. for f_C(t).

Use properness_test.R to run the experiments. To run for different sample sizes n, use run_tests.sh, changing estimate_cens to TRUE. Merged simulation results for all n are available here.

Load results (same output columns as in the previous section):

Code
res = readRDS("results/res_sims10000_distrs1000_1.rds") |>
  select(!matches("shape|scale|prop_cens|tv_dist")) # remove columns

As before, we compute 95% confidence intervals for the score differences across m = 1000 draws per simulation:

Code
measures = c("SBS_median", "SBS_q10", "SBS_q90", "ISBS", "RCLL", "rRCLL")
n_distrs = 1000 # `m` value from paper's experiment
tcrit = qt(0.975, df = n_distrs - 1) # 95% t-test
threshold = 1e-3

data = measures |>
  lapply(function(m) {
    mean_diff_col = paste0(m, "_diff")
    mean_sd_col = paste0(m, "_sd")
    
    res |>
      select(sim, n, !!mean_diff_col, !!mean_sd_col) |>
      mutate(
        se = .data[[mean_sd_col]] / sqrt(n_distrs),
        CI_lower = .data[[mean_diff_col]] - tcrit * se,
        CI_upper = .data[[mean_diff_col]] + tcrit * se,
        signif_violation = .data[[mean_diff_col]] > threshold & CI_lower > 0
      ) |>
      select(!se) |>
      mutate(metric = !!m) |>
      relocate(metric, .after = 1) |>
      rename(diff = !!mean_diff_col,
             sd   = !!mean_sd_col)
  }) |>
  bind_rows()

data$metric = factor(
  data$metric,
  levels = measures,
  labels = c("SBS (Median)", "SBS (Q10)", "SBS (Q90)", "ISBS", "RCLL", "RCLL*")
)

Lastly, we summarize the significant violations across sample sizes and scoring rules:

Code
all_stats = 
  data |> 
  group_by(n, metric) |> 
  summarize(
    n_violations = sum(signif_violation),
    violation_rate = mean(signif_violation),
    diff_mean = if (any(signif_violation)) mean(diff[signif_violation]) else NA_real_,
    diff_median = if (any(signif_violation)) median(diff[signif_violation]) else NA_real_,
    diff_min = if (any(signif_violation)) min(diff[signif_violation]) else NA_real_,
    diff_max = if (any(signif_violation)) max(diff[signif_violation]) else NA_real_,
    .groups = "drop"
  )

all_stats |>
  arrange(metric) |>
  DT::datatable(
    rownames = FALSE,
    options = list(searching = FALSE), 
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: center; font-size:150%",
      "Table D2: Empirical violations of properness using G(t)")
  ) |>
  formatRound(columns = 4:8, digits = 5)
Summary

Estimating S_C(t) via Kaplan-Meier produced nearly identical results to using the true distribution, with no violations for RCLL/RCLL* and only small-sample effects for SBS and ISBS.

Degenerate Model Exploits ISBS Scoring Rule

Note

This section demonstrates how a deliberately simple, uninformative survival model—the degenerate model—can outperform established methods under the Integrated Survival Brier Score (ISBS). The model entirely ignores covariates and instead outputs a flat survival function that drops from 1 to 0 at a fixed quantile of observed times.

This model is deliberately constructed to expose a key weakness in ISBS: it can be minimized by predictions that offer no individualization or clinical utility. This reinforces our theoretical findings that ISBS is not a proper scoring rule.

Below, we define the degenerate model in mlr3proba, tune it over the quantile cutoff with ISBS, and compare its ISBS score and other evaluation measures to Cox, Kaplan-Meier, and Random Survival Forest (RSF) learners on the survival::rats dataset.

Load libraries:

Code
library(R6)
library(mlr3proba)
library(mlr3extralearners)
library(mlr3tuning)

Define the degenerate model:

Code
LearnerSurvDegenerate = R6Class("LearnerSurvDegenerate",
  inherit = LearnerSurv,
  public = list(
    initialize = function() {
      super$initialize(
        id = "surv.degenerate",
        predict_types = c("distr"),
        feature_types = c("logical", "integer", "numeric", "character", "factor", "ordered"),
        properties = "missings",
        packages = c("survival", "distr6"),
        label = "Degenerate Estimator",
        param_set = ps(
           quantile = p_dbl(0, 1)
        )
      )
    }
  ),

  private = list(
    .train = function(task) {
      list(time = task$truth()[,1L]) # store observed times
    },

    .predict = function(task) {
      quantile_ps = self$param_set$values$quantile
      times = sort(unique(self$model$time))
      surv = matrix(nrow = task$nrow, ncol = length(times))

      q_t = quantile(seq.int(length(times)), quantile_ps)[[1]]
      
      # same S for all test observations, sharp drop from 1 to 0 at q_t
      surv[, 1:floor(q_t)] = 1
      surv[, ceiling(q_t):ncol(surv)] = 0
      colnames(surv) = times
      .surv_return(times = times, surv = surv)
    }
  )
)

We tune the degenerate model to find the optimal quantile to switch the prediction at:

Code
l = LearnerSurvDegenerate$new()
l$param_set$values$quantile = to_tune(0, 1)

# ISBS
m = msr("surv.graf")

# Tune the quantile parameter of the degenerate model
at = auto_tuner(
  tuner = tnr("grid_search", resolution = 20),
  learner = l,
  resampling = rsmp("holdout"),
  measure = m,
)

In our benchmark experiment we compare to the Cox PH, Random Forest, and Kaplan-Meier using 3-fold outer cross-validation.

Code
# Seed for reproducibility
set.seed(20250418)

# Compare to Cox PH and Kaplan-Meier
learners = c(lrns(c("surv.coxph", "surv.kaplan", "surv.rfsrc")), at)

r = rsmp("cv", folds = 3)
bm = benchmark(benchmark_grid(tasks = tsk("rats"), learners = learners, resamplings = r))

To evaluate the benchmark we use Harrell’s C, D-calibration, RCLL, the SBS evaluated at the median observed time, and the ISBS:

Code
q50 = quantile(tsk("rats")$times(), 0.50)

measures = c(msrs(
  c("surv.cindex", "surv.dcalib", "surv.rcll")),
  msr("surv.graf", integrated = FALSE, times = q50, id = "SBS_q0.5"),
  msr("surv.graf", id = "ISBS")
)

Finally score results:

Code
df = bm$aggregate(measures)[, c(4, 7:11)]

df |>
  DT::datatable(
    rownames = FALSE,
    options = list(searching = FALSE, order = list(list(5, "desc")))
  ) |>
  formatRound(columns = c(2, 4, 5, 6), digits = 5) |>
  formatSignif(columns = 3, digits = 3) # scientific notation for surv.dcalib
Summary

And we see the degenerate model performs the best with respect to ISBS, but worse with respect to all other measures.

Real-World Benchmark Evaluation

Note

To run the experiment use rrcll_benchmark.R.

We assess the performance of several survival scoring rules (RCLL, RCLL*, ISBS, D-calibration, Harrell’s C-index) on real-world survival datasets using the mlr3proba framework, tested with 10 repetitions of 5-fold cross-validation on the following learners:

  • Kaplan-Meier (non-parametric)
  • Cox proportional hazards model
  • Random survival forests (RSF)
Code
# (normal) RCLL
rcll = msr("surv.rrcll", id = "RCLL", proper = FALSE)
# (re-weighted) rRCLL/RCLL*
rrcll = msr("surv.rrcll", id = "RRCLL", proper = TRUE)

# get all survival tasks in mlr3proba
keys = as.data.table(mlr_tasks)[task_type == "surv"][["key"]]
tasks = lapply(keys, function(key) {
  tsk(key)
})

set.seed(42)
bm_grid = benchmark_grid(
  tasks = tasks,
  learners = lrns(c("surv.kaplan", "surv.ranger", "surv.coxph")),
  resamplings = rsmp("repeated_cv", folds = 5)
)

bm = benchmark(bm_grid)
res = bm$aggregate(
  measures = c(rcll, rrcll, msr("surv.cindex"), msr("surv.dcalib"), msr("surv.graf"))
)

Below we show the aggregated performance metrics across all tasks and learners:

Code
res = readRDS("results/rrcll_bm.rds")

res |>
  rename(`RCLL*` = RRCLL) |>
  DT::datatable(rownames = FALSE, options = list(searching = FALSE)) |>
  formatRound(columns = 3:7, digits = 3) |>
  formatSignif(columns = 6, digits = 3) # scientific notation for surv.dcalib
Summary

Our experiment demonstrates substantial divergences in model assessment depending on the chosen metric. Although RCLL* is the only proper scoring rule in a theoretical sense, its practical behavior differs meaningfully from other commonly used metrics.

For instance, on datasets such as veteran and whas, which exhibit non-proportional hazards, RSF is more suitable to model non-linear effects compared to the CoxPH. In these cases, RCLL* identifies RSF as superior, whereas RCLL and ISBS still prefer CoxPH — suggesting potential practical limitations of the latter metrics in such settings.

Additionally, RCLL* appears to better capture both discrimination and calibration. In the gbcs dataset, CoxPH achieves the best RCLL, ISBS and C-index scores, yet shows extremely poor calibration (\text{D-calib} \approx 15.7), while RCLL* ranks Kaplan-Meier as the best — possibly reflecting better-calibrated risk estimates. This discrepancy becomes even more pronounced in the lung dataset, where CoxPH again yields high RCLL/C-index/ISBS performance despite severe calibration failure (\text{D-calib} \approx 23.5), whereas RCLL* selects RSF instead.

These findings highlight a critical need for further research: not all metrics labeled as “proper” in theory behave equally in practice.

R session info

Code
utils::sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Oslo
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mlr3tuning_1.2.1             paradox_1.0.1               
 [3] mlr3extralearners_1.0.0-9000 mlr3proba_0.7.4             
 [5] mlr3_0.23.0                  R6_2.5.1                    
 [7] DT_0.33                      tidyr_1.3.1                 
 [9] tibble_3.2.1                 dplyr_1.1.4                 
[11] patchwork_1.3.0              ggplot2_3.5.1               

loaded via a namespace (and not attached):
 [1] gtable_0.3.6          xfun_0.49             bslib_0.8.0          
 [4] visNetwork_2.1.2      htmlwidgets_1.6.4     lattice_0.22-6       
 [7] vctrs_0.6.5           tools_4.4.2           crosstalk_1.2.1      
[10] generics_0.1.3        parallel_4.4.2        fansi_1.0.6          
[13] pkgconfig_2.0.3       Matrix_1.7-1          data.table_1.16.4    
[16] checkmate_2.3.2       RColorBrewer_1.1-3    randomForestSRC_3.3.3
[19] mlr3pipelines_0.7.1   uuid_1.2-1            lifecycle_1.0.4      
[22] set6_0.2.6            compiler_4.4.2        farver_2.1.2         
[25] stringr_1.5.1         munsell_0.5.1         data.tree_1.1.0      
[28] codetools_0.2-20      bbotk_1.5.0           htmltools_0.5.8.1    
[31] sass_0.4.9            yaml_2.3.10           pracma_2.4.4         
[34] pillar_1.9.0          crayon_1.5.3          jquerylib_0.1.4      
[37] cachem_1.1.0          parallelly_1.42.0     tidyselect_1.2.1     
[40] digest_0.6.37         stringi_1.8.4         future_1.34.0        
[43] reshape2_1.4.4        purrr_1.0.2           listenv_0.9.1        
[46] splines_4.4.2         labeling_0.4.3        fastmap_1.2.0        
[49] grid_4.4.2            colorspace_2.1-1      cli_3.6.3            
[52] DiagrammeR_1.0.11     magrittr_2.0.3        survival_3.7-0       
[55] utf8_1.2.4            future.apply_1.11.3   withr_3.0.2          
[58] scales_1.3.0          backports_1.5.0       ooplah_0.2.0         
[61] rmarkdown_2.29        globals_0.16.3        distr6_1.8.4         
[64] evaluate_1.0.3        knitr_1.49            viridisLite_0.4.2    
[67] dictionar6_0.1.3      mlr3misc_0.16.0       rlang_1.1.4          
[70] Rcpp_1.0.13-1         glue_1.8.0            param6_0.2.4         
[73] palmerpenguins_0.1.1  rstudioapi_0.17.1     jsonlite_1.8.9       
[76] lgr_0.4.4             plyr_1.8.9