Supplementary Material

Supporting information for paper ‘Examining properness in the external validation of survival models with squared and logarithmic losses’

Author

John Zobolas

Published

October 21, 2025

High variability for generated survival and censoring distributions

To illustrate the diversity of the Weibull distributions used in our simulation design, we plot both the density f(t) and survival S(t) functions for 10 randomly sampled shape \alpha and scale \sigma parameter pairs. These distributions represent either the true survival, predicted survival, or censoring times across simulations. The wide parameter range induces substantial variability, ensuring a rich family of distributions and increasing the sensitivity of our evaluation to detect violations of properness. By construction, all data-generating mechanisms assume Y independent of C, satisfying the random censoring assumption.

Code

library(ggplot2)
library(patchwork)
library(dplyr)

set.seed(20250408)

shape = runif(10, 0.5, 5)
scale = runif(10, 0.5, 5)

dw = mapply(function(shape,scale) dweibull(seq.int(1,5,0.1), shape,scale),shape,scale) |>
  reshape2::melt() |>
  mutate(x = rep(seq.int(1,5,0.1), 10),
        group = rep(sprintf("(%.2f,%.2f)",shape,scale), each = 41)) |>
  select(x, group, value)
pw = mapply(function(shape,scale) pweibull(seq.int(1,5,0.1), shape,scale, lower.tail=F),shape,scale) |>
  reshape2::melt() |>
  mutate(x = rep(seq.int(1,5,0.1), 10),
        group = rep(sprintf("(%.2f,%.2f)",shape,scale), each = 41)) |>
  select(x, group, value)

(ggplot(dw, aes(x = x, y = value, group = group, color = group)) + ylab("f(x)")) /
(ggplot(pw, aes(x = x, y = value, group = group, color = group)) + ylab("S(x)")) +
  plot_layout(guides='collect') &
  geom_line() &
  theme_classic() &
  guides(color = guide_legend(title='(α,σ)')) &
  scale_color_viridis_d(option = "turbo")

Weibull density (top) and survival (bottom) curves for randomly generated shape, α, and scale, σ parameters. Each curve corresponds to a different parameter pair, illustrating the broad variability across simulated distributions.

Code

# ggsave("weibull_plots.png", device = png, dpi=600, width =7, height = 4)

Table D1

Note

This section supports the “Simulation Experiments” section of the paper.

Use properness_test.R to run the experiments. To run for different sample sizes n, use run_tests.sh. Merged simulation results for all n are available here.

Load libraries and explore data:

Code

library(dplyr)
library(tibble)
library(tidyr)
library(DT)

res = readRDS("results/res_sims10000_distrs1000_0.rds")

res[sample(1:nrow(res), 10), ] |>
  DT::datatable(rownames = FALSE, options = list(searching = FALSE)) |>
  formatRound(columns = 3:22, digits = c(rep(3,8), rep(5,12)))

Each row corresponds to one simulation, using m = 1000 draws from Weibull survival distributions at a fixed sample size n \in {10, \dots, 1000}.

The output columns are:

sim: simulation index (k in the paper)
n: sample size
{surv|pred|cens|}_{scale|shape} the scale and shape of the Weibull distributions for true, predicted and censoring times respectively
prop_cens: proportion of censoring in each simulation
tv_dist: the total variation distance between the true and the predicted Weibull distribution (closer to 0 means more similar distributions)
{score}_diff, {score}_sd: the mean and standard deviation of the score difference D between true and predicted distributions across draws

Scoring rules analyzed:

The SBS at the 10th, 50th and 90th percentiles of observed times
The ISBS, using 50 equidistant points between the 5th and 80th percentile of observed times
The RCLL and its re-weighted rRCLL (RCLL* in the paper)

We compute 95% confidence intervals (CIs) for the mean score difference D using a t-distribution. A violation is marked statistically significant if:

The mean score difference exceeds a threshold: \bar{D} > 0.001
The CI excludes zero (i.e., \text{CI}_\text{lower} > 0)

Code

res = res |>
  select(!matches("shape|scale|prop_cens|tv_dist"))

measures = c("SBS_median", "SBS_q10", "SBS_q90", "ISBS", "RCLL", "rRCLL")
n_distrs = 1000 # `m` value from paper's experiment
tcrit = qt(0.975, df = n_distrs - 1) # 95% t-test
threshold = 1e-3

data = measures |>
  lapply(function(m) {
    mean_diff_col = paste0(m, "_diff")
    mean_sd_col = paste0(m, "_sd")
    
    res |>
      select(sim, n, !!mean_diff_col, !!mean_sd_col) |>
      mutate(
        se = .data[[mean_sd_col]] / sqrt(n_distrs),
        CI_lower = .data[[mean_diff_col]] - tcrit * se,
        CI_upper = .data[[mean_diff_col]] + tcrit * se,
        signif_violation = .data[[mean_diff_col]] > threshold & CI_lower > 0
      ) |>
      select(!se) |>
      mutate(metric = !!m) |>
      relocate(metric, .after = 1) |>
      rename(diff = !!mean_diff_col,
             sd   = !!mean_sd_col)
  }) |>
  bind_rows()

data$metric = factor(
  data$metric,
  levels = measures,
  labels = c("SBS (Median)", "SBS (Q10)", "SBS (Q90)", "ISBS", "RCLL", "RCLL*")
)

For example, the violation results for the first simulation and n = 250 across all scoring rules are:

Code

data |> 
  filter(sim == 1, n == 250) |>
  DT::datatable(rownames = FALSE, options = list(searching = FALSE)) |>
  formatRound(columns = 4:7, digits = 5)

We summarize violations across simulations, by computing for each score & sample size:

Number of significant violations
Violation rate
Average, median, min and max score difference among simulations where violations occurred

Code

all_stats = 
  data |> 
  group_by(n, metric) |> 
  summarize(
    n_violations = sum(signif_violation),
    violation_rate = mean(signif_violation),
    diff_mean = if (any(signif_violation)) mean(diff[signif_violation]) else NA_real_,
    diff_median = if (any(signif_violation)) median(diff[signif_violation]) else NA_real_,
    diff_min = if (any(signif_violation)) min(diff[signif_violation]) else NA_real_,
    diff_max = if (any(signif_violation)) max(diff[signif_violation]) else NA_real_,
    .groups = "drop"
  )

all_stats |>
  arrange(metric) |>
  DT::datatable(
    rownames = FALSE,
    options = list(searching = FALSE),
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: center; font-size:150%",
      "Table D1: Empirical violations of properness")
  ) |>
  formatRound(columns = 4:8, digits = 5)

Summary

RCLL and RCLL* show no violations across any simulation or sample size.
SBS shows time- and sample-size-dependent violations, mostly at smaller n, especially for early evaluation times (\tau = q_{0.1}). All differences are small (typically < 0.01).
ISBS showed minor violations only at n = 10; none for n > 50.

Table D2

Note

The only difference with the previous experiment is that S_C(t) is now being estimated using the marginal Kaplan-Meier model via survival::survfit() instead of using the true Weibull censoring distribution (see helper.R). For SBS/ISBS we use constant interpolation of the censoring survival distribution S_C(t). For RCLL* we use linear interpolation of S_C(t) to mitigate density estimation issues, i.e. for f_C(t).

Use properness_test.R to run the experiments. To run for different sample sizes n, use run_tests.sh, changing estimate_cens to TRUE. Merged simulation results for all n are available here.

Load results (same output columns as in the previous section):

Code

res = readRDS("results/res_sims10000_distrs1000_1.rds") |>
  select(!matches("shape|scale|prop_cens|tv_dist")) # remove columns

As before, we compute 95% confidence intervals for the score differences across m = 1000 draws per simulation:

Code

measures = c("SBS_median", "SBS_q10", "SBS_q90", "ISBS", "RCLL", "rRCLL")
n_distrs = 1000 # `m` value from paper's experiment
tcrit = qt(0.975, df = n_distrs - 1) # 95% t-test
threshold = 1e-3

data = measures |>
  lapply(function(m) {
    mean_diff_col = paste0(m, "_diff")
    mean_sd_col = paste0(m, "_sd")
    
    res |>
      select(sim, n, !!mean_diff_col, !!mean_sd_col) |>
      mutate(
        se = .data[[mean_sd_col]] / sqrt(n_distrs),
        CI_lower = .data[[mean_diff_col]] - tcrit * se,
        CI_upper = .data[[mean_diff_col]] + tcrit * se,
        signif_violation = .data[[mean_diff_col]] > threshold & CI_lower > 0
      ) |>
      select(!se) |>
      mutate(metric = !!m) |>
      relocate(metric, .after = 1) |>
      rename(diff = !!mean_diff_col,
             sd   = !!mean_sd_col)
  }) |>
  bind_rows()

data$metric = factor(
  data$metric,
  levels = measures,
  labels = c("SBS (Median)", "SBS (Q10)", "SBS (Q90)", "ISBS", "RCLL", "RCLL*")
)

Lastly, we summarize the significant violations across sample sizes and scoring rules:

Code

all_stats = 
  data |> 
  group_by(n, metric) |> 
  summarize(
    n_violations = sum(signif_violation),
    violation_rate = mean(signif_violation),
    diff_mean = if (any(signif_violation)) mean(diff[signif_violation]) else NA_real_,
    diff_median = if (any(signif_violation)) median(diff[signif_violation]) else NA_real_,
    diff_min = if (any(signif_violation)) min(diff[signif_violation]) else NA_real_,
    diff_max = if (any(signif_violation)) max(diff[signif_violation]) else NA_real_,
    .groups = "drop"
  )

all_stats |>
  arrange(metric) |>
  DT::datatable(
    rownames = FALSE,
    options = list(searching = FALSE),
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: center; font-size:150%",
      "Table D2: Empirical violations of properness using G(t)")
  ) |>
  formatRound(columns = 4:8, digits = 5)

Summary

Estimating S_C(t) via Kaplan-Meier produced nearly identical results to using the true distribution, with no violations for RCLL/RCLL* and only small-sample effects for SBS and ISBS.

Degenerate Model Exploits ISBS Scoring Rule

Note

This section demonstrates how a deliberately simple, uninformative survival model—the degenerate model—can outperform established methods under the Integrated Survival Brier Score (ISBS). The model entirely ignores covariates and instead outputs a flat survival function that drops from 1 to 0 at a fixed quantile of observed times.

This model is deliberately constructed to expose a key weakness in ISBS: it can be minimized by predictions that offer no individualization or clinical utility. This highlights that the ISBS is not a proper scoring rule in general, particularly when fundamental assumptions are violated—such as independent censoring or the propriety of the true and censoring survival distributions (i.e., when \lim_{t\to\infty} S(t) \ne 0).

Below, we define the degenerate model in mlr3proba, tune it over the quantile cutoff with ISBS, and compare its ISBS score and other evaluation measures to Cox, Kaplan-Meier, and Random Survival Forest (RSF) learners on the survival::rats dataset.

Load libraries:

Code

library(R6)
library(mlr3proba)
library(mlr3extralearners)
library(mlr3tuning)

Define the degenerate model:

Code

LearnerSurvDegenerate = R6Class("LearnerSurvDegenerate",
  inherit = LearnerSurv,
  public = list(
    initialize = function() {
      super$initialize(
        id = "surv.degenerate",
        predict_types = c("distr"),
        feature_types = c("logical", "integer", "numeric", "character", "factor", "ordered"),
        properties = "missings",
        packages = c("survival", "distr6"),
        label = "Degenerate Estimator",
        param_set = ps(
           quantile = p_dbl(0, 1)
        )
      )
    }
  ),

  private = list(
    .train = function(task) {
      list(time = task$truth()[,1L]) # store observed times
    },

    .predict = function(task) {
      quantile_ps = self$param_set$values$quantile
      times = sort(unique(self$model$time))
      surv = matrix(nrow = task$nrow, ncol = length(times))

      q_t = quantile(seq.int(length(times)), quantile_ps)[[1]]
      
      # same S for all test observations, sharp drop from 1 to 0 at q_t
      surv[, 1:floor(q_t)] = 1
      surv[, ceiling(q_t):ncol(surv)] = 0
      colnames(surv) = times
      mlr3proba::surv_return(times = times, surv = surv)
    }
  )
)

We tune the degenerate model to find the optimal quantile to switch the prediction at:

Code

l = LearnerSurvDegenerate$new()
l$param_set$values$quantile = to_tune(0, 1)

# ISBS
m = msr("surv.graf")

# Tune the quantile parameter of the degenerate model
at = auto_tuner(
  tuner = tnr("grid_search", resolution = 20),
  learner = l,
  resampling = rsmp("holdout"),
  measure = m
)

In our benchmark experiment we compare to the Cox PH, Random Forest, and Kaplan-Meier using 3-fold outer cross-validation.

Code

# Seed for reproducibility
set.seed(20250418)

# Compare to Cox PH and Kaplan-Meier
learners = c(lrns(c("surv.coxph", "surv.kaplan", "surv.rfsrc")), at)

r = rsmp("cv", folds = 3)
bm = benchmark(benchmark_grid(tasks = tsk("rats"), learners = learners, resamplings = r))

To evaluate the benchmark we use Harrell’s C, D-calibration, RCLL, the SBS evaluated at the median observed time, and the ISBS:

Code

q50 = quantile(tsk("rats")$times(), 0.50)

measures = c(msrs(
  c("surv.cindex", "surv.dcalib", "surv.rcll")),
  msr("surv.graf", integrated = FALSE, times = q50, id = "SBS_q0.5"),
  msr("surv.graf", id = "ISBS")
)

Finally score results:

Code

df = bm$aggregate(measures)[, c(4, 7:11)]

df |>
  DT::datatable(
    rownames = FALSE,
    options = list(searching = FALSE, order = list(list(5, "desc")))
  ) |>
  formatRound(columns = c(2, 4, 5, 6), digits = 5) |>
  formatSignif(columns = 3, digits = 3) # scientific notation for surv.dcalib

Summary

And we see the degenerate model performs the best with respect to ISBS, but worse with respect to all other measures.

Real-World Benchmark Evaluation

Note

To run the experiment use rrcll_benchmark.R.

We assess the performance using three survival scoring rules (RCLL, RCLL*, ISBS), D-calibration and Harrell’s C-index on real-world survival datasets using the mlr3proba framework, tested with 10 repetitions of 5-fold cross-validation on the following learners:

Kaplan-Meier (non-parametric)
Cox proportional hazards model
Random survival forests (RSF)

Code

# (normal) RCLL
rcll = msr("surv.rrcll", id = "RCLL", proper = FALSE)
# (re-weighted) rRCLL/RCLL*
rrcll = msr("surv.rrcll", id = "RRCLL", proper = TRUE)

# get all survival tasks in mlr3proba
keys = as.data.table(mlr_tasks)[task_type == "surv"][["key"]]
tasks = lapply(keys, function(key) {
  tsk(key)
})

set.seed(42)
bm_grid = benchmark_grid(
  tasks = tasks,
  learners = lrns(c("surv.kaplan", "surv.ranger", "surv.coxph")),
  resamplings = rsmp("repeated_cv", folds = 5)
)

bm = benchmark(bm_grid)
res = bm$aggregate(
  measures = c(rcll, rrcll, msr("surv.cindex"), msr("surv.dcalib"), msr("surv.graf"))
)

Below we show the aggregated performance metrics across all tasks and learners:

Code

res = readRDS("results/rrcll_bm.rds")

res |>
  rename(`RCLL*` = RRCLL, `C-index` = surv.cindex, 
         `D-Calibration` = surv.dcalib, ISBS = surv.graf) |>
  DT::datatable(rownames = FALSE, options = list(searching = FALSE)) |>
  formatRound(columns = 3:7, digits = 3) |>
  formatSignif(columns = 6, digits = 3) # scientific notation for surv.dcalib

Summary

Our experiment demonstrates substantial divergences in model assessment depending on the chosen metric (especially in the case of the scoring rules which are all theoretically proper).

For instance, on datasets such as veteran and whas, which exhibit non-proportional hazards, RSF is more suitable to model non-linear effects compared to the CoxPH. In these cases, RCLL* identifies RSF as superior, whereas RCLL and ISBS still prefer CoxPH — suggesting potential practical limitations of the latter metrics in such settings.

Additionally, RCLL* appears to better capture both discrimination and calibration. In the gbcs dataset, CoxPH achieves the best RCLL, ISBS and C-index scores, yet shows extremely poor calibration (\text{D-calib} \approx 15.7), while RCLL* ranks Kaplan-Meier as the best — possibly reflecting better-calibrated risk estimates. This discrepancy becomes even more pronounced in the lung dataset, where CoxPH again yields high RCLL/C-index/ISBS performance despite severe calibration failure (\text{D-calib} \approx 23.5), whereas RCLL* selects RSF instead.

These findings highlight a critical need for further research: not all metrics labeled as “proper” in theory behave equally in practice.

R session info

Code

utils::sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Oslo
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.67.0                mlr3tuning_1.2.1            
 [3] paradox_1.0.1                mlr3extralearners_1.1.0.9000
 [5] mlr3proba_0.8.2              mlr3_1.2.0                  
 [7] R6_2.6.1                     DT_0.33                     
 [9] tidyr_1.3.1                  tibble_3.3.0                
[11] dplyr_1.1.4                  patchwork_1.3.2             
[13] ggplot2_4.0.0               

loaded via a namespace (and not attached):
 [1] gtable_0.3.6          xfun_0.49             bslib_0.8.0          
 [4] visNetwork_2.1.2      htmlwidgets_1.6.4     lattice_0.22-6       
 [7] vctrs_0.6.5           tools_4.4.2           crosstalk_1.2.1      
[10] generics_0.1.3        parallel_4.4.2        pkgconfig_2.0.3      
[13] Matrix_1.7-1          data.table_1.17.8     checkmate_2.3.3      
[16] RColorBrewer_1.1-3    randomForestSRC_3.4.1 S7_0.2.0             
[19] mlr3pipelines_0.9.0   uuid_1.2-1            lifecycle_1.0.4      
[22] compiler_4.4.2        farver_2.1.2          stringr_1.5.1        
[25] set6_0.2.6            data.tree_1.1.0       codetools_0.2-20     
[28] bbotk_1.6.0           htmltools_0.5.8.1     sass_0.4.9           
[31] yaml_2.3.10           pillar_1.11.0         crayon_1.5.3         
[34] jquerylib_0.1.4       cachem_1.1.0          parallelly_1.45.0    
[37] tidyselect_1.2.1      digest_0.6.37         stringi_1.8.4        
[40] reshape2_1.4.4        purrr_1.0.2           listenv_0.9.1        
[43] splines_4.4.2         labeling_0.4.3        fastmap_1.2.0        
[46] grid_4.4.2            cli_3.6.5             DiagrammeR_1.0.11    
[49] magrittr_2.0.3        survival_3.7-0        future.apply_1.20.0  
[52] withr_3.0.2           scales_1.4.0          backports_1.5.0      
[55] ooplah_0.2.0          rmarkdown_2.29        globals_0.18.0       
[58] distr6_1.8.4          evaluate_1.0.5        knitr_1.49           
[61] viridisLite_0.4.2     dictionar6_0.1.3      mlr3misc_0.19.0      
[64] rlang_1.1.6           Rcpp_1.1.0            glue_1.8.0           
[67] param6_0.2.4          palmerpenguins_0.1.1  rstudioapi_0.17.1    
[70] jsonlite_1.8.9        lgr_0.5.0             plyr_1.8.9