
Create counting-process dataset for time-dependent Cox model
Source:R/cox_analysis.R
cox_create_data.RdConstructs a long-format dataset suitable for survival::coxph()
using counting-process notation Surv(tstart, tstop, event).
The function prepares a follow-up cohort starting at baseline and models
a time-dependent exposure diagnosis affecting the hazard of a response
diagnosis.
Arguments
- dpop
A data frame containing population-level variables. Must include
ID,DATE_BIRTH,DATE_MIGRATION,DATE_DEATH, and diagnosis datesexp.DATEandresp.DATE.- data_dates
A data frame containing baseline dates. Must include
IDand baseline date variablevpvmbl.- data_socioeconomic
A data frame containing socioeconomic and baseline questionnaire variables. Must include
IDand covariates such asedu,bmi,bmi_cat1, andbmi_cat2.- reference_values
A named list defining reference levels for factor variables (e.g., education or BMI categories). Passed internally to
healthpopR:::.relevel_by_reference().- censoring_date
Administrative censoring date. Default is
as.Date("2024-12-31").
Value
A long-format data frame with the following variables:
- ID
Individual identifier
- tstart
Start of interval (days since baseline)
- tstop
End of interval (days since baseline)
- event
Event indicator (1 = response diagnosis, 0 = censored)
- exposure_td
Time-dependent exposure indicator (0/1)
- age_bs
Age at baseline (years)
- edu
Education level (factor)
- bmi
Baseline BMI
The output is ready for use in:
coxph(Surv(tstart, tstop, event) ~ exposure_td + ..., data = output)Details
Baseline covariates (e.g., age, education, BMI) are treated as fixed. Exposure is handled as a time-dependent variable that switches from 0 to 1 at the exposure diagnosis date, if it occurs before the end of follow-up.
The function performs the following steps:
Filters individuals with a valid baseline date.
Computes age at baseline.
Recodes exposure and response diagnoses occurring before baseline to the baseline date.
Defines follow-up end as the minimum of migration, death, administrative censoring, or response diagnosis.
Computes follow-up times (in days) from baseline.
Splits follow-up into one or two intervals depending on whether exposure occurs before the end of follow-up.
Each row in the output represents a time interval during which exposure status is constant.