PISA case study analysis with intsvy

Assessing the effects of gender and country on reading skills; the effects of country and the size of place of residence on science skills; and the effect of SES on the probability of missing school for over 3 months due to problematic behavior. Descriptive statistics, as well as linear and logistic regression analyses are utilized in the tutorial.

Author

Jan Szczypiński

Published

October 2, 2025

1 Introduction

This document demonstrates the preparation and analysis of the PISA 2022 (Programme for International Student Assessment 2022) dataset using the R intsvy package. PISA assesses the achievements of 15-year-old students in mathematics, reading, and science.
The analysis presented here focuses on differences in reading skills by gender and country.
We will also assess the impact of country and the size of the place of residence on science skills.
Lastly, we will evaluate the relationship between socioeconomic status (SES) and missing school for over 3 months due to problematic behavior, such as violence or drug use.

The analysis will be performed using the intsvy package, which is designed for analyzing complex survey data with replication weights.

Note

More about the intsvy package can be found in our tutorial.

2 Setup

First, let’s load the required packages and set up our environment:

library(intsvy) # for complex survey analysis 
library(dplyr) # for data manipulation
library(tidyr) # for data reshaping
library(knitr) # for creating pretty html tables
library(ggplot2) # for data visualization

# Set data path to your data directory
data_path <- "data/PISA/"

3 Data download, loading and preparation

Data Download

The data can be downloaded from the PISA public data repository.
You should download data in .sav (SPSS file format) and save it in a chosen directory.
The select.merge() function from the intsvy package will accept only .sav files.

The variables we will focus on are:

  • ST004D01T - student’s gender
  • CNT- country name
  • ESCS - socioeconomic status score
  • SC001Q01TA - size/type of the place of residence
  • ST261Q02JA - miss school for 3+ months: I was suspended (e.g. for violence, use or sell drugs) [Yes/No]
  • PV1READ to PV10READ - plausible values for reading score
  • PV1SCIE to PV10SCIE - plausible values for science score

We will use the pisa.select.merge() function to read and merge student and school datasets. We will focus on four countries: Morocco, Finland, Singapore, and Poland.

pisa22 <- pisa.select.merge(
  folder = data_path,
  school.file = "CY08MSP_SCH_QQQ.SAV",
  student.file = "CY08MSP_STU_QQQ.SAV",
  student = c("ESCS", "ST004D01T", "ST261Q02JA"),
  school = c("SC001Q01TA"),
  countries = c("Finland", "Poland", "Singapore", "Morocco")
)

We have the dataset ready for analysis. Let’s proceed.

4 Differences in scores in reading by country and gender

In this part, we will compare the means and standard deviations of reading results by country and gender.

Mean and standard deviation of reading by country and gender

# Mean scores in reading by country and gender in PISA
descriptives <- pisa.mean.pv(
  pvlabel = paste0("PV", 1:10, "READ"), 
  by = c("CNT", "ST004D01T"), 
  data = pisa22
)

kable(descriptives, 
  digits = 2, 
  col.names = c("Country", "Gender", "N", "Mean", "Mean s.e.", "SD", "SD s.e."),
  align = c("l", "l", "r", "r", "r", "r", "r")
)
Country Gender N Mean Mean s.e. SD SD s.e.
Finland Female 4995 513.01 2.57 97.40 1.45
Finland Male 5244 468.32 2.77 105.49 1.36
Morocco Female 3401 350.37 3.94 73.63 1.97
Morocco Male 3466 328.65 4.21 76.01 2.15
Poland Female 3009 503.24 3.25 98.09 2.06
Poland Male 3002 474.60 3.20 107.56 2.54
Singapore Female 3248 552.55 2.28 101.74 1.42
Singapore Male 3358 532.95 2.21 108.87 1.85

The intsvy package has built-in plotting functions for descriptive statistics and can be used to visualize the results.
It allows for minor customizations of the plots.

desc_plot <- plot.intsvy.mean(descriptives) 
desc_plot[["labels"]][["colour"]] <- "Gender"
desc_plot[["labels"]][["x"]] <- "Country"
desc_plot

Interpretation

The results for reading are similar in Poland and Finland, yet Polish boys score slightly higher than Finnish boys, while Finnish girls score slightly higher than Polish girls.

Singapore has the highest scores among the three countries, while Morocco has the lowest scores.

5 Do science skills differ depending on the place of residence?

We will answer this question using linear regression, with science as the outcome variable and the place of residence as the predictor. The “village, hamlet or rural area (fewer than 3 000 people)” will be the reference category.

Recode place of residence variable

dplyr:

Apart from base R, we will also use the dplyr package for data manipulation. The dplyr package uses verb functions connected by pipes (%>%). You can read more about the pipe operator here: / Pipe function documentation
The “Pipes” chapter in “R for Data Science”

The place of residence variable (SC001Q01TA) has very long category names.
Let’s add labels to this factor for more readable outputs.

kable(with(pisa22, table(SC001Q01TA)))
SC001Q01TA Freq
A village, hamlet or rural area (fewer than 3 000 people) 2198
A small town (3 000 to about 15 000 people) 3059
A town (15 000 to about 100 000 people) 7538
A city (100 000 to about 1 000 000 people) 8223
A large city (1 000 000 to about 10 000 000 people) 8003
A megacity (with over 10 000 000 people) 380
pisa22[["SC001Q01TA"]] <- factor(pisa22[["SC001Q01TA"]],
  levels = c("A village, hamlet or rural area (fewer than 3 000 people)",
             "A small town (3 000 to about 15 000 people)",
             "A town (15 000 to about 100 000 people)",
             "A city (100 000 to about 1 000 000 people)",
             "A large city (1 000 000 to about 10 000 000 people)",
             "A megacity (with over 10 000 000 people)"),
  labels = c(" Rural <3k",
             " Small town 3k-15k",
             " Town 15k-100k",
             " City 100k-1M",
             " Large city 1M-10M",
             " Megacity >10M")
)

Run linear regression model

Significance of the results

To determine if the predictor is significant, we should look at the t-values in the regression output.
A t-value greater than 1.96 (in absolute value) indicate that the predictor is statistically significant at the 0.05 significance level.

Alternatively, we can calculate p-value from the t-value using the following formula:
p-value = 2 * (1 - pt(abs(t-value), df)) where df is the number of replicate weights - number of parameters to estimate.

# The impact of place of residence on science skills
city_lm <- pisa.reg.pv(
  pvlabel = paste0("PV", 1:10, "SCIE"),
  x = c("SC001Q01TA"), 
  data = pisa22
)

# set dfs for p value calculation
# 80 replicate weights - 1 (intercept) - 5 levels of the factor (6-1 (reference))
dfs <- 80 - 1 - (length(levels(pisa22[["SC001Q01TA"]])) - 1)

city_lm_tab <- city_lm[["reg"]]
city_lm_tab[["p-value"]] <- 2 * (1 - pt(abs(city_lm_tab[["t value"]]), dfs))
city_lm_tab[["p-value"]] <- ifelse(city_lm_tab[["p-value"]] < 0.001,
                                  "<0.001", 
                                  round(city_lm_tab[["p-value"]], 3)
 )
rownames(city_lm_tab) <- c("Intercept", levels(pisa22[["SC001Q01TA"]])[-1], "R²")

kable(city_lm_tab, 
      digits = 2, 
      col.names = c("Variable", "Estimate", "Std. Error", "t-value", "p-value"),
      align = c("l", "r", "r", "r", "r")
      )
Variable Estimate Std. Error t-value p-value
Intercept 366.81 5.78 63.47 <0.001
Small town 3k-15k 58.93 10.39 5.67 <0.001
Town 15k-100k 84.18 8.27 10.19 <0.001
City 100k-1M 84.70 9.42 8.99 <0.001
Large city 1M-10M 86.85 11.04 7.87 <0.001
Megacity >10M 15.90 19.37 0.82 0.414
0.07 0.01 5.36 <0.001

intsvy allows for the plotting of some basic regression coefficients.
However, to plot a regression with a factor predictor, with several levels we have to use another package, such as ggplot2.

First we need to calculate CIs and add a column with color depending on the direction of the effect.

city_lm_plot <- city_lm_tab %>%
   mutate(
    conf.low = Estimate - 1.96 * `Std. Error`,
    conf.high = Estimate + 1.96 * `Std. Error`,
    color = ifelse(Estimate < 0, "#377EB8", "#E41A1C"),
    city = factor(rownames(city_lm_tab), levels = rev(levels(pisa22[["SC001Q01TA"]])[-1]))
  ) %>%
    filter(city != "Intercept" & city != "R²")

city_lm_plot %>%
  ggplot(aes(Estimate, city)) +
  geom_point(aes(colour = color), size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, colour = color), linewidth = 1.5, height = 0.2) + 
  geom_vline(xintercept = 0, lty = 2) +
  labs(
    x = "Effect on science results vs rural areas (<3k residents)",
    y = "Place of residence"
      ) +
  scale_colour_manual(values = unique(as.character(city_lm_plot[["color"]]))) +
  theme_minimal(base_size = 10) +
  theme(legend.position = "none")

Interpretation

As the size of the place of residence increases, so do the science scores at a level of significance, with the exception of cities over 10 000 000 people.
For megacities (>10M residents), the scores do not significantly differ from those of students living in rural areas (< 3000 residents).
Nevertheless, there are only 380 students from megacities in the sample, which may affect the results.

6 SES as a predictor of being absent from school for 3+ months due to problematic behavior

In this section, we will use logistic regression to examine whether socioeconomic status (SES) predicts the likelihood of missing school for 3 or more months due to problematic behavior.

Cross table for country and the variable of interest

First we should look at the cross table for the country and the variable of interest (ST261Q02JA).

cross_tab <- with(pisa22, table(CNT, ST261Q02JA))
kable(cross_tab, digits = 2,
  col.names = c("Country", "Yes, n", "No, n")
)
Country Yes, n No, n
Finland 34 250
Morocco 116 812
Poland 29 264
Singapore 18 272
prop_tab <- prop.table(cross_tab, margin = 1) * 100
kable(prop_tab, digits = 2,
  col.names = c("Country", "Yes, %", "No, %")
  )
Country Yes, % No, %
Finland 11.97 88.03
Morocco 12.50 87.50
Poland 9.90 90.10
Singapore 6.21 93.79

The proportion of students who missed school for 3 or more months due to problematic behavior is relatively low across all countries, but there are noticeable differences.
Singapore has the lowest proportion (6.2%), while in Morocco the proportion is the highest (12.5%).
Nevertheless, we want to examine if SES predicts problematic behavior in these 4 countries.

Running the logistic regression model

# Probability of missing school due to problematic behavior by SES
escs_log <- pisa.log(
  y = "ST261Q02JA",
  x = "ESCS",
  by = "CNT",
  data = pisa22
)

# set dfs for p value calculation
# 80 replicate weights - 1 (intercept) - 1 (predictor)
dfs_log <- 80 - 1 - 1

escs_all_log <- do.call(rbind, list(escs_log[["Finland"]][["reg"]],
                                    escs_log[["Morocco"]][["reg"]],
                                    escs_log[["Poland"]][["reg"]],
                                    escs_log[["Singapore"]][["reg"]]
                                    )
                        )

escs_all_log[["p-value"]] <- 2 * (1 - pt(abs(escs_all_log[["t value"]]), dfs_log))
escs_all_log[["p-value"]] <- ifelse(escs_all_log[["p-value"]] < 0.001,
                                  "<0.001", 
                                  round(escs_all_log[["p-value"]], 3)
 )

rownames(escs_all_log) <- paste0(rep(c("Finland", "Morocco", "Poland", "Singapore"), each = 2), c(" Intercept", ""))

kable(escs_all_log, 
      digits = 2, 
      col.names = c("Variable","Estimate", "Std. Error", "t-value", "Odds Ratio", "lower CI95", "upper CI95", "p-value"),
      row.names = TRUE,
      align = c("l", "r", "r", "r", "r", "r", "r", "r")
    ) 
Variable Estimate Std. Error t-value Odds Ratio lower CI95 upper CI95 p-value
Finland Intercept 2.11 0.26 8.11 8.28 4.97 13.80 <0.001
Finland -0.49 0.21 -2.32 0.61 0.40 0.93 0.023
Morocco Intercept 1.27 0.16 7.91 3.56 2.60 4.88 <0.001
Morocco -0.45 0.08 -5.47 0.64 0.54 0.75 <0.001
Poland Intercept 2.24 0.25 8.87 9.36 5.71 15.35 <0.001
Poland -0.05 0.40 -0.12 0.95 0.44 2.07 0.904
Singapore Intercept 2.98 0.28 10.49 19.72 11.29 34.42 <0.001
Singapore 0.56 0.27 2.07 1.74 1.03 2.95 0.041

As with the linear regression plot, we also have to use ggplot2 to prepare the plot for logistic regression.

escs_log_plot <- escs_all_log %>%
   mutate(
    conf.low = `Coef.` - 1.96 * `Std. Error`,
    conf.high = `Coef.` + 1.96 * `Std. Error`,
    color = ifelse(`Coef.` < 0, "#377EB8", "#E41A1C"),
    country = rownames(escs_all_log)
  ) %>%
    filter(!(endsWith(country, "Intercept")))

escs_log_plot %>%
  ggplot(aes(`Coef.`, country)) +
  geom_point(aes(colour = color), size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, colour = color), linewidth = 1.5, height = 0.2) + 
  geom_vline(xintercept = 0, lty = 2) +
  labs(
    x = "Effect of SES on missing school for >3 months due to problematic behavior",
    y = "Country"
      ) +
  scale_colour_manual(values = unique(as.character(escs_log_plot[["color"]]))) +
  theme_minimal(base_size = 10) +
  theme(legend.position = "none")

Interpretation

Interestingly, SES significantly reduces the probability of missing school for 3+ months due to problematic behavior in Morocco and Finland.
Yet, we observe an inverse relationship in Singapore - SES increases the probability of missing school.
Finally, SES is not a significant predictor in Poland of missing school for 3+ months due to problematic behavior.

7 Summary

This document demonstrates how to prepare and analyze the PISA 2022 study using the R intsvy package. We covered data loading and merging, and conducting descriptive statistics and regression analyses while accounting for the complex survey design. The results provide insights into differences in reading skills by country and gender. Next, we explore the relationship between the size of the place of residence and science skills. Lastly, we examine the impact of socio-economic status on the probability of missing school due to problematic behavior.