MATERIAL UNDER DEVELOPMENT

Introduction

SPIAT (Spatial Image Analysis of Tissues) is an R package with a suite of data processing, quality control, visualization, data handling and data analysis tools (Yang et al. 2020). SPIAT is directly compatible with Opal multiplex immunohistochemistry images analysed through the HALO and InForm analysis software, but its intuitive implementation allows use with a diversity of platforms.

The Opal multiplex immunohistochemistry staining protocol enables 6-8 tissue markers to be used simultaneously on a single slide. It is suitable for use with formalin fixed paraffin embedded (FFPE) tissue sections, so it is also of relevance for clinical practice. The fluorescence of each marker is measured through imaging, and each cell is assigned an X,Y coordinate of its location. In effect this provides single-cell resolution. The fluorescence intensity of individual markers are then combined to identify the phenotype of the cells in the tissue.

SPIAT includes novel algorithms for the identification of cell clusters, cell margins and cell gradients, the calculation of neighbourhood proportions, and algorithms for the prediction of cell phenotypes in tissue images. SPIAT also includes speedy implementations of the calculation of cell distances and detection of cell communities. An overview of the functions available is shown in the figure below.

Setting up the data

First we load the SPIAT library.

library(SPIAT)

We can use format_image_to_sce to read in and format raw data for SPIAT.format_image_to_sce creates a SingleCellExperiment object. It requires at least 4 arguments:

format: “INFORM,” “HALO,” “Visium,” “CODEX” “cellprolifer, or”general"
image: path to the raw InForm or HALO image data file
markers: names of markers used in the OPAL staining. These must be in the same order as the marker columns in the input file, and for InForm must match the marker name used in the input file. One of the markers must be DAPI.
locations: locations of the markers, either Nucleus, Cytoplasm or Membrane. These must be in the order of the markers. The locations are used to auto-detect the intensity (and dye) columns. Alternatively, the column names can be specified, see below for more information.

Simple import

The simplest way to read in both InForm and HALO files, is to specify the locations of the markers as shown below. This will automatically detect the appropriate intensity (and dye) columns to use. Here we use a tiny InForm file that’s part of the SPIAT package for raw_inform_data but you would replace system.file with /path/to/your/input/file. The tiny file is just for demonstrating importing a raw data file, later in the Example data section we load a larger preformatted dataset. For information on the InForm and HALO formats see below. For “Visium,” “CODEX” or “cellprolifer see the help (?format_image_to_sce). Format”general" allows you to input a matrix of intensities (intensity_matrix), and a vector of phenotypes, cell phenotypes in the same order in which they appear in intensity_matrix. They should be of the form “CD3,CD8,” as opposed to cell names (e.g. “cytotoxic T cells”), as it does some matching with the marker names. If no phenotypes are available, then a vector of NA can be used as input. For general format, you also need to provide a vector with the X coordinates of the cells and one with the Y (coord_x and coord_y). The cells must be in the same order as in the intensity_matrix. If you have Xmin and Xmax columns (and Ymin, Ymax) you can calculate the average and input that.

raw_inform_data <- system.file("extdata", "tiny_inform.txt.gz", package = "SPIAT")
markers <- c("DAPI", "CD3", "PDL-1", "CD4", "CD8", "AMACR")
locations <- c("Nucleus", "Cytoplasm", "Membrane", "Cytoplasm", "Cytoplasm", "Cytoplasm")
formatted_image <- format_image_to_sce(
                          format="INFORM",
                          path=raw_inform_data,
                          markers=markers,
                          locations=locations)

InForm input

Intensity columns

If for some reason you don’t want to specify the locations, you can specify the intensity columns with intensity_columns_interest as shown below. This will give the same result as specifying the locations above.

raw_inform_data <- system.file("extdata", "tiny_inform.txt.gz", package = "SPIAT")
markers <- c("DAPI", "CD3", "PDL-1", "CD4", "CD8", "AMACR")
intensity_columns_interest <- c(
  "Nucleus DAPI (DAPI) Mean (Normalized Counts, Total Weighting)",
  "Cytoplasm CD3 (Opal 520) Mean (Normalized Counts, Total Weighting)", 
  "Membrane PDL-1 (Opal 540) Mean (Normalized Counts, Total Weighting)",
  "Cytoplasm CD4 (Opal 620) Mean (Normalized Counts, Total Weighting)",
  "Cytoplasm CD8 (Opal 650) Mean (Normalized Counts, Total Weighting)", 
  "Cytoplasm AMACR (Opal 690) Mean (Normalized Counts, Total Weighting)"
  )
formatted_image <- format_image_to_sce(
                          format="INFORM",
                          path=raw_inform_data,
                          markers=markers,
                    intensity_columns_interest=intensity_columns_interest)

X and Y location columns

format_image_to_sce uses the Cell X Position and Cell Y Position columns in the InForm raw data.

Phenotype column

format_image_to_sce uses the Phenotype column in the InForm raw data. The phenotype of a cell can be a single marker, for example, “CD3” (T cell), or a combination of markers, such as “CD3,CD4” (helper T cell). “OTHER” means a cell was positive for DAPI but no other marker. The phenotypes must be based on the markers (e.g. CD3,CD4), rather than names of cells (e.g. cytotoxic T cells). The names of the cells can be added later using the define_celltypes function.

Cell properties columns

The following cell properties columns are required to be present in the InForm input file: Entire Cell Area (pixels), Nucleus Area (pixels), Nucleus Compactness, Nucleus Axis Ratio, and Entire Cell Axis Ratio.

HALO input

Intensity columns

For HALO you can also just specify the locations to auto-detect the columns as shown above. But if you want to specify the columns instead, you can do so with intensity_columns_interest, as shown in the example below. Note that then you also must specify the dye_columns_interest.

Dye columns

format_image_to_sce requires the HALO dye columns to assign the phenotype. If not using locations to auto-detect, these can be specified with dye_columns_interest. These columns have the marker status (1 or 0 for whether the cell is positive or negative for the marker) and these are used to create the phenotype column. For example, if HALO has assigned a cell a marker status of 1 for CD3 and 1 for CD4, SPIAT will give it the Phenotype “CD3,CD4.” Cells that have a marker status of 1 for DAPI and no other marker, are given the phenotype “OTHER.”

raw_halo_data <- "path/to/halo/file"
markers <- c("DAPI", "CD3", "PDL-1", "CD4", "CD8", "AMACR")
intensity_columns_interest <- c("Dye 1 Nucleus Intensity",
                                "Dye 2 Cytoplasm Intensity",
                                "Dye 3 Membrane Intensity",
                                "Dye 4 Cytoplasm Intensity",
                                "Dye 5 Cytoplasm Intensity",
                                "Dye 6 Cytoplasm Intensity")
dye_columns_interest <- c("Dye 1 Positive Nucleus",
                          "Dye 2 Positive Cytoplasm",
                          "Dye 3 Positive Membrane",
                          "Dye 4 Positive Cytoplasm",
                          "Dye 5 Positive Cytoplasm",
                          "Dye 6 Positive Cytoplasm")
formatted_image <- format_image_to_sce(
                          format="HALO",
                          path=raw_halo_data,
                          markers=markers,
                          intensity_columns_interest=intensity_columns_interest,
                          dye_columns_interest=dye_columns_interest
                          )

X and Y location columns

format_image_to_sce takes the average of the HALO X min and X max columns for each cell to create the Cell.X.Position column. It takes the average of the Y min and Y max to create the Cell.Y.Position column.

Phenotype column

format_image_to_sce creates the Phenotype column from the dye columns as described under the Dye columns section above.

Cell properties columns

The following cell properties columns are required to be present in the HALO input file: Cell Area, Nucleus Area, Cytoplasm Area.

Splitting images

In the case of large images, or images where there are two independent tissue sections, it is recommended to split images into sections defined by the user. This can be performed with image_splitter after format_image_to_sce. It can also plot the different combinations of markers within the image segments.

split_image <- image_splitter(formatted_image, number_of_splits=3, plot = FALSE)

Example data

In this vignette we will use an InForm data file that’s already been formatted for SPIAT with format_image_to_sce, which we can load with data.

data("formatted_image")

This is SingleCellExperiment format.

class(formatted_image)

## [1] "SingleCellExperiment"
## attr(,"package")
## [1] "SingleCellExperiment"

This example data has 6 markers and 8419 cells.

dim(formatted_image)

## [1]    6 8419

assay stores the intensity level of every marker (rows) for every cell (columns).

# take a look at first 5 columns
assay(formatted_image)[, 1:5]

##       Cell_1 Cell_2 Cell_3 Cell_4 Cell_5
## DAPI  14.557 17.588 21.262 18.951 18.631
## CD3    0.169  0.229  0.206  0.226  0.212
## PDL-1  0.274  0.117  0.103  0.126  0.091
## CD4    1.655  1.188  0.924  1.367  1.266
## CD8    0.096  0.087  0.053  0.037  0.046
## AMACR  0.054  2.074  1.521  2.462  1.968

colData stores the phenotype, x and y coordinates, and the cell properties.

# take a look at first 5 rows
colData(formatted_image)[1:5, ]

## DataFrame with 5 rows and 8 columns
##          Phenotype Cell.X.Position Cell.Y.Position Cell.Area Nucleus.Area
##        <character>       <numeric>       <numeric> <numeric>    <numeric>
## Cell_1       OTHER              82              30       477          160
## Cell_2       AMACR             171              22       464          177
## Cell_3       AMACR             184              38       553          212
## Cell_4       AMACR             201              52       462          239
## Cell_5       AMACR             219              63       876          451
##        Nucleus.Compactness Nucleus.Axis.Ratio Cell.Axis.Ratio
##                  <numeric>          <numeric>       <numeric>
## Cell_1                0.52               2.05            1.48
## Cell_2                0.54               1.84            1.00
## Cell_3                0.51               1.92            1.21
## Cell_4                0.53               1.47            1.09
## Cell_5                0.53               1.19            1.34

We can check what phenotypes are present with print_column, which will print any specified column in the data.

print_column(formatted_image, column = "Phenotype")

## [1] "OTHER"   "AMACR"   "CD3,CD4" "CD3,CD8" "CD3"     "PDL-1"

The phenotypes in this example data can be interpreted as follows:

AMACR = prostate cancer cells
CD3,CD4 = helper T cells
CD3,CD8 = cytotoxic T cells
CD3 = T cells that are not helper or cytotoxic
PDL-1 = immune checkpoint cells
OTHER = other cell types

AMACR identifies which cells in the image are prostate cancer cells. CD3,CD4, CD3,CD8, CD3 which types of T cells (immune cells) are present, and PDL-1 is used to help determine if the cancer may benefit from immunotherapy treatment.

Predicting cell phenotypes

SPIAT can predict cell phenotypes using the marker intensity levels with predict_phenotypes. This can be used to check the phenotypes that have been assigned by InForm and HALO. It can also potentially be used to automate the manual phenotypying performed with InForm/HALO. predict_phenotypes produces a density plot that shows the cutoff for calling a cell positive for a marker. It also prints to the console the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). It returns a table containing the phenotypes predicted by SPIAT and the actual phenotypes from InForm/HALO. This can help show if we need to go back to Inform or HALO to refine the phenotyping. If you don’t have a tumour_marker you can use a marker that is present in many cells, that has at least one peak more than the main marker peak, a background peak, required by the algorithm. Of note, this algorithm does not take into account cell shape or size, so if these are required for phenotyping, manual phenotyping with HALO or InForm is recommended.

predicted_image <- predict_phenotypes(formatted_image,
                                      thresholds = NULL,
                                      tumour_marker = "AMACR",
                                      baseline_markers = c("CD3", "CD4", "CD8"),
                                      reference_phenotypes = FALSE)

## [1] "DAPI  threshold intensity:  31.0120578435792"

## [1] "CD3  threshold intensity:  0.585067686263291"

## Warning: Removed 6 rows containing non-finite values (stat_density).

## [1] "PDL-1  threshold intensity:  0.41961026013577"

## [1] "CD4  threshold intensity:  2.94518565168331"

## Warning: Removed 6 rows containing non-finite values (stat_density).

## [1] "CD8  threshold intensity:  0.490401084591211"

## Warning: Removed 6 rows containing non-finite values (stat_density).

## [1] "AMACR  threshold intensity:  0.560351632207539"

## Warning: Removed 6 rows containing non-finite values (stat_density).

We can use marker_prediction_plot to plot the predicted cell phenotypes and the ones obtained using HALO or InForm, for comparison.

#marker_prediction_plot(predicted_image, marker="CD3")

The plot shows CD3+ cells in the tissue. On the left are the CD3+ cells defined by InForm and on the right are the CD3+ cells predicted using SPIAT.

Specifying cell types

Specifying each Phenotype can be tedious if the Phenotype consists of multiple markers. To make it easier, SPIAT define_celltypes can be used to add a column indicating the cell type. By default the column is called Cell.Type. The difference between Phenotype and Cell.Type is that phenotype represents all the markers present on the cell but Cell.Type is the identity of the cell.

formatted_image <- define_celltypes(formatted_image, phenotypes = c("AMACR", "CD3,CD4", "CD3,CD8", "CD3", "PDL-1"), names = c("Tumour", "T_helper", "T_cyto", "T_other", "PDL1"))

## [1] "AMACR"   "CD3,CD4" "CD3,CD8" "CD3"     "PDL-1"  
## [1] "Tumour"   "T_helper" "T_cyto"   "T_other"  "PDL1"

Quality control

Boxplots of marker intensity

Incorrect or unreliable phenotyping of cells may also be identified through comparing the marker level for cells labelled positive and negative for a marker. Cells positive for a marker should have high levels of the marker. An unclear separation of marker intensities between positive and negative cells would suggest phenotypes have not been accurately assigned. We can use marker_intensity_boxplot to produce a boxplot for cells phenotyped as being positive or negative for a marker.

marker_intensity_boxplot(formatted_image, "CD3")

Note that some negative cells will have high marker intensity, and vice versa. This is because HALO and InForm use machine learning to determine positive cells, and not a strict threshold, and also take into account properties such as cell shape, nucleus size etc. .

Scatter plots of marker level

Uneven marker staining or high background intensity can be identified with plot_cell_marker_levels. This produces a scatter plot of the intensity of a marker in each cell. Cells that were not phenotyped as being positive for the particular marker are excluded.

plot_cell_marker_levels(formatted_image, "CD3")

Heatmaps of marker level

For large images, there is also the option of ‘blurring’ the image, where the image is split into multiple small areas, and marker intensities are averaged within each. The image is blurred based on the num_splits parameter.

plot_marker_level_heatmap(formatted_image, num_splits = 100, "AMACR")

## Identifying incorrect phenotypes

We may see incorrect phenotypes present in the input data. For example, cells might be incorrectly assigned a phenotype such as “AMACR,CD3,” both a tumour cell (AMACR) and an immune cell phenotype (CD3), which is biologically implausible. Incorrect cell phenotypes may be present due to low cell segmentation quality, antibody ‘bleeding’ from one cell to another or inadequate marker thresholding.

We can use marker_permutation to help identify if incorrect cell phenotypes are present. This permutes the marker labels of cells to create a null distribution, and then calculates the empirical p-value of whether an image is enriched or depleted in a particular combination of markers. A low P value for Depletion.p suggests the phenotype is unreliable. However, it is not absolute and it is recommended to review the results.

# permute marker labels of cells
sig <- marker_permutation(formatted_image, num_iter = 100)

# sort by Observed_cell_number
sig.sorted <- sig[order(-sig$Observed_cell_number), ]

# take a look
head(sig.sorted)

##         Observed_cell_number Percentage_of_iterations_where_present
## AMACR                   4446                                    100
## CD3,CD4                  512                                    100
## CD3,CD8                  138                                     99
## CD3                       19                                    100
## PDL-1                      4                                     89
## CD4                        0                                    100
##         Average_bootstrap_cell_number Enrichment.p Depletion.p
## AMACR                         3777.30         0.01        1.00
## CD3,CD4                         19.39         0.01        1.00
## CD3,CD8                          4.71         0.01        1.00
## CD3                            292.05         1.00        0.01
## PDL-1                            1.60         0.01        1.00
## CD4                            215.85         1.00        0.01

Removing incorrect phenotypes

If you identify incorrect phenotypes or have any you want to exclude you can do that with select_phenotypes.

data_subset <- select_phenotypes(formatted_image, keep=TRUE,
                                     phenotypes = c("AMACR",
                                                    "CD3,CD8",
                                                    "PDL-1"))
print_column(data_subset, column = "Phenotype")

## [1] "AMACR"   "CD3,CD8" "PDL-1"

In this vignette we will work with all the original phenotypes present in formatted_image.

Cell percentages

We can obtain the number and proportion of each cell type with calculate_cell_proportions. We can use reference_celltypes to specify cell types to use as the reference. For example, “Total” will calculate the proportion of each cell type against all cells. We can exclude any cell types that are not of interest e.g. “Undefined” with celltypes_to_exclude.

p_cells <- calculate_cell_proportions(formatted_image, reference_celltypes=c("Total"), column="Cell.Type")
p_cells

##   Cell_type Number_of_celltype Proportion Percentage  Proportion_name Reference
## 6 Undefined               3299        Inf        Inf Undefined/Custom     Total
## 5    Tumour               4446        Inf        Inf    Tumour/Custom     Total
## 4   T_other                 19        Inf        Inf   T_other/Custom     Total
## 3  T_helper                513        Inf        Inf  T_helper/Custom     Total
## 2    T_cyto                138        Inf        Inf    T_cyto/Custom     Total
## 1      PDL1                  4        Inf        Inf      PDL1/Custom     Total

To visualize the cell type proportions as barplots we can use plot_cell_percentages.

plot_cell_percentages(p_cells)

Visualizing tissues

In addition to the marker level tissue plots for QC, SPIAT has other methods for visualizing markers and phenotypes in tissues.

3D surface plot

We can visualize a selected marker in 3D with marker_surface_plot. The image is blurred based on the num_splits parameter.

marker_surface_plot(formatted_image, num_splits=15, marker="CD3")

3D stacked surface plot

To visualize multiple markers in 3D in a single plot we can use marker_surface_plot_stack. This shows normalized intensity level of specified markers and enables the identification of co-occurring and mutually exclusive markers.

marker_surface_plot_stack(formatted_image, num_splits=15, markers=c("AMACR", "CD3"))

The stacked surface plots of the AMACR (tumour cell) and CD3 (T cell) markers in this prostate tissue shows how AMACR and CD3 are mutually exclusive as the peaks and valleys are opposite.

Categorical dot plot

We can see the location of all cell types (or any column in the data) in the tissue with plot_cell_basic. Each dot in the plot corresponds to a cell and cells are coloured by cell type. Any cell types present in the data but not in the cell types of interest will be put in the category “OTHER” and coloured lightgrey.

my_colors <- c("darkgrey", "skyBlue", "darkcyan", "white", "orange")
  
plot_cell_basic(formatted_image, c("Tumour", "T_helper", "T_cyto", "T_other", "PDL1"), my_colors, "Cell.Type")

Cell distances

We can calculate the distances between two cell types (cell type A and cell type B) by identifying the closest cell of cell type B to each of the cells of cell type A with calculate_all_distances_between_cell_types.

distances <- calculate_all_distances_between_cell_types(formatted_image, cell_types_of_interest = c("T_helper", "T_cyto"), column="Cell.Type")

This creates a distribution of minimum distances between phenotypes A and B, which can be visualized as a violin plot with plot_cell_distances_violin. Visualization of this distribution often reveals whether pairs of cells are evenly spaced across the image, or whether there are clusters of pairs of phenotypes.

plot_cell_distances_violin(distances)

From this plot we can see that T_helper cells seem to be more closely interacting with each other, as there are shorter distances between those cells, than between T_cyto cells or between T_helper and T_cyto cells.

We can also calculate summary statistics for the distances between each combination of cell types, the mean, median and standard deviation, with calculate_summary_distances_between_cell_types.

summary_distances <- calculate_summary_distances_between_cell_types(formatted_image, column = "Cell.Type")

## [1] "All markers are used in pair-wise distance calculation: "
## [1] "Undefined" "Tumour"    "T_helper"  "T_cyto"    "T_other"   "PDL1"

These cell distances can then be visualized as a heatmap with plot_distance_heatmap.

plot_distance_heatmap(summary_distances)

This plot shows that Tumour cells are interacting most closely with T_helper, followed by T_cyto, while PDL1 cells are more distant.

Cell mixing

With SPIAT we can quantify how much two cell types are mixing, and thus potentially interacting, through a variety of ways.

Cells within radius

To determine whether two cell types are interacting with another, or if there is repulsion between them, SPIAT uses the concept of gradients. The assumption is that, if marker A intensity is high close to cells of marker B, and this value decreases as the distance from B increases, it suggests the cells are close and interacting. Conversely, if we see that marker A intensity is low when close to cells of marker B but increases with distance, it suggests there is repulsion between the two cell types.

plot_average_intensity calculates the average intensity of a target marker for a number of user-supplied radii values. It plots the intensity level at each specified radius as a line graph. The radius unit is pixels.

plot_average_intensity(formatted_image, reference_marker="CD8", target_marker="CD4", c(30, 35, 40, 45, 50, 75, 100))

This plot shows that high levels of CD8 were observed in cells near CD4 cells and these levels decreased at larger radii. This suggests CD4 and CD8 cells may be closely interacting in this tissue.

We can use average_marker_intensity_within_radius to calculate the average intensity of the target_marker within a radius from the cells positive for the reference marker. Note that it pools all cells with the target marker that are within the specific radius of any reference cell. Results represent the average intensities within a radius, but do not correspond to metrics for each cell. This function can also be used to help identify radii values for plot_average_intensity.

average_marker_intensity_within_radius(formatted_image,
                                        reference_marker ="CD8",
                                        target_marker = "CD4",
                                        radius=30)

## [1] 10.13677

We can calculate the average percentage of cells of one cell type (target) within a radius of another cell type (reference).

average_percentage_of_cells_within_radius(formatted_image, reference_phenotypes = "T_helper", target_phenotypes = "T_cyto", radius=100, 
column="Cell.Type")

## [1] 5.613999

Mixing score

This score was originally defined as the number of immune-tumor interactions divided by the number of immune-immune interactions (Keren et al. 2018). SPIAT generalizes this method to allow calculation of any two cell types. mixing_score_summary returns the mixing score between a reference cell type and a target cell type. This mixing score is defined as the number of target-reference interactions/number of reference-reference interactions within a specified radius. The higher the score the greater the mixing of the two cell types. The normalised score is normalised for the number of target and reference cells in the image.

mixing_score_summary(formatted_image, reference_marker = "T_helper", target_marker = "T_cyto", radius=100, column="Cell.Type")

##   Reference Target Number_of_reference_cells Number_of_target_cells
## 2  T_helper T_cyto                       513                    138
##   Reference_target_interaction Reference_reference_interaction Mixing_score
## 2                         1646                            8082    0.2036625
##   Normalised_mixing_score
## 2               0.7570931

K-Cross function

K-cross function is similar to mixing score, but calculates the score across all possible distances. It also compares the behaviour of the image with an image that has randomly distributed points of the same number. There are 4 patterns that can be distinguished from K-cross function, as illustrated in the plots below (generated here).

Here, the black line represents our image, the red line represents a randomly distibuted point pattern.

1st plot: The red line and black line are close to each other, it means the two types of points are randomly independently distributed.
2nd plot: The red line is under the black line, and in the middle the difference is dramatic, it means the points are mixed and split into clusters.
3rd plot: With the increase of radius, the black line diverges further from the red line, it means that there is one mixed cluster of two types of points.
4th plot: The red line is above the black line, it means that the two types of points form separated clusters.

We can calculate the cross K-function using SPIAT.

df_cross <- bivariate_dependence(formatted_image, method = "Kcross", phenotypes = c("T_helper","T_cyto"), column ="Cell.Type")

The plot tells us that the T_cyto and T_helper form one mixed cluster, as it looks like the 3rd plot above.

We can calculate the area under the curve (AUC) of the cross K-function. This tells us the two types of cells are:

separate clusters - the more negative it is
independent - the closer it is to zero
mixed - the more positive it is

AUC_of_cross_function(df_cross)

## [1] 146677210

The AUC score is very positive so this tells us that the two types of cells form mixed clusters, similar to the plot.

Entropy

We can calculate the entropy of the cells.

entropy(formatted_image, types_of_interest = c("T_helper","T_cyto"), column = "Cell.Type")

## [1] 0.2243436

Gradients

We can use the gradient function to calculate metrics (entropy, mixing score, percentage of cells within radius, marker intensity) for a range of radii. We specify the metric we want to calculate with FUN=.

gradient_positions <- c(30, 50, 100)
gradient_entropy <- gradient(formatted_image, radii=gradient_positions, FUN=entropy, types_of_interest = c("T_helper","T_cyto"), column = "Cell.Type")

Tissue structures

If the image contains tumour, it may be a clear tumour structure or unclear (dispersed). The function average_nearest_neighbor_index can inform us whether there is a clear tumour structure (clustered) or unclear (dispersed), along with a P value for the estimate.

average_nearest_neighbor_index(formatted_image, reference_cell="Tumour", column="Cell.Type")

## $pattern
## [1] "Clustered"
## 
## $`p-value`
## [1] 1.132853e-19

When there is a clear tumour region in the image, we define four tumour structures: “Inside,” “Outside,” “Infiltration front” and “Invasive front,” as shown in the figure below. Why are we interested in defining such structure? For some types of cancer, there can be small clusters of tumour cells, like globes. We may be interested in not only whether immune cells infiltrate into the whole tumour region but also whether the immune cells infiltrate into the small globes. The identification of the tumour structure can be achieved by calculating the distance from each cell to the tumour boundary, after the identification of bordering cells.

We can identify borders with identify_bordering_cells. This uses the alpha hull method (Pateiro-Lopez, Rodriguez-Casal, and. 2019) via the hull2spatial package. Here we use tumour cells (AMACR) as the reference to identify the bordering cells but any cell type can be used.

formatted_border <- identify_bordering_cells(formatted_image, reference_cell = "Tumour", column="Cell.Type")

## [1] "The alpha of Polygon is: 85.9125"

We then calculate the distance of each cell to the borders.

formatted_distance <- calculate_min_distance_to_borders(formatted_border)

We define the area within a 5-cell distance to the boundary as “infiltration front” (inside) and “invasive front” (outside). The areas located more than a 5 cell-distance away from the boundary are defined as “in” (inside) and “out” (outside).

names_of_immune_cells <- c("T_helper", "T_cyto")

formatted_structure <- define_structure(formatted_distance, names_of_immune_cells = names_of_immune_cells, 
n_margin_layers = 5)

## [1] "How many layers of cells in the external/internal margin: 5"
## [1] "The width of internal/external margin: 79.6210533312122"

categories <- print_column(formatted_structure, "Structure")

## [1] "External.margin"        "Border"                 "Internal.margin"       
## [4] "Stromal.immune"         "Outside"                "External.margin.immune"
## [7] "Inside"                 "Internal.margin.immune" "Infiltrated.immune"

We can plot and colour these structure categories.

my_colors <- c("yellow", "darkblue", "darkgreen", "black", "purple", "orange", "darkgreen", "pink", "green")

plot_cell_basic(formatted_structure, categories, 
                  my_colors,
                  column = "Structure")

We can calculate the proportion of immune cells in the tumour structure.

immune_proportions <- calculate_proportions_of_immune_cells_in_structure(formatted_structure, immune_cells = names_of_immune_cells, column="Cell.Type")

immune_proportions

##   Cell.Type2     Relative_to P.Infiltrated.Immune P.Internal.Margin.Immune
## 1   T_helper    To_all_cells          0.010325346              0.016913319
## 2     T_cyto    To_all_cells          0.003117086              0.007188161
## 3   T_helper To_immune_cells          0.768115942              0.701754386
## 4     T_cyto To_immune_cells          0.231884058              0.298245614
##   P.External.Margin.Immune P.Stromal.Immune
## 1               0.12205567       0.10551724
## 2               0.04068522       0.02310345
## 3               0.75000000       0.82037534
## 4               0.25000000       0.17962466

We can calculate summaries of the distances for immune cells in the tumour structure.

immune_distances <- calculate_summary_distances_of_immune_cells_to_borders(formatted_structure, names_of_immune_cells, "Cell.Type")

immune_distances

##    Cell.Type       Area    Min_d    Max_d    Mean_d  Median_d  St.dev_d
## 1 All_immune Tumor_area 13.00000 336.7432  98.97223  82.82800  69.53922
## 2 All_immune     Stroma 14.31782 558.1684 154.49817 131.04579 102.04069
## 3   T_helper Tumor_area 13.00000 336.7432 103.04387  84.20214  72.87082
## 4   T_helper     Stroma 15.13275 558.1684 156.44987 135.19738 102.17890
## 5     T_cyto Tumor_area 17.69181 288.8910  87.49760  73.23933  58.64129
## 6     T_cyto     Stroma 14.31782 468.3289 146.69138 121.16518 101.59620

Clusters and communities

We can determine if there are aggregates (or a lack of aggregates) of a particular phenotype or combination of cell types. SPIAT can be used to identify two types of cell aggregates, clusters and communities.

Clusters are cell aggregates composed of non-tumour cells, often by immune cells. For the detection of clusters, identify_cell_clusters only considers cell types of interest defined by the user. Euclidean distances between cells are calculated, and pairs of cells with a distance less than a threshold are considered to be ‘interacting,’ with the rest being ‘non-interacting.’ Hierarchical clustering is then used to separate the clusters. We need to specify a radius. The radius corresponds to how lenient we are in defining clusters. A bigger radius would mean we are more likely to merge individual clusters. The radius is chosen based on what looks ‘reasonable.’ Cells not assigned to clusters (“free” cells) are assigned to Cluster_NA in the output table.

clusters <- identify_cell_clusters(formatted_image, cell_types_of_interest = c("T_helper", "T_cyto"), radius = 100, column = "Cell.Type")

This plot shows clusters of CD3,CD4 and CD3,CD8 cells. Each number and colour corresponds to a distinct cluster. Black cells correspond to ‘free,’ un-clustered cells.

Users are recommended to test out different thresholds and then visualize the clustering results. But SPIAT also has average_minimum_distance, which calculates the average minimum distance between all cells in an image, and can be used as a starting point.

average_minimum_distance(formatted_image)

## [1] 15.92421

Communities correspond to micro-niches or micro-ecosystems of cells that are located close to each other (Jackson et al. 2020). The main distinction between clusters and communities is that the algorithm for the detection of communities used by identify_cell_communities does not take into account cell phenotype. Therefore, communities often consist of a combination of different cell types, including tumour cells. dbscan is used as the clustering algorithm to detect communities (Hahsler, Piekenbrock, and Doran 2019). Cells without a phenotype are excluded.

communities <- identify_cell_communities(formatted_image, radius=100)

This plot shows communities of cells in the tissue. Each number and colour corresponds to a distinct community.

We can visualize the cell composition of clusters or communities. This allows us to determine whether there are regions where cells of different types are interacting, or whether there is little mixing between celltypes. To do this, we can use composition_of_clusters_and_communities to obtain the percentages of cells with a specific marker within each cluster and the number of cells in the cluster.

clusters_vis <- composition_of_clusters_and_communities(clusters, type_of_aggregate = "Cluster", column = "Cell.Type")
clusters_vis <- clusters_vis[clusters_vis$Total_number_of_cells >=5,]

Then we can use plot_composition_heatmap to produce a heatmap showing the marker percentages within each cluster.

plot_composition_heatmap(clusters_vis, type_of_aggregate = "Cluster", column="Cell.Type")

This plot shows that most clusters are composed of a mixture of T_helper and T_cyto cells and some are more dominated by T_helper than others.

We can visualize the composition of the communities in a similiar way.

communities_vis <- composition_of_clusters_and_communities(communities, type_of_aggregate="Community", column = "Cell.Type")
plot_composition_heatmap(communities_vis, type_of_aggregate = "Community", column="Cell.Type")

This plot shows that several communities detected have a combination of tumour (AMACR) and immune cells while some are dominated by tumour or immune cells.

Reproducibility

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SPIAT_0.4                   SingleCellExperiment_1.14.1
##  [3] SummarizedExperiment_1.22.0 Biobase_2.52.0             
##  [5] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4        
##  [7] IRanges_2.26.0              S4Vectors_0.30.1           
##  [9] BiocGenerics_0.38.0         MatrixGenerics_1.4.3       
## [11] matrixStats_0.61.0          BiocStyle_2.20.2           
## 
## loaded via a namespace (and not attached):
##   [1] scattermore_0.7        R.methodsS3_1.8.1      SeuratObject_4.0.2    
##   [4] ragg_1.1.3             tidyr_1.1.3            ggplot2_3.3.5         
##   [7] bit64_4.0.5            knitr_1.34             irlba_2.3.3           
##  [10] DelayedArray_0.18.0    R.utils_2.11.0         data.table_1.14.0     
##  [13] rpart_4.1-15           RCurl_1.98-1.5         doParallel_1.0.16     
##  [16] generics_0.1.0         cowplot_1.1.1          RANN_2.6.1            
##  [19] future_1.22.1          bit_4.0.4              tzdb_0.1.2            
##  [22] Rphenograph_0.99.1     spatstat.data_2.1-0    lubridate_1.7.10      
##  [25] httpuv_1.6.3           xfun_0.26              hms_1.1.1             
##  [28] jquerylib_0.1.4        evaluate_0.14          promises_1.2.0.1      
##  [31] fansi_0.5.0            readxl_1.3.1           igraph_1.2.6          
##  [34] htmlwidgets_1.5.4      xROI_0.9.20            apcluster_1.4.8       
##  [37] spatstat.geom_2.2-2    purrr_0.3.4            ellipsis_0.3.2        
##  [40] crosstalk_1.1.1        dplyr_1.0.7            ggpubr_0.4.0          
##  [43] backports_1.2.1        bookdown_0.24          moments_0.14          
##  [46] deldir_0.2-10          vctrs_0.3.8            Cairo_1.5-12.2        
##  [49] ROCR_1.0-11            abind_1.4-5            cachem_1.0.6          
##  [52] rgdal_1.5-27           vroom_1.5.5            sctransform_0.3.2     
##  [55] mmand_1.6.1            dittoSeq_1.4.3         goftest_1.2-2         
##  [58] cluster_2.1.2          splancs_2.01-42        lazyeval_0.2.2        
##  [61] crayon_1.4.1           pkgconfig_2.0.3        labeling_0.4.2        
##  [64] nlme_3.1-153           rlang_0.4.11           globals_0.14.0        
##  [67] lifecycle_1.0.1        miniUI_0.1.1.1         colourpicker_1.1.0    
##  [70] sgeostat_1.0-27        dbscan_1.1-8           cellranger_1.1.0      
##  [73] rprojroot_2.0.2        polyclip_1.10-0        lmtest_0.9-38         
##  [76] tiff_0.1-8             Matrix_1.3-4           divo_1.0.1            
##  [79] raster_3.4-13          carData_3.0-4          zoo_1.8-9             
##  [82] ggridges_0.5.3         GlobalOptions_0.1.2    pheatmap_1.0.12       
##  [85] png_0.1-7              viridisLite_0.4.0      rjson_0.2.20          
##  [88] bitops_1.0-7           R.oo_1.24.0            KernSmooth_2.23-20    
##  [91] shape_1.4.6            stringr_1.4.0          parallelly_1.28.1     
##  [94] jpeg_0.1-9             rstatix_0.7.0          ggsignif_0.6.3        
##  [97] tripack_1.3-9.1        scales_1.1.1           memoise_2.0.0         
## [100] magrittr_2.0.1         plyr_1.8.6             ica_1.0-2             
## [103] zlibbioc_1.38.0        compiler_4.1.0         RColorBrewer_1.1-2    
## [106] clue_0.3-59            fitdistrplus_1.1-5     cli_3.0.1             
## [109] XVector_0.32.0         listenv_0.8.0          patchwork_1.1.1       
## [112] pbapply_1.5-0          MASS_7.3-54            mgcv_1.8-37           
## [115] tidyselect_1.1.1       stringi_1.7.4          forcats_0.5.1         
## [118] textshaping_0.3.5      hull2spatial_0.1.0     highr_0.9             
## [121] yaml_2.2.1             ggrepel_0.9.1          grid_4.1.0            
## [124] sass_0.4.0             spatstat.linnet_2.3-0  tools_4.1.0           
## [127] future.apply_1.8.1     rio_0.5.27             circlize_0.4.13       
## [130] foreach_1.5.1          foreign_0.8-81         gridExtra_2.3         
## [133] alphahull_2.2          farver_2.1.0           Rtsne_0.15            
## [136] digest_0.6.28          BiocManager_1.30.16    shiny_1.7.0           
## [139] pracma_2.3.3           Rcpp_1.0.7             car_3.0-11            
## [142] broom_0.7.9            later_1.3.0            RcppAnnoy_0.0.19      
## [145] httr_1.4.2             ComplexHeatmap_2.8.0   colorspace_2.0-2      
## [148] fs_1.5.0               tensor_1.5             reticulate_1.22       
## [151] splines_4.1.0          uwot_0.1.10            spatstat.utils_2.2-0  
## [154] pkgdown_1.6.1          sp_1.4-5               plotly_4.9.4.1        
## [157] systemfonts_1.0.2      xtable_1.8-4           jsonlite_1.7.2        
## [160] spatstat_2.2-0         R6_2.5.1               pillar_1.6.3          
## [163] htmltools_0.5.2        mime_0.11              glue_1.4.2            
## [166] fastmap_1.1.0          codetools_0.2-18       utf8_1.2.2            
## [169] lattice_0.20-45        bslib_0.3.0            spatstat.sparse_2.0-0 
## [172] tibble_3.1.4           curl_4.3.2             leiden_0.3.9          
## [175] gtools_3.9.2           zip_2.2.0              shinyjs_2.0.0         
## [178] openxlsx_4.2.4         survival_3.2-13        rmarkdown_2.11        
## [181] desc_1.3.0             munsell_0.5.0          GetoptLong_1.0.5      
## [184] GenomeInfoDbData_1.2.6 iterators_1.0.13       haven_2.4.3           
## [187] reshape2_1.4.4         gtable_0.3.0           spatstat.core_2.3-0   
## [190] Seurat_4.0.4

Author Contributions

AT, YF, TY, ML, JZ, VO, MD are authors of the package code. MD wrote the vignette. AT designed the package.

References

Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “Dbscan: Fast Density-Based Clustering with r.” Journal of Statistical Software 91 (1): 1–30.

Jackson, Hartland W, Jana R Fischer, Vito RT Zanotelli, H Raza Ali, Robert Mechera, Savas D Soysal, Holger Moch, et al. 2020. “The Single-Cell Pathology Landscape of Breast Cancer.” Nature 578 (7796): 615–20.

Keren, Leeat, Marc Bosse, Diana Marquez, Roshan Angoshtari, Samir Jain, Sushama Varma, Soo-Ryum Yang, et al. 2018. “A Structured Tumor-Immune Microenvironment in Triple Negative Breast Cancer Revealed by Multiplexed Ion Beam Imaging.” Cell 174 (6): 1373–87.

Pateiro-Lopez, Beatriz, Alberto Rodriguez-Casal, and. 2019. Alphahull: Generalization of the Convex Hull of a Sample of Points in the Plane. https://CRAN.R-project.org/package=alphahull.

Yang, Tianpei, Volkan Ozcoban, Anu Pasam, Nikolce Kocovski, Angela Pizzolla, Yu-Kuan Huang, Greg Bass, et al. 2020. “SPIAT: An r Package for the Spatial Image Analysis of Cells in Tissues.” bioRxiv.

Overview of the SPIAT package

Anna Trigos, Yuzhou Feng, Tom Yang, Mabel Li, John Zhu, Volkan Ozcoban, Maria Doyle

28 September 2021

MATERIAL UNDER DEVELOPMENT

Introduction

Setting up the data

Simple import

InForm input

Intensity columns

X and Y location columns

Phenotype column

Cell properties columns

HALO input

Intensity columns

Dye columns

X and Y location columns

Phenotype column

Cell properties columns

Splitting images

Example data

Predicting cell phenotypes

Specifying cell types

Quality control

Boxplots of marker intensity

Scatter plots of marker level

Heatmaps of marker level

Removing incorrect phenotypes

Cell percentages

Visualizing tissues

3D surface plot

3D stacked surface plot

Categorical dot plot

Cell distances

Cell mixing

Cells within radius

Mixing score

K-Cross function

Entropy

Gradients

Tissue structures

Clusters and communities

Reproducibility

Author Contributions

References