Plotting data distributions with Swift Charts

In this post I'm going to explore how to visualise data distributions with Swift Charts framework.

I want to look at a dataset of some survey data collected from penguin colonies. The dataset captures many different anatomical details, such as weight, flipper length and bill length. To start my investigation, I will look at the distribution of values for each of these parameters.

To build a histogram, I need to group my data into bins and count how many samples are within each bin. Charts provides a simple way to do the binning operation with NumberBins data types.

The NumberBins datatype has a range of constructors. The simplest one to use is init(data:desiredCount:), it detects the best placement of bin thresholds from the input data.

let bins = NumberBins(
    data: dataset.map(\.billLength),
    desiredCount: 30
)

However, just using NumberBins is not enough, it doesn't group the data into bins. It just provides a way to easily determine which bin each bit of data belongs to. To group the data I need to use Dictionary.init(grouping:by:) constructor along with the bins to group the data.

let groups = Dictionary(
    grouping: dataset.map(\.billLength),
    by: bins.index
)

Now I have a dictionary that maps the bin index to an array of all the values in that bin.

To plot a histogram, I need to compute the proportion of all the data samples that are in each bin.

let preparedData = groups.map { key, values in
    (
        index: key,
        range: bins[key],
        frequency: values.count
    )
}

I'm computing the proportion of samples and, at the same time, retrieving the range of values that correspond to this bin.

Since the number of samples is small enough, I will just place all of the above work into a computed property.

struct PenguinChart: View {
    let dataset: [PenguinsDataPoint]
    
    var binnedData: [
        (
            index: Int, 
            range: ChartBinRange<Float>,
            frequency: Int
        )
    ] {
        let bins = NumberBins(
            data: dataset.map(\.billLength),
            desiredCount: 30
        )
        
        let groups = Dictionary(
            grouping: dataset.map(\.billLength),
            by: bins.index
        )
        
        let preparedData = groups.map { key, values in
             return (
                index: key,
                range: bins[key],
                frequency: values.count
            )
        }
        
        return preparedData
    }
    
    var body: some View { ... }
} 

With the data aggregated into bins, it's time to plot a bar chart.

struct PenguinChart: View {
    let dataset: [PenguinsDataPoint]
    
    var binnedData: [(...)} { ... }
    
    var body: some View {
        Chart(
            self.binnedData, id: \.index
        ) { element in
            BarMark(
                x: .value(
                    "Bill Length",
                    element.range
                ),
                y: .value(
                    "Frequency",
                    element.frequency
                )
            )
        }
    }
}
A histogram showing the distribution of wing length with the chart staring at 0

The chart looks a little off, since the framework always forces the axes to start at 0 by default. Using chartXScale(domain: .automatic(includesZero: false)) modifier I can override the default here.

struct PenguinChart: View {
    let dataset: [PenguinsDataPoint]
    
    var binnedData: [(...)} { ... }
    
    var body: some View {
        Chart(self.binnedData, id: \.index) { element in
            BarMark(
                x: .value(
                    "Bill Length",
                    element.range
                ),
                y: .value(
                    "Frequency",
                    element.frequency
                )
            )
        }
        .chartXScale(
            domain: .automatic(includesZero: false)
        )
    }
}
A histogram showing the distribution of bill length with the chart starting where the data starts

Let's explore the other numerical properties of the data set. I can adjust the computed binnedData to aggregate over other fields of the dataset.

Three histograms showing the distribution of bill length, flipper length and body mass

Plotting just one attribute at a time is useful, however, I would love to drill down on the relationship between these variables.

# Building a 2D density plot

I feel it would be best to look into the relationship between flipper length and bill length, as both of these appear to be multimodal distributions.

I need to adjust how I'm aggregating the data. To start with, I will create bins for both flipper length and bill length.

let flipperLengthBins = NumberBins(
    data: dataset.map(\.flipperLength),
    desiredCount: 30
)

let billLengthBins = NumberBins(
    data: dataset.map(\.billLength),
    desiredCount: 30
)

To represent each grid square, I will create an intermediate data type, that combines a flipperLengthBin index and a billLengthBin index.

struct TwoDimensionalBinIndex: Hashable {
    let xBinIndex: Int
    let yBinIndex: Int
} 

TwoDimensionalBinIndex must conform to Hashable, so that it can be used with Dictionary.init(grouping:by:). When grouping the data, I will create the corresponding TwoDimensionalBinIndex for each value.

let groupedData = Dictionary(
    grouping: dataset
) { element in
    TwoDimensionalBinIndex(
        xBinIndex:
            flipperLengthBins
                .index(for: element.flipperLength),
        yBinIndex:
            billLengthBins
                .index(for: element.billLength)
    )
}

I'm calling index(for:) method of the NumberBin instances for each element in my data set and assigning that bin index to the corresponding x and y attributes in the TwoDimensionalBinIndex, that is used to group the data.

As with the regular histogram, I now need to aggregate these values for each group. This time, rather than returning just a single range, I will need to return a range for x and a range for y.

let values = groupedData
    .map { key, values in
        return (
            index: key,
            xDataRange: flipperLengthBins[
                key.xBinIndex
            ],
            yDataRange: billLengthBins[
                key.yBinIndex
            ],
            frequency: values.count
        )
    }

Here subscript(position:) method on the respective NumberBins instances is used to retrieve the ChartBinRange values for each axes. To bring this all together, it's best to first declare a typealias for the aggregated binned values.

typealias BinnedValue = (
    index: TwoDimensionalBinIndex,
    xDataRange: ChartBinRange<Float>,
    yDataRange: ChartBinRange<Float>,
    frequency: Int
)

In my chart view, as with the histogram, I use a regular computed property.

struct PenguinChart: View {
    let dataSet: [PenguinsDataPoint]
    
    struct TwoDimensionalBinIndex: Hashable {
        let xBinIndex: Int
        let yBinIndex: Int
    }
    
    typealias BinnedValue = (
        index: TwoDimensionalBinIndex,
        xDataRange: ChartBinRange<Float>,
        yDataRange: ChartBinRange<Float>,
        frequency: Int
    )
    
    
    var bins: [BinnedValue] {
        let flipperLengthBins = NumberBins(
            data: dataset.map(\.flipperLength),
            desiredCount: 25
        )

        let billLengthBins = NumberBins(
            data: dataset.map(\.billLength),
            desiredCount: 25
        )
        
        let groupedData = Dictionary(
            grouping: dataset
        ) { element in
            TwoDimensionalBinIndex(
                xBinIndex: flipperLengthBins
                    .index(for: element.flipperLength),
                yBinIndex: billLengthBins
                    .index(for: element.billLength)
            )
        }

        let values = groupedData
            .map { key, values in
                return (
                    index: key,
                    xDataRange: flipperLengthBins[
                        key.xBinIndex
                    ],
                    yDataRange: billLengthBins[
                        key.yBinIndex
                    ],
                    frequency: values.count
                )
            }
        
        return values
    }
    
    var body: some View { ... }
}

With all these bins computed, it's now a matter of plotting the values.

struct PenguinChart: View {
    let dataSet: [PenguinsDataPoint]
    
    ...
    
    var bins: [BinnedValue] { .. }
    
    var body: some View {
        Chart(
            self.bins, id: \.index
        ) { element in
            RectangleMark(
                x: .value(
                    "Flipper Length",
                    element.xDataRange
                ),
                y: .value(
                    "Bill Length",
                    element.yDataRange
                )
            )
            .foregroundStyle(
                by: .value(
                    "Frequency",
                    element.frequency
                )
            )
        }
        .chartXScale(
            domain: .automatic(
                includesZero: false
            )
        )
        .chartYScale(
            domain: .automatic(
                includesZero: false
            )
        )
    }
}

I use a RectangleMark to draw a rectangle for each x and y region, then I colour that region based on the frequency value. Furthermore, as with the histogram, I adjust the x and y axes to no longer start at 0.

A 2D density plot for flipper length and bill length with default colouring in white and blue

Clearly, some styling is going to be needed. I will adjust the colour gradient and get rid of the spacing between each of the rectangles. The colour gradient can be adjusted using chartForegroundStyleScale(range:) modifier. And by passing width and height values to RectangleMark initialiser, I'm able to controle the spacing provided.

struct PenguinChart: View {
    
    ...
    
    var body: some View {
        Chart(
            self.bins, id: \.index
        ) { element in
            RectangleMark(
                x: .value(
                    "Flipper Length",
                    element.xDataRange
                ),
                y: .value(
                    "Bill Length",
                    element.yDataRange
                ),
                width: .ratio(1),
                height: .ratio(1)
            )
            .foregroundStyle(
                by: .value(
                    "Frequency",
                    element.frequency
                )
            )
        }
        .chartXScale(
            domain: .automatic(
                includesZero: false
            )
        )
        .chartYScale(
            domain: .automatic(
                includesZero: false
            )
        )
        .chartForegroundStyleScale(
            range: Gradient(
                colors: [
                    Color.red.opacity(0.1),
                    Color.yellow
                ]
            )
        )
    }
}
A 2D density plot for flipper length and bill length with a gentle gradient from red to yellow

While exploring this, I did also tried using BarMark instead of or RectangleMark by varying the width and height and I was able to create some interesting effects. These could be very useful if animating between different plotted attributes, but that will need to wait for a follow up post.

A 2D density plot for flipper length and bill length made out of small rounded rectangles

You can find the code for this post in our GitHub project, this includes the code to download and parse the CSV file.