Skip to content

Categorical × Numerical Data

In this section, following data will be used as an example:

import numpy as np
from whitecanvas import new_canvas

rng = np.random.default_rng(3)
df = {
    "category": ["A"] * 40 + ["B"] * 50,
    "observation": np.concatenate([rng.normal(2.0, size=40), rng.normal(3.3, size=50)]),
    "replicate": [0] * 23 + [1] * 17 + [0] * 22 + [1] * 28,
    "temperature": rng.normal(scale=2.8, size=90) + 22.0,
}

How can we visualize the distributions for each category? There are several plots that use categorical axis as either the x- or y-axis, and numerical axis as the other. Examples are:

  • Strip plot
  • Swarm plot
  • Violin plot
  • Box plot

Aside from the categorical axis, data points may further be grouped by other features, such as the marker symbol and the marker size. Things are even more complicated when the markers represent numerical values, such as their size being proportional to the value, or colored by a colormap.

whitecanvas provides a consistent and simple interface to handle all these cases. Methods used for this purpose are cat_x and cat_y, where cat_x will deem the x-axis as categorical, and cat_y will do the same for the y-axis.

canvas = new_canvas("matplotlib")

# create the categorical plotter.
cat_plt_x = canvas.cat_x(df, x="category", y="observation")
cat_plt_y = canvas.cat_y(df, x="observation", y="category")

cat_x and cat_y use the argument x= and y= to specify the columns that are used for the plot, where x= is the categorical axis for cat_x and y= for cat_y.

This is one of the important difference between `seaborn`. In `seaborn`, `orient` are
used to specify the orientation of the plots. This design forces the user to add the
argument `orient=` to every plot even though the orientation rarely changes during the
use of the same figure. In `whitecanvas`, you don't have to specify the orientation
once a categorical plotter is created by either `cat_x` or `cat_y`.

Multiplt columns can be used for the categorical axis, but only one column can be used for the numerical axis.

# OK
canvas.cat_x(df, x=["category", "replicate"], y="observation")
# OK
canvas.cat_y(df, x="observation", y=["category", "replicate"])
# NG
canvas.cat_x(df, x="category", y=["observation", "temperature"])

Non-marker-type Plots

Since plots without data point markers are more straightforward, we will start with them. It includes add_violinplot, add_boxplot, add_pointplot and add_barplot.

canvas = new_canvas("matplotlib")
canvas.cat_x(df, x="category", y="observation").add_violinplot()
canvas.show()

Violins can also be shown in different color. Specify the color= argument to do that.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
)
canvas.show()

By default, groups with different colors do not overlap. This is controlled by the dodge= argument. Set dodge=False to make them overlap (although it is not the way we usually do).

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate", dodge=False)
)
canvas.show()

hatch= can also be specified in a similar way. It will change the hatch pattern of the violins.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(hatch="replicate")
)
canvas.show()

color and hatch can overlap with each other or the x= or y= argument.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="category")
)
canvas.show()

The color palette of the canvas is used to paint categories. If you want to change it after the layer is added, use update_color_palette method.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .update_color_palette(["pink", "teal"])
)
canvas.show()

add_boxplot, add_pointplot and add_barplot is very similar to add_violinplot.

from whitecanvas import new_row

canvas = new_row(3, size=(1600, 600), backend="matplotlib")

c0 = canvas.add_canvas(0)
c0.cat_x(df, x="category", y="observation").add_boxplot()
c0.title = "boxplot"

c1 = canvas.add_canvas(1)
c1.cat_x(df, x="category", y="observation").add_pointplot()
c1.title = "pointplot"

c2 = canvas.add_canvas(2)
c2.cat_x(df, x="category", y="observation").add_barplot()
c2.title = "barplot"

canvas.show()

Marker-type Plots

Marker-type plots use a marker to represent each data point.

Strip plot

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_stripplot(color="replicate")
)
canvas.show()

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_stripplot(color="replicate", dodge=True)
)
canvas.show()

As for the Markers layer, as_edge_only will convert the face features to the edge features.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_stripplot(color="replicate", dodge=True)
    .as_edge_only(width=2)
)
canvas.show()

with_hover_template is also defined. All the column names can be used in the template format string.

canvas = new_canvas("plotly", size=(400, 300))
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_stripplot(color="replicate", dodge=True)
    .with_hover_template("{category} (rep={replicate})")
)
canvas.show()

Each marker size can represent a numerical value. update_size will map the numerical values of a column to the size of the markers.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_stripplot()
    .update_size("temperature")
)
canvas.show()

Similarly, each marker color can represent a numerical value. update_colormap will map the value with an arbitrary colormap.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_stripplot()
    .update_colormap("temperature", cmap="coolwarm")
)
canvas.show()

Swarm plot

Swarm plot (or beeswarm plot) is another way to visualize all the data points with markers. In swarm plot, the outline of the markers represents the distribution of the data.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_swarmplot(sort=True)
    .update_colormap("temperature", cmap="coolwarm")
)
canvas.show()

Rug plot

Although rug plot does not directly use markers, it also use a line to represent each data point.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_rugplot(color="replicate", dodge=True)
)
canvas.show()

Some methods defined for marker-type plots can also be used for rug plot. For example, update_colormap will change the color of the rug lines based on the values of the specified column.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_rugplot()
    .update_colormap("temperature", cmap="coolwarm")
)
canvas.show()

scale_by_density will change the length of the rugs to represent the density of the data points.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_rugplot(color="replicate", dodge=True)
    .scale_by_density()
)
canvas.show()

Overlaying Plots

Different types of plots have their own strengths and weaknesses. To make the plot more informative, it is often necessary to overlay different types of plots.

You can simply call different methds to overlay different types of plots, but in some cases it is not that easy. For example, to add rug plot to violin plot, you have to correctly set the lengths of the rug lines so that their edges exactly match the edges of the violins.

Some types of plots are implemented with methods to efficiently overlay them with other plots. All of them use method chaining so that the API is very clean.

Rug plot over violin plot

Violin plot can be overlaid with rug plot using with_rug method. Edges of the rug lines match exactly with the edges of the violins. Of cource, you can hover over the rug lines to see the details.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .with_rug(color="purple")
)
canvas.show()

Box plot over violin plot

Violin plot can be overlaid with box plot using with_box method. Color of the box plot follows the convention of other plotting softwares by default.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .with_box(width=2.0, extent=0.05)
)
canvas.show()

If the violins are edge only, the box plot will be filled with the same color.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .as_edge_only()
    .with_box(width=2.0, extent=0.05)
)
canvas.show()

Markers over violin plot

Violin plot has with_strip and with_swarm methods to overlay markers.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .with_strip(symbol="D", size=8, color="black")
)

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .with_swarm(size=8, color="black")
)

Add outliers

Box plot and violin plot are usually combined with outlier markers, as these plots are not good at showing the details of the sparse data points. For these plots, with_outliers method will add outliers, and optionally change the whisker lengths for the box plot.

This is the example of adding outliers to the box plot. Because outliers are shown as a strip plot, arguments specific to strip plot (symbol, size, extent and seed) can be used.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_boxplot(color="replicate")
    .with_outliers(size=8)
)

If the box plot is edge only, the outliers will be the same.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_boxplot(color="replicate")
    .as_edge_only()
    .with_outliers()
)

Setting update_whiskers to False will not change the whisker lengths.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_boxplot(color="replicate")
    .with_outliers(update_whiskers=False)
)

Violin plot also supports with_outliers method.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .add_violinplot(color="replicate")
    .with_outliers(size=8)
)

Sort categorical axis

By default, the order of the categories is determined by the order of appearance in the data. In the example data, the order of "category" is "A" and "B".

If you want to sort the categories as you like, you can use the sorting methods of the categorical plotters.

Sort in ascending or descending order

The sort method will sort the categories in ascending or descending order. The default is ascending. Use ascending=False to sort in descending order.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .sort(ascending=False)
    .add_violinplot(color="replicate")
)

Sorting works similarly for the categorical axis with multiple columns.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x=["category", "replicate"], y="observation")
    .sort(ascending=False)
    .add_violinplot(color="replicate")
)

Sort in any order

If you already know the category names and want to sort them in a certain order, use the sort_in_order method.

canvas = new_canvas("matplotlib")
(
    canvas
    .cat_x(df, x="category", y="observation")
    .sort_in_order(["B", "A"])
    .add_violinplot(color="replicate")
)