Alternative clustering methods#

Alternative methods overview#

In this section, we show two different clustering methods which were used on top of Leiden clustering (our preferred method).

These two methods are agglomerative clustering (hierarchical clustering) and gaussian mixture model-based (GMM) clustering. We show that these methods are not as good as Leiden clustering.

Agglomerative clustering#

Agglomerative clustering is a hierarchical clustering method. It starts with each point in its own cluster and then merges the two closest clusters iteratively until there is only one cluster left. We use the Euclidean distance as the distance metric, with ward linkage (cluster linkage which minimises intra-cluster variance), which was by far the best mon linkage methods tried. We use the AgglomerativeClustering class from sklearn.cluster to perform agglomerative clustering.

Gaussian mixture model-based clustering#

Gaussian mixture model-based clustering is a clustering method which assumes that the data is generated from a mixture of Gaussian distributions. We use the GaussianMixture class from sklearn.mixture to perform Gaussian mixture model-based clustering.

Data loading and preprocessing#

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
import utils
import sklearn.cluster
import sklearn.mixture
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import umap_ as umap
import plotly.express as px
import plotly.offline as pyo

pyo.init_notebook_mode()
DATA_PATH = Path.cwd() / "../data"
data = {
    Path(f).stem: pd.read_csv(f, index_col=0) for f in DATA_PATH.glob("combined_*.csv")
}
print(list(data.keys()))
['combined_metrics_finished_edges', 'combined_metrics_finished_paths', 'combined_metrics_unfinished_edges', 'combined_metrics_unfinished_paths']
features_finished_paths = data["combined_metrics_finished_paths"].reset_index(drop=True)
features_unfinished_paths = data["combined_metrics_unfinished_paths"].reset_index(
    drop=True
)
features_used=utils.FEATURES_COLS_USED_FOR_CLUSTERING
combined_df = pd.concat([features_finished_paths, features_unfinished_paths], axis=0)
combined_df['finished'] = [1] * len(features_finished_paths) + [0] * len(features_unfinished_paths)
combined_df[features_used] = utils.normalize_features(combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING])
X = combined_df[features_used].values
#find number of NaNs in each column
nans = np.isnan(X).sum(axis=0)
print(nans)
#drop rows with NaNs
mask_vector = ~np.isnan(X).any(axis=1)
X = X[~np.isnan(X).any(axis=1)]
[0 0 3 0 0 0 0 0 0 0]
# drop the path sum metrics
combined_df
hashedIpAddress timestamp durationInSec path rating backtrack numberOfPath path_length coarse_mean_time position_mean ... semantic_similarity path_degree_abs_sum path_clustering_abs_sum path_degree_centrality_abs_sum path_betweenness_abs_sum path_closeness_abs_sum index target type finished
0 6a3701d319fc3754 2011-02-15 03:26:49 0.094978 ['14th_century', '15th_century', '16th_century... NaN -0.280785 -0.360322 1.237829 -0.293402 0.011529 ... 1.882018 5.197422 -3.233355 -3.234431 -5.004633 -0.430555 NaN NaN NaN 1
1 3824310e536af032 2012-08-12 06:36:52 -0.422557 ['14th_century', 'Europe', 'Africa', 'Atlantic... 3.0 -0.280785 0.358481 -0.067835 -0.333922 -0.627881 ... 1.190749 7.010212 -2.318889 -1.421642 -3.524030 0.953234 NaN NaN NaN 1
2 415612e93584d30e 2012-10-03 21:10:40 -0.055666 ['14th_century', 'Niger', 'Nigeria', 'British_... NaN -0.280785 0.544508 0.976194 -0.351290 -0.369856 ... 0.770837 5.236336 -3.212821 -3.195518 -4.962358 -0.403703 NaN NaN NaN 1
3 64dd5cd342e3780c 2010-02-08 07:25:25 -1.129090 ['14th_century', 'Renaissance', 'Ancient_Greec... NaN -0.280785 -0.559082 -0.563509 -0.890117 -0.442103 ... 0.535863 NaN NaN NaN NaN 0.000000 NaN NaN NaN 1
4 015245d773376aab 2013-04-23 15:27:08 0.138033 ['14th_century', 'Italy', 'Roman_Catholic_Chur... 3.0 -0.280785 -0.898865 0.679578 -0.030457 0.062407 ... -0.815607 6.315864 -2.737924 -2.115989 -4.991059 0.562986 NaN NaN NaN 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18976 109ed71f571d86e9 2014-01-15 12:00:12 0.170016 ['Turks_and_Caicos_Islands', 'United_States', ... NaN -0.280785 0.843977 0.337160 0.136738 0.774194 ... 0.089435 7.555088 -1.505405 -0.876766 -2.304742 1.021614 24869.0 Iraq_War restart 0
18977 109ed71f571d86e9 2014-01-15 12:06:45 0.161005 ['Franz_Kafka', 'Tuberculosis', 'World_Health_... NaN 0.337716 0.871218 0.337160 -0.121555 -0.310926 ... -0.462655 3.669951 -3.305998 -4.761902 -7.644179 -1.914494 24870.0 Cholera restart 0
18978 2e09a7224600a7cd 2014-01-15 15:06:40 2.082769 ['Computer_programming', 'Linguistics', 'Cultu... NaN 0.337716 -0.109914 -1.202543 2.322458 1.156748 ... -0.102215 4.760035 -2.802979 -3.671818 -6.478384 0.412224 24872.0 The_Beatles timeout 0
18979 60af9e2138051b96 2014-01-15 15:24:41 2.084055 ['Jamaica', 'United_Kingdom', 'World_War_II', ... NaN -0.280785 -0.898865 -0.563509 2.516759 -1.107312 ... -0.259947 7.158709 -1.836719 -1.273145 -2.951980 0.864835 24873.0 Alan_Turing timeout 0
18980 1cf0cbb3281049ab 2014-01-15 21:54:01 0.707915 ['Mark_Antony', 'Rome', 'Tennis', 'Hawk-Eye', ... NaN -0.280785 1.008917 -0.067835 0.864710 0.666220 ... -1.801801 5.595124 -1.730546 -2.836729 -5.768889 0.465042 24874.0 Feather restart 0

70290 rows × 26 columns

Finding the dimensionality of the data will not be reiterated here, please refer to the Leiden clustering notebook for more details.

We will keep only the 7 first dimensions.

pca = PCA(n_components=7, random_state=42)
X_pca = pca.fit_transform(X)

Hierarchical clustering#

The code below defines the denedogram plotting function, which is used to visualise the decision tree of the hierarchical clustering.

Setup#

def plot_dendrogram(model, n_clust, **kwargs): # From sklearn plot clustering dendogram docs: https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

    # Draw a line at the threshold (which is derived from the number of clusters)
    plt.axhline(y=(linkage_matrix[-n_clust, 2] + linkage_matrix[-n_clust + 1, 2]) / 2, c='r', linestyle='--', alpha=0.5)

    plt.xlabel("Number of points in node (or index of point if no parenthesis).")
    plt.show()
X_pca.shape
(70287, 7)

Visualisation of the hierarchical clustering#

Here we plot a dendrogram of the clustering, and the resulting clusters (on a 3D projection of the data)

# Select a random subset of samples
SUBSAMPLE_SIZE = 20000

np.random.seed(42)
subset_indices = np.random.choice(X_pca.shape[0], size=SUBSAMPLE_SIZE, replace=True)
X_subset = X_pca[subset_indices]
NUM_CLUSTERS = 6

# Run hierarchical clustering on the subset of data (maximum distance between clusters)
model = sklearn.cluster.AgglomerativeClustering(n_clusters=NUM_CLUSTERS, compute_full_tree=True, compute_distances=True, linkage='ward')
Z = model.fit_predict(X_subset)

plot_dendrogram(model, truncate_mode='level', p=5, n_clust=NUM_CLUSTERS, color_threshold=0)
_images/899375077246448d77c62a6b6ff86cdc0df061f63fbaed32f62fb4ac631421dd.png

UMAP plot#

# umap the data down to 3 dimensions
reducer = umap.UMAP(random_state=42, n_components=3, init='random')
# reducer = PCA(n_components=2, random_state=42)
# reducer = TSNE(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_subset)

fig = px.scatter_3d(
	x=X_umap[:, 0],
	y=X_umap[:, 1],
	z=X_umap[:, 2],
	color=model.labels_,
	opacity=0.8,
	labels={"color": "cluster"},
	title=f"UMAP of agglomerative clustering of {SUBSAMPLE_SIZE} samples",
)
utils.format_3d_plot(fig)
display(fig)
c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\umap\umap_.py:1943: UserWarning:

n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.

c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:346: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")


c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:348: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")


c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\pynndescent\pynndescent_.py:358: NumbaPendingDeprecationWarning:

Code using Numba extension API maybe depending on 'old_style' error-capturing, which is deprecated and will be replaced by 'new_style' in a future release. See details at https://numba.readthedocs.io/en/latest/reference/deprecation.html#deprecation-of-old-style-numba-captured-errors
Exception origin:
  File "c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\numba\core\types\functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")

# reducer = umap.UMAP(random_state=42, n_components=2, init='random')
# reducer = PCA(n_components=2, random_state=42)
reducer = TSNE(n_components=3, random_state=42)
X_umap = reducer.fit_transform(X_subset)

t-SNE plot#

fig2=px.scatter_3d(
	x=X_umap[:, 0],
	y=X_umap[:, 1],
	z=X_umap[:, 2],
	color=model.labels_.astype(str),
	opacity=0.8,
	labels={"color": "cluster"},
	title=f"TSNE of agglomerative clustering of {SUBSAMPLE_SIZE} samples",
)
utils.format_3d_plot(fig2)
fig.update_layout(scene=dict(xaxis_title="t-SNE 1", yaxis_title="t-SNE 2", zaxis_title="t-SNE 3"))
display(fig)

One can observe that the clustering obtained from this method follows the UMAP and TSNE projections (position in space determines cluster). One of the selected clusters only has 3 points, which is definitely not what we want! (See second line crossing threshold from left.)

Feature distributions per cluster#

with plt.style.context("dark_background"):
    plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING].copy().dropna().iloc[subset_indices]
    plot_data["cluster"] = model.labels_
    plot_data["cluster"] = plot_data["cluster"].astype("category")
    n_features = len(plot_data.columns) - 1
    n_cols = 3
    n_rows = int(np.ceil(n_features / n_cols))
    fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
    axs = axs.flatten()
    for i, col in enumerate(plot_data.columns):
        if col == "cluster":
            continue
        sns.histplot(
            data=plot_data,
            x=col,
            hue="cluster",
            ax=axs[i],
            stat="density",
            common_norm=False,
            palette="tab10",
        )
        utils.set_axis_style(axs, i)

    plt.suptitle("Histograms of normalized features colored by cluster", fontsize=20)
    plt.subplots_adjust(top=0.95)
    plt.show()
_images/6fad8b2e379e45cd30da3f35d751c33619fec79b896bff04fd4d01811e00a270.png

Firstly, these clusters have been computed on a subset, due to memory limitations of the method. Secondly, the clusters are not as well defined as the ones obtained from Leiden clustering. One can observe the presence of significant outliers in the 5th cluster, and very similar values for the rest.

with plt.style.context("dark_background"):
	combined_df["finished_normalized"] = (
		combined_df["finished"] - combined_df["finished"].mean()
	) / combined_df["finished"].std()
	plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING + ["finished_normalized"]].copy().iloc[subset_indices]

	plot_data.index = np.arange(len(plot_data))

	plot_data["cluster"] = model.labels_

	plot_data["cluster"] = plot_data["cluster"].astype("category")
	n_features = len(plot_data.columns) - 1
	n_cols = 4
	n_rows = int(np.ceil(n_features / n_cols))
	fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
	fig.suptitle("Boxplots of normalized features colored by cluster", fontsize=20)
	plt.subplots_adjust(top=0.95)
	axs = axs.flatten()
	for i, col in enumerate(plot_data.columns):
		if col == "cluster":
			continue
		sns.boxplot(data=plot_data, x="cluster", y=col, ax=axs[i], hue="cluster", palette="tab10")
		utils.set_axis_style(axs, i)
	plt.show()
_images/37486551f3afe3fb7fdb7512202470283f8d56be966a8926a142172e1533998a.png

A lot of features do not vary by cluster. This is not wanted behaviour, as we want to find features which vary by cluster.

clusters_names,n_points = np.unique(plot_data["cluster"], return_counts=True)
with plt.style.context("dark_background"):
    sns.barplot(x=clusters_names, y=n_points,hue=range(len(clusters_names)), palette="tab10").set_title("Number of paths per cluster", fontsize=20)
    # make background grey
    plt.gca().patch.set_facecolor("#d3d3d3")
    # update legend style
    leg = plt.gca().get_legend()
    leg.get_frame().set_facecolor("#d3d3d3")
    leg.set_title("Cluster")
    leg_title = leg.get_title()
    leg_title.set_color("black")
    # change all legend text to black
    [text.set_color("black") for text in leg.get_texts()]
    plt.xlabel("Cluster", fontsize=14)
    plt.ylabel("Number of paths", fontsize=14)
plt.show()
_images/8de357c1b6ff894af232d1c9514401866a1f6c697078b0f5946097a89ba4d838.png

This is a relatively nice distribution.

analysis_df = combined_df.copy().iloc[subset_indices]
analysis_df['cluster'] = Z
feat_labels = utils.get_feature_names_labels()
with plt.style.context("dark_background"):
    means = plot_data.groupby(plot_data["cluster"]).mean()
    sns.heatmap(means, cmap="coolwarm", annot=True, fmt=".1f", vmin=-1, vmax=1)
    plt.title("Heatmap of normalized features\nmeans per cluster", fontsize=25)
    plt.ylabel("Cluster", fontsize=20)
    plt.xlabel("Feature", fontsize=20)
    plt.xticks(labels=feat_labels+["Finished (norm)"], ticks=np.arange(len(feat_labels)+1)+0.5, rotation=90, fontsize=15)
    plt.show()
_images/eb7ee4dece676666caa47ee68b2536909f7eddc1ecdb56f49e57569b0f07b79a.png

Here, interestingly enough, we can sort of identify player behaviour through these clusters.

  1. Cluster 5 comprises players which are more often on longer paths. These players are likely experienced and capable of finishing more on average.

  2. Cluster 3 comprises player which struggle more with completing the task: they take longer, take less optimal paths, and often do not finish.

  3. Clusters 0, 1, 2, and 4 are too related to distinguish.

In the end, unfortunately, these clusters are not very interesting.

Gaussian mixed-model clustering#

Visualisation#

import numpy as np
# Select a random subset of samples
np.random.seed(42)
subset_indices = np.random.choice(X_pca.shape[0], size=SUBSAMPLE_SIZE, replace=False)
X_subset = X_pca[subset_indices]

# Run hierarchical clustering on the subset of data
model = sklearn.mixture.GaussianMixture(n_components=5)
Z = model.fit_predict(X_subset)

reducer = umap.UMAP(random_state=42, n_components=3, init='random')
X_umap = reducer.fit_transform(X_subset)
c:\Users\Cyril\anaconda3\envs\DLbiomed\lib\site-packages\umap\umap_.py:1943: UserWarning:

n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
fig = px.scatter_3d(
	x=X_umap[:, 0],
	y=X_umap[:, 1],
	z=X_umap[:, 2],
	color=Z.astype(str),
	opacity=0.8,
	labels={"color": "cluster"},
	title=f"Gaussian Mixture clustering of {SUBSAMPLE_SIZE} samples",
)
utils.format_3d_plot(fig)
display(fig)
with plt.style.context("dark_background"):
    plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING].copy().dropna().iloc[subset_indices]
    plot_data["cluster"] = Z
    plot_data["cluster"] = plot_data["cluster"].astype("category")
    n_features = len(plot_data.columns) - 1
    n_cols = 3
    n_rows = int(np.ceil(n_features / n_cols))
    fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
    axs = axs.flatten()
    for i, col in enumerate(plot_data.columns):
        if col == "cluster":
            continue
        sns.histplot(
            data=plot_data,
            x=col,
            hue="cluster",
            ax=axs[i],
            stat="density",
            common_norm=False,
            palette="tab10",
        )
        utils.set_axis_style(axs, i)

    plt.suptitle("Histograms of normalized features colored by cluster", fontsize=20)
    plt.subplots_adjust(top=0.95)
    plt.show()
_images/f7c13ef6be0c77983be923a5799b339e501f4134b0a6331d381872669041c33e.png
with plt.style.context("dark_background"):
	combined_df["finished_normalized"] = (
		combined_df["finished"] - combined_df["finished"].mean()
	) / combined_df["finished"].std()
	plot_data = combined_df[utils.FEATURES_COLS_USED_FOR_CLUSTERING + ["finished_normalized"]].copy().iloc[subset_indices]

	plot_data.index = np.arange(len(plot_data))

	plot_data["cluster"] = Z

	plot_data["cluster"] = plot_data["cluster"].astype("category")
	n_features = len(plot_data.columns) - 1
	n_cols = 4
	n_rows = int(np.ceil(n_features / n_cols))
	fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))
	fig.suptitle("Boxplots of normalized features colored by cluster", fontsize=20)
	plt.subplots_adjust(top=0.95)
	axs = axs.flatten()
	for i, col in enumerate(plot_data.columns):
		if col == "cluster":
			continue
		sns.boxplot(data=plot_data, x="cluster", y=col, ax=axs[i], hue="cluster", palette="tab10")
		utils.set_axis_style(axs, i)
	plt.show()
_images/b17c8314a24b9056798fff09ddfcf7caeedba7a860309e571f7b2074f45ad271.png
clusters_names,n_points = np.unique(plot_data["cluster"], return_counts=True)
with plt.style.context("dark_background"):
    sns.barplot(x=clusters_names, y=n_points,hue=range(len(clusters_names)), palette="tab10").set_title("Number of paths per cluster", fontsize=20)
    # make background grey
    plt.gca().patch.set_facecolor("#d3d3d3")
    # update legend style
    leg = plt.gca().get_legend()
    leg.get_frame().set_facecolor("#d3d3d3")
    leg.set_title("Cluster")
    leg_title = leg.get_title()
    leg_title.set_color("black")
    # change all legend text to black
    [text.set_color("black") for text in leg.get_texts()]
    plt.xlabel("Cluster", fontsize=14)
    plt.ylabel("Number of paths", fontsize=14)
plt.show()
_images/4c63c45e313491055338047a6a5fb957bb457cf5d35fb1fdba28257da1c4e839.png

We encounter the same problem as before, but slightly different. Instead of one cluster being way too small, we have one cluster that is much too large here!

feat_labels = utils.get_feature_names_labels()
with plt.style.context("dark_background"):
    means = plot_data.groupby(plot_data["cluster"]).mean()
    sns.heatmap(means, cmap="coolwarm", annot=True, fmt=".1f", vmin=-1, vmax=1)
    plt.title("Heatmap of normalized features\nmeans per cluster", fontsize=25)
    plt.ylabel("Cluster", fontsize=20)
    plt.xlabel("Feature", fontsize=20)
    plt.xticks(labels=feat_labels+["Finished (norm)"], ticks=np.arange(len(feat_labels)+1)+0.5, rotation=90, fontsize=15)
    plt.show()
_images/e054cf1108602cd308bee441c2bddcd666aeae16c2ef1fc47e449d6f54b80723.png

The problem with this clustering method is that due to soft clustering assignment, clusters are a “blur” of sorts, and therefore the differences between them are much less pronounced. This is indeed the case, these clusters are present, but they don’t show us anything!! A single interesting cluster is found, but it’s the same as with hierarchical clustering, so there’s no point in using this method.


Conclusion#

Due to the fact that we’ve shown that these clustering methods are inferior to Leiden clustering, we will not be performing any statistical clustering analysis, nor continuing with the use of these methods.