WSS and BIC are not on the same scale, so we can’t directly compare kmeans and GMM at this point. I’ll save you from google-ing and just tell you that within-cluster sum of squares (WSS) is typically used for kmeans, and Bayesian Information Criteria (BIC) is the go-to metric for GMM. So, since clustering comes last, all we need to do is figure out how to judge the clustering this will tell us something about how “good” the combination of dimensionality reduction and clustering is overall. We’ll be feeding in the results from the dimensionality reduction-either PCA or UMAP-to a clustering method-either kmeans or GMM. Trying to infer something from the correlation matrix doesn’t get you very far, so one can see why dimensionality reduction will be useful.Īlso, we don’t really have “labels” here (more on this later), so clustering can be useful for learning something from our data. # Let's only use players with a 10 matches' worth of minutes. Mutate(across(where(is.double), ~replace_na(.x, 0))) I’m here to illustrate the potential advantages of upgrading your PCA + kmeans workflow to Uniform Manifold Approximation and Projection (UMAP) + Gaussian Mixture Model (GMM), as noted in my reply here.įor this demonstration, I’ll be using this data set pointed out here, including over 100 stats for players from soccer’s “Big 5” leagues. While there is some debate about whether combining dimensionality reduction and clustering is something we should ever do 1, I’m not here to debate that. Combining principal component analysis (PCA) and kmeans clustering seems to be a pretty popular 1-2 punch in data science.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |