Packt: The Unsupervised Learning Workshop — an in-depth review
The detailed comments of a grumpy old data scientist
tl;dr A lack of added value and confused target audience mean you’d do better with other (free) resources
Back in February, Packt gave me a copy of “The Unsupervised Learning Workshop” in return for my genuine feedback. I took some time going through it before getting caught up in other stuff. You know, life. I owe them that review. Here it is. I suspect I won’t be asked to do another.
Who’s it for?
The contents of this workshop suggest it would be much more honestly pitched if the stated target audience were business/data analysts rather than data scientists. Yes, it is true there is much job title inflation and conflation there. But nevertheless, the topics in this book are really for someone tasked with finding patterns in data, not building predictive models. The chapters on market basket analysis and hotspot analysis feel out of place and just tacked on for the sake of it as far as a title on unsupervised learning is concerned. They make a little more sense in the context of a title aimed at business analysts. I’d go as far as to say it is simply wrong to claim the audience will be wanting to learn how to implement machine learning algorithms to build predictive models. If that is you, this is definitely not the book for you!
This leads to the next misfit. I think such people (in the business analytics space) should not be reasonably expected to have a solid understanding of Python and be editing classes and functions. Much of that code editing is necessitated by the painful pedagogical experience of being forced to crawl over barbed wire. It was somewhat surprising to see instructions to download Python. The authors do mention anaconda; I’d have gone straight to recommending that. And if you suggest your audience should have a solid understanding of Python, then perhaps you should just expect them to already have Python installed and set up. The provision of a requirements.txt file was a nice attention to detail though. The source code repo and interactive Jupyter notebook lab environments are also very welcome. However, the value of these lie in the value of the content, and that is lacking.
Chapter 1 Introduction to clustering
I was rapidly unimpressed and uninspired by the chapter contents. Starting with this chapter, I can’t say I found the introduction to what is unsupervised learning particularly edifying. The whole “imagine you’re dropped on planet Earth” story was positively confusing. For the given target audience, talking about having a dataset of customers’ purchase history and wanting to cluster the customers for some marketing campaign would get to the point much sooner and with more relevance.
The first exercise (1.01) demonstrates the poor practice, all too frequently seen in students’ work, of assuming that the algorithm found the one, true number of clusters. It wrongly, in my view, suggests the human would be wrong and the computer correct. Frankly, if you try that attitude with the marketing team in your organization, you’re likely not to get asked to segment the customer base again. Poor practice and confusion continues in exercise 1.02. The “crawl over barbed wire” approach is even more painful here for not using NumPy for sqrt and exponentiation. This is doubly confusing because NumPy is imported but not used in the code snippet. It is entirely appropriate with the stated positioning of this book to say
You’re a new data scientist, so you will have encountered NumPy and vectorised operations before. Here’s a diagram illustrating the Euclidean and Manhatten distances, and this is how to calculate each using NumPy and iterate through a K-means implementation from scratch.
If the workshop is actually pitched at analysts coming from Excel such that you need to walk them through really basic maths in Python, then don’t pitch it at people with a solid understanding of Python in the introduction.
I did like the use of make_blobs to create a simple dataset. But we then quickly jump into importing cdist from scipy.spatial.distance and calling reshape and relying on broadcasting. So within a single chapter we have bemused the audience with a rather convoluted picture of unsupervised learning, we’ve vexed the more experienced programmers by dragging them over really basic stuff, and we’ve bemused the rest by jumping into more advanced NumPy usage.
In summary of chapter 1, then, it’s a bit of a dog’s dinner. It felt like an incongruous hands-on from scratch (using a curious mix of packages) and algebra combined with a dose of teaching-grandma-to-suck-eggs language. There would have been more mileage in presenting the K-means algorithm with some diagrams and an implementation in NumPy before introducing Scikit-learn. Any data scientist in the stated target audience should be able to follow that and it would have been a better learning experience. The attempt to explore what is clustering struggled to show why it’s useful in a variety of problems.
Chapter 2 Hierarchical clustering
I’ve frequently got more mileage out of hierarchical clustering, so it was good to see this up next. The chapter introduction disappoints. The concept of a hierarchy is introduced okay, but the reader is then faced with the dissonance of a rather wordy layman’s explanation of the six steps in hierarchical clustering immediately followed by an example walkthrough that hits them with tables of numbers. And step 2 of the clustering process (page 43) really fluffs the concept of linkage (which “bits” of adjacent clusters you measure distances between) by needlessly throwing it in as an unnamed concept. This all really would have been so much clearer with some well thought through diagrams. Really, at this point, I just want to direct the reader to Scikit-learn’s page on hierarchical clustering.
I’ve some sympathy for the approach of providing an explicit linkage and dendrogram step (page 49), although the text is a little misleading in how it, again, kicks the topic of linkage further down the road. And if the goal of this exercise is to teach the underlying fundamentals, I’m far from convinced that the wording “the dendrogram function uses the distances calculated in Step 4 to generate a visually clean way of parsing grouped information” achieves that. This is especially the case given how the reader had to navigate a maze of numbers and then stare at a dendrogram to work out how to interpret it. Dendrograms wanted a better introduction here. Then we could more easily just say “and now we call this handy function that takes the clusters at each level in the hierarchy and plots the distance between them”. Fcluster is pretty much just thrown in for convenience. A horizontal dashed line in the dendrogram at distance=3 would not have gone amiss here to demonstrate that we’re effectively “cutting” the tree at this “height”. But, finally, we now get to linkage. Again, the reader would fare much better in visiting the sklearn page on linkage for the explanation and figures there than the word salad here.
At first glance, I thought it a nice pedagogical exercise to explore the effect of different linkage methods on the clusters. Sadly, this is fundamentally flawed. The text previously used the carefully chosen distance threshold of 3 that split the data into the known number of clusters. This threshold is therefore implicitly optimized for the linkage that was used. To use the same threshold with different linkage methods and then claim this as evidence that the original linkage method was best is just nonsense. You’re calculating different distances, of course the threshold should be re-evaluated each time.
At least, in the chapter summary, they write “Your duty as a practitioner of unsupervised learning is to explore all the options and identify the solution that is both resource-efficient and performant.” — Amen to that.
Chapter 3 Neighbourhood approaches and DBSCAN
This chapter really helped me consolidate my understanding. I came to fully grasp the fact that readers should just visit the Scikit-learn documentation website and not bother with this book. The authors attempt to provide an introduction to DBSCAN in “simple words”. This is laborious and bogs itself down in fluff and a wine analogy that suggests the wine world can be broken down into white and red as well as “more exotic”, expensive varieties. And you can more easily point out customers who have the highest potential for remarketing in a campaign. Ah, wait, we’re clustering through the concept of a neighbourhood (so like k-means and hierarchical clustering then?) and can separate out one-off customers from repeat customers. So, wait, DBSCAN relies on multiple observations of subjects? What is a neighbourhood if not a cluster around a centroid? Haven’t we already covered this? Ah, is DBSCAN something like k-means but with variable sized (distance from a centroid) clusters and you can tweak this based on how many points fall inside?
There are threads of something there you can pull on, if you work at it. Compare all this with the first couple of sentences from Scikit-learn’s documentation for DBSCAN:
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.
Boom, I’d argue that this has already given us a more intuitive introduction than a couple of pages of talking about wine stock and infrequent customers.
It also takes the authors some time to get around to introducing the concept of a core point, which they do with a somewhat awkward
If the point under observation has data points greater than the minimum number of points in its neighborhood that make up a cluster, then that point is called a core point of the cluster. All core points within the neighborhood of other core points are part of the same cluster. However, all the core points that are not in same neighborhood are part of another cluster.
Got that? Contrast this with Scikit-learn’s third and fourth sentence on DBSCAN
The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).
The variable radius circles denoting a neighbourhood on pages 85 and 86 are exactly the sort of additional intuitive visualization with which a book can add value. Some carefully crafted examples illustrating parameter trade-offs in various scenarios are exactly where a workshop could contribute. Instead, we get “During an interview, you are asked to create the DBSCAN algorithm from scratch…” No, just no. No, you don’t need to “earn” the ability to use easier implementations. No, no, just no. But then it’s nice to see the illustration of the effect of varying the min_samples parameter.
Another good use of such a workshop book would be to set the reader up with enough knowledge and tooling so that, after demonstrating various properties of clustering with the example of eight clusters, a new problem dataset with a different number of clusters is introduced and the reader is challenged to vary the clustering parameters to best isolate them. This is where a workshop would really shine, not in rehashing Scikit-learn documentation.
If only there were more of the wisdom demonstrated on page 95: “k-means and hierarchical clustering really excel when you have some idea regarding the number of clusters in your data”. This is something with which I wholeheartedly concur; rarely should you ever be blindly applying k-means and a heuristic such as an elbow plot or a silhouette plot to tell you the “true” number of clusters.
The chapter summary starts with “In this chapter, we discussed hierarchical clustering and DBSCAN, and in what type of situations they are best employed.” I have to disagree with this. I don’t think the reader would have the best appreciation of pros and cons of the approaches covered so far in the book. If they’re anything like me, their heads are buzzing with wine orders that have lost their labels. The overview of clustering methods at the start of the Scikit-learn documentation on clustering, on the other hand…
Chapter 4 Dimensionality reduction techniques and PCA
I started this chapter with some hope remaining. It starts off describing what dimensionality is. I was initially tempted to think this was nugatory, but reminding relatively inexperienced data scientists that dimensionality can be not only the number of (disparate) feature columns, but also image or sequence data (e.g. a time series such as EKG). It would have been nice, however, to have illustrated just how such time series or image data can be regarded as living in a high dimensional space, and how it’s important to centre your data consistently. Although the EKG time series is plotted as an example, it’s not clear what you’d then do with it or what issues could arise.
The authors’ vague, hand-waving style continues. Dimensionality reduction as a noise reduction step is introduced with an example image “filtered to the first 20 most significant sources of data”. Sources of data? I can only assume here that it means the first 20 principal components, but “sources of data” can only serve to bemuse readers for the sake of avoiding yet diving into talk of what such components are. They are, however, happy to throw in mention of eigenvalues. It was a walk down the memory lane of my PhD to see active shape modeling mentioned. It doesn’t really add much for the reader though. I hope the business analysts coming from Excel are keeping up.
It was nice to see the curse of dimensionality introduced in the context of sparsity, but I don’t feel the Pac-Man example was very effective and they failed to intuitively make the connection to how this “make[s] statistically valid correlations more difficult”.
When it comes to describing PCA, I simply cannot believe a reader would need a review of mean and standard deviation. The painful, and seemingly nugatory, steps of calculating means, standard deviations, and variances using both numpy and pandas is just bizarre. This is compounded by the discrepancy apparent between the NumPy and pandas variances (because one uses the biased estimator and the other the unbiased), which remains unremarked, swept under the rug, so to speak, in plain sight. And all of these are never used anyway because our destination is simply to call the .cov() method from both packages. But still, it padded out a couple of pages, hey.
It could have been nice to motivate PCA as minimizing the sum of square errors and compare this to linear regression, especially nice as this would emphasize PCA’s unsupervised nature compared to linear regression’s supervised. It was nice to see the authors pointing out that using all components would return the original dataset. Or, at least, I see what they were trying to get at. However, we need to be clear what we’re doing here in order to be clear in what we would obtain. If we are performing the dimensionality reduction process, but using all components, then we would retain the full information present in the original data; we would not weed out potentially unwanted variability. However, we would not, technically, obtain the original dataset; we would have a version with rotated axes (linear combinations of the original features) that contained all the information present in the original dataset. We could be talking about reconstructing the original feature space using a subset of components, in which case using all available components would return the exact original dataset. But then that is not compressing the data, as the text implies. All of this could be well illustrated with a simple 2D scatterplot, like the length of kernel vs area of kernel for the seeds data that they use.
It is odd to perform a walkthrough of a manual implementation of PCA that uses eigenvalues and eigenvectors after informing the reader that we won’t go into exactly what they are because “it is quite involved”. The approach here, then, could be summarized as “there’s this thing called the covariance matrix, we’re going to waste some time calculating means and standard deviations we won’t use before going on to just calculate the covariance matrix, which we’ll then put through this eig function we haven’t explained.”
It’s nice to see the step of subtracting the mean mentioned in the context of manual PCA vs Scikit-learn’s implementation. It’s also nice to see an example of reconstructing data using just some (one) of the components. It’s an absolute waste of page count, though, to do this both for the manual approach and using Scikit-learn’s implementation.
Chapter 5 Autoencoders
This could be quite an exciting chapter, I thought, and the de-noising example is impressive. For a target audience that has rather painfully been walked through some tedious calculations before, we’re now given a rather blithe mention, pretty much in passing, of activation functions, vanishing gradients, dead cells and bias. We then plot the form of some activation functions. I’d be interested in whether the stated target audience really quite follows what activation functions actually do. I’ve seen much, much better introductions to the basic mechanics of neural networks. I can certainly see an argument can be made that this isn’t the place for an introductory course on the topic, but again it feels that the authors are trying to compensate for the omission of foundational theory by manually performing some calculations and plotting the results. Also, this consumes quite a few pages.
I can’t help but wonder if a more constructive approach might not have been simply to say something along the lines of the following:
Here’s a generic neural network that extracts features before then producing an output of class labels. But instead of aiming to produce an output that agrees with known class labels, we can instead define our output layer to be the same dimensionality as the input layer and aim to produce an output that matches the input as closely as possible. How is this useful? We’ve said that the intermediate part of the network is performing feature extraction. By defining the desired output to be something that closely matches the input, we’ve essentially performed dimensionality reduction. By breaking the network apart at that intermediate layer, therefore, we have access to that reduced dimension representation of the data (that then subsequently encodes a maximal amount of information).
The many pages of using keras to build a classifier feels incongruous in a book about unsupervised learning, especially given the sparse treatment of neural networks to this point.
The first page introducing autoencoders is a worthy entry in and of itself. It succinctly summarises what autoencoders are as well as the two main stages. One is tempted to suggest ditching the preceding pages of this chapter without any great loss to the narrative on unsupervised learning. At the end of the section on autoencoders (page 198), if the reader has followed the keras UI then they’ll have an idea what the two autoencoder networks are doing, and how they differ. Again, it feels that this could have been done without the preceding half of the chapter.
It’s nice to see convolutional neural networks introduced, and it’s also nice to see the improvement in reconstructed image quality. That this is after the reader has suddenly, and briefly, had maxpooling and upsampling thrown at them is unfortunately typical of this book’s approach.
Chapter 6 t-SNE
“The Student’s t-distribution … is often used in the Student’s t-test.” Erm… Funny that.
The authors now appear to have tired of walking the reader through painful manual calculations, which is to be welcomed. Instead we have a reasonable, quick overview of what t-SNE does. This is a light touch that could have been fruitfully employed earlier in the book. t-SNE is then performed on MNIST images that have been projected down to 30 dimensions by PCA. Given this is supposed to be a workshop, there is zero explanation of why we’re doing this. Is it because t-SNE doesn’t like noise? Or is it because t-SNE would be too slow on the full 28x28 images? Why isn’t t-SNE performed on the original MNIST images? Then the text comments that we’ve reduced the 784 dimensions down to 2 for visualization. But we haven’t done this using (just) t-SNE. PCA got us down from 784 to 30. Why did we go from 30 to 2 using t-SNE? Why didn’t we just go straight from 784 to 2 using PCA? Comparing the two scatterplots, from PCA and from t-SNE, would have (likely) been most informative. We aren’t even told how much variance those 30 PCA components explained. Hint: you don’t have to do PCA first, but it’s argued to be good practice when starting with high dimensional data.
In activity 6.01, with wine data, t-SNE again uses PCA as a dimensionality reducing preprocessing step. The reader will have no idea why they’re doing PCA every time, let alone what the desired number of components is. They’re better off just reading Tyler’s article linked to above.
The reader following the text and exercises should at least end up with a feel for the effect of perplexity and number of iterations. This is a nice outcome and, for all the other flaws in this and earlier chapters, points to something that such a workshop could have brought to the topics covered. It’s a shame it misses that mark all too frequently.
The chapter summary has a number of points I’d take issue with. It could give a reader the unfortunate impression that PCA produces high-dimensional representations of data, rather than the opposite. Yes, the number of PCA components we wish to keep may well be >> 2 and so hard to visualize as they are, but this will still be a considerably lower dimensionality than that of the original data. I personally dislike the wording that says PCA and t-SNE were able to cluster the classes. They are not clustering algorithms; they merely produce a low dimensional representation. Any clustering was done by eye using known labels. This may sound like a quibble, but I know many newcomers are frequently confused by such semantics.
Chapter 7 Topic Modelling
There is the promise of Latent Dirichlet Allocation and non-negative matrix factorization in this chapter. How does it fair?
The chapter has a reasonable introduction to topic modelling. A potential real value addition of Packt is the binder-powered notebooks. Thus it feels like quite a waste to have page after page of the book comprise basically a walk through of a notebook. On balance, I felt this chapter didn’t provide a bad, quick intro to why we want to clean and preprocess text data. I’m less decided on how appropriate it was to include NLP in the book. On the one hand, it could be a good little glimpse of an introduction to the steps involved in NLP. On the other hand, a decent introduction really needs a longer treatment, with the result that this is somewhat rushed and tries to cram too much in.
As a non-NLP expert, I was looking forward to this section. LDA has some pretty deep probabilistic underpinnings that are going to be hard to do justice to in a workshop-style publication. And then I wonder if this topic really needs to be here. Would a workshop on unsupervised learning be woefully inadequate without it? Probably not. Will introducing it risk confusing the novice data scientist? Probably.
Seven chapters in, I start off asking myself “in what way will the sklearn documentation be better?” The authors immediately mention Blei, Ng, and Jordan’s 2003 paper, but a link would have been nice. The Scikit-learn documentation starts off with a nicer diagram than the one the workshop authors put on a following page. It is clear from both the Scikit-learn page and this book that there are quite a few moving parts to LDA. The generative process the authors give introduces a beta that isn’t explained, although it appears in the diagram on the following page. It isn’t until the page after that that a meaning for beta is given. The Scikit-learn documentation does give a few useful references as links, including the 2003 paper. Now it is clear that the workshop authors lifted the outline of the generative process directly from the 2003 paper. The paper, however, takes a much more readable, and rigorous, approach to explaining terms. Thus, if I was to recommend a path to someone to learn about LDA, I’d be inclined to point them, yet again, to the Scikit-learn documentation (and its associated reference links, including the 2003 paper). If you’re going to throw LDA at someone, you’re setting yourself up with quite a task to explain the theoretical foundations to at least some extent. This workshop has proven itself on multiple occasions to struggle with getting this balance right.
This is really, really not a well thought through pedagogical roadmap. A far better approach would have been to start off talking about how to represent a corpus numerically, leading to TF-IDF. As we’ve already covered distance metrics, an example of finding similar documents using cosine similarity might have been nice here.
The explanation of perplexity gives quite a nice intuitive motivation. But then using it in a homebrewed function to find an optimum number of topics feels like teaching bad practice. A workshop publication like this would have been so much better off jumping into LDA and treating it as a black box, and doing practically useful things such as using grid search to find the best number of topics and then extracting the top words associated with each topic etc. This approach is demonstrated in this article, which was really easy to find.
Non-negative matrix factorization.
Again, this section suffers from a wordy background and lack of rigorous development before ducking out with “so let’s skip the derivatives and just state the updates”. The cynically minded might think this is because the authors just wanted to justify putting some more equations in the text. Really, it would be quite sufficient here to just introduce NMF as being similar to PCA but it assumes the data and the components are non-negative. Then a workshop could crack on doing something interesting and useful with it. It’s not like we even do anything manually with the update rules anyway!
Chapter 8 Market basket analysis
This workshop is on unsupervised learning, right? But it throws in some NPL. Okay, it’s an application area. Now what is market basket analysis?
A foundational and reliable algorithm for analyzing transaction data. It’s ubiquitous and undeniably impactful in the retail space… It’s important because it provides insight into why people buy certain items together and whether those item combinations can be leveraged to hasten growth and or increase profitability.
This just doesn’t feel like it belongs in a title on unsupervised learning. Retail analytics, business analytics, data analytics? Fine. Yes, there will be data scientists with stretched job titles doing such business analytics. Yes, this is a problem space without predefined labels. But to put this in a text on unsupervised learning just doesn’t feel appropriate. I’ll admit that by chapter 8 I’m not feeling very charitable when they write
The general idea behind market basket analysis is to identify and quantify which items, or groups of items, are purchased together frequently enough to drive insight into customer behavior and product relationships.
Isn’t this just analytics? I apologise to those people adding real value to their businesses by doing just analytics. But I’ll refer back to the preface of this book, which claims
if you are a data scientist who is just getting started and want to learn how to implement machine learning algorithms to build predictive models, then this book is for you.
Just not this chapter, especially not this chapter, perhaps.
We have, on page 335, pretty much an entire page taken up with echoing something as trivial as the output of the first 10 rows of a dataframe and its shape. Such a book as this should never be allowed to be printed on paper because it would be such a waste of paper. If such steps are merely echoing the output of a notebook in a lab online, then keep it in the notebook in the online lab!
From page 340 and for a few pages, the reader is basically forced through some rather tedious, basic data wrangling steps. This is not adding value for a workshop on unsupervised learning. On page 354:
Imagine you work for an online retailer. You are given all the transaction data from the last month and told to find all the item sets appearing in at least 1% of the transactions. Once the qualifying item sets are identified, you are subsequently told to identify the distribution of the support values. The distribution of support values will tell all interested parties whether groups of items exist that are purchased together with high probability as well as the mean of the support values. Let’s collect all the information for the company’s leadership and strategists.
This is a great business intelligence / data analyst question. What’s it got to do with unsupervised (machine) learning?
What’s the upshot of this chapter? A great panel question: When does business analytics become unsupervised learning? Discuss.
Chapter 9 Hotspot analysis
Is non-parametric density estimation within the remit of unsupervised learning? Answers on a postcard. Is it appropriate for a workshop on unsupervised learning? I don’t think so. To put it another way, there’s so much to business analytics that a practitioner would never be interested in this workshop. There’s so much to unsupervised learning, in the sense of exploring and interpreting clusters, that this workshop adds little to no value.
This is a rather confused effort that fails to clearly advance an introduction to unsupervised learning concepts. It is a chimera that cannot decide whether it is a workshop on unsupervised learning for data scientists or on business analytics for data analysts. It also contains numerous examples of poor practice and outright technical and procedural errors. If you want a good, readable, introduction to unsupervised learning and associated algorithms, you’re much better off just visiting the Scikit-learn documentation site. This book lacks both real fundamentals and useful real-world case studies. It goes heavy on laborious, frequently nugatory, manual implementations of what are often not key steps, and then fails to really deliver much added value from the subsequent exercises. I definitely want to celebrate the code and the live labs, but then the Scikit-learn documentation also gives you those! Overall, I would say save your money. For a much better introduction and discussion, as well as code and demos you can run live in a Jupyter notebook using binder, visit the Scikit-learn pages (for example, here’s topic extraction with non-negative matrix factorization and Latent Dirichlet Allocation). This workshop title is inadequate for teaching the fundamentals and provides uninteresting worked examples.