Similar posts recommendation with Doc2Vec - Part II

Marton Trencseni - Sun 04 December 2022 - machine-learning

Introduction

In the previous post, I used the Doc2Vec neural network architecture to compute the similarities between my blog posts. Using these, I am planning to add Articles You May Like recommendations to Bytepawn (this blog), based on the semantic similarity of my blog posts. In this second post I investigate the results further. The code is up on Github.

Similarity matrix

The previous post left off with showing the raw similarity matrix:

Word2vec

I also noted the (red) square of high similarity scores in the above plot, from id=84 to id=89. That's the cluster of articles about Probabilistic spin glasses, which all have the same filename prefix (so their integer ids are sequential, so they are sequential rows and columns in the matrix), and have very high similarity scores:

Clusters

The matrix visualization doesn't work well, it only visually shows clusters that happen to be sequentially ordered, like above. We need to look at it as a graph, and find connected sub-graphs of similar articles. Plotting the entire graph is pointless, as it would have, for $N=141$ posts, $N^2=19\,881$ edges — nothing would be discernible. We need to reduce the number of edges. After some playing around, I found that keeping only similarities $ s(a, b) > 0.55 $ works well:

similarity_matrix_truncated = [[y if y > 0.55 else 0 for y in x] for x in similarity_matrix]
for i in range(len(similarity_matrix_truncated)):
    similarity_matrix_truncated[i][i] = 0
plot_matrix(similarity_matrix_truncated)

The code also zeros out the diagonal elements, to avoid plotting self-loops on the graph. First, let's see what this truncated matrix looks like as a heatmap:

Word2vec

Looks promising. Now let's plot the graph:

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

def plot_graph(m):
    labels = list(range(len(m)))
    df = pd.DataFrame(m, index=labels, columns=labels)
    G = nx.from_pandas_adjacency(df)
    plt.figure(figsize=(15, 15))
    nx.draw_networkx(G, nx.spring_layout(G), node_size=0, arrows=False, edge_color='lightgray', font_size=8)
    plt.show()

plot_graph(similarity_matrix_truncated)

Output:

Word2vec

By reading off the ids for some of the clusters, we can check what they are:

print([
        [idx_lookup[63], idx_lookup[64]],
        [idx_lookup[3],  idx_lookup[9]],
        [idx_lookup[37], idx_lookup[47], idx_lookup[48], idx_lookup[106], idx_lookup[133]],
])```

Output:

[
    # good cluster, these 2 articles are about GPT-3
    [
        'how-good-a-data-scientist-is-gpt-3',
        'how-good-a-particle-physicist-is-gpt-3'
    ],
    # good cluster, these 2 articles are about A/B testing on social networks
    [
        'ab-testing-and-network-effects',
        'ab-testing-on-social-networks'
    ],
    # good cluster, these 5 articles are about entropy
    [
        'cross-entropy-joint-entropy-conditional-entropy-and-relative-entropy',
        'entropy-in-data-science',
        'entropy-of-an-ideal-gas-with-coarse-graining',
        'the-physical-sackur-tetrode-entropy-of-an-ideal-gas',
        'what-is-the-entropy-of-a-fair-coin-toss'
    ]
]

The results continue to look promising. Instead of checking these cluster manually, let's automate it:

def clusters(m):
    connected_subgraphs = set([frozenset([i]) for i in list(range(len(m)))])
    def update_subgraphs(i, j):
        join_targets = [g for g in connected_subgraphs if i in g or j in g]
        new_subgraph = frozenset({i for g in join_targets for i in g})
        for g in join_targets:
            connected_subgraphs.remove(g)
        connected_subgraphs.add(new_subgraph)
    for i in range(len(m)):
        for j in range(len(m)):
            if m[i][j] > 0:
                update_subgraphs(i, j)
    return connected_subgraphs

def print_clusters(connected_subgraphs):
    for g in connected_subgraphs:
        if len(g) == 1:
            for i in g:
                print(f'Singleton: {idx_lookup[i]}')
    for g in connected_subgraphs:
        if len(g) > 1:
            print('Cluster:')
            for i in g:
                print({idx_lookup[i]})

And now let's print them:

c = clusters(similarity_matrix_truncated)
print_clusters(c)

Output:

Singleton: understanding-facebook-planout-ab-testing-framework
Singleton: ab-tests-moving-fast-vs-being-sure
Singleton: the-culture-map
Singleton: more-data-scientists-should-learn-sql
...
Cluster:
{'aa-testing-and-false-positives-with-cuped'}
{'correlations-seasonality-lift-and-cuped'}
{'reducing-variance-in-ab-testing-with-cuped'}
{'reducing-variance-in-conversion-ab-testing-with-cuped'}
{'ab-testing-and-multi-armed-bandits'}
{'multi-armed-bandits-and-false-positives'}
{'ab-testing-and-the-historic-lift-paradox'}
{'five-ways-to-reduce-variance-in-ab-testing'}
Cluster:
{'how-good-a-particle-physicist-is-gpt-3'}
{'how-good-a-data-scientist-is-gpt-3'}
Cluster:
{'exploring-prior-beliefs-with-mcmc'}
{'food-deliveries-bayes-and-computational-statistics'}
Cluster:
{'solving-openai-gym-classic-control-problems-with-pytorch'}
{'using-simulated-self-play-to-solve-all-openai-gym-classic-control-problems-with-pytorch'}
{'solving-the-cartpole-reinforcement-learning-problem-with-pytorch'}
...

It still looks good! However, this approach isn't the most representative: many posts are left as singletons, because all their similarity weights are less than the cutoff of $ s(a, b) > 0.55 $. However, in practice the recommendation engine would still return the top $n$ for them. A better approach is to keep the top $n$ post recommendations (edge) for each post (node), irrespective of weight. Let's start with $n=3$:

# another approach, to avoid singletons:
# for each post, get the top n=3 most similar
def similarity_matrix_top_n(n=3):
    m = [[0 for _ in range(len(posts))] for _ in range(len(posts))]
    for i in list(range(len(m))):
        sp = similar_posts(i)
        idxs = [tagged_posts[p[0]][1][0] for p in sp]
        for k, j in enumerate(idxs):
            m[i][j] = sp[k][1]
    return m

similarity_matrix_truncated = similarity_matrix_top_n(n=3)
plot_matrix(similarity_matrix_truncated)
plot_graph(similarity_matrix_truncated)

The heatmap:

Word2vec

The graph:

Word2vec

The graph is now fully connected, which makes sense. The challenge now is that it's again hard to make sense of the adjacencies. Let's re-run, but only keep the top $n=2$:

similarity_matrix_truncated = similarity_matrix_top_n(n=2)
plot_graph(similarity_matrix_truncated)

Word2vec

Most of the graph is still a big fully-connected cluster, let's try again with $n=1$:

similarity_matrix_truncated = similarity_matrix_top_n(n=1)
plot_graph(similarity_matrix_truncated)

Word2vec

This now looks interesting. There are reasonably sized clusters, and all posts are in a cluster. Let's look at them explicitly:

c = clusters(similarity_matrix_truncated)
print_clusters(c)

Output (note that there are no singletons):

Cluster:
{'building-a-toy-python-dataclass-decorator'}
{'rxe-literate-and-composable-regular-expressions'}
{'building-a-simple-python-enum-class-part-ii'}
{'building-a-simple-python-enum-class'}
Cluster:
{'automating-a-call-center-with-machine-learning'}
{'backtraced-growth-accounting'}
{'warehouse-locations-with-kmeans'}
{'machine-learning-at-fetchr'}
{'calibration-curves-for-delivery-prediction-with-scikit-learn'}
{'arabic-name-classification-with-scikit-learn-and-pytorch'}
Cluster:
{'the-german-tank-problem-in-world-war-ii'}
{'beyond-the-central-limit-theorem'}
{'ab-testing-and-the-ttest'}
{'ab-testing-and-the-ztest'}
{'ab-testing-and-the-central-limit-theorem'}
...

Note that I'm not implying here that I would only use the top $n=1$ recommendation. This is just to see what kind of clusters are formed by the similarity scores, and whether it makes sense — it's verification.

Conclusion

The similarities look good — I see no reason to invest any more time into this part. In Part III, I will add the recommendations to the website!