Similar posts recommendation with Doc2Vec - Part III
Marton Trencseni - Sat 10 December 2022 - machine-learning
Introduction
In the previous posts, I used the Doc2Vec neural network architecture to compute the similarities between my blog posts, and explored the quality of the scores:
- Part I: Computing the scores with Doc2Vec using the gensim library
- Part II: Exploring the results with heatmaps and graphs
In this final post, I show how I added the final Articles You May Like recommendation sections to the blog. It's live, you can see it if you scroll down to the bottom of this page (or any other page). There is nothing sophisticated happening in this post, just creating a Python script and some minimal HTML/CSS/Javascript.
Approach
When thinking about how to go about this, I identified 3 options:
- Compute recommendations when the static blog is being generated and emit the recommendations as part of the statically rendered HTML
- Compute recommendations as a static
.js
file which is loaded by the browser, and the recommendations are shown with some Javascript code modifying the DOM - Dynamically requests the recommendations for each article over an API, and show the recommendations with some Javascript code modifying the DOM
I decided against the first one because I didn't want to slow down my edit/publish/check workflow: generating the recommendations takes about a minute on my server hosting the blog, while the static blog generation is just a few seconds. The third one is unnecessarily complicated, so that left me with the second option: emit a static .js
, and write some Javascript code to show the recommendations.
Emitting the recommendations
First I created a venv
on my server with all the needed Python libraries installed, pip install numpy networkx nltk gensim
. I also wanted to use the same version 3.9 of Python as on my other computers, so I had to install it, which took a bit of fiddling on my server's somewhat dated distro because I had to install from source.
Once I had the venv
set up, I converted my earlier code to a stand-alone script recommend.py
, which emits the .js
file with the recommendations:
import os
import json
import numpy as np
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
def build_post_struct(lines):
slug = next(line[len('slug:'):].strip() for line in lines[:10] if line.lower().startswith('slug:'))
title = next(line[len('title:'):].strip() for line in lines[:10] if line.lower().startswith('title:'))
date = next(line[len('date:'):].strip() for line in lines[:10] if line.lower().startswith('date:'))
return {'slug': slug, 'title': title, 'date': date, 'contents': '\n'.join(lines[10:]).lower()}
def similar_posts(idx, n=3):
results = model.dv.most_similar(positive=[model.infer_vector(tagged_posts[idx][0])], topn=n+1)
results = [i for i, _ in results if idx != i]
return results[:n]
BLOG_DIR = "content/"
paths = [f'{BLOG_DIR}/{f}' for f in os.listdir(BLOG_DIR) if f.lower().endswith(".md")]
posts = [build_post_struct(open(path, encoding="utf8").read().splitlines()) for path in paths]
tagged_posts = {idx : TaggedDocument(word_tokenize(post['contents']), [idx]) for idx, post in enumerate(posts)}
model = Doc2Vec(tagged_posts.values(), vector_size=100, alpha=0.025, min_count=1, workers=4, epochs=100)
posts = {idx : post for idx, post in enumerate(posts)}
for idx, post in posts.items():
del post['contents']
post['recommendations'] = similar_posts(idx, n=3)
json_file = f'recommendations = {json.dumps(posts)};'
with open("flex/static/js/recommendations.js", "w") as f:
f.write(json_file)
The output is live here. With some pretty-printing, it looks like:
recommendations = {
"0": {
"slug": "mnist-pixel-attacks-with-pytorch",
"title": "MNIST pixel attacks with Pytorch",
"date": "2019-06-01",
"recommendations": [
87,
117,
102
]
},
"1": {
"slug": "five-ways-to-reduce-variance-in-ab-testing",
"title": "Five ways to reduce variance in A/B testing",
"date": "2021-09-19",
"recommendations": [
85,
5,
54
]
},
...
};
Makefile
Then I added a new target to the blog's Makefile
:
...
clean:
[ ! -d $(OUTPUTDIR) ] || rm -rf $(OUTPUTDIR)
output: $(INPUTDIR)/* *.py Makefile
@$(PELICAN) $(INPUTDIR) -o $(OUTPUTDIR) -s $(PUBLISHCONF) $(PELICANOPTS)
...
clone:
git clone git@github.com:mtrencseni/mtrencseni.github.io.git
recommend: # <----------
@./recommend.py #
publish: output
@cp -R $(OUTPUTDIR)/* /var/www/bytepawn.com/
...
So my modified workflow is now:
- continue to use
make publish
to publish new and/or changed articles to the blog - once an article is final, update the recommendations with
make recommend
Javascript
First, load the recommendations.js
file on each article's page:
<script src="recommendations.js"></script>
Next, create a <div>
that will contain the recommendations:
<div id="similar_articles" style="display:none; ..."></div>
Finally, the Javascript to render the recommendations:
var body = document.getElementsByTagName("body")[0];
window.onload = function () { show_recommendations("{{ article.slug }}"); }
function get_post_by_slug(slug) {
for (const k in recommendations) {
if (recommendations[k]["slug"] == slug)
return recommendations[k];
}
return null;
}
function get_post_by_idx(idx) {
return recommendations[idx];
}
function show_recommendations(slug) {
if (typeof recommendations == 'undefined')
return;
post = get_post_by_slug(slug);
if (post == null)
return;
var div_similar_articles = document.getElementById("similar_articles");
div_similar_articles.innerHTML = "<div style=\"margin-bottom:5px;\"><b>Other Articles You May Like:</b></div>";
for (const idx of post["recommendations"]) {
rp = get_post_by_idx(idx)
url = "https://bytepawn.com/" + rp["slug"]+ ".html";
title = rp["title"];
year = rp["date"].slice(0, 4);
div_similar_articles.innerHTML += "<li><a href=\"" + url + "\">" + title + " (" + year + ")</a>";
}
div_similar_articles.style.display = "block";
};
Note the if (typeof recommendations == 'undefined')
and if (post == null)
check: if the recommendations are missing or the current page is not part of the recommendations (because I forgot to run make recommend
) the page does not break, and the <div>
remains hidden in its initial style="display:none"
state.
Result
You can see the result on this page, just below this line, below the tags. Enjoy!