Similar posts recommendation with Doc2Vec - Part III

Marton Trencseni - Sat 10 December 2022 - machine-learning

Introduction

In the previous posts, I used the Doc2Vec neural network architecture to compute the similarities between my blog posts, and explored the quality of the scores:

In this final post, I show how I added the final Articles You May Like recommendation sections to the blog. It's live, you can see it if you scroll down to the bottom of this page (or any other page). There is nothing sophisticated happening in this post, just creating a Python script and some minimal HTML/CSS/Javascript.

Approach

When thinking about how to go about this, I identified 3 options:

Compute recommendations when the static blog is being generated and emit the recommendations as part of the statically rendered HTML
Compute recommendations as a static .js file which is loaded by the browser, and the recommendations are shown with some Javascript code modifying the DOM
Dynamically requests the recommendations for each article over an API, and show the recommendations with some Javascript code modifying the DOM

I decided against the first one because I didn't want to slow down my edit/publish/check workflow: generating the recommendations takes about a minute on my server hosting the blog, while the static blog generation is just a few seconds. The third one is unnecessarily complicated, so that left me with the second option: emit a static .js, and write some Javascript code to show the recommendations.

Emitting the recommendations

First I created a venv on my server with all the needed Python libraries installed, pip install numpy networkx nltk gensim. I also wanted to use the same version 3.9 of Python as on my other computers, so I had to install it, which took a bit of fiddling on my server's somewhat dated distro because I had to install from source.

Once I had the venv set up, I converted my earlier code to a stand-alone script recommend.py, which emits the .js file with the recommendations:

import os
import json
import numpy as np
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

def build_post_struct(lines):
    slug = next(line[len('slug:'):].strip() for line in lines[:10] if line.lower().startswith('slug:'))
    title = next(line[len('title:'):].strip() for line in lines[:10] if line.lower().startswith('title:'))
    date = next(line[len('date:'):].strip() for line in lines[:10] if line.lower().startswith('date:'))
    return {'slug': slug, 'title': title, 'date': date, 'contents': '\n'.join(lines[10:]).lower()}

def similar_posts(idx, n=3):
    results = model.dv.most_similar(positive=[model.infer_vector(tagged_posts[idx][0])], topn=n+1)
    results = [i for i, _ in results if idx != i]
    return results[:n]

BLOG_DIR = "content/"
paths = [f'{BLOG_DIR}/{f}' for f in os.listdir(BLOG_DIR) if f.lower().endswith(".md")]
posts = [build_post_struct(open(path, encoding="utf8").read().splitlines()) for path in paths]
tagged_posts = {idx : TaggedDocument(word_tokenize(post['contents']), [idx]) for idx, post in enumerate(posts)}
model = Doc2Vec(tagged_posts.values(), vector_size=100, alpha=0.025, min_count=1, workers=4, epochs=100)

posts = {idx : post for idx, post in enumerate(posts)}
for idx, post in posts.items():
    del post['contents']
    post['recommendations'] = similar_posts(idx, n=3)

json_file = f'recommendations = {json.dumps(posts)};' 

with open("flex/static/js/recommendations.js", "w") as f:
    f.write(json_file)

The output is live here. With some pretty-printing, it looks like:

recommendations = {
  "0": {
    "slug": "mnist-pixel-attacks-with-pytorch",
    "title": "MNIST pixel attacks with Pytorch",
    "date": "2019-06-01",
    "recommendations": [
      87,
      117,
      102
    ]
  },
  "1": {
    "slug": "five-ways-to-reduce-variance-in-ab-testing",
    "title": "Five ways to reduce variance in A/B testing",
    "date": "2021-09-19",
    "recommendations": [
      85,
      5,
      54
    ]
  },
  ...
};

Makefile

Then I added a new target to the blog's Makefile:

...
clean:
    [ ! -d $(OUTPUTDIR) ] || rm -rf $(OUTPUTDIR)

output: $(INPUTDIR)/* *.py Makefile
    @$(PELICAN) $(INPUTDIR) -o $(OUTPUTDIR) -s $(PUBLISHCONF) $(PELICANOPTS)
  ...

clone:
    git clone git@github.com:mtrencseni/mtrencseni.github.io.git

recommend:               # <----------
    @./recommend.py        # 

publish: output
    @cp -R $(OUTPUTDIR)/* /var/www/bytepawn.com/
...

So my modified workflow is now:

continue to use make publish to publish new and/or changed articles to the blog
once an article is final, update the recommendations with make recommend

Javascript

First, load the recommendations.js file on each article's page:

<script src="recommendations.js"></script>

Next, create a <div> that will contain the recommendations:

<div id="similar_articles" style="display:none; ..."></div>

Finally, the Javascript to render the recommendations:

var body = document.getElementsByTagName("body")[0];
window.onload = function () { show_recommendations("{{ article.slug }}"); }
function get_post_by_slug(slug) {
  for (const k in recommendations) {
    if (recommendations[k]["slug"] == slug)
      return recommendations[k];
  }
  return null;
}
function get_post_by_idx(idx) {
  return recommendations[idx];
}
function show_recommendations(slug) {
  if (typeof recommendations == 'undefined')
    return;
  post = get_post_by_slug(slug);
  if (post == null)
    return;
  var div_similar_articles = document.getElementById("similar_articles");
  div_similar_articles.innerHTML = "<div style=\"margin-bottom:5px;\"><b>Other Articles You May Like:</b></div>";
  for (const idx of post["recommendations"]) {
    rp = get_post_by_idx(idx)
    url = "https://bytepawn.com/" + rp["slug"]+ ".html";
    title = rp["title"];
    year = rp["date"].slice(0, 4);
    div_similar_articles.innerHTML += "<li><a href=\"" + url + "\">" + title + " (" + year + ")</a>";
  }
  div_similar_articles.style.display = "block";
};

Note the if (typeof recommendations == 'undefined') and if (post == null) check: if the recommendations are missing or the current page is not part of the recommendations (because I forgot to run make recommend) the page does not break, and the <div> remains hidden in its initial style="display:none" state.

Result

You can see the result on this page, just below this line, below the tags. Enjoy!