Finding similar API functions between Pytorch and Tensorflow with Doc2Vec

Marton Trencseni - Wed 21 December 2022 - machine-learning

Introduction

In a series of previous posts I used Doc2Vec to add recommendations to this blog, which are now live (scroll to the bottom of any page, it's a blue box). These previous posts were:

I wanted to see how Dov2Vec performs out-of-the-box comparing pages from different domains, ie. pages that have a different structure. Since I don't have a labeled data set for this, I was thinking of some domain where there are obvious similarities, and I could manually check the quality of the results. It occured to me that Pytorch and Tensorflow are similar Deep Learning libraries, so I could use Docv2Vec to compute similarities between their API doc pages, and see if it finds obvious "pairs". By pairs I mean, both will have a library function for eg. Cross Entropy, and so on. Let's see how it goes!

Tensorflow Pytorch

The code is up on Github.

Crawling

First, let's download the API docs. Initially I tried to use scrapy for this, but after a few hours of usage, I grew disappointed and abandoned it, for the following reasons:

it does not (seem to) have a good default auto-crawl, I needed to specifically tell it what links to crawl
it does not (seem to) have good default document extraction, you're on your own with eg. BeautifulSoup
it does not (seen to) have good default error-handling, eg. handling a javascript: or mailto: links crashes it
it uses a multiprocessing library which does not allow multiple crawls/restarts when used from ipython on Windows; I had to restart the whole python kernel on every new crawl

After a few hours of not getting much bang for my buck, I realized I'm better off writing a simple crawler loop myself with requests, urllib and BeautifulSoup. At least for such a simple use-case, I was right. My solution is more robust when used from ipython, simples, and about the same amount of code as the scrapy driver class:

def extract_text_and_links(html):
    soup = BeautifulSoup(html, 'html.parser')
    text = ' '.join(soup.find('article').text.split())
    links = [link['href'] for link in soup.find_all('a', href=True)]
    return text, links

def link_prefix(link):
    if '#' in link:
        link = link.split('#')[0]
    if '?' in link:
        link = link.split('?')[0]
    return link

def resolve_links(base_url, links):
    return set([link_prefix(urljoin(base_url, link)) for link in links])

def filter_links(base_domain, links):
    return [link for link in links if link.startswith('https://') and base_domain in link and '.pdf' not in link]

def crawl_and_save(base_url, num_pages):
    print(f'Crawling {base_url}...')
    base_domain = base_url.split('/')[0]
    urls_queued, urls_crawled, saved_pages = ['https://' + base_url], set(), {}
    while len(urls_queued) > 0 and len(urls_crawled) < num_pages:
        url = urls_queued.pop(0)
        urls_crawled.add(url)
        print(f'Fetching {url}')
        try:
            html = requests.get(url).text
            text, links = extract_text_and_links(html)
            if base_url in url:
                saved_pages[url] = text
            links = resolve_links(url, links)
            links = filter_links(base_url, links)
            links = [link for link in links if (link not in set(urls_queued) and link not in urls_crawled)]
            urls_queued.extend(links)
        except:
            pass
    print(f'Done!')
    print(f'Crawled {len(urls_crawled)} total pages, saved {len(urls_crawled)} target pages')
    print(f'Total content extracted: {int(sum([len(v)/1000 for v in saved_pages.values()]))} kbytes')
    return urls_crawled, saved_pages

With this I can now crawl both API docs:

_, tf_saved_pages = crawl_and_save(
    base_url='tensorflow.org/api_docs/python/tf',
    num_pages=5000,
)
_, pt_saved_pages = crawl_and_save(
    base_url='pytorch.org/docs/stable',
    num_pages=5000,
)

The output looks something like:

Crawling tensorflow.org/api_docs/python/tf...
Fetching https://tensorflow.org/api_docs/python/tf
...
Fetching https://www.tensorflow.org/api_docs/python/tf/raw_ops/WriteScalarSummary
Fetching https://www.tensorflow.org/api_docs/python/tf/image/stateless_random_saturation
Done!
Crawled 5000 total pages, saved 5000 target pages
Total content extracted: 14570 kbytes

Crawling pytorch.org/docs/stable...
Fetching https://pytorch.org/docs/stable
...
Fetching https://pytorch.org/docs/stable/_modules/torch/ao/nn/qat/dynamic/modules/www.lfprojects.org/policies/
Fetching https://pytorch.org/docs/stable/_modules/torch/ao/nn/qat/dynamic/modules/www.linuxfoundation.org/policies/
Done!
Crawled 3912 total pages, saved 3912 target pages
Total content extracted: 7921 kbytes

Note: at num_pages=5000, there were still some Tensorflow pages left, since it crawled the full 5,000 pages. Pytorch stopped at 3,912, so it crawled the entire documentation.

The next step is to merge the saved pages, and similar to how we did it in the previous Doc2Vec post, build the model:

pages = {**tf_saved_pages, **pt_saved_pages}
tagged_posts = {url : TaggedDocument(word_tokenize(text), [idx]) for idx, (url, text) in enumerate(pages.items())}
idx_lookup = {idx : url for idx, url in enumerate(pages.keys())}
model = Doc2Vec(tagged_posts.values(), vector_size=100, alpha=0.025, min_count=1, workers=16, epochs=100)

Similarity extraction

First, let's re-use the similar_pages() function from the previous article and do a consistency check:

def similar_pages(which, n=3):
    if not isinstance(which, str):
        which = idx_lookup[which]
    # at this point which is the url
    if n == 'all':
        return model.dv.most_similar(positive=[model.infer_vector(tagged_posts[which][0])], topn=None)
    results = model.dv.most_similar(positive=[model.infer_vector(tagged_posts[which][0])], topn=len(pages))
    results = [(idx_lookup[idx], f'{score:.3f}') for idx, score in results
        if idx != tagged_posts[which][1][0] and 'www' not in idx_lookup[idx]]
    return results[:n]

Check:

similar_pages('https://tensorflow.org/api_docs/python/tf/linalg/adjoint', n=10)

Returns:

[('https://tensorflow.org/api_docs/python/tf/raw_ops/BatchMatrixSolve', '0.661'),
 ('https://tensorflow.org/api_docs/python/tf/linalg/det',               '0.651'),
 ('https://tensorflow.org/api_docs/python/tf/linalg/logdet',            '0.640')]

This looks reasonable. adjoint is a matrix operation, and it returns related matrix operations. Let's look at another one:

similar_pages('https://tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy')

Returns:

[('https://tensorflow.org/api_docs/python/tf/keras/losses/BinaryFocalCrossentropy',         '0.885'),
 ('https://tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy',   '0.826'),
 ('https://tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity',                '0.820')]

This looks reasonable. So similar_pages() returns similar pages, from the same API docs, as expected.

The next step is to write a simple function which returns the top n pages from the other API docs, given a query URL:

def similar_pages_cross(which, n=3):
    if not isinstance(which, str):
        which = idx_lookup[which]
    # at this point which is the url
    results = model.dv.most_similar(positive=[model.infer_vector(tagged_posts[which][0])], topn=len(pages))
    exclude = 'tensorflow.org' if 'tensorflow.org' in which else 'pytorch.org'
    results = [(idx_lookup[idx], f'{score:.3f}') for idx, score in results
        if idx != tagged_posts[which][1][0] and exclude not in idx_lookup[idx] and 'www' not in idx_lookup[idx]]
    return results[:n]

Let's see:

similar_pages_cross('https://tensorflow.org/api_docs/python/tf/linalg/adjoint')

Returns:

[('https://pytorch.org/docs/stable/generated/torch.Tensor.tril_.html',          '0.581'),
 ('https://pytorch.org/docs/stable/_modules/torch/_C/_distributed_c10d.html',   '0.581'),
 ('https://pytorch.org/docs/stable/_modules/torch/testing/_creation.html',      '0.578')]

This does not look good, I would have liked to get https://pytorch.org/docs/stable/generated/torch.adjoint.html in there.

Let's try with BCE:

similar_pages_cross('https://tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy')

Returns:

[('https://pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy_with_logits.html',    '0.559'),
 ('https://pytorch.org/docs/stable/generated/torch.Tensor.logit_.html',                                     '0.536'),
 ('https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html',                                      '0.526')]

This looks reasonable.

Conclusion

After playing around with the results more, my conclusion is that the top recommendations from the other API docs are not always what I'd intuitively expect. Ie. similarly to how the pair of adjoint is not found, Doc2Vec does not reliably identify the matching/similar API call in the other API docs. Based on this very limited experiment, I suspect that this simple version would not be good enough for production use, ie. give recommendations for a programmer coming from one API, trying to use the other.