Naming is hard

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I'm a software developer for almost a decade now, and if I had to choose a single quote that was always true for me for that time it would be the one above.

Undoubtedly mapping real-world problems and implementing solutions for them must involve naming concepts in the first place and agreeing on vocabulary. It is not only a problem domain but also our experience that makes us better on calling classes, methods, variables, and all related concepts.

I was looking for a side project and came up with an idea of a web site where you put your roughly sketched name for a concept, and you will see a list of suggestions based on: – user query general language synonyms – and additionally (and more importantly) a list of synonyms based on already existing code from GitHub public repositories.

This way, you can find inspiration coming from other engineers' minds similar to yours (it does not mean they were right themselves, but at least you can get more options to choose from).

Additionally, I get inspired by those two papers: – Splitting source code identifiers using bidirectional LSTM Recurrent Neural NetworkPublic Git Archive: a Big Code dataset for all

Tech stack

This project would be an excellent opportunity to try something new and try to solve it with a pinch of data science and trying out some new framework/technology. I'm mainly a Ruby developer being tired of Ruby on Rails, so I've decided to give a try to Hanami. Data science part was utterly unknown to me as I've had no previous experience in that field, I had to do my research along the way.

Hanami

My goal was to have a one monolith app written in Hanami and nicely split as in-app modules (this approach was revised later on).

I liked the approach that your core logic is your app and is kept in the lib/ directory. In contrast, any interface to that logic is an “app” (where it can be a web application, CLI interface, etc.) and is kept in /apps directory. Such split provides an excellent separation of contexts and allows easy replacing and testing of each module.

First attempt

My main controller in Hanami web app looked like this:

module Web
  module Controllers
    module Search
      class Show
        MAX_RESULTS = 5

        include Web::Action
        include ::AutoInject[
          code: 'handlers.code',
          distance: 'handlers.distance',
          synonyms: 'handlers.synonyms',
        ]

        expose :synonyms, :res, :query

        params do
          required(:query).filled(:str?)
        end

        def call(params)
          if params.valid? && current_user
            @query = params.dig(:query)
            @synonyms = synonyms.find(query: @query).take(MAX_RESULTS)
            @synonyms.unshift(@query)

            responses = {}
            @res = @synonyms.take(MAX_RESULTS).map do |synonym|
              distance.filter(
                synonym: synonym,
                results: code.find(
                  query: synonym.parameterize,
                  user: current_user
                )
              )
            end.flatten.uniq
          else
            redirect_to routes.root_path
          end
        end
      end
    end
  end
end

Using dependency injection makes it easy to define “processing” objects. As such, I describe an object that implements a processing function and does not mutate (or even does not have any state), so you can pass any object to it, call some function on it and return results. Given those objects can be reused between calls, we define them in a Container file like so:

class Container
  extend Dry::Container::Mixin

  register('code.search_query_builder') do
    Code::SearchQueryBuilder.new
  end

  register('cache.synonyms') do
    Cache::Synonyms.new
  end

    ...

end

The primary method for each controller in Hanami is defined in a call method. In our case, we do a few actions there.

First, we try to find @synonyms based on a user query. This process simply makes an API call to external service Big Huge Thesaurus by using the gem dtuite/dinosaurus. Not much to discuss there, we take results and proceed next with those.

Next thing is the main call chain in this controller:

@res = @synonyms.take(MAX_RESULTS).map do |synonym|
  distance.filter(
    synonym: synonym,
    results: code.find(
      query: synonym.parameterize,
      user: current_user
    )
  )
end.flatten.uniq

Going from the inner block, we try to fetch code search results directly from GitHub using their search API endpoint.

After getting results from GitHub, we try to apply some logic by calculating Levenshtein distance to each returned token.

For each line from GitHub search results, we split them, cleanup with regexp, and decide whether to account given token in or not based on the distance calculated between query (and its synonyms) and the token. If it is below the custom-defined threshold, then we show such a token. If not – we discard it. More on that later.

In each flow, I also try to utilize cache as much as possible, given all APIs have limitations (especially GitHub).

This is done with :

output = cache.find(query: synonym)
return output if output && !output.empty?

and

cache.save(query: synonym, value: output)

cache as you may guess is an object in the Container defined for each cache-able entity.

register('cache.synonyms') do
  Cache::Synonyms.new
end

register('cache.code') do
  Cache::Code.new
end

register('cache.distance') do
  Cache::Distance.new
end

each of those classes inherits from the base class and defines each own cache prefix (added to differentiate cached keys in Redis):

module Cache
  class Distance < Base
    def prefix
      'distance'
    end
  end
end

and all the processing is kept in the base class as such:

module Cache
  class Base
    include ::AutoInject['clients.redis']

    EXPIRATION_TIME = 60 * 24 * 7 # 1 week

    def find(query:)
      result = redis.get(cache_key(query))
      return unless result
      JSON.parse(result, symbolize_names: true)
    end

    def save(query:, value:)
      redis.set(cache_key(query), value.to_json, ex: EXPIRATION_TIME)
    end

    def prefix
      fail NotImplementedError, "#{self.class}##{__method__} not implemented"
    end

    private

    def cache_key(query)
      "#{Cache::PATH_PREFIX}:#{prefix}:#{query.downcase.gsub(/\s/,'')}"
    end
  end
end

Thanks to the dependency injection approach, all of those modules are easily replaceable and easy to unit test.

Ii liked that everything is kept in one app but, at the same time, organized in multiple modules. The way how Hanami structures directories/files and usage of dry-rb gems makes my code modular and easy to work with.

The biggest issue with this approach were limits on GitHub APIs. Those run out quite fast (until we build up a decent cache, this app is not usable). Additionally, filtering results based on Levenshtein distance is not accurate enough, and I needed something better.

Data science

Bit more on Levenshtein distance

My first approach was to use something simple. I was looking for a clever way of comparing two words. Levenshtein distance seemed like a good candidate. After Wikipedia article:

... the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other ...

I used optimized Ruby gem to calculate the value and tried to fine-tune the threshold to limit which words are worth showing as meaningful enough to a user.

require 'CGI'

module Distance
  class Handler
    include ::AutoInject[
      cache: 'cache.distance',
    ]

    THRESHOLD = 0.7
    CLEANUP_REGEXP = /[^A-Za-z0-9\s_\:\.]/i

    def filter(synonym:, results:)
      output = cache.find(query: synonym)
      return output if output && !output.empty?

      output = results.flatten.map do |r|
        r.split(' ')
      end.flatten.map do |r|
        r.gsub(CLEANUP_REGEXP, '')
      end.map do |r|
        r if Levenshtein.normalized_distance(synonym, r, THRESHOLD) && !r.empty?
      end.compact.flatten

      cache.save(query: synonym, value: output)

      output
    end
  end
end

This approach worked fine in terms of performance, but I have been missing one crucial thing – meaning. Levenshtein's distance between words does not rely on any model of the meaning of both words but rather compares them by checking how hard it is to change one word into another by calculating how many actions are required to do the job.

That was not enough, but fear not, there is another approach we can take.

Word2Vec

Text embedding was a real revelation for me. I highly recommend watching 15 minutes intro from Computerphile.

In short, word2vec maps each word in the text to a vector in multiple dimensions (how many depends on you), and based on vector distances from other words, it infers a word meaning. For someone who does not have a data science expertise, like myself, it is enough, for now, to say that this approach defines a word meaning by measuring vector distances between words in a given text window (how many words to look for and before the current term).

If you have an hour to spare, I also recommend watching this poetic, more in-depth explanation from Allison Parrish.

Or you can also read it in the form of gist as Jupyter Notebook for those who prefer reading and experimenting along.

Up to that point, I understood that searching GitHub directly via API would not scale and would take too much time on each request to be usable. Also, it is very easy to pass API limit thresholds, which required users to wait for the 60 seconds before trying again.

I knew I need a data model and a lot of code to train it.

My initial take was on the Public Git Archive, as I already knew about those from the papers I've read (the ones I've linked in the beginning). Unfortunately, datasets from the project's repo are no longer available. I was planning to use the already cleaned identifiers dataset as this is precisely what I need from those public repositories. I've tried to reach out to authors, and there is a chance they will be able to upload them and share them in the next few days.

Handling significant amounts of data

Luckily Google exposes a considerable amount of public GitHub repositories in its BigQuery analytics application.

I was able to easily extract ~1GB of data (public source code) with this self-explanatory SQL query.

WITH
  ruby_repos AS (
  SELECT
    r.repo_name AS repo_name,
    r.watch_count AS watches
  FROM
    `airy-shadow-177520.gh.languages` l,
    UNNEST(LANGUAGE)
  JOIN
    `airy-shadow-177520.gh.sample_repos` r
  ON
    r.repo_name = l.repo_name
  WHERE
    name IN ('Ruby',
      'Javascript',
      'Java',
      'Python'))
SELECT
  content
FROM
  `airy-shadow-177520.gh.sample_contents` c
WHERE
  c.sample_repo_name IN (
  SELECT
    repo_name
  FROM
    ruby_repos
  ORDER BY
    watches DESC
  LIMIT
    3000 )
  AND content IS NOT NULL

OK, I got data, I know what (roughly) I would like to do with it, now the question was: how? I already knew Jupyter Notebooks, and those are perfect for any analysis, reporting, and debugging tasks. Python is also well known nowadays from its wide variety of data science libraries and that it is a de facto language of choice among data scientists equally with R. Choice was obvious.

I've used a Python library called gensim which advertises as

the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.

I was able to use it easily in Jupyter Notebook and train my model. It was also trivially simple to save the trained model to a file for reuse.

Deploying model

I do have my model. Yay! After experimenting and playing with it in the notebook, the time came to deploy it somewhere and expose it to be used by my Hanami app.

Before all of that, it would also be useful to track model versions somehow. That was an easy part as I've already known about GitHub LFS (Large File Storage), which makes versioning of huge/binary files an easy peasy. I've added my model files to a new GitHub repo (separate from the Hanami app as I thought it would be easier to deploy those two separately).

The next step was to use a simple web server for model REST API. In the Python world Flask seemed like an obvious choice.

My first take was just that.

from gensim.models import Word2Vec
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/')
def word2vec():
    model = Word2Vec.load("model/model.wv")
    w = request.args['word']
    return jsonify(model.most_similar(positive=[w], topn=15))

if __name__ == '__main__':
    app.run(debug=True)

Can you see an issue here? If not, think how loading to RAM a few hundred MB model on each request could make cloud server instance to die.

So instead of loading the same model multiple times (on each request basically), I've decided that separating the model and Flask part will be the best approach. This way, I can have a daemonized model, loaded once, and responding to queries. Whenever it dies, I also have monit watching for PID changes and reviving the model.

root@findn-model:~# cat /etc/systemd/system/model_receiver.service
[Unit]
Description=Model Findn Receiver Daemon

[Service]
Type=simple
ExecStart=/bin/bash -c 'cd /var/www/model.findn.name/html/ && source venv/bin/activate && ./receiver.py'
WorkingDirectory=/var/www/model.findn.name/html/
Restart=always
RestartSec=10

[Install]
WantedBy=sysinit.target
check program receiver with path "/bin/systemctl --quiet is-active model_receiver"
  start program = "/bin/systemctl start model_receiver"
  stop program = "/bin/systemctl stop model_receiver"
  if status != 0 then restart

As always, with two processes, you would like to allow some kind of communication between them.

My first take was to use FIFO files, but this approach did not work out on multiple API requests made to the Flask app. I did not want to spend too much time on debugging, so I decided to use something more reliable. My choice was ØMQ for which (guess what) there is also an excellent Python library.

Equipped with all those tools I've managed to prepare, firstly Flask API app

from flask import Flask, jsonify, request
import zmq

app = Flask(__name__)

@app.route('/')
def word2vec():
    w = request.args['word']
    if w is None:
        return
    context = zmq.Context()
    socket = context.socket(zmq.REQ)
    socket.connect('tcp://127.0.0.1:5555')
    socket.send_string(w)
    resp = socket.recv_pyobj()
    try:
        return jsonify(resp)
    except Exception as e:
        return jsonify([])

@app.errorhandler(404)
def pageNotFound(error):
    return "page not found"

@app.errorhandler(500)
def raiseError(error):
    return error

if __name__ == '__main__':
    app.run(debug=True)

as well as daemonized model receiver

#!/usr/bin/env python3

import zmq
import os
from gensim.models import Word2Vec

model = Word2Vec.load("model/model.wv")

context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind('tcp://127.0.0.1:5555')

while True:
    msg = socket.recv()
    try:
        resp = model.wv.most_similar(positive=[msg.decode('utf-8')], topn=7)
    except Exception as e:
        resp = []
    socket.send_pyobj(resp)

The final result can be seen here: https://findn.name

Summary and next steps

This side project was a lot of fun, mostly thanks to the data science part of it. I have not written any low-level algorithmic code to train the model (as I've used already existing library). Nevertheless, it was still a great experience to see how those algorithms can “learn” about the meaning of words in a human sense only by using a small subset of data. And how such a model can be deployed on a hosted server outside the Jupyter Notebook environment.

The next step probably would involve improvements to the model itself and experiment with different data sets.

Overall, it was a successful side project. Fun for sure!