Naming is hard
There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton
I'm a software developer for almost a decade now, and if I had to choose a single quote that was always true for me for that time it would be the one above.
Undoubtedly mapping real-world problems and implementing solutions for them must involve naming concepts in the first place and agreeing on vocabulary. It is not only a problem domain but also our experience that makes us better on calling classes, methods, variables, and all related concepts.
I was looking for a side project and came up with an idea of a web site where you put your roughly sketched name for a concept, and you will see a list of suggestions based on: – user query general language synonyms – and additionally (and more importantly) a list of synonyms based on already existing code from GitHub public repositories.
This way, you can find inspiration coming from other engineers' minds similar to yours (it does not mean they were right themselves, but at least you can get more options to choose from).
Additionally, I get inspired by those two papers: – Splitting source code identifiers using bidirectional LSTM Recurrent Neural Network – Public Git Archive: a Big Code dataset for all
Tech stack
This project would be an excellent opportunity to try something new and try to solve it with a pinch of data science and trying out some new framework/technology. I'm mainly a Ruby developer being tired of Ruby on Rails, so I've decided to give a try to Hanami. Data science part was utterly unknown to me as I've had no previous experience in that field, I had to do my research along the way.
Hanami
My goal was to have a one monolith app written in Hanami and nicely split as in-app modules (this approach was revised later on).
I liked the approach that your core logic is your app and is kept in the lib/
directory. In contrast, any interface to that logic is an “app” (where it can be a web application, CLI interface, etc.) and is kept in /apps
directory. Such split provides an excellent separation of contexts and allows easy replacing and testing of each module.
First attempt
My main controller in Hanami web app looked like this:
module Web
module Controllers
module Search
class Show
MAX_RESULTS = 5
include Web::Action
include ::AutoInject[
code: 'handlers.code',
distance: 'handlers.distance',
synonyms: 'handlers.synonyms',
]
expose :synonyms, :res, :query
params do
required(:query).filled(:str?)
end
def call(params)
if params.valid? && current_user
@query = params.dig(:query)
@synonyms = synonyms.find(query: @query).take(MAX_RESULTS)
@synonyms.unshift(@query)
responses = {}
@res = @synonyms.take(MAX_RESULTS).map do |synonym|
distance.filter(
synonym: synonym,
results: code.find(
query: synonym.parameterize,
user: current_user
)
)
end.flatten.uniq
else
redirect_to routes.root_path
end
end
end
end
end
end
Using dependency injection makes it easy to define “processing” objects. As such, I describe an object that implements a processing function and does not mutate (or even does not have any state), so you can pass any object to it, call some function on it and return results. Given those objects can be reused between calls, we define them in a Container
file like so:
class Container
extend Dry::Container::Mixin
register('code.search_query_builder') do
Code::SearchQueryBuilder.new
end
register('cache.synonyms') do
Cache::Synonyms.new
end
...
end
The primary method for each controller in Hanami is defined in a call
method. In our case, we do a few actions there.
First, we try to find @synonyms
based on a user query. This process simply makes an API call to external service Big Huge Thesaurus by using the gem dtuite/dinosaurus. Not much to discuss there, we take results and proceed next with those.
Next thing is the main call chain in this controller:
@res = @synonyms.take(MAX_RESULTS).map do |synonym|
distance.filter(
synonym: synonym,
results: code.find(
query: synonym.parameterize,
user: current_user
)
)
end.flatten.uniq
Going from the inner block, we try to fetch code search results directly from GitHub using their search API endpoint.
After getting results from GitHub, we try to apply some logic by calculating Levenshtein distance to each returned token.
For each line from GitHub search results, we split them, cleanup with regexp, and decide whether to account given token in or not based on the distance calculated between query
(and its synonyms) and the token. If it is below the custom-defined threshold, then we show such a token. If not – we discard it. More on that later.
In each flow, I also try to utilize cache as much as possible, given all APIs have limitations (especially GitHub).
This is done with :
output = cache.find(query: synonym)
return output if output && !output.empty?
and
cache.save(query: synonym, value: output)
cache
as you may guess is an object in the Container
defined for each cache-able entity.
register('cache.synonyms') do
Cache::Synonyms.new
end
register('cache.code') do
Cache::Code.new
end
register('cache.distance') do
Cache::Distance.new
end
each of those classes inherits from the base class and defines each own cache prefix (added to differentiate cached keys in Redis):
module Cache
class Distance < Base
def prefix
'distance'
end
end
end
and all the processing is kept in the base class as such:
module Cache
class Base
include ::AutoInject['clients.redis']
EXPIRATION_TIME = 60 * 24 * 7 # 1 week
def find(query:)
result = redis.get(cache_key(query))
return unless result
JSON.parse(result, symbolize_names: true)
end
def save(query:, value:)
redis.set(cache_key(query), value.to_json, ex: EXPIRATION_TIME)
end
def prefix
fail NotImplementedError, "#{self.class}##{__method__} not implemented"
end
private
def cache_key(query)
"#{Cache::PATH_PREFIX}:#{prefix}:#{query.downcase.gsub(/\s/,'')}"
end
end
end
Thanks to the dependency injection approach, all of those modules are easily replaceable and easy to unit test.
Ii liked that everything is kept in one app but, at the same time, organized in multiple modules. The way how Hanami structures directories/files and usage of dry-rb
gems makes my code modular and easy to work with.
The biggest issue with this approach were limits on GitHub APIs. Those run out quite fast (until we build up a decent cache, this app is not usable). Additionally, filtering results based on Levenshtein distance is not accurate enough, and I needed something better.
Data science
Bit more on Levenshtein distance
My first approach was to use something simple. I was looking for a clever way of comparing two words. Levenshtein distance seemed like a good candidate. After Wikipedia article:
... the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other ...
I used optimized Ruby gem to calculate the value and tried to fine-tune the threshold to limit which words are worth showing as meaningful enough to a user.
require 'CGI'
module Distance
class Handler
include ::AutoInject[
cache: 'cache.distance',
]
THRESHOLD = 0.7
CLEANUP_REGEXP = /[^A-Za-z0-9\s_\:\.]/i
def filter(synonym:, results:)
output = cache.find(query: synonym)
return output if output && !output.empty?
output = results.flatten.map do |r|
r.split(' ')
end.flatten.map do |r|
r.gsub(CLEANUP_REGEXP, '')
end.map do |r|
r if Levenshtein.normalized_distance(synonym, r, THRESHOLD) && !r.empty?
end.compact.flatten
cache.save(query: synonym, value: output)
output
end
end
end
This approach worked fine in terms of performance, but I have been missing one crucial thing – meaning. Levenshtein's distance between words does not rely on any model of the meaning of both words but rather compares them by checking how hard it is to change one word into another by calculating how many actions are required to do the job.
That was not enough, but fear not, there is another approach we can take.
Word2Vec
Text embedding was a real revelation for me. I highly recommend watching 15 minutes intro from Computerphile.
In short, word2vec maps each word in the text to a vector in multiple dimensions (how many depends on you), and based on vector distances from other words, it infers a word meaning. For someone who does not have a data science expertise, like myself, it is enough, for now, to say that this approach defines a word meaning by measuring vector distances between words in a given text window (how many words to look for and before the current term).
If you have an hour to spare, I also recommend watching this poetic, more in-depth explanation from Allison Parrish.
Or you can also read it in the form of gist as Jupyter Notebook for those who prefer reading and experimenting along.
Up to that point, I understood that searching GitHub directly via API would not scale and would take too much time on each request to be usable. Also, it is very easy to pass API limit thresholds, which required users to wait for the 60 seconds before trying again.
I knew I need a data model and a lot of code to train it.
My initial take was on the Public Git Archive, as I already knew about those from the papers I've read (the ones I've linked in the beginning). Unfortunately, datasets from the project's repo are no longer available. I was planning to use the already cleaned identifiers dataset as this is precisely what I need from those public repositories. I've tried to reach out to authors, and there is a chance they will be able to upload them and share them in the next few days.
Handling significant amounts of data
Luckily Google exposes a considerable amount of public GitHub repositories in its BigQuery analytics application.
I was able to easily extract ~1GB of data (public source code) with this self-explanatory SQL query.
WITH
ruby_repos AS (
SELECT
r.repo_name AS repo_name,
r.watch_count AS watches
FROM
`airy-shadow-177520.gh.languages` l,
UNNEST(LANGUAGE)
JOIN
`airy-shadow-177520.gh.sample_repos` r
ON
r.repo_name = l.repo_name
WHERE
name IN ('Ruby',
'Javascript',
'Java',
'Python'))
SELECT
content
FROM
`airy-shadow-177520.gh.sample_contents` c
WHERE
c.sample_repo_name IN (
SELECT
repo_name
FROM
ruby_repos
ORDER BY
watches DESC
LIMIT
3000 )
AND content IS NOT NULL
OK, I got data, I know what (roughly) I would like to do with it, now the question was: how? I already knew Jupyter Notebooks, and those are perfect for any analysis, reporting, and debugging tasks. Python is also well known nowadays from its wide variety of data science libraries and that it is a de facto language of choice among data scientists equally with R. Choice was obvious.
I've used a Python library called gensim which advertises as
the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.
I was able to use it easily in Jupyter Notebook and train my model. It was also trivially simple to save the trained model to a file for reuse.
Deploying model
I do have my model. Yay! After experimenting and playing with it in the notebook, the time came to deploy it somewhere and expose it to be used by my Hanami app.
Before all of that, it would also be useful to track model versions somehow. That was an easy part as I've already known about GitHub LFS (Large File Storage), which makes versioning of huge/binary files an easy peasy. I've added my model files to a new GitHub repo (separate from the Hanami app as I thought it would be easier to deploy those two separately).
The next step was to use a simple web server for model REST API. In the Python world Flask seemed like an obvious choice.
My first take was just that.
from gensim.models import Word2Vec
from flask import Flask, jsonify, request
app = Flask(__name__)
@app.route('/')
def word2vec():
model = Word2Vec.load("model/model.wv")
w = request.args['word']
return jsonify(model.most_similar(positive=[w], topn=15))
if __name__ == '__main__':
app.run(debug=True)
Can you see an issue here? If not, think how loading to RAM a few hundred MB model on each request could make cloud server instance to die.
So instead of loading the same model multiple times (on each request basically), I've decided that separating the model and Flask part will be the best approach. This way, I can have a daemonized model, loaded once, and responding to queries. Whenever it dies, I also have monit watching for PID changes and reviving the model.
root@findn-model:~# cat /etc/systemd/system/model_receiver.service
[Unit]
Description=Model Findn Receiver Daemon
[Service]
Type=simple
ExecStart=/bin/bash -c 'cd /var/www/model.findn.name/html/ && source venv/bin/activate && ./receiver.py'
WorkingDirectory=/var/www/model.findn.name/html/
Restart=always
RestartSec=10
[Install]
WantedBy=sysinit.target
check program receiver with path "/bin/systemctl --quiet is-active model_receiver"
start program = "/bin/systemctl start model_receiver"
stop program = "/bin/systemctl stop model_receiver"
if status != 0 then restart
As always, with two processes, you would like to allow some kind of communication between them.
My first take was to use FIFO files, but this approach did not work out on multiple API requests made to the Flask app. I did not want to spend too much time on debugging, so I decided to use something more reliable. My choice was ØMQ for which (guess what) there is also an excellent Python library.
Equipped with all those tools I've managed to prepare, firstly Flask API app
from flask import Flask, jsonify, request
import zmq
app = Flask(__name__)
@app.route('/')
def word2vec():
w = request.args['word']
if w is None:
return
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect('tcp://127.0.0.1:5555')
socket.send_string(w)
resp = socket.recv_pyobj()
try:
return jsonify(resp)
except Exception as e:
return jsonify([])
@app.errorhandler(404)
def pageNotFound(error):
return "page not found"
@app.errorhandler(500)
def raiseError(error):
return error
if __name__ == '__main__':
app.run(debug=True)
as well as daemonized model receiver
#!/usr/bin/env python3
import zmq
import os
from gensim.models import Word2Vec
model = Word2Vec.load("model/model.wv")
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind('tcp://127.0.0.1:5555')
while True:
msg = socket.recv()
try:
resp = model.wv.most_similar(positive=[msg.decode('utf-8')], topn=7)
except Exception as e:
resp = []
socket.send_pyobj(resp)
The final result can be seen here: https://findn.name
Summary and next steps
This side project was a lot of fun, mostly thanks to the data science part of it. I have not written any low-level algorithmic code to train the model (as I've used already existing library). Nevertheless, it was still a great experience to see how those algorithms can “learn” about the meaning of words in a human sense only by using a small subset of data. And how such a model can be deployed on a hosted server outside the Jupyter Notebook environment.
The next step probably would involve improvements to the model itself and experiment with different data sets.
Overall, it was a successful side project. Fun for sure!