kspe

"Progress in a fixed context is almost always a form of optimisation." -- Alan Kay

The current world situation forced a significant amount of workforce to stay at home and work from home. A lot of people talk about shifting the whole sectors to remote work. I think we forget too often that the current state of affairs has nothing to do with working remotely.

What is remote work for me?

Let's make it simple; it is a possibility of providing service from any distraction-free (or with a distraction that I can control) place of my choice, with minimal technical requirements needed to make work happen.

When you stay at home, and your boss asks you to work from home, and your kids are at home because schools are closed, and your partner stays at home and tries to also work from there – this is not a usual remote work. This is working from home in times of exceptional circumstances with distractions you cannot get rid of (read kids, for example).

Some good might come out of the lockdown after all. I can imagine companies can be convinced to allow working remotely more often after the pandemic is over, overall this is the future for all jobs that require some dose of creativity

In regards to controlling your environment, for example, I like working with the TV on in the background, this is not mentally exhausting as I'm in control, I can push a button on my remote, and it goes away. I think this is very relevant to any kind of remote work – being in control of your environment has a high impact on your productivity and mental health. Interestingly for me, some hard to control events (friends asking for a quick chat over the coffee, visiting co-workers in another building, walk after lunch with the whole team, etc.) were much higher in the office. After some time, I've realized I would be able to do the same work in some extreme cases in 2-3 hours while being in the environment of my choice rather than on spending 8 hours “being at work” in the office space. As a result, I've left my company back then and have not to get back to on-site work ever since (almost ten years now).

Usually, I'm not traveling while working remotely, nor I spend my life on workations (which, by the way, seems like a terrible idea to me, I like to keep those two separate as much as I can). I do enjoy the possibility to change my primary place of residence without a much hustle if I need to. That thought of being unattached from the site where the company has an office is comforting to me and makes me less anxious then thought that I have to be attached to a given city/street because of the work I do. Although I'm not like characters in Flights, I do set roots, in my case though, it is mostly a result of family-related life choices, like pleasant daily life for my kids, and things like having a lovely landscape view through my kitchen window.

Remote work does not need to mean working from home. I think the main point is working from a place where you feel comfortable and productive with no strings attached to the building where a company is. It means you can live wherever you think it is an excellent place to live for you and your family without struggling with a pointless commute to the city center. It is a kind of freedom that working remotely gives me, and I do feel privileged and grateful for being able to choose myself where I live and where I do work without dependencies on each other.

Test-driven development might be dead or might not be. Whatever your approach to testing is – writing them first or after implementation – having quick feedback from your test is crucial for effective coding. I cannot imagine changing an existing method, function, class without the ability to see related specs output quickly.

There are multiple solutions specific to language or framework, e.g. guard for Ruby projects, which one of many features is running spec related to the changed class automatically in a separate terminal window.

Although you can find similar tools for other languages, I like using more straightforward tools (without configuration overhead) and more in the spirit of Unix tools philosophy. Moreover, using a language-agnostic tool have an obvious benefit of using it without the hustle and in the same way for whatever project in whatever language you are working at a moment.

I use entr for multiple tasks: reload a browser tab when CSS/JS/HTML files got changed, run specs suite when any lib/ file got changed, recompile Rust project on file change.

Example usage with Elixir project is simple as:

ag -l | entr -s 'clear && mix test'

Now on each file change from files listed by ag (silver searcher), entr will execute defined command – clear screen and run mix tests.

This way, you can assemble such command for any language of your choice and get an instant feedback loop from your tests.

In my past and current projects, there was always a hurdle in sharing knowledge effectively. Even if there is the best documentation for a project and each story is very well defined, including clear acceptance criteria, there is still a significant amount of knowledge hiding in shadows.

Shadow knowledge that is never documented and passed around only by “tribe” stories. No matter how hard you try to clarify the reasoning behind the given feature, model, UI component, there is always this small detail that you unconsciously omit – not purposely, of course. Most of the time, it is either something so evident to you that you assume everyone knows it.

That shadow knowledge is everywhere, especially in long-lasting legacy projects. It prevents to build a shared understanding not only among teams but also within an organization. The same applies to new joiners. The only difference, in that case, is the gap in knowledge that needs to be fulfilled, and this is rarely done thoroughly by asking people to read documentation and code itself. Too many times, documentation is outdated, and reading code is like reading a snapshot of the current system without getting any historical data to back up its current state.

The solution that I've tried many times and seems to be working best for me and my peers is pair programming (and in more extreme scenarios: mob programming but this is an exciting topic of its own and deserves a separate blog post).

There are different approaches to how pair programming sessions should be performed. Still, one thing that does not change is the purpose of it—sharing the knowledge that is not easily shareable in the official documentation. Rarely it is being written down that a feature was implemented the way it was because of low headcount and time pressure. Pair programming sessions give two people a safe boundary where they can, not only code a solution together but also build trust and share uncomfortable facts about the existing solutions.

I try to keep in mind a few simple rules when doing pair programming that makes them practical for me and my colleagues (at least they never said otherwise :)).

Keep them remote

I work remotely for about eight years now, so “remote” is a natural environment for me. Still, even when I was working in the office, I encouraged my peers to do pair programming sessions remotely. The main benefit is that each peer can work from their computer and use their environment, editor settings, etc. This approach removes the problem of using the same settings everywhere, which always comes in the way of effectiveness. It is straightforward nowadays to share code via, e.g., git branch, and get the same code state in a second if we switch roles. Another significant benefit is that both peers are kept “in the zone” by staying in a phone booth space or at least by using headphones (especially when those are with active noise cancellation). Pairing in person was always too vulnerable to all office-related interruptions.

Do not assume anything

An assumption is an enemy of understanding and finding common ground. Whenever I pair program with someone senior, someone who works in the same problem domain or even someone who is considered an expert where I'm not, I try to pretend I'm pairing with someone who is junior and lacks domain knowledge. If we both use the same tactic, it is easy to build up trust and being more comfortable to share all legacy specific details about the code we are working with. I'm also sure I do not forget to share any insights that, for me, seem to be obvious.

Do not tell how

This rule is crucial for a successful pair programming. I never say what my peer has to type in or how to approach a problem. Quite the opposite, I try to give as much historical background and context on the current code so that the other side can propose a solution on their own. This way, I can challenge my peer to come up with a solution that I might never have to think about. When they are going in the wrong direction (e.g., I've been there already and tried that), I'm not cutting conversation there. I'm allowing them to try it and fail, this way we can be on the same page and understand the limitations and possibilities of the current code much better.

Last but not least, there is also a social aspect of pair programming. After all, what is the more delightful way to better know your colleagues then sharing with them your passion for programming and doing this together at the same time?

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I'm a software developer for almost a decade now, and if I had to choose a single quote that was always true for me for that time it would be the one above.

Undoubtedly mapping real-world problems and implementing solutions for them must involve naming concepts in the first place and agreeing on vocabulary. It is not only a problem domain but also our experience that makes us better on calling classes, methods, variables, and all related concepts.

I was looking for a side project and came up with an idea of a web site where you put your roughly sketched name for a concept, and you will see a list of suggestions based on: – user query general language synonyms – and additionally (and more importantly) a list of synonyms based on already existing code from GitHub public repositories.

This way, you can find inspiration coming from other engineers' minds similar to yours (it does not mean they were right themselves, but at least you can get more options to choose from).

Additionally, I get inspired by those two papers: – Splitting source code identifiers using bidirectional LSTM Recurrent Neural NetworkPublic Git Archive: a Big Code dataset for all

Tech stack

This project would be an excellent opportunity to try something new and try to solve it with a pinch of data science and trying out some new framework/technology. I'm mainly a Ruby developer being tired of Ruby on Rails, so I've decided to give a try to Hanami. Data science part was utterly unknown to me as I've had no previous experience in that field, I had to do my research along the way.

Hanami

My goal was to have a one monolith app written in Hanami and nicely split as in-app modules (this approach was revised later on).

I liked the approach that your core logic is your app and is kept in the lib/ directory. In contrast, any interface to that logic is an “app” (where it can be a web application, CLI interface, etc.) and is kept in /apps directory. Such split provides an excellent separation of contexts and allows easy replacing and testing of each module.

First attempt

My main controller in Hanami web app looked like this:

module Web
  module Controllers
    module Search
      class Show
        MAX_RESULTS = 5

        include Web::Action
        include ::AutoInject[
          code: 'handlers.code',
          distance: 'handlers.distance',
          synonyms: 'handlers.synonyms',
        ]

        expose :synonyms, :res, :query

        params do
          required(:query).filled(:str?)
        end

        def call(params)
          if params.valid? && current_user
            @query = params.dig(:query)
            @synonyms = synonyms.find(query: @query).take(MAX_RESULTS)
            @synonyms.unshift(@query)

            responses = {}
            @res = @synonyms.take(MAX_RESULTS).map do |synonym|
              distance.filter(
                synonym: synonym,
                results: code.find(
                  query: synonym.parameterize,
                  user: current_user
                )
              )
            end.flatten.uniq
          else
            redirect_to routes.root_path
          end
        end
      end
    end
  end
end

Using dependency injection makes it easy to define “processing” objects. As such, I describe an object that implements a processing function and does not mutate (or even does not have any state), so you can pass any object to it, call some function on it and return results. Given those objects can be reused between calls, we define them in a Container file like so:

class Container
  extend Dry::Container::Mixin

  register('code.search_query_builder') do
    Code::SearchQueryBuilder.new
  end

  register('cache.synonyms') do
    Cache::Synonyms.new
  end

    ...

end

The primary method for each controller in Hanami is defined in a call method. In our case, we do a few actions there.

First, we try to find @synonyms based on a user query. This process simply makes an API call to external service Big Huge Thesaurus by using the gem dtuite/dinosaurus. Not much to discuss there, we take results and proceed next with those.

Next thing is the main call chain in this controller:

@res = @synonyms.take(MAX_RESULTS).map do |synonym|
  distance.filter(
    synonym: synonym,
    results: code.find(
      query: synonym.parameterize,
      user: current_user
    )
  )
end.flatten.uniq

Going from the inner block, we try to fetch code search results directly from GitHub using their search API endpoint.

After getting results from GitHub, we try to apply some logic by calculating Levenshtein distance to each returned token.

For each line from GitHub search results, we split them, cleanup with regexp, and decide whether to account given token in or not based on the distance calculated between query (and its synonyms) and the token. If it is below the custom-defined threshold, then we show such a token. If not – we discard it. More on that later.

In each flow, I also try to utilize cache as much as possible, given all APIs have limitations (especially GitHub).

This is done with :

output = cache.find(query: synonym)
return output if output && !output.empty?

and

cache.save(query: synonym, value: output)

cache as you may guess is an object in the Container defined for each cache-able entity.

register('cache.synonyms') do
  Cache::Synonyms.new
end

register('cache.code') do
  Cache::Code.new
end

register('cache.distance') do
  Cache::Distance.new
end

each of those classes inherits from the base class and defines each own cache prefix (added to differentiate cached keys in Redis):

module Cache
  class Distance < Base
    def prefix
      'distance'
    end
  end
end

and all the processing is kept in the base class as such:

module Cache
  class Base
    include ::AutoInject['clients.redis']

    EXPIRATION_TIME = 60 * 24 * 7 # 1 week

    def find(query:)
      result = redis.get(cache_key(query))
      return unless result
      JSON.parse(result, symbolize_names: true)
    end

    def save(query:, value:)
      redis.set(cache_key(query), value.to_json, ex: EXPIRATION_TIME)
    end

    def prefix
      fail NotImplementedError, "#{self.class}##{__method__} not implemented"
    end

    private

    def cache_key(query)
      "#{Cache::PATH_PREFIX}:#{prefix}:#{query.downcase.gsub(/\s/,'')}"
    end
  end
end

Thanks to the dependency injection approach, all of those modules are easily replaceable and easy to unit test.

Ii liked that everything is kept in one app but, at the same time, organized in multiple modules. The way how Hanami structures directories/files and usage of dry-rb gems makes my code modular and easy to work with.

The biggest issue with this approach were limits on GitHub APIs. Those run out quite fast (until we build up a decent cache, this app is not usable). Additionally, filtering results based on Levenshtein distance is not accurate enough, and I needed something better.

Data science

Bit more on Levenshtein distance

My first approach was to use something simple. I was looking for a clever way of comparing two words. Levenshtein distance seemed like a good candidate. After Wikipedia article:

... the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other ...

I used optimized Ruby gem to calculate the value and tried to fine-tune the threshold to limit which words are worth showing as meaningful enough to a user.

require 'CGI'

module Distance
  class Handler
    include ::AutoInject[
      cache: 'cache.distance',
    ]

    THRESHOLD = 0.7
    CLEANUP_REGEXP = /[^A-Za-z0-9\s_\:\.]/i

    def filter(synonym:, results:)
      output = cache.find(query: synonym)
      return output if output && !output.empty?

      output = results.flatten.map do |r|
        r.split(' ')
      end.flatten.map do |r|
        r.gsub(CLEANUP_REGEXP, '')
      end.map do |r|
        r if Levenshtein.normalized_distance(synonym, r, THRESHOLD) && !r.empty?
      end.compact.flatten

      cache.save(query: synonym, value: output)

      output
    end
  end
end

This approach worked fine in terms of performance, but I have been missing one crucial thing – meaning. Levenshtein's distance between words does not rely on any model of the meaning of both words but rather compares them by checking how hard it is to change one word into another by calculating how many actions are required to do the job.

That was not enough, but fear not, there is another approach we can take.

Word2Vec

Text embedding was a real revelation for me. I highly recommend watching 15 minutes intro from Computerphile.

In short, word2vec maps each word in the text to a vector in multiple dimensions (how many depends on you), and based on vector distances from other words, it infers a word meaning. For someone who does not have a data science expertise, like myself, it is enough, for now, to say that this approach defines a word meaning by measuring vector distances between words in a given text window (how many words to look for and before the current term).

If you have an hour to spare, I also recommend watching this poetic, more in-depth explanation from Allison Parrish.

Or you can also read it in the form of gist as Jupyter Notebook for those who prefer reading and experimenting along.

Up to that point, I understood that searching GitHub directly via API would not scale and would take too much time on each request to be usable. Also, it is very easy to pass API limit thresholds, which required users to wait for the 60 seconds before trying again.

I knew I need a data model and a lot of code to train it.

My initial take was on the Public Git Archive, as I already knew about those from the papers I've read (the ones I've linked in the beginning). Unfortunately, datasets from the project's repo are no longer available. I was planning to use the already cleaned identifiers dataset as this is precisely what I need from those public repositories. I've tried to reach out to authors, and there is a chance they will be able to upload them and share them in the next few days.

Handling significant amounts of data

Luckily Google exposes a considerable amount of public GitHub repositories in its BigQuery analytics application.

I was able to easily extract ~1GB of data (public source code) with this self-explanatory SQL query.

WITH
  ruby_repos AS (
  SELECT
    r.repo_name AS repo_name,
    r.watch_count AS watches
  FROM
    `airy-shadow-177520.gh.languages` l,
    UNNEST(LANGUAGE)
  JOIN
    `airy-shadow-177520.gh.sample_repos` r
  ON
    r.repo_name = l.repo_name
  WHERE
    name IN ('Ruby',
      'Javascript',
      'Java',
      'Python'))
SELECT
  content
FROM
  `airy-shadow-177520.gh.sample_contents` c
WHERE
  c.sample_repo_name IN (
  SELECT
    repo_name
  FROM
    ruby_repos
  ORDER BY
    watches DESC
  LIMIT
    3000 )
  AND content IS NOT NULL

OK, I got data, I know what (roughly) I would like to do with it, now the question was: how? I already knew Jupyter Notebooks, and those are perfect for any analysis, reporting, and debugging tasks. Python is also well known nowadays from its wide variety of data science libraries and that it is a de facto language of choice among data scientists equally with R. Choice was obvious.

I've used a Python library called gensim which advertises as

the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.

I was able to use it easily in Jupyter Notebook and train my model. It was also trivially simple to save the trained model to a file for reuse.

Deploying model

I do have my model. Yay! After experimenting and playing with it in the notebook, the time came to deploy it somewhere and expose it to be used by my Hanami app.

Before all of that, it would also be useful to track model versions somehow. That was an easy part as I've already known about GitHub LFS (Large File Storage), which makes versioning of huge/binary files an easy peasy. I've added my model files to a new GitHub repo (separate from the Hanami app as I thought it would be easier to deploy those two separately).

The next step was to use a simple web server for model REST API. In the Python world Flask seemed like an obvious choice.

My first take was just that.

from gensim.models import Word2Vec
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/')
def word2vec():
    model = Word2Vec.load("model/model.wv")
    w = request.args['word']
    return jsonify(model.most_similar(positive=[w], topn=15))

if __name__ == '__main__':
    app.run(debug=True)

Can you see an issue here? If not, think how loading to RAM a few hundred MB model on each request could make cloud server instance to die.

So instead of loading the same model multiple times (on each request basically), I've decided that separating the model and Flask part will be the best approach. This way, I can have a daemonized model, loaded once, and responding to queries. Whenever it dies, I also have monit watching for PID changes and reviving the model.

[email protected]:~# cat /etc/systemd/system/model_receiver.service
[Unit]
Description=Model Findn Receiver Daemon

[Service]
Type=simple
ExecStart=/bin/bash -c 'cd /var/www/model.findn.name/html/ && source venv/bin/activate && ./receiver.py'
WorkingDirectory=/var/www/model.findn.name/html/
Restart=always
RestartSec=10

[Install]
WantedBy=sysinit.target
check program receiver with path "/bin/systemctl --quiet is-active model_receiver"
  start program = "/bin/systemctl start model_receiver"
  stop program = "/bin/systemctl stop model_receiver"
  if status != 0 then restart

As always, with two processes, you would like to allow some kind of communication between them.

My first take was to use FIFO files, but this approach did not work out on multiple API requests made to the Flask app. I did not want to spend too much time on debugging, so I decided to use something more reliable. My choice was ØMQ for which (guess what) there is also an excellent Python library.

Equipped with all those tools I've managed to prepare, firstly Flask API app

from flask import Flask, jsonify, request
import zmq

app = Flask(__name__)

@app.route('/')
def word2vec():
    w = request.args['word']
    if w is None:
        return
    context = zmq.Context()
    socket = context.socket(zmq.REQ)
    socket.connect('tcp://127.0.0.1:5555')
    socket.send_string(w)
    resp = socket.recv_pyobj()
    try:
        return jsonify(resp)
    except Exception as e:
        return jsonify([])

@app.errorhandler(404)
def pageNotFound(error):
    return "page not found"

@app.errorhandler(500)
def raiseError(error):
    return error

if __name__ == '__main__':
    app.run(debug=True)

as well as daemonized model receiver

#!/usr/bin/env python3

import zmq
import os
from gensim.models import Word2Vec

model = Word2Vec.load("model/model.wv")

context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind('tcp://127.0.0.1:5555')

while True:
    msg = socket.recv()
    try:
        resp = model.wv.most_similar(positive=[msg.decode('utf-8')], topn=7)
    except Exception as e:
        resp = []
    socket.send_pyobj(resp)

The final result can be seen here: https://findn.name

Summary and next steps

This side project was a lot of fun, mostly thanks to the data science part of it. I have not written any low-level algorithmic code to train the model (as I've used already existing library). Nevertheless, it was still a great experience to see how those algorithms can “learn” about the meaning of words in a human sense only by using a small subset of data. And how such a model can be deployed on a hosted server outside the Jupyter Notebook environment.

The next step probably would involve improvements to the model itself and experiment with different data sets.

Overall, it was a successful side project. Fun for sure!

Enter your email to subscribe to updates.