Practical use of recursion when scanning Redis keys

March 14, 2021

The most known way to get all keys from Redis DB matching a pattern is to use KEYS <pattern>. At the same time, developers should not use this method on production systems:

Warning: consider KEYS as a command that should only be used in production environments with extreme care. It may ruin performance when it is executed against large databases. This command is intended for debugging and special operations, such as changing your keyspace layout.

The proper way is to use SCAN command, which is the Redis way of doing cursor-based scans.

The cursor-based scan is also an excellent opportunity to use recursion, especially with Elixir. It should be easy enough given Erlang (thus Elixir) implements tail-call optimization.

The solution I came up with is using Redix to interact with Redis.

  @max_scan_count 1000

  def process_scan(key, filters, cursor, func) do
    command = [
      "SCAN",
      cursor,
      "MATCH",
      "#{key}",
      "COUNT",
      @max_scan_count
    ]

    {:ok, [new_cursor, keys]} = Redix.command(:cache, command)

    keys =
      keys
      |> filter_by(filters)

     case new_cursor do
      "0" ->
        func.(keys)

      _ ->
        func.(keys) ++ process_scan(key, filters, new_cursor, func)
    end
  end

the key argument can be any argument accepted by SCAN e.g. my-key-12*
the filters argument can be a list of additional filtering options – if an empty list provided, no further filtering is applied
the cursor is a cursor position returned by SCAN (on the first invocation set as 0)

The interesting argument is the func argument which can be any function that should be applied on found keys.

Let's say we want to count all keys in our DB, matching a pattern: my-key-123*. We can call our recursive function as:

process_scan("my-key-123*", [], 0, fn found_keys ->
  [length(found_keys)]
end)
|> Enum.reduce(0, fn res, acc -> acc + res end)

This snippet will go through all keys in DB matching a given pattern, and on each batch, it will run our callback, so the result is a list of lengths of each batch. We reduce that list to get the final number indicating an absolute number of matching elements.

The callback function can be more sophisticated than that. For example, we can invalidate found keys like so

  defp invalidate_by_keys(keys) do
    case Redix.command(:cache, ["DEL"] ++ keys) do
      {:ok, result} -> {:ok, result}
      _ -> {:error, :no_key}
    end
  end

and use it in our recursive function:

process_scan(key, filters, 0, fn found_keys ->
  case invalidate_by_keys(found_keys) do
    {:ok, res} -> [res]
    {:error, _} -> [0]
  end
end)

So, as a result, we receive a list where elements are numbers indicating how many records were invalidated in each batch returned from the SCAN command.

So far, this approach was working fine on a DB with a few million items. Note that performance is not a critical requirement for my use case, but it performs well enough. I assume this is thanks to BEAM optimizations on recursion.

What if your project board would have one column?

If you are a software engineer, you and your team most likely use some board to manage your project. That board, most likely, has labeled columns, and each one of them contains tickets. That is what we used to as developers, but have you ever wondered why? Why we need those columns? Why we move those tickets around?

Lately, we've had an interesting discussion about whether we should merge some columns on our board or not. It was about combining the “Ready for Review” column with the “Code Review” one. Intuitively I knew we should merge them but was not able to provide a convincing argument. If there is one thing I hate during a discussion, it would be a lack of proper argumentation for my intuition.

I like to get back to basics when it happens, so I've asked myself: “why we need those columns at all?”. In the project, in the team, why we use the board at all? Can we have one column, and when a task is planned to be done, it is kept in that column. When it is done, it gets removed? A bit extreme to what we've used to but still, the question is valid. At that point, we would miss fancy reporting, graphs, nice board to show on standup but would it hurt our team to what it should do, delivering values to customers?

What would happen if we add “In progress” and “Done” columns and the first one becomes the “To Do” column? What is the benefit of having tasks in the “In progress” state? We can see what is being worked on but why we need to know it? Progress tracking? We can track progress by checking what is removed from the “To Do” as this means a task was done. With the “In progress” column in place, we can track the workload. We know what the throughput of our team is.

All right, that makes sense but still, what is the benefit of knowing that? If we are a team of five and all the time the “In progress” columns contains five tickets, from which two of them stay there for three days can we be proud of ourselves or not? For sure, we maximized utilization of human resources as this column maxed out. Is that a good thing? Hard to say, but it seems there is a benefit of having the “In progress” column.

What if we split that one up further, so we end up with “To Do”, “In progress”, “Code review”, “Done”. Now we end up having, most of the time, two tickets in the “In progress” and four tickets in the “Code review”. What if the opposite happens? Now we can see, like the last time, the throughput, but there is a piece of additional information on the board. Now we can identify bottlenecks. If tickets “get stuck” in code review, it means we have a blockage, and we should focus on that stage to improve it. How could we do that? We could split that column even further into “Ready for Review” and “Code Review” to see whether tickets are waiting for code review to start or they get stuck when the review started but taking too long.

Thanks to that thought experiment, I recalled all I've read about the Theory of Constraints and how Kanban shows resemblance to ToC (or the other way around, unsure which one was first).

That was my A-HA! Moment. So we need to indicate the process that we are going through to measure a team throughput (at least one column reflecting the “in progress” state). If we need to identify bottlenecks, we need to split that column further to be more accurate in pinpointing where we are getting stuck.

That makes sense to my inner me. We are getting back to the initial question of whether we should merge our two columns or not. I think we should, as splitting columns should happen when we notice that one of the columns becomes a bottleneck. If we do that prior, we are doing premature optimization. A number of columns should be minimized – not becoming a 1:1 reflection of all possible states in the process.

The case of the Red Bug

February 12, 2021

It was yet another gloomy evening in the Beamopolis. Boxes hiding artifacts of solved and unsolved cases cluttered my room. Dim light passing through my small window covers everything. I was sitting in front of my desk and was oscillating between the dream world and reality. A brutal phone call woke me up, and my intuition kicked in, telling me this will not be another typical case of a messy exception that went wrong.

I was right; this time, my caller was asking impossible. She wanted me to find a way to trace a suspect. Still, not only that, but the requirement was also to see all the breadcrumbs left by a suspect. She also demanded to gather all information about suspect inputs and outputs – all that with a possibility to have insights on-demand, remotely without even meeting me in person. My initial gut feeling told me to hang up the phone right away, but as always, with those cases, there is an urge to stand up to the challenge and try to accomplish the impossible.

For better or worse, I've agreed. Now I was standing next to my window, looking at all those actors passing by, not aware of each other minds but beautifully arranged in asynchronous dance on message lanes. I had to find a way to trace my suspect in the worst part of Beamopolis – Complexity District 9. The story tells that no one knows what happens behind the defmodule lines separating the district from the rest of the city – yet everyone communicates with them daily. I was skimming through my options.

A Tracer Trevor owns me a favor from our last case, but he was not ideal for the task at hand. It was hard to make him not put all his information to standard output but rather send it to me via a message. I also thought about the famous hacker going with the nickname dbg, but she was quite rusty and might not have been able to address all my requirements. Then I've had an a-ha moment; there is already an excellent tracer in the city – redbug. He does not speak elixirish; instead, he is fluent in erlangish – if there only be a way to find a decent translator. After a few phone calls, I was lucky enough to find a fine young lady – rexbug, who agreed to help me out and translate what redbug was saying to the language I speak.

Unfortunately for me, redbug had a similar limitation (at first glance only as I was going to find out) that it was outputting his findings to the standard output of his. The approach would not work for me as my mysterious caller demanded to have a possibility to query traces 24/7. That meant I have to build a specific endpoint for her, and tracing my suspect should be automated.

After going back and forth with rexbug, we've found a way. She told me that I could use a custom function to print whatever redbug found to wherever I want to. I knew that was it. At this stage, I knew I would be able to finish what I've started. I've come up with the following invocation.

TraceClient.start(
  [
    "Main.Suspect :: return"
  ],
  time: @timeout,
  msgs: @max_number_of_messages,
  print_fun: fn entry -> GenServer.call(Tracer, {:trace_entry, entry}) end
)

Both @timeout and @max_number_of_messages were set up in agreement with rexbug that we won't try to trace my suspect for too long (or too short), and if he becomes too noisy, we would also cease the operation. After all, we do not want to go too deep inside the Complexity District 9.

With my invocation ready, I could do my tedious but so needed work to handle all the pieces of evidence and present them to my employer. I came up with the following.

defmodule Tracer do
  use GenServer

  alias Rexbug, as: TraceClient

  @timeout 2_000
  @max_number_of_messages 100_000

  # Client
  def start_link(opts) do
    GenServer.start_link(__MODULE__, :ok, opts)
  end

  def trace do
    GenServer.call(Tracer, {:trace})
  end

  def get_results do
    GenServer.call(Tracer, :get_results)
  end

  # Server (callbacks)
  @impl true
  def init(:ok) do
    {:ok, []}
  end

  @impl true
  def handle_call({:trace}, _from, state) do
    TraceClient.start(
      [
        "Main.Suspect :: return"
      ],
      time: @timeout,
      msgs: @max_number_of_messages,
      print_fun: fn entry -> GenServer.call(Tracer, {:trace_entry, entry}) end
    )

    try do
      Main.Suspect.call()
    rescue
      _ -> nil
    end

    TraceClient.stop()

    {:reply, :ok, state}
  end

  @impl true
  def handle_call({:trace_entry, trace_entry}, _from, state) do
    {:reply, trace_entry, state ++ [trace_entry]}
  end

  @impl true
  def handle_call(:get_results, _from, state) do
    result =
      Enum.map(state, fn entry ->
        {action, {{module, function, args}, retn}, {_, _}, {_, _, _, _}} = entry

        %{
          type: action,
          module: module,
          function: function,
          arguments: inspect(args),
          return_value: inspect(retn)
        }
      end)

    {:reply, {:ok, result}, []}
  end
end

With that good old GenServer approach, I can now trace my suspect whenever I want, and all traces gathered with my friends – redbug and rexbug are passed to my GenServer state, waiting there until I fetch them.

It is easy to use my work simply as

Tracer.trace()

and after traces are gathered fetch them as

case Tracer.get_results() do
  {:ok, results} -> results
  {:error, error} -> error
end

With all that, I get results in a nice form where I can see all suspect's interactions – what they received, and what they gave in return

{:ok,
 [
  %{
	 arguments: "we have to stash that gold somewhere",
	 function: :heist,
	 module: Main.Suspect,
	 return_value: "\"\"",
	 type: :call
  },

  ...

  %{
	 arguments: "1",
	 function: :heist,
	 module: Main.Suspect,
	 return_value: "{:error, \"uh! oh! they\'ve got me\"}",
	 type: :retn
  }
]}

Everything was there, right in front of me, now I could trace all naughty functions visited by my suspect, I knew what he passed to them and what they gave in return. That knowledge was exciting and scary at the same time. There was nothing he can do to hide anymore; even if he hides in the most complex part of the city, I will be there, always watching, always knowing.

I'm closing this case. The lady who started it decided to use this knowledge for good and created an excellent document where everyone could check what our suspect does in the city. It was a game-changer as no villain could hide anymore from good citizens of Beamopolis. Everyone was cheerful – until one day – I've picked up another phone call …

December 20, 2020

Disconnect

I'm a software engineer. I'm not a ninja, samurai developer, and I do not disrupt the industry. I enjoy building things. I think it is about the act of creating something, the feeling of making something from nothing. I had had a similar feeling when as a child, I've built a house using Lego bricks (how incredible was that!) and later when I soldered my first PCB board and later on when I wrote my first computer program. That feeling of lowering the surrounding world's entropy by putting my cognitive energy to create structure, order. The issue with reducing entropy is that it requires energy. At first, when you have motivation, vision, goal, and support, it looks easy to do but with time, when these things fade away, the energy levels you have to burn to get the same results are much higher, sometimes to the point of exhaustion – some folks call it burnout when it happens.

I've been looking for ways how to deal with mine. It is not like it just happened one morning. It was a process building up for months. I'm experienced enough to know now that changing jobs is not a solution, changing role or team neither – already done that in the past and it helped only for a short period. This time I decided to take a different approach. I took a month off from my work but not only for the sake of having time off. I decided to find joy again in building things. Without rush and commitments, I started with a few things that I've had to improve in my summer house, fixed a leaking roof there, cleaned up a massive mess in my garage, renovated one room in my flat. Done all that (and some smaller things) at my own pace with time to enjoy the small steps I took to accomplish a bigger goal.

Did it work after all? I think it did. I re-learned to enjoy the process of lowering entropy around me and appreciate the small steps I have to take to make it happen. It is not about how much time off you take but more about what you do with it. Not sure for how long that feeling from childhood will last, but I managed to revive it, and this time, knowing how it feels to lose it, I will pay more attention to care for it.

Touch-typing and habits

June 23, 2020

Mastering tools is a must-have in any profession. Can you imagine a construction worker trying to find a button to start a drill each time he needs to use it? Or a violinist trying to find appropriate strings while performing Beethoven's Violin Sonata no.9? It comes striking to me that so many software engineers with years of experience still struggle with something they do many times every day – typing.

Learning touch typing was a life changer for me, and I was the same. I always postponed practicing it, so after a few years of working in the industry, I almost developed my way of typing – not that efficient but still – using more than two fingers.

One day I've said enough is enough and started practicing a few minutes every day.

You can see I suck on creating habits and had a few approaches to practice sessions, but the 1st of August 2016 was the day for me.

After intensive practice weeks, you can see how the frequency of practicing has dropper but has not stopped, which is a key to keep up your fingers in good shape.

I still do a few minutes of practice every day after those four years. I can say it becomes a habit of mine. It became my first conscious habit where the end goal was to get some skill.

I've tried many times to build up a good habit and failed miserably, I've wanted to read a few minutes a week, do learning for a few minutes, learn languages. The problem was that I was not using those skills in my day to day routine (I've been focusing on too much at once too); hence the only time I was exposed to those skills were those few minutes a day. And forgot about it for the rest of the day.

Some learnings from keeping my habits for a few years:

I've learned a lot about myself and how should I approach any task that I'd like to become a habit of mine (I successfully do WaniKani lessons for three years now almost every day), most important thing is to use the skills you practice also outside the practice sessions (even when you use it inefficiently for the time being)
The habit should have a practical influence on how you work or live, having vague habit as “reading 10 minutes every day” does not work for me at all as I cannot see a clear goal of doing so (“getting wiser” is as vague as the habit itself). On the other hand, “learn how Ruby works under the hood by reading Ruby Under the Microscope book” makes much more sense to me. I have to quantify the habit to see the progress.
As long as the feeling of accomplishment is excellent (especially if you keep your streak for 365 days), the burden you put on yourself by forcing a habit is even more critical. It can undermine the whole plan – keep in mind you are doing it for fun and profit, not because it is a punishment. If you have to take a break, take it.

Build up your habits and more importantly use skills learned, otherwise there is no point of having a habit at all.

Being remote

May 31, 2020

The current world situation forced a significant amount of workforce to stay at home and work from home. A lot of people talk about shifting the whole sectors to remote work. I think we forget too often that the current state of affairs has nothing to do with working remotely.

What is remote work for me?

Let's make it simple; it is a possibility of providing service from any distraction-free (or with a distraction that I can control) place of my choice, with minimal technical requirements needed to make work happen.

When you stay at home, and your boss asks you to work from home, and your kids are at home because schools are closed, and your partner stays at home and tries to also work from there – this is not a usual remote work. This is working from home in times of exceptional circumstances with distractions you cannot get rid of (read kids, for example).

Some good might come out of the lockdown after all. I can imagine companies can be convinced to allow working remotely more often after the pandemic is over, overall this is the future for all jobs that require some dose of creativity

In regards to controlling your environment, for example, I like working with the TV on in the background, this is not mentally exhausting as I'm in control, I can push a button on my remote, and it goes away. I think this is very relevant to any kind of remote work – being in control of your environment has a high impact on your productivity and mental health. Interestingly for me, some hard to control events (friends asking for a quick chat over the coffee, visiting co-workers in another building, walk after lunch with the whole team, etc.) were much higher in the office. After some time, I've realized I would be able to do the same work in some extreme cases in 2-3 hours while being in the environment of my choice rather than on spending 8 hours “being at work” in the office space. As a result, I've left my company back then and have not to get back to on-site work ever since (almost ten years now).

Usually, I'm not traveling while working remotely, nor I spend my life on workations (which, by the way, seems like a terrible idea to me, I like to keep those two separate as much as I can). I do enjoy the possibility to change my primary place of residence without a much hustle if I need to. That thought of being unattached from the site where the company has an office is comforting to me and makes me less anxious then thought that I have to be attached to a given city/street because of the work I do. Although I'm not like characters in Flights, I do set roots, in my case though, it is mostly a result of family-related life choices, like pleasant daily life for my kids, and things like having a lovely landscape view through my kitchen window.

Remote work does not need to mean working from home. I think the main point is working from a place where you feel comfortable and productive with no strings attached to the building where a company is. It means you can live wherever you think it is an excellent place to live for you and your family without struggling with a pointless commute to the city center. It is a kind of freedom that working remotely gives me, and I do feel privileged and grateful for being able to choose myself where I live and where I do work without dependencies on each other.

Language agnostic tests feedback loop

May 22, 2020

Test-driven development might be dead or might not be. Whatever your approach to testing is – writing them first or after implementation – having quick feedback from your test is crucial for effective coding. I cannot imagine changing an existing method, function, class without the ability to see related specs output quickly.

There are multiple solutions specific to language or framework, e.g. guard for Ruby projects, which one of many features is running spec related to the changed class automatically in a separate terminal window.

Although you can find similar tools for other languages, I like using more straightforward tools (without configuration overhead) and more in the spirit of Unix tools philosophy. Moreover, using a language-agnostic tool have an obvious benefit of using it without the hustle and in the same way for whatever project in whatever language you are working at a moment.

I use entr for multiple tasks: reload a browser tab when CSS/JS/HTML files got changed, run specs suite when any lib/ file got changed, recompile Rust project on file change.

Example usage with Elixir project is simple as:

ag -l | entr -s 'clear && mix test'

Now on each file change from files listed by ag (silver searcher), entr will execute defined command – clear screen and run mix tests.

This way, you can assemble such command for any language of your choice and get an instant feedback loop from your tests.

On pair programming

May 21, 2020

In my past and current projects, there was always a hurdle in sharing knowledge effectively. Even if there is the best documentation for a project and each story is very well defined, including clear acceptance criteria, there is still a significant amount of knowledge hiding in shadows.

Shadow knowledge that is never documented and passed around only by “tribe” stories. No matter how hard you try to clarify the reasoning behind the given feature, model, UI component, there is always this small detail that you unconsciously omit – not purposely, of course. Most of the time, it is either something so evident to you that you assume everyone knows it.

That shadow knowledge is everywhere, especially in long-lasting legacy projects. It prevents to build a shared understanding not only among teams but also within an organization. The same applies to new joiners. The only difference, in that case, is the gap in knowledge that needs to be fulfilled, and this is rarely done thoroughly by asking people to read documentation and code itself. Too many times, documentation is outdated, and reading code is like reading a snapshot of the current system without getting any historical data to back up its current state.

The solution that I've tried many times and seems to be working best for me and my peers is pair programming (and in more extreme scenarios: mob programming but this is an exciting topic of its own and deserves a separate blog post).

There are different approaches to how pair programming sessions should be performed. Still, one thing that does not change is the purpose of it—sharing the knowledge that is not easily shareable in the official documentation. Rarely it is being written down that a feature was implemented the way it was because of low headcount and time pressure. Pair programming sessions give two people a safe boundary where they can, not only code a solution together but also build trust and share uncomfortable facts about the existing solutions.

I try to keep in mind a few simple rules when doing pair programming that makes them practical for me and my colleagues (at least they never said otherwise :)).

Keep them remote

I work remotely for about eight years now, so “remote” is a natural environment for me. Still, even when I was working in the office, I encouraged my peers to do pair programming sessions remotely. The main benefit is that each peer can work from their computer and use their environment, editor settings, etc. This approach removes the problem of using the same settings everywhere, which always comes in the way of effectiveness. It is straightforward nowadays to share code via, e.g., git branch, and get the same code state in a second if we switch roles. Another significant benefit is that both peers are kept “in the zone” by staying in a phone booth space or at least by using headphones (especially when those are with active noise cancellation). Pairing in person was always too vulnerable to all office-related interruptions.

Do not assume anything

An assumption is an enemy of understanding and finding common ground. Whenever I pair program with someone senior, someone who works in the same problem domain or even someone who is considered an expert where I'm not, I try to pretend I'm pairing with someone who is junior and lacks domain knowledge. If we both use the same tactic, it is easy to build up trust and being more comfortable to share all legacy specific details about the code we are working with. I'm also sure I do not forget to share any insights that, for me, seem to be obvious.

Do not tell how

This rule is crucial for a successful pair programming. I never say what my peer has to type in or how to approach a problem. Quite the opposite, I try to give as much historical background and context on the current code so that the other side can propose a solution on their own. This way, I can challenge my peer to come up with a solution that I might never have to think about. When they are going in the wrong direction (e.g., I've been there already and tried that), I'm not cutting conversation there. I'm allowing them to try it and fail, this way we can be on the same page and understand the limitations and possibilities of the current code much better.

Last but not least, there is also a social aspect of pair programming. After all, what is the more delightful way to better know your colleagues then sharing with them your passion for programming and doing this together at the same time?

Naming is hard

March 31, 2020

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I'm a software developer for almost a decade now, and if I had to choose a single quote that was always true for me for that time it would be the one above.

Undoubtedly mapping real-world problems and implementing solutions for them must involve naming concepts in the first place and agreeing on vocabulary. It is not only a problem domain but also our experience that makes us better on calling classes, methods, variables, and all related concepts.

I was looking for a side project and came up with an idea of a web site where you put your roughly sketched name for a concept, and you will see a list of suggestions based on: – user query general language synonyms – and additionally (and more importantly) a list of synonyms based on already existing code from GitHub public repositories.

This way, you can find inspiration coming from other engineers' minds similar to yours (it does not mean they were right themselves, but at least you can get more options to choose from).

Additionally, I get inspired by those two papers: – Splitting source code identifiers using bidirectional LSTM Recurrent Neural Network – Public Git Archive: a Big Code dataset for all

Tech stack

This project would be an excellent opportunity to try something new and try to solve it with a pinch of data science and trying out some new framework/technology. I'm mainly a Ruby developer being tired of Ruby on Rails, so I've decided to give a try to Hanami. Data science part was utterly unknown to me as I've had no previous experience in that field, I had to do my research along the way.

Hanami

My goal was to have a one monolith app written in Hanami and nicely split as in-app modules (this approach was revised later on).

I liked the approach that your core logic is your app and is kept in the lib/ directory. In contrast, any interface to that logic is an “app” (where it can be a web application, CLI interface, etc.) and is kept in /apps directory. Such split provides an excellent separation of contexts and allows easy replacing and testing of each module.

First attempt

My main controller in Hanami web app looked like this:

module Web
  module Controllers
    module Search
      class Show
        MAX_RESULTS = 5

        include Web::Action
        include ::AutoInject[
          code: 'handlers.code',
          distance: 'handlers.distance',
          synonyms: 'handlers.synonyms',
        ]

        expose :synonyms, :res, :query

        params do
          required(:query).filled(:str?)
        end

        def call(params)
          if params.valid? && current_user
            @query = params.dig(:query)
            @synonyms = synonyms.find(query: @query).take(MAX_RESULTS)
            @synonyms.unshift(@query)

            responses = {}
            @res = @synonyms.take(MAX_RESULTS).map do |synonym|
              distance.filter(
                synonym: synonym,
                results: code.find(
                  query: synonym.parameterize,
                  user: current_user
                )
              )
            end.flatten.uniq
          else
            redirect_to routes.root_path
          end
        end
      end
    end
  end
end

Using dependency injection makes it easy to define “processing” objects. As such, I describe an object that implements a processing function and does not mutate (or even does not have any state), so you can pass any object to it, call some function on it and return results. Given those objects can be reused between calls, we define them in a Container file like so:

class Container
  extend Dry::Container::Mixin

  register('code.search_query_builder') do
    Code::SearchQueryBuilder.new
  end

  register('cache.synonyms') do
    Cache::Synonyms.new
  end

    ...

end

The primary method for each controller in Hanami is defined in a call method. In our case, we do a few actions there.

First, we try to find @synonyms based on a user query. This process simply makes an API call to external service Big Huge Thesaurus by using the gem dtuite/dinosaurus. Not much to discuss there, we take results and proceed next with those.

Next thing is the main call chain in this controller:

@res = @synonyms.take(MAX_RESULTS).map do |synonym|
  distance.filter(
    synonym: synonym,
    results: code.find(
      query: synonym.parameterize,
      user: current_user
    )
  )
end.flatten.uniq

Going from the inner block, we try to fetch code search results directly from GitHub using their search API endpoint.

After getting results from GitHub, we try to apply some logic by calculating Levenshtein distance to each returned token.

For each line from GitHub search results, we split them, cleanup with regexp, and decide whether to account given token in or not based on the distance calculated between query (and its synonyms) and the token. If it is below the custom-defined threshold, then we show such a token. If not – we discard it. More on that later.

In each flow, I also try to utilize cache as much as possible, given all APIs have limitations (especially GitHub).

This is done with :

output = cache.find(query: synonym)
return output if output && !output.empty?

and

cache.save(query: synonym, value: output)

cache as you may guess is an object in the Container defined for each cache-able entity.

register('cache.synonyms') do
  Cache::Synonyms.new
end

register('cache.code') do
  Cache::Code.new
end

register('cache.distance') do
  Cache::Distance.new
end

each of those classes inherits from the base class and defines each own cache prefix (added to differentiate cached keys in Redis):

module Cache
  class Distance < Base
    def prefix
      'distance'
    end
  end
end

and all the processing is kept in the base class as such:

module Cache
  class Base
    include ::AutoInject['clients.redis']

    EXPIRATION_TIME = 60 * 24 * 7 # 1 week

    def find(query:)
      result = redis.get(cache_key(query))
      return unless result
      JSON.parse(result, symbolize_names: true)
    end

    def save(query:, value:)
      redis.set(cache_key(query), value.to_json, ex: EXPIRATION_TIME)
    end

    def prefix
      fail NotImplementedError, "#{self.class}##{__method__} not implemented"
    end

    private

    def cache_key(query)
      "#{Cache::PATH_PREFIX}:#{prefix}:#{query.downcase.gsub(/\s/,'')}"
    end
  end
end

Thanks to the dependency injection approach, all of those modules are easily replaceable and easy to unit test.

Ii liked that everything is kept in one app but, at the same time, organized in multiple modules. The way how Hanami structures directories/files and usage of dry-rb gems makes my code modular and easy to work with.

The biggest issue with this approach were limits on GitHub APIs. Those run out quite fast (until we build up a decent cache, this app is not usable). Additionally, filtering results based on Levenshtein distance is not accurate enough, and I needed something better.

Data science

Bit more on Levenshtein distance

My first approach was to use something simple. I was looking for a clever way of comparing two words. Levenshtein distance seemed like a good candidate. After Wikipedia article:

... the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other ...

I used optimized Ruby gem to calculate the value and tried to fine-tune the threshold to limit which words are worth showing as meaningful enough to a user.

require 'CGI'

module Distance
  class Handler
    include ::AutoInject[
      cache: 'cache.distance',
    ]

    THRESHOLD = 0.7
    CLEANUP_REGEXP = /[^A-Za-z0-9\s_\:\.]/i

    def filter(synonym:, results:)
      output = cache.find(query: synonym)
      return output if output && !output.empty?

      output = results.flatten.map do |r|
        r.split(' ')
      end.flatten.map do |r|
        r.gsub(CLEANUP_REGEXP, '')
      end.map do |r|
        r if Levenshtein.normalized_distance(synonym, r, THRESHOLD) && !r.empty?
      end.compact.flatten

      cache.save(query: synonym, value: output)

      output
    end
  end
end

This approach worked fine in terms of performance, but I have been missing one crucial thing – meaning. Levenshtein's distance between words does not rely on any model of the meaning of both words but rather compares them by checking how hard it is to change one word into another by calculating how many actions are required to do the job.

That was not enough, but fear not, there is another approach we can take.

Word2Vec

Text embedding was a real revelation for me. I highly recommend watching 15 minutes intro from Computerphile.

In short, word2vec maps each word in the text to a vector in multiple dimensions (how many depends on you), and based on vector distances from other words, it infers a word meaning. For someone who does not have a data science expertise, like myself, it is enough, for now, to say that this approach defines a word meaning by measuring vector distances between words in a given text window (how many words to look for and before the current term).

If you have an hour to spare, I also recommend watching this poetic, more in-depth explanation from Allison Parrish.

Or you can also read it in the form of gist as Jupyter Notebook for those who prefer reading and experimenting along.

Up to that point, I understood that searching GitHub directly via API would not scale and would take too much time on each request to be usable. Also, it is very easy to pass API limit thresholds, which required users to wait for the 60 seconds before trying again.

I knew I need a data model and a lot of code to train it.

My initial take was on the Public Git Archive, as I already knew about those from the papers I've read (the ones I've linked in the beginning). Unfortunately, datasets from the project's repo are no longer available. I was planning to use the already cleaned identifiers dataset as this is precisely what I need from those public repositories. I've tried to reach out to authors, and there is a chance they will be able to upload them and share them in the next few days.

Handling significant amounts of data

Luckily Google exposes a considerable amount of public GitHub repositories in its BigQuery analytics application.

I was able to easily extract ~1GB of data (public source code) with this self-explanatory SQL query.

WITH
  ruby_repos AS (
  SELECT
    r.repo_name AS repo_name,
    r.watch_count AS watches
  FROM
    `airy-shadow-177520.gh.languages` l,
    UNNEST(LANGUAGE)
  JOIN
    `airy-shadow-177520.gh.sample_repos` r
  ON
    r.repo_name = l.repo_name
  WHERE
    name IN ('Ruby',
      'Javascript',
      'Java',
      'Python'))
SELECT
  content
FROM
  `airy-shadow-177520.gh.sample_contents` c
WHERE
  c.sample_repo_name IN (
  SELECT
    repo_name
  FROM
    ruby_repos
  ORDER BY
    watches DESC
  LIMIT
    3000 )
  AND content IS NOT NULL

OK, I got data, I know what (roughly) I would like to do with it, now the question was: how? I already knew Jupyter Notebooks, and those are perfect for any analysis, reporting, and debugging tasks. Python is also well known nowadays from its wide variety of data science libraries and that it is a de facto language of choice among data scientists equally with R. Choice was obvious.

I've used a Python library called gensim which advertises as

the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.

I was able to use it easily in Jupyter Notebook and train my model. It was also trivially simple to save the trained model to a file for reuse.

Deploying model

I do have my model. Yay! After experimenting and playing with it in the notebook, the time came to deploy it somewhere and expose it to be used by my Hanami app.

Before all of that, it would also be useful to track model versions somehow. That was an easy part as I've already known about GitHub LFS (Large File Storage), which makes versioning of huge/binary files an easy peasy. I've added my model files to a new GitHub repo (separate from the Hanami app as I thought it would be easier to deploy those two separately).

The next step was to use a simple web server for model REST API. In the Python world Flask seemed like an obvious choice.

My first take was just that.

from gensim.models import Word2Vec
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/')
def word2vec():
    model = Word2Vec.load("model/model.wv")
    w = request.args['word']
    return jsonify(model.most_similar(positive=[w], topn=15))

if __name__ == '__main__':
    app.run(debug=True)

Can you see an issue here? If not, think how loading to RAM a few hundred MB model on each request could make cloud server instance to die.

So instead of loading the same model multiple times (on each request basically), I've decided that separating the model and Flask part will be the best approach. This way, I can have a daemonized model, loaded once, and responding to queries. Whenever it dies, I also have monit watching for PID changes and reviving the model.

root@findn-model:~# cat /etc/systemd/system/model_receiver.service
[Unit]
Description=Model Findn Receiver Daemon

[Service]
Type=simple
ExecStart=/bin/bash -c 'cd /var/www/model.findn.name/html/ && source venv/bin/activate && ./receiver.py'
WorkingDirectory=/var/www/model.findn.name/html/
Restart=always
RestartSec=10

[Install]
WantedBy=sysinit.target

check program receiver with path "/bin/systemctl --quiet is-active model_receiver"
  start program = "/bin/systemctl start model_receiver"
  stop program = "/bin/systemctl stop model_receiver"
  if status != 0 then restart

As always, with two processes, you would like to allow some kind of communication between them.

My first take was to use FIFO files, but this approach did not work out on multiple API requests made to the Flask app. I did not want to spend too much time on debugging, so I decided to use something more reliable. My choice was ØMQ for which (guess what) there is also an excellent Python library.

Equipped with all those tools I've managed to prepare, firstly Flask API app

from flask import Flask, jsonify, request
import zmq

app = Flask(__name__)

@app.route('/')
def word2vec():
    w = request.args['word']
    if w is None:
        return
    context = zmq.Context()
    socket = context.socket(zmq.REQ)
    socket.connect('tcp://127.0.0.1:5555')
    socket.send_string(w)
    resp = socket.recv_pyobj()
    try:
        return jsonify(resp)
    except Exception as e:
        return jsonify([])

@app.errorhandler(404)
def pageNotFound(error):
    return "page not found"

@app.errorhandler(500)
def raiseError(error):
    return error

if __name__ == '__main__':
    app.run(debug=True)

as well as daemonized model receiver

#!/usr/bin/env python3

import zmq
import os
from gensim.models import Word2Vec

model = Word2Vec.load("model/model.wv")

context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind('tcp://127.0.0.1:5555')

while True:
    msg = socket.recv()
    try:
        resp = model.wv.most_similar(positive=[msg.decode('utf-8')], topn=7)
    except Exception as e:
        resp = []
    socket.send_pyobj(resp)

The final result can be seen here: https://findn.name

Summary and next steps

This side project was a lot of fun, mostly thanks to the data science part of it. I have not written any low-level algorithmic code to train the model (as I've used already existing library). Nevertheless, it was still a great experience to see how those algorithms can “learn” about the meaning of words in a human sense only by using a small subset of data. And how such a model can be deployed on a hosted server outside the Jupyter Notebook environment.

The next step probably would involve improvements to the model itself and experiment with different data sets.

Overall, it was a successful side project. Fun for sure!