with PostgreSQL 9.6 for the database.
Now, let me present to you the problem through which we will frame this discussion: given a number of student records, when displaying a single student, also display the quintile into which their grade falls.
Let us begin by looking at the structure of our students
table. The following was generated with a simple Rails migration, the details of which are outside the scope of this article. The table, intentionally simple for the purpose of illustration, only has columns for an id
, a name
, and a grade
.
In order to have some data to work with, we want to generate some at random. We use the following command to create a million records with grades between 0 and 100. As you can see from the subsequent queries, we now have a corpus upon which to operate. Since we are working with a million rows, we know each quintile contains 200,000 records, and we can safely assume that they will be split fairly close to multiples of 20. We will soon be able to confirm this assumption.
In order to calculate quantiles, we will be making use of a PostgreSQL feature called window functions . Using the ntile()
function, it is actually fairly trivial to calculate the quintile for a row. As can be seen in the following query, this is the case so long as we use a limit
clause to only return one row. Using the window function call syntax , we define a computed column quintile
using ntile(5)
over the grade column, ordered.
But something interesting happens when we try to use a where
clause to get the same result: we get a different, incorrect, result. This is because window functions operate over the set of records returned by the from/where
clause, so, in this next case, only sees one record. This makes our current query useless, since we are under the constraint of needing to display the quintile for a specific record.
At this point, it would also be nice to confirm that our data has the expected shape, but we run into another interesting limitation: namely, that we cannot use window functions inside the group by
clause.
Luckily, SQL queries, generally speaking, can often be reshaped into a form that is sufficient for our needs. In particular, we can use another feature of PostgreSQL to work around these early teething problems: common table expressions, aka with
queries .
Using a common table expression, we generate a temporary table for the duration of this query. This table will contain our quintile value, and we will pull the data from it. For now, we will just call it quintile_table
. As an aside, it is worthwhile to note that common table expressions are also capable of significantly more advanced features, including iteration (albeit invoked using the recursive
keyword), but we need not concern ourselves with those details here.
That query gave us exactly the output we desire. The current shape of the query also conveys upon us the ability to cheat the restriction of not being able to use window functions in a group by
clause, since we are now grouping on a column from the temporary table, and PostgreSQL treats this like any other column. As evidenced in the following query, the data looks exactly how we expected.
The added layer of complexity, necessary to allow us to filter on an arbitrary where
clause should be expected to increase the computational cost of the query, but that is actually not the case in a substantial way. The start-up cost for the query using common table expressions is less (by 5000) than that of the naive query using a limit
clause, which means the output phase should be expected to actually begin earlier. True, the total cost is more by 17499.99, but that seems like a slight cost to pay in order to get the desired result from amongst a million records.
It is worthwhile to note, before going any further, that this method would not be sufficiently performant in a production environment for a data set of this size; some caching or precomputation layer would be necessary, but that is outside the scope of this article.
At this point, we have a query that can handle arbitrary filtering, but we want to use it within an ActiveRecord
model. The first thought may be that we can simply use ActiveRecord::QueryMethods#select , but we are about to take a step backwards.
(N.B. We break down the long SQL string into separate clauses using an array and joining it back into a string for the query; while this may not be the best practice, it is a sufficient way to break up long chunks of SQL when prototyping.)
The result is, again, not what we desire, and we must step back. Actually reading the documentation for the select
method, we see it takes a list of fields we wish to select, but does not overwrite the entire select
clause, as we (with intentional naivete) assumed. Instead, we will have to again reshape the query into a way that will allow for composability within the constraints of an ActiveRecord
model.
Thinking in terms of how we compose queries using scopes in Rails, it may be best to define our optimal interface before going forward. In this case, a standalone with_quintile
scope would be optimal, and we would want to be able to use it just like any other scope, with its internals abstracted. Consider the following: Student.with_quintile.where(id: 1).first.quintile
In order to achieve this result, we will need to abandon our attempts to manipulate the select
clause to our ends and, instead, focus on the from
clause. Very simply, we can alias our original quintile_table
as students
, the table_name
of our table, thereby tricking all other normal scopes into being well behaved in its presence. As far as they are concerned, the students
table has the quintile
column there at all times.
It would, again, be reasonable to believe additional misdirection such as this would increase the cost of the query, but the limit
clause added by the call to ActiveRecord::FinderMethods#first
effectively reduces the total cost back to the initial value.
We can now say that we have gotten to a point where we can simply wrap what we have in a scope, and call it good. Were this a feature for a client on a time sensitive project, I would probably agree; but we can certainly do better than the following.
Normally, it is considered bad practice to hide complexity within modules, so the following is not inherently a recommendation, so much as an elaboration on how we would approach generalizing what we already have. On the other hand, having worked on large, long-lived projects, I cannot stress enough the maintenance issues caused by having pieces of hard-coded SQL strewn within model classes. As such, the conversion to use AREL is an explicit recommendation.
Here, we create a new ActiveSupport::Concern
module in app/models/concerns/quintile.rb
, which we will include in any class we want to be able to call our scope on. In this case, we have removed any explicit to both the table name or the column we are using as our calculation. Consequently, this module can already be included any ActiveRecord
model, its quintile_on
class macro used, and that is all that is necessary to add a scope for calculating a quintile on a given column. In this particular case, we could have simply defined all the methods within the class_methods
block in the module within the Student
class itself, thereby obviating the extraneous module.
Ultimately, the version written with AREL produces the exact same output, but with a lot of the hard-coded aspects stripped away. While it is more difficult to write up front, and more difficult to follow, each method returns an object that is an Arel::Node
, which has a to_sql
method. When composing queries in this way for a real production application, this makes it possible to very easily test the SQL generated, helping with long-term maintenance.
As can be seen in our usage of the Student
below, the named scope generated by our class macro have a more fluent name than we had used previously, namely stating on which column the quintile is being processed. Another point of interest for the query generated is our choice to define the quintile_table
method, such that it includes a random value in the name, with the intent of making name collisions less likely when working with other scopes that require table aliases.
While this is a perfectly reasonable stopping point for this project, having gotten exactly what we need from the database within the confines of a Rails application, there is a lot more we could do. It is easy to imagine, from atop our module specific to quintiles for columns, writing a much more robust system for defining these sorts of calculations. The next logical steps are fairly clear:
quintile
aspect of the library, for the ability to have any quantile.table
and column
, to prevent the constant use as parameters.Now, ActiveSUpport::Inflector
may not deal in quantiles, but it would be easy enough to possess our own mapping between integers and words. But all of this is mere speculation, for we have accomplished what we set out to do. Perhaps we will someday return to this, but, for now, that is enough implementation.
Instead, the path upon which we have trod during this exercise gives us a contextual vantage point from which to discuss software construction, methodology, and practice more generally. We started as "close to the metal" as necessary—inside PostgreSQL—but not at a layer so far away from our problem domain that we would lose sight of goal. In doing so, we learned about and utilized some less common features of the database in an environment that was conducive to exploring this fundamentals. Had we started from ActiveRecord, or worse AREL, our feedback loop would have been much slower as we coped with extraneous details. Instead, we did the simplest thing we could, found the points at which it broke down, and iterated our implementation before even attempting to integrate it. And when we did so, we had to step back, and reiterate the process, given our new set of constraints.
Often times, a problem will appear intractable when approached at too high a level, without enough granular control, or without enough understanding of the underlying architecture. Stepping down helps in these cases, but does not produce a satisfactory result. After having gleaned what we could from the database itself, we had to climb back up the ladder of abstraction . In doing so, we went marginally further than necessary, but showed how we could remove a lot of the hard-coded details and make our software more robust. We had the opportunity to stop, but favored the more complete approach. In languages with code generation deeply ingrained, our solutions will often grow to involve a level of abstraction that makes reuse trivial. When it is overkill and when it is ideal is a much more subjective question, based more on the exigencies of the real world rather than what would be considered optimal from a software implementation perspective.
And, finally, to get to the meta and talk about this last section from within itself; it is always important to step back after an exercise that touches many different pieces, contrived or experienced in practice, and ponder about what can be gleaned from the process. Just as higher order abstractions in programming come from the recognition of patterns, so does it arise in higher order thinking about programming.
05 Jun 2017