Thursday, 27 February 2014

How to pay it forward when designing your data models

With my midlife crisis almost complete, I am lucky enough to have scored an internship with Two Red Kites. Although I have pretty much kissed my previous life in Data Warehousing goodbye, I sometimes feel guilt at the power I now wield as a developer of front end systems. The decisions that I make when designing an application can have far-ranging impacts further down the line.

Attention should be paid up front in the data entry application to collecting the right data, at the right level of detail, and with input validation to ensure data quality. Without this solid foundation of data integrity, it can be difficult to construct a reliable reporting solution.

A general rule in designing a data warehouse is to store data at the finest level of granularity possible. This produces the most robust design for responding to future queries and new data elements.

However, the level of granularity is limited by the level of granularity captured in the source system. Trying to break a data element down into smaller elements is a challenging task to say the least. Data quality, inconsistent formatting, and incomplete fields all create an additional headache.

A good example is an address field. Giving customers a free-text address field can, and will, result in the same address being represented in many different ways:

River View Apartments
Apartment 4503
197 Albert Street
Brisbane
Queensland 4000

River View Apartements
4503/197 Albert Street
Brisbane
QLD 4000

4503/197 Albert St
River View Apartments
Brisbane 
Queenlsand 4000

Unit 4503
179 Albert Street
Brisbane CBD

Apartment 4503
River View Apartments
Albert Street
Brisbane Q 4000

Level 45, Apartment 3
197 Albert Road
Brisbane Q’land 4000

Customers often ask for their reports to be broken down by postcode, Statistical Local Area (SLA) or Local Government Area (LGA). The ability to assign an SLA or an LGA to an address involves identifying the exact location of the property. If we cannot extract each and every data element from that compound address field, the completeness and accuracy of such a report is questionable at best.

So here is my plea: We should design and develop our applications to collect data in a flexible and robust manner. This facilitates future reporting requirements, placing us in a position where we can make a real and positive difference to our Data Warehousing colleagues.

Sunday, 16 February 2014

Nested Authorisation

Nested Authorisation

In an application we are building, we are using CanCan for authorisation and we have nested routes for scoping and permissions. One benefit of nesting controllers is that authorisation can be constructed in a hierarchy that matches the hierarchy of the routes. However, using traditional load_and_authorize_resource becomes difficult to manage and understand, let alone debug when tests aren't working as expected.

For example, in our routes.rb file we might have resources such as:

resources :clients do
  resources :projects do
    resources :items
  end
end

And in our app/controllers/items_controller.rb we would have:

class ItemsController < ApplicationController

  before_filter :authenticate_user!

  load_and_authorize_resource :client
  load_and_authorize_resource :project, :through => :client
  load_and_authorize_resource :item, :through => :project
  ...

This is pretty straight forward, but as the items may need to appear underneath other resources they relate to, this load_and_authorize_resource usages become more complex. Let's say there is a need to scope items by user as a single user may have items across multiple projects. Our routes.rb becomes:

resources :clients do
  resources :projects do
    resources :items
  end
  resources :users do
    resources :items
  end
end

And our app/controllers/items_controller.rb becomes:

class ItemsController < ApplicationController

  before_filter :authenticate_user!

  load_and_authorize_resource :client
  load_and_authorize_resource :project, :through => :client, :if => lambda { params[:project_id] }
  load_and_authorize_resource :item, :through => :project, :if => lambda { params[:project_id] }
  load_and_authorize_resource :user, :through => :client, :if => lambda { params[:user_id] }
  load_and_authorize_resource :item, :through => :user, :if => lambda { params[:user_id] }
  ...

As this routing nests a few levels deeper, the combination of load_and_authorize_resource usages become very difficult to read, let alone alter without breaking the outcome for other routes.

Additionally, because the CanCan load_and_authorize_resource method creates a before filter, when something goes wrong, there is nothing in an exception stack trace indicating which resource was trying to be loaded when the failure occurred.

Since our routes are being defined as a hierarchy, I began investigating how it define the authorisation in a similar hierarchy. Thankfully, CanCan is very well designed, and it's not too hard to set up the CanCan controller object that does the authorisation from your own code. I created a controller instance method load_and_authorize which takes parameters just the same as the CanCan class method load_and_authorize_resource. It also takes a block and invokes the block only if the resource was actually loaded by that call. If the resource is already loaded it has no effect. It takes the same arguments as the CanCan class method, but automatically includes the :through => :model option based on the nesting of a previous block. This allows app/controllers/items_controller.rb to be expressed as:

class ItemsController < ApplicationController

  before_filter :authenticate_user!

  before_filter do
    load_and_authorize :client do
      if params[:project_id]
        load_and_authorize :project do
          load_and_authorize :item
        end
      end
      if params[:user_id]
        load_and_authorize :user do
          load_and_authorize :item
        end
      end
    end
  end
  ...

This is a bit longer, but much easier to maintain and to keep changes under one route from adversely affecting another. ie: changes made under the params[:project_id] section only affect those routes. By adding a few little syntactic sugar helpers, such as load_and_authorize_if_present, we can shorten the controller to:

class ItemsController < ApplicationController

  before_filter do
    authenticate_user!
    load_and_authorize :client do
      load_and_authorize_if_present :project do
        load_and_authorize :item
      end
      load_and_authorize_if_present :user do
        load_and_authorize :item
      end
    end
  end
  ...

Now we have an authorisation definition that is almost identical to our routes definition. This gives us much more clarity about how we construct our authorisation, more visibility into where authorisation failures are actually occuring if tests are failing, and much easier maintainability as changes to our routes and authorisation are done in the same hierarchical manner.

After implementing this in several of our nested controllers, I'm happy to say that this has indeed delivered on these benefits for us.

And here's the code:

Wednesday, 12 February 2014

Polymorphic Path

Polymorphic path

Every now and then you might end up with something like the following route structure:
resources :projects do
  resources :tasks do
    resources :documents
  end
  resources :users do
    resources :documents
  end
end
which inevitably leads to views like this:
- if @task
    = link_to 'Document' project_task_document_path(@project, @task, @document)
- else
    = link_to 'Document', project_user_document_path(@project, @user, @document)
It's pretty frustrating when you're trying to achieve logicless view nirvana.
Enter polymorphic_path. This little bit of Rails voodoo allows you to achieve the following with the above example:
= link_to 'Document', polymorphic_path(@project, @task || @user, @document)
But what if you want to do the same but with an edit action?
No problem:
= link_to 'Edit Document', edit_polymorphic_path(@project, @task || @user, @document)
New and index actions get a little trickier but are still extremely simple
= link_to 'New Document', new_polymorphic_path(@project, @task || @user, :document)
= link_to 'Documents', polymorphic_path(@project, @task || @user, :documents)
polymorphic_path builds on the work in polymorphic_url which is just another component of the Rails framework. Methods such as form_for, url_for and redirect_to methods all utilise polymorphic_url to generate their paths.
If you find your view logic getting cluttered with if statements like the above, and you're willing to take a hit on the readability of your path in exchange for cleaner looking views or controller actions, polymorphic_path is your friend.

Friday, 7 February 2014

Using Strings as ID fields in Rails

Scenario

Rails defaults to using Integers for ID's which works fine when you only working with your own internal data. Sometimes you need to periodically import external data that has a Primary Key (PK) of a string. Lets talk about the options we have.

Example

Let's say we import daily company and contacts from an external API. And we have tables and relationships of:

We record the string external company and contact ID into company_ext_id and contact_ext_id respectively.  Each time we import companies & contacts we need to check the existing keys exist to get the Rails ID for the object.


This gets worse when you are importing contacts that you will need to match the imported company_id (which will probably be a foreign key to companies table).

Now if we used the external tables string ID's, we can then use standard ActiveRecord:-


This is far easier and doesn't require a lookup of Company to get the companies integer ID.

SO, how do you get rails to use strings for IDs. Here is an example migration:-


Things to note here are:-

  • id: false       # in the create_table line
  • t.string :id   # the id needs to be defined now
  • t.string :company_id  # relationship needs to be a string as well  
  • We need to manually add the PK
The allows the standard ActiveRecord belongs_to to work just like normal, without any extra options.


The Issue

When we set a PK in the migration, it has no effect on the development or production environments (as they build from the migrations), although it does effect the test environment (which builds from the schema.rb). The schema.rb generated by rails will see the id field an see it is a PK and then automatically create the below in schema.rb


Rails convention says if its an ID and a PK that its an integer. Note, it does not have the id field listed and the ":force => true" is defaulting the ID to an integer.
So when you:

rake db:test:prepare

It uses the schema.rb and this will make the test db have an integer for an ID causing your tests to fail (if you have any).

To avoid this we need use a feature that Rails supports that allows the schema dump to be sql statements instead of ruby. http://guides.rubyonrails.org/migrations.html#types-of-schema-dumps. To do this add the following line to your config/application.rb

config.active_record.schema_format = :sql

This will now ignore the schema.rb (and no longer write changes to it) and will now produce a db/structure.sql

Now when you load your test DB structure it will correctly build the ID's PK's with strings.

Another option, is to not use a PK (just add a unique index) in your migration and schema.rb will generate the correct format, this seems to be a little bit hacky...

Conclusions

Most people believe that using a string for a PK reduces performance, I have discussed this with a few DBA's and they say it has none to very little effect.
I have used both methods and found recording the external ID and constantly relating very painful and slow. Using strings for ID's takes a little more initial setup although after that it makes no difference in your general development.