Agile O/R Mapping

Object-Relational mapping is hard. Its usually much easier to change your object model than it is to change your database schema, which can quickly lead to a mismatch between your objects and your database. Some kind of mapping metadata layer is almost inevitable, as I have yet to find a tool that can map even moderately complex object models automatically to a database. Some of them are getting close, and if your project is small and agile enough to allow you to generate your database automatically then you can use your mapping infomation to drive your schema. I’ve been playing around with Hibernate and XDoclet to do just this, and it can work really well. Not having to worry about maintaining table creation scripts is very handy. Its difficult to work out how to scale this approach to big projects, but here are some thoughts.

On a project of any appreciable size, the database will be touched by several other systems, plus ad-hoc reports etc. Dropping and rebuilding the whole thing every iteration probably won’t work too well.

There appear to me to be at least three distinct types of data in a typical application database (in increasing order of ubiquity and variability):

Application metadata
Data used during a build process to code generate type-safe enum objects. Not frequently seen (except on the project I’m currently on), and only of interest at build time. Could in theory be excluded from a deployed database.

Static reference data
Lookup data that changes rarely, like currency codes, country codes etc. Most applications have data like this.

Application data
The active ‘memory’ of the application. Stuff that is constantly being inserted, updated and deleted during normal operation. If your application doesn’t have this type of data, you probably don’t need a database at all.

The first two types are generally easier to deal with, as they change rarely, and the changes are almost always to the data, not the schema. Generally speaking, normal dba management practices should take care of these two.

The third one is where most of the difficulty arises. It is this data that everyone cares about. The developers are trying to map their classes to it, other systems want to feed data to, or take data from it, and unruly crowds of business users want vast numbers of reports run against it.

Problems arise because there are two opposing forces at work. The developers want a totally flexible schema that perfectly maps to their classes. From their perspective a schema that is regenerated during every build based on their mapping metadata would be perfect. This end of the spectrum allows total agility, at the expense of dropping all the mapped tables every time a build is deployed. At the other end of the spectrum lie all the customers of the database – all the inputs, outputs and reports that interact with it. From their perspective, a totally stable schema (assuming the data model is not fundamentally wrong) is the ideal.

The sweet-spot on the spectrum will depend on where the application is in its lifecycle. A brand new system with no pre-existing customers will benefit from the rapid feedback that comes from being driven by the object model. This allows the developers maximum freedom to refactor as the application design improves. At some point during development, it will probably become necessary to stabilise the schema, in order to allow integration work with other systems to take place.

Coming from a development perspective, my desire is to find ways to maximise the ‘agile period’ for the database schema. One approach could be to further subdivide the schema, to insulate the parts of the database that must interact with other systems from the tables used for object persistence. This would allow the object persistence tables to remain auto-generated, while offering some stability in the other tables. The issue then becomes one of synchronisation. If using a database that allows updatable views, these could be used to map the database’s ‘public interface’ to the object persistence tables. Keeping the views in sync. then becomes a maintenance overhead, which could be significant if the object model is changing rapidly.

Note that I haven’t mentioned data migration anywhere, mostly because this is probably even harder to cope with than schema changes. Not being a dba, I’d probably do an injustice to the subject if I tried to address it. With that risk in mind, all I’ll say for now is that if your object model is driving the schema, it will probably be necessary for the acceptance test data to be defined in code, as a big method whose job is just to instantiate objects and send them to the persistence layer. This at least allows the test data to be included in any automated refactorings done in IDEA, for example. More brainpower than I currently have available is needed.

Useful links: