Google AppEngine : Lazy Data Migration with Versions
If you have already used Google AppEngine to develop an application, you would have already scratched your head around data migration. And Google App Engine is not that great when it comes to Agile style iterations of web development. In its own way, it forces you to design models up front instead of making life easier evolving over time.
I have been developing a community application, 'Yes to Politics' for friends in Andhra Pradesh, India to interact with politics in some strangely different way. More about this app later, let me share about data migration.
I have around 28000 entities in a model and trust me, I tried to cover all my needs in the design of the model up front. Well, software development doesn't work that way. I realized I needed another property in the model and I needed to give some default value too.
When you add a new property to an existing model, Google App engine doesn't fill the default value for the existing set of entities. So when you access that new property on existing rows, you will meet with exceptions as they never existed for them. You can check if that property exist before accessing it. Well, Python provide no way to do it. The only way to find whether that property exists or not is by accessing it and catching an exception when it doesn't exist. Not a cool way, but that's almost the best method we got.
And, you can not use that new property in queries too that need to search in the existing set of entities. There is a way we can fix it.
Define your new property and then loop through all your entities and set this new default value. This is no easy job to do on Google App engine. First, you need to setup a new URL and a HTTP handler to take care of this maintenance. Second, you can only update so many entities in one go without exceeding restrictions in terms of time and processing power. So you need to split the task in to easier bits, say, update 10 entities at a time and create a handler to auto refresh every few seconds to take care of all updates. And then run that handler in the browser and wait till it is finished.
I have about 28000 entities and that would mean I have to call that handler almost 2800 times (10 entities at once) and better give some wait between calls to make App Engine restrictions happy. For my model, 5 sec between each call worked fine. Any quicker, App Engine throws an exception. And it took about 25 minutes for me to finish the process.
I thought it will be anyway one time task so didn't regret waiting that long. But then after all that is done and happily using it for a couple of days, I found that I had to add yet another property.
This time, instead of doing that hard way, I have decided to do something different. Instead of adding the new property I was thinking of, I added a simple integer that will act as a version number for the entity. I followed the same as above and waited another half hour to get it updated. And for the actual property that I wanted to add, I added the new property but deferred updating the property with a default value until that entity is being used. The lazy way. So I just need to update my queries with this new version number logic but don't have to really update all my data at once.
I realized later it was an excellent move. As not all my entities really need that new property added. Whichever entity needs it, will get it when it is accessed for the first time. Your data migration now becomes highly scalable. Now all my models, I begin with a version number, so that I never have to worry so much about data migration when I decide to make changes to my models.
This is not without a downside. We have an additional property in the model and a little overhead of a version number comparison every time entity is accessed. If you are continuously adding properties that may not be required for all entities, then the storage space you save could easily outweigh this new additional property.
But, you decide whether the flexibility of scalable data migration it provides is worth the weight and hassle.
Comments
Let's say I have model called Car. I decide to add a property doorType. Pseudo code to update:
cars = Car.all().fetch(10)
for c in cars:
doorType = FindOutDoorType(c)
c.doorType = doorType
c.put()
How do I ensure that I am not fetching already converted entities? I could do Car.all().filter(somefilter).fetch(10), but what is the somefilter then? I tried doing something like .filter('doorType=', None) but that doesn't work.
Thanks for any ideas.