PostgreSQL: Dealing with half a billion rows

As the primary DBA at the company, I knew of a problem I wasn’t quite sure how to solve brewing. The short summary is that ever since the very first applications, we have been tracking statistics of app usage. Since we’ve been building apps for over 5 years now, it’s quite a massive data set.

Infancy

When I first started, I wanted to run as little software on the server as possible. This was back when the webserver and the database server were the same physical box without any performance-increasing RAID setups. I opted to store statistic information in sqlite with one database per show. This made a lot of sense at the time to me, but I was also very naive in how everything would look after several years.

The first issue I ran into was during the first truly major event we had an app for – 2010 International CES. With tens of thousands of users refreshing their apps and submitting their statistics, the syncs were piling up due to how sqlite does locks during writes. I quickly patched it by having the sync APIs write to a temporary atomic store (accomplished using a list as a queue in redis, which I already had running for other caching reasons).

This process worked well for the single-server setup, but I knew in the back of my mind it would become a pain someday. I wasn’t quite sure of my scaling strategy yet – would I go horizontal by using more inexpensive hardware, or would I go for more power and get a centralized set of database servers?

Childhood

SQLite is a wonderful wonderful technology. However, it’s really not meant to be used as a backend for multiple threads let alone multiple servers. Our next investment in hardware was to split the database server from the webserver, and have a honest-to-goodness RAID. However, the difference of having sqlite accessed remotely was going to be too much to handle, and what if I added a second webserver someday?

I made the choice to migrate all of the SQLite databases to PostgreSQL. I decided to go away from the one-database-per-customer approach for statistics by evaluating these pros and cons:

Pros:

  • Ability to query and compare multiple shows at once
  • Simpler design
  • Known quantity

Cons:

  • Scaling horizontally would be potentially more challenging
  • Optimizing one large table could be difficult (Oh boy if I realized the depth of what this con was…)

Such was born the device statistics table. Refactoring the code wasn’t difficult, and because of my abstraction of using Redis to temporarily store the queue of incoming stats, it was easy to cut over – just temporarily disable the process that cleared that queue, deploy the code, copy all the existing info in, then re-enable the process now that it’s pointing to psql.

Adolescence

As the device stats table grew, I tweaked our indexes and thought that was enough. However, any seasoned DBA knows that indexes are only half the battle. From my previous experience I knew the best next-step solution was to begin relying less on raw queries across the indexed table and instead create an aggregation process.

For those who haven’t dealt with a large dataset like this before, what I mean by an aggregation process is to simply convert raw “hit” type information into pre-calculated “count” rows. Instead of doing a count(*) you would end up with a sum(count), and the aggregate table has fewer rows which makes lookups and scans much faster.

Aggregation has its downsides: the process can be time consuming in and of itself, and if not fast enough, you’re faced with viewing stats that are outdated frequently. While the statistics for our apps are far from life or death, there’s both utility and a sense of “fun” that arise from seeing stats that are almost real time.

The Teenage Years

The aggregation process worked well, and still worked moderately well as our data set kept getting larger. However, even a year ago, I noticed several problems that are commonplace with psql.

The simple aggregation process I wrote used deletes and inserts to ensure rows weren’t being counted twice. Inserts and deletes don’t reuse the same space on disk, and thus the indexes and tables became fragmented. Reindexing isn’t a fun process either, as issuing a true reindex acquires locks on the database that prevent inserting new records. This essentially freezes the aggregation process. However, the benefit of defragmented indexes is significant enough that even a lock is worth it, or what I tended to do more frequently is create a set of new indexes with the same parameters and then drop the old ones. This avoided the locking problem, but took longer and required more disk space.

The next approach I took was partitioning the tables. Since aggregation to some extent can be “locked in” after a certain point, I began splitting old data off by timestamp. So, 2009 data was in its own table, 2010 in its own, 2011 etc. For 2012 I started splitting by quarter, and then by month for 2013 and 2014.

This approach worked well, but eventually the gains just weren’t enough.

The College Years

I started experimenting with several concepts, all which had merits but their own hangups as well.

Per-app Database

I didn’t actually experiment with this much, and that’s because after some very quick search results I discovered that many people reported issues with auto-vacuuming and the fact that psql dedicates sets of files for each database. The issue is that with thousands of databases you’ll run into max open file limits very easily. The hassle of that I didn’t want to deal with, but in those same search results I found my next concept I started experimenting with.

Per-app Schema

Schemas are honestly a little weird in psql. I say this not because they aren’t a useful construct, but rather that “setting the current schema” is a concept that is done in a different way.

For example:

default schema public table some_global_table schema app1 table device_stats schema app2 table device_stats

By default, you can reference tables in the “public” schema without qualification, ie SELECT * from some_global_table, and you can reference tables in other schemas by using the schema name, ie SELECT * from app1.device_stats. The mechanism to change the “current” schema is by setting the search path like this: SET search_path = 'app1'. You can specify a full search path, which is cool, but this did just seem a little “weird” to me.

However, I began seeing a way to maintain most of the code in the same manner but start splitting every app into its own schema. Sure enough, splitting sped things up by the virtue of reducing the data set sizes.

Unfortunately after splitting all of the data on another server for testing, I realized that having thousands of schemas just felt really ugly. It solved the problems, but not in a way I was happy with.

Amazon’s Redshift

Redshift is a custom-built software solution that solves big-data problems by enforcing physical sorting of the tables and offering a sort of map-reduce functionality from within standard sql. It’s a really cool technology, and I was able to get our raw stats table loaded up and queries ran pretty quickly. I played around with a lot of configurations, and decided this is where I wanted to move my stats to.

Over the course of the next couple weeks, we were making other drastic changes to aggregation which made the turnaround much faster but also was more optimized for a Redshift type setup.

However, those couple weeks led me to really ponder the setup. I had been focused on speeding up aggregation, and rethinking aggregation entirely meant I could focus on trying to speed up the main stats table. Could I salvage our current setup?

CLUSTER

Perhaps one of the worst choices maintainers of technology can make is to choose terms that make google results hard to pin down. Such is the case of the CLUSTER statement. Postgres calls individual instances running on a server a “cluster”, so when I initially saw references to clusters, my brain processed the search results as, “Oh, I definitely don’t want to set up a cluster for every single app.”

However, something finally jumped out at me in one of my searches last week. The summary of that link is that CLUSTERing a table physically reorders the table based on the order of the index that’s specified. This is precisely how Redshift operates, but not using the same commands. But would this change be enough to combat the gains of custom designed software like Redshift?

Hardware-wise, I was comparing an Amazon Dense Compute cluster:

  • 2 virtual CPUs
  • 7 Elastic Compute Units
  • 15 GB of RAM
  • 160GB of SSD RAID

Against our current standard database node:

  • 2 quad core CPUs
  • Dedicated to running only our databases
  • 48-64 GB of RAM
  • 1.6 TB of 15K SAS RAID

I was convinced we should be able to beat Redshift, but being the amateur DBA who already spent more time on this problem than I originally wanted to, I was fine going to Redshift if I couldn’t beat it quickly. Redshift didn’t have to be permanent, I could always migrate back off of it in the future once I possessed more wisdom.

And thus a plan was born.

Adulthood

This past weekend I began by disabling aggregation and preparing to migrate to Redshift if needed. I turned off that Resque queue flusher, and began my CLUSTER experiment. I first had to create a new index on the half billion rows we had, and since I was doing a complex index across 5 columns, it took a very long time.

Learning step 1: I was using the default maintenance_work_mem and not as big of work_mem as I should have. The initial index took 16 hours to create, but it was CPU bound not disk bound. My hypothesis post-hoc is that because I wasn’t allowing postgres to read in as much data and operate on it in memory it was paging in and out which thankfully hit disk page caches but still requires psql to do too small of data sets while operating.

I set those settings a bit higher and started the cluster operation. Unfortunately psql has no progress bars or ways to query progress, so I was watching using iotop in accumulative mode to gauge roughly how much longer I had to wait.

After a painfully long wait (around 28 hours total), the operation succeeded. This took longer because it had to physically reorder the entire database on disk, and then regenerate the index since all of the data moved from underneath it.

The first step after CLUSTER is to ANALYZE. I kicked that process off, and then tested the updated aggregation script against the table. The larger queries that could take up to 4 minutes before returned in about 1500ms. And, compared to Redshift, it won by about 800ms on average.

I finished deploying and testing the new code, and everything has been extremely smooth sailing. Our worker queues that used to always have a backlog of aggregation processes actually clear up completely before starting on the next round of aggregations.

Life’s Lessons

The moral of this story boils down to something software developers know already: when something smells, don’t wait until it breaks to fix it. Start looking for a solution before the stench is overwhelming. I waited a bit longer than I should have on this one, and to carry the same olfactory analogy, by the time I paid attention to the stench, I was smelling it from down the hall of where the problem mainly stemmed from.

By stepping back and rethinking the problem, I was able to cut through to the root of the issues and ultimately discover the (currently) optimal solution for our setup. The reality is that I knew that while a half billion records is a large number, it’s not nearly a “huge” data set. It’s a large data set, but I know DBAs all over the world face data sets this large on a daily basis. It was that knowledge that drove me to figure out how to optimize my psql install rather than abandon it.

Tools of the Trade

One of the posts I wrote last time I restarted this blog was a list of services or tools that I use that have aided me in running my business. Given that I wrote that a few years ago, I figured that an updated list would be in order.

One stark change that has happened with me is that while I still dive in on iOS and Android code, my primary focus these days is on architectural and forward thinking things. As a side result of scaling the business from where we started 5 years ago to today, I have a larger focus on the “dev-ops” perspective than I used to, and my daily tools are more focused on these two roles more than they ever have been.

Infrastructure

  • NodePing: At the top of my list is the monitoring service I know I had in my previous writeup. They have slightly different pricing models than before, but they are still the most affordable that I’m aware of, and they’re great people to work with. If you have many websites or services (smtp, etc) that need to be monitored, these guys should be your go-to. If you are looking for an inexpensive monitor for an http site that doesn’t check every minute, I still highly recommend UptimeRobot which is free for up to 50 sites and checks every 5 minutes.

  • NewRelic: I think NewRelic has an incredible product, but we’re not paying customers. On their free plan, you get 24 hour retention of some very useful logging information. However, the jump from free to Pro is drastic, and while there are some great fetaures that I sometimes want, I just can’t stomach the cost when many of the hosts I want that extra information on are VPSes that cost $20 a month. If I was to pay full price (I know they have discounts) for Pro on all the machines I would want it on, it would cost more than our hosting bill costs on its own.

Their free tier provides some great monitoring and can help narrow down problems quickly. I highly recommend using them for their free tier at least. I just wish there was some plan that I could justify the cost for because I do love the service that much.

  • DigitalOcean: I found out about this VPS hosting provider late last year, and I’ve only heard and experienced great things. They offer a smaller/cheaper tier than Linode but then above that tier Linode and DigitalOcean are very similar in pricing and specs. I like the fresh take that DigitalOcean has. They aren’t nearly as full featured as Linode, but I loved their SSD offerings before Linode recently switched over to SSDs for their hosting. Definintely a company to check out for your VPS hosting – not the cheapest, but pretty close with so far great service and support.

  • Linode: The main provider of most of my VPSes is Linode. I was feeling a little stale with them although I have always been a fan. Their pretty major switch to SSDs shows me why I should remain a fan of them. The only painful part I have is something they couldn’t really control – I’ve been a customer of theirs so long that many of my servers were created with VPSes were recommended being created with 32-bit kernels, and to take advantage of the SSD installs you need a 64-bit kernel. So I’m slowly rebuilding those servers – the ones that would benefit from SSDs.

But overall, I want to just say these guys have spectacular support, affordable pricing, and are one of those companies I’m glad to have supported and used all of these years.

  • Redis: Redis has been around as long as my company has. It is a very fast key-value in-memory database with a lot of great features. It has atomic operations, list data types, set operations, and more. I use it in a variety of ways ranging from a simple cache to a place to put incoming information that can be processed by background workers.

It isn’t meant to replace your relational database, although many people have used it as the only database for a project. My recommendation however is to continue using a relational database for your primary data store and using this to help solve problems that might be trickier in a relational database.

  • PostgreSQL: In the days before I was a DBA (if I can call myself that), I still had a fondness for PostgreSQL due to MySQLs history for corrupting user data. That’s not an issue these days if you use the proper table formats, and MySQL is a capable server. However, this video demonstrates many reasons why I vastly prefer PostgreSQL.

The summary of that video is that MySQL in some cases takes the sanctity of your data and twists it rather than giving errors. For example, performing an INSERT on a table with a “int NOT NULL” column with no value specified inserts a row with 0 in that column, rather than providing an error.

Are there ways that MySQL is better than PostgreSQL? Undoublty. But there are also many ways that PostgreSQL is better than MySQL. Is there any one true database server? Definitely not. I just am sticking with PostgreSQL on my own projects.

  • SaltStack: This is one of my more recent additions to my toolkit. There are many distributed system manamagent solutions, and I’m not an expert of any. I tried my best to understand and configure one of our simpler webservers with Chef, and got lost. I couldn’t see myself getting much further with Puppet. In the process, I discovered SaltStack and gave it a whirl. I absolutely love it. It still took a bit to really understand, but within a day I had a completely automated solution to spinning up a new VPS on either Linode or DigitalOcean and getting it completely configured as a webserver for one of our products.

Being able to execute commands or update configurations across all servers of a specific product is amazing. I love the power. I highly recommand on a project that you anticipate growing taking the time to set up something like SaltStack to help you manage multiple servers.

Managing a Team

I manage a completely remote development team that spans 4 platforms. Keeping organized and connected is a struggle. I plan on going a bit more in-depth about the challenges of managing a remote company in a separate post, but here I can give a brief overview of some of the tools we use:

  • HipChat: For several years, we used Campfire from 37 signals (now known as Basecamp). It’s a capable product, but with their recent change they no longer are going to put any effort into the product. I really dislike paying for a service that the creators admitted they aren’t going to focus on anymore. I started a search that lasted several months for the next best solution, and HipChat was one of the better ones.

HipChat on its surface is similar to Campfire in many ways, but adds some nice features. It was a little more expensive for our team size, but while our trial was going on they announced they were going free for unlimited users. They recently added one-on-one audio and video sharing, including screen sharing. It’s a little buggy, but if you don’t use USB audio it actually works quite flawlessly. Once it’s stabilized, I’m confident we’ll be paid customers.

However, the benefits of having persistent group chat for your team at $0 per month are just outstanding. If you don’t have a good group chat setup, definite give HipChat a try.

  • Skype: For voice calls, Skype just works. I don’t actually love Skype, there are a lot of things I dislike about it. But the pros of it outweigh the cons: cheap unlimited dialing out from your computer and easy high quality voice calls including conferencing multiple people. Skype fills many gaps and solves many problems that the alternatives just don’t have a good solution for.

  • Redmine: We adopted Redmine within the first year of starting the company. It’s a great open-source ticket tracking/project management solution. If you are looking for a standard ticket/issue tracking system, this should be on your list to consider.

  • Passpack: Securely sharing passwords is a tricky problem, and this is a pretty clever website. It’s not the easiest to use tool, but once you get the flow figured out it works stunningly. With it I have one paid account and each of my developers that needs access to private keys or passwords links via their sharing feature, and I can add them to the proper groups giving them access to view those entries. If they add a new one, they can share and transfer ownership to the main account.

Day to Day Tools

  • Sublime Text: Sublime lives up to its name for me. I used to use BBEdit or TextMate. However, they both had their downsides and their benefits. I still use BBEdit for some things, but for 99% of my day-to-day workflow, I live in Sublime Text. It is a wonderful text editor that works the way I want it to, and has a lot of great customizations and add-ons available. The only downside for me is that discovering some of those settings and add-ons can be tough. If you aren’t happy with your text editor, definitely give this one a try.

  • Transmit: I commonly have to do file transfers to our servers or from other FTP servers. In the back of my mind I always thought that there probably was a pretty good cheaper solution than Transmit these days. For example, I knew of CyberDuck and recommended that to many of the people I worked with. However, several people in the last year have heard me say, “I used Transmit, but CyberDuck is free.” They went on to download the Transmit trial after using CyberDuck for a while and ended up buying it.

There’s something about its simplicity and flow that just makes it a joy to use compared to other remote file clients.

  • LastPass: For my personal password management I currently use LastPass. I still use 1Password for some things, but LastPass is inexpensive and cross-platform. I don’t like it always being built into the web browser, but functionality wise and price wise I love it. 1Password is a great product, but for the longest time it had a poor integration with Chrome which is what led me to try LastPass. I haven’t really looked back since it works well enough and the cross-platform buy-in of 1Password is a little too high for me.

What else?

I could go on, but those are the highlights of what I consider my core tools. There are many other things I use on a day to day basis, but some are either required by everyone who works on that platform (Xcode) or ones that have been beaten to death and most people have heard of. Hopefully on this list there’s a few items you haven’t heard of that might solve some problems you’re facing.

Have any questions about specifics of one of the products above? Just ask in a comment.

Welcome Back, Me!

I doubt I have any followers on this blog anymore, since it’s been such a long time since I wrote anything and the death of Google Reader I’m sure caused some people to clear their feeds up.

I’ve always had the desire to come back and restart my blog back up, but I never quite pushed myself to get it done. I finally convinced myself earlier this week I needed to get it going again, but I forgot what topic is the one that made me want to put this back up!

As a result, it’s unfortunate but I’m doing the typical welcome post for filler, but I hope to put up an interesting post at the minimum of once every weekend. My topics won’t always be strictly development – they never were. My last 5 years of building a completely virtual company from the ground up has offered me some interesting insights into a lot of topics that any tech-driven entrepreneur would be interested in hearing about.

If you are still following along, feel free to leave a comment. I’d love to hear from you.