Bernt Habermeier is the SVP of Engineering at Moblyng Games. In this guest article he explains how his team created a multiplayer poker game that runs on poor mobile internet connections.
In this post we'll walk you through how you can make your mobile web game resilient to poor network conditions. Many excellent developers are used to developing games for the desktop environment but who often don't think about network reliability when they implement their apps. When you find that your app freezes up over a cellular network you can shrug and blame the carrier, or you can roll up your sleeves and fix the problem. In this post we'll teach you how to fix such problems and discuss the creation of multiplayer games that play great even in areas where the user's cellular coverage is not great. As a side note, the techniques discussed here are broadly applicable and go beyond just games, but we'll focus on games in this post.
It turns out that some of the old methods from way back came in handy when I recently wrote the cellular-network-friendly networking code for Moblyng's Social Poker Live. I have a few years experience working on games over questionable network connections, including 28.8k - 56k modems.
Before we dive into what the issues are and how to solve them, I'd like to quickly review the game so you can see what I'm talking about is a solution we used in a feature rich, live production quality title. Social Poker Live is a multiplayer poker game that can be played in real-time, over any Internet connection on iOS and Android mobile devices (phones and tablets), and it's resilient to spotty network conditions. For example, you can play the game on Caltrain going from San Francisco to Redwood City, and will recover going through tunnels and the like.
Of course, the game also runs well on a desktop, but in this post we'll focus on the mobile experience. Try out Moblyng's Social Poker Live on your iOS or Android phone at http://poker.moblyng.com.
Here are some screenshots of the typical user experience, starting with the loading screen, all the way to the winning screen. One neat feature is that we'll discover which of your Facebook poker pals are online, and ask you if you'd like to join their table. Here is what that looks like:
I'd like to encourage you to load the game in chrome, bring up the developer tools, look at the Network tab, and go back to the game to sit at a table. You'll see the data passed back and forth between the client and the server, including packet identifiers and acknowledgements that we'll talk about in the rest of the article.
The common perception is that it's difficult to implement real time multiplayer games on mobile devices. It's true that this type of game is more involved than implementing a typical social game where the game doesn't require down to the second data synchronization across a group of players, but it's not terribly difficult either. There tends to be two problems with multiplayer real-time games: scalability and networking over a cellular network. This post talks about the latter -- how to get reasonable networking performance even over spotty 3G cellular networks using HTML and JavaScript.
Traditional web technologies are decent with issuing requests to servers. The client knows when it wants some piece of information and hits the server with an HTTP request, and as a result the server scrambles to offer up the request as quickly as it can.
In multiplayer games the game client knows it wants game updates as quickly as possible, but it doesn't know when to ask for such information. This doesn't fit well with a model where the client has to initiate the request for data. When do you ask for an update?
You could just hit up the server once a second with an active poll request. Your operations manager could also decide to hit you once a second over your noggin because that's not efficient for neither the client or the server. All hitting aside, you won't get sub-second response times -- especially over a cellular network.
I wish I could say I implemented the client/server communication with WebSockets.
WebSockets just make sense because they provide the exact functionality that you want when implementing a responsive multiplayer game; both the server and the client can read/write information whenever they want. More importantly, WebSockets are also much more efficient because they do not send all of the unnecessary HTTP headers with every message. This is a huge benefit, especially for mobile applications that are trying to run over a cellular network. Still not convinced? Take a look at http://websocket.org/quantum.html.
Unfortunately, I can't say that we're using WebSockets because Android doesn't support Web Sockets. As I lament the lack of Web Sockets on Android, let's move on...
What's neat about the long poll is that it turns the notion of the typical client initiated request on its head. There are many variations and tricks people play with long poll implementations, but at the core of the long poll is a simple understanding between the client and server to break with tradition of servicing a client request as quickly as possible. Instead the client expects the server to hold on to the request and keep the underlying TCP connection open for as long as it wants to (within reason). The moment the server has information it would like to share with the client it sends the data over to the client and terminates the connection. This is reasonably efficient because all the pain of setting up the HTTP request is already behind us.
Upon data receipt the client immediately initiates the next request to the server -- again expecting the server to sit on it until there is something of note to send back.
I affectionately call such long poll communication “lookie” requests. I'm sure your implementation will work without resorting to such a cute name, but your technical conversations with your coworkers simply won't won't be as fun as mine. Example: Duude, how many lookies are we running on our prod server config? (warms my heart. Every time).
The long poll is not the only Ajax GET request the client makes. User initiated actions need to be sent to the server right away, and because the client isn't in control over when it needs to issue the next long poll request, it can't use that channel to communicate with the server about user initiated actions. Thus, we have to implement one more Ajax communication channel for asynchronous one-off communications originating on the client and going to the server. These requests are traditional Ajax requests, where the client can expect a fast turn-around from the server.
I call such communication “action” requests.
Cellular networks are notoriously spotty. I'm sure you've found yourself in the situation where you were sure you had a good connection only to realize that you've been talking to yourself for a while. When this happens to me, I always feel a bit sheepish. We can expect that our client/server communication code on mobile devices will be similarly affected -- though I can't comment on its emotional state.
Besides changing your cellular carrier when you get frequent data drops, you'll want to write your code so that it will handle poor conditions and retry sending the data if it doesn't arrive. You can catch exceptions like timeouts and associated errors as part of making the Ajax request, but what you don't know is if your HTTP request died on the way to the server or if it died on it's way back to the client.
Specifically, here are the modes of communication failure and their associated problems when trying to resend the request or data.
Failed Client to Server | Failed Server to Client | |
SEND (single ajax) action | Clean Retry | Danger (duplicate request) |
RECV (long poll) lookie | Danger (duplicate update) | Danger (lost update) |
For action requests, if the client to server communication fails, nothing bad happens if we just retry sending the same action: It's as if we had never sent the data. However, if the server gets the action request and processes it, but the response message from server to client fails -- then we're in a bit of a bind when the client retries the action request all over again. In this case the server will process the same action once more on the client retry, which is likely to result in some kind of grief later on. So our code will want to catch this.
For lookie requests, if we're not careful we might end up sending duplicate data updates to the client, or we might lose updates going to the client if the server is too optimistic about the client having received data it had sent out. Beyond the loss or duplication of data, we also want to make sure the updates are delivered to the client in the correct order -- all things you might get wrong about if you are not careful.
We'd do well to assume that the connection we are getting over the cellular network is going to be spotty. From the client side, that means being reasonably aggressive with timing out our Ajax requests, and retrying. From both sides it means going with the assumption that data you're trying to send to the other side is not going to get there, so you better hold on to it so you can resend it if you need to.
The first thing we'll want the server to do is keep the data in memory and label each logical data message with a sequence number. Once we have sequence numbers we have an easy way for the client and server talk about what data they have and or need to resend. The general plan is for the server to send data with sequence numbers to the client, and then have the client acknowledge (subsequently called ‘ack'), these sequence numbers back to the server as a form of proof of receipt. Sequence numbers are meta-information that should only be relevant to the network protocol layer in your code, and for the purposes of the following charts, a box with a number in it implies a data packet that has a specific sequence ID affixed to it.
The server has a message that it wants to send to the client. In this example the message data contains information about a player. The server networking protocol adds a sequence ID to the message as meta-information. This sequence number is relevant to the protocol layer only. The rest of the server or client code should not care about it.
Server-side core logic in the game can make calls that pertain to updating a client from a high-level perspective (ex: sendPlayerInfo, sendInventory, and updatePos). The methods end up calling a networking routine that affixes a message sequence number to each logical message and subsequently pushes all this to an outgoing message buffer array.
Below we build up an example of the entire client / server chain of events.
Let's say that the core server logic wants to send various messages to a player. As an example, let's assume it makes 2 calls that put data onto an outgoing message queue, and the networking code annotates the data with sequence numbers: 0 and 1.
The data is ready for pickup by a long-poll from the client.
The client at some point starts it long-poll request, and sends out an ack id of -1 along with the request (meaning it never got any data yet).
The server responds by sending everything it's got for that client: Packets with sequence number of 0 and 1.
The client receives packets with sequence number of 0 and 1, and immediately continues its long poll, but this time sends an ack id of 1 (confirming that it got packet 0 and 1).
When the client gets the information it can queue up client-side actions that can react to the messages in the core client code (ie: data is passed on out from networking code for processing in core client code).
The networking code on the server gets the next long-poll message from the client. This time it carries an ack id of 1, which means the client acknowledges receiving everything up to and including packet 1.
Thus, the server deletes packet info 0 and 1 from memory for this client, and the whole chain of events can continue.
Putting it all together, the whole sequence of events looks like this.
Because the server assigns a unique and increasing sequence number to every outgoing message, the client can acknowledge the receipt of this information back to the server. This acknowledgement allows the server to delete the old data, or in some cases, it allows the server to resend the old data because something bad happened and the client never got it. Above is a small example of a data run where nothing went wrong.
What could possibly go wrong?
The whole point of all this is to be able to retry when bad things happen. Let's take a look at a few failure cases.
Above I show what happens when data that is supposed to be coming from the server stalls out. I'm calling this a “stall”, because it's most likely not just a single TCP/IP packet that went astray. TCP/IP is sophisticated and can take care of its self most of the time, but everyone has experienced what it feels like when it can't: A site stops loading (or stops being responsive). How resilient your application is to stalls like this depends on how much your code cares (and retries).
In Social Poker Live, at the moment, we do 10 second long-polls. Meaning the client waits for a response from the server for 10 seconds, and should it not get any data, it'll abandon the request, and send the next poll. That might not be the right frequency; we're still fine-tuning it. Even if were optimal for poker, it wouldn't necessarily be for your application. Maybe 5 seconds might be better. Maybe 15 seconds.
It's hard to know what the best long-poll time should be, but here are some considerations: As you increase the duration of the long poll, it'll take more time for the client to re-request data that might have been lost. This larger average latency means the application will feel more sluggish when there was a TCP/IP stall. However, as you decrease the duration of the long poll, your server has to deal with more spam from the client that constantly makes new connections to ask about updates, and also it's more likely that the client will re-request data that is already in transit from the server (not lost, just not received yet).
So 10 seconds felt about right for Poker. It's probably just about the time a person would be willing to stare at their browser waiting for a site to recover from a stall before hitting the refresh button in frustration. I'd be happy to hear what you think about this value. Maybe it should be dynamically set based on the quality of the connection?
Above is a case where the long-poll request stalls out. The server never gets the request along with the ack-id. This means that after the max long-poll duration, the client will re-issue the request with the same ack-id (value of 1), and the server deletes the information up-to-and-including packet. Although the client had packets 0 and 1, the server never got the next long-poll request due to a stall, and so had to keep packets 0 and 1 in memory until it finally got the next long-poll request, which carried with it an ack id of 1, and it was safe to delete packets 0 and 1.
Above we covered problems when there is a stall with the server to client flow of data, but what about data flowing in the opposite direction? Because the Social Poker Live client will typically only have one outstanding client-initiated request at a time, I didn't need to implement a formal seq / ack solution for actions. However, even through there should only ever be one message outstanding at a time, I still had to take care of the possibility of issuing a single action more than once. To avoid this possibility, the client sends a unique identifier with each action. The server makes sure that it has not yet seen such an action before, and if it did, it simply returns a pre-cached result of the already serviced call.
This solution won't work for all games. You have to judge for yourself how chatty the client actions are, and if you're sending a lot of actions to the server, you may well have to also resort to a seq/ack solution for client-to-server data communication.
There may be other (and better) ways you can deal with stalled TCP/IP connections than what I've done, but if you want a resilient application the one thing you can't do is simply trust that anything relying on TCP/IP will do just fine. Your TCP/IP will stall out on cellular networks, your app will freeze, and your users will be bummed. If you want to make multiplayer games that play great even in areas where your cellular coverage isn't, you have to deal with this somehow.
Although I chose to use long-polling, don't assume that if you switch to WebSockets, or some other higher level communications framework, that you don't have to worry about TCP/IP stalls. WebSockets still relies on TCP/IP and although many things would be cleaner and would have much less overhead, you can still get stalled out.
I haven't looked at all the cool frameworks or libraries that are out there, so I'd like to hear what you've found compelling. One library that looks promising, especially if you want to use WebSockets (but can't because it's not implemented in all the places we care about), is Socket.io. That looks really slick, and it does look like there would be ways of catching TCP/IP stalls.
Here are some great resources available that provide more info about WebSockets and long polling.
Have comments or questions about building resilient multiplayer games using HTML5? Leave a comment and we'll follow up.