For today's interview, we have Neerad Somanchi, a Production Engineer on the Presto team at Meta. Presto powers an incredibly wide variety of internal tools and platforms, including real time analytics, research, our ads platform, and much more. The Presto production engineering team focuses on evolving and maintaining Meta’s Presto architecture. You can find Neerad on LinkedIn, and you can check out his lessons learned scaling up Presto to meet Meta’s massive scale in this post.
Presto powers such a wide variety of internal tools and platforms that half of Meta uses Presto even if they don’t realize it. Real time analytics, research, our ads platform, and much more all utilize Presto.
My role is really varied, so there are days when I’m doing data analysis, days when I’m investigating production issues, days when I’m heads down coding, and days when I’m brainstorming with my colleagues. I guess a typical day is a mix of all these activities? One of the things I love about my work here is that I can’t easily define a typical day!
There were a couple of factors which led me to join the Presto team. The first is the scale at which Presto is deployed at Meta. It powers a lot of interactive as well as batch analytics usage. I was drawn by the potential to make an impact for such a large service and also the fact that I could learn so much from all the talented and experienced engineers on the team.
The second is Presto being open sourced. I remember being very excited when I made my first code contribution to Presto OSS. The community aspect of an open source project, learning from experts in the field outside of Meta, and knowing that my contribution to it might meaningfully impact anyone using Presto in the wild - these are reasons that brought me to the Presto team.
Great engineers are spread across many different companies, and across different continents. Using open source means using a product being built by a team that no one company can put together. I believe that it results in software that is of high quality. Contributing to open source also means that you can attract great talent to join you and work with you on a daily basis!
There is also the benefit of empowering the open source community by providing them the source code to a product which they can modify and tweak to their heart’s content.
I think establishing well-defined SLAs (also SLOs) is important. Defining SLAs around important metrics in a manner that tracks customer pain points becomes crucial as a platform is scaled up. When there is a large number of users, the lack of proper SLAs can greatly hinder efforts to mitigate production issues because of confusion in determining the impact of an incident.
Proactively setting up monitoring and automated remediation systems is another important aspect of running a system at scale. Having thorough monitoring can help identify production issues before their blast radius becomes too big. Automated debugging/remediation tools are valuable when trying to root cause and mitigate problems.
Finally, implementing demand control and having good models for capacity estimation are some best practices teams can adopt early on.
I recently got a 30,000 foot view on work that’s being done in several industries to improve hiring practices, specifically around equitable hiring. I’ve learnt a tiny little bit about diversity, equity, and inclusion in recruitment and I find it exciting that such valuable work is being done!
I think one of the most important aspects is having a mindset that's not surprised if and when things go unexpectedly (and badly!). It's almost like being a pessimist - constantly thinking about everything that can go wrong when running a production service helps design, develop and iterate on systems that are reliable, efficient and scalable.
Coming up with wide-reaching solutions is a very good trait to have. Different teams at Meta might have similar problems. Being aware of what other teams are up to means that one can propose ideas that solve problems spanning multiple teams. This prevents duplication of effort and increases the impact that one can have.
Developing a mindset which encourages you to dive head first into a problem is very important. This is especially true when dealing with services at Meta scale. It might seem daunting at first, but the upshot is that learning is greatly accelerated! Personally, I had gotten into this mindset as part of my previous work experience where I was heavily involved in the commercialization stage of a product launch. This meant that we were stress-testing the product and investigating a whole host of issues. Constantly engaging with problems builds analytical skills, improves debugging and troubleshooting techniques, and eventually helps you design and develop systems that are robust and can efficiently run at scale.
This article was written in collaboration with Neerad Somanchi, a Production Engineer at Meta, and Philip Bell, a Developer Advocate at Meta.