WHAT’S BEHIND LYFT’S CHOICES IN BIG DATA TECH

WorldLine Technology
Jun 18, 2019
4 min read

Credit by Alex Woodie

Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. The first notable difference between the firms is Lyft's decision to park their data in cloud object stores, whereas Uber invested in Apache Hadoop infrastructure.

This was one of the topics that Lyft Data Engineer Li Gao discussed in a recent interview with Software Engineering Daily‘s Jeff Meyerson, who published the talk in a podcast last month. “If you went to any of the Uber talks in the past, they talk about a lot of the pain points when they’re dealing with large HDFS deployment,” Gao tells Meyerson. Its AWS bills are quite large, as you might expect from a company as large as Lyft, which operates in 600 cities and had revenues of $2.1 billion last year. Lyft, which went public earlier this year, reported in its S-1 filling that it pays AWS a minimum of $8 million per month.

In fact, its contract with AWS calls for it to spend at least $300 million through 2021.Despite being tied to AWS, Lyft doesn't use AWS services exclusively. Lyft was a big Redshift user in the early days, but when it started to run into scalability issues due to the tight coupling of compute and storage in the 2016 timeframe, it migrated from Redshift to Apache Hive, Gao says. In 2018, Lyft upgraded to a faster version of Hive that increased the size and number of ETL jobs that it could run, Gao says.

It also introduced Presto to provide a more powerful query engine. Presto's advantage is how easily it powers ad hoc analytics that involve joining a lot of different data, Gao says.“We have batch processing running mostly on Hive and Spark and interactive system running on Presto serve those interactive datasets. Lyft engineers build many data pipelines that route incoming to many of these data marts and processing engines.

Gao says Lyft is currently building connection between Druid and Presto that will enable developers to utilize the relative advantages of both of those query engines and surface insights within Superset. There's also a smattering of relational databases, including Postgres and MySQL, that serve other needs. On the data science front, Lyft is a big user of Jupyter, a popular notebook-style interface for working with data and machine learning algorithms. Much of the data engineering and data science work is connected via Apache Airflow, a data workflow tool that was originally developed at Airbnb, another disruptive tech firm headquartered in the Bay Area that has enmeshed itself in the open source big data software ecosystem.

Since the Lyft service is time-dependent, it's not surprising to learn that the company uses a mixture of Apache Kafka, Apache Flink, and Spark to build streaming services. It has also played around a bit with Apache Beam. Lyft originally built its real-time data infrastructure atop Amazon Kinesis, but it migrated to Kafka when it discovered scalability limitations in Kinesis, Gao says. “The Kinesis behavior is lagging or inconsistent from what I heard from our streaming team,” he says.

The company today gets its enterprise Kafka service via a contract with Confluent, which is the company that helped to commercialize Kafka as it emerged from LinkedIn. “Kafka is a critical piece for us to move data, the real-time metrics data or an event data around,” Gao says. Kafka is used extensively at Lyft. Lyft chose Flink over Kafka Streams because it offers more powerful transformations, Gao says.

So using Spark can unify those processes in a single platform. Lyft uses a mix of different tools to orchestrate all of these data services, including YARN and Kubernetes. Gao says the company has nearly 100 small Spark clusters running on AWS that rely on YARN as the scheduler. The company is also looking at getting Spark to run on Kubernetes.

It has made some progress in that regard, but it's an ongoing development project. “Spark on Kubernetes is still in this very early stage,” Gao says. “We still see there's a huge gap of [how] we would like to see how Spark perform and what the Spark 2 actually is doing.YARN, which originated in the Apache Hadoop project, actually does some things better than Kubernetes, Gao says.

“YARN has this different schedule called fair scheduler, or capacity scheduler, or a combination of both, to serve the different use cases to maximize both the job performance and the cluster utilization. So we have to build our” own. Software Engineering Daily's Meyerson pointed out to Gao that Uber has standardized much of its on-premise infrastructure on Mesos, and that it has developed its own scheduler that is general enough that it can be used with Kubernetes. “So you may have your scheduling problems solved by the Uber team,” he says.

Gao says he's familiar with the Uber product, called Peloton, but doubts that it will work in Lyft's multi-tenant environment. Instead, Lyft is looking forward to the day when the best of all three schedulers – Mesos, Kubernetes, and YARN – can be used at the same time.And while YARN has some advantages, it's clear Kubernetes has the momentum,including over Mesos, which is another differentiator between Lyft and Uber . It's another case reason why you don't want to be the first one at the big data party.

“When we started looking at container orchestration, it was already 2018,” Gao says.