GOTO - Today, Tomorrow and the Future

Unlocking the Power of Real-Time Analytics • Tim Berglund & Adi Polak

June 02, 2023 Tim Berglund, Adi Polak & GOTO Season 3 Episode 22
GOTO - Today, Tomorrow and the Future
Unlocking the Power of Real-Time Analytics • Tim Berglund & Adi Polak
Show Notes Transcript Chapter Markers

This interview was recorded for GOTO Unscripted.
gotopia.tech

Read the full transcription of this interview here

Tim Berglund - VP DevRel at StarTree & Author of "Gradle Beyond the Basics"
Adi Polak - VP of Developer Experience at Treeverse & Contributing to lakeFS OSS

RESOURCES
Tim
timberglund.com
twitter.com/tlberglund
linkedin.com/in/tlberglund

Adi
twitter.com/AdiPolak
instagram.com/polak.code
linkedin.com/in/polak-adi

Tools & companies
pinot.apache.org
twitter.com/startreedata
linkedin.com/company/startreedata
dev.startree.ai
stree.ai/slack

YT videos
Data Mesh • Zhamak Dehghani
Beyond Microservices • Gwen Shapira

DESCRIPTION
Adi Polak and Tim Berglund explore the concept of analytics and what it truly means in the software development world. They delve into the benefits of real-time analytics for product development, highlighting the fine line between compute and storage and the technical requirements for achieving effective real-time analytics. They also discuss the applications of real-time analytics through the lens of Apache Pinot and StarTree Cloud, exploring use cases such as the popular "Who's Watched My Profile on LinkedIn" feature powered by Apache Pinot.

RECOMMENDED BOOKS
Adi Polak • Scaling Machine Learning with Spark
Tim Berglund • Gradle Beyond the Basics
Tim Berglund & Matthew McCullough • Building and Testing with Gradle
Mark Needham • Building Real-Time Analytics Systems
Gwen Shapira, Todd Palino, Rajini Sivaram & Krit Petty • Kafka: The Definitive Guide

Twitter
Instagram
LinkedIn
Facebook

Looking for a unique learning experience?
Attend the next GOTO conference near you! Get your ticket: gotopia.tech

SUBSCRIBE TO OUR YOUTUBE CHANNEL - new videos posted daily!

What are analytics?

Adi Polak: Hello, everyone, and thank you so much for joining me for another episode of GOTO Unscripted. Today I have a dear friend, Tim Berglund. Tim is the VP of DevRel at StarTree, a very exciting startup that builds Apache Pinot, and other great technologies in the ecosystem. Hi.

Tim Berglund: Hey, Adi. Great to see you again. Great to talk to you. And we build a cloud service based on Apache Pinot and other layers on top of that.

Adi Polak: Yes, I remember some fascinating things around anomaly detection and things that are really extremely valuable for developers.

Tim Berglund: And more to come.

Adi Polak: And more to come. That's amazing. All right, so I'm curious, there are different schools of thoughts around analytics. So what does analytics mean to you?

Tim Berglund: I love that question. It’s like, it just gets to the core of things and any, like, what's a chair? Questions like that, you kind of get to places where it's actually hard to give an answer, but I think, fundamentally, we could say that analytics is looking back over what has happened, and trying to get insights into what has happened.

Ideally, insights that we could take action on, right? This is usually in a business context, and we're trying to figure out how to drive decisions. So looking back over what has happened in an attempt to generate actionable insights is, I think, a good, short, somewhat essential definition of what analytics is.

What are real-time analytics?

Adi Polak: I think it's extremely valuable. I saw a lot of...especially the real-time one was super critical to identify kinds of trends, recent trends, changes in the ecosystem and customer consumption patterns, and a lot of this good stuff. And I wonder, you know, when you look at that, like, what type of analytics can you do? Yeah.

Tim Berglund: I feel like I should get into that a little bit more. Let me maybe expand on my definition a little bit, because it's looking back over what has happened to try to generate actionable insights, but maybe a contrast. So if you're really new to databases, or new to thinking about analytics, like in contrast to what, is a good question. And that's always the classical kind of two categories of databases: analytical databases and transactional databases, or OLAP, online analytical processing, or OLTP, online transaction processing databases. And I just wanna dig into it a little bit more before I get to that question because I feel like, it's been a little cutesy just giving that short of an answer. So let me dive into that.

You've got two kinds, broadly these two kinds of databases. The transactional processing part of the ecosystem is more concerned with keeping track of entities in the world, right? In an OLTP query, there's usually a thing that you want to read, write, create, or delete. It's usually like one thing that you're interested in. Somebody creates an order, well, then that one order is a new entity in our world. And maybe there are various SKUs associated with that order. Well, you create those SKUs and you associate them, and that's a collection of things. But there's usually a preoccupation with a thing, an entity, even if it's got some sort of compound existence and it's composed of other entities. It's one thing that you're interested in.

The other thing about, or in contrast to the world of analytics, we're usually interested in looking at a lot of things. So I said to, you're looking back over what has happened. Those are events or those are entities that have come into existence or have been mutated, then, you know, rewritten in some way. And in the analytics world, you're usually interested in looking at a lot of things. So rather than just this one thing you're looking at, it's these things. And you want to scan through a lot of things and add them up or average them or, you know, run some sort of reducing function, compute some sort of aggregate over a particular number in those things.

Whereas in the transactional world, that one thing, you never reduce it to a single number like that's not what you do, you kind of want many of the properties of that entity or the whole entity all at once. And that's what you're dealing with. Where in analytics, there are usually kinds of all these things, and in a particular query, there's one measurement that you're interested in. What was your question again? Was about the time, like the real-timeliness of this, and maybe ask your question again. I...

Adi Polak: What do we mean by real-time when we speak about, you know, the analytics world and kind of like the two different, you know, aspects of that that you mentioned?

Tim Berglund: So, that's important. Because that's what we do at StarTree is real-time analytics. Apache Pinot is a real-time analytics database. And I always feel like I need to apologize for that word when I'm talking to other technical people, right? When I'm talking to people who build things with computers, it kind of just sounds like a marketing word. You know, does it have a rigorous definition? Do you just mean it's fast and I'm supposed to be impressed? It's squishy, you know.

I started my career in firmware and that was fun. I missed that. It was a huge amount of fun. And in that world, real-time has a specific definition. The operating systems that you put, and this is changing a lot. A lot of stuff is just like embedded Linux these days, but small things, you still have this very small kernel that could be literally a few kilobytes of code that's called the real-time operating system. And so, this term real-time has this precise definition in that world. And in that world, what real-time means is predictable latency.

There's a maximum, a bounded amount of time that you can take to process an event. And if you take longer than that, you have failed. So that's satisfying, right? That would be nice if we had something that was that good. That's not really what we mean necessarily in real-time analytics. You know, we still talk about tail latencies the way anybody in the DevOps world would, you know, there's P50 and P75 and P99 latencies that are all a normal distribution. A latency can always be a little bit longer. So it's not like there's some hard limit.

But to make this something that you can really sink your teeth in and somewhat specific, what I like to say, the way I like to define real-time in the context of real-time analytics is something that you can put in the loop of a UI, something that exists in the interaction layer. So I'll hold up my phone. I have a phone and I'm using some app and I refresh a view or I enter a search or something for, you know, I wanna order a hamburger near me. And then there are five burger restaurants that it'll be in the search area and maybe OLAP queries, analytics queries associated with those things that are happening as the display refreshes.

So a real-time analytics query would be, for example, what is the average of the delivery times in the last half hour for the burger joints inside this polygon? This geospatial polygon, that's an analytics query because we're looking back over what has happened, and if it's happening while I'm looking at the screen, we can't kick off a Spark job and have it come back 15 seconds later and give us a result, right? That's not going to work. I mean, that's talking to a woman who's written a book on Spark, incredibly transformational technology and you need it to do certain kinds of things, but you also need this other kind of tool to do things that I'm looking at a UI and waiting for.

It's not a dashboard that can refresh, it's not a report that can run, it's right in the loop of the interaction layer. And so real-time means, and this is still somewhat soft, right? But basically, I'm willing to look at a display and have that query happen and feel like it's alive and it's not broken, and I'm not getting impatient and going to try DoorDash instead of Uber Eats or whatever. So that's how I define real-time. I can do it without a caching layer while I'm looking at it in a UI.

Adi Polak: That's amazing. It reminds me a little bit back in the Hadoop days and sparking Hadoop besties or used-to-be besties to best as I know. It reminded me that...

Tim Berglund: Never felt like they were besties to me.

Adi Polak: Never.

Tim Berglund: No. Felt like Spark was kind of elbowing Hadoop in the face, but that's just my take.

Adi Polak: That's an interesting take on that.

Tim Berglund: Maybe a little more violent than... Anyway, go on, Spark and Hadoop.

The value of real-time analytics

Adi Polak: Yes. Although old MapReduce, I guess gave some light to ideas that Spark took and ran with. Anyway, what I wanted to say is when I used to work in analytics, like building infrastructure for advanced analytics, I remember we used to prepare reports in advance. So the queries were limited in a way that there are only specific columns where the users can query, only specific time windows even like 30 days, 10 days, 1 day. There wasn't a lot of flexibility around that. And that's because we're essential, we've been doing batch, we've been doing the batch processing and preparing all these reports with great indexing so people will be able to query big data fast. But in reality, the underlying system was already prepped.

It gave a real-time experience for the user, yet it was very limited in what they could do. And what I slowly realized, and correct me if I'm wrong, is that actually having an engine that is tailored to doing real-time analytics can give me much more flexibility in looking at the different records and running different types of aggregations that are less, well, essentially more flexible, more agile for me as a user. So it gives me, from a product point of view, it gives me a lot of value.

Tim Berglund: Yes, and I think you're onto something really important there. So starting with Hadoop as a piece of history, and I don't want to just slap it around. It was an incredibly important innovation. I was kind of there watching it, just incredible amounts of data that could be processed in as now a standard open source framework that wouldn't have been possible, that would've been state-of-the-art computer science research that you were doing on your own to have built that kind of thing before there was Hadoop and then Cutting and others at Yahoo did it. And now it was this thing you could use, right?

It was that huge step forward and amazing. I think we like to slap MapReduce around because it was terrible to use, the actual experience was awful. But in reality, not very many people wrote map and reduce functions. Like who actually did that? Everything was kind of hive very quickly, or if you didn't wanna do that, you had things like cascading that kinds of layers on top. So, it's easy to slap it around and talk about how terrible it was, but I think that's a little bit of a straw man.

The experience in Hadoop's flowering, which was brief, the experience wasn't as bad as we think, but like you're saying, you're fundamentally scanning everything, right? Now, in your map, you want to reduce what you're scanning, but you don't really have an indexing system in that framework. What you have is a distributed storage tool and a distributed computation framework that sits on top of that distributed file system. And that's great, but that's not a database, that's a file system and a way of doing computation somewhat smartly over big giant blocks of data in that file system. So you can't really call that a database.

Databases, there was, you know, HBase was built on top of it. Again, didn't take over the world. It was kind of for true believers, HBase is what you do if you just really had a big commitment to Hadoop, if you needed that kind of database and like that kind of data model, most of those people went off to Cassandra. But we find that to be a performant database, you really do have to engineer a system according to what performant means, what you're saying that needs to be. And I'm saying, well, there's this interaction layer loop.

I want queries to happen in a user interface or as it were on page load, whatever that means these days. When I click something and before my page refreshes, I want this stuff to happen, which means I need things to happen probably under 100 milliseconds. Well, I'm going to need to engineer a system that does that because Hadoop couldn't do that. Hadoop's children like Spark, which was a vast improvement, not engineered for that, but engineered for other things. Current, the query aggregation systems like Trino and Presto...Designed to be specialist components of a data stack, but not designed to run interesting queries under 100 milliseconds. You gotta build something to do that.

You have to start to answer questions about how storage and computing are gonna be coupled. I give a contained answer here, but we can dig into that more if you want. For me, as I look at these systems, a concept that emerges is like, what are the trade-offs you're gonna make about storage, compute, coupling? And that kind of determines where you're gonna live on that database spectrum.

The fine line between compute and storage

Adi Polak:  We definitely see in the industry how the two split it between the more traditional data lake and then the data warehouses and kind of the in-between where you want to have that separation from compute and storage and having local storage is kind of good for you when you're running your distributed compute engine.

Tim Berglund: It's very good because it's fast, right? Now you've got an SSD on the other end of a PCIE bus, which is, you know, that's about as fast as you're gonna get. You would love to have that. It's just expensive to have. Let's see if I can describe this. I might get lost without an actual whiteboard, but let's give it a shot. And we might sometimes depending on when this is published, we might have a light board at StarTree where I draw this on the board, but if you could think of storage going from tightly coupled to loosely coupled, right?

Tightly coupled means, again, there's an SSD on the other end of a PCIE bus from the processor. Loosely coupled is S3. I can make HTTP calls to essentially an infinite sump of storage out there. Now that infinite sump of storage is gonna be a lot cheaper than my SSD. It's also gonna be orders of magnitudes slower. So, I've got a cost-performance trade-off here on this storage spectrum where I can go decoupled or tightly coupled, and I can get faster and more expensive or slower and cheaper.

On the other axis, going across this way, there's a question about how persistent my compute is, right? So I could have, think about a legacy data warehouse, right? It's a relational database. I had an ETL job that I hand-coded in C because I hate life. It runs overnight and it writes all this data into my star schema in Oracle because again, self-harm, I need to get some help, and it's in this Oracle database on a database server and that's where the queries run. The storage is tightly coupled, so I guess it's up here, my little diagram here. The storage is tightly coupled and the compute is always there, that computer is there, heating the room no matter what.

On the other end of the spectrum, for compute... Let me take Spark because I mean you've got like worker nodes out there, but they're not doing anything until the job starts and then the job is divvied up into tasks and those are distributed. So in a sense, I'm not really doing any compute until I've got a job. So it's kind of this somewhat instant on or standby compute resource that isn't allocated until I give it a query. That's this horizontal axis is whether my compute is dispatchable or permanently allocated. And then I've got this other one that's tightly coupled and loosely coupled storage and the modern data systems just live in different places on this quadrant. I think it's an interesting way to divide up the space. I mean, even all analytics systems, like forget about any OLTP database. You can put all the analytics systems on that board somewhere.

Adi Polak: That's really interesting because when we look at the different solutions in the industry, I guess, you know, you can actually see how some solutions are more tailor-made to specific cases and the specific decision-makers sometimes. As you mentioned, we have Spark which gives us the compute and then there's a whole conversation around whether should we bring compute to the storage level, kind of the indexing filtering, push down predicates versus should we bring the storage to the compute side. And I wonder specifically... Yeah, it's an interesting debate now in the industry.

Tim Berglund: There aren't answers, right? They're only trade-offs. There are no solutions to that question. You just pick where you're going to live and decide on what trade-offs you want based on what you're trying to do. Or StarTree Cloud, we have a tiered storage thing now where we will put some, this is like not yet a Pinot feature. This is a StarTree think-so hashtag you know, commercial for a second here. But we'll keep some of the data on a local disk and then after a certain age, some of it in S3 and query that data in S3 in fairly smart ways. So you don't get ten-second queries. You might get, instead of 50 milliseconds, 350 milliseconds or something. So it's like a trade-off. It's slower, but it's not crazy slow, like what you'd normally see when you have decoupled storage and compute.

Apache Pinot & StarTree Cloud: Target audience

Adi Polak: That's really cool. I think that's a critical feature when I think about persistence and availability for both, the real-time analytics consumer at the end of the day. And also the people that are managing the system. It's probably can be a bit easier for disaster recovery and so on. So I wonder now that we have a little bit of a better understanding of the world. So when I was thinking about decision-makers for real-time analytics, right? Regarding the solution that you mentioned, specifically with Pinot or StarTree cloud, who are the people that need it the most?

Tim Berglund: You're just asking these really elemental questions, what's analytics? And you kind of just asked who is the decision maker. It's like, who is my neighbor? Well, I don't know. So again, historically the way this started like when I was brand new in my career in the early '90s, again, I wasn't doing anything like data warehousing or analytics. I was doing firmware and having a great time.

But the way that stuff was shaping up with Bill Inman and Roger Kimball and their work they were defining data warehouses. The idea was a printed report, you know, probably came off a laser printer, given to a decision maker. This is somebody high up in the organization. Who is a decision maker? A person of high status in the organization who has lots of capital and human resources to direct. And we need to spend all this money to build this system to get them information so they can make better decisions. because their decisions are very consequential.

That leadership hierarchy is, I guess it's relatively durable over human history despite various attempts to substitute something else. So it's not like that has changed in the last 30 years. You still got important people in corner offices making big decisions. But analytics have moved down the org chart in an important way. And this is the first really, I think got beat into my head by listening to Zhamak Dehghani talk about data mesh. And she made the point that it's not just high-status people in the organization, it's leaf nodes of the org chart that need access to analytics products. They need actionable insights to be able to do their job, which is just a good thing in the structure of organizations that it's gotten cheaper to produce analytics. And we found that kind of everybody in an org is making decisions and so empower them with the data of the company.

Okay, cool. We're not done yet. So the answer to the question, who's a decision maker isn't merely every employee of the company because right now there's this new wave and I'm thinking it's a tidal wave, we'll see, of companies like LinkedIn and Stripe and all of the meal delivery companies on the planet. You know, Uber Eats was the first one, but the DoorDash, Cisco is doing things with their WebEx products. All these, you know, companies and sort of marquee brands that we all know about are saying actually decision makers are also the users of the product.

So it's not just people inside the organization making decisions about what the organization is gonna do, but I, as a user, external to the organization, as your customer engaging with your application, whatever it is, it's now worthwhile for you to take the data inside your company and expose that to me in, you know, some actionable insight generating form like we were talking about earlier. You know, what's analytics? It's looking back over what has happened and generating actionable insights from that history. We want to do that for our users now. And it can seem a little scary. Oh no, you know, this is the data inside the company, it's a competitive asset, it's whatever.

But the fact is you can create a much more engaging experience for that person. A stickier experience, however, is your product people are gonna measure that by exposing analytics to your users. So who's the decision maker? Everybody who uses your product. And that's one of the fundamental theorems of the real-time analytics revolution, is that it's valuable and a key competitive advantage to build features into your products that are based on the analysis of historical data and recent data in kind of the life of the product and the user's interaction with it.

Adi Polak: I love that because it speaks a lot of our culture of, A, being data-driven, right? We always like to look into data. And B, actually empowering people to take action and have visibility into, you know, the things that their own impact, right? So I can go into the analytics and start querying and see if the action I took created some impact in the system or the other way around, or if I did a mistake. I think it's really critical. And it's fantastic that now we are able to do that in a real-time setting to better serve our customers and users and also to empower other customers to dig into their data, you know if we are creating some dashboards or kind of analytics services for them.

Tim Berglund: The idea is I, as a user, I'm better off if I know who's ordering burgers from that burger joint in the last 15 minutes. And I actually don't wanna know that that list of names is not useful to me, but how long is it gonna take? Is a downstream impact on who's ordering burgers right now. And so that recent history of events, I'm a better off user, I like it better. I'm gonna get my meal delivery from you because you give me these kinds of up-to-date, if you will, real-time insights that I can use to take action on. That's including me in the community of decision-makers as somebody who does not work for DoorDash, Uber Eats or whoever is a win.

Who's Viewed Your LinkedIn Profile feat. by Apache Pinot

Adi Polak: That reminds me of Who Watched my Profile on LinkedIn that I read some articles that it's powered by Apache Pinot, so, you know, I'm curious to learn more. Maybe if you can share.

Tim Berglund: Yes, it is powered by Apache Pinot, and it's why Apache Pinot was built. So specifically that feature at LinkedIn is why Proto Pinot was developed by the team at LinkedIn who would later, some of whom at least would leave and found StarTree. And the idea there, like if you remember LinkedIn not so long ago, it was a place where you could put your resume and your Rolodex to use a somewhat old-fashioned term and how often are you gonna engage with that, right? You don't update your resume very often, and maybe you go to a conference and you make some connections, you might go and reach out. So it's just, you're not gonna go there much. Who cares? Well, knowing who viewed your profile in real-time, at least for certain types of people, I would note I feel seen that's a very compelling feature, right? You wanna look at that, and that's the thing.

Adi Polak: People care.

Tim Berglund: At least some people do, and that'll keep you coming back to the site. They had to build a, basically real-time analytics database out of some fairly unsuitable tools that had unfortunate operational characteristics. Like it cost a crap load of money to run and it was too slow. So they designed and built, you know, the first version of Pinot in order to be able to do that. And then that has found its way into every, timeline view. And there are other kind of more niche products on LinkedIn. I'm not a LinkedIn expert, but every just, you know, go to linkedin.com and you get your timeline loaded.

There are Pinot queries powering that based on what people you know and interact with have recently done, what you've recently looked at and read, and all those kinds of things. So it's now an integral part of the experience of using LinkedIn. Kafka has a similar story, right? That's another technology I've worked with recently. It was also simply built at LinkedIn to solve a problem that LinkedIn had. There's always something nice about a piece of infrastructure technology that, you know, it's not like somebody said, "Hey, I bet this would be useful, and I know some VCs, let me raise some money and we'll build it and see if anybody wants it." That could be successful, but there's a lot more risk there in speculating. I think this infrastructure might be useful rather than, no, somebody needed this and so they built it, you know? You know it's useful to somebody and it solves a category of problems and it's finding its way into lots more categories of problems from there.

Adi Polak: It's proven, which I think is very, very interesting to see. It's proven on a huge scale. And I think you know, I wasn't in LinkedIn although LinkedIn was acquired by Microsoft at some point, so...

Tim Berglund: That's right. You were....

Adi Polak: I don't how you want to look at that, but putting that aside, I did notice that LinkedIn transformed from an online resume connecting with recruiters or things like that to a platform that is more community-driven events, social, people are sharing technical content, things that they learned experiences. So it definitely shows the whole feeling seen experience and timeline and the fact that LinkedIn added real-time analytics to show the users for the user to be able to query, kind of query in different features. Really develop that as a community in my opinion.

Tim Berglund: You know, if you were a skeptic of Pinot, I think there'd be ways to poke holes in this argument. So I'm not presenting this as the most airtight argument in the world, But it's worth observing that LinkedIn is that now. It used to be a thing you went to every once in a while for those two things, resume and Rolodex. Now for many, many people, it's a site they visit, it's the corporate social network, it's the professional social network. Social network, period.

And anecdotally, you know, like I run the DevRel team at StarTree, we make things, we write blog posts, we make videos. When we release a video, we'll post it natively on Twitter, LinkedIn, and YouTube, and we'll socialize the YouTube link, right? LinkedIn, we get several times the engagement on those videos that we released there compared to YouTube. So, remarkably more views of those videos, that's the impact we're trying to make with the stuff we spend a lot of time and effort making happens on LinkedIn. Now, the old resume, and you might say that's cool, Tim, we're talking about real-time analytics, not your DevRel team. Okay. Yes. But the old-time resume and Rolodex website wouldn't have done that. And the transformation from resume and Rolodex to the professional social network, I think the argument can be made compellingly that Pinot is what powered that transformation.

Pinot was built because of the baby steps initiating that transformation, and it underlies all the important features that make the current state of LinkedIn possible. So it's proven at scale and as a driver of significant business and the risk of some overstatement, kind of social change in the terms of the way people interact professionally, LinkedIn's a part of that. So again, a skeptic could say, well, that's a social media thing, that's a one-off, that's LinkedIn. There are other things and there are actually other cool Pinot use cases to talk about, but I'm glad you asked about that. I wasn't even thinking that. But that Who Viewed My Profile was really a kind of a pivotal moment in the real-time analytics revolution in ways that it's worth it to tell the story of.

Technical requirements for real-time analytics

Adi Polak: That's amazing. So from observation, looking at other social media, there is a lot of emphases on real-time, right? If we're looking at timelines, feeds, etc., as human beings, we want to consume the latest of and greatest of news. And in order for the platform to know what to present to us, it needs to be able to act in real time. So, real-time processing of a lot of data sometimes and doing it relatively fast. So, I'm curious, and maybe we'll touch a little bit on the technical needs of Pinot. So, you know, from your experience, what makes it so fast?

Tim Berglund: Well, there are, as an analytics database, you have two jobs. One is to ingest data and the other is to support reads. Ingestion is not trivial, but if you read stuff and you write it quick as you can, that's kind of a little bit more of an open-and-shut thing. On reads, it comes down to a question of indexes, like what kinds of... And you hinted at this earlier when you said in MapReduce you sort of had to have things planned out. Literally writing Java code that wasn't an effective query yet.

Adi Polak: Or Scala.

Tim Berglund: Or Scala, you know, it could be worse. It could even be Scala.

Adi Polak: That's true.

Tim Berglund: Right? So, I'm underselling the pain. It's you really had to have things mapped out and changing that is a redeploy. Back in those days, 12, 13 years ago, redeploy was even harder, blah, blah, blah. Now it's SQL. You throw SQL at a database and how does that work? Well, indexes. If you're expected to be able to generate arbitrary SQL and throw it at a Pinot table and get performant results, just like any transactional database ever the answer is you have to have designed the right indexes, and the database itself needs to support the right indexes to give you those options, to give you the ability to have lots of different kinds of queries that you can execute in that real-time, you know, sub-100-millisecond timeframe0.

To be clear, it's perfectly possible to write a query in Pinot that takes longer than 100 milliseconds,  right now in the main branch, in a pre-production form in 0.12 there's the so-called multi-stage query engine that supports arbitrary joins between tables. Lots of arbitrary joins are gonna take longer than 100 milliseconds, but yeah, you've got these rollups that you can do in that timeframe. To do that you have to have the right indexes because fundamentally OLAP queries are about scanning, right? OLTP is this thing, OLAP is these things. I want these things. And so I have to read these things, and there's some column that we call a measurement, some number that we're running through reducing function to compute an aggregate. That's kind of what we're doing, right? And so, I gotta scan these things.

To be fast means I have two options. I can scan less or I can scan faster. That's like, I can work smarter or I can work harder. That's really it. There isn't magic in the read path. You can either figure out ways to scan less or rather, you can optimize the scanning that you do. Now, number two, optimizing scanning faster. You know, there are always optimizations to make and new hotspots to find, and you can always make incremental improvements on the code itself. And the IO subsystem, excuse me just a second. But really that's about a decision to have tightly coupled storage, right? I'm not reading something off an S3. Generally, for those fastest queries, I want the storage to be at the other end of the PCIE bus, the PCIE bus on SSD, and not on the other end of a HTTP call.

Of course, as I said, StarTree Cloud kind of does that smartly, and you can put old stuff in Cloud lob store and not give away the firm performance-wise. But scanning faster is generally about locally attached storage. Scanning Smarter is about cool indexes that let you filter and focus your scanning on only the right blocks of the right segments of the table. And so those two decisions, a broad array of indexes and locally attached storage for the fastest queries are the answer. Scan less, scan faster, and digging into scan less and indexes a little bit, Pinot's got this pluggable architecture where it's relatively easy to add new indexes.

The bell of the ball in the index, Pinot index pantheon, is mixed metaphors there, is the StarTree index has nothing to do with StarTree, the company, we just thought it sounded like a cool name. And so we named ourselves after that index. It's kind of like building a pivot table in a spreadsheet and persisting it to disk. So you've got pre-computed aggregates based on some number of dimensions, and you traverse a tree, you know, in log time and you get the pre-computed aggregate out. So you just get crazy, crazy performance. Excuse me. So, that's the cool index right now, but you have a pluggable index architecture, so that probably won't be the coolest index in five years. You know, there'll be something cooler.

Adi Polak: I remember I attended Gwen Shapira's session on Why My Database, You Make Me Coffee or something like that.

Tim Berglund: Yes, yes for that. Things databases don't do, but should.

Adi Polak: It should. Thank you.  I think it was one of her final bullet points that she described kind of the future of databases, and she discussed custom indexes, specifically saying that my database, either it's real-time analytics or something different, it would be beneficial if it'll learn my patterns as a user and kind of pre-make these indexes for me. Which I thought kinda was very interesting.

Tim Berglund: I feel like there are two layers to that, right? You could probably stand up a not-too-expensive ML model that looks at queries. What are you filtering on and what are you aggregating and figures out of the indexes I have and the columns in your table, let me recommend some indexes, like throw traffic at me for a little while and I'll tell you what your table config should be to optimize these. And that's a doable feature. And I don't think that's in our Pinot's backlog, and I don't know where we shake out if somebody proposed it, but you can imagine that. The apotheosis of that idea of Gwen's is some kind of dynamic index builder that's a couple of stages past that, where I'm not picking from the 12 index modules I have, but I'm analyzing traffic and building an index set of rules that is performant. You know, that, that would be cool. That's truly nextgen stuff. That's like the ChatGPT of database indexes.

Adi Polak: Wouldn't that be cool?

Tim Berglund: Yes

Outro

Adi Polak: Alrighty. Any last thing you wanna share with our audience today? Funny quotes, fun facts, anything?

Tim Berglund: No, I don't have any funny quotes and I can't think of any fun facts that are relevant, but you can join us in Slack. We'll put a link in the show notes or check out our resources on YouTube or our developer site if you wanna know more about Apache Pinot, we'd love to help you.

Adi Polak: Fantastic. So, how to get started with Pinot?

Tim Berglund: Yes.

Adi Polak: Slack, website, docs, YouTube...

Tim Berglund: All the above on dev.startree.ai is the short story. And we'll have links in the show notes.

Adi Polak: Fantastic. Well, Tim, thank you so much for joining me today. It was a pleasure as always.

Tim Berglund: Always a delight, Adi, thank you.

Intro
What are analytics?
What are real-time analytics?
The value of real-time analytics
The fine line between compute & storage
Apache Pinot & StarTree Cloud: Target audience
Who Viewed Your LinkedIn Profile featured by Apache Pinot
Technical requirements for real-time analytics
Outro