Firebolt founder Eldad Farkash and his team have built a new cloud-native (AWS-based) data warehouse with some unique features under the hood: from how it decouples storage and compute, to how it handles data compression, queries, and tackles latency, including through sparse indexing. The company — which aims to take on market darling Snowflake — came out of stealth mode on December 9th, with $37 million in backing from Zeev, TLV , Bessemer, and Angular Ventures and claims to deliver a “sub-second interactive analytics experience” with terabytes to petabytes of data, via a pay-per-use SaaS model with granular hardware control.
We’re making this 50-strong, engineer-dominated team (largely based in Israel), The Stack‘s inaugural “one to watch”: companies we are confident are going to be well-known names in the near future.
Who’s this aimed at?
Firebolt is targeting data engineers for (typically) born-in-the-cloud companies or applications that – from data lake to data warehouse – are running up against latency limitations. Such users want to be able to serve interactive queries to a dashboard at blistering pace, while also reducing the amount of compute and storage (and hence money) that they need to run these workloads in the cloud. Demand is rising: businesses and their customers increasingly want to be able to interact with a dashboard and get answers from formidably large and growing data sets nigh-instantly. Indeed, for many enterprises, this is increasingly central to their offering.
For this to be viable, the query needs to reach where the data is stored, find an answer and then deliver the answer back to the dashboard. If businesses can only query a small fraction of their “Big Data” because it’s too expensive, time-consuming or just downright difficult to query a large dataset, often they end up serving stale or incomplete responses.
Let’s dive a little deeper…
What is Firebolt?
For those wanting semi-structured data analytics with sub-second performance at serious scale, Firebolt is an intriguing proposition. The team – led by Eldad, a former CTO of Business Intelligence (BI) software company Sisense and co-founder Saar Bitner, previous Sisense’s CMO — are aiming to take a substantial bite out of a cloud analytics market anticipated to be worth over $65 billion by 2025, by creating a cloud data warehouses designed down to the hardware level to scale for the performance and efficiency companies need to power analytics at petabyte level.
The Stack’s founder Ed Targett tried to keep up with Firebolt CEO Eldad (who’s a pretty low latency soul himself) to learn more.
Who are you targeting here?
“We’re selling to data engineering teams. And 100% of our users are born in the cloud. Their business was growing from the cloud. They’re all data-driven companies. I’m not talking about using data for insight: I’m talking about their businesses being operated based on the data they gather, and based on how they can utilise the data for their users.
“Scale is pretty much a solved problem in this space. We went for a design that we think tackles the biggest problem going forward, which is speed, latency and efficiency. For data serving… the point is that you want to go from your lake, to your dashboard, to your customer-facing analytics with no steps involved. You want your data warehouse to support intensive, high-performance querying, which is something that doesn’t exist today.
“If you talk to nine out of 10 companies who build and work with data, they all say the same thing: ‘We’ve been spending so much time, energy, and mostly frustration on getting data pipelines, the data lake, everything stitched together — just to discover that people open their BI or their dashboards and still get the same shitty experience.”
“If you look at what happens past the data warehouse today, you will see crazy stuff that has existed for 20 years. People offload the data from the warehouse, and they massage it, they trim it, they make it smaller, they make it less granular. And with one purpose: to just get this dashboard returning in less than five minutes… [To do that] your only option with cloud-native vendors, is put more money in. If you look at a Snowflake: ‘you double your cluster, and hopefully you’ll get double the speed.’
“At Firebolt, we want to break any perception that speed and latency are a thing you need to pay resources for. People talk about how data grows, but never talk about how queries grow; how much it costs them getting the data in the warehouse to getting data served to meet actual needs.
You’ve started with around 10 customers, refining the product and are now coming out of stealth. What’s the reaction been like? You need large, real-world data sets to test something like this… Have people been open?
“The demand and interest and the discussion with companies has been [immense] because of the urgency of the problem. Companies are willing to open up their data as well. There is no old-school IT process any more; there is security and compliance, which is a must, of course. But we are finding you get to the person who decides to sign the cheque immediately.
“From my experience, as someone who’s been selling BI — where basically you need to convince everyone at the bottom to use your dashboard; then when you sell data warehousing, you talk to a few people — for us we immediately get in touch with data engineering; it’s great.
“For some, there have been conversations where it is a case of ‘well guys, truthfully, this speedboat was not built for you; maybe we’ll talk in two years’ but we’re even starting to talk to companies we would never imagine we talk to: old-school companies who moved to the cloud and accelerated cloud usage and data usage on the cloud dramatically over the last year. Especially because data warehousing was this heavy, long process where you had to work a lot to even try it out. But Firebolt is a SaaS data warehouse. You don’t really operate it. You just talk to it with the ecosystem. Most of our users don’t directly operate our warehouse. They use Tableau, they use Looker, they use modern ELT and ETL.
Users can filter in a granular way like they never could before. And they smile! Smiling in our space is rare, because frustration drives everything
OK, take us under the hood a little…
“When you operate your data on the cloud, the storage tiers get more complex. For example, we live on S3. Our data is stored on S3. But you can’t really do low latency, high velocity, many reads on S3. It needs to be much smarter than that. And what cloud data warehouse vendors do, is try to optimise how much data they download from S3 to the clusters, the local cache, and to make sure that the cache, the SSD — which is much smaller, much faster — will have the data you need when you run your queries.
“The innovation of Firebolt is in two spaces. One is completely changing… how much data you need to scan on S3 to get your result. The second thing is everything that happens between RAM and CPU. The biggest thing is… in our file format, which we call Triple F. This is a new format that serves us. Our query engine was designed exclusively to run on that.
“How does it work? In a nutshell, without diving in to too many details, we do something that is unique, which is we order and we sort the data while you ingest it. Why is that important? Imagine that you have a table. You have 800 fields in it; typical for a Big Data use case. But most of your queries will be filtered by 10% of those columns… Firebolt introduces the primary index on tables, meaning you define a primary index, and the columns within the index will make sure that your table will be sorted by those columns. Think about if you ingest data as a stream, and then you chunk up the stream. Each chunk becomes a file, a Triple F file.
“The Triple F file is being sorted based on this primary index. And because it’s sorted, we can do magic, because if your data is sorted, two things happen. One is you can introduce decompression and encoding to support and to exploit the fact that it’s ordered. Which means much smaller payloads sitting in S3. A second thing is, which gets more interesting, is introducing sparse indexing. Sparse indexing means that you can get a coverage of the data in a column or in a table, but you don’t need to store all the data. A sparse index is something that is very small in terms of memory, in terms of size. But it needs the data to be ordered, or the sparse index will be not efficient. If you order the data, you can apply sparse indexing. When you apply sparse indexing, you can apply much smarter pruning.
“Think about it like in your cluster, in RAM, you’ll have those indexes. They’re very small. And they can span over hundreds and petabytes of data. And when you run a query, things happen. And eventually, one crucial part happens, pruning. The pruning engine will say: ‘You’ve been filtering by a specific product, a time range, and a few customers.’ A typical data warehouse or a typical compute engine will only be able to download data from S3 on a file granularity. When you store those in Parquet and everything, your engine can tell you two things. “I need to download the file to scan it, to process it, to know if those customers are in it,” or, “That’s it.”
OK, you’re losing me, but hopefully some of my readers will stay with you; keep rolling…
“What people [typically] do is they play with the size of the file. A bigger file means you can compress it better. But it also means you download much more than you need. It’s kind of a back and forth between what size do I need to store in S3. Snowflake has the concept of micro-partitions.
The difference with Firebolt is that we don’t work on files when the query runs. The query engine is the concept of ranges. A range is a more granular kind of representation of what we need to get. Think like you have a 40 gigabyte compressed Triple F file being S3. Because we did merging and compaction while we ingested, so we grew the file by merging multiple smaller ones. But the query engine will tell you: ‘Well, you don’t need to download the file. I don’t even care about the file. I care about 50 ranges within this file that point exactly to the predicate that you described.’ So, customer, date and product, those represent 50 ranges. They could be 100 megabytes, 2 kilobytes, but it’s the precise range within the file that is stored in S3 that we need to get. And when we scan S3, we scan by range versus by file.
You need to completely change local storage. We don’t download the file. It’s not a Parquet sitting in S3 or in SSD. The way we store those ranges on SSD, in local storage, is different from the original file.
“The first wow people get when they run a query is that, ‘how come that the size of your scan’ — because you always see how much data your query is scanning — ‘is so much smaller? Is it because of compression? Is it because of approximation?’ Yes and yes. But the real factor that affects the size of the payload that you download for scan, is the concept of working on ranges. Of course, there is a lot of technicalities. You need to completely change local storage. We don’t download the file. It’s not a Parquet sitting in S3 or in SSD. The way we store those ranges on SSD, in local storage, is different from the original file. This is why we call it ‘Triple F’, because Triple F means that the same file, the same data, will be represented in three different ways depending on the storage tier that the data goes through. So, S3: on hard disk; SSD cache: mid-range, RAM, the last step. [But…] ultimately, I have data sitting in my lake. I run an insert, and this stuff happens behind the scenes, so it’s really designed for people who don’t deal with complexities around that. If you’re a SQL analyst or a data engineer and SQL is your first choice and that’s how you work, it should feed that. And that’s our target audience.
You made that sound deceptively easy at one point there…
“You should look at me having cold showers after a long day, crying and looking for something to show me the light. [Laughs] In our micro-universe, it’s all about hard work and having the best talent with us. And we banged our head over every step. When we started to deal with S3, we immediately realized it’s not going to work, because our design for the query was order data, sparse indexing, ranges…If you spin up an engine with Firebolt (equivalent to a cluster) we had to come to the point where we had to build our own version of the OS. We had to deal with I/O, and we had to deal with the way we store data locally in a very different way than we used to.
You can find storage engineering within Firebolt. You can find a query engine. We have a big team that exclusively deals with SQL front end: language, compatibility, optimisation landscape, all of that. It’s really spanning across storage, operating system, and pure vectorised in-memory processing. We have everything.
This is why I love Firebolt. In my previous life, it was all about HPC. But now, it’s also about storage. It’s also about elasticity. I haven’t come to the point of elasticity because you know elasticity: resource isolation and being able to deal with that involves consistency protocols… It’s really about combining a lot of disciplines and being hardcore about solving them. That’s what we’ve been doing over the last two years. No magic: hard work, and a lot of innovation, and getting shit done.
You’ve mentioned Snowflake a few times. Who else are you up against and how are you pricing this, as you earlier blogged about going down the annual subscription, rather than pay-per-use route.
We started on work with real people and real problems
“Our product is actually meant to replace a lot of use cases you would use a BigQuery or a Redshift or a Snowflake for. Compute, as we all know, is extremely expensive. And we make compute very efficient. Previously one or two people were allowed to operate the warehouse, play with it, build new features and actually touch the big data. Now, the whole team does that. A typical data engineer will tell you it takes them two weeks to add a new feature to their product. And that it takes them eight months to turn that feature into a data-driven feature. With Firebolt, I can do that in a few weeks. So for data engineers, it’s no longer a data warehouse mindset. It’s a developer mindset where they look at productivity and time to market.
“That’s very different than my previous life within BI and classic data warehousing, where it was all strategic, it was all trying to understand the need five years ahead. Here, it’s instant. We go crazy with speed.