Cracking the code on high-scale data pipelines
SPONSORED — Enterprises reliant on a steady stream of data need to know when that data is changing.
If you are a global food producer, for example, having real-time insight into production capacity and your customers’ orders is critical to customer relationships – and in turn your own success. Or if you are a major private equity firm, having real-time operational metrics and transactional updates from your portfolio companies is important for tracking their performance and mitigating risk.
This “real-time” component is increasingly critical and leading businesses, in short, are moving from periodic reviews of data at-rest to dynamically understanding and adapting to data streams. Too often however, the digital plumbing underpinning these streams is deeply inefficient and has put pressure on DevOps teams to come up with some complex workarounds that accommodate dynamic changes in data.
Change data capture (CDC) has become one popular solution. With CDC, the responsibility of publishing change events is moved from applications to the database that serves the application (this typically involves tracking row-level changes in database source tables) but for enterprises running large, scalable, distributed architectures underpinned by databases like Apache Cassandra, getting CDC working has been challenging.
As databases scale, the challenge of providing a reliable, consistent stream of evolving events becomes increasingly difficult. And when it comes to database scale, Apache Cassandra – relied on by Netflix to Facebook; Capital One to Expedia Group – is perhaps the most scalable database of them all.
As DataStax VP Chris Latimer notes with regard to earlier CDC on Cassandra efforts: “Changes could originate from one of any number of nodes in a database cluster and be propagated throughout the cluster members.
“This made features that were trivial in relational databases, like sequencing and deduplication orders of magnitude more complex to solve. Organisations had to decide whether to tackle this complexity head-on or find workarounds, such as batch ETL processes, to detect data changes in their Cassandra database…”
Change Data Capture on Cassandra: DataStax’s Astra DB and Astra Streaming provide a world class option
It was a problem DataStax, dedicated to helping customers mobilise real-time data, was determined to fix. When DataStax turned to the problem of providing reliable CDC for Cassandra, they went to work not only on the database, but also on Apache Pulsar, a cloud-native server-to-server message streaming system.
Ed Anuff, Chief Product Officer at DataStax, notes: “When you have multiple servers, all making multiple changes, this means you actually have to take a very different approach to CDC, if you want it to work.”
“We recognised that turning distributed data into a set of events is a streaming problem. So we embedded a [Pulsar-based] solution into Cassandra, and used that as the basis of our CDC in a pretty unique approach.”
The move was a smart one. Pulsar adoption is soaring because of some unique capabilities that set it apart from its more widely adopted and well-known rival Apache Kafka and it has matured fast in recent years.
CDC on Cassandra? Thanks, Pulsar…
As Michael Smith, Director of Engineering at investment community platform Commonstock tells The Stack, that use of Pulsar makes a lot of sense. For his deployment, for example, he initially looked at Kafka, but found “a lot of things that Kafka didn’t do out of the box that we’d have to think about. And one of those was that scaling was more difficult because you had to scale the broker in addition to scaling storage” as he puts it.
“We’re data heavy, so we’re taking in a lot of trades. And we also have market data that we’re ingesting. So we have all this data, but don’t always need a corresponding amount of compute with it.
“It would be a waste to scale up the brokers just for more storage for stuff that isn’t really going to be touched that often,” he notes, adding that with Apache Pulsar “there was also encryption: [Unlike Kafka] in Pulsar I had out-of-the-box encryption between the client and storage. When we looked into Pulsar the immediate thing that caught my attention though was the ability to scale brokers and storage independently – as well as tiered storage where your old data gets shuffled off to S3 but it’s still available through the same API; so seamless.”
DataStax’s Cassandra CDC solution has been built into Astra Streaming; a cloud-native messaging and event streaming platform built on Apache Pulsar, to build streaming applications on top of what is an elastically scalable, multi-cloud messaging and event streaming platform fully managed by DataStax. Using this as a core component of CDC with Cassandra opens up a huge range of possibilities – as up to now, Cassandra’s scale made creating a reliable stream hugely challenging. CDC for Astra DB (DataStax’s Cassandra-based DBaaS) means customers can now capture changes in instances of Astra DB and publish them to a message topic.
“Now we can build a data pipeline off a giant data set in Cassandra. I can hook in everything else that I want to have hanging off of that – and it will be able to deal with these changes as they happen,” says Anuff.
Speaking to The Stack the DataStax Chief Product Officer is excited about what this means for customers: “People want their Elasticsearch to be in sync with their database, for example. People use this for data pipelines, where they are sending the data into multiple places; they use it for monitoring,” Anuff explain, adding for a lot of companies “if something happens ‘here’, I want multiple systems to know about it, I want to be able to react to it, I want to be able to process it, and I want to be able to connect it to different applications or endpoints.”
“Ultimately if your data engineers can’t connect things together you’re basically selling them a black box: that just won’t play well anymore, because everybody is using multiple systems; maybe they want to string data from Cassandra into Snowflake, for example, because they want to have their business folks doing analytics in a data warehouse. People are mixing and matching data architectures. That’s a beautiful thing, it’s very cool and we want to help make it as easy, efficient, and economical as possible for everyone” adds Anuff.
He credits DataStax co-founder Jonathan Ellis with the insight which unlocked CDC in Cassandra.
“At a certain point he said, ‘look, we’re actually reinventing the wheel in terms of CDC with what streaming platforms are doing’. When you look at what Pulsar brings to the table by having a persistent ordered log and being able to take out of order data, but deliver it in order – that’s essentially why we went to Pulsar.”
“We did some smart things under the hood, this isn’t just your off-the-shelf Pulsar. We do things where that consistent log and all that stuff is using our same underlying storage tier. So we make Pulsar scale out as well. And there are different guarantees that Pulsar enforces to be able to deliver those sequences in order.”
To Commonstock’s Michael Smith getting DataStax involved was a no-brainer.
“We found out about Astra Streaming as a managed solution and we’re like, ‘yeah, that’s perfect’ he says frankly. “Getting it up and running was super straightforward. It didn’t take a ton of developer time – because that’s the most important thing; the mantra really is ship or die – and they really went above and beyond to accommodate us, you know, VPC peering, posting it in a region that was flexible for us; they helped us out immensely with a cluster migration – the customer service aspect of it was super important.”
DataStax says it has designed its open data stack – whether that is Astra DB or Astra Streaming – to work with how organisations’ use of data is evolving. CPO Ed Anuff says: “What we saw was time and again with the modern open data stack that’s emerging is the need for scale-out data at rest and streaming” – and while it has chosen Pulsar to deliver the performance and flexibility needed for CDC streaming from Cassandra, the “bigger bet” is much wider; Apache Cassandra users at every level will benefit from the CDC innovations.
“They may not be aware of it, they’re just saying, wow, I finally have reliable CDC for Cassandra and that’s pretty cool. But it is definitely part of this data architecture transition that’s coming – from thinking about everything in a data-store centric way, which is very much ‘how’s the data sitting in storage’, versus thinking about data in motion – how am I dealing with data in real time context?”
CDC for Astra DB – DataStax’s enterprise DBaaS built on Cassandra – is powered by Astra Streaming, a multi-cloud streaming as a service built on Apache Pulsar. Using a simple configuration based approach, you can enable CDC on one or more of your Astra DB tables and publish the changes to an event topic in Astra Streaming. From there, your real-time applications can subscribe to change events using client libraries in Java, Golang, Python, or Node.js. Additional endpoints support direct subscription via websocket interface or using a standard JMS client.If the destination of your CDC data is another platform such as Snowflake, ElasticSearch, Kafka or Redis (to name just a few), Astra Streaming also allows you to create real-time data pipelines through a simple configuration-driven interface using the built-in connector library. Learn more