You’ve probably relied on a recommendation engine several times this month without noticing: That time you bought a comfortable dressing gown and got recommended some sexy lingerie for more eventful days? The escapist thriller that popped up on your favoured video streaming hub? The alluring city break proposals on that travel site? All were underpinned by data plumbing that may be less glamorous than a cocktail on the perfect beach in your new satin, but which has the potential to dramatically transform customer experience.
Personalisation remains a huge, growing opportunity across most verticals and delivering it in as close to real-time as humanly possible is critical. DataStax wants to be the go-to company for building this infrastructure in a cost-effective, cloud-native way across every aspect of the experience: storage, streaming, compute.
Earlier this year it bought Kaskada, a startup built to expand the reach of machine learning (ML) by stripping the complexity and cost out of feature engineering (using insight from raw event data streams to inform ML model training: call it real-time AI, or simply call it the ability to give customers what they want.)
Within weeks of its acquisition, DataStax open-sourced the underlying engine of what had been a proprietary SaaS product, releasing it in late March under a permissive Apache 2.0 licence (get it on GitHub here).
Today (May 4) DataStax is releasing a new support service for the offering to help grow uptake, branding it “Luna ML”, to sit alongside “Luna”: expertise and support for mission critical databases and “Luna Streaming”: enterprise-ready data streaming built on open source Apache Pulsar. The release means DataStax now provides storage, streaming and compute for modern, scalable, cloud native applications that can support real-time AI.
Luna ML: Real-Time AI support
So why not just rebrand Kaskada’s SaaS and offer it as a turnkey real-time AI solution?
As Kaskada co-founder Davor Bonaci – now EVP at DataStax – puts it to The Stack: “This is about winning the practitioner. Developers, engineers have real decision-making power about how things are designed and architected; how things are built. By putting Luna ML out there and basing it on open source, we want to become the de facto standard for solving this problem and give people a lot of value very quickly.”
“People have been using Cassandra as a key part of their AI stack for a really long time,” he adds.
“Uber, Netflix and so on. DataStax has been continuously adding capabilities to make it easier for companies to build real-time AI. That means streaming, right? If you want to ingest data in real time, asynchronously into the stack,streaming is the right solution for that. Then you need to be able to process data from those asynchronous real time streams, to compute real-time features; to get real-time predictions.
“This is where open source Kaskada and Luna ML come into play.”
Getting a little more down in the weeds…
To drill down a level, what, exactly, does open source Kaskada offer? (Luna ML ultimately offers DataStax’s help and expert support getting set up and optimising it, rather than a managed service, which may follow.)
In the past, Bonaci has earlier emphasised, his team saw that no one was looking at the process of going from raw, event-based, data to computed feature values. What does that mean exactly? “That users had to choose: use SQL and treat the events as a table, losing important information in the process, or use lower-level data pipeline APIs and worry about all the details. Our experience working on data processing systems at Google and as part of Apache Beam led us to create a compute engine designed for the needs of feature engineering.”
Kaskada tackles this by attacking stateful stream processing in a high-level, declarative query language designed specifically for reasoning about events in bulk and in real time. Its query language “builds on the best features of SQL to provide a more expressive way to compute over events,” with queries that are “simple and declarative [and which] unlike SQL are also concise, composable, and designed for processing events…” as Bonaci puts it.
“The power and convenience of Kaskada’s query language come from the fact that it’s built from a new abstraction: the timeline. Timelines give you the declarative transformations and aggregations of SQL without losing the ability to reason about temporal context, time travel, sequencing, timeseries, etc.
“Any query can be used, unchanged, in either batch or streaming mode.”
See also: Amazon Prime Video saves 90% by ditching Lambda, microservices for a cloudy "monolith"...
Computation meanwhile is implemented as a single, chronological pass over the input events, so you can compute over datasets that are significantly larger than available memory. Internally, events are stored on disk as Parquet files: “We find that most computations are bottlenecked on I/O, so using an efficient columnar file format lets us selectively read the columns and row ranges needed to produce a result,” Kaskada’s team says.
The result is a modern event processing engine that instals in seconds without any external dependencies and computes quickly and efficiently. Now it’s freely available under an Apache 2.0 licence too and there’s a go-to, cost-effective place for support in getting set up and tuning how you run it, with DataStax’s Luna ML.
If Bonaci is bemused by seeing his brainchild given away for free after the acquisition, he hides it well: “Making Kaskada part of an open source, open data stack, where every component can talk to the other, everyone can see it and contribute to it? It’s exciting. There is so much potential here for customers to build pipelines that let them do a better job of surfacing the right thing to the right user at the right time or giving insight on the ‘next best action’.” That’s not just across e-commerce, advertising, video streaming etc. he says, but also segmentation work, for example in gaming; matching players of similar abilities with each other, or “more canonical use cases like churn prediction. There are a lot of common use cases that we see from predictive AI, where event-based data coming from an application can underpin really powerful predictive capabilities.”