Interest in Apache Pulsar has been growing consistently, with the cloud-native distributed messaging and streaming platform building a steady buzz since it became a top-level Apache Software Foundation project in late 2018. (It was submitted to the Apache incubator by Yahoo! in 2017).
Apache Pulsar, in crudest translation, is a way for distributed servers to talk to each other very fast indeed, using a publish-subscribe (“pub-sub”) pattern. This can be used to facilitate messaging for event-driven systems and/or streaming analytics; it can also be used to decouple applications to boost performance, reliability and scalability. Similar though that may sound to Apache Kafka, there are pronounced differences however, including the way in which Pulsar separates compute and storage.
Pulsar, for example, delegates persistence to another system called Apache BookKeeper (a dedicated separate low-latency storage service designed for real-time workloads), and its “brokers” on the other hand are stateless — they are not responsible for storing messages on their local disk. (Pulsar brokers run an HTTP server with a REST interface for admin and topic lookup, and a dispatcher to handle all message transfers.)
As Jaroslaw Kijanowski, a developer at SoftwareMill notes tidily “[Statelessness] makes spinning up or tearing down brokers much less of a hassle… The separation between brokers and the storage allows to scale these two layers independently. If you require more disk space to store more or bigger messages, you scale only BookKeeper. If you have to serve more clients, just add more Pulsar brokers. With Kafka, adding a broker means extending the storage capacity as well as serving capabilities”.
Pulsar is at the heart of Yahoo! owner Verizon Media’s own architecture, where it handles hundreds of billions of data events each day. (Yahoo! developers described it in 2018 as “an integral part of our hybrid cloud strategy [that] enables us to stream data between our public and private clouds and allows data pipelines to connect across the clouds.”) It has also been deployed at COMCAST, Huawei, Splunk, and beyond.
DataStax’s Kesque acquisition is a strategic pivot — and also a vote of confidence in Pulsar’s enterprise readiness.
DataStax’s acquisition this week of Kesque — a cloud messaging service fully managed and powered by Apache Pulsar and available to run on AWS, Azure, or GCP — represents the clearest sign yet that Pulsar’s really coming of age in the enterprise space. It’s also an intriguing strategic move by the cloud data infrastructure specialist out of databases and into streaming itself; with the Santa Clara-headquartered firm wasting no time in offering up “DataStax Luna Streaming“, a “production-ready” and open-source distribution of Pulsar with support on a subscription basis.
Pulsar with SLA and SLO-backed subscription/support may prove a tempting proposition for businesses looking to deploy, but wary of complex provisioning and management challenges in a still small market.
Chet Kapoor, Chairman and CEO at DataStax said in canned statement on Wednesday. “There is a strong demand for a database as powerful as Cassandra coupled with scale-out event streaming – both halves of an enterprise data architecture. We are excited to innovate beyond the database and support enterprises with both world-class streaming and database technologies to power modern data apps.”
“Organizations begin to adopt event streaming when they realize how widely distributed their processes, systems and data have become, how much broader their ecosystems are, the speed at which they must now operate to compete and how critical it is to collect and correlate massively distributed data at high speeds,” according to Maureen Fleming, IDC’s program vice president for intelligent process automation. “It’s fair to think about event streaming as a superglue for transformation.”
DataStax’s Jonathan Ellis spelled out Apache Pulsar’s attractions in a more detailed blog that notes that Kafka came up short as a cloud-native, open source messaging service in the following four areas:
He added: “In Kafka, the unit of storage is a segment file, but the unit of replication is all the segment files in a partition.
“Each partition is owned by a single leader broker, which replicates to several followers. So when you need to add capacity to your Kafka cluster, some partitions need to be copied to the new node before it can participate in reducing the load on the existing nodes. This means that adding capacity to a Kafka cluster makes it slower before it makes it faster. If your capacity planning is on point, then this is fine, but if business needs change faster than you expected then it could be a serious problem.”
The company’s built a new admin console onto Pulsar that makes it easier to manage multiple tenants across multiple regions from a single interface (including authentication and authorisation, isolation policies that, for example, let users “optionally carve out hardware within the cluster that is dedicated to a single tenant”, and storage quotas) and cites three customers on-record, including Liquid Analytics, a company that uses Pulsar to underping drives real-time decisions across Finance, Sales, Marketing, and HR organizations through its AI-based platform, Liquid Decisions.
It may be early days, but expect to hear a lot more about Apache Pulsar in the near future (and no doubt witness the hyperscalers wading into the fray inevitably in due course). With its Kesque acquisition, DataStax gets something close to first-mover advantage and takes a significant step towards offering a much more rounded set of subscriptions. A clever buy.