AWS touts a "zero-ETL" future for Aurora, Redshift, Spark

Extract, Transform, Load (ETL) jobs – combining data from multiple sources into a large, central repository – can be a colossal headache, even between AWS services. At a keynote on Tuesday, AWS CEO Adam Selipsky announced a “zero-ETL” drive to help tackle that pain point, including a preview that will enable “near real-time analytics and machine learning using Amazon Redshift on petabytes of data from Aurora."

The big idea: Obviating the need to build and maintain data pipelines for ETL operations; connecting the two services so that “within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift.” (Aurora is a managed relational database service; Redshift is Amazon’s widely used data warehouse.)

This, said Selipsky, “helps solve one of the greatest ETL pain points for our customers” (that two AWS services interacting sub-optimally was a great customer ETL pain point is a point it would be rude to dwell on.)

AWS eyes zero-ETL future, new Redshift integration for Apache Spark

A new Redshift integration for Apache Spark meanwhile, designed to help developers build and run Apache Spark applications on Amazon Redshift data, was also announced. Amazon EMR (a cloud big data platform for running large-scale distributed data processing jobs) is also integrated into this set of “zero-ETL” releases.

For Amazon EMR 6.9, the integration is available across all three deployment models for EMR: EC2, EKS, and Serverless. AWS says customers can use these new services to build applications that directly write to Redshift tables as a part of your ETL workflows or to combine data in Redshift with data in other source.

Developers can load data from Redshift tables to Spark data frames or write data to Redshift tables: “Developers don’t have to worry about downloading open source connectors to connect to Redshift.”