NextGen Data Lake, Data Pipeline Using AWS ETL

Current Solution & Goals

Using an on-premises Oracle Exadata database, the client ingests and merges configurational data from various networks. The process takes two+ hours per technology. These processes must be cost-effectively migrated into the AWS cloud while also reducing the time taken for ingestion. Two main AWS service solutions were evaluated: AWS Glue, a serverless framework with Spark ETL logic, and EMR, a Hadoop environment running in EC2 server instances.

How T-Systems used AWS to implement a cost-effective data ingestion pipeline

A data base with six bubbles arragned around it including circuit boards

Two ETL Deployments were tested: Both AWS Glue and AWS Elastic MapReduce (EMR) provide mechanisms to create the ETL (Extract, Transform, Load) logic necessary to transform the raw input (zipped TXT file) into formatted table structures. The advantage of AWS Glue would be that it is a serverless application, therefore requiring no manual maintenance of server resources. Also, the scaling of compute resources can be simply achieved by adjusting the worker processes (number of DPU’s) used during an AWS Glue Job. However, being a managed service, it also is more expensive and less flexible – than EMR in this case.

EMR provides a much more powerful solution with a minimal increase in administrative overhead: even though several EC2 instances are spun up during the execution of the ingest workflow, the EMR cluster is transient, meaning, that it is automatically terminated when all steps are finished.

EMR is also one of the most cost effective of all possible ETL solutions, especially considering the spot instances that could be specified when launching the cluster. Even though resource scaling is not as seamless as with AWS Glue, the EMR instance fleet option provides a flexible way to mix on-demand with spot instances and also define auto-scaling policies when certain EMR metrics are met.

The EMR cluster was tested using several instance types settings to find a well-balanced solution that optimizes the runtime of the EMR steps and the cost of the cluster.
As the steps involved run Spark ETL jobs, a memory-intensive instance type is recommended, such as r5 or the newer Graviton-based r6g instance family.
The r5-family instances performed constantly better than the respective r6g instances, even though Amazon suggested an appr. 15% speed-up for Spark workloads using the r6g instances in December 2020.

For the final result of a 22 minute runtime, the r5.4xlarge instance type was used with 1 master and 2 core nodes. This provides a total of 48 vCPU’s and 370GB compute resource. (The r5.xlarge did not provide enough memory for the completion of the job and the r5.2xlarge took appr. 25% more time to complete.)

Orchestration

For the execution of the ingest steps, an AWS Step function state machine was implemented. For the triggering of the workflow, an S3 PutObject event can be used in order to signify the arrival of a zipped input file on the S3 bucket from the DataSync task.

Ingestion data flow using AWS EMR as the ETL engine

ETL workflow orchestration using AWS Step function

Conclusion:

Using the ETL workflow implemented with EMR, the ingestion of raw batch data could be realized in appr. 30 mins or less, as opposed to the two hours achieved on the current on-premises solution. AWS EMR provides several powerful “big data” ETL solutions such as Apache Spark using a highly customizable and scalable infrastructure.
By defining instance fleets and groups at the EMR cluster creation stage, one can take advantage of spot instances, thereby having the possibility to deploy a cost-effective solution for such a powerful compute environment.
The result of the ETL business logic would be stored in compressed Parquet table format, which is a generally recommended columnar format for data access. Furthermore, the advantage of using the S3 central data lake is that it integrates well with all data analytics solutions such as AWS Glue, EMR, Athena, and Redshift.

Our Portals

Client Portal

Shop

Client Service Center

NextGen Data Lake, Data Pipeline Using AWS ETL

Current Solution & Goals

How T-Systems used AWS to implement a cost-effective data ingestion pipeline

Orchestration

Ingestion data flow using AWS EMR as the ETL engine

ETL workflow orchestration using AWS Step function

Conclusion:

Laszlo Hadhazy

You might also be interested in:

Enable mass Migrations from on-Premises to AWS

Microsegmentation: Security down to the most minute detail