ANFIMAU | Cloudera data platform migration

Challenge

On-prem workloads to be decommissioned in 18 month.

Big number of analytical products, data ingestion components, data volume and platform capabilities to be migrated and modernized.

Hosts 55+ ingestion components to acquire data in a batch incremental mode from internal & external systems through variety of interfaces: Oracle DBs, MS SQL DBs, Sharepoint, Web APIs, NAS, SFTP.
Hosts 25+ descriptive and predictive analytical products exposing custom reports via Spotfire and PowerBI to enable data driven decisions for areas like drilling and completions, well productivity, product performance analytics, drilling cost prediction and other industry areas.
Hosts 8+ TB of structured, semi-structured and unstructured raw and analytical data.
Provides data exploratory tools for data analysts, data scientists & auditors following strong access security model based on company policies.

Migration to Azure-based data platform should go with minimum re-factoring of existing data transformation and business logic, absence of business disruption, data quality issues and performance degradation.

Network connectivity issues with on-prem data sources and reporting systems to be connected with.

High dependency on cloudera technology stack (e.g. oozie, impala, hdfs, sqoop, etc.) thus leading to challenges during target Azure technology selection, modernization and migration approaches.

There is a need to solve existing issues in data inconsistency and overheads to Ops team resulted because of scheduled-based triggering approach for executing ingestion component and analytical product pipelines.

Approach

Engagement started with the discovery phase focused on deep understanding of existing platform architecture state, preparing target architecture as well as conducting 360 analysis of existing analytical data products. The goal of analysis was to divide analytical solutions to different complexity buckets, agree on migration priority with business and select candidates for MVP to start with.

As a result of analysis, migration platform architecture based on Azure PaaS was prepared with a list of PoCs to be executed and list of first analytical products from each complexity bucket to be migrated as a first step. The goal of such an iterative MVP-based approach was to test out defined platform architecture using migration of analytical products with different complexity to identify any potential issues and define required accelerators to streamline the migration at scale.

MVP execution resulted into useful feedbacks that were incorporated into platform architecture to facilitate migration efforts and increase quality of the migration program, specifically:

Validation of Azure PaaS for data platform migration

Performance tests helped to confidently finalize Azure PaaS technologies to be used as a core for migration data platform.

Ingestion framework for SQL & oracle data extraction

Recurrent patterns for data ingestion components were identified which resulted into delivery of custom ingestion framework that can generate ADF-based data ingestion components based on configuration files to extract data in batch full & incremental modes from SQL & Oracle data sources into ADLS Gen2 Data Lake.

Automated migration accelerator development

Recurrent patterns for analytical products migration steps were identified which resulted into creation of custom accelerator to automate migration routine for data pipelines of analytical products (e.g. generation of adf pipelines to wrap oozie workflows execution, auto-refactoring of source code to be compliant with HDInsight, etc.).

Data quality validation strategy

Understanding of baseline architecture and finalization of Azure technologies for target platform helped to define scalable approach for data quality validation during migration based on approach of automated data comparison between on-prem data assets and migrated data assets.

As a result, finalized Azure data platform architecture was utilizing:

Hub-and-spoke data architecture implementation

Hub-and-Spoke approach for data platform following by Azure resources and data isolation for each analytical solution within separate Resource Groups and Subscriptions supporting costs segregation, dedicated security perimeters and clear ownership model.

Custom accelerators for CI/CD and data quality

Custom accelerators for automated creation of ingestion components, migration of analytical product components, validation of data quality, generation of CI/CD pipelines, etc.

Express route integration for enhanced data transfer

Express Route for on-prem connectivity for robust and fast network channel for data movements from client on-prem environment.

Azure HD insight for optimized data workloads

Usage of Azure HDInsight as PaaS technology for running data processing workloads based for existing data pipeline components thus leading to reuse of existing business and transformation logic and decreased migration efforts.

Event-driven data pipeline execution

Event-based execution of data pipelines for data ingestion components and data analytical solutions based on data availability and internal dependency graph.

Expedited MVP delivery and migration completion

Delivered MVP in 4 month time frame, planned migration was completed in 16 month, on-prem platform was decommissioned per client plans.

Development of migration accelerators

Developed 4 accelerators to to speed-up migration of various aspects in governed way.

Advanced capabilities deployment

Delivered new capabilities that were not available in on-prem version of the platform such as granular security model on all layers, smart monitoring, event-based execution of data pipelines, granular billing model, platform events, data lineage capturing, on-demand compute infrastructure provisioning, data ingestion as a service, etc.