Cloudera data platform migration banner

Cloudera data platform migration

Overview

Challenge

On-prem workloads to be decommissioned in 18 month.

Big number of analytical products, data ingestion components, data volume and platform capabilities to be migrated and modernized.

  • Hosts 55+ ingestion components to acquire data in a batch incremental mode from internal & external systems through variety of interfaces: Oracle DBs, MS SQL DBs, Sharepoint, Web APIs, NAS, SFTP.
  • Hosts 25+ descriptive and predictive analytical products exposing custom reports via Spotfire and PowerBI to enable data driven decisions for areas like drilling and completions, well productivity, product performance analytics, drilling cost prediction and other industry areas.
  • Hosts 8+ TB of structured, semi-structured and unstructured raw and analytical data.
  • Provides data exploratory tools for data analysts, data scientists & auditors following strong access security model based on company policies.

Migration to Azure-based data platform should go with minimum re-factoring of existing data transformation and business logic, absence of business disruption, data quality issues and performance degradation.

Network connectivity issues with on-prem data sources and reporting systems to be connected with.

High dependency on cloudera technology stack (e.g. oozie, impala, hdfs, sqoop, etc.) thus leading to challenges during target Azure technology selection, modernization and migration approaches.

There is a need to solve existing issues in data inconsistency and overheads to Ops team resulted because of scheduled-based triggering approach for executing ingestion component and analytical product pipelines.

Approach

Engagement started with the discovery phase focused on deep understanding of existing platform architecture state, preparing target architecture as well as conducting 360 analysis of existing analytical data products. The goal of analysis was to divide analytical solutions to different complexity buckets, agree on migration priority with business and select candidates for MVP to start with.

As a result of analysis, migration platform architecture based on Azure PaaS was prepared with a list of PoCs to be executed and list of first analytical products from each complexity bucket to be migrated as a first step. The goal of such an iterative MVP-based approach was to test out defined platform architecture using migration of analytical products with different complexity to identify any potential issues and define required accelerators to streamline the migration at scale.

MVP execution resulted into useful feedbacks that were incorporated into platform architecture to facilitate migration efforts and increase quality of the migration program, specifically:

feature speedometer icon

Validation of Azure PaaS for data platform migration

Performance tests helped to confidently finalize Azure PaaS technologies to be used as a core for migration data platform.

parts icon

Ingestion framework for SQL & oracle data extraction

Recurrent patterns for data ingestion components were identified which resulted into delivery of custom ingestion framework that can generate ADF-based data ingestion components based on configuration files to extract data in batch full & incremental modes from SQL & Oracle data sources into ADLS Gen2 Data Lake.

feature search icon

Automated migration accelerator development

Recurrent patterns for analytical products migration steps were identified which resulted into creation of custom accelerator to automate migration routine for data pipelines of analytical products (e.g. generation of adf pipelines to wrap oozie workflows execution, auto-refactoring of source code to be compliant with HDInsight, etc.).

feature cloud icon

Data quality validation strategy

Understanding of baseline architecture and finalization of Azure technologies for target platform helped to define scalable approach for data quality validation during migration based on approach of automated data comparison between on-prem data assets and migrated data assets.

As a result, finalized Azure data platform architecture was utilizing:

feature computer icon

Hub-and-spoke data architecture implementation

Hub-and-Spoke approach for data platform following by Azure resources and data isolation for each analytical solution within separate Resource Groups and Subscriptions supporting costs segregation, dedicated security perimeters and clear ownership model.

part icon

Custom accelerators for CI/CD and data quality

Custom accelerators for automated creation of ingestion components, migration of analytical product components, validation of data quality, generation of CI/CD pipelines, etc.

feature update-frame icon

Express route integration for enhanced data transfer

Express Route for on-prem connectivity for robust and fast network channel for data movements from client on-prem environment.

feature file-with-search icon

Azure HD insight for optimized data workloads

Usage of Azure HDInsight as PaaS technology for running data processing workloads based for existing data pipeline components thus leading to reuse of existing business and transformation logic and decreased migration efforts.

feature roadmap icon

Event-driven data pipeline execution

Event-based execution of data pipelines for data ingestion components and data analytical solutions based on data availability and internal dependency graph.

Achievements

feature clock icon

Expedited MVP delivery and migration completion

Delivered MVP in 4 month time frame, planned migration was completed in 16 month, on-prem platform was decommissioned per client plans.

feature rocket icon

Development of migration accelerators

Developed 4 accelerators to to speed-up migration of various aspects in governed way.

feature laptop icon

Advanced capabilities deployment

Delivered new capabilities that were not available in on-prem version of the platform such as granular security model on all layers, smart monitoring, event-based execution of data pipelines, granular billing model, platform events, data lineage capturing, on-demand compute infrastructure provisioning, data ingestion as a service, etc.

Tech stack

Azure services

  • Dala lake gen2 icon
    Data lake storage gen2
  • key vault icon
    Key vault
  • functions icon
    Functions
  • Monitor icon
    Monitor
  • sql-manager-instance icon
    SQL Managed instance
  • sql icon
    SQL
  • Data factory icon
    Data factory
  • SQL data warehouse icon
    SQL data warehouse
  • devops services icon
    DevOps services
  • Event grid icon
    Event grid
  • policy icon
    Policy
  • express route icon
    ExpressRoute

Big data

  • databricks icon
    Databricks
  • Apache Oozie icon
    Apache Oozie
  • Apache hive icon
    Apache hive
  • Apache spark icon
    Apache spark
  • Cloudera icon
    Cloudera
  • Apache hadoop icon
    Apache hadoop

DevOps

  • JFrog icon
    JFrog

Programming languages

  • Java icon
    Java
  • Python icon
    Python

Business Intelligence

  • power bi icon
    Power BI
  • sportfire bi icon
    Sportfire

Infrastructure automation

  • Ansible icon
    Ansible
  • Powershell icon
    Powershell

Case studies

We are well-versed in the dynamic world of development across a variety of industries.

Contact us

Anfimau Industry Solutions GmbH

Managing director: Mikhail Anfimau

contact us

Mergenthalerallee 15-21 65760 Eschborn, Germany

Phone

+49 6196 7008475

Tax number

040 228 55754

VAT ID

DE345344498

Trade registry

HRB 123580