Accelerate migration from hadoop to GCP BigQuery with 80% less manual efforts

Written by Arvind S

Hadoop was widely adopted for its ability to manage both unstructured and structured data through a range of toolsets such as HDFS, HBase, Hive, Sqoop, Oozie, Pig, Spark. However, each tool operates independently, often requiring its own coding expertise. This lack of integration not only complicates maintenance processes but also hinders agility in system enhancements. Over time Enterprises which run their data platforms on Hadoop face issues with scalability, code maintenance and environment administration. These issues underscore the need for efficient integration and modernization solutions like data migration services.

Cloud data platforms offer enterprises the flexibility to scale resources as needed, reducing both capital expenditures (CAPEX) and operational expenditures (OPEX). Google Cloud Platform (GCP), with its BigQuery Platform as a Service (PaaS) Data Warehouse and Cloud Storage, provides an industry-leading data platform that is not only cost-effective but also seamlessly integrated with powerful AI and ML capabilities.

Migrating from Hadoop on premises to GCP

The existing Hadoop on premises environment of a large healthcare organization faced challenges in scaling up their analytical and predictive models. Members, Prospects, and their enrolment/claim data was managed in the Hadoop landscape. The Data Scientist team and the business users observed degradation in Hadoop performance, which impacted their ability to access the right insights at the right time for decision-making. The organization resolved these issues with modernization solutions.

GCP BigQuery was the chosen data modernization platform to replace the existing Cloudera Hadoop environment.

Modernization approach

1. Assessment and Code analysis
The objective of the assessment and code analysis is to evaluate the existing Hive queries and understand the changes involved in converting them to Big Query SQL’s, uncovering potential challenges and client specific customizations needed. During this phase, a detailed report was created which offered insights into the effort needed to migrate to GCP.

Ascendion AVA Low Code Data
Modernization Platform was leveraged for the code analysis and migration planning. A.AVA platform was setup as a containerized app on a Google Kubernetes Engine (GKE) cluster within the client GCP environment and went through stringent security and vulnerability testing.

As part of assessment, we identified commonly used SQL functions and tuned A.AVA for an efficient translation tailored to the client landscape, some of them are

  • Cast() to Safe_Cast()
  • Array of Struct
  • Year() to Extract(Year from <>)

2. Proof of Concept
The objective of this phase was to take the HQL scripts , convert them with A.AVA platform, demonstrate how the platform accelerates the migration journey and validate the converted scripts. We were given about 30 scripts which we converted with A.AVA , we were able to see 80% of the script was converted.

We had also compared the A.AVA platform accuracy against GCP SQL Translator service and one other product.

Factory model migration

Following the successful completion of the PoC, we initiated a factory model to migrate the pipelines to GCP. The following are the list of activities that were performed as part of the migration strategy.

1. POD Team
Process Conversion Team: A dedicated POD team which had a team of Engineers along with SME’s from the existing Hadoop environment.
Data Migration Team: A dedicated POD team which had a team of Engineers along with QAs from validating the migrated data from the Hadoop to BQY environment.

2. Code conversion and Validation
We published a framework which would explain the source code intake process and the list of activities we would perform before sharing the converted code back to client. Code conversion was started on an iterative approach and data validation/testing of the code was done to ensure that the converted code works syntactically in Big Query and was handed over to the client team for further integration testing.

As part of code conversion and testing process we found more scope for enhancements in the A.AVA platform which accelerated the next iteration of data migration strategies.

For the phase 1 data migration some of the key benefits delivered by A.AVA platform include:

  • 80% automation efficiency resulting in saving 2200+ man hours
  • Time to deploy to production reduced to 1 month from 4 months
  • 4000+ Data Engineers/Data Scientists and BI Analysts enabled to deliver quicker insights

 

Ascendion AVA Low Code Modernization Platform

Ascendion Low Code Data Modernization platform specializes in automating migration from legacy platforms such as, Oracle, Informatica, DB2, Teradata, Hadoop. Some of the features available in the tool are Intelligent Assessment, Schema Migration, Data Migration, Data Validation and Process Conversion. This results in a streamlined and efficient transition to modern platforms like GCP, AWS and Azure environments.

Contact Ascendion to explore cutting-edge data modernization solutions. Experience effective transitions to modern cloud environments.