Summary
Overview
Work History
Education
Skills
Timeline
Hi, I’m

Hari Krishna Aitha

Copenhagen
Hari Krishna Aitha

Summary

Data Engineer with over 7 years of experience in designing and optimizing data solutions across AWS, Azure, and GCP. Expertise in managing large-scale data pipelines and ensuring data integrity while enhancing performance. Recognized for strong time management and problem-solving abilities, contributing to team success and organizational growth.

Overview

8
years of professional experience

Work History

Citi Bank

04.2024 - Current

Job overview

  • Company Overview: Citi Bank is a global financial institution offering banking, credit, investment, and wealth management services to individuals, businesses, and governments
  • I design, build, and maintain scalable ETL/ELT pipelines using Azure Data Factory (ADF) and Azure Synapse Analytics
  • Automate data ingestion and transformation processes for structured, semi-structured, and unstructured data
  • Developed PySpark applications for various ETL operations across various data pipelines
  • Exploring with the PySpark to improve the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's
  • Analysed and developed a modern data solution with Azure PaaS service to enable data visualization
  • Understood the application's current Production state and the impact of new installation on existing business processes
  • Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and state full transformations
  • Developing data pipelines and workflows using Azure Databricks to process and transform large volumes of data, utilizing programming languages such as Python, Scala, or SQL
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster
  • Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records
  • Working on query languages such as SQL, code languages such as Python or C# and scripting languages such as PowerShell, M-Query (Power Query), or Windows batch commands
  • Utilized Elasticsearch and Kibana for indexing and visualizing the real-time analytics results, enabling stakeholders to gain actionable insights quickly
  • Involved in various phases of Software Development Lifecycle (SDLC) of the application, like gathering requirements, design, development, deployment, and analysis of the application
  • Involved in loading data into Cassandra NoSQL Database
  • Develop ETL/ELT processes to prepare data for analytics and reporting
  • Implement data governance policies to ensure data quality, consistency, and compliance
  • Use Azure Purview or similar tools for data cataloging, lineage tracking, and metadata management
  • Ensure data security by implementing encryption, access controls, and monitoring using Azure Security Center and Azure Key Vault
  • Monitor data pipelines and systems for performance, reliability, and cost
  • Work closely with data scientists to provide clean, structured data for machine learning models and advanced analytics
  • Use Azure Monitor, Log Analytics, and Application Insights to troubleshoot issues and optimize resource usage
  • Manage and optimize Azure cloud infrastructure, including virtual machines, storage accounts, and networking components
  • Support business analysts by enabling access to data through tools like Power BI or Tableau
  • Use Infrastructure as Code (IaC) tools like Terraform or Azure Resource Manager (ARM) templates for deployment and management
  • Work with IoT data streams from sensors and equipment used in oil and gas operations
  • Use Azure IoT Hub, Azure Stream Analytics, or Apache Kafka for real-time data processing and analytics
  • Document data pipelines, architectures, and processes for future reference and onboarding of new team members
  • Implemented Synapse Integration with Azure Databricks notebooks which reduce about half of development work
  • And achieved performance improvement on Synapse loading by implementing a dynamic partition switch
  • Built and configured Jenkins slaves for parallel job execution
  • Installed and configured Jenkins for continuous integration and performed continuous deployments
  • Successfully managed data migration projects, including importing and exporting data to and from MongoDB, ensuring data integrity and consistency throughout the process
  • Worked on Jenkins pipelines to run various steps including unit, integration and static analysis tools
  • Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack- Elastic search and Kibana
  • Extensively used Azure Athena to ingest structured data from Azure Blob Storage into various systems such as Azure Synapse Analytics or to generate reports
  • Developed and maintained data models and schemas within Snowflake, including the creation of tables, views, and materialized views to support business reporting and analytics requirements
  • Good experience with Continuous Integration and Continuous Delivery (CI/CD) of application using Bamboo
  • Technologies Used: Analytics, API, Athena, Azure, Azure Synapse Analytics, Blob, Cassandra, CI/CD, Elasticsearch, ETL, Java, Jenkins, Kafka, lake, PaaS, PySpark, Python, Scala, Snowflake, Spark, Spark Streaming, SQL

Lundbeck

09.2022 - 03.2024

Job overview

  • Company Overview: Lundbeck is a global pharmaceutical company specializing in brain diseases, focusing on innovative treatments for psychiatric and neurological disorders
  • Enhanced the data pipelines for performance, scalability, and reliability and leveraged modern tools and frameworks like Apache Airflow, Spark, and cloud-native services
  • Design, development and implementation of performant ETL pipelines using Python API of Apache Spark
  • Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources
  • Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries
  • Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming and did data quality checks using Spark Streaming and arranged bad and passable flags on the data
  • Responsible for estimating cluster size, monitoring, and troubleshooting the Spark Databricks cluster
  • Worked with AWS Terraform templates in maintaining the infrastructure as code
  • Involved in various phases of Software Development Lifecycle (SDLC) of the application, like gathering requirements, design, development, deployment, and analysis of the application
  • Used Django evolution and manual SQL modifications were able to modify Django models while retaining all data, while site was in production mode
  • Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment
  • Ensured data integrity and consistency during migration, resolving compatibility issues with T-SQL scripting
  • Dockerized applications by creating Docker images from Docker file, collaborated with development support team to setup a continuous deployment environment using Docker
  • Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications
  • Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions
  • Migrated the data from Amazon Redshift data warehouse to Snowflake for further financial reporting
  • Stored the log files in AWS S3
  • Used versioning in S3 buckets where the highly sensitive information is stored
  • Technologies Used: APIs, AWS, CI/CD, Docker, EC2, Elasticsearch, ETL, HBase, Java, Jenkins, Jira, Kafka, lake, Lambda, Python, Redshift, S3, Snowflake, Spark, Spark Streaming, SQL

Aon

11.2019 - 08.2022

Job overview

  • Company Overview: Aon is a global professional services firm providing risk management, insurance, retirement, and health consulting solutions
  • Designed and implemented the efficient data models for BigQuery to support analytics and reporting needs
  • Optimized the schemas for performance, scalability, and cost-efficiency
  • Extensively worked on HIVE, created numerous Internal and external tables as part of the analysis requirements
  • Involved in data validations and reports using PowerBI
  • Creating Data Studio report to review billing and usage of services to optimize the queries and contribute in cost saving measures
  • Worked on NoSQL Databases such as HBase and integrated with PySpark for processing and persisting real-time streaming
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery
  • Created Amazon VPC to create public-facing subnet for web servers with internet access, and backend databases & application servers in a private-facing subnet with no Internet access
  • Developed an end-to-end solution that involved ingesting sales data from multiple sources, transforming and aggregating it using Azure Databricks, and visualizing insights through Tableau dashboards
  • Good knowledge in using Cloud Shell for various tasks and deploying services
  • Created batch and real time pipelines using Spark as the main processing framework
  • Used Python to write Data into JSON files for testing Django Websites, Created scripts for data modelling and data import and export
  • Experienced in Google Cloud components, Google container builders and GCP client libraries and Cloud SDK'S
  • Managed large datasets using Panda data frames and SQL
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL
  • Also Worked with Cosmos DB (SQL API and Mongo API)
  • Monitoring BigQuery, Dataproc and Cloud Dataflow jobs via Stack driver for all the different environments
  • Used Sqoop import/export to ingest raw data into Google Cloud Storage by spinning up Cloud Dataproc cluster
  • Used Google Cloud Dataflow using Python sdk for deploying streaming jobs in GCP as well as batch jobs for custom cleaning of text and json files and write them to BigQuery
  • Involved in setting up of Apache Airflow service in GCP
  • Technologies Used: Airflow, Apache, API, Azure, BigQuery, Cosmos DB, Data Factory, Factory, GCP, HBase, HDInsight, JS, PySpark, Python, SDK, Spark, Spark SQL, SQL, Sqoop, Tableau, VPC

Nokia

05.2017 - 10.2019

Job overview

  • Company Overview: Nokia is a global technology company specializing in telecommunications, networking, and 5G infrastructure solutions
  • Implemented the data validation and cleansing processes to maintain data accuracy and reliability
  • Monitor and resolve data pipeline errors to ensure seamless data flow
  • Used AWS Lambda to perform data validation, filtering, sorting, or other transformations for every data change in a Database table and load the transformed data to another data store AWS S3 for raw file storage
  • Worked on Big Data Integration & Analytics based on Hadoop, SOLR, PySpark, Kafka, Storm and web Methods
  • Have worked on partition of Kafka messages and setting up the replication factors in Kafka Cluster
  • Created several Databricks Spark jobs with PySpark to perform several tables to table operations
  • Developed Spark applications for the entire batch processing by using PySpark
  • Involved in the entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementation, and support
  • Developed database triggers and stored procedures using T-SQL cursors and tables
  • Implemented Apache Airflow for workflow automation and scheduling tasks and created DAGs tasks
  • Built scalable data infrastructure on cloud platforms, such as AWS, using Kubernetes and Docker
  • Conducted query optimization and performance tuning tasks, such as query profiling, indexing, and utilizing Snowflake's automatic clustering to improve query response times and reduce costs
  • Created CI/CD pipelines with Jenkins and deploy the application on AWS EC2 using docker containers
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
  • Technologies Used: AWS CI/CD, Cluster, Data Factory, Docker, DynamoDB, EC2, EMR, ETL, Factory, Jenkins, Kafka, Kubernetes, Lake, Lambda, PySpark, S3, Snowflake, Spark, SQL, Sqoop, Storm

Education

Kakatiya University

Master of Science from Computer Applications Development
05-2008

Skills

  • AWS Services: S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, Cloud formation
  • Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, PySpark Airflow, Kafka, Snowflake Spark Components
  • Databases: Oracle, Microsoft SQL Server, MySQL, DB2, Teradata
  • Programming Languages: Java, Scala, Impala, Python
  • Web Servers: Apache Tomcat, WebLogic
  • IDE: Eclipse, Dreamweaver
  • NoSQL Databases: NoSQL Database (HBase, Cassandra, Mongo DB)
  • Methodologies: Agile (Scrum), Waterfall, UML, Design Patterns, SDLC
  • Currently Exploring: Apache Flink, Drill, Tachyon
  • Cloud Services: AWS, Azure, Azure Data Factory / ETL/ELT/SSIS Azure Data Lake Storage Azure Data bricks, GCP
  • Teamwork and collaboration
  • ETL Tools: Talend Open Studio & Talend Enterprise Platform
  • Reporting and ETL Tools: Tableau, Power BI, AWS GLUE, SSIS, SSRS, Informatica, Data Stage
  • Friendly, positive attitude

Timeline

Citi Bank
04.2024 - Current

Lundbeck
09.2022 - 03.2024

Aon
11.2019 - 08.2022

Nokia
05.2017 - 10.2019

Kakatiya University

Master of Science from Computer Applications Development
Hari Krishna Aitha