Hello! I’m Sreenath Vemireddy.
I am a Senior Big Data and Azure Data Engineer with over 10 years of progressive experience in designing, building, and managing end-to-end data engineering solutions across diverse industries including finance, healthcare, insurance, banking, and geospatial mapping. My core strength lies in building scalable, high-performance data pipelines using cutting-edge technologies like Hadoop, Spark, PySpark, and Hive, and integrating them seamlessly with cloud platforms such as Microsoft Azure and AWS.
Proficianal Summary
I have a proven track record in delivering data migrations—moving terabytes of data from legacy systems to cloud environments—while ensuring data quality, lineage tracking, regulatory compliance, and performance optimization. I specialize in metadata-driven automation frameworks for schema validation, ingestion, transformation, audit logging, and alerting, significantly reducing manual effort and enhancing scalability.
My expertise includes tools like Azure Data Factory (ADF), Azure Blob Storage, Azure SQL Database, Azure Key Vault, and orchestration tools like Apache Airflow, Zena, and Autosys. I have implemented complex DAGs for ETL pipelines, automated failure handling, and ensured secure cloud integration with strict governance practices.
I’ve collaborated with top-tier global organizations such as DBS Bank, PayPal, ICICI Bank, AbbVie, Country Financial, Apple, and Nokia. My work has contributed to mission-critical systems including MAS 637 and PILLAR3 regulatory reporting, SAS exit transformations, CRM cloud migrations, HANA-to-Hadoop reporting, and spatial data integration for Apple and Nokia Maps.
I am well-versed in data governance practices, using tools like Collibra to manage metadata and ensure compliance. I regularly optimize performance of Spark jobs, Hive queries, and ADF pipelines and have utilized Power BI, Adobe Analytics, and Apache Superset for data visualization and reporting.
My approach is guided by Agile methodologies and DevOps principles, including CI/CD automation with Git and Azure DevOps. I’ve also taken on leadership roles, mentoring junior engineers, conducting code reviews, and guiding teams in the delivery of scalable and robust solutions.
I am passionate about transforming raw data into actionable intelligence and continuously strive to enhance data value, accessibility, and governance across organizations. For me, data engineering is not just about infrastructure—it's about enabling smarter, data-driven decisions that drive business success.
I am currently serving as a Cloud Data Engineer at Country Financial, where I lead end-to-end Azure-based data migration projects integrating Databricks, Delta Lake, and metadata-driven automation. I’ve also championed the integration of Collibra with enterprise pipelines to automate lineage and data quality profiling for regulatory compliance.
My Skills
Latest Projects
Comm-Agg End-to-End Data Migration Solution
Description: Designed and developed a metadata‑driven, end‑to‑end data migration and analytics platform to ingest data from Guidewire S3 into Azure SQL, Data Lake, and Databricks Delta Lake, enabling scalable analytics, historical tracking, and regulatory compliance.
- Designed and implemented a metadata-driven Azure Data Factory (ADF) pipeline framework for ingestion from Guidewire S3 buckets to Azure SQL, Data Lake, and Databricks Delta Lake, supporting dynamic schema evolution and schema validation.
- Built modular PySpark transformation jobs in Databricks using Delta Lake, implementing SCD Type 2 logic, historical tracking, and late arriving dimension support.
- Developed parameterized and conditional orchestration using Zena scheduler, enabling seamless pipeline integration based on metadata inputs and business rules.
- Integrated Collibra APIs for automated metadata registration, lineage capture, and data quality profiling of ingested datasets, supporting regulatory compliance.
- Developed comprehensive audit logging, alerting, and reconciliation layers using PySpark, ADF logging tables, and automated email notifications to ensure data integrity and compliance.
- Optimized Spark job performance for large Guidewire datasets (~20TB) by tuning shuffle partitions, caching intermediate results, and optimizing complex join operations.
- Conducted thorough end-to-end unit testing and data validation using PySpark and SQL, reconciling with legacy Oracle systems to ensure fidelity and completeness of migrated data.
- Mentored and guided junior engineers, conducted code reviews, and led weekly sprint demos with product owners and business SMEs to ensure alignment and quality delivery.
- Enabled business users to visualize key insurance metrics by building Power BI dashboards consuming curated Delta Lake tables, improving data accessibility and decision making.
- Automated schema drift handling and late arriving data management, reducing manual intervention and ensuring continuous pipeline robustness.
Technologies Used: Azure Data Factory (ADF), Azure Data Lake Storage Gen2, Delta Lake, Databricks, Blob Storage, Azure SQL, Python, Hadoop, Hive, Impala, Zena, Shell Scripting
MAS 637 Reporting
Description: The MAS 637 project involved generating reports for the Monetary Authority of Singapore (MAS) to ensure regulatory compliance. The solution analyzed pool statuses across various user types and tracked changes over a 12‑month period.
- Designed and developed regulatory reporting pipelines for MAS 637 leveraging PySpark within the ADA (Automated Data Analytics) framework to automate ingestion, transformation, and aggregation of large-scale financial datasets.
- Integrated end-to-end data lineage and data quality controls using Collibra and custom Python validation scripts, enabling traceability and compliance for all reportable data elements.
- Collaborated with risk, compliance, and business teams to define regulatory requirements, ensuring accurate and timely submission of MAS 637 reports to Singapore authorities.
- Optimized Presto and Hive queries to efficiently process high-volume transactional data, reducing reporting cycle times and improving query performance.
- Provided insights into financial transactions and default status by integrating multiple internal data sources, improving transparency for auditors and stakeholders.
- Delivered high‑quality, compliant reports to the Monetary Authority of Singapore, ensuring all data and analysis met strict regulatory standards.
- Developed automated alerting mechanisms for SLA breaches and compliance exceptions using Airflow DAGs and email triggers, ensuring proactive remediation and transparency.
- Documented all data flows, transformations, and control points to support internal and external audits for regulatory purposes.
Technologies Used: ADA (In‑house framework), PySpark, Spark SQL, Presto, Hive, Hadoop, Airflow, Jupyter, Collibra, Python, Shell Scripting
PILLAR3 Risk Analytics
Description: PILLAR3 reporting focused on risk management, using predictive analytics to evaluate user behavior and default risks based on historical data.
- Developed advanced analytics and dashboards for PILLAR3 risk exposure using PySpark, Presto, and Apache Superset, enabling real-time risk monitoring for senior management.
- Designed and implemented historical snapshot logic for 12-month risk trend analysis, supporting regulatory and business reporting requirements.
- Enabled interactive drill-down and slice-and-dice capabilities for senior management and regulatory auditors, improving transparency and decision support.
- Implemented robust data masking, row-level security, and access controls to ensure strict adherence to data privacy and security guidelines.
- Collaborated with data governance teams to define and implement Collibra data quality (DQ) rules and lineage modeling for all critical risk datasets.
- Automated validation and exception handling workflows, reducing manual effort and increasing data reliability for risk analytics.
- Applied predictive analytics techniques to evaluate potential default risks, improving the bank’s overall risk‑management strategy.
Technologies Used: PySpark, Spark SQL, Presto, Apache Superset, Hive, Collibra, Airflow
SAS Exit
Description: The SAS Exit project was designed to migrate legacy SAS scripts into DBS Bank’s ADA platform. The migration converted complex SAS reports into PySpark‑based solutions, improving performance and scalability of the reporting system.
- Reverse-engineered and migrated 100+ legacy SAS reporting scripts and ETL workflows to PySpark and Spark SQL within the ADA framework, ensuring full functional parity and improved scalability.
- Refactored monolithic SAS jobs into modular, reusable Spark components integrated with Airflow orchestration for automated scheduling and monitoring.
- Developed comprehensive validation, reconciliation, and audit scripts to ensure data accuracy and completeness post-migration, supporting regulatory and business reporting requirements.
- Reduced end-to-end report processing time by 50% and eliminated recurring SAS licensing costs, delivering significant cost savings for the organization.
- Provided detailed documentation, knowledge transfer, and hands-on training to analytics teams and business users on the new PySpark-based reporting framework.
- Collaborated with compliance and data‑governance teams to define domain‑specific policies, ensuring consistent rule enforcement across systems.
- Integrated Collibra with bank‑wide data pipelines to automate lineage tracking and data‑quality profiling.
- Enabled dynamic parameterization and metadata-driven job execution, increasing flexibility and reducing manual intervention.
Technologies Used: SAS, ADA (In-house framework), PySpark, Airflow, Collibra, Python, Spark SQL, Presto, Hive, Hadoop
CRM Data Migration
Description: This project involved migrating CRM data from multiple on‑premises and cloud environments to an Azure‑based platform with the goal of ensuring smooth data transfer while generating insightful business reports.
- Designed and implemented Azure Data Factory (ADF) pipelines to extract, transform, and load CRM data from diverse on-premises (Oracle, SQL Server) and cloud sources into Azure SQL, Blob Storage, and Data Lake.
- Integrated Azure Key Vault for secure credential management, and implemented automated error handling and notification workflows using Azure Logic Apps.
- Developed robust schema validation, data profiling, and reconciliation scripts to ensure data quality, consistency, and completeness throughout the migration process.
- Built interactive Power BI dashboards and custom SQL reports to enable business stakeholders to visualize CRM metrics, trends, and KPIs in real time.
- Coordinated with cross-functional business, IT, and QA teams to ensure zero data loss, minimal downtime, and seamless business continuity during cutover and migration phases.
- Documented all migration processes, controls, and data mappings, supporting compliance and audit requirements.
Technologies Used: Azure Data Factory (ADF), Blob Storage, SQL Database, Oracle, Key-Vault, Vertica, Azure Logic Apps
H2H (HANA to Hadoop)
Description: The H2H project involved migrating SAP HANA reports to a Hadoop‑based environment, ensuring improved performance and scalability for reporting and analytics.
- Developed scalable PySpark-based ETL pipelines to replicate, transform, and load SAP HANA reporting workloads into Hadoop, Hive, and MongoDB environments, enabling advanced analytics and scalability.
- Built complex SQL queries for advanced analytics and integrated Python scripts to validate, clean, and transform data prior to ingestion.
- Optimized Spark and Hive queries for large-scale reporting, achieving over 60% reduction in query latency and improving overall system performance.
- Designed and implemented comprehensive data reconciliation, validation, and audit frameworks to ensure data consistency, integrity, and completeness post-migration.
- Automated job orchestration, monitoring, and alerting using custom PayPal frameworks and shell scripting, reducing manual intervention and improving reliability.
- Supported business intelligence and analytics teams in building new dashboards, reports, and self-service analytics on migrated data, driving business value and insights.
- Documented all migration logic, controls, and exception handling procedures for operational transparency and audit readiness.
Technologies Used: PySpark, HDFS, Hive, MongoDB, GIT, Shell, Custom PayPal Frameworks
IAP (Integrated Analytics Platform)
Description: The IAP project was designed to centralize patient data from various applications into Hadoop for analysis, processing and analyzing patient activity data to help healthcare providers make informed decisions.
- Developed ingestion pipelines for clickstream, Adobe Analytics, and transactional patient data using Apache Spark, Hive, and Hadoop, centralizing data from multiple healthcare applications.
- Developed Spark applications to analyze patient activity data, providing a comprehensive view of patient behavior and health outcomes.
- Designed and implemented scalable ETL processes to aggregate, transform, and enrich patient activity data for advanced downstream analytics and reporting.
- Automated data validation, error handling, and job scheduling using Autosys JIL scripting and shell scripting, improving reliability and operational efficiency.
- Collaborated with healthcare analysts and business users to design custom reports, dashboards, and analytics solutions for actionable patient insights.
- Ensured HIPAA compliance and data privacy through secure data handling, access controls, and audit logging across all ETL workflows.
- Documented all data flows, validation checks, and security controls to support compliance and audit requirements.
Technologies Used: Click Stream Data, Adobe Analytics, Apache Spark, Hadoop, Hive, Scala, Impala, Shell Scripting, Autosys
SDC (Spatial Data Collaborator)
Description: This project involved working with geospatial data sources to support Apple Maps; we processed and integrated data from multiple sources to create accurate and up‑to‑date maps.
- Designed and implemented Hadoop-based ETL workflows to ingest, process, and harmonize large-scale geospatial data from USGS, TIGER, and various third-party sources for Apple Maps enhancements.
- Developed MapReduce and Hive jobs for spatial data transformation, enrichment, validation, and quality checks, supporting advanced mapping features and analytics.
- Automated data validation, reconciliation, and update processes to ensure accuracy, currency, and reliability of mapping datasets.
- Collaborated with Apple Maps engineering and product teams to deliver timely data updates, support new map features, and resolve data quality issues.
- Improved data processing throughput and reduced manual intervention by implementing robust shell scripting automation and workflow monitoring.
- Documented ETL logic, data flows, and validation frameworks for operational transparency and knowledge sharing.
Technologies Used: Hadoop, Hive, HDFS, Python, Shell Scripting, MapReduce, USGS, TIGER Data Sources
Mapping Data Integration
Description: This project focused on integrating Nokia's geospatial data into Hadoop for large‑scale analytics, enabling advanced mapping and navigation solutions.
- Developed robust ingestion and transformation pipelines using Spark, Hive, and HDFS for large-scale integration of Nokia’s geospatial and mapping datasets into Hadoop-based analytics environments.
- Automated data quality validation, enrichment, and cleansing processes for spatial data, improving accuracy and usability for downstream analytics.
- Supported analytics and engineering teams in building advanced navigation, routing, and mapping solutions leveraging Hadoop-based data lakes.
- Optimized ETL jobs for high throughput, scalability, and reliability using Python, shell scripting, and Spark performance tuning techniques.
- Coordinated with Nokia’s global data teams to ensure timely data integration, updates, and resolution of data quality issues.
- Documented all data integration logic, validation workflows, and operational procedures for ongoing maintenance and audit readiness.
Technologies Used: Hadoop, Hive, HDFS, Python, Shell, Spark