While the client's existing system could ingest and store time-series data, there was a need to implement a real time streaming processing capability within their system. Having a stream transformation engine would allow the client to understand what is happening in their physical infrastructure sensors in real time. In addition to this need the client was lacking a proper version control, CI/CD and infrastructure-as-code implementation. Coming from a PoC background the client's system was ready to benefit from the above so that they can adapt and evolve quicker, safer and accurately.
The client's existing tech stack has been built in Microsoft Azure Cloud. To address the needs of near real time stream processing, Spark Streaming on Databricks has been employed. We wrote PySpark code with Structured Streaming to connect and ingest data from Azure CosmosDB containers, perform transformations, aggregations and join datasets on the fly, and finally write in new collections in near real time. Windows of time are configurable in Spark Structured Streaming but we could go as fast as 100 milliseconds of latency. In addition to Azure CosmosDB we have also built connections to Azure SQL Databases, both for reading and writing results in near real time. To allow for the most cost-effective CosmosDB throughput management we implemented a set of Azure Functions to manipulate the offer throughputs for each container and integrated that with existing Azure Data Factory ETL pipelines. Following the innovation of the above changes it is now possible for the client to understand in more detail and make better decisions based on the operations of their multiple sensors in the sources of their system.
Terraform was utilised to script the entire infrastructure of Azure Managed services the client had been using; including Azure SQL Server, CosmosDB, Function Apps, Databricks. To control the latter and all of existing codebase solutions of the client, a new Azure Devops organisation was created and each component was stored in its own Azure Repo with detailed unit testing, each artifact was deployed in Azure Artifacts and Azure Pipelines was used in conjunction with Azure Key Vault for secrets management and Azure Active Directory for user permissions/service principals. Now, fast changes with captured status of each operation were able to take place in the client's day to day operations and more confidence was built-in within the wider team to fully empower the power of the cloud and best practices for Data and Cloud Engineering.
Big Data and Cloud Engineer
2020 — 2021
Saint-Gobain