Many big organisations including Shell are usually found in the situation at which their datasets are found in disparate data sources but most importantly are extremely dirty and no one ever really looked at them in order to start building deep knowledge - or applying analytics on them to drive better informed business decisions. This was the situation for this project which was in need of experienced engineers to drive the above workloads and at the same time leverage the Cloud capacity to make sure that future-like projects are going to be able to cope, irrespective of data volumes and scalability.
Leveraging the power of Python, Apache Spark and JupyterLab, initial notebooks were build that took care of initial steps such as extracting data from various data sources including SQL databases and Azure Storage Accounts, apply various business logic on them and finally build Proof-of-Concept graphs using matplotlib and plotly libraries.
With regular catchup meetings with various business stakeholders PoCs transitioned to a more controlled environment approach as new features were being requested and eventually added. We transitioned the code fully outside JupyterLab on a fully versioned Github repository, build CI/CD on it with CircleCI to accomodate quick and accurate changes with appropriate unit and validation testing and finally we made use of Docker containers to ship our code with Helm Charts to a managed Kubernetes cluster.
A generic framework to extract data from the main databases was also built with the mindet to be re-used from engineers or data scientists and at the same time making sure to keep DBAs' sanity in high levels, for example avoiding sending `SELECT *` queries in their database. In this way various teams in different geographical zones are working together and will continue to learn from each other going forward.
Big Data Engineer & DevOps
2019 — 2020
Shell