Challenges
- Needed consultation for evaluation of tools and approaches for cloud adaptation. The objective was to offload computing from existing out-moded on-premise MapR cluster to the cloud.
- Needed a solution custom-built for their live data (largest module) for evaluation and decision-making.
- Needed an automated solution for resource configuration, deployment, scheduling, scalability, etc.
- Needed the ability to process incoming incremental data (10 TB or more) in a better and more efficient manner.
Solutions
- Provided a cloud-optimized, on-demand spin up solution for the computation offloading and Snowflake-based reporting solution.
- Weekly extraction of 5TB or more data performed from the on premise MapR cluster and placed in S3 using shell script & AWS CLI executed by Airflow jobs.
- Based on data size, copied over AWS EMR cluster is spun up using cloud formation templates and AWS CLI for executing Spark & Pig scripts.
- Resultant data post-processing from EMR is pushed into S3 buckets for persistence.
- AWS EMR cluster is auto-scaling enabled and gets purged post-processing.
Tools & Technologies
Amazon S3, Apache Pig, Apache Spark, Cloud Formation, Amazon EMR, MAPR, Apache Airflow, Python, R, Powershell, Snowflake, Bash
Key benefits
- Provided a cost-efficient – On-demand solution for computation on AWS platform
- Added value by providing best-suited recommendations for resource type and configuration for a cost-efficient and optimal solution.
- Offloaded jobs that would need 48 hours in on perm server to cloud and processed them within 24 hours.
