Challenges
- The application hosts hundreds and thousands of datasets (either free or paid) sourced from thousands of providers.
- Enable the enterprise users, decision scientists, and data analysts to upload their organizational datasets.
- Facilitate joins between the uploaded transactional/non-transactional datasets and the other publicly hosted datasets.
- These joins should be executed within a few seconds for seamless user experience.
Solutions
- Implemented responsive and modern frontend app for data scientists using Redux React.
- Designed and implemented all middle-tier services that include APIs and data access layer on Python Django
- Wrote ANSI SQL code generator in Python that considers all user selections, connects with the metadata system, and generates the final query that runs on Snowflake.
- Built search and recommendation systems on Neo4j that help users find features pertinent to their own uploaded datasets.
Tools & Technologies
Numpy, Django, Redux, React, AWS, Snowflake
Key benefits
- Snowflake allows complex joins that include running various math functions between large datasets to happen within seconds, giving an output of billions of rows
- It auto-creates multiple clusters depending on the count of concurrent queries as the workload increases
- Data Scientists can quickly iterate over their models and thus move towards higher accuracy levels since they now save a significant amount of time finding the most relevant features.
