Table of Contents
What is Dataops
Enter Data Pipelines
The Tools and Technologies
Putting it all together with a Dataops approach
Adaption and Growth
Call to Action
Data is a big deal
Talk about data being the new oil, or rather the new renewable energy source (to be ‘greener’ with our metaphors here). For an undisputed proof-point of this prediction becoming truer with each passing year, we need to look no further than the recent developments in the corporate world.
Technologically, underneath this huge merger in the world of the data economy, we see two data platforms coming together under one umbrella – Kensho, which S&P Global acquired in 2018, and IHS Markit Data Lake
Kensho is an AI-driven research platform specializing in Deep Learning (DL) speech recognition, advanced visualizations, entity recognition, and state-of-the-art search. Kensho’s products power S&P Global and deliver innovative solutions and capabilities to their clients.
IHS Markit Data Lake enables both data monetization and data management. It has vast data assets grouped into a single cataloged platform. The Cloud-based platform stores, catalogs, and governs access to structured and unstructured data. The Data Lake includes access to over 1,000 proprietary data assets, which will be expanded over time, as well as a technology platform allowing clients to manage their own data
‘Well, that sounds great, but it is just one data point from the financial services sector!’ one might say.
Data is the #1 priority for CFOs going forward
Fair enough! But it seems that executives across all verticals are, and will be, equally enamored with data. If we look at the top priorities within enterprise IT for 2021, data transformation is the highest based on a CFO survey conducted by Gartner on the fiscal readjustments needed in the post-pandemic world. In 2021 and beyond, this new emphasis on data will manifest itself in large-scale and enterprise-wide data transformation efforts aimed at creating key business differentiators such as improved user experience (UX), real-time and self-service analytics, insights powered by AI, including Machine Learning (ML) and DL, and productivity gains from intelligent automation including Robotic Process Automation (RPA).
Before we proceed further, we need to make a significant clarification. We need to make sure that we are all working with the same interpretation of the term ‘data’. The data that we will be talking about here is not any small data. Instead, it is Big Data, of which the transactional and structured data from systems of record will only be a tiny subset. Big Data, just to re-emphasize, is often characterized by the famous 4 Vs – as in higher Volumes, Variety, Velocity, and Value.
In support of the new priorities, enterprises will continue to guide their businesses to be more data-influenced, as opposed to being just data-aware. Being data-influenced blends quantitative data with qualitative data, leveraging the power of human intuition that will be aided by various forms of AI which are infused with Big Data, and especially AI of the explainable kind, one which gives recommended actions along with the rationale as to why – all in near real-time to real-time.
All these years, Big Data has focused on the generation, transformation, and storage of data using Data Integration, Data Lakes or Data Marts. Very little attention has been given to the consumption of data by teams working on it. With organizations becoming more data-influenced, access to data that is curated and compliant becomes paramount. Providing this data more promptly, enabling collaboration between business and data teams, is where DataOps begins to play a more significant role.
What is DataOps?
What is DataOps?
DataOps – Streamlining and automating steps needed for rapid transformation of ‘Data’ into ‘Business Value’.
Talking about being more data-influenced brings us to another topic of importance: the automation of the data life cycle activities. As we all agree, that of late, the suffix ‘Ops’ has become a fashionable way to refer to such automation efforts all across a wide range of industries. We have been adding it next to many IT (and other enterprises) workstreams; the most popular is DevOps.
By implication, the use of that exalted suffix, ‘Ops’, with data, only refers to streamlining and automating activities related to the rapid transformation of ‘data’ into ‘business value’. While coining the moniker, DataOps, in order for it to be catchy and crisp like ‘DevOps’, the IT industry decided to drop the leading adjective ‘Big’. But it should be very much there in spirit every time we refer to the term ‘data’ in here, let alone ‘DataOps’.
With that IT etymology out of the way, let us now look at what Gartner has to say (paraphrased significantly here), about the greater role for DataOps and the interplay between data and AI.
“AI engineering stands on three core pillars: DataOps, ModelOps, and DevOps,” Gartner says. “DevOps deals mainly with high-speed code changes, but AI projects also experience dynamic changes in AI models and data, in addition to the application. And, all three must be improved. Organizations must apply DevOps principles used for CI/CD pipelines across the Data Pipelines with DataOps and across all AI model pipelines with ModelOps, or more narrowly MLOps for ML models, to reap the benefits of AI engineering.”
“DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization,” adds Gartner.
According to Gartner, “As data and analytics teams become critical to supporting more diverse, complex and mission-critical business processes, many are challenged with scaling the work they do in delivering data and insights to support a range of consumers and use cases. Pressure to deliver faster, with higher quality, and with resiliency in the face of constant change is causing data and analytics leaders to rethink how their teams are organized and how they work. Traditional waterfall-oriented methodologies aren’t meeting the need — the distance between requirements definition and delivery of value is too great, the time required too long, and too many critical tasks get lost or degraded across role and team silos.”
In our efforts to optimize the entire data value chain, DataOps helps us manage the ever-increasing complexity of data. This complexity can come through a combination of the following factors that modern data projects must know how to handle:
- Distributed nature of data origination (both in IoT as well as multi-cloud environments
- Growing need to access real-time data – the trickle-down of real-time data begins to force layers of other data to require real-time data
- A large number of data sources (sometimes in the 1000s) – internal as well as supplier and customer-sourced
- Growing and diverse base of data consumers – no longer is it just IT and a few business users sitting around writing reports
- Need for more ‘actionable analytics’ with a focus on leading indicators, in addition to the ‘passive’ reports on lagging indicator
- Support for multiple data types
- Multiple data platforms (for example, those listed in the mega-merger that we started this discussion with)
- The need for speed
To better understand how this complexity can, and should, be tamed, let us take a look at the various stages in the lifecycle of data. This data moves from its origination by an application, device, or person through manual data entry as a passive snapshot of an event. These events can come in the form of a meter reading, a transaction to a valuable insight (preferably actionable) and/or a monetizable business asset.
Enter Data Pipelines
Enter Data Pipelines
DataOps addresses the challenges listed above largely through a more agile, collaborative, and change-friendly approach involving the building and managing Data Pipelines. To that extent, any discussion on DataOps would end up being hollow without a proper understanding of Data Pipelines.
So, ‘What is so new and unique about Data Pipelines?’ one might ask.
In comparison, in the digital world, the data management use cases in question are quite challenging and more impactful. They have to be more in-tune with helping the enterprise participate actively and effectively in the data economy. Such questions as “Where is data coming from?”, “Where is it headed?”, “What kind of data or event it is?” and “How fast should it get there?” have significantly different answers as compared to the traditional handling of data.
Data Pipelines technology is a superset of ETL. It offers support for all the protocols and treatments that we expect data to encounter in the new hybrid IT world – across all layers of compute, network, and storage. Data Pipelines offer built-in support for monitoring, with auditing/logging/alerting, to ensure guaranteed timely delivery of the processed data to required destinations. These pipelines also know how to move data across all the boundaries of the new data economy – social media, IoT devices, mobile endpoints in the field, edge locations, data centers, public/private Clouds, commercial data exchanges, and so on. To give it all a jolt of productivity, many data platforms also support low-code or no-code development environments and come with hooks into such DevOps’ linchpins as GitHub and Jenkins so the pipelines can adopt CI/CD (Continuous Integration and Continuous Deployment) best practices for agile development.
Data Pipelines view all data as streaming data and allow for flexible schemas. Regardless of whether they come from static sources (such as a file extract) or real-time sources (such as IoT transmissions), Data Pipelines divide each data stream into smaller chunks that can be processed in parallel if desired
Data Pipelines are also polyglot in nature, with support for many powerful new data handling programming languages such as Python/pandas, MapReduce, Spark, etc., in addition to the good old Java, .Net, SQL, etc. For example, Jupyter, a popular DataOps platform, is named so because it supports development in Julia, Python, and R. Backward compatibility with SQL in old ETL scripts can be expected, subject to the proprietary restrictions of ETL platforms, for leverage while building new Enterprise Data Platforms
To understand the art of the possible with Data Pipelines, we should also list the architecture patterns they support for handling different treatments needed by large-scale data platform efforts
These patterns include:
- Integration: for data from multiple sources
- Messaging: for temporary repositories or pub/sub based queues like Kafka
- Quality: for data quality checks or to standardize data (address validation)
- Match/Merge: for MDM verification and for entity resolution, including deduplication
- Exchange: for sharing data with partners and customers in the required format, such as HL7, SWIFT
- Warehousing: for moving data, for example, to AWS Redshift, Snowflake, SQL data warehouses, or Teradata
- Reporting: for BI/BA tools like Tableau or Power BI
- Lakes: for moving data to Amazon S3, Microsoft ADLS, or Hadoop
- AI: for moving data to feed to AI (ML) models
- Security: for security-related transformations, which include masking, anonymizing, tokenization, or encryption
Now, all of that amounts to quite some power that we are ascribing to Data Pipelines. Let us now try to get an appreciation for the applied nature of that power with the help of a couple of end-to-end ‘modern’ data use cases – one which is simple and the other a bit complex:
- When a customer ‘likes’ a new car just purchased, that event could generate data to feed a real-time report that counts social media’ likes’ for that car, a sentiment analysis application that sends a positive response back expressing thanks, and an application charting each mention on a world map. Though the data is from the same source in all cases, each of these applications will be built on unique Data Pipelines that must be smoothly completed before that customer and other customers interested in that car see the results
- In the new world of Omni-channel Commerce, the Service Center personnel at a retailer have to get in and out of many systems looking for data on generated orders. Service Center personnel leverage systems in the store and on the Web, deliveries, returns, audio logs, and more, across a multitude of channels just to service a particular request/complaint on unfulfilled orders. All of this data is stored in separate ERP, MRP, CRM, marketing automation, web analytics, call center platforms, and typically many other systems. Imagine how it can all be streamlined and automated with DataOps to create a Customer Data Platform that offers a single view for service center teams towards greater customer experience and customer retention
The following visual, inspired by an Informatica blog and highlighting typical scenarios, is pivotal to the rest of the discussion on Data Pipelines and the scope and responsibility of DataOps. It illustrates how different data flows are connected from data sources at one end to value-generating endpoints at the other.
As we can deduce from the visual above, the scope of DataOps includes building and running Data Pipelines for ingestion, data treatments, and feeding of data to downstream steps such as ModelOps, MLOps systems for AI, Data Exchanges, and BI/Analytics platforms. But in addition, DataOps also includes the activities related to governing and monitoring all those steps (as shown at the top). Yet, that is not all. DataOps also supports such DevOps-like activities, involving Data Pipelines, as collaboration and development of ‘user stories’, development, testing, and deployment (as shown with chevrons at the bottom). Along the same lines, there is now talk of DecisionOps to include BI/Reporting workloads to complement DataOps at the consumption end of the life cycle. The visual in Figure 4 illustrates this synergy between DataOps and ‘OtherOps’, or ‘XOps’ as Gartner calls them.
Now, shifting our discussion from process to people, when we focus on the primary actors in the enterprise whom DataOps brings together, and more importantly depends on for its success, we realize that the traditional roles of data modelers, architects, and engineers are now augmented by data scientists (for ML/DL) and data analysts (for self-service and predictive analytics).
The Tools and Technologies
The Tools and Technologies
Now to move on to the third critical element of DataOps, i.e, technology, it has to be emphasized that, much like DevOps, DataOps is indeed more about the process and not technology. But we must also realize that without the power that the Data Pipelines technology brings to support DataOps, the business returns of the Data Management initiatives in question would be impossible to realize.
Just to get a flavor for these offerings, let us now re-paint the logical view of the Data Pipelines patterns from the visual in Figure 3, with some representative and notable offerings for each pattern/step
If the picture above looks a bit overwhelming, it is only because it is meant to be a poster for a good portion of the entire landscape of Data Pipelines and Data Management technologies/tools. And it lists several enablers for each pattern/step. The purpose of the visual is to represent the excitement and sponsorship these tools have garnered (or, is it ‘Gartnered’?) in recent times. In a single chart, the vast number of options available for enterprise data platform teams to pick and choose from to help meet the specific objectives of their data initiatives is also to illustrate.
We should also note that there are many (and the list is only growing) DataOps (or Data Fabric or Data Sciences) platforms that make this task of composing the end-to-end software stack easy. They do so by integrating/embedding some of the a la carte (or best of breed) components we listed above into a single integrated platform. In Figure 7, we show some of these platforms as a representative list in alphabetical order. We at ACS Solutions have our own Data Sciences platform (called iDSP) in this list.
If you did feel that the visual in Figure 6 was making you feel dizzy, brace yourself. There is more. And, because of that, there is an excellent reason to consider using one of the platforms listed in Figure 7 for one-stop shopping of DataOps tools.
If you look closely, as busy as the visual in Figure 6 is, it doesn’t even include software components to handle the ‘consumption’ end of the Data Pipeline patterns that we see to the extreme right side of the visual in Figure 3.
Here is a list of the notable options for these endpoints that Data Pipelines feed into:
The Data Sciences platforms that we listed in Figure 7 do integrate/embed some of these ‘consumption’ tools/technologies as well, making those platforms that much more integrated and hence worth some serious consideration.
Putting it all together with a DataOps approach
Putting it all together with a DataOps approach
To follow up our discussion on process, people, and technology, let us now try to put it all together into an enterprise-level DataOps approach
- Setup of Data Pipelines Support Structures:
- The top layer, of Figure 3, of governance (for Data Catalog, Data Ownership, Policies, Access Control, and so on), Orchestration (for managing the dynamic behavior of data pipelines, including scheduling and event-driven dependencies), and Monitoring (for troubleshooting and optimizing end-to-end performance with alerts on errors rates and latency percentiles)
- The foundational layer of CI/CD, from the bottom layer of Figure 3, as leveraged from and/or influenced by existing DevOps processes/tools (for PLAN, CODE, BUILD, TEST, DEPLOY, tying back to OPERATEE, MONITOR)
- Identification of Data Sources and Integration Approach:
- Identifying required data sources and treatments along with the various stops in the data journey and accordingly selecting the Data Pipeline patterns needed for ingestion and integration/transformation. It has to be noted that the selection of Data Pipeline patterns is not a ‘one and done’ proposition. Instead, they can be plugged in as the complexity of an imitative grows or additional initiatives are onboarded onto the enterprise-level program.
- Integration with ‘OtherOps’:
- Complementing DevOps and DataOps disciplines with ‘OtherOps’ as needed, such as ModelOps and MLOps (as shown in Figure4). This step extends the practice of Agile development all the way to the consumption end of the spectrum and takes data preparation further towards ML’s model selection, model training, evaluation, and fine-tuning, and ultimately insights.
- Emphasis on Training and Building of Knowledge Base:
- Training of data teams and their data partners such as business stakeholders, enterprise architects, IT Ops at each of the above steps focusing on the end-to-end processes and their roles and responsibilities in the related workflows.
Adoption and Growth
Adoption and Growth
To gain an insight into how the demand for DataOps is gaining momentum, let us turn to a market analyst who tracks Enterprise Data Management. Reportlinker.com, in their global forecast on Enterprise Data Management market, announced that “the improved data governance, increased Cloud and DataOps adoption are expected to drive the growth of enterprise data management solutions and services from $9 billion in 2020 to $122.9 billion by 2025, at a Compound Annual Growth Rate (CAGR) of 9.5% during the forecast period.”
They expect the following verticals to benefit the most from this anticipated growth:
Call to Action
Call to Action
That’s great, but are DataOps and Data Pipelines for everyone? The answer is a ‘No!’. They are not for everyone, but they are a ‘must’ for enterprises interested in undertaking focused data transformation initiatives that meet certain qualification criteria. Such efforts can’t afford to miss out on the benefits of incorporating DataOps into their data environments.
Improved Customer Experience (CX): Do we have CX gains to focus on, such as customer retention, improved repeat buying, and cross-selling opportunities?
Customer Journey Analytics: Identify contributing factors and exact touchpoints of interactions that result in delighted customers.
AI Chatbots with Hyper-personalization: Product/service recommendations based on order history and buying patterns perceived as value-added and thoughtful services.
Data Monetization / Data Exchanges: Are there are opportunities to tap into an otherwise dormant heap of data that can be anonymized and monetized into commercial data sets that can benefit us, our industry consortiums, related data market places, and the well-being of the public at large?
Location Data: Telcos have long been processing a large amount of location-specific data to derive patterns about the places people visit. Retail outlets buy this data for focused target messages and promotions while also tracking the efficacy of such campaign.
Market Intelligence: This is the use case that we started our discussion on DataOps with. It provides data, research, and tools to help related professionals, public agencies, private corporations, and other organizations gain valuable market intelligence.
Data Exchanges for Epidemiology: The science of epidemiology is all about the use of data and analytics to promote human well-being. From the days of cholera back in the 19th century to today’s COVID-19, healthcare and public safety data exchanges and related analyses have been at the forefront of our offense and defense against such pandemics. CDC promises to be the nation’s leading science-based, data-driven, service organization that protects the public’s health in support of these efforts. Modern DataOps can lend speed, accuracy, integration, collaboration, and governance to the underlying projects across all key players in this ecosystem – public agencies, academia, research groups/labs, healthcare providers, life sciences, and so on.
Improved productivity with AI/ML: Are there significant productivity gains from AI in handling client interactions in our B2C, B2B, B2E operations
Reimagining the Customer Call Center: Replacing cumbersome and often irritating IVR systems with conversational AI Call Center agents, a progressive Telecoms vendor has claimed the following business gains:
- 4.5 Million calls per month
- Caller’s intent recognized 90.2% of the time
- 44% reduction in customer abandonment
AI/ML/Big Data in Healthcare: When it comes to data, healthcare represents a case of ’embarrassment of riches.’ There is a lot of data all around. Advancements in AI/ML help us find value in them for the entire ecosystem. A report from McKinsey estimates that the use of AI/Big Data could save medicine and pharma up to $100B annually as a result of improved efficiencies in clinical trials and research, better insight for decision-making, and new tools that will help payers, regulators, providers, life sciences, and patients make better decisions.
Self-service BI and analytics: How can we empower our decision-makers with the ability to help themselves with various forms of self-service reports and analytics?
Statistical analysis: For casual and power users, self-service BI helps in adapting canned reports and other statistical tools to navigate and analyze data, run models, and come to their own conclusions
Faster and cheaper BI: Self-service BI can cut down, or eliminate, the dependency that business analysts and executives have on internal IT and/or external consultants. It can deliver near-immediate answers to support decision-making while reducing costs by up to 50%.
Real-time analytics: How can we move at the speed of business and respond to alerts, breaches, emergencies in real-time?
Anomaly/Fraud Detection: In financial services, companies can respond to potential fraudulent access and/or transactions by joining a real-time activity stream with historic account usage data in real-time
Emergency humanitarian/healthcare services: Data from handhelds and wearables from the affected can be combined with the movements of relatives, caregivers, and relief workers to deploy necessary emergency services
Rationalization and consolidation of Data Platforms – Are we spending too much on too many sluggish data platforms? Isn’t it time to assess, consolidate, and modernize our disparate divisional/departmental data platforms? Shouldn’t the new platform be Cloud-centric for greater business agility?
Data Warehouse Modernization: On-going modernization keeps data warehouses relevant, but only when modernized appropriately. Today, that modernization is driven by the business needs of digital transformation – for a broader range of analytics, better quality data, modern data models, enriched metadata, multiple data types, scalability, and so on.
Edge-to-Cloud Data Fabric: Data Fabrics serve the needs of Data and Data Platform consolidations efforts. Data Platforms from internal silos and acquisitions need to be consolidated to address TCO, end-of-life, and scalability concerns. And, data consolidation is the effort to enrich, clean, safeguard all data in one central and live repository that we now refer to as a ‘data lake’.
Compliance with Regulatory standards: In our eagerness to tap into the power of data, are we being negligent about the responsibilities such power comes with? What is our readiness for compliance with GDPR/CRPA, HIPAA/PSQIA, and so on? What kind of data is passing through our platforms? What are the regulations that affect our data operations? Are we encrypting it in transit and rest it, masking it, tokenizing it as necessary?
GDPR/CCPA/CRPA: The California Privacy Rights Act (CPRA) of 2020 enhances the California Consumer Privacy Act of 2018. CPRA applies to personal information collected after January 1, 2022, and comes into force on January 1, 2023. It enforces stricter protection of consumer privacy, similar to the European Union’s General Data Protection Regulation (GDPR) of 2016, which has been in place since 2018. New York, Illinois, and Washington states are all expected to enact similar laws.
Protection of PII and PCI data is central to compliance with the above acts of law. Personally Identifiable Information (PII) is any data that can be used to identify a specific individual. Social Security numbers, mailing or email addresses, and phone numbers have most commonly been considered PII. PCI stands for “Payment Card Industry Data Security Standard”, which is often referred to as PCI for short, and it covers all attributes of a consumer’s credit card.
HIPAA/PSQIA: Enacted in 1996, the Health Insurance Portability and Accountability Act (HIPAA) incorporates provisions for guarding the security and privacy of personal health information. The Patient Safety and Quality Improvement Act (PSQIA) was enacted in 2009. PSQIA provides Federal privilege and confidentiality protections for patient safety information, including information collected and created during the reporting and analysis of patient safety events.
- Corporate business goals that data can influence
- A commitment to ‘data culture’ – The desire and budget to build and scale up the ‘People, Process, Technology’ infrastructure to true a data-influenced organization in support of the above goals
- Well-articulated ROI/TCO based business cases for all applicable initiatives like the ones listed above
- Big Data maturity, especially in volumes that have the critical mass (preferably, in the Terabytes and above) to be impactful
- Organizational change management to adopt/promote Agile development, complementing ‘data culture’
The initiatives that we have just gone through are but a small representation of the need for and in many cases, the early momentum behind, modern data management platforms. The starting point for a given enterprise is to pick relevant initiatives of immediate/maximum impact and build the corresponding business cases for approval.
Next, the target ‘data consumers’ for approved initiatives have to be identified along with their specific requirements for different data sets, corresponding sources of data, and the treatments needed. The approach section details how the required Data Pipelines have to be built, orchestrated and managed to prepare high-quality datasets with the most current data.
These datasets then have to be delivered to the intended user groups, preferably as virtualized datasets in ‘data capsules’ through ‘containers’ that should also ‘host’ the processing logic (such as reporting and analytics) and data access privileges.
The objective behind a pilot of this nature is to demonstrate a sense of responsibility and to start building up the momentum for broader enterprise adoption with demonstrable:
Chief Technology Officer
ACS Solutions is a leading global information technology services and consulting organization with 20,000+ employees and has been serving businesses across industries since 1998. A trusted partner to both mid-market and Fortune 500 clients globally, ACS Solutions has been instrumental in each of their unique digital transformation journeys. Our extensive industry-specific expertise and passion for innovation have helped clients envision, build, scale, and run their businesses more efficiently.
We have a proven track record of developing large and complex software and technology solutions for Fortune 500 clients across industries such as Retail, Healthcare & Lifesciences, Manufacturing, Financial Services, Telecom and more. We enable our customers to achieve a digital competitive advantage through flexible and global delivery models, agile methodologies, and battle-proven frameworks. Headquartered in Duluth, GA, and with several locations across North and South America, Europe and the Asia-Pacific regions, ACS Solutions specializes in 360-degree digital transformation and IT consulting services.
For more information, please reach us at – firstname.lastname@example.org