Data Modernization for AI - 4 Stages of Data Engineering

Aug 25, 2024
5 min read

Updated: Jan 31, 2025

Is your firm’s data platform, data engineering and DataOps capable of supporting successful AI?

Expanding the scope of data engineering to meet the needs of Knowledge & Reasoning Apps

Summary:

Changing needs of modern AI application forces us to re-look at how we do data engineering. Data engineering today must be re-shaped to enable knowledge creation & reasoning engines, without giving up the operational and semantic needs of traditional insight generation. A 4-stage data engineering structure, with each stage solving specific AI enablement issues may be the answer. This structure is called DataIntelligenceOps.

2 Key insights drive the thinking behind this:

1. The existing boundaries that define data engineering do not meet the needs of modern AI apps – these boundaries need to be expanded.

2. Adding semantic richness and operational value to data is necessary to enable the potential ROI value of AI apps.

Preface:

Most industries (and sectors) are building knowledge creation, insight generation and reasoning engine applications; these application categories were seldom seen even a few years ago. Some examples of intelligent apps include:

Plan & solve applications to automatically research any area of technology/science.
Robotaxis
Full length movie, advertisement w/ story-line & image-sequence generation
Reinforcement learning based inventory optimization.
Automated code-writing and deployment tools
Visual verification based on shipping manifests.
Summarization & consolidation of meetings, videos, feedback for product firms
Multi-step chains of inference for behavior-based recommendations.
Decision-making digital twins in manufacturing
Multi-component ML to decipher & fix lost sales.
Automated corrective actions in Salesforce based on customer LTV, churn predictions
Trusted financial Insights with no need for tribal knowledge-based verification

As mentioned before the applications above throw up interesting artifacts which have minimal footprint (if any) in today’s data engineering. Examples of such artifacts include:

Cross inference step evaluation & verification data sets
Reasoning based function/tool calling
Plan & solve agencies deployed as service-as-software
Prompt template hubs and agent flow state machines.
Semantic chunking for LLM RAG apps
Complex embedding model usage for multi-modal data.
Few shot data labeling & generation
Prompt driven data visualizations
Text-to-SQL integrations
LLM based cypher generation
Data contracts between producers & consumers of data
Fully causally connected DataOps architectures for better MTBF

The key message that comes across from these shifts is that “don’t be limited by existing definitions of data engineering”.

Leading Question: In all the above example apps, even if the industry moves to much better capabilities such as a few shot synthetic data generation or multi-modal Knowledge Graph embeddings, expanded data engineering with all its implications will remain a key enabler. So, the question is: with this significant shift in application needs, why do we think that data engineering of old will meet the needs of these new categories of applications? And, if there is a shift necessary, what should it look like!

The simple response is that old style data engineering does not meet these emerging needs as the list of emerging artifacts suggest; new thinking is necessary.

What are the emerging shifts necessary in next gen data engineering?

One way to define the structure of this data engineering shift is to identify a set of stages (4 here) of data engineering and for each stage list the added needs/gaps.

The 4 stages defined below facilitate a firm’s journey from insight creation to multi-step reasoning systems. Each phase addresses a specific set of gaps and can be seen as a progression.

Stages 1 and Stage 2:

These 2 sets of shifts represent a new look at insight generation and ML based predictions with an added twist. Insight generation has been weighed down by time to insights, need for tribal knowledge-based verification, poor MTBF and of course in many cases, there’s no way to effect actions that could follow-on from these insights. Traditional ML based estimations and predictions are often seen as losing the support of CFOs who are not getting the relevant ROI on their investments. The oft repeated questions here are – where are the qoq revenue growth and profitability promised by AI.

The table below lists some of the gaps addressed by the first 2 stages.

Stage 1: Trusted Actionable Insights	Stage 2: Traditional ML for QOQ rev/profit growth
Unverifiable insights Overwhelming need for use of tribal knowledge Dashboard sprawl Low MTBF No integration with insights driven actions. Fragility due to changing source schemas. Multi-cloud & multi-patterns of ingestion Complex pipelines, fragmented transformations Manual query generation Scale needs custom solutions	Training-Serving Skew Automated drift remediation Domain specific feature creation Business metrics mapping & tracking Statistically inefficient data Data augmentation, generation, labeling Inefficient Data & Storage Ops for ML Data Data Procurement ML Infrastructure healing

Stage 3 and Stage 4:

These 2 stages reflect the needs of more recent categories of applications. LLM-apps and vision-apps have tremendous promise but creating actual, accurate products, as opposed to demos is hard. Levels of accuracy, costs, complexity and performance are all stumbling blocks. Stage 4 is necessitated by the lack of perceived ROI in Gen-AI. David Cahn from Sequoia Capital has captured this gap quite well in his article “AI’s $600B question”. (https://www.sequoiacap.com/article/ais-600b-question/). The table below captures the data needs and gaps that are addressed by these 2 stages.

Stage 3: Accurate LLM-apps & Vision Products	Stage 4: Multi-component/Step AI agents & Systems
Complex information retrieval Chunking strategies Multi modal data – embeddings, ingestion Populating Knowledge graphs Labeling, augmentation of vision data Context strengthening Prompt repositories, mgmt., curation. Datasets for fine-tuning, evals Query transformation, enhancements. Latency, Streaming, throughput perf	Composite, hybrid search. Plan & Solve flow repositories, tools. State machines for flows – CRUD. Multi-category data store Ops Context, model fusion Cross Inference data sets, eval sets. Function call reasoning infrastructure Resource tracking, frugality cross inference Multi-data product co-pilots

Note: The above shifts are all related to data semantics, dataOps etc. – application design shifts have not been considered.

Kinds of intelligence & Ops effectiveness needs to be added to each stage of Data Engineering?

Each stage above needs specific add on components to enable the kind of semantic intelligence and operational effectiveness necessary for modern array of AI apps. The infographic below captures these enhancements. These add on intelligence & ops mechanisms, when aggregated is called Data-Intelligence-Ops. Note that the specific architectures, implementation design and the processes associated with Data-Intelligence-Ops are defined elsewhere.

Parting thoughts: Key messages included.

AI apps are rapidly increasing in complexity and capability – so, old boundaries of data engineering do not apply.
The way to enable this AI led shift is to move to a modern style of data engineering that systematically adds semantic and operational value in 4 different stages of maturity.
In many cases firms will choose to skip a stage to move faster and nothing prevents that.
Existing building blocks such as ingestion mechanisms, pipeline tools, cloud EDW etc., remain unaffected – this is not a rip n’ replace design.
Claims of “but this is not data engineering” need to be rejected.
Data engineering must now support knowledge enablement, reasoning engines & qoq AI ROI.

Arindam Banerji, PhD (banerji.arindam@dakshineshwari.net)

Data Modernization for AI - 4 Stages of Data Engineering

Recent Posts

Comments

Stay Connected with us