Data Modernization for AI - 4 Stages of Data Engineering
- Arindom Banerjee
- Aug 25, 2024
- 5 min read
Updated: Jan 31
Is your firm’s data platform, data engineering and DataOps capable of supporting successful AI?
Expanding the scope of data engineering to meet the needs of Knowledge & Reasoning Apps
Summary:
Changing needs of modern AI application forces us to re-look at how we do data engineering. Data engineering today must be re-shaped to enable knowledge creation & reasoning engines, without giving up the operational and semantic needs of traditional insight generation. A 4-stage data engineering structure, with each stage solving specific AI enablement issues may be the answer. This structure is called DataIntelligenceOps.
2 Key insights drive the thinking behind this:
1. The existing boundaries that define data engineering do not meet the needs of modern AI apps – these boundaries need to be expanded.
2. Adding semantic richness and operational value to data is necessary to enable the potential ROI value of AI apps.
Preface:
Most industries (and sectors) are building knowledge creation, insight generation and reasoning engine applications; these application categories were seldom seen even a few years ago. Some examples of intelligent apps include:
Plan & solve applications to automatically research any area of technology/science.
Robotaxis
Full length movie, advertisement w/ story-line & image-sequence generation
Reinforcement learning based inventory optimization.
Automated code-writing and deployment tools
Visual verification based on shipping manifests.
Summarization & consolidation of meetings, videos, feedback for product firms
Multi-step chains of inference for behavior-based recommendations.
Decision-making digital twins in manufacturing
Multi-component ML to decipher & fix lost sales.
Automated corrective actions in Salesforce based on customer LTV, churn predictions
Trusted financial Insights with no need for tribal knowledge-based verification
As mentioned before the applications above throw up interesting artifacts which have minimal footprint (if any) in today’s data engineering. Examples of such artifacts include:
Cross inference step evaluation & verification data sets
Reasoning based function/tool calling
Plan & solve agencies deployed as service-as-software
Prompt template hubs and agent flow state machines.
Semantic chunking for LLM RAG apps
Complex embedding model usage for multi-modal data.
Few shot data labeling & generation
Prompt driven data visualizations
Text-to-SQL integrations
LLM based cypher generation
Data contracts between producers & consumers of data
Fully causally connected DataOps architectures for better MTBF
The key message that comes across from these shifts is that “don’t be limited by existing definitions of data engineering”.
Leading Question: In all the above example apps, even if the industry moves to much better capabilities such as a few shot synthetic data generation or multi-modal Knowledge Graph embeddings, expanded data engineering with all its implications will remain a key enabler. So, the question is: with this significant shift in application needs, why do we think that data engineering of old will meet the needs of these new categories of applications? And, if there is a shift necessary, what should it look like!
The simple response is that old style data engineering does not meet these emerging needs as the list of emerging artifacts suggest; new thinking is necessary.
What are the emerging shifts necessary in next gen data engineering?
One way to define the structure of this data engineering shift is to identify a set of stages (4 here) of data engineering and for each stage list the added needs/gaps.
The 4 stages defined below facilitate a firm’s journey from insight creation to multi-step reasoning systems. Each phase addresses a specific set of gaps and can be seen as a progression.
Stages 1 and Stage 2:
These 2 sets of shifts represent a new look at insight generation and ML based predictions with an added twist. Insight generation has been weighed down by time to insights, need for tribal knowledge-based verification, poor MTBF and of course in many cases, there’s no way to effect actions that could follow-on from these insights. Traditional ML based estimations and predictions are often seen as losing the support of CFOs who are not getting the relevant ROI on their investments. The oft repeated questions here are – where are the qoq revenue growth and profitability promised by AI.
The table below lists some of the gaps addressed by the first 2 stages.
Stage 1: Trusted Actionable Insights | Stage 2: Traditional ML for QOQ rev/profit growth |
|
|
Stage 3 and Stage 4:
These 2 stages reflect the needs of more recent categories of applications. LLM-apps and vision-apps have tremendous promise but creating actual, accurate products, as opposed to demos is hard. Levels of accuracy, costs, complexity and performance are all stumbling blocks. Stage 4 is necessitated by the lack of perceived ROI in Gen-AI. David Cahn from Sequoia Capital has captured this gap quite well in his article “AI’s $600B question”. (https://www.sequoiacap.com/article/ais-600b-question/). The table below captures the data needs and gaps that are addressed by these 2 stages.
Stage 3: Accurate LLM-apps & Vision Products | Stage 4: Multi-component/Step AI agents & Systems |
|
|
Note: The above shifts are all related to data semantics, dataOps etc. – application design shifts have not been considered.
Kinds of intelligence & Ops effectiveness needs to be added to each stage of Data Engineering?
Each stage above needs specific add on components to enable the kind of semantic intelligence and operational effectiveness necessary for modern array of AI apps. The infographic below captures these enhancements. These add on intelligence & ops mechanisms, when aggregated is called Data-Intelligence-Ops. Note that the specific architectures, implementation design and the processes associated with Data-Intelligence-Ops are defined elsewhere.

Parting thoughts: Key messages included.
AI apps are rapidly increasing in complexity and capability – so, old boundaries of data engineering do not apply.
The way to enable this AI led shift is to move to a modern style of data engineering that systematically adds semantic and operational value in 4 different stages of maturity.
In many cases firms will choose to skip a stage to move faster and nothing prevents that.
Existing building blocks such as ingestion mechanisms, pipeline tools, cloud EDW etc., remain unaffected – this is not a rip n’ replace design.
Claims of “but this is not data engineering” need to be rejected.
Data engineering must now support knowledge enablement, reasoning engines & qoq AI ROI.
Arindam Banerji, PhD (banerji.arindam@dakshineshwari.net)



Comments