Technical

AI data preparation
before the AI can start.

Data preparation consumes 40–60% of every AI project’s budget. Here’s what actually needs to happen to your data — and what doesn’t.

Shishir Mishra By Shishir Mishra · · 11 min read
The Unsexy Truth 5 Stages Good Enough Mistakes Regulated Industries Is Your Data Ready?
Shishir Mishra
Not sure if your
data is ready?
Get an honest assessment of your data readiness before committing to an AI project. No sales pitch.
or
“The AI is only as good as the data you feed it. That is not a platitude — it is a budget line item.”
Listen to this article
Click play to start listening

AI data preparation is the process of auditing, cleaning, structuring, enriching, and validating your business data so that AI systems can actually use it. It typically consumes 40–60% of an AI project’s budget and involves five stages: data audit (mapping what you have), data cleaning (fixing duplicates, gaps, and inconsistencies), data structuring (unified schemas and naming), data enrichment (adding context and linking datasets), and data validation (automated quality checks and monitoring). Skipping or rushing this work is the single most common reason AI projects fail.

Everyone wants to talk about the AI model. Nobody wants to talk about the data. This is understandable — models are interesting, data cleaning is not. But the uncomfortable truth is that data preparation determines whether your AI project succeeds or fails far more than model selection does. You can have the most sophisticated machine learning architecture available, and it will produce garbage if the data feeding it is inconsistent, incomplete, or poorly structured.

This is not a theoretical problem. In our experience, 40–60% of every AI project’s budget goes to data work — not model building, not deployment, not monitoring. Data work. And the organisations that acknowledge this upfront and budget for it accordingly are the ones whose AI systems actually reach production and deliver measurable ROI.

This article walks you through exactly what needs to happen to your data before AI can use it, what “good enough” actually looks like (it is not perfection), and how to know when you are ready to move from data preparation to model deployment.

40–60%
of AI project budget typically spent on data preparation, not model building
KORIX estimate
AI Data Preparation: What Your Data Actually Needs Before AI Can Use It

Why data preparation is the unsexy truth of AI

The gap between “we want AI” and “our data is ready for AI” is where most projects stall. Vendor demos use clean, pre-structured datasets. Your business has data scattered across CRMs, spreadsheets, legacy databases, email inboxes, and filing cabinets. The demo looked like magic. Your reality looks like a reconciliation project.

This is not a failure on your part. It is the normal state of business data. Most organisations have been collecting data for years or decades without any thought to how an AI system might eventually consume it. Field names are inconsistent. Duplicate records accumulate. Key information lives in free-text fields that no system can parse automatically. Date formats vary between systems. Customer records in one database do not match customer records in another.

An e-commerce company wanted AI demand forecasting across hundreds of SKUs. The data lived in three disconnected systems with inconsistent naming conventions — product identifiers in the sales platform did not match identifiers in the warehouse system, and the marketing calendar was a spreadsheet with its own naming logic. Before any model could be trained, the team spent weeks building a unified data structure, reconciling naming schemes across all three systems, and validating historical records against actual transactions. The AI came after — and only worked because the data foundation was solid.

That story is not unusual. It is the norm. The organisations that treat data preparation as the project — not as a preliminary step before the “real” project begins — are the ones that actually get working AI systems into production.

Three realities about data preparation that most AI vendors will not tell you:

  • It takes longer than the AI build itself. Plan for 2–8 weeks of data work before model development even starts. The exact timeline depends on how many systems your data lives in and how consistent it is.
  • It requires business knowledge, not just technical skills. A data engineer can remove duplicate records. But deciding which record is the “correct” one requires someone who understands the business context behind the data.
  • It is never truly finished. Data degrades over time. Staff change how they enter records. Systems get updated. New data sources get added. Ongoing data quality monitoring is not optional — it is part of operating an AI system.
The vendor demo trap

AI vendor demos use curated, pre-cleaned datasets. Your data will not look like their demo data. Budget for the gap between demo-quality data and your actual data state — that gap is where 40–60% of your project budget will go. Understanding what AI implementation actually costs starts with understanding this reality.

The 5 stages of AI data readiness

Data preparation is not a single task — it is a sequence of five distinct stages, each with its own methods, tools, and success criteria. Skipping stages does not save time. It creates problems that surface later, during model training or production deployment, when they are far more expensive to fix.

Stage 1: Data Audit

Before you can prepare your data, you need to know what data you actually have. A data audit answers four questions: What data exists? Where does it live? Who owns it? How is it accessed?

This sounds simple. In practice, it reveals surprises. The CRM has customer records, but so does the billing system, and they do not match. Sales data lives in a cloud platform, but the historical records before the migration are in an on-premise database with a different schema. The marketing team maintains their own spreadsheets with segmentation data that does not exist anywhere else.

The audit should produce a data inventory: a single document listing every relevant data source, its format, its owner, its access method (API, database query, manual export), its update frequency, and its known quality issues. This inventory becomes the project plan for the remaining four stages.

Stage 2: Data Cleaning

Data cleaning addresses the problems your audit uncovered. The four most common issues are duplicates (the same entity appearing multiple times with slight variations), missing values (fields that should have data but do not), inconsistencies (the same information represented differently across systems), and format mismatches (dates in different formats, currencies without standardisation, text fields with inconsistent capitalisation).

The key decision in data cleaning is not how to fix problems — it is which problems to fix. Not every data quality issue affects your AI use case. A demand forecasting model does not care about inconsistent customer name spellings. A customer matching system does. Clean the data that matters for your specific use case, not every imperfection you find.

Stage 3: Data Structuring

Structuring transforms cleaned data into a format AI systems can consume. This means unified schemas (all data sources using the same field names, data types, and relationships), consistent naming conventions (one standard applied everywhere), and proper typing (dates stored as dates, numbers stored as numbers, categories stored as categories).

Structuring also means establishing clear relationships between datasets. If your AI system needs to connect customer purchase history with support ticket data and marketing engagement, those three datasets need a shared key — a customer identifier that is consistent and reliable across all three sources.

Stage 4: Data Enrichment

Enrichment adds context that makes data more useful for AI. This includes linking related datasets (connecting purchase history to demographic data), deriving new fields from existing ones (calculating customer lifetime value from transaction records), and adding external context where it adds predictive value (industry benchmarks, seasonal patterns, market indicators).

Feature engineering — the process of creating input variables that help an AI model learn patterns — happens in this stage. Good feature engineering is often the difference between a model that performs adequately and one that delivers genuine business value. It requires both technical skill and deep domain knowledge about what factors actually drive the outcome you are trying to predict.

Stage 5: Data Validation

Validation ensures your prepared data actually meets the quality standards your AI system needs — and continues to meet them over time. This means automated quality checks (running on every data update to catch new issues), drift detection (alerting when data patterns change in ways that might affect model performance), and freshness monitoring (ensuring data feeds are updating on schedule).

Validation is the stage most organisations skip or treat as optional. It is not optional. Data that was clean last month may not be clean this month. A new hire entering records differently, a system update changing a field format, a supplier changing their data export structure — any of these can degrade your data quality without anyone noticing until the AI starts producing poor results.

S1
Data Audit
Map all sources, owners, formats, access methods — 1–2 weeks
S2
Data Cleaning
Duplicates, missing values, inconsistencies, format mismatches — 1–3 weeks
S3
Data Structuring
Unified schemas, consistent naming, proper typing — 1–2 weeks
S4
Data Enrichment
Adding context, linking datasets, feature engineering — 1–2 weeks
S5
Data Validation
Automated quality checks, drift detection, freshness monitoring — Ongoing

What “good enough” looks like

Perfection is the enemy of deployment. You do not need 100% clean data — you need data clean enough for your specific use case. The quality threshold varies significantly depending on what you are building. A medical diagnosis system has far stricter data requirements than a marketing lead scoring model.

The most expensive mistake in data preparation is not stopping. Teams that chase perfection in data quality spend months refining data beyond the point of diminishing returns, while the business case for the AI system slowly erodes. The goal is not perfect data. The goal is data that is good enough to deliver reliable results for the specific problem you are solving.

Here is what “good enough” looks like across common AI use cases:

Use CaseData Quality NeededAcceptable Gaps
Document extraction95%+ OCR accuracy, structured templatesMinor formatting variation, occasional handwriting
Lead scoring80%+ field completeness, consistent categorisationSome missing demographic fields, incomplete history for newer leads
Demand forecasting90%+ historical accuracy, no major gapsOccasional missing days, minor rounding differences
Chatbot / FAQComplete knowledge base, categorised queriesSome uncategorised edge-case queries, evolving product info

The practical test is simple: can your AI system produce reliable outputs with the data you have? If a lead scoring model achieves 80% accuracy with 80% field completeness, and the remaining 20% of missing fields would take three more months to collect, the right decision is usually to deploy at 80% and improve iteratively. An 80%-accurate model deployed today is more valuable than a 95%-accurate model deployed in six months.

A company wanted AI for customer support automation. During the initial assessment, their support requests were poorly categorised — tickets were tagged inconsistently, and the knowledge base that agents relied on had not been updated in over a year. The answers documented did not match current products and policies. Rather than trying to build an AI system on top of this shaky foundation, the recommendation was to standardise the knowledge base and ticket classification system first. That data preparation work — which felt like a detour from the AI project — meant the chatbot actually worked when it was deployed six months later. Without that foundation, the AI would have been confidently giving customers outdated information.

The 80/20 rule of data preparation

80% of the value comes from fixing the 20% of data issues that directly affect your AI use case. Identify those critical issues first. Fix them. Deploy. Then improve iteratively with real production feedback rather than hypothetical quality standards.

Not Sure If Your Data Is Ready?

Our AI Readiness Assessment evaluates your data state, identifies gaps, and gives you a realistic timeline for AI deployment — before you commit budget.

Take the Assessment →

Common data preparation mistakes

These five mistakes account for the majority of data preparation failures we see. Each one is avoidable — but only if you know to watch for it before you start.

Mistake 1: Starting with the model, not the data

The most common mistake is selecting an AI model or platform before understanding the data it will need to consume. Teams get excited about a specific technology, sign a contract, and then discover their data is not in a state that technology can use. The correct sequence is always: understand the business question, audit the data that could answer it, prepare that data, and then select the model architecture that fits. Not the other way around.

Mistake 2: Treating data prep as a one-time task

Data preparation is not a phase you complete and move on from. It is an ongoing operational function. Data degrades continuously — new staff enter records differently, systems get updated, business processes change, data sources get deprecated and replaced. If you treat data preparation as a project with a completion date, your AI system’s performance will degrade starting from that date. Build data quality monitoring into your operational workflow, not just your project plan.

Mistake 3: Ignoring data ownership and access controls

Data preparation often requires combining datasets from different departments, each with their own access policies and data governance rules. The team that owns the CRM data may have legitimate concerns about sharing customer records with the analytics team. The finance department may restrict access to revenue data. These are not obstacles to work around — they are governance requirements to address properly. Identify data owners early, get their buy-in, and establish clear access agreements before you start combining datasets.

Mistake 4: Over-engineering the data pipeline for v1

The first version of your data pipeline does not need to be production-grade infrastructure. It needs to work reliably enough to prove the AI use case. Teams that spend three months building a scalable, fault-tolerant, fully automated data pipeline before testing whether the AI model even works on their data have their priorities backwards. Build the minimum viable pipeline first. Prove the use case. Then invest in infrastructure.

Mistake 5: Not validating data against the actual business question

Data can be technically clean — no duplicates, no missing values, consistent formatting — and still be useless for your AI use case. If you are building a demand forecasting model and your historical sales data does not account for stockouts (recording zero sales when the product was out of stock, not when demand was zero), the data is clean but misleading. Always validate data against the specific business question the AI will answer, not just against generic quality standards.

60%
estimated share of AI project budget spent on data preparation, not model building
KORIX estimate
The silent data problem

The most dangerous data quality issues are the ones that do not cause errors — they cause subtly wrong results. A model trained on biased historical data will produce biased predictions without throwing any errors. A forecast trained on stockout-affected sales data will systematically under-predict demand. These problems only surface when someone with domain expertise reviews the outputs critically.

Data preparation for regulated industries

In regulated industries, data preparation is not just about quality — it is about compliance, traceability, and provenance. The data feeding your AI system needs to meet the same governance standards as every other business-critical process in your organisation. In many cases, it needs to meet higher standards, because AI decisions are increasingly subject to regulatory scrutiny.

Financial services

Financial services data preparation adds three layers that other industries can often defer: complete audit trails (every transformation applied to the data must be logged and traceable), data lineage (the ability to trace any piece of data back to its original source), and encryption at rest and in transit (all data must be encrypted regardless of where it is in the pipeline).

For a financial services firm automating loan document analysis, every piece of training data needed a complete audit trail. Data lineage was not optional — regulators required traceability from raw document to AI recommendation. The data preparation phase included building encryption, access logs, and version control into the pipeline itself. This was not overhead bolted on after the fact. It was infrastructure built into the data preparation process from day one, because retrofitting governance into a running AI system is significantly more expensive than building it in from the start.

Healthcare

Healthcare data preparation centres on three requirements: anonymisation (removing or masking personally identifiable information before any AI processing), consent tracking (ensuring every data point used for training has appropriate patient consent), and regulatory compliance (HIPAA in the US, GDPR in Europe, and equivalent frameworks in other jurisdictions). The anonymisation step alone can consume a significant portion of the data preparation timeline, particularly for unstructured data like clinical notes where PII can appear in unpredictable locations.

Cross-industry governance principles

Regardless of your specific industry, governed AI systems require data preparation that includes:

  • Data lineage documentation — Where did each piece of data come from? What transformations were applied?
  • Access control logging — Who accessed which data, when, and for what purpose?
  • Version control — The ability to reconstruct any historical state of the dataset
  • Bias auditing — Systematic checks for demographic, geographic, or temporal bias in training data
  • Retention policies — Clear rules for how long data is kept and when it is purged
Governance is not overhead

Treating data governance as an overhead cost is like treating foundation work as an overhead cost on a building. It is not overhead — it is the foundation. AI systems built on ungoverned data may work in the short term, but they create regulatory, reputational, and operational risk that compounds over time. Build governance into data preparation from the start.

How to know when your data is ready for AI

Data readiness is not binary — it is a spectrum. But there is a practical threshold below which AI projects consistently fail, and above which they have a reasonable chance of success. The following checklist is based on patterns across multiple AI implementations. It is not exhaustive, but it covers the factors that most consistently predict whether a project will reach production.

Answer these six questions honestly:

  1. Can you describe what data you have and where it lives? If you cannot produce a data inventory within a few days, you need a data audit before anything else.
  2. Is there a single owner for each dataset? Data without clear ownership does not get maintained. Unmaintained data degrades. Degraded data produces bad AI outputs.
  3. Are your key fields at least 80% complete? The specific threshold depends on the use case, but below 80% completeness on key fields, most AI models struggle to find reliable patterns.
  4. Can you access the data programmatically? If getting the data requires manual exports, copy-pasting, or emailing someone to pull a report, you have an access problem that needs solving before AI enters the picture.
  5. Do you have at least 6 months of historical data for your use case? AI systems learn from patterns. Patterns need history. Less than 6 months typically does not provide enough signal for most business applications.
  6. Is the data already being used for manual decisions today? If humans are already making decisions based on this data, that is a strong signal the data contains the patterns an AI system could learn from. If nobody uses the data for decisions currently, question whether it actually contains useful signal.

If you answered yes to four or more of these questions, you are likely ready for an AI pilot. Not necessarily a full deployment — but a structured pilot that tests whether AI can deliver value on your specific data.

If you answered yes to fewer than four, that does not mean AI is off the table. It means you have specific data preparation work to complete first. And knowing exactly what that work is — rather than discovering it mid-project — saves significant time and budget. An AI readiness assessment can map those gaps precisely.

The 7 signs your business is ready for AI covers the broader organisational readiness factors beyond data, including leadership alignment, process maturity, and team capacity.

The best time to start preparing your data was six months ago. The second best time is now. Every month of delay is another month of data quality debt accumulating.

The Bottom Line

Your AI is only as good
as the data behind it.

Data preparation is not a preliminary step — it is the foundation that determines whether your AI system delivers value or becomes an expensive experiment. Audit what you have. Clean what matters. Structure it for consumption. Validate it continuously. And accept that “good enough” is better than “perfect but never deployed.”

Shishir Mishra
Founder & Systems Architect, KORIX
19 years building AI and enterprise systems across finance, healthcare, logistics, and real estate. “The organisations that succeed with AI are the ones that invest in data foundations first. Everything else follows from that.”
Learn more about Shishir →
FAQ

Common questions about
AI data preparation.

Have a question not listed here?

Ask us directly →
How long does AI data preparation take?

Typically 2 to 8 weeks depending on the current state of your data. Organisations with data already in structured databases may need only 2 to 3 weeks. Those with data spread across disconnected systems, inconsistent formats, and significant quality issues may need 6 to 8 weeks. The investment is worth it — rushing data preparation is the single most common reason AI projects fail to deliver results.

What percentage of AI projects fail due to bad data?

Approximately 85% of AI projects fail to reach production, and data quality is consistently cited as the number one contributing factor (Source: Gartner). Bad data does not just reduce accuracy — it can make an AI system actively harmful by producing confident but wrong outputs.

Do I need a data scientist for data preparation?

Not necessarily. Structured data preparation — cleaning, deduplication, format standardisation, schema unification — is typically handled by data engineers rather than data scientists. Data scientists become essential later, during feature engineering and model training. For many mid-market organisations, a competent data engineer or technically skilled analyst can handle the bulk of data preparation work.

Can AI clean its own data?

Partially. AI tools can flag anomalies, detect duplicates, and identify pattern inconsistencies faster than manual review. However, human judgment is still essential for business rule validation — deciding whether a flagged record is genuinely wrong or an edge case, and defining what “correct” means for your specific context. The most effective approach combines AI-assisted detection with human-driven decision-making.

What is the minimum data needed for AI?

It depends on the use case, but as a general guideline: at least 6 months of historical data and 80%+ field completeness for key variables. Simple classification tasks may work with less. Complex forecasting models need more. Quality matters more than quantity — 6 months of clean, well-structured data will outperform 3 years of inconsistent data in almost every scenario. Our readiness assessment can evaluate your specific situation.

Ready to find out if
your data is AI-ready?

Free consultation. No commitment. 30 minutes. We will assess your data state and give you an honest timeline for AI deployment.