Your AI Knowledge Base Is Only as Good as What You Feed It

Labs Learning - April 2026

This is Part 1 of a two-part series on building AI knowledge bases. Read Part 2: Building an AI Knowledge Base That Actually Works

There's a type of AI system that's becoming increasingly popular in businesses of all sizes. It's called RAG - retrieval-augmented generation - and the basic idea is simple: instead of hoping an AI model knows something about your company, you give it access to your actual documents. Your policies, your proposals, your training materials, your meeting notes. When someone asks a question, the system searches your documents, finds the relevant sections, and generates an answer based on what it found.

Think of it as "Ask your documents questions in plain English."

It's a genuinely useful idea. The frustration of knowing an answer exists somewhere in your company's files - but not being able to find it - is universal. We've all spent 20 minutes searching through shared drives, email, and chat logs for something we know we read last month.

I've been building and testing RAG systems since 2023, and the technology has matured significantly. But recently I wanted to push further. I took 18 years of my own business documents - roughly 900 files including published books, client proposals, training materials, and research notes - and built a proper knowledge base from scratch.

What I discovered surprised me, and I think it has implications for any business considering this approach.

The 87% noise problem

After running the first indexing process (where the system reads all your documents, breaks them into searchable pieces, and stores them), I checked the numbers. The system had created 370,000 searchable chunks from my 900 files.

That sounds impressive. It was actually a disaster.

87% of those chunks - roughly 320,000 of them - were noise. Not useful content, not searchable information. Just garbage that the system was dutifully treating as knowledge.

What happened? The system had processed PowerPoint presentations and PDF files alongside my text documents. When an AI system processes a PowerPoint file, it doesn't see slides the way you do. It sees the raw file structure: XML code, image data encoded as text, slide templates, formatting instructions. A single congratulations certificate - one slide with maybe 50 words on it - produced 8,809 searchable chunks. A presentation about QR codes produced 8,913.

The result was that every question I asked the system had to search through 320,000 pieces of garbage to find 38,000 pieces of real content. It's like trying to find a specific document in a filing cabinet where 87% of the folders contain shredded paper.

The fix was embarrassingly simple

I removed the file types that don't process well (PowerPoint, PDF, Word documents) from the searchable corpus and kept only the formats that work cleanly: markdown, plain text, and HTML files.

The important part: I didn't delete anything. The original files still exist, organised in a separate location. I just stopped feeding them to the AI system in their raw form. If the content in a presentation or PDF is valuable (and often it is), the right approach is to convert it to a text format first, then add the text version to the knowledge base.

After cleaning the corpus, the system had roughly 38,000 chunks. All real content. I ran ten test queries and got accurate, well-sourced answers with zero hallucination (which means the AI didn't make anything up - it only answered based on what it actually found in my documents).

What this means for your business

If you're considering building an AI knowledge base for your organisation, this is the most important thing I can tell you: the quality of what you put in determines the quality of what you get out. Not the AI model you choose. Not the database technology. Not the search algorithm. The data.

Here's what that means in practice:

Not all file formats work equally well. Text-based formats (markdown, plain text, HTML, well-structured web pages) are ideal. PowerPoint and PDF files need to be converted to text first. Raw spreadsheets produce poor results because a table of numbers without context doesn't mean anything when it's broken into fragments.

Bigger isn't always better. I had one file - 29 podcast episodes concatenated into a single document - that produced over 10,000 searchable chunks. That single file represented more than a quarter of my entire useful index. When one document dominates your knowledge base, it skews every search result. Split large files into logical units before adding them.

Check for sensitive information before you start. This nearly caught me out. My business documents contained real bank account numbers, tax references, a landlord's personal details from a lease agreement, and phone numbers from third-party CVs. I only found these because I ran a sensitivity scan as the very first step. If I'd built the entire knowledge base before discovering the problem, I'd have had to tear it all down and start over.

For any business handling client data, employee information, or financial records, this is not optional. Scan for sensitive data before you do anything else.

Your knowledge base needs rules. After the 87% noise experience, I wrote a set of explicit rules for what belongs in the corpus and what doesn't. File format rules, size limits, quality checks. Without these, it's only a matter of time before someone drops a 50MB PowerPoint export into the shared drive and silently degrades the entire system's performance.

The data preparation iceberg

Most AI vendors and consultants talk about the exciting parts: the AI model, the chat interface, the "magic" of natural language search. Those things matter, but they're the tip of the iceberg.

Below the waterline is the work that actually determines whether the system is useful: cleaning your data, deciding what belongs in the knowledge base, converting documents to formats that work, removing sensitive information, and setting up rules to keep the quality high over time.

In my experience, you should expect to spend about 80% of a RAG project on data preparation and 20% on the AI system itself. If a vendor tells you they can "just point the AI at your files and it'll work," ask them what happens when someone uploads a 50-slide PowerPoint deck. If they don't have a good answer, they haven't built one of these in production.

What's next

In the next post, I'll cover what I learned about building the actual retrieval system - how the AI finds the right information in your documents, what "chunking strategy" means and why it matters more than which AI model you use, and how to tell whether your system is actually working or just giving you confident-sounding nonsense.

Continue reading: Part 2 - Building an AI Knowledge Base That Actually Works

I help organisations adopt AI practically and responsibly. If you're considering an AI knowledge base for your business, the AI Opportunity Audit is a 90-minute session where we figure out whether it's the right fit and what a realistic implementation looks like. Details at humanspark.ai

Your AI Knowledge Base Is Only as Good as What You Feed It

The 87% noise problem

The fix was embarrassingly simple

What this means for your business

The data preparation iceberg

What's next

Related Articles

Building an AI Knowledge Base That Actually Works

Building a Live Speech to Text AI

Labs: Building a Production-Ready AI Project Manager in 12 Hours

Is your business AI ready?