SEO Is Not That Hard

Entities Part 2 : How Machines Learn To Read

Edd Dawson Season 1 Episode 322

Send us a text

Keywords don’t tell the whole story—entities do. We take you inside the three-step process machines use to read your content like a detective at a crime scene: highlighting potential entities, using context to resolve ambiguity, and linking each mention to a unique identifier in a global knowledge base. By the end, you’ll see why “Jordan” only makes sense when surrounded by the right clues—and how to present those clues so search engines and AIs make the right call every time.

We start with named entity recognition, the digital highlighter that picks out people, organisations, products, places, and dates across unstructured text. Then we move to entity disambiguation, where context—co-occurring teams, locations, or concepts—guides the system to the correct meaning. Finally, we close with entity linking, the moment a string becomes a node with a library card in Wikipedia or Wikidata. That linkage is the bridge into Google’s Knowledge Graph, powering features like knowledge panels and richer, more confident results.

Along the way, we dig into why Wikipedia and Wikidata matter far beyond vanity. Accurate, well-sourced entries create a feedback loop that improves how machines understand your brand, your founders, and your products. If you don’t meet notability yet, don’t force it; build authority elsewhere with consistent profiles, structured data, and content that names and connects related entities. We also share a simple action: search for your brand, founder, and main product on Wikipedia and Wikidata and assess accuracy. Want more like this? Follow the show, share it with a colleague, and leave a review so we can help more teams make sense of entity-first SEO.

SEO Is Not That Hard is hosted by Edd Dawson and brought to you by KeywordsPeopleUse.com

Help feed the algorithm and leave a review at ratethispodcast.com/seo

You can get your free copy of my 101 Quick SEO Tips at: https://seotips.edddawson.com/101-quick-seo-tips

To get a personal no-obligation demo of how KeywordsPeopleUse could help you boost your SEO and get a 7 day FREE trial of our Standard Plan book a demo with me now

See Edd's personal site at edddawson.com

Ask me a question and get on the show Click here to record a question

Find Edd on Linkedin, Bluesky & Twitter

Find KeywordsPeopleUse on Twitter @kwds_ppl_use

"Werq" Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0 License
http://creativecommons.org/licenses/by/4.0/

SPEAKER_01:

Hello and welcome to SEO Is Not That Hard. I'm your host, Ed Dawson, the founder of the SEO intelligence platform KeywordPupeopleUser.com where we help you discover the questions people ask online and then how to optimise your content for traffic and authority. I've been in SEO from online marketing for over 20 years and I'm here to share the wealth of knowledge, hints and tips I've amassed over that time.

SPEAKER_00:

Hello, welcome back to SEO's Not That Hard. It's me here, Ed Dawson, as usual, and this is part two in our mini-series on SEO. And in the last episode, I introduced the single most important new concept in Modern SEO, and that's the entity. And we learned that an entity is a thing, not a string. It's a real-world concept like a person, a product, or a place, not just an ambiguous keyword that we type into a search bar. And we use the example of Apple to see how thinking in entities helps search engines resolve that ambiguity and truly understand what we're looking for. That naturally leads to the next big question. So if the web is just a massive, chaotic library of trillions of pages of text, how does a machine like Google or an AI like ChatGPT actually read a sentence on your website and figure out what real-world things that you're talking about? Now, it's actually a very logical process and it's called information extraction. And today we're going to look at how this works and the three-step pipeline that machines use to learn how to read and understand what is on a page. So the first thing to understand is that most of the internet is made up of completely unstructured text. If you think about it, there's blog posts, news articles, product descriptions, forum comments. It's all just free-form, flowing language. And for a machine to make sense of it, it has to convert that unstructured mess into structured interconnected data. And the analogy that that's often used is you think about it, a detective arriving at a very complex crime scene. They don't just look at the room all in one go at once. They have a process. They are looking to identify the potential clues. They will then want to analyse the context around those clues to figure out what they mean, and then they look to link them together to solve the case. Machines do something very similar, and their process has three main steps. So step one is called named entity recognition or NER. You might also hear it called entity identification or sometimes entity chunking. And this is what is essentially their first sweep of the crime scene, if we consider our document to be a crime scene. But this yeah, we're looking at sweeping, having that first overall look at the document and the text that's contained within it. The machine will scan sentences on the website with imagine having a digital highlighter, and its only job at this stage is to mark any word or phrase that might be a distinct entity. And it highlights them. It puts them into broad, predefined categories like person, organization, location, product, or date. For example, if it reads the sentence, Michael Jordan, who played for the Chicago Bulls, later became the owner of the Charlotte Hornets, the named Anthony Recognition System would highlight Michael Jordan as a person, Chicago Bulls as an organisation, and Charlotte Hornets as an organisation. And this process is like it's really versatile, it can be applied to any kind of text, whether it's a corporate blog post, a dense academic research paper, or just even a simple tweet. It's the machine's kind of first step in finding the core concepts in any document. Okay, so in our case, the machine has highlighted its clues, but this leads to the most critical and the most challenging part of the investigation, and that's step two, which is entity disambiguation. So the machine has highlighted the name Jordan. Now it has to figure out which specific real-world Jordan we're talking about, and this is a big challenge. To solve it, the system first generates a list of possible candidates from its vast knowledge base. So with the word Jordan, we've got Jordan the country, we've got Jordan the River Jordan, we've got Michael Jordan the person, or Jordan, the night brand. So how does it choose the right one? It does what any good detective would do. It looks for context, it analyses the surrounding words and other entities in the text to resolve that ambiguity. So in our example sentence, the context includes the entities Chicago Bulls and Charlotte Hornets. And the machine knows that these are basketball teams. So it's identified them, it's not there is much less ambiguity and Chicago Bulls and Charlotte Hornets. They are that's clearly they are the basketball team. There's nothing else named that. So this surrounding context of basketball means that the overwhelmingly is probable that Jordan refers to Michael Jordan, the person. If we got a different sentence and say it said, the ancient city of Petra is the crown jewel of Jordan, a country in the Middle East, the context of Petra and Middle East would allow the machine to confidently identify the entity as a country. So this contextual analysis is the core of how machines achieve an almost human-like understanding of the text. They're not just seeing the words, they're not just seeing keywords here. What they're seeing is a web of relationships, and they're using those relationships to figure out what you truly mean. So we're now on to step three. This is where once the machine is confident it's identified the correct entity, it moves to the final step of the process. That's our step three, entity linking. And this is where cases closed and filed away. We've solved, we've solved the puzzle. We've now sorted out the ambiguity, we've worked out that Michael Jordan is the person, and we've connected them or linked them to its unique identifier in a massive centralized knowledge base. Now, think of this knowledge base as like a universal library, and every unique entity in the world has its own library card with a unique ID number. Entity linking is the act of stamping the mention on your website web page with the exact library card number. So this action transforms a simple string of text into a rich structured data point that we is now officially part of a larger network of global knowledge. And this is where it can now get really interesting to us as website owners. What are these universal libraries that search engines and AI systems use for their source of truth? For the most part, they're these massive public humanly created knowledge bases. The two most important ones are Wikipedia and its sister project, Wikidata. And in the field of AI, the task of linking mentions to their corresponding Wikipedia pages is so common that it's got its own name, Wikification. And this shows us there's a deep symbiotic relationship here. AI systems like Google rely heavily on these structured verified information in places like Wikipedia to build their own models of the world. And this essentially creates a really powerful feedback loop that you need to be aware of. So if your organization, your founder, or your products are accurately and authoritatively represented on these public knowledge bases like Wikidata, you're providing clean, high-quality data that feeds directly into systems like Google Knowledge Graph. And in turn, when Google has high confidence in your entity, it might grant you a knowledge panel in the search results, which reinforces your notability. And notability is a key criteria for being included and maintained in Wikipedia in the first place. It's not easy to get in Wikipedia. So if you're working on smaller sites or smaller clients, don't get hung up on trying to get in Wikipedia or these data sets, it can be really hard. But it's still really important to understand these data sets because as we'll see later on, making sure that if you're talking about an entity that includes those other related entities in your content is really important to boost the authority of your content. Don't think I'm saying that everybody has to get in Wikipedia. But if you're working with larger clients or you are a larger client yourself, then if you have got into Wikipedia already or you're capable of getting into it, then it's important that you get there and it's important that you make sure that what is there is correct. And what this means is if you're going to have a truly comprehensive entity strategy, it has to extend beyond your own website. You need to actively ensure that you're accurate representative in these places, in these foundational knowledge bases. And yeah, it's it's quite simply one of the most direct ways of injecting authoritative data into the very core of the ecosystem that both the search engines and the LLMs depend upon. Wrapping up for today, what we've covered and what we've learned is that machines read our content using a three-step process, and that is step one, named entity recognition, where they highlight the potential entities. Step two, entity entity disamb disambiguation, not easy to say, where they use context to figure out exactly which entity you mean. And thirdly, entity linking, which is where they connect that entity to a universal knowledge base like Wikipedia or Wiki. So that brings us to you know what I think would be great if you could do, and that is to go to Wikipedia or Wikitata and search for your brand, your founder, your main product. Does a page exist for them? If it does, is the information 100% accurate and well sourced? The answer to that question will tell you a lot about how well the AI world currently understands you and your business, and it'll show you whether you've got a strong foundational entry in that universal library if there is work to be done. Next time, we're going to take a close look at the place where Google stores all of this linked information, its gigantic digital brain, the knowledge graph. We'll explore how it's built and how it directly impacts what you users see in their search results. So until next time, keep optimising, stay curious, and remember SEO is not that hard when you understand the basics.

SPEAKER_01:

Thanks for listening, it means a lot to me. This is where I get to remind you where you can connect with me and my SEO tools and services. You can find links to all the links I mentioned here in the show notes. Just remember with all these places where I use my name, the Ed is spelled with 2Ds. You can find me on LinkedIn and Blue Sky, just search for Ed Dawson on both. You can record a voice question to get answered on the podcast, the link is in the show notes. You can turn on my SEO intelligence platform, KeywordsPupleUse, at keywordspupleuse.com, where we can help you discover the questions and keywords people asking online. Plus those questions and keywords into related groups so you know what content you need to build topical authority. And finally, connect your Google Search Console account for your sites so we can crawl and understand your actual content. Find what keywords you rank for and then help you optimise and continually refine your content. Targeted personalised advice to keep your traffic growing. If you're interested in learning more about me personally or looking for dedicated consulting advice, then visit www.eddawson.com. Bye for now and see you in the next episode of SU is not a hammered.

People on this episode