Knowledge graphs: Encyclopaedias for machines

by Luciano Del Corro on in Natural Language Understanding

The term “knowledge graph” is becoming increasingly popular in the field of Artificial Intelligence (AI). The term itself is relatively new, but knowledge graphs have been around for decades. They initially emerged from the database community and were commonly referred as ontologies or knowledge bases. This post is intended to give an initial overview of knowledge graphs, their main components, and an insight into how fundamental they are to empower intelligent systems.

A knowledge graph can be intuitively described as an encyclopedia for machines. It consists of a formal description of certain knowledge that can be accessed and reasoned about by computers. As a book contains knowledge meant to be read by humans, a knowledge graph is specially formatted to be understood by computers.

Nowadays, knowledge graph construction is one of the hottest topics in AI. Its ultimate goal is to translate human generated knowledge (predominantly available in textual form like books, news, scientific papers, etc.) into formal representations that can be easily understood by machines.

You might not have noticed yet, but knowledge graphs are already part of your life. Either by allowing you to directly search for information in your preferred search engine or by working in the background of your favorite applications. Apple’s Siri, Microsoft’s Cortana, Amazon Echo, or Google Now, for example, heavily rely on knowledge graphs to fulfill your requests. Many of the most popular specialized websites such as IMDb or TripAdvisor are built on top of knowledge graphs so they can be read by humans or computers alike. If you are a Facebook or Twitter user, you are yourself part of a knowledge graph that algorithms use to suggest you new contacts, posts or ads.

As said, what ultimately makes a knowledge graph distinct from any other information source is that it can be accessed directly by machines without any human intermediation. Since computers only understand formal language, the information contained in a knowledge graph is usually referred as structured data, as opposed to unstructured data (like plain text) only understandable to humans.

From a more technical aspect, a fundamental trade-off must be considered when designing a knowledge graph. On one side, a knowledge graph should be descriptive enough so that complex knowledge can be encoded; but on the other side, this description should be sufficiently simple so that the computer can handle it in a fast way. There is an important branch of AI focused on solving this question with major mathematical, philosophical and engineering challenges.

There are multiple definitions and descriptions of a knowledge graph according to the needs of the particular applications. For the purpose of this article, we will focus on a representation based on entities, relations, and facts.

Named entities, relations, facts and classes

Probably, the most popular approach to computationally encode knowledge is through the so-called entities and relations. In this view, a knowledge graph is just a set of entities (Lionel Messi, Argentina, etc.), a set of relations between those entities (<plays_for>, <was_born_in>), and a set of facts. Facts are the combination of the previous two (<Messi, plays_for, Argentina>).

Entities are mostly persons, organizations, locations or products. These are usually called named entities because they refer to real world objects (physical or abstract) that bear a name. Barack Obama, Hawaii, Greece, Batman, or iPhone 7 are examples of named entities.

Since named entities are ambiguous (multiple named entities share the same name), each named entity in the knowledge graph must be uniquely identified. The knowledge graph must be aware that they are different. For instance, consider the case of the US President Barack Obama, who has the same name as his father. The knowledge graph must then use a distinct and unique Id for each of them. For example, Barack_Obama_463 for Barack Obama and Barack_Obama_732 for Barack Obama.

Relations join entities together. One could think of relations as verbs or verbal phrases like <was_born_in>, <graduated_from>, <plays_for> or <acted_in> . Each relation must be unique, have a precise meaning, and a given scope, in the sense that they can only join specific classes of entities (<was_born_in> only involves persons and locations; <acted_in> only relates actors with movies, series or stage plays).

A fact is formed by joining entities through relations. For instance, <Obama, was_born_in, Hawaii> is a fact about entities Barack Obama and Hawaii, joined by the relation <was_born_in>, describing that the US president was born in the US state of Hawaii. Ultimately, the knowledge contained in a knowledge graph is represented by the set of facts it contains.

One useful characteristic of named entities is that they are categorizable. For instance, Barack Obama is of class president, Nobel Prize Laureate, lawyer, etc. Scarlet Johansson is an American actress and director. Classes are extremely useful for analyzing textual data.

As a real world example, have a look at the IMDb knowledge graph, meant to be used by both people and computers. Actors, directors, writers or films, are the entities while <acted_in> or <writer_of> some of the relations. You can see the facts in each entity page. For instance in Scarlet Johanson‘s page, you will see all the movies she appeared in or where she was born, among other facts.

Why are knowledge graphs useful?

As said, knowledge graphs are the knowledge available to the machine, which in addition to some reasoning capabilities can be used by intelligent applications. In principle, a knowledge graph can be depicted as an oracle to whom a computer can ask anything. A nearby Restaurant, a list of US presidents, who is Angelina Jolie? Where was she born? The extent to which the application can answer these questions will depend on the quality of the knowledge graph and the type of information it contains.

The range of questions (usually referred as queries) that can be asked to a knowledge graph is broad. It can involve any combination of relations, entities, classes or facts. If the knowledge graph is relatively complete, it is guaranteed to provide high-quality answers in a minuscule amount of time.

At this point, we would recommend you to take some time to watch the following video, which shows the IBM AI Watson competing (and winning) the question answering TV show Jeopardy. You can think of the IBM system as an AI on top of a knowledge graph which translates human language questions into semantic graph format questions, and then translates back the semantic graph answers into human answers.

Another common use for knowledge graphs is entity linking, the task of finding named entities in text (see our blog post here). A knowledge graph is quite relevant in this case because it ultimately determines the collection of named entities that will be recognized in text. You can try our entity-linking demo below.

Enter any text in English, German, Spanish, or Chinese.



Existing knowledge bases

Knowledge graphs exist for a wide range of domains such as pharmaceuticals, scientific publications, cinema, business, traveling and more general knowledge like those generated from Wikipedia. This last group involves the three most prominent general knowledge graphs to date: YAGO, DBpedia, and WikiData. The first two built in an automatic way and the third manually.

Ambiverse’s knowledge graph is based on YAGO, which was part of the IBM Watson system that defeated Jeopardy champions. YAGO is automatically constructed from Wikipedia. The named entities are the Wikipedia pages, and the relations are extracted from the Wikipedia infoboxes.

YAGO contains almost 17 million entities, grouped in 570 thousand categories, and 150 million facts. It is part of the linked data network and is one of the most used knowledge graphs in the industry and academic world. You can have an overview of YAGO here.

If you would like to read more about knowledge graphs, you can have a look at the following references:

  • “YAGO: a multilingual knowledge base from Wikipedia, Wordnet, and Geonames”, Thomas Rebele, Fabian M. Suchanek, Johannes Hoffart, Joanna “Asia” Biega, Erdal Kuzey, Gerhard Weikum. International Semantic Web Conference  (ISWC), 2016
  • “DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia”, Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer. Semantic Web Journal 6 (2): 167–195, 2015.

  • “Wikidata: a free collaborative knowledgebase”, Denny Vrandečić, Markus Krötzsch. Communications of the ACM 57 (10) 78–85, 2014.
Luciano Del Corro

Luciano Del Corro

Chief Innovation Officer & Co-Founder at Ambiverse
Luciano has the ability to solve language understanding problems in a principled but tractable manner, making him the key person for bringing research to applications. He completed a PhD in natural language understanding at the Max-Planck-Institute for Informatics.
Luciano Del Corro

2 Pingbacks/Trackbacks