Developing a Knowledge Graph Schema

How do we pioneer a technology that automates the transformation of scientific literature into a knowledge graph? We’ve set an ambitious goal: our system should be capable of processing any scientific article that meets the basic criteria for publication. This breadth of scope simplifies the selection of input texts, as they can be chosen randomly for the automated process. However, the challenge intensifies on the output side: we must design a schema versatile enough to encapsulate the information from any conceivable scientific article.

Having set this goal, it becomes clear that our graph schema must be extremely adaptable, as the information structures in scientific articles can vary greatly. Information is diverse, and written language, using only a simple set of grammatical rules, has become a universal tool for representing this diversity. We believe we can endow our knowledge graph schema with similar capabilities through the carefully crafting of a new syntax based on entities and relations instead of punctuation and spacing. But this only creates the framework for the information representation, so we are left with the question on how to fill the framework. We can glimpse at language again: The versatility of language comes from its vast vocabulary, hence we aim to mirror it in the graph schema. Essentially, the schema, like language, will be constructed using a straightforward syntax enhanced by an extensive vocabulary.

We also aim for the graph to be intuitive and user-friendly, since its purpose as search engine can only be met when users can operate without extensive training. The main pillar to achieve this is of course the simplicity of its syntax and its alignment with natural language vocabulary.

Moreover, we must ensure that our graph schema is adaptable, keeping pace with changes in information and practices over time.

It is also essential to minimize redundancy and maximize connectivity. Our goal is to structure the data in such a way that connects information via the shortest possible paths, which may involve introducing new types of edges through an algorithm based on logical principles.

The DB size explosion is controlled by typeDB ‘rules’

TypeDB allows to do rule-reasoning over the graph. This means, during query time the DB automatically retrieves the rules that have an impact on the retrieval, and infers the relations necessary to compute the output. We use this feature to eliminate commonsense relations that are stored as overly-specified information. E.g. ‘buying food implies a cost’ and ‘buying a saussage implies a cost’, we can detect such anomalies and thorugh crowdsoureced intervertion can controll the size explosion be removing these relations and placing one that means: ‘buying anything implies cost’.

Rules detected contradictions
- A similar example is the idea of having recognized there is a wrong information ‘mammals are carnivores’, it is automatically detected since there are members of the mammal-type that have the relation ‘ that’s actually raising a contradiction – ‘elephants are not carnivores’.