Predicting and Scoring Links in Anatomical Ontology Mapping

The paper presents a work performed in the area of automatic and semi-automatic ontology mapping. A method for inferring additional cross-ontology links while mapping anatomical ontologies is described and the results of some experiments performed with various external knowledge sources and scoring schemes are discussed as well. Keywords-ontology; graph; directed acyclic graph; ontology mediation; ontology mapping; ontology merging; scoring scheme; probability; knowledge sharing; knowledge reuse; interoperability


I. INTRODUCTION
The term ontology comes from Philosophy and has been applied in Information Systems, Information Retrieval etc. to represent the formalization of a body of knowledge describing a given domain.Ontologies have become increasingly popular because they help to realize many of the most challenging problems in the IT field like interoperability, information/knowledge sharing and knowledge reuse.
Information sources (and ontologies in particular), even from the same problem domain, are usually heterogeneous.In order to enable interoperation between such information sources (ontologies) and to integrate the information/knowledge from multiple sources, one needs to build mappings between ontologies.These mappings establish the semantic correspondence between concepts and relations in different ontologies.As we have noted in [10] there are some terminological differences pertaining to the integration of ontologies within the ontology mapping/merging/matching (OM) community.Those terminological differences are mostly between the terminology adopted in [1] on one side, and in [11] on the other.In our works, we adopt the terminology of [1].In the sense of [1], ontology mapping is the process of taking two input ontologies and generating semantic links between their concepts/terms.The generated links are not part of the two input ontologies; they are stored separately from them.Two other terms are related to ontology mapping: ontology aligning and ontology merging.Ontology aligning [1] can be viewed as an automatic or semi-automatic ontology mapping; it denotes the process of discovery of cross-ontology links by a computer program.Again, these links are stored separately from the two input ontologies.Ontology merging [1] is the ultimate goal when integrating/mediating two input ontologies; it comes down to taking two input ontologies and generating an output ontology that unifies the knowledge contained in them.It is usually a process which follows the processes of mapping/aligning and which utilizes the intermediate results produced by them; during this process, some pairs of terms (one from each of the two input ontologies) are merged into single nodes of the output ontology, while other input terms are not paired but are just copied unchanged to the output ontology.
This paper discusses some issues in automatic map-ping or aligning of species-specific anatomical ontologies by utilization of various knowledge sources.

II. PROBLEM FORMULATION
Given the anatomical ontologies of two different species (model organisms) e.g.mouse and zebrafish, our goal is to establish semantic links between the terms of the two ontologies such that: (i) these links are of one of the following types: R 1 = synonymy, R 2 = hypernymy, R 3 = hyponymy, R 4 = holonymy, R 5 = meronymy, and (ii) each of these links has some degree of certainty or degree of confidence or confidence score which is a real number in the interval [0, 1].The semantic relation types R k that we refer to here are well-known and are widely utilized in the areas of linguistics, knowledge representation and ontology engineering.That is why we don't provide any formal or informal definitions for them here.
The two input ontologies are represented in the form of OBO files.OBO stands for "Open Biomedical Ontology" and denotes an ontology language and an ontology file format [2] for defining ontologies.It has been used mostly for defining ontologies in the biomedical domain.Nowadays OBO is adopted by the GO project [2], [3], the OBO Foundry initiative [4], and other communities.

III. FORMALIZATION OF THE PROBLEM
In mathematical terms, each of the two input anatomical ontologies can be considered as a directed acyclic graph together with a function colouring the graph's edges.The colours model the relations defined within the input ontologies (like is a and part of ) which we call inner-ontology relations.Typically, there are other innerontology relations except those two.These additional relations usually pertain to the development of the particular organism and not just to its adult/gross anatomy.Such relations are for example start stage, end stage, develops from but practically we don't deal with them as we are mainly concerned with the organism's adult/gross anatomy, not with the organism's growth and development.We shall use further the following notation: Here O 1 and O 2 are the two input anatomical ontologies; DAG 1 , DAG 2 are their corresponding directed acyclic graphs; V 1 and V 2 are the sets of terms of the two input ontologies (each term has an identifier and a name); E 1 and E 2 are the relations defined within the two input ontologies; F 1 and F 2 are the edge-colouring functions.Two terms u 1 and u 2 are connected with an edge e if and only if the pair of terms (u 1 , u 2 ) belongs to the relation represented by e.
The relations is a (specialization/generalization) and part of (membership/aggregation) are the two typical examples of inner-ontology relations defined within the ontologies O 1 and O 2 .In our notation, we map relations to colours (through F 1 and F 2 ), and we deal only with two relations (is a, part of ).So it can be assumed that n = 2, c 1 = is a, c 2 = part of .Thus, if for example, u 1 = "brain", u 2 = "central nervous system", u 1 , u 2 ∈ V 1 , then there usually exists an edge e between u 1 and u 2 such that F 1 (e) = part of (because the brain is part of the central nervous system and anatomical ontologies of most organisms usually declare this fact explicitly).
Also given are several (typically large) external knowledge sources which might be either biomedical ones or general-purpose ones.They contain anatomical terms and relations (is a, part of , others) between their own terms.Three concrete external knowledge sources have been used for the purposes of this work.These are T 1 = UMLS, T 2 = FMA, T 3 = WordNet.UMLS [5], [14] and FMA [6], [15] are biomedical knowledge sources, and WordNet [7], [8], [16] is a general purpose knowledge source.Formally stated, each of these knowledge sources T s , s = 1, 2, 3, contains the following information: • Terms.M s = {t s1 , t s2 , ..., t sms } is the set of terms in the knowledge source T s .Here t sk = (id sk ; name sk ); id sk is the identifier within T s of the term t sk ; name sk is the textual name within T s of the term t sk ; m s (usually 10 6 ≤ m s ≤ 10 7 ) is the number of terms in the knowledge source T s .• Relations.These are the is a and part of relations defined within the external knowledge source T s : Typically other relations are also defined within the external knowledge source T s but only these two are relevant to our work.Each knowledge source src = T s , s = 1, 2, 3, is up-front assigned a score f (src) which is based on its preciseness in predicting synonymy and parent-child (is a, part of ) relations between terms of the two input ontologies.Details on this evaluation (of the three knowledge sources that we use) can be found in [9].
Having the notation introduced above, we now seek to find a set of predictions (a set of 4-tuples): Here, for each k, v 1k is a term from the input ontology O 1 , v 2k is a term from the input ontology O 2 , r k are automatically (i.e. in silico) predicted crossontology links from one of the five types defined in the previous section, and s k is a real number denoting the confidence score of the prediction that the terms v 1k and v 2k are related/linked by a cross-ontology link of the type r k .Requiring that s k ∈ (0, 1], we basically imply that the set D which we seek, is in fact a set of cross-ontology predictions or a set of predicted crossontology links between O 1 and O 2 where each score is probabilistic-based (modeling, given the information we have in the input ontologies and also in the available knowledge sources, the probability that the corresponding prediction is actually true).

IV. ALGORITHMIC PROCEDURES
Three algorithmic procedures are applied to the graph structures that were described formally in the previous section.Each of them adds more links to the set D that is being sought.These three procedures are detailed in [12], here we mention them only briefly.
Within the first procedure, the two input ontologies are scanned for identity matches between the names of their terms.If t 1 ∈ V 1 and t 2 ∈ V 2 have the same names, they are marked as synonyms predicted by what we call the direct matching (DM) procedure.The cross-ontology links discovered/predicted this way are assigned the highest possible scores of 1.0 as these predictions come from information contained entirely in the two input ontologies.
During the second procedure, using the information (the terms and the relations) in the external knowledge sources, and identity matches between term names of the two input ontologies and term names of the three external knowledge sources, we build a graph model/structure which aligns each of the two input ontologies to each of the three external knowledge sources.This model contains a set of semantic links (of the types R k , k = 1, 2, ..., 5, that were defined above) between the two input ontologies on the one side, and the three external knowledge sources on the other side.Then a set of logical rules is applied, and conclusions are drawn for the semantic relations that exist between terms t 1 ∈ V 1 and t 2 ∈ V 2 of the two input ontologies.The following rules are applied at this stage: been detected as synonyms of the same term t ∈ T s , then t 1 and t 2 are marked as predicted cross-ontology synonyms of each other; • Rule (B).If t j ∈ V j has been detected as a synonym of t ∈ T s (s = 1, 2, 3), and if the term t 3−j ∈ V 3−j has been detected as an (is a/part of ) child/parent of t, then t j is marked as predicted cross-ontology (is a/part of ) parent/child of t 3−j (here j = 1 or j = 2).
The application of these rules is what we call the source matching predictions (SMP) procedure.Rule (A), when applied, finds the synonymy relations (i.e. the relations of type R 1 ) between terms from the two input ontologies.Rule (B) is a composite (generalized) version of four separate rules (two options for is a/part of by two options for child/parent makes four options in total).These four rules which originate from rule (B), when applied, find the hypernymy, hyponymy, holonymy and meronymy relations (i.e. the relations of types R 2 , R 3 , R 4 , R 5 ) between terms of the two input ontologies.All links predicted through SMP are given the score f (src), where src is the knowledge source confirming/implying these predictions.
Finally, we run a procedure that we denote as the child matching predictions (CMP) procedure.This one tries to find R 1 , R 2 , R 3 , R 4 and R 5 links between terms of the two input ontologies, t 1 ∈ V 1 and t 2 ∈ V 2 , for which no links have been predicted either by DM or by SMP.The approach CMP takes is to consider patterns of crossontology connectivity (found by DM and SMP) between t 1 ∈ V 1 (parent term 1), t 2 ∈ V 2 (parent term 2), and the child terms of the two parent terms t 1 and t 2 .Three separate patterns of connectivity are considered by CMP: (we call this an U Pattern); (we call this an X Pattern); (we call these two patterns V Patterns).
In this notation, the −→ and ←− arrows denote sets of non-CMP parent-child links (the arrows always point from child to parent).These are asymmetrical links.The ←→ arrows denote sets of non-CMP synonymy links These are symmetrical links.The t ch1 and t ch2 are child terms from the two input ontologies.Each occurrence of any of these patterns between t 1 and t 2 (the two parent terms) we call a pattern instance.All arrows within one pattern instance represent either is a or part of links (we don't allow mixing these two within a single pattern instance).
Based on these patterns of connectivity, new crossontology links (CMP links) are introduced (one CMP link per pattern instance) between t 1 and t 2 .We call these links individual CMP links.To assign scores to the individual CMP links, the concepts score of a set of non-CMP links between two terms and score of a pattern instance (or score of an individual CMP link) are defined below.Also, we introduce two functions, Conj and Disj, with N ≥ 2 parameters each, which, provided that the probabilities p 1 , p 2 , ..., p N of N events are given, define the probabilities of (i) all these events occurring at the same time (Conj), and (ii) at least one of these events occurring (Disj).We call the Conj and Disj functions accumulation functions as they accumulate scores of non-CMP links to produce a score of an individual CMP link.Finally, all individual CMP links between t 1 and t 2 are aggregated through what we call an aggregation function (which can be e.g. the max of N ≥ 1 numbers).Next, we define in some more detail the concepts which we just introduced in relation to CMP.Definition 3 (score of a non-CMP link): The score of a non-CMP link between any two terms (which could be from the same ontology or not) is defined as follows: came from the source src ∈ {UMLS, FMA, WordNet}.
Here IO stands for inner-ontology, DM stands for direct matching and SMP stands for source matching predictions; s ij is one single non-CMP link (i.e. one single evidence); the I and D are constants (typically having the values of 1.0).Definition 4 (score of a set of non-CMP links): The score of a set of non-CMP links (score of an evidence set) is defined as follows: where Disj is the function from Definition 2, s ik are non-CMP (i.e.either IO or DM or SMP) links, and the Disj is taken over all non-CMP links taking part in the evidence set S i .
Definition 5 (score of an individual CMP link): The score of an individual CMP link e is defined as: where p ∈ [0, 1] is a CMP penalty constant, Conj is the function from Definition 1, and the Conj is taken over all evidence sets S i that take part in the pattern instance, which the link e originates from (note that n = 2 for the V patterns and n = 3 for the X and U patterns).
Definition 6 (aggregation function): Let K be the number of all individual CMP links drawn between An aggregation function is a known function F agg which takes the scores of all these K individual CMP links and produces a single number P CMP (t 1 , t 2 ) ∈ [0, 1], which we call score of the aggregated (final) CMP link drawn between t 1 and t 2 .
As a final result from the CMP procedure, this aggregated CMP link is drawn between any two terms t 1 and t 2 for which at least one pattern (of any of the three types X, U, V) is found.The score of this link is calculated in the way shown above.

SCHEMES
We have produced several distinct scoring schemes by varying the functions Conj, Disj and F agg which were defined above.
F agg from scoring scheme #1 corresponds to the probability of the union of two events such that one is completely dependent on the other.F agg from scoring scheme #2 coincides with Disj from the same scoring scheme, which equals the probability of the union of two independent events.
Therefore in a probabilistic model the expression s 1 + s 2 − s 1 s 2 is a good choice for combining two independent scores, while max(s 1 , s 2 ) is a good choice for combining scores when one score is completely dependent on the other.
In scoring scheme #3 we design a scoring function whose values are between the values of the first two scoring functions (#3 is a linear combination of #1 and #2).The main objective behind the use of this third scoring function is to account for the dependencies between the knowledge sources (UMLS, FMA, WordNet) without completely ignoring the fact that, if more than one of them confirm certain prediction, that usually improves the odds that this prediction is correct.In scheme #3, α ∈ [0, 1] is a parameter of the linear combination defined in (3b).It varies depending on the knowledge source or the combination of knowledge sources, which confirm the predictions whose scores we accumulate in (3b).The α parameter acts as a buffer to prevent the score from growing too fast when adding up cumulative predictions (i.e. when the predictions being accumulated are confirmed by several knowledge sources): when α equals 0.0, the value is growing the quickest (as it should for independent scores); when α equals 1.0, the value is limited by the maximum score of the scores being accumulated.
To experimentally show that the choice of Disj from (3b) is a reasonable one, we have generated a set of observations on two dependent random variables x 1 , x 2 with Boolean (1/0 i.e. true/false) truth values, and we have confirmed that if we substitute the scores s 1 and s 2 in (3b) with the probabilities P (x i = true) (i = 1, 2), and α with the modulus of the correlation coefficient between the two random variables, we get a very good approximation for the probability P (z = true) of their Boolean disjunction z = (x 1 or x 2 ).

VI. RESULTS AND DISCUSSION
Let us consider the following two figures which illustrate how the scores generated by the three scoring schemes are related to each other and demonstrate the advantages of scheme #3.It can be seen on Fig. 1 that the data in scheme #1 appear clustered around the configured values for knowledge source scores (and combinations of these), because there isn't anything to account for the amount of available evidence gathered from each source (e.g. the number of patterns confirming a prediction).Compared to scheme #1, both schemes #2 and #3 scatter the clusters because the F agg values are growing when more patterns are confirming a given prediction.As F agg in scheme #3 is limited through the α parameter, it causes a more moderate scattering as seen on Fig. 1, while scheme #2 causes a very rapid increase.
The main advantage of scheme #3 is that it allows us to control the speed at which additional patterns increase the score, while scheme #2 gives control only over the initial value of that score.Within scheme #2, when having one pattern confirming the prediction, the scores start somewhere around the configured CMP score value (defined by the penalty constant), and grow with the same speed up to 1.0.Within scheme #3 this growth can be slowed down and controlled through the α parameter.The difference between the schemes #2 and #3 can be seen on Fig. 2 in red, and it clearly shows how easily some scores approach the value 1.0 when scheme #2 is used.The Disj function from Scheme #3 also causes a softening effect on the score when there are multiple knowledge sources and algorithmic procedures (DM, SMP, CMP) confirming the prediction, because it allows us to control the speed at which the score grows and even to use the actual correlation coefficient between the distinct knowledge sources.This is not directly visible on the figures at this scale, because it largely produces local shifts in the position of the clusters and has the biggest effect on data predicted by the knowledge sources (SMP) which constitute the cluster around score=1.0.

VII. CONCLUSION
We presented in this paper an original algorithmic approach to inferring (predicting and scoring) crossontology links within automatic mapping of distinct species-specific anatomical ontologies.The full mapping procedure assumes that the auto-generated set of predictions will be carefully checked by a curator (a human, an anatomy expert) and his/her input will be utilized to accurately calculate the correlation coefficients between certain pairs of knowledge sources.These correlation coefficients could be used as values for the α parameters of the scoring scheme.The procedures described briefly here and detailed in [12], and the scoring schemes introduced here, are utilized in the software program AnatOM [10], [13] developed as part of our work on semi-automatic mapping and merging of anatomical ontologies.

Definition 1 (
Conj): Conj is a function which takes N arguments (each of them in [0, 1]) and returns a result in [0, 1].We discuss a possible implementation for it below.Definition 2 (Disj): Disj is a function which takes N arguments (each of them in [0, 1]) and returns a result in [0, 1].We discuss possible implementations for it below.