HopRank: How Semantic Structure Influences Teleportation in PageRank (A Case Study on BioPortal)

This paper introduces HopRank, an algorithm for modeling human navigation on semantic networks. HopRank leverages the assumption that users know or can see the whole structure of the network. Therefore, besides following links, they also follow nodes at certain distances (i.e., k-hop neighborhoods), and not at random as suggested by PageRank, which assumes only links are known or visible. We observe such preference towards k-hop neighborhoods on BioPortal, one of the leading repositories of biomedical ontologies on the Web. In general, users navigate within the vicinity of a concept. But they also “jump” to distant concepts less frequently. We fit our model on 11 ontologies using the transition matrix of clickstreams, and show that semantic structure can influence teleportation in PageRank. This suggests that users-to some extent-utilize knowledge about the underlying structure of ontologies, and leverage it to reach certain pieces of information. Our results help the development and improvement of user interfaces for ontology exploration.


INTRODUCTION
Ontology Engineering and Ontology Learning are two branches of the Semantic Web whose aim is to accurately build and curate ontologies.The former studies new techniques to improve collaboration among humans while editing ontologies [26,29], and the latter introduces new methodologies and algorithms to automatically create ontologies by crawling the Web [5,25].These efforts represent significant advances in the development of knowledge bases, which represent facts about the real world (e.g., people, diseases).However, there is little knowledge about how users consume such ontologies on the Web.To this end, Walk et al. studied how users browse BioPortal [28].Their findings suggest that some ontologies influence the way users interact with the website.However, how users navigate through the ontology structure (i.e., from one concept to another) remains unclear.
Problem Statement: In this paper, we study the influence of semantic structure on teleportation (i.e., jumping to any node chosen at random) in PageRank.For example, consider the ontology shown in Figure 1(a), where nodes represent classes (a.k.a.concepts) and edges isASubClassOf relationships.On BioPortal, ontologies are shown vertically as hierarchical trees, and concepts can be explored using the expand-on-demand principle.This means that only top level concepts are shown first, and then users are able to expand and collapse as many concepts as they need at any level of the ontology.In other words, users can use and therefore are potentially aware of a virtually fully connected network in all stages of navigation.Previous studies [21,31] have modeled user navigation using PageRank.However, these assume that navigation paths are constrained by links and random teleportation.In our scenario, where the whole structure of an ontology can be visualized at any time, we believe that teleportation is not fully random, but rather biased towards k-hop neighborhoods.
Approach: Motivated by previous studies on information foraging [8][9][10]22], decentralized search [16,18], and PageRank [4,14,15,21,31], we propose HopRank, a method for modeling transitions across k-hop neighborhoods on semantic networks.The key idea of this work relies on the HopPortation vector ⃗ β, which defines the probabilities of transitioning to each k-hop neighborhood.From the PageRank point of view, we can say that teleportation is not fully random, and the probability of following the structure of a page is not based only on one parameter (i.e., probability of following links), but on k parameters, representing all k-hop neighborhoods reached from the current page.Technically, we pass the HopPortation vector to a random walker to make biased decisions on which neighborhood to go next.Once this decision is made, the random walker uniformly chooses a concept within that neighborhood.Contributions: The contributions of this paper are: (1) We empirically show how users leverage the structure of the ontologies on BioPortal by quantifying the proportion of transitions per k-hop neighborhood.
(2) We propose HopRank, an algorithm for modeling human navigation on semantic networks.(3) We demonstrate that HopRank outperforms traditional navigation and popularity-based models on BioPortal, especially when users browse ontologies directly without search.(4) We make an implementation of this approach openly available on the Web [11].

RELATED WORK
BioPortal provides users with a tree-like explorer and a local search engine to navigate ontologies.In addition, concepts can be expanded on demand to see their children nodes.Although these functionalities are exploited differently across ontologies [28], it is unclear how users navigate through the ontology structure.Thus, this section covers previous work on search and navigation on networks.
Search.Information Foraging [22] assumes that people, when possible, modify their strategies or the structure of the environment to maximize their gain of valuable information.These patterns are also found in the way humans recall information from memory [17].Similarly, berrypicking [6], a model of online searching, states that queries are not static, but rather evolve, and users commonly gather information in pieces instead of in one large set.
Navigation.PageRank [21] is the most popular method to measure the importance of web pages based on their incoming and outgoing links.It relies on an imaginary surfer who is randomly clicking on links, and eventually jumps to any node in the network.The probability of following links is given by a damping factor.Multiple variations have been proposed for improving information retrieval systems, e.g., a biased PageRank [15] to capture the importance of a page more accurately by taken topics into account or a weighted PageRank [31] to assign larger rank values to more popular pages (i.e., preferential attachment) instead of distributing the rank value of a page uniformly to all outgoing pages.Geigl et al. suggest that the behavior of a random surfer is almost similar to real users, as long as they do not use search engines [13].They also find that classical navigation structures, such as navigation hierarchies or breadcrumbs, only exercise limited influence on navigation.Experiments in [24] reveal that memory-less Markov chains represent a quite practical model for human navigation on a page level.However, this assumption is violated when the analysis is expanded to a topical level.Helic et al. identify certain configurations of decentralized search that are capable of modeling human navigation in information networks [16].Their findings suggest that navigation on such networks is a two phase process combined with the exploitation of the known (i.e., goal-seeking) and the exploration of the unknown (i.e., orientation).
User Interfaces.Human navigation has also been studied for enhancing interfaces.For instance, [12] explores fisheye views to display large information structures such as programs and databases.The intuition behind this paradigm is that users often explore their neighborhood, and distant major landmarks in more detail.Similarly, Van Ham and Perer studied the search, show context, expand on demand browsing model in [27], and proposed techniques to design better graph visualization tools.
We propose HopRank-a biased random walker-to model navigation on semantic networks.HopRank builds upon insights from information foraging [17,22], decentralized search [16,18] and PageRank [21].More precisely, we replace the damping factor by a HopPortation vector to encode the probabilities of visiting each k-hop neighborhood.The intuition here is that users browse semantically close terms more often than semantically distant ones.

BIOPORTAL
There exist a large number of ontologies in the biomedical domain.They are highly specialized and therefore expensive to develop.To enable ontology adoption and reuse, effective support for browsing and exploring existing ontologies is required.Towards that goal, the National Center for Biomedical Ontology (NCBO) [3,19] features BioPortal [1, 20, 30]-one of the leading repositories of biomedical ontologies on the Web-containing currently more than 700 ontologies with more than 9 million ontology classes.On BioPortal, practitioners and experts can access ontologies via Web services and Web browsers.The latter allows users to navigate ontologies by searching specific classes, or by directly browsing their concept hierarchies within a tree-like explorer [28].
Ontologies.We propose to model human navigation on semantic networks using the structure of the underlying ontology.On Bio-Portal, ontologies are defined as directed networks, where nodes represent concepts and edges isASubClassOf relationships.Since such edges are usually non-cyclic and have a common root, these ontologies often form trees. Table 1 shows 11 of the most visited ontologies in 20151 .For instance, LOINC the largest ontology with 175K nodes, 153K edges, and 74K connected components.
Transitions.We analyzed all HTTP requests made in 2015 and extracted 336K valid sessions (i.e., after filtering out sessions with less than 2 requests, and requests to ontologies or concepts which do not exist).Each session contains transitions (i.e., a sequence of visited concept pages) triggered by a single user (i.e., IP address) without breaks (i.e., pauses of at least 60 minutes).For simplicity, we only consider transitions within the largest connected component (LCC) of each ontology, and discard ontologies with less than 1000 transitions 2 .Overall, we found 11 ontologies and 133K transitions between their concepts 3 , see Table 1 for some key properties.
Navigation Types.Based on the HTTP request headers, we inferred 7 navigation types: Details (DE), Direct Click (DC), Direct URL (DU), Expand (EX), External Link (EL), External Search (ES), and Local Search (LS).DE: are all clicks made within the Details tab of a selected concept.DC: are all clicks made on concepts within the tree-like explorer.DU: refers to all concept requests without HTTP referrer (e.g., direct URL in the browser).EX: considers all clicks on the (+) symbol of a concept, which triggers the expansion of the concept to show all its children nodes.Notice that this request is called only once, even if the symbol is clicked multiple times.The opposite behavior (collapse) is not considered 4 .EL: captures all requests coming from external websites that are not search engines.
ES: are all requests coming from the top 10 most popular external search engines such as Google and Yahoo.LS: are all requests made via the local search functionality of each ontology.Notice that this search is a 3-step process.First users type a keyword, then the system shows auto-suggestions and finally users click on one of the concepts shown in the auto-suggestion list.We only consider the final step a local search transition.ALL: includes all the abovementioned types.Figure 2 shows the distribution of transitions across navigation types for each ontology.In general, most traffic comes from expanding a concept (EX, 44%), followed by local search (LS, 17%), direct URL (DU, 16%) and details (DE, 14%).Surprisingly, direct clicks on concepts (DC) only represent 7% of all transitions.This suggests that users spend substantial time expanding concepts before they find a concept of interest.

HOPRANK: A BIASED RANDOM WALKER
HopRank models human navigation on semantic networks.Imagine a random walker whose decisions on where to go next are biased towards specific k-hop neighborhoods.This bias is what we call HopPortation, which encodes the probabilities of transitioning to each k-hop neighborhood.In our model, navigation on networks can be explained as a 2-step process.First, a k-hop neighborhood of the current node i is drawn from a categorical distribution.Second, a node j is randomly chosen within that k-hop neighborhood.Note that this process holds only if the walker is fully or partially aware of the structure of the network (i.e., knows or can see it).Without this prerequisite, and if links are not preferred, then random jumps to random pages will be more plausible.In comparison to the classic random walker with teleportation (e.g., PageRank [21]), where its movements are constrained by the damping factor α (i.e., probability of following links), HopRank is constrained by a vector ⃗ β containing k different factors, which define the probabilities of going to each k-hop neighborhood from the current location.
Visited k-hop Neighborhoods on BioPortal.We aggregate ALL transitions by the shortest distance between two sequentially visited nodes.This distance is referred to as k-hop neighborhood.In Figure 3(a) we see that target nodes at large distances are less likely to be visited next.This is expected, since-to some extent-larger distances enclose more branches, therefore more target candidates.Note that ontologies are sorted by diameter in descendant order from MESH to WHO-ART.Interestingly, users tend to hop as far as the ontology's diameter, for d ′ ≤ 12.For instance, OMIM's diameter is 6 (see Table 1), and 6 is the maximum hop done by users.Otherwise, users (roughly) hop up to two-thirds of the ontology's diameter, for d ′ > 12.For example, MESH's diameter is 31, and the largest hop reached is 19.
Transitions per k-hop Neighborhood on BioPortal.Figure 3(b) shows the average percentage of transitions across k-hop neighborhoods per navigation type.We see that users on average (ALL, grey) prefer to navigate through 2-hop (41%) and 1-hop (23%) neighbors.In particular, when navigation is triggered by direct clicks (DC, orange) and expand (EX, red).Notice their fast decay when khop > 8. Other types of navigation such as external link (EL, purple), and direct URL (DU, green)-which do not leverage the tree-like explorer-tend to reach concepts at larger distances more frequently.Notice their peaks at khop = {5, 11}, respectively.Interestingly, when users opt for external search (ES, brown), they often click on 2-hop concepts, but also on 12-hop and 15-hop neighbors.
Intuitively, the details tab (DE, blue) helps users to click on nearby concepts at khop ≤ 2, more often than local search (LS, pink), which is more likely to reach concepts at khop ≥ 2.

MODELS OF HUMAN TRANSITIONS
In this section, we formally introduce our HopRank model, and recap popular navigation models for comparison.We denote the transition probabilities, and # of parameters according to HopRank and 7 other models that we will use later on for model selection.
We formally represent an ontology 5 as a graph G = (V , E), with V = (v 1 , . . .v n ) being a set of N nodes, and E = {(v i , v j )} ∈ V × V a set of undirected edges 6 .The ontology structure is captured by the adjacency matrix A N ×N = a i j , where a i j is 1 if the link exists, 0 otherwise.Transitions are represented by the transition matrix T N ×N = t i j , where t i j represents the number of transitions between source node i and target node j.

HopRank (HR). Given the HopPortation vector ⃗
β, the probability of reaching a k-hop neighborhood is denoted by factor β k ∈ ⃗ β.The stochastic k-hop matrix M k encodes all nodes j with a shortest distance k from i. HopRank uniformly distributes β k across all nodes j at distance k.The limits of k-hop neighborhoods go from 1 (direct edges), to d ′ , the diameter of the ontology G. Noise k =1 β k is added to allow for random jumps and self-loops.Figure 1(b) illustrates how the HopPortation vector is computed 5 We focus on its largest connected component (LCC) 6 Directionality of edges is omitted to calculate shortest paths between all pair of nodes.
Preferential Attachment (PA).Given the degree matrix D N ×N = d i j = d j , where d j represents the degree of the target node j.The probability of moving from i to j is proportional to the degree of j.Number of model parameters: 0.
Gravitational (Gr).Given the matrix S N ×N = (sp(i, j) + ϵ ) 2 , where sp(i, j) denotes the shortest path between nodes i and j.The probability of navigating from i to j is proportional to the degree of node j and inversely proportional to the square distance between i and j.
We add a smoothing factor ϵ to avoid overflows when dyads are disconnected.In such cases, we set ϵ to the diameter d ′ of G plus 1, to consider these jumps with a very low probability.Similarly, we set the diagonal (i.e., self-loops) to ϵ = d ′ + 2. Number of model parameters: 0.
Random Walker (RW).Given the damping factor α (i.e., probability of following links), the probability of visiting a node j is proportional to α divided by the degree of the source node i, plus a random choice equally distributed among all nodes.Depending on the α value, a random walker can model four different behaviors: (i) α = 0.0: random jumps only, (ii) α ≈ 1.0: navigation over links only, (iii) α = 0.85: PageRank using the commonly used damping factor for navigating the Web [7], and (iv) the empirical PageRank which learns the parameter α from the transitions data.Number of model parameters: 1 if empirical, 0 otherwise.
Markov Chain (MC).We assume that moving to the next node follows a Markov process.Therefore, the probability of moving to a node j only depends on the current node i.These probabilities represent the maximum likelihood, learned from the transition matrix T .Thus, the probability of visiting node j from node i is proportional to the number of transitions t i j .Number of model parameters: N × (N − 2).
Note that M M M, A A A, and all P * from Equations (1) to ( 5) are right stochastic matrices (i.e., each row must sum to 1).

EXPERIMENTS
In this section, we compare the performance of HopRank to the baselines on synthetic and real-world networks.

Model Selection
For comparing the models, we employ the Bayesian Information Criterion (BIC) [23] to select the best, i.e., lowest BIC score.BIC evaluates log-likelihoods LL (i.e., how likely our transitions are for a given model) and takes into account the number of model parameters and observations (i.e., # of transitions) to avoid over-fitting.
where t i j represents the actual number of transitions from node i to node j, and p i j the probability of transitioning from node i to node j for a given model.

Synthetic Network
Setup.The underlying network (structure) is a binary tree composed by N = 7 nodes and |E| = 6 edges as shown in Figure 1(a).Transitions (curved-thick edges) are biased towards 2-hop and 4hop neighborhoods.These biases are reflected in the HopPortation vector shown in Figure 1(b).Results.Probabilities inferred using Equation (1) are depicted in Figure 1(c).Figure 4 (left) shows the number of parameters inferred by each model.While the Markov chain model (MC) requires 35 parameters, HopRank only needs 5.The empirical PageRank (RW E.) learned a damping factor of α = 0.01.This means that users are 1% likely to follow links.In Figure 4 (right) we see the comparison of models using BIC scores.In this synthetic network, transitions are best described by the Markov chain model because model parameters (i.e., maximum likelihood) are proportional to the actual transition counts per dyad, and the data structure is very small 7 .
In spite of that, HopRank is the second best model and describes navigation better than random (RW 0.0).

Medical Dictionary for Regulatory Activities Terminology (MEDDRA)
Setup.MEDDRA [2] is one of the the largest ontologies in our dataset (see Table 1).After pre-processing, its largest connected component (LCC) consists of 23K nodes and 43K transitions.
Results. Figure 5(a) shows the HopPortation vectors learned for each type of navigation in MEDDRA.We see that users mainly navigate through 1, 2, 6, and 8-hop neighbors.For instance, transitions through direct clicks-on a concept (DC), its details (DE) or expand (EX)-mainly follow 1-hop and 2-hop neighbors.However, when transitions are triggered by direct URLs (DU), local search (LS) or external links (EL), users tend to reach distant target nodes (i.e., 6-hop and 8-hop neighbors).Figure 5(b) shows the ranking of models according to BIC scores (lower is better).We see that in MEDDRA all types of navigation are best explained by HopRank.

Top11 Ontologies in BioPortal
Setup.We fit HopRank and the baseline models to all transitions by ontology and navigation type.These represent 133K transitions coming from the 11 ontologies described in Table 1.
Results.In Figure 6 we highlight the model that explains the number of transitions per ontology and navigation type best (i.e., the model with lowest BIC score).Ontologies are sorted by their number of transitions from CPT (largest) to OMIM (smallest).HopRank outperforms the other models 89% of the time, especially when users browse directly-regardless of the ontology-the tree-like explorer via clicks (DC), details (DE) and expand (EX).When there are not enough observations (i.e., the number of transitions is small), the other models tend to outperform HopRank due to the fact that the other models require fewer parameters and/or it is less likely to find transitions across different k-hop neighborhoods.This is the case for 6 ontologies in certain navigation types.For instance, we found

DISCUSSION AND FUTURE WORK
In this section, we discuss decisions made for data processing, and future directions that can be pursued to improve our results.
Largest Connected Component (LCC).Surprisingly, ontologies on BioPortal may have multiple connected components.In those cases, only the branch connected to the root owl:Thing is shown at first in the tree-like explorer.Disconnected (and hidden) nodes or branches need to be accessed from external pages or local search.
For simplicity, we opted to work with the LCC of each ontology with the cost of removing 20% of all transitions.Future work should consider the whole network to study the tradeoffs between number of transitions and random teleportation.HopRank Extensions.More extensions based on network properties or similarity measures between nodes could improve our results.For instance, considering ontologies as directed graphs, and assuming that navigation is not only constrained by distance but also directionality: top-down or bottom-up.
Other Types of Networks.Even though this paper targets semantic networks, we believe that HopRank can be utilized to model human navigation in other networks, such as the Web or cities.The only assumption required is that users must have background knowledge of the underlying network they are surfing/traveling in.

CONCLUSIONS
In this paper, we introduced the concept of HopPortation which states that users-navigating a known or visible network-are biased towards certain k-hop neighborhoods.This is a variation of PageRank, where we assume that teleportation is not fully random but rather distributed non-uniformly across different neighborhoods.We proposed HopRank-a biased random walker-to model navigation on semantic networks.Our findings on BioPortal suggest that semantic structure (i.e., shortest path) influences navigation on networks.In particular, users tend to be biased towards certain k-hop neighborhoods depending on the type of navigation.For instance, when manually browsing the tree-like explorer, users tend to hop to nearby concepts, whereas far-away concepts are more likely to be reached by non-browsing types such as external links.These results advance our understanding of how ontologies are actually navigated and consumed, and help to develop and improve user interfaces for ontology exploration.

Figure 1 :
Figure 1: HopRank on semantic networks.This example illustrates an instance of navigation on an ontology.(a) Shows the underlying network composed by seven concepts (a-g) and six isASubClassOf relationships (straight-thin grey arrows).Transitions (curved-thick black arrows) are labeled by the actual number of transitions between concepts, as well as the [k-hop] distance (i.e., shortest path) between them.(b) Illustrates how the HopPortation vector ⃗ β is built using transition counts per k-hop.(c) Shows the transition probabilities inferred by HopRank, see Equation (1).
Mean % of transitions per navigation type

Figure 3 :
Figure 3: Popularity of k-hops.(a) Shows the percentage of dyads that are traversed per k-hop neighborhood.Lines represent ontologies and are sorted by their LCC diameter: In descendent order from MESH (dark blue) to WHO-ART (dark red).(b) Shows the distribution of transitions across k-hop neighborhoods per navigation type.Percentages are averages across ontologies, and error bars the respective standard deviation.While several k-hop distances are being traversed non-uniformly, most transitions happen across nearby nodes, especially when browsing (DE, DC, EX, ES) 2-hop neighbors.In contrast, non-browsing types (EL, LS, DU) tend to reach more distant nodes more frequently.

BICFigure 4 :
Figure 4: Results on Synthetic Network from Figure 1.X-axis maps the models at interest.(Left) Number of parameters inferred by each model.(Right) BIC scores: the lower the better at explaining the data.In this example, navigation is best described by Markov chain followed by HopRank.

Table 1 :
Datasets.This table illustrates network properties of 11 of the most popular ontologies on BioPortal in 2015.Ontologies represent networks whose nodes refer to concepts and edges isASubClassOf relationships.Original number of nodes, edges, and connected components of ontologies are shown under N, E and cc, respectively.Properties of the largest connected component (LCC) of each ontology are shown under N', E', d' and T', where d' refers to the diameter and T' to the number of transitions.
external link (EL, purple), external search (ES, brown) and local search (LS, pink).Most ontologies are mainly navigated by expanding nodes within the tree-like explorer.
Results on MEDDRA.(a) This heatmap shows the HopPortation vectors learned from the transitions in MEDDRA.Cells represent the probabilities of visiting a certain k-hop neighborhood (column) by a given navigation type (row).In general, 2-hop and 1-hop neighborhoods are more likely to be visited next, regardless of navigation type (ALL).However, distant hops are preferred through direct URLs (DU), external links (EL), and local search (LS).(b) This figure shows the comparison of models across navigation types using BIC scores.We see that HopRank outperforms all baseline models.