**Research Metrics Workshop**

**August 16, 2024**

** I4-01-03 Seminar Room, NUS, Singapore**

**View the Schedule Here**

**Speakers ****(alphabetical order)**

– Hiroka Hamada (The Institute of Statistical Mathematics)

– Thorsten Koch (Technische Universität Berlin & Zuse Institute Berlin)

– Junji Nakano (The Institute of Statistical Mathematics & Chuo University)

– Frederick Kin Hing Phoa (Academia Sinica)

– Guoyang Rong (National University of Singapore)

– Lei Zhou (National University of Singapore)

**Hiroka Hamada**

The Institute of Statistical Mathematics

**Talk Title: PLAYER: A clustering method for research metrics that provides evaluation unit of research and distance**

Abstact: In research IR, which analyzes an institution’s research activities to support decision making, the shortage of indicators to evaluate research is a problem. To solve this problem, we have developed a method for defining evaluation units for research activities. Existing quantitative evaluation indicators used for research evaluation have remained quantity-counting indicators, although several correction methods have been proposed. This is because, in order to properly evaluate research, expert knowledge of the research area is necessary. Therefore, we propose the “PLAYER”, which classifies studies based on natural language information and co-authorship information, and defines distances between studies, thereby aligning studies into units that can be compared and evaluated. PLAYER is a cluster of articles derived from the embedded vector provided by SPECTER and the co-authorship network information. First, the features obtained from each data by the kernel method are synthesized, then clustered according to the PLAYER requirements. The distance between studies is defined by the Wasserstein distance.

By using PLAYER for research metrics, “collection of similar studies” and “analysis and evaluation of transdisciplinary studies” can be realized without special knowledge. Examples of analyses show that PLAYER is useful for research analysis with different evaluation axes.

**Thorsten Koch**

Technische Universität Berlin & Zuse Institute Berlin

**Talk Title: Integrating Large Citation Data Sets for Measuring Article’s Scientific Prestige **

Abstract: Evaluating scientific impact necessitates precisely measuring individual articles’ impact, commonly assessed through metrics reliant on citation counts. However, these metrics are subject to limitations, notably susceptibility to manipulation within the scholarly community. Recently, there has been a shift towards utilizing knowledge distilled from citation graphs rather than relying solely on citation counts. This shift mandates access to a comprehensive citation graph for more reliable measurement. In this talk, we focus on methods for merging citation data sets incorporating big data to construct a comprehensive citation graph. We present our implementation results for merging two extensive citation databases, containing more than 63 million and 98 million article records, respectively, alongside more than 953 million and 1.3 billion citations. Handling big data presented significant challenges during our implementation, including quality issues from semi-structured data lacking universal identification numbers. Through meticulous deduplication efforts, we streamlined the merged database to a single consistent dataset. Our work led to a citation graph that portrays inter-article associations more accurately than graphs derived from single databases. The presentation outlines our approach to managing big data for constructing the merged citation graph, emphasizing the challenges and our remedies to deal with these challenges.

**Junji Nakano**

The Institute of Statistical Mathematics & Chuo University

**Talk Title: Improving a stochastic model of the citation mechanism for scientific and technical articles**

Abstract: Citations among scientific and technical articles can be represented by a network structure called a citation network, where nodes and directed edges represent articles with discrete publication time and citations, respectively. We have proposed a stochastic generative model in which a citation between two articles is described by a probability based on the type of the citing article, the importance of the cited article, and the difference between their publication times. We consider the out-degree of an article as its type, and the in-degree as its importance. In the model, we assume three structures: a logistic function to represent the expected number of articles published in discrete time, an inverse Gaussian probability distribution function to approximate the aging effect, and an exponential distribution to approximate the out-degree distribution. We also assume two types of generative mechanisms, preferential attachment, and triad formation to perform edge generation. We have shown that the model is able to generate network structures that approximate several scientific citation networks. We also found that the model does not fit the patent citations well. Therefore, we have proposed a modified model which uses the triad formation ratio as a random variable instead of a constant parameter in the first model. After analyzing the patent data in more detail, we find that several improvements are needed for the time series treatment of the data, and the distribution of the triad formation ratio.

**Frederick Kin Hing Phoa**

##### Academia Sinica

**Talk Title: Weighted Evolving Hypergraph Model with Preferential Attachment**

Abstract: A hypergraph is useful to express the relation between two or more nodes. Real hypergraph data are typically weighted. We propose a weighted evolving hypergraph model that considers preferential attachment. The model allows variability on two basic components of the evolving hypergraph: the number and the size of hyperedges to be connected. Under the mild distributional conditions on the two varying quantities, we derive the exact degree distribution that asymptotically follows a power-law distribution. We find that the limiting power-law exponent is affected by the distribution of hyperedge sizes. The distribution of the number of hyperedges to be connected has a considerable impact on a small-degree range in which non-power-law behavior is frequently observed in real data. Moreover, we argue that the degree distribution of the model can be expressed as a mixture of the degree distributions with a fixed number of hyperedges to be connected. The validity and usefulness of the model are explained with interpretations via simulation study and real data analysis.

**Guoyang Rong**

National University of Singapore

**Talk Title: Exploring the critical years for interdisciplinary citations**

Abstract: Revealing interdisciplinary patterns is a cornerstone for the continued evolution of research, education, and societal progress, providing a scaffold upon which to build a more collaborative and integrated approach to knowledge creation. This study presents a novel approach to identifying and analyzing the critical year for interdisciplinary citations (CYIC), which was defined as the year in which qualitative change in interdisciplinary knowledge flow occurred. We conducted two experiments using a Chinese paper dataset spanning 106 disciplines from 1992 to 2022, with the first to pinpoint the occurrence of CYICs and the second to examine three patterns of interdisciplinarity following these CYICs. Our findings revealed that 85% of disciplines exhibit CYICs, often corresponding with a transition from unidirectional output to reciprocal knowledge cooperation. Furthermore, we found that datasets after CYICs are generally characterized by increased interdisciplinarity of knowledge, albeit without a corresponding rise in the interdisciplinarity of disciplines or interdisciplinary diversity. Our results suggest that policy shifts and societal needs are pivotal in driving the formation of interdisciplinary collaborations, as exemplified by the surge in mutual interdisciplinary citations in response to China’s poverty alleviation efforts and western development policies.

**Tracy Zhou**

National University of Singapore

**Talk Title: Is Innovation Slowing Down? Insights from the AIMS Framework of Patent Values**

Abstract: Amidst the unprecedented expansion of scientific and technological knowledge over the past century, concerns persist regarding a slowdown in innovation. To address this, we introduce the AIMS framework, which categorizes patents into four types – Aurora, Invisible, Mirage, and Success – based on their respective inherent scientific values and market-recognized economic values. Utilizing USPTO patent and citation data from 1976 to 2022, our analysis reveals an increasing volume of patent issuances but a concerning dilution in scientific quality starting in the 2000s. This trend is primarily attributed to the rise of low scientific value patents – categorized as Mirage and Invisible – and a modest decline in high-impact scientific patents – categorized as Success and Aurora. Meanwhile, the economic value of patents has risen, especially noted with the growth in Mirage patents since the 2010s, indicating a shift towards strategies that prioritize market-driven patenting. This study highlights the evolving nature of patents from mere indicators of scientific innovation to strategic tools for market dominance, providing an alternative understanding of patent value and its implications for firms’ strategic decisions over patent issuance across different sectors.

#### _________________________________________________________________________________________________________

**Organizer:**

- Ying Chen
- Keisuke Honda
- Thorsten Koch
- Federick Kin Hing Phoa

#### _________________________________________________________________________________________________________

**Coordinator:**

- Tracy Zhou