译文:谷歌学术让这变得困难——组织出版物的复杂性
摘要:科学家可以在谷歌学术的个人资料页面维护自己的出版物,而对于这些作品的引用数将能够被自动收集并累加。对出版物的维护是由研究者自己手工完成的,这些工作包括:删除错误的论文,合并未被识别但实质相同的论文,补充遗漏的合作者以及纠正论文的标题与出处。2012至2014年的出版物将在网页界面上呈现20至100篇,(从2014年中开始,谷歌学术允许不限篇数的论文呈现在单个网页上)。一篇论文的不同版本如果出现的页面不同,网页界面是不允许科学家进行内容合并的。这不仅意味着,科学家有时无法合并自己想要的某些子集出版物,同时也指明,我们需要注意这一决策问题,即确定合并的特定子集的论文是否有可能是NP 完全的。 关键词:计算复杂性、NP 完全性、归约、三分区、谷歌学术
1 介绍 五号宋体大部分计算科学领域的科研工作者对谷歌学术都很熟悉,它可以维护出版物及对其的引用。每一个科研工作者都有一个包含他们出版物列表的网页,作为自己的个人资料页面。谷歌学术判定每一个出版物的引用数,并默认按照此引用数在科研工作者的网页上按序罗列。因为对数据的收集是自动化的,所以往往包含很多的错误。而这些错误的起因,通常是其他的科学家没有正确地给出论文的标题和其他基本信息。其结果是,一篇论文在列表中会有很多的版本,这些版本要么是一个略微不同的标题或出版方,要么是对合作者的遗漏(如图一)。谷歌学术给科研工作者提供了在个人资料页面修正错误的可能,比如把不同版本的论文合二为一。而合并后的新版本的引用数,是原始版本的总和。当然,也可以删除错误的版本,但这会造成一些引用数的损失,而且会影响相当重要的H指数和其他谷歌学术维护的统计数据。 谷歌学术在一个页面上默认展示20篇的出版物(也可以修改篇数直至100篇)。从2014年中开始,界面的修改使得在同一页展示所有的出版物成为可能。但在这份报告中,我们确信从2012年到2014年中所用到界面无法实现这一点,单张网页呈现的最大值仍是100篇。需要合并的两篇论文,都应该在网页上选中,然后合并操作才可以被执行。可是,对于两篇论文的选择却只限于他们出现在同一页时,所以,只有当两篇论文同属于一个100篇的分组时才能被合并。比如,按引用数排列,如果一篇论文排在第103位,而另一篇论文排在第187位,那么他们是可以被合并的。但是如果一篇论文排在97位,而另一篇论文排在105位,那么他们是不能被合并的。
图1 作者Jones的六篇不同的论文在图中的两个页面中出现了九个版本,Jones必须完成三次合并以订正数据,而且合并的顺序也很关键。
图2 在有序列表中不同版本论文的合并以及位置的改变。 论文合并的顺序很重要。假设有两组论文需要被合并,分别位于第4位和第12位,第101位和第107位。如果首先合并了第一组论文,那么会把第二组论文移动到第100位和第106位,使他们处于不同的页面。而首先合并第二组论文,再合并第一组论文也是允许的。注意,一次合并可能使论文前移或者后移,如图2所示。 此外,决定合并序列(为此找到正确的顺序)的存在这一计算问题是难解的,有时想要的论文不能够被合并。这也暗示了多项式时间解决合并序列的算法不太可能存在。 2 难解性的证明 为了证明这一问题的难解性,即NP 完全性,我们首先用形式化的语言对其加以描述。令问题实例中论文的版本总数初始化为n,令页数为p。令论文版本分别为v1,···,vn,假设论文版本vi被引用了c(vi)次。一个问题实例由序列c(v1),···,c(vn)组成。将1,···,n划分为多个子集,若两个或者以上版本的论文在同一子集中,说明他们是同一篇论文的不同版本,因此,他们将被合并。谷歌学术合并问题就是在所有的子集中选择,选中的子集中所有的版本可以被合并。当两个版本被合并后,新版本的引用数是旧版本的总和。每次合并后,新集合中版本是按引用数有序排列的。如果引用数相同,论文将按照其他良定义的次序排列,但这与难解性的证明是不相关的,所以我们选择忽略它。 定理1 谷歌学术合并问题是NP完全的。 证明: 首先我们将证明谷歌学术合并问题是NP问题,这很容易证明。一种可能的合并顺序可以记录为二项式时间或者更少。 其次,我们用另外一个可以归约为该问题的NP – 完全或者NP – 困难问题,即三分区[2]。一个三分区实例由一个正整数集合a1,···,a3m和一个整数B组成,我们将a1,···,a3m分解为m个子集,要求每个子集中的三个整数的和为B。整数a1,···,a3m严格限定于B/4和B/2之间,用以保证任何一个和为B的子集恰好包含三个整数。 接下来我们描述从三分区到谷歌学术合并问题的归约过程。首先我们把所有的ai和B加倍,以保证他们都是偶数。我们继续使用a1,···,a3m和B来表示加倍后的值,虽然有些轻微的符号滥用。 我们赋值页面数为3m,令D为某个较大的偶数,我们让D=3mB。 我们的实例包含一个有多个版本的论文P和许多属于同一版本的论文,如图3所示。 首先,我们取一篇论文P和5m个版本被引用: ·a1,···,a3m 次(X型), ·D – B + 2i 次(Y型),对于每个i,1 ≤i ≤m, ·D + 2i 次(Z型),对于每个i,1 ≤i ≤m。 P的每一个版本都有偶数个引用数,所以每次合并得到的版本也有偶数个引用数。 其次,我们对以下的众多论文都只取一个版本: ·对于每个i,1 ≤i ≤m,我们取3m – 1篇只有一个版本的论文,每篇有D + 2i – 1次引用, ·我们取3m篇引用数为D + 2m + 1的论文, ·我们取5m篇引用数为B – 1的论文。 一个实例完全充满了m + 4页,则所有的X型版本出现在上一页。 我们称第二个集合的论文为单一论文,他们有奇数个应用数而且不能被合并。包含至少3m – 1篇单一论文和同样数量的引用数的一组称为一个块,一个实例中有3m – 1 个块。由于一个块至少有页面数减一篇论文,对于一个块不同边的引用计数,使论文版本不能够出现在同一页。注意,Z型版本的论文P被块所分隔。只有创建一个有同样引用数的论文P才能够“解救”Z型版本并从毗邻的块中取出。这个结构确保这只能被论文P的一个Y型版本和三个X型版本实现。证明被简化。
图3由三分区实例产生的谷歌学术合并问题的初始化场景,垂直的灰色线是页面边界。 我们声明这个谷歌学术搜索问题有解有且只有三分区问题有解。 假使我们把a1,···,a3m划分为m个和为B的三元组。令Ti为第i个三元组的整数子集。从1到m遍历i如下,合并彼此三个引用数为aj的X型版本,其中aj∈Ti。注意,所有合并的中间版本都有总共少于B – 1次引用,因此只会出现在上一页。当我们合并一个版本B次引用的三元组时,这个版本会作为Y型版本出现在同一页。接下来,我们合并任何一个Y类型D – B + 2i次引用的三元组,赋予D + 2i次引用。然后,我们再对这Z类型D+2i次引用的三元组合并。任何对一个Z型、一个Y型,以及三个X型版本的合并都会出现在第一页,因为它会有比最多引用的块更多的引用数。我们完成论文P中所有版本的合并时结束工作。因此,我们可以合并所有的论文P中的版本如果三分区实例有解。 相反地,假使我们可以合并实例中的论文P的所有版本。这意味着所有X型可以与其他X型版本合并为至少B次引用的版本,否则它们会留在块中,保持B - 1次引用。我们可以使这一合并版本小于等于m,因为X型总共有mB次引用。这一合并版本由至少三个X型版本组成,因为任何两个X型版本的合并最多有B-2次引用(由于X型版本被引用了ai次且B/4<aiD+2m次引用,所以我们可以用至多一篇Y型论文如果我们想获得一个引用数和Z型版本一样的版本。为了获得合并自任何Z型版本的X型版本,它们必须是位于合并版本中的一些至少D+2次引用的点,但是所有的X型版本总共只有mB < D + 2次引用。所以一个Y型版本通常需要这样一个合并版本,而且我们归纳得,确实需要来自某个合并版本的一个Y型版本,让一个合并版本拥有与任何一个Z型版本相同的引用数。 这意味着我们需要至少m个X型版本的合并版本,否则我们不能够拥有足够的Y型版本来合并达到与Z型版本相同的引用数,所以我们必须能够确保有m个X型版本的合并版本。为了实现这一点,我们不能把四个X型版本合并为一个版本,因为这个合并版本会有>B次引用,而依据鸽巢理论m个合并版本中至少有一个版本只有两个版本以及<B次引用。于是有,m个X型版本的合并版本是三元组,且总和为B,说明三分区实例有解。这就是对NP完全性的证明。 3 结论 这一点很有趣,我们看到了一个用户界面的实例,用户解决了一个与此界面有关的NP完全问题。比较而言,操作研究公司通常提供一个用户界面,用户为一个计划制定问题制定计划,这通常是NP困难的,但是在这里用户界面帮助于制定计划,而且没有用户界面能够让这变得简单。其他的例子确是难解游戏,即难解类型的实例扩展是NP完全甚至PSPACE 困难的[3]。这一想法很清楚,这个任务被认为是很困难的。在谷歌学术合并问题中,由用户界面的设计产生的NP完全性是随机的。谷歌学术让研究人员维护个人信息页面变得困难。自从2014年中界面更改后,这不再是一个难解问题,因为现在一个“显示更多”按钮允许用户在同一页中获取更多的论文,结果是允许所有的论文出现在同一页上。 使用合并论文的方式在谷歌学术最大化H指数被证明是NP困难的[4]。此外,甚至当一个兼容性图像存在于论文版本间时,决定H指数按步增长也是NP困难的[1]。 它是开放的,以确定谷歌学术搜索合并研究仍然是np完全问题,如果: ·页面大小用常量界定, ·任何论文的版本数量用常量界定,或者, ·页面可以按任何排名数放置开始,不只是页数加一的倍数。
Source text: Google Scholar makes it hard - the complexity of organizing one's publications.
Hans L.Bodlaender, MarcvanKreveld∗ Department of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB, Utrecht, The Netherlands
Abstract: With Google Scholar, scientists can maintain their publications on personal profile pages, while the citations to these works are automatically collected and counted. Maintenance of publications is done manually by the researcher herself, and involves deleting erroneous ones, merging ones that are the same but which were not recognized as the same, adding forgotten co-authors, and correcting titles of papers and venues. The publications are presented on pages with 20 or 100 papers in the web page interface from 2012–2014. (Since mid 2014, Google Scholar’s profile pages allow any number of papers on a single page.) The interface does not allow a scientist to merge two versionsof a paper if they appear on different pages. This not only implies that a scientist who wants to merge certain subsets of publications will sometimes be unable to do so, but also, we show in this note that the decision problem to determine if it is possible to merge given subsets of papers is NP-complete.
Keywords: Computational complexity NP-completeness Reduction 3-partition Google Scholar
- Introduction
Most researchers in computer science will be familiar with Google Scholar and its abilities to maintain publica-tions and their citations. Each researcher has his/her own profile which is shown as a web page with a list of publi-cations. Google Scholar determines the number of citations to each publication and by default, lists them in this or-der on web pages for that researcher. Since the collection of the data is automated, it will contain various mistakes, many of which are caused by other scientists who fail to give the title or other essential information on a pa-per correctly. As a consequence, a single paper may have various versions in the list, with a slightly different ti-tle or publication venue, or with co-authors missing (see Fig.1). Google Scholar offers researchers the possibility to correct these mistakes on their own profile page, for ex-ample by allowing them to merge two paper versions into one. This creates one version with the citations of the orig-inal versions summed up. Of course, one could also delete the erroneous version, but this may cost some citations, which on its turn can influence the ever-important H-index and other summary statistics that Google Scholar main-tains.
Google Scholar by default shows the publication list on pages with 20 papers. It is possible to change this number to 100. Since mid 2014, a change in the interface makes it possible to get all publications on a single page. In this note we assume the interface in use from 2012 until mid 2014, when this was not possible and the maximum was 100 on a single page. To merge two papers, both should be selected on the web page, after which the merge ac-tion can be executed. However, selection of two papers is possible only if they appear on the same page, and there-fore, merging can be done only if the two papers appear in the same group of 100 papers. For example, if one pa-per is the 103rd by citation count and another paper is the 187th, then they can be merged, but if one is the 97th and the other is the 105th, then they cannot be merged.
The order in which papers are merged is important. If there are two pairs of papers to be merged, for example at positions 4 and 12, and at positions 101 and 107, then merging the first pair first will move the positions of the latter pair to 100 and 106, putting them on different pages. But merging the second pair first still allows the merging of the first pair. Notice that the position of a paper can change both forward and backward due to a merge, see Fig.2. Besides the problem that desired merges sometimes cannot be done, the computational problem of deciding whether a sequence of merges exists (and therefore, find-ing the correct order) is computationally intractable. This implies that a polynomial-time algorithm to produce the sequence of merges is unlikely to exist.
- A proof of intractability
To prove intractability, or, NP-completeness of the prob-lem, we will formalize it first. Let nbe the total number of versions of papers initially in a problem instance, and let pbe the page size. Let the paper versions be v1, ..., vn, and assume that paper version viis cited c(vi)times. Aprob-lem instance consists of the sequence c(v1), ..., c(vn), and a partition of 1, ..., ninto subsets where two or more versions in the same subset indicates that they are dif-ferent versions of the same paper, and therefore, they are to be merged into one. The Google Scholar Merge Prob-lem is the problem of deciding whether for every subset, all of its versions can be merged. When two versions are merged, they appear as one new version and their cita-tions are added. After each merge, the new set of versions appears in sorted order on citation count. When citation counts are the same, the papers will appear in some other well-defined order, but this will be irrelevant for the in-tractability proof and we will ignore this issue.
Theorem 1. The Google Scholar Merge Problem is NP-complete.
Proof. First, we will verify that the Google Scholar Merge Problem is in NP. This is easy: a suggested merge order can easily be checked in quadratic time or less.
Second, we use another NP-complete or NP-hard prob-lem and provide a reduction to our problem, namely3-partition[2].
A3-partitioninstance consists of a set a1, ..., a3mof positive integers and an integer B, and asks whether we can partition a1, ..., a3min msubsets of 3in-tegers that each sum up to B. The integers a1, ..., a3mare all strictly between B/4and B/2, which ensures that any subset that sums to Bconsists of exactly three integers.
We describe the reduction from3-partitionto the Google Scholar Merge Problem. First, we double all aiand Bto ensure that they are even. With slight abuse of nota-tion we continue to use the notation a1, ..., a3mand Bfor these doubled values.
We set the page size to 3m. Let Dbe some large, even integer; we can use D =3mB.
Our instance consists of one paper Pwith many ver-sions and many papers with one version, see Fig.3for a schematic depiction.
First, we take one paper Pwith 5mversions that are cited:
•a1, ..., a3mtimes (type X), •D −B +2itimes (type Y), for each i, 1 ≤i ≤m, •D +2itimes (type Z), for each i, 1 ≤i ≤m.
All versions of Phave an even number of citations, so any merge of these will also have an even number of citations.
Second, we take many papers with only one version as follows:
•for each i, 1 ≤i ≤m, we take 3m −1papers with one version, each with D +2i −1citations, •we take 3mpapers with D +2m +1citations, •we take 5mpapers with B −1citations.
An instance fills exactly m +4pages, and all type X versions appear on the last page.
We call the papers second set the single papers; they all have an odd number of citations and will not be merged. Agroup of at least 3m −1single papers with the same number of citations is called a block; there are m +2blocks in the instance. Since a block has at least as many papers as the page size minus one, paper versions on different sides of a block in citation count can never be on the same page. Notice that the type Z versions of paper Pare sep-arated by the blocks. The only way to “rescue” a type Z version and get it out of the adjacent blocks is to create a version of paper Pthat has exactly the same number of ci-tations. The construction makes sure that this can only be accomplished using exactly one type Y version and three type X versions of paper P; the proof is given shortly. We claim that this Google Scholar Merge Problem in-stance has a solution if and only if the3-partitioninstance has a solution. Suppose we have a partition of a1, ..., a3minto mtriples that each sum up to B. Let Tibe the subset of in-tegers in the ith triple.
For ifrom 1to mdo the following. Merge the three type X versions with ajcitations for the aj∈Tiwith each other. Note that all intermediate merged versions have less than B −1citations together, and thus will appear on the last page. When we merged the triple we have a version with Bcitations, which will appear on the same page as all type Y versions. Next, we merge any merged triple with a type Y version with D −B +2ici-tations, giving it D +2icitations. Then we merge it with the type Z version that has D +2icitations.
Any merge of a type Z, a type Y, and three type X versions will ap-pear on the first page because it will have more citations than the block with most citations. We finish by merging all versions of paper Pon the first page. Hence, we can merge all versions of paper P if the3-partitioninstance has a solution.
Conversely, suppose that we can merge all versions of paper P in the instance. This implies that all type X ver-sions can be merged with other type X versions into ver-sions with at least Bcitations, otherwise they would stay after the block with B −1citations. We can make ≤mof these merged versions, because the type X versions have mBcitations in total. These merged versions consist of at least three type X versions, because any two type X ver-sions merged have at most B −2citations (since type X versions are cited aitimes with B/4 <ai<B/2, for all 1 ≤i ≤3m).
Furthermore, the assumption that we can merge all versions of paper P implies that at some point in the merge sequence, we must have had merged versions with D +2, D +4, ..., D +2mcitations, otherwise the type Z versions cannot get to a common page.
Two type Y versions merged will have at least 2D −2B +6 >D +2mcitations, so we can use at most one type Y paper if we want to get a version with as many citations as some type Z version. To get the type X versions merged with any of the type Z versions, they must at some point be in a merged version with at least D +2citations, but all of the type X versions together have only mB <D +2citations. So a type Y version is always needed in such a merged version, and we conclude that exactly one type Y version is needed in any merged version to make a merged version with the same number of citations as any of the type Z versions.
This implies that we need at least mmerged versions of type X versions, otherwise we do not have enough ver-sions to merge with the type Y versions to reach the same citation counts as the type Z versions, so we must be able to make exactly mmerged versions of type X papers. To realize this we cannot have a merged version with four type X versions, because this merged version would have >Bcitations, and then by the pigeonhole principle at least one of the mmerged versions has only two versions and hence <Bcitations. It follows that the mmerged versions of type X papers are triples and they sum up to exactly B, showing that the3-partitioninstance has a solution. This completes the proof of NP-completeness.
It would have been possible to construct a similar but slightly easier proof based on a reduction from the NP-complete problemPartition. The reason we chose3-partitionis that3-partitionis NP-complete in the strong sense[2], which implies that even if the total number of citations is bounded by a polynomial in n(the integers of the input), then3-partitionis still NP-complete. A proof based onPartitionwould require exponentially many ci-tations to the papers of a researcher.
- Conclusions
It is interesting to see an example of a user inter-face where the user has to solve an NP-complete prob-lem due to the choice of interface. By comparison, oper-ations research companies often provide a user interface where a user makes a schedule for a scheduling problem that is usually NP-hard, but here the user interface as-sists in the scheduling, and no user interface can make the problem easier. Other examples are certain puzzle games where large-instance extensions of the puzzle type are NP-complete or even PSPACE-hard[3]. Here the idea is clearly that the task is supposed to be hard. In the Google Scholar Merge Problem, the NP-completeness is accidental, and arises from the way the user interface is designed. Google Scholar makes it hard for researchers to maintain their Google Scholar profile pages. Since the interface change in the middle of 2014, the problem is no longer intractable because there is now a button “show more” that allows a user to get more papers on the same page, eventually al-lowing all papers to be on a single page.
Maximizing the H-index using paper merging in Google Scholar was shown to be NP-hard[4]. Furthermore, when a compatibility graph exists between paper versions, even deciding whether the H-index can be increased by one is already NP-hard[1].
It is open to determine whether the Google Scholar Merge Problem studied in this paper is still NP-complete if:
•the page size is bounded by a constant, •the number of versions of any paper is bounded by a constant, or •the page start can be placed at any rank number, not just at multiples of the page size plus one.
References [1]René van Bevern, Christian Komusiewicz, Rolf Niedermeier, Manuel Sorge, Tony Walsh, Manipulation by merging: models, theory, and ex-periments, in: Proceedings of the 24th International Joint Conference on Artificial Intelligence, AAAI Press, 2015. [2]Michael R. Garey, David S. Johnson, Computers and Intractability: AGuide to the Theory of NP-Completeness, Freeman, 1979. [3]Robert A. Hearn, Erik D. Demaine, Games, Puzzles, and Computation, A K Peters, 2009. [4]Bart de Keijzer, Krzysztof R. Apt, The H-index can be easily manipu-lated, Bull. Eur. Assoc. Theor. Comput. Sci. 110 (2013) 79–85.