Findings

Culturomics

Kevin Lewis

December 17, 2010

Quantitative Analysis of Culture Using Millions of Digitized Books

Jean-Baptiste Michel et al.
Science, forthcoming

Abstract:
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

----------------------

Inferring social ties from geographic coincidences

David Crandall et al.
Proceedings of the National Academy of Sciences, forthcoming

Abstract:
We investigate the extent to which social ties between people can be inferred from co-occurrence in time and space: Given that two people have been in approximately the same geographic locale at approximately the same time, on multiple occasions, how likely are they to know each other? Furthermore, how does this likelihood depend on the spatial and temporal proximity of the co-occurrences? Such issues arise in data originating in both online and offline domains as well as settings that capture interfaces between online and offline behavior. Here we develop a framework for quantifying the answers to such questions, and we apply this framework to publicly available data from a social media site, finding that even a very small number of co-occurrences can result in a high empirical likelihood of a social tie. We then present probabilistic models showing how such large probabilities can arise from a natural model of proximity and co-occurrence in the presence of social ties. In addition to providing a method for establishing some of the first quantifiable estimates of these measures, our findings have potential privacy implications, particularly for the ways in which social structures can be inferred from public online records that capture individuals' physical locations over time.

----------------------

Redrawing the Map of Great Britain from a Network of Human Interactions

Carlo Ratti et al.
PLoS ONE, December 2010, e14248

Abstract:
Do regional boundaries defined by governments respect the more natural ways that people interact across space? This paper proposes a novel, fine-grained approach to regional delineation, based on analyzing networks of billions of individual human transactions. Given a geographical area and some measure of the strength of links between its inhabitants, we show how to partition the area into smaller, non-overlapping regions while minimizing the disruption to each person's links. We tested our method on the largest non-Internet human network, inferred from a large telecommunications database in Great Britain. Our partitioning algorithm yields geographically cohesive regions that correspond remarkably well with administrative regions, while unveiling unexpected spatial structures that had previously only been hypothesized in the literature. We also quantify the effects of partitioning, showing for instance that the effects of a possible secession of Wales from Great Britain would be twice as disruptive for the human network than that of Scotland.

----------------------

Divisions within Academia: Evidence from Faculty Hiring and Placement

Marko Terviö
Review of Economics and Statistics, forthcoming

Abstract:
I look for divisions to clusters among academic departments in three disciplines: economics, mathematics, and comparative literature. I define clusters as subsets of departments with unexpectedly little hiring across the cluster lines. The division within economics is by far the strongest, is consistent with anecdotal evidence about "Freshwater" and "Saltwater" schools of thought, and has been stable over time. There is also a significant division within comparative literature, but the hiring patterns between top mathematics departments are consistent with random matching.

----------------------

There's plenty of time for evolution

Herbert Wilf & Warren Ewens
Proceedings of the National Academy of Sciences, forthcoming

Abstract:
Objections to Darwinian evolution are often based on the time required to carry out the necessary mutations. Seemingly, exponential numbers of mutations are needed. We show that such estimates ignore the effects of natural selection, and that the numbers of necessary mutations are thereby reduced to about K log L, rather than KL, where L is the length of the genomic "word," and K is the number of possible "letters" that can occupy any position in the word. The required theory makes contact with the theory of radix-exchange sorting in theoretical computer science, and the asymptotic analysis of certain sums that occur there.

----------------------

Putting men on a pedestal: Nobel prizes as superhuman myths?

Danny Dorling
Significance, September 2010, Pages 142-144

Abstract:
The Nobel prizes for 2010 are to be announced in October. They recognise the best thinkers, the most beautiful minds in their fields; but does the distribution of Nobels accurately reflect the distribution of elite minds? Danny Dorling looks at laureate statistics and finds arbitrariness lurks.

----------------------

Industry Induces Academic Science to Know Less about More

James Evans
American Journal of Sociology, September 2010, Pages 389-452

Abstract:
How does collaboration between academic research and industry shape science? This article argues that companies' relative indifference to theory nudges their academic partners toward novel, theoretically unanticipated experiments. The article then evaluates this proposition using fieldwork, archival materials, and panel models of all academic research using the popular plant model Arabidopsis thaliana and the companies that support that research. Findings suggest that industry partnerships draw high‐status academics away from confirming theories and toward speculation. For the network of scientific ideas surrounding Arabidopsis, industry sponsorship weaves discoveries around the periphery into looser, more expansive knowledge. Government funding plays a complementary role, sponsoring focused scientific activity in dense hubs that facilitate scientific community and understanding.

----------------------

Does Collocation Inform the Impact of Collaboration?

Kyungjoon Lee, John Brownstein, Richard Mills & Isaac Kohane
PLoS ONE, December 2010, e14279

Background: It has been shown that large interdisciplinary teams working across geography are more likely to be impactful. We asked whether the physical proximity of collaborators remained a strong predictor of the scientific impact of their research as measured by citations of the resulting publications.

Methodology/Principal Findings: Articles published by Harvard investigators from 1993 to 2003 with at least two authors were identified in the domain of biomedical science. Each collaboration was geocoded to the precise three-dimensional location of its authors. Physical distances between any two coauthors were calculated and associated with corresponding citations. Relationship between distance of coauthors and citations for four author relationships (first-last, first-middle, last-middle, and middle-middle) were investigated at different spatial scales. At all sizes of collaborations (from two authors to dozens of authors), geographical proximity between first and last author is highly informative of impact at the microscale (i.e. within building) and beyond. The mean citation for first-last author relationship decreased as the distance between them increased in less than one km range as well as in the three categorized ranges (in the same building, same city, or different city). Such a trend was not seen in other three author relationships.

Conclusions/Significance: Despite the positive impact of emerging communication technologies on scientific research, our results provide striking evidence for the role of physical proximity as a predictor of the impact of collaborations.

----------------------

Lessons From an Oops at Consumer Reports: Consumers Follow Experts; Ignore Invalid Information

Uri Simonsohn
Journal of Marketing Research, forthcoming

Abstract:
In 2007 Consumer Reports released, and two weeks later retracted, a flawed report on the safety of infant carseats. Analyzing data from 5,471 online auctions for carseats ending before, during and after the information was considered valid I find that (1) consumers responded to the new information, and - more interestingly- that (2) they promptly ceased to do so once it was retracted. This first finding, thanks to the random nature of the flawed ratings, demonstrates that expert advice has a causal effect on consumer demand. The second finding suggests that people's inability to willfully ignore information is not as extreme as the experimental evidence in the psychological literature would suggest.

----------------------

The role of mortality in the transmission of knowledge

Michael Bar & Oksana Leukhina
Journal of Economic Growth, December 2010, Pages 291-321

Abstract
We investigate, both theoretically and quantitatively, a previously unexplored link between gains in adult mortality and productivity growth. Our mechanism allocates a central role to individuals as carriers of useful ideas and to personal contact as an important means of transferring these ideas. It thus implies that disrupting a human life impedes the process of knowledge transmission across time. We derive a simple and intuitive form of the dependence of aggregate knowledge transfer on adult mortality and incorporate it into a model of endogenous growth. We then quantitatively examine the relevance of the proposed link in application to the long-run growth experience of England. Our calibration exercise suggests that the reduction in adult mortality, by improving knowledge transmission across time and encouraging more innovation, was a quantitatively important force behind the takeoff in output per capita.

----------------------

The Temporal Structure of Scientific Consensus Formation

Uri Shwed & Peter Bearman
American Sociological Review, December 2010, Pages 817-840

Abstract:
This article engages with problems that are usually opaque: What trajectories do scientific debates assume, when does a scientific community consider a proposition to be a fact, and how can we know that? We develop a strategy for evaluating the state of scientific contestation on issues. The analysis builds from Latour's black box imagery, which we observe in scientific citation networks. We show that as consensus forms, the importance of internal divisions to the overall network structure declines. We consider substantive cases that are now considered facts, such as the carcinogenicity of smoking and the non-carcinogenicity of coffee. We then employ the same analysis to currently contested cases: the suspected carcinogenicity of cellular phones, and the relationship between vaccines and autism. Extracting meaning from the internal structure of scientific knowledge carves a niche for renewed sociological commentary on science, revealing a typology of trajectories that scientific propositions may experience en route to consensus.

----------------------

The Structure of Borders in a Small World

Christian Thiemann, Fabian Theis, Daniel Grady, Rafael Brune & Dirk Brockmann
PLoS ONE, November 2010, e15422

Abstract:
Territorial subdivisions and geographic borders are essential for understanding phenomena in sociology, political science, history, and economics. They influence the interregional flow of information and cross-border trade and affect the diffusion of innovation and technology. However, it is unclear if existing administrative subdivisions that typically evolved decades ago still reflect the most plausible organizational structure of today. The complexity of modern human communication, the ease of long-distance movement, and increased interaction across political borders complicate the operational definition and assessment of geographic borders that optimally reflect the multi-scale nature of today's human connectivity patterns. What border structures emerge directly from the interplay of scales in human interactions is an open question. Based on a massive proxy dataset, we analyze a multi-scale human mobility network and compute effective geographic borders inherent to human mobility patterns in the United States. We propose two computational techniques for extracting these borders and for quantifying their strength. We find that effective borders only partially overlap with existing administrative borders, and show that some of the strongest mobility borders exist in unexpected regions. We show that the observed structures cannot be generated by gravity models for human traffic. Finally, we introduce the concept of link significance that clarifies the observed structure of effective borders. Our approach represents a novel type of quantitative, comparative analysis framework for spatially embedded multi-scale interaction networks in general and may yield important insight into a multitude of spatiotemporal phenomena generated by human activity.

----------------------

The spread of innovations in social networks

Andrea Montanari & Amin Saberi
Proceedings of the National Academy of Sciences, 23 November 2010, Pages 20196-20201

Abstract:
Which network structures favor the rapid spread of new ideas, behaviors, or technologies? This question has been studied extensively using epidemic models. Here we consider a complementary point of view and consider scenarios where the individuals' behavior is the result of a strategic choice among competing alternatives. In particular, we study models that are based on the dynamics of coordination games. Classical results in game theory studying this model provide a simple condition for a new action or innovation to become widespread in the network. The present paper characterizes the rate of convergence as a function of the structure of the interaction network. The resulting predictions differ strongly from the ones provided by epidemic models. In particular, it appears that innovation spreads much more slowly on well-connected network structures dominated by long-range links than in low-dimensional ones dominated, for example, by geographic proximity.

----------------------

Superstars and heavy tails in recorded entertainment: Empirical analysis of the market for DVDs

W.D. Walls
Journal of Cultural Economics, November 2010, Pages 261-279

Abstract:
This research presents a systematic empirical analysis of the market for digital versatile discs (DVDs). We examine a sample of 953 DVD titles that appeared on the weekly top-30 sales charts in North America over a 30-month interval. We find that the size distribution of weekly DVD sales revenue does not indicate the presence of increasing returns to information. The empirical results for DVD sales contrast starkly with previous results obtained for motion-picture box-office revenue, where a number of researchers have found evidence of positive feedback in demand. While the distribution of cumulative revenues across DVDs is highly unequal, the DVD market appears not to be characterized by the extreme heavy upper tail that so well describes the winner-take-all nature of the distribution of box-office success across motion pictures.

----------------------

Retractions in the scientific literature: Do authors deliberately commit research fraud?

Grant Steen
Journal of Medical Ethics, forthcoming

Background: Papers retracted for fraud (data fabrication or data falsification) may represent a deliberate effort to deceive, a motivation fundamentally different from papers retracted for error. It is hypothesised that fraudulent authors target journals with a high impact factor (IF), have other fraudulent publications, diffuse responsibility across many co-authors, delay retracting fraudulent papers and publish from countries with a weak research infrastructure.

Methods: All 788 English language research papers retracted from the PubMed database between 2000 and 2010 were evaluated. Data pertinent to each retracted paper were abstracted from the paper and the reasons for retraction were derived from the retraction notice and dichotomised as fraud or error. Data for each retracted article were entered in an Excel spreadsheet for analysis.

Results: Journal IF was higher for fraudulent papers (p<0.001). Roughly 53% of fraudulent papers were written by a first author who had written other retracted papers (‘repeat offender'), whereas only 18% of erroneous papers were written by a repeat offender (χ=88.40; p<0.0001). Fraudulent papers had more authors (p<0.001) and were retracted more slowly than erroneous papers (p<0.005). Surprisingly, there was significantly more fraud than error among retracted papers from the USA (χ2=8.71; p<0.05) compared with the rest of the world.

Conclusions: This study reports evidence consistent with the ‘deliberate fraud' hypothesis. The results suggest that papers retracted because of data fabrication or falsification represent a calculated effort to deceive. It is inferred that such behaviour is neither naïve, feckless nor inadvertent.

----------------------

Making words work: Using financial text as a Predictor of financial events

Mark Cecchini, Haldun Aytug, Gary Koehler & Praveen Pathak
Decision Support Systems, December 2010, Pages 164-175

Abstract:
We develop a methodology for automatically analyzing text to aid in discriminating firms that encounter catastrophic financial events. The dictionaries we create from Management Discussion and Analysis Sections (MD&A) of 10-Ks discriminate fraudulent from non-fraudulent firms 75% of the time and bankrupt from nonbankrupt firms 80% of the time. Our results compare favorably with quantitative prediction methods. We further test for complementarities by merging quantitative data with text data. We achieve our best prediction results for both bankruptcy (83.87%) and fraud (81.97%) with the combined data, showing that that the text of the MD&A complements the quantitative financial information.

----------------------

Who Believes the Hype? An Experimental Examination of How Language Affects Investor Judgments

Jeffrey Hales Xi (Jason) Kuang & Shankar Venkataraman
Journal of Accounting Research, forthcoming

Abstract:
This paper investigates the effect of vivid language on investor judgments. Recent research finds that investor judgments are significantly influenced by disclosure tone (positive versus negative). Holding tone constant, we investigate investors' reactions to vivid versus pallid information. Drawing on theories from psychology, we predict that investors will be sensitive to the differences between vivid and pallid language when the underlying information is preference-inconsistent, but not when the information is preference-consistent. Results of two experiments support our prediction. Vivid language significantly influences the judgment of investors who hold contrarian positions (i.e., short investors in a bull market and long investors in a bear market). Interestingly, vivid language has limited influence on the judgment of investors who hold positions consistent with the general tenor of the market. Our results provide evidence regarding when vividness matters and when it does not in financial contexts, thereby contributing to both psychology and a growing literature on disclosure tone in financial reporting. In addition, our results also speak to concerns raised by regulators and academics asserting that vivid language can inflate bubbles and incite panics.

----------------------

Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research

Yassine Gargouri et al.
PLoS ONE, October 2010, e13636

Background: Articles whose authors have supplemented subscription-based access to the publisher's version by self-archiving their own final draft to make it accessible free for all on the web ("Open Access", OA) are cited significantly more than articles in the same journal and year that have not been made OA. Some have suggested that this "OA Advantage" may not be causal but just a self-selection bias, because authors preferentially make higher-quality articles OA. To test this we compared self-selective self-archiving with mandatory self-archiving for a sample of 27,197 articles published 2002-2006 in 1,984 journals.

Methdology/Principal Findings: The OA Advantage proved just as high for both. Logistic regression analysis showed that the advantage is independent of other correlates of citations (article age; journal impact factor; number of co-authors, references or pages; field; article type; or country) and highest for the most highly cited articles. The OA Advantage is real, independent and causal, but skewed. Its size is indeed correlated with quality, just as citations themselves are (the top 20% of articles receive about 80% of all citations).

Conclusions/Significance: The OA advantage is greater for the more citable articles, not because of a quality bias from authors self-selecting what to make OA, but because of a quality advantage, from users self-selecting what to use and cite, freed by OA from the constraints of selective accessibility to subscribers only. It is hoped that these findings will help motivate the adoption of OA self-archiving mandates by universities, research institutions and research funders.


Insight

from the

Archives

A weekly newsletter with free essays from past issues of National Affairs and The Public Interest that shed light on the week's pressing issues.

advertisement

Sign-in to your National Affairs subscriber account.


Already a subscriber? Activate your account.


subscribe

Unlimited access to intelligent essays on the nation’s affairs.

SUBSCRIBE
Subscribe to National Affairs.