In a given paper, researchers might aspire to any subset of the following goals, among others: While determining which knowledge warrants inquiry may be subjective, once the topic is fixed, papers are most valuable to the community when they act in service of the reader, creating foundational knowledge and communicating as clearly as possible.

What sort of papers best serve their readers? We can enumerate desirable characteristics: Recent progress in machine learning comes despite frequent departures from these ideals. In this paper, we focus on the following four patterns that appear to us to be trending in ML scholarship: Failure to distinguish between explanation and speculation.

Failure to identify the sources of empirical gains, e. Misuse of language, e. While the causes behind these patterns are uncertain, possibilities include the rapid expansion of the community, the consequent thinness of the reviewer pool, and the often-misaligned incentives between scholarship and short-term measures of success e.

As the impact of machine learning widens, and the audience for research papers increasingly includes students, journalists, and policy-makers, these considerations apply to this wider audience as well.

We hope that by communicating more precise information with greater clarity, we can accelerate the pace of research, reduce the on-boarding time for new researchers, and play a more constructive role in the public discourse.

Indeed, many of these problems have recurred cyclically throughout the history of artificial intelligence and, more broadly, in scientific research. Similar discussions recurred throughout the 80s, 90s, and aughts [13, 38, 2].

The current strength of machine learning owes to a large body of rigorous research to date, both theoretical [22, 7, 19] and empirical [34, 25, 5].

By promoting clear scientific thinking and communication, we can sustain the trust and investment currently enjoyed by our community. While we stand by the points represented here, we do not purport to offer a full or balanced viewpoint or to discuss the overall quality of science in ML.

In many aspects, such as reproducibility, the community has advanced standards far beyond what sufficed a decade ago.

We note that these arguments are made by us, against us, by insiders offering a critical introspective look, not as sniping outsiders. The ills that we identify are not specific to any individual or institution. We ourselves have fallen into these patterns, and likely will again in the future.

While we provide concrete examples, our guiding principles are to i implicate ourselves, and ii to preferentially select from the work of better-established researchers and institutions that we admire, to avoid singling out junior students for whom inclusion in this discussion might have consequences and who lack the opportunity to reply symmetrically.

We are grateful to belong to a community that provides sufficient intellectual freedom to allow us to express critical perspectives. Pointing to weaknesses in individual papers can be a sensitive topic. To minimize this, we keep examples short and specific.

Speculation Research into new areas often involves exploration predicated on intuitions that have yet to coalesce into crisp formal representations. We recognize the role of speculation as a means for authors to impart intuitions that may not yet withstand the full weight of scientific scrutiny.

However, papers often offer speculation in the guise of explanations, which are then interpreted as authoritative due to the trappings of a scientific paper and the presumed expertise of the authors. For instance, [33] forms an intuitive theory around a concept called internal covariate shift.

The exposition on internal covariate shift, starting from the abstract, appears to state technical facts. However, key terms are not made crisp enough to conclusively assume a truth value.

For example, the paper states that batch normalization offers improvements by reducing changes in the distribution of hidden activations over the course of training. By which divergence measure is this change quantified? The paper never clarifies, and some work suggests that this explanation of batch normalization may be off the mark [65].

