Influential Statistician Dr. Hadley Wickham Visits DC and Presents on R Code Organization

September 17, 2015 David Kretch

Last night, Summiteers joined Statistical Programming DC for Dr. Hadley Wickham's talk on creating fluent interfaces for R. As the creator of many popular and influential R packages, including ggplot2, plyr, reshape2, dplyr, and tidyr, Dr. Wickham is an authority on developing for R in a readable, reusable, and practical fashion.

(If you haven’t heard of him, check out this profile: “Hadley Wickham, the Man Who Revolutionized R.”)

The focus of Dr. Wickham's presentation was creating readable, reproducible data analysis programs using a technique called piping. Piping, written in R as ‘%>%’, moves data from one function to the next, like the name implies.

For example,

foo_foo %>%

  hop_through(forest) %>%

  scoop_up(field_mouse) %>%

  bop_on(head)

is equivalent to writing

bop_on(scoop_up(hop_through(foo_foo, forest), field_mouse), head)

Photo: Dr. Hadley Wickham presenting in Washington, D.C. on 9/17/2015

But the version using piping is much easier to read. Using piping, we can construct complex and readable data analysis programs out of simpler building block functions; piping frees us from having to figure out how to join our blocks together so we can focus on the blocks themselves.

To take advantage of pipes, our functions should have three qualities: purity, predictability, and pipeability.

Purity

Functions should be pure, which is computer science jargon meaning their output depends only on their input and their operation does not change the state of the world. Pure functions are easier to reason about since they can be considered in isolation. For example, the value of sum(1, 2) depends only on the 1 and 2 provided to the sum function; you don’t need to know anything else.

Examples of impure functions are those that read or write from disk, set options, and generate random numbers. You can’t entirely avoid these, but operations that require impure functions make up only a limited portion of typical data analysis tasks.

Predictability

Related functions should be consistent with each other: consistent names, argument orders, output object types, etc. Predictability means we only have to learn one convention to apply it many places. It makes syntax easier to learn, easier to teach, and easier to read.

Dr. Wickham notes that it’s not always possible to be consistent on all axes. For example, we might have three functions: one takes a dataset and a filename as an argument, one takes a filename as an argument, and one takes a dataset as an argument. We can't consistently make  filenames or datasets first in our argument order! Instead, we can be consistent in our priorities and how principles of ordering take precedence.

Pipeability

Functions should accept the object being transformed as their first argument and return an object of the same type. This allows us to chain together our functions with pipes and leverage all other code written to be used with pipes.

By making our functions conform to these guidelines, our code is easier to write and easier to understand.

Thank you to Dr. Wickham for visiting D.C. and sharing such a useful presentation! If you’d like to learn more about his work, check out his website, follow him on Twitter, or browse his popular GitHub.

 

Share This: