Workshop Introduction

Hello everyone and Welcome to the BIDS Student Group Best Coding Practices Workshop!

Throughout this session we are going to walk though some of the fundamentals of best coding practices in the R language as well as how do adapt these practices into to good code usage.

As such this workshop is broken into two parts:

  1. Developing proper R style based on the R Style guidelines

  2. Performing a Code Review to incorporating the R style with other rule of thumb to create clean code

While this workshop is for R specific, hopefully the style guidelines and writing concepts will translate into your other language workflows!

For Additional Readings On Which This Walkthrough is Based Check Out:

So without further adu, let’s jump in and think about what does it mean to write clean, readable code.



Style, Structure, and Code Review

Imagine you are reading an article. There’s an opening paragraph, which gives you a brief overview of what the article is about. There are headings, each with a bunch of paragraphs. The paragraphs are structured with the relevant bits of information grouped together and ordered so that the article “flows” and reads nicely.

Now, image the article didn’t have any headings. There are paragraphs, but they are long and in a confusing order. You can’t skim read the article, and have to really dive into the content to get a feel for what the article is about. This can be quite frustrating!

Your code should read like a good article. Think of your classes/files as headings, and your methods as paragraphs. Sentences are the statements in your code.

One way to develop this tool kit is by actually studying previously written Code: AKA Code Review! Code review is careful, systematic study of source code by people who are not the original author of the code. It’s analogous to proofreading a term paper.

Code review really has two purposes:

Improving the code. Finding bugs, anticipating possible bugs, checking the clarity of the code, and checking for consistency with the project’s style standards.

Improving the programmer. Code review is an important way that programmers learn and teach each other, about new language features, changes in the design of the project or its coding standards, and new techniques. In open source projects, particularly, much conversation happens in the context of code reviews.

On to the big question of the day — how do you actually write clean code?



Style in R

The foundation of interpretable and reproducible code is readability and consistency. Spagetti code will run, however it is harder to debug, can hide errors, and will cost collaborators time and frustration, especially when you work on the same code with other people, or when you share your code with others. When you co-code with others it’s a good idea to agree on a common style up-front. Also any consistent style is better than chaos, working with others may mean that you’ll need to sacrifice some preferred aspects of your style for ease of working with other people. (based on “Advanced” R by Hadley Wickham)


Notation and naming

We already discussed some requirements for naming schemes for variables in R, but lets talk about good and bad practices in naming not only variables, but also files and functions.

File names

File names should be meaningful and end in .R.

Good

fit-models.R
utility-functions.R  

Bad

foo.r
stuff.r  

If files need to be run in sequence, prefix them with numbers:

0-download.R
1-parse.R
2-explore.R

Object names

“There are only two hard things in Computer Science: cache invalidation and naming things.” - Phil Karlton

Variable and function names should be lowercase. Use an underscore (_) to separate words within a name. Generally, variable names should be nouns and function names should be verbs. Strive for concise and meaningful names.

Good

day_one
day_1

Bad

first_day_of_the_month
DayOne
dayone
djm1

Where possible, avoid using names of existing functions and variables, to avoid confusion for the readers of your code.

Bad

T <- FALSE
c <- 10
mean <- function(x) sum(x)


Syntax

Spacing

Place spaces around all infix operators (=, +, -, <-, etc.). The same rule applies when using = in function calls. Always* put a space after a comma, and never before (just like in regular English).

Good

average <- mean(feet / 12 + inches, na.rm = TRUE)

Bad

average<-mean(feet/12+inches,na.rm=TRUE)  

Exception to this rule: :, :: and ::: don’t need spaces around them!

Good

x <- 1:10
base::get

Bad

x <- 1 : 10
base :: get  

Place a space before left parentheses, except in a function call.

Good

if (debug) do(x)
plot(x, y)

Bad

if(debug)do(x)
plot (x, y)  

Extra spacing (i.e., more than one space in a row) is ok if it improves alignment of equal signs or assignments (<-).

list(
  total = a + b + c, 
  mean  = (a + b + c) / n
)

Do not place spaces around code in parentheses or square brackets (unless there’s a comma, in which case see above).

Good

if (debug) do(x)
diamonds[5, ]

Bad

if ( debug ) do(x)  # No spaces around debug
x[1,]   # Needs a space after the comma
x[1 ,]  # Space goes after comma not before

Curly braces

An opening curly brace should never go on its own line and should always be followed by a new line.

A closing curly brace should always go on its own line, unless it’s followed by else.

Always indent the code inside curly braces.

Good

if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

Bad

if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

It’s ok to leave very short statements on the same line:

if (y < 0 && debug) message("Y is negative")


Line length

Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.


Indentation

When indenting your code, use two spaces. Never use tabs or mix tabs and spaces.

The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:

long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}

Assignment

As we already discussed earlier, use <-, not =, for assignment.

Good x <- 5

Bad x = 5


Comments

Comment your code. Each line of a comment should begin with the comment symbol and a single space: #. Comments should explain the why, not the what.

Good

# Objects like data frames are treated as leaves
x <- map_if(x, is_bare_list, recurse)

Bad

# Recurse only with bare lists
x <- map_if(x, is_bare_list, recurse)

Comments should be in sentence case, and only end with a full stop if they contain at least two sentences:

Good

# Objects like data frames are treated as leaves
x <- map_if(x, is_bare_list, recurse)

# Do not use `is.list()`. Objects like data frames must be treated
# as leaves.
x <- map_if(x, is_bare_list, recurse)

Bad

# objects like data frames are treated as leaves
x <- map_if(x, is_bare_list, recurse)

# Objects like data frames are treated as leaves.
x <- map_if(x, is_bare_list, recurse)

Organisation

Use commented lines of - and = to break up your file into easily readable chunks:

# Load data ---------------------------

# Plot data ---------------------------


Rules of Thumb

Don’t Repeat Yourself

Duplicated code is a risk to safety. If you have identical or very similar code in two places, then the fundamental risk is that there’s a bug in both copies, and some maintainer fixes the bug in one place but not the other.

Avoid duplication like you’d avoid crossing the street without looking. Copy-and-paste is an enormously tempting programming tool, and you should feel a frisson of danger run down your spine every time you use it. The longer the block you’re copying, the riskier it is. The alternative here would be to write a function, run a loop, split-apply-combine, or any combination of these.

Don’t Repeat Yourself, or DRY for short, has become a programmer’s mantra.


Comment Only Where Needed

A quick general word about commenting. Good software developers write comments in their code, and do it judiciously. Good comments should make the code easier to understand, safer from bugs (because important assumptions have been documented), and ready for change.


Avoid Magic Numbers

There are really only two constants that computer scientists recognize as valid in and of themselves: 0, 1, and maybe 2. (Okay, three constants.)

Other constant numbers need to be explained. One way to explain them is with a comment, but a far better way is to declare the number as a constant with a good, explanatory name. For instance, the months 2, …, 12 would be far more readable as FEBRUARY, …, DECEMBER.



Practice Exercises: Code Review

Consider the following piece of code. Based on what we discussed previously, talk with your neighbor about what makes this code easy or hard to read? Make a list of any pros and cons.

dayOfYear <- function(Month, Day, Year) {
  if (Month == 2) {
    Day <- Day+31
  } else if (Month == 3) {
    Day <- Day+59
  } else if (Month == 4) {
    Day <- Day+90
  } else if (Month == 5) {
    Day <- Day+31+28+31+30
  } else if (Month == 6) {
    Day <- Day+31+28+31+30+31
  } else if (Month == 7) {
    Day <- Day+31+28+31+30+31+30
  } else if (Month == 8) {
    Day <- Day+31+28+31+30+31+30+31
  } else if (Month == 9) {
    Day <- Day+31+28+31+30+31+30+3131
  } else if (Month == 10) {
    Day <- Day+31+28+31+30+31+30+31+31+30
  } else if (Month == 11) {
    Day <- Day+31+28+31+30+31+30+31+31+30+31
  } else if (Month == 12) {
    Day <- Day+31+28+31+30+31+30+31+31+30+31+31
  }
  return(Day)
}
PROS

* Correct use of curly braces In else if statements
* Clearly return For fuction
* Function name specifies use

CONS

* No spaces between addition opperator
* Varaibles names are capatalized
* Extreme duplicatio
* Use of Magic Numbers

Some of the repetition in dayOfYear() is repeated values. How many times is the number of days in April written in dayOfYear()?

SOLUTION: 8

Each sum of the form 31 + 28 + 31 + 30 + ... is a sum of days In months: 31/*January*/ + 28/*February*/ + 31/*March*/ + 30/*April*/ + ... There are 8 occurrences of 30 that belong to April.

By the way, the fact that this question couldn’t be obviously answered from the code is an example of the problem of magic numbers, which will be discussed more In a bit.

Repeated code is problematic because any errors have to be fixed in many places, rather than not just one. Suppose our calendar changed so that February really has 30 days instead of 28. How many numbers in this code have to be changed?

SOLUTION: 10

The eight explicit occurrences of 28 would have to change, and so would the two numbers 59 and 90, which implicitly depend on the assumption that February has 28 days: 59 = 31/*January*/ + 28/*February*/, and 90 = 31/*January*/ + 28/*February*/ + 31/*March*/. These two surprise numbers are magic numbers, which we’ll talk about shortly.

Let’s “clean up” the code slightly by adding in some comments: Which comments are useful additions to the code? Consider each comment independently, as if the other comments weren’t there.

# '@param month: month of the year, where January=1 and December=12 [C1] */
dayOfYear <- function(month, day, year) {
  if (month == 2) {                                 #we're in February  [C2]
    day <- day + 31                   #add in the days of January that already passed [C3]
  } else if (month == 3) {                          #month is 3 here  [C4]
    day <- day + 59
  } else if (month == 4) {
    day <- day + 90
  } else if (month == 5) {
    day <- day + 31 + 28 + 31 + 30
  } else if (month == 6) {
    day <- day + 31 + 28 + 31 + 30 + 31
  } else if (month == 7) {
    day <- day + 31 + 28 + 31 + 30 + 31 + 30
  } else if (month == 8) {
    day <- day + 31 + 28 + 31 + 30 + 31 + 30 + 31
  } else if (month == 9) {
    day <- day + 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31
  } else if (month == 10) {
    day <- day + 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30
  } else if (month == 11) {
    day <- day + 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31
  } else if (month == 12) {
    day <- day + 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31 + 31
  }
  return(day)                               #the answer  [C5]
}

Solution:

C1 is definitely a good addition because it clarifies what the month parameter means. In general, you should have a specification comment like this before every method. This specification isn’t complete, but it’s a start.

C2 and C3 help clarify what the numbers 2 and 31 signify. The comments help here, but we’ll see shortly that a better way to explain these lines is not with comments but with descriptive names, like month == February and day <- monthLength[January].

C4 and C5 contribute nothing that a capable reader of R wouldn’t already know.

Additional Thoughts: The date is February 25, 2020. The correct dayOfYear() result for this date is 56. Think about what would happen if a programmer might (mistakenly) call dayOfYear(). Would they be able to tell something went wrong? Is this a plausible user error?

  • dayOfYear(“February”, 9, 2020)
  • dayOfYear(2020, 2, 9)
  • dayOfYear(9, 2, 2020)
  • dayOfYear(2, 2020, 9)

SOLUTION: 8
dayOfYear(2020, 2, 9)
This is plausible if the programmer is assuming the arguments are in year/month/day order, which is a common international standard (ISO 8601, in fact). Thankfully, it is easily caught since 2020 is outside the 12 months of the year!

dayOfYear("February", 9, 2020)
This is plausible if the programmer is assuming the month is passed by a string name (in English). Static type checking forbids passing a String to an int argument, however, so the mistake is caught fast, before the program even starts.

dayOfYear(9, 2, 2020)
This is plausible if the programmer is assuming the arguments are in day/month/year order, which is the standard almost everywhere in the world except the United States. It quietly produces the wrong answer because dayOfYear() interprets those arguments as September 2.

dayOfYear(2, 2020, 9)
This is implausible because no convention for writing dates puts the year in the middle. It’s unlikely to happen by accident.

Work with your neighbor to create a clean version of the dayOfYear function! We will reconvene as a group shortly to discuss our solutions.

Your code should include the following:

  1. Provide Test Cases
  2. Remove Duplication
  3. Include Comments Where Appropraite
  4. Show Outputs
## Functions --------------------------------------------------------

#' Converts a day, month, year into the number of days since the start of the year i.e january 1st
#'
#' @param month  month of the year, where January=1 and December=12
#' @param day int 
#' @param year int 
#'
#' @return int
#' @export
#' @examples
#' dayOfYear(2,25,2020)

dayOfYear <- function(month, day, year) {

  #days in each month starting from January with 31
  daysPerMonth <- c(31,28,31,30,31,30,31,31,30,31,30,31)

  if(month == 1){            #if it is january, the day of the month is the day of the year
    dayOfYear <- day
  } else {                   #else, sum up days to the previous month
    dayOfYear <- day + sum(daysPerMonth[1:month-1])
  }

  return(dayOfYear)
}

## Process Data -----------------------------------------------------

dayOfYear(2,25,2020)
## [1] 56

Create a full scripting using the solar_spike_data.csv and lunar_spike_data.csv

Your code should include the following:

  1. Load Data – Load the Data into RStudio

  2. Functions – Copy your clean dayOfYear Function into R

  3. Processing – Process the datasets by converted the mm/dd/yyyy to the day of the year and store the outputs

  4. Plot the data – Create a line scatter plot of the day of solar spikes vs the day of the lunar spikes

#################################################
#                                               #
#       BIDS WORKSHOP CLEAN CODING EXAMPLE      #
#                                               #
#################################################



## Load Data -----------------------------------

solarSpikesData <- read.csv("~/Desktop/BIDS/BIDSworkshop/solar_spikes_data.csv")
lunarSpikesData <- read.csv("~/Desktop/BIDS/BIDSworkshop/lunar_spikes_data.csv")



## Functions -----------------------------------

#' Converts a day, month, year into the number of days since the start of the year i.e january 1st
#'
#' @param month  month of the year, where January=1 and December=12
#' @param day int 
#' @param year int 
#'
#' @return int
#' @export
#' @examples
#' dayOfYear(2,25,2020)

dayOfYear <- function(month, day, year) {

  daysPerMonth <- c(31,28,31,30,31,30,31,31,30,31,30,31)

  if(month == 1){            #If it is january, the day of the month is the day of the year
    dayOfYear <- day
  } else{                    #Else, sum up days to the previous month
    dayOfYear <- day + sum(daysPerMonth[1:month-1])
  }

  return(dayOfYear)
}




## Process Data -------------------------------

#convert solar Spike Dates to the Day of Year
dayOfYearSolar <- apply(solarSpikesData, 1, FUN = function(eventDate){
  dayOfYear( month = eventDate['month'],
             day  =  eventDate['day'],
             year = eventDate['year'])
})

dayOfYearLunar <- apply(lunarSpikesData, 1, FUN = function(eventDate){
  dayOfYear( month = eventDate['month'],
             day = eventDate['day'],
             year = eventDate['year'])
})



## Plot Data -----------------------

#set up plot label formatting
par(font = 2, font.axis = 2, font.lab = 2)

#plots of Solar Spike Day vs Lunar Spike Day
plot(x = dayOfYearSolar,
     y = dayOfYearLunar,
     type = "l",
     col = "orange",
     lwd = 40,
     xlim = c(0, 365),
     ylim =  c(0, 365),
     asp = 1,
     xlab = "Solar Spike Day of Year",
     ylab = "Lunar Spike Day of Year",
     main = "You Are a Star!")
box(lwd = 3)


You’ve reached the end of the interactive portion of the workshop! Lets meet back up and discuss what we have learned!