Training documents and materials for the DAA 2022, held at Uppsala University in Sweden 1-5 August
View this repository on GitHubglow-gh/daa
This module will introduce you to the coding language R and the integrated developer environment software package RStudio. R is a widely used coding language with particularly strong abilities in handling, manipulating, and visualising large amounts of tabular data.
RStudio is an IDE (integrated developer environment) designed specifically for use with R, which adds a number of features to the base R Console, including overviews
While we will primarily be working with RStudio in this and the next module, let us just briefly introduce coding in R itself.
Coding in R can be undertaken with the use of the R Console, which is an integral part of a standard R installation. Note that R can also be written directly in the Windows Command Prompt or the MacOS Terminal.
To write scripts in R, standard practice is to use an R Editor document, which allows you to save the script as a self-contained file. While you can use the R Console to try out commands, you cannot save scripts from this location.
getwd()
setwd(“[path of your Documents folder”)
dir.create(“R”)
setwd(“[path of your Documents folder/R”)
getwd()
When working with R and RStudio, it is standard practice always to start out by checking the current working directory using getwd(), and to set the desired working directory if necessary using setwd()
It is perfectly possible to write extensive and complex scripts in R Console and R Editor, but unless you are a trained coder, it will quickly become difficult to manage the various outputs and files involved. To this end, RStudio provides you with a working environment for writing R scripts and keep track of files, functions, and libraries.
A central feature in RStudio is the ability to create more elaborate file formats than base R. A much used format is R Markdown, a variety of basic Markdown syntax that we saw yesterday (see 1.3 Markdown), which can also include and run R code.
R Markdown files come in two different varieties in RStudio, namely R Markdown and R Notebook. The former is a file format, the latter is a way of viewing and generating .html-formatted output from R Markdown. Both formats are stored in R Markdown, using the file extension .Rmd. For reasons that will become clearer in a moment, we will start by creating an R Notebook.
A couple of remarks about the structure of an R Markdown or Notebook file. The top four lines contains a YAML header, which you encountered during the Markdown module yesterday. This includes the title of the document when converted, and the conversion output, which is set to .html as default for R Notebooks.
Below comes the standard Markdown syntax that you should already be familiar with. R Markdown formats will accept the same Markdown syntax as you were introduced to yesterday.
The highlighted area below is a chunk of R code. R code is included in R Markdown using `{r}** when starting the chunk and **
` when closing the chunk. You can also press the button Insert a new code chunk on the top R Notebook window toolbar. Within this space, R code can be entered and executed.
Just like VS Studio has workspaces to manage various file collections, RStudio has projects. Projects are useful because they will help you keep track of files, and also because they can be synced with a GitHub repository to allow for Git version control of R projects and files.
You should know see a number of things happening with RStudio. First, the project button text in the top right corner will have changed from Project (None) to R, indicating which project RStudio is currently working in. Second, the Viewer windown in the bottom right will have changed to the Files pane, giving you an overview of all the files contained in the project directory, namely the R folder.
Integrating RStudio with GitHub is easy provided that you already have Git installed on your system, have set up a GitHub user profile, and have a repository that can be cloned.
We have now been through setting up RStudio for working with R code and Markdown. Now, let us look at how R and RStudio handles data. To do this, we should first introduce the concepts of libraries or packages in R. Libraries are collections of functions or commands, that can be downloaded to R and RStudio and launched when required. One of the things that make R such a versatile programming language is the very wide range of libraries available. In the Packages pane of the Viewer window, you can see a list of libraries available on your system, and those that are currently loaded.
Curated libraries can be downloaded in RStudio by selecting Tools > Install Packages… from the top menu. We will try downloading a single library in a little while, to illustrate how that works.
R comes with a set of native libraries typically referred to as base R, which include the most basic functions for working with a variety of data. Now, we will import some sample data and see how some of these functions work. We will use a basic .csv prepared by the GLoW project, which contains an index of the c. 550 archaeological locations where cuneiform inscriptions have been found. You can read more about this resource in Rattenborg et al. 2021 and see the online location of the resource on Zenodo.
To import a .csv from an online location to R, we can use the command read.csv() with the download URL of the .csv-version of CIGS. Note that filepaths in R should always be enclosed in “” or ‘’.
read.csv(“https://zenodo.org/record/5642899/files/CIGS_v1_4_20211101.csv?download=1”)
The .csv-version of the CIGS index is a very small file in the context of what R can handle. When importing very large data files, from a local disk or a remote location, it is often preferable to check the format and structure of the file before importing the entire dataset. For this we can use the head() function, which returns a set number of rows from the beginning of the file.
head(read.csv(“https://zenodo.org/record/5642899/files/CIGS_v1_4_20211101.csv?download=1”))
So far, we have performed operations that have not actually imported the data contained in CIGS into RStudio, but rather served to view and inspect the data file. If we want to work further with this data, we would need to turn it into an object that can be subjected to data manipulation. In R, tabular data such as CIGS is called dataframes. These can be created by using an arrow operator -> or <-, which will forward the result of a given command to a variable, or object.
cigs <- read.csv(“https://zenodo.org/record/5642899/files/CIGS_v1_4_20211101.csv?download=1”)
We now have an object that can be subjected to further analysis and manipulation. First, let us look at some basic tools for examining data.
The cigs dataframe contains 25 columns. You can of course print the entire dataframe, or open it from the Environment window by clicking the object, but printing a list of column names may often be just as useful. We can do so using the names() function
Results can be turned into objects as well. For example, we can create a vector from the column names returned.
As with OpenRefine presented yesterday (see 1.6 Data Cleaning), we can also print unique values from a column to see what values are included. This can be done using the unique() function.
Introducing a filter to sort or subset records according to specific criteria is also possible, but somewhat cumbersome in base R. Here, we can load a library package called dplyr, which includes a good filter function
Finally a few examples on how data can be manipulated. There are a number of ways to extract parts of a dataframe to a new dataframe using very simple syntax.
With the dplyr library loaded, grouped summaries of a dataframe can be obtained using the count() function.
The result can be stored as a new dataframe
cigs[1:2, ]
cigs[1:2, 1:2]
In conclusion, let us demonstrate the application of R Notebooks in building large, annotated scripts for performing various tasks.