P01. Introduction to R, part 2: solutions

A. read in a data file

myTBdata <- read.table("TB_stats.txt", header=TRUE)

In order to run this, your computer need to know where “TB_stats.txt” is. You could download it here.

What does the “header=TRUE” option mean?

Answer: R reads in the first row as the column headers (or, names)

B. Have a look at the first few lines

# Now let's investigate the data file
head(myTBdata)
                   Country HIV_neg_TB_mortality HIV_pos_TB_mortality
1                   Angola                11000                 7200
2               Bangladesh                73000                  230
3                   Brazil                 5500                 2200
4                 Cambodia                 8600                  440
5 Central_African_Republic                 2200                 2700
6                    China                35000                 2600
  Total_TB_mortality HIV_pos_TB_incidence Population
1              93000                28000   25000000
2             362000                  630  161000000
3              84000                13000  208000000
4              59000                 1400   15600000
5              19000                 8600    4900000
6             918000                15000 1380000000

How many rows can you see? What is the first row?

Answer: 1 row of column names and 6 rows of data

C. What are the names of the columns?

names(myTBdata)
[1] "Country"              "HIV_neg_TB_mortality" "HIV_pos_TB_mortality"
[4] "Total_TB_mortality"   "HIV_pos_TB_incidence" "Population"          

Is this what you expected?

Answer: We would expect there to be 6 elements to correspond to the 6 columns. Each column name is stored as a separate element in a vector.

D. How many rows and columns are there in your data ?

dim(myTBdata)
[1] 30  6

What is the first number telling you? And the second?

Answer: dim(…) gives you a 2 element vector; the first number is the number of rows, the second number is the number of columns.

E. How are your data stored?

attributes(myTBdata)
$names
[1] "Country"              "HIV_neg_TB_mortality" "HIV_pos_TB_mortality"
[4] "Total_TB_mortality"   "HIV_pos_TB_incidence" "Population"          

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

What new piece of information have you learned from the ‘attributes()’ function?

Answer: We now know that our data set is stored as a data.frame.

F. Now take a look at some summary statistics for your data

summary(myTBdata)
   Country          HIV_neg_TB_mortality HIV_pos_TB_mortality
 Length:30          Min.   :   780       Min.   :   40       
 Class :character   1st Qu.:  3725       1st Qu.:  935       
 Mode  :character   Median : 14500       Median : 3300       
                    Mean   : 40543       Mean   :11269       
                    3rd Qu.: 29250       3rd Qu.:10800       
                    Max.   :480000       Max.   :73000       
 Total_TB_mortality HIV_pos_TB_incidence   Population       
 Min.   :  12000    Min.   :   450       Min.   :2.140e+06  
 1st Qu.:  43250    1st Qu.:  5050       1st Qu.:1.560e+07  
 Median : 122500    Median : 14000       Median :5.370e+07  
 Mean   : 301600    Mean   : 33376       Mean   :1.546e+08  
 3rd Qu.: 305500    3rd Qu.: 37500       3rd Qu.:1.325e+08  
 Max.   :2840000    Max.   :258000       Max.   :1.380e+09  

Let’s extract some information from our data

G. First, Calculate the total number of deaths across all countries.

The following two methods should give you the same answer

total_TB_mortality1 <- sum(myTBdata[,2:3]) # method 1
total_TB_mortality2 <- sum(myTBdata$HIV_pos_TB_mortality + myTBdata$HIV_neg_TB_mortality) # method 2

Do you think one method is better than the other?

Answer: While method 1 is certaintly shorter, method 2 is generally better for two reasons: 1) it is easier to read because the column names provide information; 2) it is less liable to cause errors (if you for some reason change the indexing of your columns, method 1 might break).

H. Now let’s check that both methods give the same answer.

We’ll use two ways to check this. First, let’s output both answers

total_TB_mortality1
[1] 1554340
total_TB_mortality2
[1] 1554340

Now, let’s ask R to check whether they are both equal

total_TB_mortality1==total_TB_mortality2 
[1] TRUE
# logical expression which gives TRUE if equal and FALSE if not

Why might you prefer to use the second check (using the logical expression) than the first?

Answer: For checking two numbers are the same, it is easy to output two numbers and manually check. However, for larger datasets or for multiple checks, comparison will be much quicker if they are automated.

I. How different is the TB mortality rate in HIV positive persons in Lesotho compared to Zimbabwe?

First, let’s add “mortality rate” as another column in our data frame

myTBdata$Mortality_Per1000 <- 1000 * (myTBdata$HIV_pos_TB_mortality + myTBdata$HIV_neg_TB_mortality)/myTBdata$Population

Now subset the dataset to extract the TB mortality rate for both Lesotho and Zimbabwe

Lesotho_mortalityrate <- myTBdata[myTBdata$Country=="Lesotho", "Mortality_Per1000"]
Zimbabwe_mortalityrate <- myTBdata[myTBdata$Country=="Zimbabwe", "Mortality_Per1000"]
Relative_Mortality_Rate <- Lesotho_mortalityrate / Zimbabwe_mortalityrate

How many times higher is the mortality rate for TB in Lesotho as it is in Zimbabwe?

paste("The relative mortality rate is", round(Relative_Mortality_Rate, 2), sep=" ")
[1] "The relative mortality rate is 5.47"

J. Finally in this section, let’s look at what can go wrong when reading in data files.

In order to complete this section, you would need to download readfileexample_1.txt, readfileexample_2.txt, and readfileexample_3.txt.

(a) There is not an equal number of columns in each of the rows.

readFile_a <- read.table("readfileexample_1.txt", header=TRUE)

How do you fix this error? Hint: set missing values in the data file to be ‘Not Assigned’ by adding them as NA in the original file. Try running this line again with the updated file.

(b) The wrong delimiter is used

readFile_b <- read.table("readfileexample_2.txt", header=TRUE)
  Clinic.NumberOfTests.NumberOfPositiveTests
1                                     1,10,5
2                                      2,5,2
3                                      3,6,1
4                                    4,NA,NA
5                                    5,NA,NA

Is an error given? Check out ‘readFile_b’ - is it correct?

Answer: No error is given but if you type head(readFile_b), the file has not read properly. The file should be read in as a 5 x 3 data.frame. However, instead it is a 5 x 1

How do you fix this? Ask R for help (?read.table) Which option do you need to specify?

Answer: We must specify the delimiter of the text file

readFile_b <- read.table("readfileexample_2.txt", sep = ",", header=TRUE)
  Clinic NumberOfTests NumberOfPositiveTests
1      1            10                     5
2      2             5                     2
3      3             6                     1
4      4            NA                    NA
5      5            NA                    NA

Is there another way of fixing this problem?

Answer: We could use the function read.csv() which automatically uses the

readFile_b <- read.csv("readfileexample_2.txt", header=TRUE)
  Clinic NumberOfTests NumberOfPositiveTests
1      1            10                     5
2      2             5                     2
3      3             6                     1
4      4            NA                    NA
5      5            NA                    NA

(c) The names are read in as data rows rather than names

readFile_c <- read.csv("readfileexample_2.txt", header=FALSE)
      V1            V2                    V3
1 Clinic NumberOfTests NumberOfPositiveTests
2      1            10                     5
3      2             5                     2
4      3             6                     1
5      4          <NA>                  <NA>
6      5          <NA>                  <NA>

Is an error given? Check out ‘readFile_c’ - is it correct?

Answer: No, R has read in the first row of the file as regular data points

Type a new line of code to correct this problem (hint: copy-paste from above and change one of the options)

readFile_c <- read.csv("readfileexample_2.txt", header=TRUE)
  Clinic NumberOfTests NumberOfPositiveTests
1      1            10                     5
2      2             5                     2
3      3             6                     1
4      4            NA                    NA
5      5            NA                    NA

(d) One of more of the columns contain different classes

readFile_d <- read.table("readfileexample_3.txt", header=TRUE)
  Clinic Catchment_1000s Doctors
1      1             1.2      10
2      2               5      12
3      3             0.4       3
4      4             1.1       5
5      5             ten       6

Is an error given? Check out ‘readFile_d’ - is it correct?

Answer: It appears to look OK. However one of the values has mistakenly been inputted as ‘ten’ rather than ‘10’. This means that the column has been saved as non-numeric.

How do you fix this issue? Hint: check the ‘class’ of the problem column. Try running this line again with an updated file.

Answer: CHANGE THE 5th row, 2nd column to the number 10 in the file and re-run the code