UP | HOME

R Practice 1

1 MASS library

The MASS package can be installed by using the command install.package('MASS'). It provides with the survey dataset.

To load the library and look at some rows of the survey dataset, the following commands can be used.

library(MASS)
survey
head(survey, 4)

Sex field has two levels. It is usually called factor when the variable is independent. A "treatement" is a possible combination on independent variables. Sex and Class with F/M and A/B values.

For more information about the dataset, use help(survey) instruction.

1.1 Age Mean

The following code calculate the mean value from the data on the Age column.

mean(survey$Age)

1.2 Levels

The next snippet shows the column's levels (the values founded).

levels(survey$Sex)

1.3 Plot

Plotting in R is achieved by the plot function. It will select the proper graphics style to use. There are more specific functions like barplot, boxplot, etc.

In the next example, plot will plot the Sex as X axis and the writing hand W.Hnd as Y axis.

plot(survey$Sex, survey$W.Hnd)

1.3.1 Labels

plot has got several parameters to customize the appearance of the graphic. For example, for changing the labels that appears on the title, the X axis and the Y axis. Also, it is possible to change the colour of each column.

plot(survey$Sex, main="Cantidad por género", xlab="genero", 
     ylab="frecuencia", col=c("green", "yellow"))

1.4 Correlate without plotting

The cor.test function can be used to test for correlation between two columns. It support the three most common methods: Pearson's, Kendall's and Spearman's.

Usually, a cor value nearer to 1 means it is very related. In the following instance, Pearson's is used beacuse of the nominal data type of the columns.

res <- cor.test(survey$Wr.Hnd, survey$NW.Hnd)
print(res)

The results is stored at the res variable. To consult the correlated estimation, use the $ operand.

res$estimate

1.5 More Examples

1.5.1 Mean

The mean calculation is simple. Remember that mean is sensible to the extreme values. The following example shows how to create a vector with some values and calculate its mean.

r <- c(6, 7, 8, 7, 6)
mean(r)

Another instance, where the mean is calculated with extreme values.

t <-c(10, 10, 10, 2, 2)
mean(t) 

If the standard deviation (SD) is near to zero, then is more aproximate to use the mean. In other words, the mean is more significative when the SD is nearest to zero. The following calculate the standard deviation using the sd() function.

print(sd(r))
print(sd(t))

1.5.2 Median

Geometric median is calculated by using the nstat package.

1.5.3 Mode

The mode can be calculated by using the modes or the frequency package.

The following snippet create a table which counts the amount of repetitions of each number in the vector.

v <- c(1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 7)
example <- table(v)
example

Using the modes package gives the following results.

library(modes)
modes(v)
  1. Using Formulae

    The mode can be calculated by using formulae. For each item, example has got the amout of repetitions on the vector.

    Suppossing that $i$ is the amount of repetitions of the one item, and $s$ the length of the vector. The following formulae shows the persentage of appearances for the item: $\frac{i}{s} \by 100$

    R can calculate the same formulae for each row simply replacing the item with the table name.

    c <- (example/length(v))* 100
    c
    

    The following code creates a dataframe which displays the frequency, the calculated relative frequency for each element.

    req <- data.frame(freq=example, relative_freq=c)
    req
    

1.5.4 Summary

The summary function provides some information about a dataframe and its columns.

summary(survey)

Female:118 Min. :13.00 Min. :12.50 Left : 18 L on R : 99 Min. : 35.00 Left : 39 Freq:115 Heavy: 11 Min. :150.0 Imperial: 68 Min. :16.75 Male :118 1st Qu.:17.50 1st Qu.:17.50 Right:218 Neither: 18 1st Qu.: 66.00 Neither: 50 None: 24 Never:189 1st Qu.:165.0 Metric :141 1st Qu.:17.67 NA's : 1 Median :18.50 Median :18.50 NA's : 1 R on L :120 Median : 72.50 Right :147 Some: 98 Occas: 19 Median :171.0 NA's : 28 Median :18.58 nil Mean :18.67 Mean :18.58 nil nil Mean : 74.15 NA's : 1 nil Regul: 17 Mean :172.4 nil Mean :20.37 nil 3rd Qu.:19.80 3rd Qu.:19.73 nil nil 3rd Qu.: 80.00 nil nil NA's : 1 3rd Qu.:180.0 nil 3rd Qu.:20.17 nil Max. :23.20 Max. :23.50 nil nil Max. :104.00 nil nil nil Max. :200.0 nil Max. :73.00 nil NA's :1 NA's :1 nil nil NA's :45 nil nil nil NA's :28 nil nil

Be carefull when the N/A value is present on one of the column. For example, the pulse column has NA values and can change the results of some functions.

print(min(survey$Age))
print(max(survey$Age))
print(min(survey$Pulse))
print(min(survey$Pulse, na.rm=T))

The min function returns N/A if it is not ignored. Using na.rm=TRUE as parameter make min to ignore them.

2 Aplication of tests

IV: Independent variable DV: Dependent variable

IV DV  
Variable factor 2 levels Likert Test no param Mann-Whitney
Factor + 2 levels Likert Kruskel-Wallis

3 License of This Work

This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nd/4.0/.

Creative Commons Licence
R Practice 1 by Gimenez Christian is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.