Hello..hello…I am falling in love deeply with R. That’s why in this current post, I’d still like to ramble and blabber about R. For R newbie like me, it’s essential to know the basic data type in R, how to obtain data from various sources such as csv, xls, or Rstudio dataset, what is matrices and list in R, and what operators we can use to do operations in R. Without further ado, let’s just dive into this very basic of R.
I. Data Type in R
R can handle numeric, text, and logical.
Use function
class(var_name) to check the data type of certain variable.
- Numerical: integer (int), double (dbl)
- Categorical: factor (fct)
- Vectors (one-dimensional array) –> can hold numeric, character or logical values.
Created using:- c() or
- using a vector function –>
vector(mode = "logical", length = 0) –> produces a vector of the given length and mode.
Example:
output <- vector("double", ncol(df)) Output:
> output [1] 0 0 0 0
- Matrices (two-dimensional array) –> can hold numeric, character or logical values. The elements in a matrix all have the same data type.
Created using matrix() . - Data frames (two-dimensional objects) –> can hold numeric, character or logical values. Within a column, all elements have the same data type, but different columns can be of different data type.
Created using data.frame() . - Table –> a contingency table of the counts at each combination of factor levels.
- List –> a collection of a variety of objects under one name. These objects can be matrices, vectors, data frames, even other lists, etc.
A list is a super data type: you can store practically any piece of information in it. Created using list() .
II. Operators in R
- Relational (equal/unequal) : == , != , > , >= , < , <=
- Logical –> and (&) , or ( | ) , NOT ( ! ), double and (&&) , double OR (||)
Notes:- ‘&’ , ‘&&’, ‘|’, and ‘||’ behave differently in R, compared to other languages such as Java and C.
- & and |
- compare every corresponding element in each vectors –> perform element-wise operation.
- producing result having length of the longer operand.
- && and ||
- examines only the first element of the operands resulting into a single length logical vector.
- All other elements in a vector or list are ignored regardless of the first ones value.
- See the example from datamentor.io below to get a clearer understanding:
123456789101112> x <- c(TRUE,FALSE,0,6)> y <- c(FALSE,TRUE,FALSE,TRUE)> !x[1] FALSE TRUE TRUE FALSE> x&y[1] FALSE FALSE FALSE TRUE> x&&y[1] FALSE> x|y[1] TRUE TRUE FALSE TRUE> x||y[1] TRUE - I must say that the concept of &, &&, |, and || is quite confusing to me. I also find that the documentation does not really give enlightenment. Luckily, I managed to gather information from various sources such as stackoverflow and csgillespie:
- && and || are what is called “short circuiting”. That means that they will not evaluate the second operand if the first operand is enough to determine the value of the expression.
- For example if the first operand to && is false then there is no point in evaluating the second operand, since it can’t change the value of the expression (false && true and false && false are both false). The same goes for || when the first operand is true.
- && and || are very useful in flow control purpose.
- Still confused? then just follow this rule:
“For logical comparison, stick to “&” and “|” unless you know you need “&&”.
Use && and || when you want to do some flow control like if..else… and while() or whenever you are sure what you are goint to do with the && and ||”.
- x < y, TRUE if x is less than y
- x <= y, TRUE if x is less than or equal to y
- x == y, TRUE if x equals y
- x != y, TRUE if x does not equal y
- x >= y, TRUE if x is greater than or equal to y
- x > y, TRUE if x is greater than y
- x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
III. Obtaining Data
- hardcode it
- using 1-d array (vector)
2 functions that we have to know when work with vector:
1) the combine function c() for creating the vector and
2) the names function names() for naming the elements in the vector.- Creating a 1-d array (vector)
use the combination function –> c()
Remember: ‘c’ means vector
e.g:
numeric_vector <- c(1, 10, 49)
character_vector <- c(“a”, “b”, “c”)
days_vector <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”)Example:
123> numeric_vector <- c(140, -50, 20, -120, 240)> numeric_vector[1] 140 -50 20 -120 240
We can also name the vector elements using either of these 2 ways:
1. directly naming it when creating the vector. So, it’s like creating an associative array.
2. name it after creating the vector by using the function names(the_vector) <- the_vector_names .Example:
12345678910> # 1st method: directly name it> numeric_vector <- c("Monday"=140, "Tuesday"=-50, "Wednesday"=20, "Thursday"=-120, "Friday"=240)> numeric_vectorMonday Tuesday Wednesday Thursday Friday140 -50 20 -120 240> # 2nd method: name it later using function names()> names(numeric_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")> numeric_vectorMonday Tuesday Wednesday Thursday Friday140 -50 20 -120 240We can also merge several vectors into 1 single vector.
e.g:12345red <- c(12, 23)yellow <- c(34, 45)green <- c(56, 67)traffic_light <- c(red, yellow, green) - selecting certain element
a) based on index number enclosed in square bracket
index starts from 1.
Example:
my_vector[2] –> selecting the 2nd element of a vector
my_vector[c(2, 3, 4, 5)] –> selecting the 2nd, 3rd, 4th, and 5th element of a vector. –> there is a more convenient way as follows:
my_vector[2:5] –> unlike python, the last index counts) based on the names of the elements1234> poker_start <- poker_vector[c("Monday", "Tuesday", "Wednesday")]> poker_startMonday Tuesday Wednesday140 -50 20 - Sum or Counting all elements in a vector
use: sum(vector_name)
example:1234# Poker winnings from Monday to Fridaypoker_vector <- c(140, -50, 20, -120, 240)total_poker <- sum(poker_vector)total_poker - Length of The Vector
length(vector_name) - create categories out of a vector’s values.
use –> factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)
We can define factor() as a categoric vector.
Example 1: Factoring12345678910sex_vector <- c("Male", "Female", "Female", "Male", "Male")# Convert sex_vector to a factorfactor_sex_vector <- factor(sex_vector)# Print out factor_sex_vectorfactor_sex_vector# Output:# Levels: Female MaleExample 2: Refactoring and Relabeling:
12# Relabel the race variableadult$RACEHPR2 <- factor(adult$RACEHPR2, labels = c('Latino', 'Asian', 'African American', 'White'))- Converting a numeric-like factor from ‘factor’ type to ‘numeric’ type.
- Let’s say we have a dataframe ‘zf’ wih 1 factor column that contains numeric-like values.
12345> z <- c(1, 5, 0, 6, 1)> zf <- factor(z)> zf[1] 1 5 0 6 1Levels: 0 1 5 6 - unclass it to get its integer code
1234<span class="GNKRCKGCMSB ace_keyword">> </span><span class="GNKRCKGCMRB ace_keyword">unclass(zf)</span><span class="GNKRCKGCGSB">[1] 2 3 1 4 2attr(,"levels")[1] "0" "1" "5" "6"</span> - converting it directly to a numeric type will return its integer codes instead of its real values
12> as.numeric(zf)[1] 2 3 1 4 2 - meanwhile, if we convert it to a character type, will result in the string version of its values.
12> as.character(zf)[1] "1" "5" "0" "6" "1" - Thus, to convert a factor with numeric-like values into a numeric values, we have to do a double conversion
12> as.numeric(as.character(zf))[1] 1 5 0 6 1
- Let’s say we have a dataframe ‘zf’ wih 1 factor column that contains numeric-like values.
- Converting a numeric-like factor from ‘factor’ type to ‘numeric’ type.
- Adding a new element to a vector
Just directly add it using the function c() .
Example:
123456> my_score <- c(10, 25, 39, 40)> my_score[1] 10 25 39 40> my_score <- c(my_score, 53)> my_score[1] 10 25 39 40 53
- Creating a 1-d array (vector)
- using 1-d array (vector)
- Load it from Existing Datasets provided by RStudio.
- By defaults, RStudio already provides a number of datasets for us to use such as iris, mtcars, Titanic, and AirPassengers.
- List of functions that we can use:
- data() –> check the list of available datasets.
- data(dataset_name) –> start using the dataset by loading it.
- str(dataset_name) –> check the structure of the dataset.
- glimpse(dataset_name) –> see the glimpse of the dataset.
- import from other files.
- Flat Files
- using built-in utils package
scan(txtfile, skip=0)
- only work for
.txt
file. - returns a 1-d vector (not a dataframe).
- Notes:
- txtfile = the text file
- skip = the number of lines of the input file to skip before beginning to read data values.
- Example:
x3 <- scan('x.txt') - to return a list of several vectors (matrix), we can use
scan()
insidematrix()
.
Example:
Let’s say we have a text file contains data as follows:
123453.37095845 2.32192531.43530183 1.21616112.36312841 3.57572752.6328626 2.64289932.40426832 2.0897606
Usingscan()
we get a vector as follows:
1234> xtrim <- scan('x_trimmed.txt')Read 10 items> xtrim[1] 3.370958 2.321925 1.435302 1.216161 2.363128 3.575728 2.632863 2.642899 2.404268 2.089761
To make it as matrix, we can usescan()
insidematrix()
123456789> xtrim2 <- matrix(scan('x_trimmed.txt'), ncol = 2, byrow = TRUE)Read 10 items> xtrim2[,1] [,2][1,] 3.370958 2.321925[2,] 1.435302 1.216161[3,] 2.363128 3.575728[4,] 2.632863 2.642899[5,] 2.404268 2.089761
- only work for
- using built-in utils package
read.xxx
Result: dataframe- Supporting format:
- flat file –> *.txt, *.csv
-
read.csv(filename, stringsAsFactors = TRUE, header = TRUE, sep = ',')
Notes:- for comma-delimited file.
- filename includes the path –> must be enclosed in a double quotes.
- if stringsAsFactors = TRUE –> convert strings in the flat file to factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be characters.
-
read.delim(filename, stringsAsFactors = TRUE, sep = "\t", header = TRUE)
Notes:- for a tab-delimited file.
-
read.table(filename, stringsAsFactors = TRUE, header = FALSE, sep = "", col.names = c("xxx", "yyy", "zzz"), colClasses = c('class1', 'class2', 'class3'))
Notes:- for other types of delimit.
- It can be used for comma-delimited or tab-delimited too as long as we use proper arguments for the sep.
- header = TRUE means 1st row l= the variable names.
- sep = separator. It’s “” by default. We can change it in accordance with our purpose such as “/”.
- col.names –> a vector of optional names for the variables. The default is to use “V” followed by the column number
- colClasses –> specify the column types/classes of the resulting data frame.
–> If a column is set to “NULL” in the colClasses vector, this column will be skipped and will not be loaded into the data frame.Example:
123hotdogs2 <- read.delim("hotdogs.txt", header = FALSE,col.names = c("type", "calories", "sodium"),colClasses = c("factor", "NULL", "numeric"))
-
read.csv(filename, stringsAsFactors = TRUE, header = TRUE, sep = ',')
- databases –> postGre, mySQL
- web
- other statistical software –> SPSS, Stata
- flat file –> *.txt, *.csv
- Supporting format:
- Using readr package by Hadley Wickhamtic
- Result: a tibble
- can handle flat files that are on the internet.
- format: read_ instead of read.
- utils vs readr
No util readr 1 using dot sign ‘.’ using underscore ‘_’ 2 output: dataframe output: tibble 3 read.table() read_delim() 4 read.csv() read_csv() 5 read.delim() read_tsv() -
read_delim("filename", delim = "", col_names = TRUE, col_types = NULL, skip = 0, nmax = inf)
Notes:- delim = sep
- col_names = TRUE or FALSE or a vector of column names. If it’s TRUE, the 1st row will be header.
- col_types = colClasses.
Possible values:- NULL –> The column types will be imputed based on the first 1000 rows on the input
- a collection which is defined using col_something().
In the read_delim() function, it is enclosed in list().
Example:
12345678# The collectors you will need to import the datafac <- col_factor(levels = c("Beef", "Meat", "Poultry"))int <- col_integer()# Edit the col_types argument to import the data correctly: hotdogs_factorhotdogs_factor <- read_tsv("hotdogs.txt",col_names = c("type", "calories", "sodium"),col_types = list(fac, int, int))
Possible collection:- col_double()
- col_character()
- col_integer() –> meaning: the column should be interpreted as an integer.
- col_factor(levels = c(“x”, “y”, “z”)) –> the column should be interpreted as a factor with levels
- a compact string representation where each character represents one column:
c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column.
Example:
read_delim("states2.txt", delim = "/", col_types = "ccdd") –> col_types = character, character, double, double. - skip –> number of first rows that will be excluded before we begin importing
- nmax –> max number of rows that will be imported.
- read_csv("filename", col_names = TRUE)
- read_tsv("filename", col_names = TRUE) –> tab-separated value
- Using the package data.table by Matt Dowle and Arun Srinivasan
- use
fread() for reading the imported file.
- suitable for large dataset that read.csv() won’t cut.
- Similar to
read.table but faster and more convenient.
12library(data.table)fread(filename, drop = NULL, select = NULL)
Notes:
1. drop –> Vector of column names or numbers to drop, keep the rest.
2. keep –> Vector of column names or numbers to keep, drop the rest.
Example:
12# Import columns 6 and 8 of potatoes.csv: potatoesreapotatoes = fread('potatoes.csv', select = c(6, 8))
- use
fread() for reading the imported file.
- using built-in utils package
- Working with Excel Files
- using package readxl by Hadley Wickham –> special for reading excel file.
- for reading from excel file.
- same arguments as readr.
- 2 functions:
- excel_sheets(‘filename’) –> find out which sheets are available in the workbook.
Example:
12> excel_sheets('urbanpop_nonames.xlsx')[1] "1960-1966" "1967-1974" "1975-2011" -
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)
Notes:- sheet_num ==> starts from 1
- read_excel() cannot handle xls files that are on the internet (at least, not yet).
- excel_sheets(‘filename’) –> find out which sheets are available in the workbook.
- using package gdata by Gregory R. Warnes
- read.xls(filename, sheet = 1)
- the first row of the original data will be imported as the variable name for the first column
Meaning: the rows after import != the rows before import. - it skips empty rows
- it returns a dataframe.
- Always compare the data after import to get an understanding of the new rows order.
- can handle .xls files that are on the internet.
- Using package XLConnect
-
12345678https://www.rdocumentation.org/packages/readr/topics/read_csvinstall.packages(XLConnect)library(XLConnect)# this is the bridge between R and Excelbook <- loadWorkbook(filename)# To actually import data from a sheetreadWorksheet(book, sheet = num_or_name_sheet, startRow = n, endRow = n, startCol = n, endCol = n, header = FALSE)
Example:
12# Import columns 3, 4, and 5 from second sheet in my_book: urbanpop_selurbanpop_sel <- readWorksheet(my_book, sheet = 2, startCol = 3, endCol = 5) - Other functions in XLConnect:
- getSheets(filename) –> get list the sheets in an Excel file. The list is in vector format.
- createSheet(workbook, name = "sheet_name") –> adding a new sheet to a workbook.
- writeWorksheet(workbook, data, filename) –> populate a sheet with data
- saveWorkbook(workbook, filename) –> storing a workbook into a new file.
- renameSheet()
- if you encounter problems during the XLConnect installation, try this following steps:
- install java
- install jdk
- assotiate the JDK installed with R
1sudo R CMD javareconf - Install RJava and Rgdal
12sudo apt-get install r-cran-rjavasudo apt-get install libgdal1-dev libproj-dev - Install package in RStudio
install.packages("rJava") - d
- For working with excel files through R. Bridging between Excel and R –> no need to import the files.
- makes it able to edit your Excel files from inside R
- create a new sheet.
- populate the sheet with data
- save the results in a new Excel file.
-
- using package readxl by Hadley Wickham –> special for reading excel file.
- Working with Database
In this case, we do not download the whole data from database to R. But rather just load rows that we need.
Steps:- Create a connection to the database. Define the host, db, user, pwd, host.
Example for mysql:
123456# Set up a connection to the mysql databasemy_db <- src_mysql(dbname = "dplyr",host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",port = 3306,user = "student",password = "datacamp") - Connect to the DB –> use function
tbl(conn, dbname) .
Example for mysql:
12# Reference a table within that source: nycflightsnycflights <- tbl(my_db, "dplyr")
This will be further discussed in another post eq.
- Create a connection to the database. Define the host, db, user, pwd, host.
- Download files from the internet
- use:
download.file(url, dest_pathfile)
- this will save the downloaded file into our local directory
- it has no return value. Thus, cannot be assign into a variable.
- load the downloaded file into our working environment
load(the_downloaded_file) –> has no return value. - alternative way
load(url(the_file_url)) –> but the file will not be downloaded into our local directory.
- use:
download.file(url, dest_pathfile)
- Working with JSON
- use the library jsonlite
library(jsonlite) - use the function
fromJSON(json_strings) to convert JSON data into a nicely structured R list.
It also works if you pass a URL as a character string or the path to a local file that contains JSON data.
Example:
1234567891011# Load the jsonlite packagelibrary(jsonlite)# wine_json is a JSONwine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'# Convert wine_json into a list: winewine <- fromJSON(wine_json)# Print structure of wineprint(str(wine))
Output:
1234567List of 5$ name : chr "Chateau Migraine"$ year : int 1997$ alcohol_pct: num 12.4$ color : chr "red"$ awarded : logi FALSENULL -
toJSON(x, pretty = FALSE) –> convert R back to JSON in the minified format.
Notes:- pretty = TRUE –> prettify the JSON format
- othe JSON-related functions:
- prettify()
- minify()
- use the library jsonlite
- Import from other Statistical software
- using haven package by Hadley Wickham
- read_sas() –> for SAS
- read_stata() and read_dta() –> for Stata
-
read_spss()
summary of haven package with Statistical Software Packages. Image taken from datacamp
- using foreign package by R Core team
- can handle more statistisal softwares: spss, stata, systat, and weka.
- cannot import sas (.sas7bdat), only sas library (.xport):
-
12install.packages('foreign')library('foreign')
- read stata
123read.dta(file, convert.dates = TRUE, convert.factors = TRUE,missing.type = FALSE,convert.underscore = FALSE, warn.missing.labels = TRUE)
Notes:
- convert.underscore –> if TRUE, convert the ‘_’ to ‘.’
- read spss
123456read.spss(file, use.value.labels = TRUE, to.data.frame = FALSE,max.value.labels = Inf, trim.factor.names = FALSE,trim_values = TRUE, reencode = NA, use.missings = to.data.frame,sub = ".", add.undeclared.levels = c("sort", "append", "no"),duplicated.value.labels = c("append", "condense"),duplicated.value.labels.infix = "_duplicated_", ...)
Notes:- to.data.frame = TRUE –> return data frame
to.data.fraem = FALSE –> return list.
- to.data.frame = TRUE –> return data frame
- using haven package by Hadley Wickham
- Flat Files
IV. Using Data
- selecting certain column –> use the ‘$’ sign.
Example: Let’s say we have the data students with column ‘age’ becomes the column that we want. Thus, we do students$age .
V. Visualization
- using R built-in plot()
- plot(x, y)
- hist(x, breaks = 10, main = main = paste(“Histogram of” , xname), xlab = xx, ylab = yy)
Notes:- main = the graph title
- d
- using ggplot2 library. Read more here.
ggplot(data, aes)
VI. Mathematical Functions
a. Arithmetic
– Using operator:
+ - * / %%
– Work in element-wise way.
– 2 types of operations:
a) matrix vs constant
b) matrix vs matrix
–
2 * my_matrix –> multiplied every element of
my_matrix by two.
–
my_matrix1 * my_matrix2 –> creates a matrix where each element is the product of the corresponding elements in
my_matrix1 and
my_matrix2 .
“+” vs sum()
+ –> do element-wise summation
sum() –> sum up all elements in the vector(s)
Example:
1 2 3 4 5 6 7 8 9 10 11 |
> linkedin <- c(16, 9, 13, 5, 2, 17, 14) > facebook <- c(17, 7, 5, 16, 8, 13, 14) > > # Calculate the mean of the sum > lifa <- linkedin + facebook > lifa [1] 33 16 18 21 10 30 28 > lifa2 <- sum(linkedin + facebook) > lifa2 [1] 156 > |
b. Mathematic/Statistics
- abs()
- coef(modeling_function)
Extract coefficients from a modeling function.
Example:
123> x <- 1:5; coef(lm(c(1:3, 7, 6) ~ x))(Intercept) x-0.7 1.5 - mean(vector_name, trim=0, na.rm=FALSE)
Notes:
– trim = the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.
– na.rm = remove missing values (NA), if it’s TRUE.Example:
12345678910111213141516> # The linkedin and facebook vectors have already been created for you> linkedin <- c(16, 9, 13, 5, 2, 17, 14)> facebook <- c(17, 7, 5, 16, 8, 13, 14)>> # Calculate the mean of the sum> avg_sum <- mean(linkedin + facebook)>> # Calculate the trimmed mean of the sum> # we trim the first 20% and the last 20% of observation> avg_sum_trimmed <- mean((linkedin + facebook), trim = 0.2)>> # Inspect both new variables> print(avg_sum)[1] 22.28571> print(avg_sum_trimmed)[1] 22.6
–> Output: 36.66667 - standard deviation –> sd(x, na.rm = FALSE)
- linear regression related
- lm(formula = y ~ x, data)
- To fit a linear model with a response variable y and explanatory variable x
- example:
US_fit <- lm(percent_yes ~ year, data = US_by_year) .
Output:
12345678> US_fitCall:lm(formula = percent_yes ~ year, data = US_by_year)Coefficients:(Intercept) year12.664146 -0.006239
We can also see more details of the lm() output by using summary() of the output.
12345678910111213141516171819> summary(US_fit)Call:lm(formula = percent_yes ~ year, data = US_by_year)Residuals:Min 1Q Median 3Q Max-0.222491 -0.080635 -0.008661 0.081948 0.194307Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 12.6641455 1.8379743 6.890 8.48e-08 ***year -0.0062393 0.0009282 -6.722 1.37e-07 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 0.1062 on 32 degrees of freedomMultiple R-squared: 0.5854, Adjusted R-squared: 0.5724F-statistic: 45.18 on 1 and 32 DF, p-value: 1.367e-07 - lm’s methods and properties:
my_lm <- lm(y ~ x, data=my_data) .- coef(my_lm) –> extract the coefficient of the model.
- summary(my_lm) –> show full output of the model.
- to make the model above looks tidy, use the package
broom and its function
tidy(model) .
Example:
From the model US_fit above, we can tidy it using broom as follows:
12345> US_tidied <- tidy(US_fit)> US_tidiedterm estimate std.error statistic p.value1 (Intercept) 12.664145512 1.8379742715 6.890274 8.477089e-082 year -0.006239305 0.0009282243 -6.721764 1.366904e-07
- r.squared(object)
–> find the coefficient of determination (R squared). R squared illustrates the proportion of dependent-variable variance that is predictable from the independent variables.
- lm(formula = y ~ x, data)
- max() –> find the maximum value of a vector or data frame.
- min() –> find the minimum value of a vector or data frame.
- pmax() –> computes the parallel maxima of two or more input vectors.
-
pmin() –> computes the parallel minima of two or more input vectors.
Example:
12345678910> x1 <- c(2, 8, 3, 4, 1, 5)> x2 <- c(0, 7, 5, 5, 6, 1)> max(x1, x2)[1] 8> pmax(x1,x2)[1] 2 8 5 5 6 5> min(x1, x2)[1] 0> pmin(x1, x2)[1] 0 7 3 4 1 1
- median()
-
quantile(x, probs, na.rm=FALSE, names=TRUE) .
Notes:- x = the vector to loop over
- probs = the n-th quantile we want to obtain. It can be a single number or a vector.
Example: find the 5th percentile
quantile(cars$mpg, probs = 0.05)
Example: find the 5th and 95th percentile
quantile(df, probs = c(0.05, 0.95)) - If we do not specify the probs, then by default quantile will return a list of 5 elements: min, q1, q2, q3, max.
Example:
1234> asdf <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30)> quantile(asdf)0% 25% 50% 75% 100%2 9 16 23 30
Thus, if we want to get q3, we can just simply use:
123> quantile(asdf)[4]75%23
- round()
- sum() –> sum of all values in the data structure (vector, list, etc).
- which.max() –> find the index of the max element in a vector/list.
- which.min() –> find the index of the min element in a vector/list
Example:
1234567891011> x <- c(1:4, 0:5, 11)> x[1] 1 2 3 4 0 1 2 3 4 5 11> min(x)[1] 0> which.min(x)[1] 5> max(x)[1] 11> which.max(x)[1] 11 - cor(a, b, use=’everything’)
- Notes:
- a, b = 2 variables to calculate correlation of.
- use = a string which defines how to handle data in the presence of missing values.
- ‘everything’ –> default –> resulting cor value will be NA whenever one of its contributing observations is NA.
- “all.obs” –> the presence of missing observations will produce an error.
- “complete.obs” –> deletes all cases (rows) with missing values before calculating the correlation. If there are no complete cases, that returns an error. Complete cases = rows whose columns are all filled with value –> contains no missing value.
- “na.or.complete” –> same as “complete.obs” except that if there are no complete cases, it will return NA.
- “pairwise.complete.obs” –> only compute correlation for all non-missing pairs. only take complete paired cases into correlation calculation. Let’s say the 3rd row in a is complete but the 3rd row in b is not complete. Then both paired rows will be dropped from the correlation calculation.
- Notes:
- chisq.test()
performs chi-squared contingency table tests and goodness-of-fit tests.
Format:
1chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000)
Returns:
a list of 9 elements: statistic, parameter, p.value, method, data.name, observed, expected, residuals, stdres.Example:
1234567> results <- chisq.test(table(adult$RBMI, adult$SRAGE_P))> resultsPearson's Chi-squared testdata: table(adult$RBMI, adult$SRAGE_P)X-squared = 1009.5, df = 198, p-value < 2.2e-161234567891011121314151617181920212223242526> str(results)List of 9$ statistic: Named num 1010..- attr(*, "names")= chr "X-squared"$ parameter: Named int 198..- attr(*, "names")= chr "df"$ p.value : num 6.4e-109$ method : chr "Pearson's Chi-squared test"$ data.name: chr "table(adult$RBMI, adult$SRAGE_P)"$ observed : 'table' int [1:4, 1:67] 30 254 80 52 22 248 76 45 14 191 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "Under-weight" "Healthy-weight" "Over-weight" "Obese".. ..$ : chr [1:67] "18" "19" "20" "21" ...$ expected : num [1:4, 1:67] 8.38 174.11 141.91 91.61 7.87 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "Under-weight" "Healthy-weight" "Over-weight" "Obese".. ..$ : chr [1:67] "18" "19" "20" "21" ...$ residuals: table [1:4, 1:67] 7.47 6.05 -5.2 -4.14 5.04 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "Under-weight" "Healthy-weight" "Over-weight" "Obese".. ..$ : chr [1:67] "18" "19" "20" "21" ...$ stdres : table [1:4, 1:67] 7.59 7.98 -6.43 -4.71 5.11 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "Under-weight" "Healthy-weight" "Over-weight" "Obese".. ..$ : chr [1:67] "18" "19" "20" "21" ...- attr(*, "class")= chr "htest"
VII. Text-Related Functions
-
paste (…, sep = " ", collapse = NULL)
converts its arguments to character strings, and concatenates them (separating them by the string given by sep).
Example:123> age <- 25> print(paste("My age is", age))[1] "My age is 25"In the code above, paste concatenates My age is and age after converting age to string.
-
paste0(...)
similar to paste() except that the strings are concatenated without separator.
Example:12345678> cols <- c("country", paste0("year_", 1960:1966))> cols[1] "country" "year_1960" "year_1961" "year_1962" "year_1963" "year_1964"[7] "year_1965" "year_1966"> cols2 <- c("country", paste("year_", 1960:1966))> cols2[1] "country" "year_ 1960" "year_ 1961" "year_ 1962" "year_ 1963"[6] "year_ 1964" "year_ 1965" "year_ 1966"See the difference?
-
cat(R_objects)
– It converts its arguments to character vectors, concatenates them to a single character vector, appends the given sep= string(s) to each element and then output them to a console or a file.
– cat() will not return anything, it will just output to the console or another connection.
– paste() vs cat()
–> paste() will return something, while cat() only output the result to the console.
–> thus, output from cat() cannot be assigned to a variable, whilst output from paste() can.
– Example:1234567> asdf <- cat("Miaow!")Miaow!> asdfNULL # asdf = NULL because cat() does not return anything.>> paste("Miaow!")[1] "Miaow!" - the length of chars –>
nchar(the_strings)
Example:123> nama <- "erika"> nchar(nama)[1] 5 - replacement
sub(pattern, replacement, x) –> only replace the first match
gsub(pattern, replacement, x) –> replace all matches. - split string based on a certain separator.
use: strsplit(string, split=sep)
Notes:
– split = the separator used for separating the string.
– strsplit() returns a list of 1 or more components, where each component is separated by a comma ‘,’ .
In example 1, the string only consist of 1 component. Thus, the resulting list will also consist of 1 component only.
On the other hand, in the example 2, the input string is a vector of 4 elements separated by comma. Thus, the resulting list will comprise 4 components.
– sep = “” –> split by character
– sep = ” ” –> split by word (separated by a whitespace). –> see example 3.
Single square bracket will return the element enclosed in its container.
To directly get the content of the component, we have to add [[1]] right after splitting the string.
See the illustration below taken from stackoverflow:
– Example 1:1234567891011121314> rquote <- "r's internals are irrefutably intriguing"> # without [[1]]> chars <- strsplit(rquote, split = "")> chars[[1]][1] "r" "'" "s" " " "i" "n" "t" "e" "r" "n" "a" "l" "s" " " "a" "r" "e" " " "i"[20] "r" "r" "e" "f" "u" "t" "a" "b" "l" "y" " " "i" "n" "t" "r" "i" "g" "u" "i"[39] "n" "g"> # with [[1]]> chars2 <- strsplit(rquote, split = "")[[1]]> chars2[1] "r" "'" "s" " " "i" "n" "t" "e" "r" "n" "a" "l" "s" " " "a" "r" "e" " " "i"[20] "r" "r" "e" "f" "u" "t" "a" "b" "l" "y" " " "i" "n" "t" "r" "i" "g" "u" "i"[39] "n" "g"Can you see the difference? Have a look at line 5 in the code above.
– Example 2: resulting list has more than 1 component.
123456789101112131415161718> # Split names from birth year> # pioneers is a vector consists of 4 elements. Thus, the resulting list will also comprise 4 components.> pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")>> # Split names from birth year> split_math <- strsplit(pioneers, split = ":")> split_math[[1]][1] "GAUSS" "1777"[[2]][1] "BAYES" "1702"[[3]][1] "PASCAL" "1623"[[4]][1] "PEARSON" "1857" - Converting string to lowercase letters
– use tolower()
– example:1234> aku <- "ERIKA"> aku_low <- tolower(aku)> aku_low[1] "erika" - Converting string to uppercase letters
– use toupper() . - finding similarity and differences
- unique(iterables) –> unique() can also be used for number, etc. Not exclusive to string.
- setequal(a, b) –> check if list a equal to list b
- identical(a, b) –> check if list a identical with list b.
The difference between setequal and identical [source]:- when there are duplicate values
- when there are different ordering.
VIII. Comparison
- recognized symbol: < , > , == , != , <= , >=
- can be use on scalar and vector
- scalar –> 5 < 6 TRUE
- vector –>
c(4, 5, 6) > 5 Return: FALSE FALSE TRUE- We can also select only the elements whose values are True, by using the square bracket.
Example:
1234567> my_score <- c(100, 50, 20, 85, 90)> my_ok_score <- my_score > 70> my_ok_score[1] TRUE FALSE FALSE TRUE TRUE> my_ok_score_selected <- my_score[my_ok_score]> my_ok_score_selected[1] 100 85 90
We can also simplify the code above by rewriting it as follows:
123> my_score <- c(100, 50, 20, 85, 90)> my_score[my_score > 70][1] 100 85 90
See the difference on line 3.
- We can also select only the elements whose values are True, by using the square bracket.
- When comparing 2 strings, comparison will be conducted in alphabetical order based on the 1st letter of each string.
- When comparing 2 boolean, TRUE will be assigned the value of 1, while False will be assigned the value of 0. Thus, TRUE will always > FALSE
- When comparing 2 vectors, comparison will be conducted element-wise.
IX. Matrix
Creating A Matrix
matrix(data, byrow=TRUE/FALSE, nrow=x, ncol=y, dimnames=None) .
Notes:
- data = the data that will be set into matrix –> must be an optional data vectorSeparatorSeparatorSeparatorSeparator
- byrow = iF it is TRUE, the matrix will be filled row-wise. Otherwise, it will be arranged by column.
- nrow = the number of rows in the matrix
- dimnames = the names for rows and columns in the matrix.
- Format: dimnames(list(rownames_vector, colnames_vector))
- use
NULL
if the matrix does not have rownames or colnames. - Example:
pokemon <- matrix(mydata, ncol = 6, byrow = TRUE, dimnames = list(NULL, c('Ants','Birds','Cats','Dogs','Elephants','Flamingos')))Result:
12345678910111213> str(pokemon)num [1:24, 1:6] 45 60 80 80 39 58 78 78 78 44 ...- attr(*, "dimnames")=List of 2..$ : NULL..$ : chr [1:6] "Ants" "Birds" "Cats" "Dogs" ...> head(pokemon)Ants Birds Cats Dogs Elephants Flamingos[1,] 45 49 49 65 65 45[2,] 60 62 63 80 80 60[3,] 80 82 83 100 100 80[4,] 80 100 123 122 120 80[5,] 39 52 43 60 50 65[6,] 58 64 58 80 65 80
- dataframe vs matrix
- matrix is homogenous –> e.g: all numeric, etc
- dataframe is heterogenous
Example : my_matrix <- matrix(1:9, byrow=TRUE, nrow=3)
The data could be hardcoded or taken from another array.
Example:
1 2 |
income <- c(460.998 314.400 290.475 247.900 309.306 165.800) income_matrix <- matrix(c(income), nrow=3, byrow=TRUE) |
Output:
1 2 3 4 |
[,1] [,2] [1,] 460.998 314.4 [2,] 290.475 247.9 [3,] 309.306 165.8 |
byrow defines how the matrix will be filled. If the matrix above is created using byrow=FALSE , then the output will be:
1 2 3 4 |
[,1] [,2] [1,] 460.998 247.900 [2,] 314.400 309.306 [3,] 290.475 165.800 |
naming the columns and rows of a matrix:
a) naming the columns –>
colnames(matrix_name) <- row_names_vector
We can also use
colnames(matrix_name) <- row_names_vector which will return the same result.
b) naming the rows –> rownames(matrix_name) <- col_names_vector
Example:
1 2 3 |
colnames(income_matrix) <- c("male", "female") rownames(income_matrix) <- c("manager", "supervisor", "staff") income_matrix |
Output:
1 2 3 4 |
Male Female Manager 460.998 314.4 Supervisor 290.475 247.9 Staff 309.306 165.8 |
We can also name the rows and columns at once using the matrix parameter
dimnames .
Example:
1 2 |
income <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8) income_matrix <- matrix(income, nrow = 3, byrow = TRUE, dimnames = list(c("Manager", "Supervisor", "Staff"), c("Male", "Female"))) |
Output:
1 2 3 4 |
Male Female Manager 460.998 314.4 Supervisor 290.475 247.9 Staff 309.306 165.8 |
Count the number of rows and cols and in a matrix
nrow(matrix_name) and ncol(matrix_name)
Count
Calculating the total of each row
use
rowSums(matrix_name) .
Example:
1 2 3 4 |
> total_row_income <- rowSums(income_matrix) > total_row_income Manager Supervisor Staff 775.398 538.375 475.106 |
Calculating the total of each column
use
colSums(matrix_name) .00 comments awaiting moderation
Example:
1 2 3 4 |
> total_row_income <- colSums(income_matrix) > total_row_income Male Female 1060.779 728.1 |
Calculating the cumulative summary of a value in a list.
use:
cumsum(the list) .
Example:
1 2 3 4 5 6 7 8 |
> head(DF) Under-weight Healthy-weight Over-weight Obese 18 30 254 80 52 19 22 248 76 45 20 14 191 68 43 21 15 168 70 45 22 13 145 56 44 23 15 142 59 58 |
Calculating the cumsum:
1 2 3 4 5 6 7 8 |
> DF$xmax <- cumsum(DF$groupSum) > DF$xmax [1] 416 807 1123 1421 1679 1953 2228 2488 2777 3070 3354 3670 [13] 4029 4403 4763 5134 5532 5953 6400 6909 7433 8046 8635 9245 [25] 9851 10474 11142 11879 12581 13312 14154 15028 15885 16762 17706 18636 [37] 19559 20536 21530 22506 23473 24410 25404 26419 27478 28480 29304 30208 [49] 31058 31895 32658 33382 34083 34804 35499 36138 36783 37383 37984 38569 [61] 39131 39734 40309 40819 41292 41787 42167 |
Adding new column(s) from other matrixes/vectors
use cbind(matrix1, matrix2, vector1, ....)
Example:
1 2 3 4 5 6 |
> big_matrix <- cbind(income_matrix, total_row_income) > big_matrix Male Female total_row_income Manager 460.998 314.4 775.398 Supervisor 290.475 247.9 538.375 Staff 309.306 165.8 475.106 |
Adding new row(s) from other matrixes/vectors
use
1 |
rbind(matrix1, matrix2, vector1, ...) |
Example:
1 2 3 4 5 6 7 8 9 |
> bigger_matrix <- rbind(income_matrix1, income_matrix2) > bigger_matrix US non-US Manager 461.0 314.4 Supervisor 290.5 247.9 Staff 309.3 165.8 Security 474.5 552.5 Cleaning Service 310.7 338.7 Secretary 380.3 468.5 |
Get the names of the rows in a dataframe
Use:
row.names(df) .
Example:
1 2 3 4 5 6 7 8 9 10 11 12 |
> row.names(mtcars) [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" [7] "Duster 360" "Merc 240D" "Merc 230" [10] "Merc 280" "Merc 280C" "Merc 450SE" [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" [31] "Maserati Bora" "Volvo 142E" |
The Use of Square Bracket [ ]
There are 2 different uses of square bracket:
- Selection –> positive [ ]
my_matrix[1,2] –> selects the element at the first row and second column
my_matrix[1:3,2:4] –> select data on the rows 1, 2, 3 and columns 2, 3, 4.
my_matrix[,1] –> selects all elements of the first column.
my_matrix[1,] –> selects only 1st row and all columns. - Select the opposite –> positive [] and exclamation mark
my_matrix[!is_ok] –> select all the elements in my_matrix that do not fit the criteria in is_ok . - Omission/Removal –> negative [ ]
12my_df[-(1:5), ] # Omit first 5 rows of my_dfmy_df[, -4] # Omit fourth column of my_df
Note: Column contains row header is not counted as a column in the matrix. Similar concept applies for column header.
Select the first observations –>
head(..., n = 1)
n = the number of observations we want to obtain. It can be 1, 2, 3, etc.
select the last observations –> tail(..., n = 1)
Math Operation
- all math operation such as ‘+’, ‘-‘, ‘*’, ‘/’ will be conducted element-wise
Looping
Loop over a matrix can be done by iterate over the row first, the on each row, iterate over each of its column. In this case, the column is the elements themselves.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[,1] [,2] [,3] [1,] "O" NA "X" [2,] NA "O" "O" [3,] "X" NA "X" > > # define the double for loop > for (i in 1:nrow(ttt)) { for (j in 1:ncol(ttt)) { print(paste("On row", i, "and column", j, "the board contains", ttt[i, j])) } } [1] "On row 1 and column 1 the board contains O" [1] "On row 1 and column 2 the board contains NA" [1] "On row 1 and column 3 the board contains X" [1] "On row 2 and column 1 the board contains NA" [1] "On row 2 and column 2 the board contains O" [1] "On row 2 and column 3 the board contains O" [1] "On row 3 and column 1 the board contains X" [1] "On row 3 and column 2 the board contains NA" [1] "On row 3 and column 3 the board contains X" |
X. Data Frame
Data frame vs Matrix
- dataframe is heterogenous, while the matrix is homogenous.
All columns in a matrix must have the same data type (numeric, character, etc.) and the same length.
In data frame, different columns can have different modes (numeric, character, factor, etc.). Just like a table in a database or excel sheet. - Data frame has row and column title, while matrix does not.
Creating a Data Frame
A data frame is basically a set of vectors. Thus, we can create a dataframe by combining several vectors using the function
data.frame(vector1, vector2, ...., vectorN) or
df = data.frame()
Example:
1 2 3 4 5 6 7 8 9 |
emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE ) |
Output:
1 2 3 4 5 6 |
a emp_id emp_name salary start_date 1 1 Rick 623.30 2012-01-01 2 2 Dan 515.20 2013-09-23 3 3 Michelle 611.00 2014-11-15 4 4 Ryan 729.00 2014-05-11 5 5 Gary 843.25 2015-03-27sdf |
Exploratory Functions
- str –> str(df)
- summary –> summary(df)Separator
- glimpse –>
glimpse(df) –> only applied with
dplyr
package. - pairs –>
pairs(df) –> Returns a plot matrix, consisting of scatterplots for each variable-combination of a data frame.
pairs(iris)
Example:
- dfdf
Get the overview of Data Frame
Use the function str() which (I think) stands for stSeparatorructure.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... |
Get The summary Statistics of All Variables in The Data Frame or Vector
summary()
The summary consists of min, max, Q1, median, and Q3.
Example 1:
1 2 3 4 |
> asdf <- c(1, 2, 3, 4, 5, 6) > summary(asdf) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.25 3.50 3.50 4.75 6.00 |
Example 2:
For the hotdogs data that consists of 3 variables: meat types, sodium, and calorie levels, the summary gives this following result.
1 2 3 4 5 6 7 8 9 |
> # summary() shows the summary of each variable (column) in the txt file > print(summary(hotdogs)) V1 V2 V3 Beef :20 Min. : 86.0 Min. :144.0 Meat :17 1st Qu.:132.0 1st Qu.:362.5 Poultry:17 Median :145.0 Median :405.0 Mean :145.4 Mean :424.8 3rd Qu.:172.8 3rd Qu.:503.5 Max. :195.0 Max. :645.0 |
Aggregate a Data Frame.
aggregate() splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
Let say we have a dataframe x with columns: col1, col2, and col3. Then we can do various aggregation as follows.
Usage 1: aggregate(x, by=list(col2), FUN) –> aggregate each column in dataframe x by col2, with function FUN.
Usage 2: aggregate(x$col1, by = list(x$col2), FUN) –> aggregate column col1 only grouped by col2, with function FUN.
Usage 3: aggregate(col1 ~ col2, data=x, FUN) –> aggregate column col1 only grouped by col2 with function FUN.
Notes:
- x = the data we want to aggregate. It could be a dataframe itself or a column in the dataframe.
- by = the variable to group by the data. In the usage 1 and 2, it should be in the form of a list. But in the usage 3, it shouldn’t.
- FUN = the function to aggregate/summary the subset.
Example 1:
Let’s say we have the dataframe signs
as follows:
1 2 3 4 5 6 7 8 9 |
> str(signs) 'data.frame': 146 obs. of 49 variables: $ sign_type: chr "pedestrian" "pedestrian" "pedestrian" "pedestrian" ... $ r1 : int 155 142 57 22 169 75 136 149 13 123 ... $ g1 : int 228 217 54 35 179 67 149 225 34 124 ... $ b1 : int 251 242 50 41 170 60 157 241 28 107 ... $ r2 : int 135 166 187 171 231 131 200 34 5 83 ... $ g2 : int 188 204 201 178 254 89 203 45 21 61 ... $ b2 : int 101 44 68 26 27 53 107 1 11 26 ... |
From the dataframe signs
find the average of each columns based on the sign_type
column.
aggregate(signs, by = list(signs$sign_type), mean) .
1 2 3 4 |
Group.1 sign_type r1 g1 b1 r2 g2 b2 1 pedestrian NA 100.76087 116.50000 110.9565 99.28261 114.73913 50.97826 2 speed NA 86.71429 97.08163 94.2449 124.55102 140.55102 138.53061 3 stop NA 114.56863 130.54902 128.0000 108.33333 41.21569 41.03922 |
Example 2:
Let’s say we only want to aggregate column r2 only. We can do such by using 2 different usages which will return the same output.
aggregate(signs$r2, list(signs$sign_type), mean) or aggregate(r2 ~ sign_type, data=signs, mean) .
The output is slightly different in terms of the resulted aggregated column names:
Output of example 2, usage 1
1 2 3 4 5 |
> aggregate(signs$r2, list(signs$sign_type), mean) Group.1 x 1 pedestrian 99.28261 2 speed 124.55102 3 stop 108.33333 |
Output of example 2, usage 2
1 2 3 4 5 |
> aggregate(r2 ~ sign_type, data = signs, mean) sign_type r2 1 pedestrian 99.28261 2 speed 124.55102 3 stop 108.33333 |
Selecting elements of Data Frame
Similar to that of matrix –> use square bracket
[] .
In addition, we can also select columns using its names instead of its index number.
Example:
1 2 3 4 |
> # Select first 5 values of diameter column > planets_df_5 <- planets_df[1:5, 'diameter'] > planets_df_5 [1] 0.382 0.949 1.000 0.532 11.209 |
Since the diameter column is the 3rd column in the planets_df dataframe, then planets_df_5 <- planets_df[1:5, 'diameter'] can also be rewritten as planets_df_5 <- planets_df[1:5, 3] .
Selecting Certain All Rows and 1 Column Only
Use the sign ‘$’.
Example:
1 2 |
# Select the rings variable from planets_df rings_vector <- planets_df$rings |
Select Rows that Fulfil Certain Condition.
Method 1:
Use the condition as the parameter for the row.
Example:
1 2 |
small <- planets_df$diameter < 1 # this returns boolean planets_df_small <- planets_df[small, ] |
The code above select all rows that has small = TRUE.
Method 2:
Another way to select by condition is using
subset(df, subset=condition) , which is simpler and more understandable.
Example:
subset(where9am, daytype == 'weekday' & location == 'office')
Note that in R logical operator, we only use the ampersand symbol &
once.
Selecting attribute of an object.
attr(x, which)
Notes:
- x = an object whose attributes are to be accessed.
- which = a non-empty character string specifying which attribute is to be accessed.
Example:
Choose attribute prob
from the object sign_pred
.
attr(sign_pred, prob) .
XI. List
Created using
list(comp1, comp2, ....)
comp = component which can be in any objet types: matrix, vector, df, etc.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
> my_vector <- 1:10 #comp 1 > my_matrix <- matrix(1:9, ncol = 3) #comp 2 > my_df <- mtcars[1:10,] # comp 3 > # Construct list with these different elements: > my_list <- list(my_vector, my_matrix, my_df) > my_list > > # output will be a combination of all the above components displaye in what seems like separated component. > my_list [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 [[3]] mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 |
Selecting Elements of a List
Since a list is a combination of several components, then selecting an element in a List is conducted in 2 steps:
1. selecting the component, using one of these 3 ways.
a) double square bracket with the index number inside –> my_list[[2]]
b) double square bracket with the component name inside –> my_list[[”comp1”]]
c) using the $ sign –> my_list$comp1
2. selecting the particular element itself.
Example:
1 2 3 4 5 6 7 |
> # selecting the 2nd component (actor) and 2nd element of the component > shining_list[[2]][2] > [1] "Shelley Duvall" > > # the code above will return the same result as: > # shining_list[['actor']][2] or > # shining_list$actor[2] |
Adding New Component to a List
Similar to that of vector. Use the function
c()
Example:
1 |
ext_list <- c(my_list, my_name = my_val) |
Converting a List into a Vector
- use unlist(the_list)
- returns a vector contains all the values of all components in a list (not including the component’s name itself).
- Example:
Getting the name attributes of a list
- using function
names(list) .
Example:
123456789101112> y <- list(a = 1, b = "c", c = 1:3)> y$a[1] 1$b[1] "c"$c[1] 1 2 3> names(y)[1] "a" "b" "c" - If we pass a non-associative array, names() will return NULL.
Example:
12345> z <- 1:3> z[1] 1 2 3> names(z)NULL
XII. Factor
factor(the_vector, levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA) –> used to encode a vector as a categorical data or enumerated type (or ‘factor’ in R terms).
Notes:
- the_vector = the vector that will be turned into categorical data
- levels = a vector of uniques values in the_vector. levels can also be used for renaming the category.
Let’s say we have data of students based on gender which is labeled by ‘F’ and ‘M’. To make this label meaningful, we can use ‘level’ to change the category. So, we’ll write something like this in the parameter level: levels = c('Female', 'Male') - ordered = logical flag to determine if the levels should be regarded as ordered (in the order given). By setting ordered = TRUE , we indicate that the factor is ordered based on the given levels. Otherwise, it will just be an unordered factor.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> # no other parameters are defined. creating unordered factor. > temperature_vector <- c("High", "Low", "High","Low", "Medium")<br>> factor_temperature_vector <- factor(temperature_vector)<br>> factor_temperature_vector<br>[1] High Low High Low Medium<br>Levels: High Low Medium > # levels is defined and order = FALSE. This also creates an unordered factor. > factor_temperature_vector <- factor(temperature_vector, order = FALSE, levels = c("Low", "Medium", "High")) > factor_temperature_vector [1] High Low High Low Medium Levels: Low Medium High ># levels is defined and order = TRUE > factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High")) > factor_temperature_vector [1] High Low High Low Medium Levels: Low < Medium < High |
We can also define the levels later as in this following example:
1 2 3 4 5 6 7 8 9 |
> # Build factor_survey_vector with clean levels > survey_vector <- c("M", "F", "F", "M", "M") > factor_survey_vector <- factor(survey_vector) > > # here we define the levels > levels(factor_survey_vector) <- c("Female", "Male") > factor_survey_vector [1] Male Female Female Male Male Levels: Female Male |
The function ‘summary()’
summary(the_data_object) is used to provide the summary of the given data object.
In a regular vector,
summary() only shows the length and the type of the data inside the object.
In a factor,
summary() shows the quantity of each category.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> survey_vector <- c("M", "F", "F", "M", "M") > factor_survey_vector <- factor(survey_vector) > levels(factor_survey_vector) <- c("Female", "Male") > > survey_vector_sum <- summary(survey_vector) > survey_vector_sum Length Class Mode 5 character character > > factor_survey_vector_sum <- summary(factor_survey_vector) > factor_survey_vector_sum Female Male 2 3 |
From the example above, we can see that using summary() on a factor will give us the information of the number of elements in each category.
XIII. Miscellaneous
- custom-styled report with RMarkdown and CSS – all of that within the powerful tidyverse.
- R is case sensitive.
- to get a help of certain function –> use the question mark symbol ‘?thing_to_ask’ or help(thing_to_ask)
- Example:
1234567891011?matrix> matrixfunction (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL){if (is.object(data) || !is.atomic(data))data <- as.vector(data).Internal(matrix(data, nrow, ncol, byrow, dimnames, missing(nrow),missing(ncol)))}<bytecode: 0x38d6730><environment: namespace:base> - documentation is available here: www.rdocumentation.org.
Cheatsheet for General Must-Known R Function
-
%in% –> checking if a value exist in a damtaframe or not
- use ! to find the ‘not’
- Example:
12# if x exists in names(df)if('x' %in% names(df))
12# if x does not in names(df)if(!'x' %in% names(df))
-
any(iterables)
- Given a set of logical vectors, is at least one of the values true?
- Example:
12# checking if there is at least 1 NA in 'social_df'any(is.na(social_df))
-
append(x, values, after = length(x)) –> appending ‘values’ to ‘x’ after certain criteria defined in ‘after’.
Example:
123456> a <- c(1, 2, 3)> b <- c("a", "b", "c")> append(a, b)[1] "1" "2" "3" "a" "b" "c"> append(a, b, after = 2)[1] "1" "2" "a" "b" "c" "3" -
args(function_name) –> Displays the argument names and corresponding default values of a function or primitive.
Example:
123456> args(mean)function (x, ...)NULL> args(sd)function (x, na.rm = FALSE)NULL -
as.new_class(data_to_convert) : Convert an R object from one class to another class.
Example 1:
1234567as.character()as.Dates()as.factoras.integeras.logic()as.numeric()as.vector()
Example2 :
12345678910111213141516171819202122232425> linkedin[[1]][1] 16[[2]][1] 9[[3]][1] 13[[4]][1] 5[[5]][1] 2[[6]][1] 17[[7]][1] 14> li_vec <- sapply(linkedin, as.vector)> li_vec[1] 16 9 13 5 2 17 14 - class(object) –> what class an object belongs to.
- diff(x)–> find the differences between each elements in vector x.
-
dist(x, method='euclidean') –> find the distance between each point in the matrix.
- Notes:
- x = a numeric matrix, data frame or “dist” object.
- method = the distance measure to be used. This must be one of “euclidean“, “maximum“, “manhattan“, “canberra“, “binary” or “minkowski“.
- Example:
123456789> three_playersx y1 5 42 15 103 0 20> dist(three_players)1 22 11.661903 16.76305 18.02776
- Notes:
-
file.path(subpath1, subpath2, ....) –> Construct the path to a file
Example:
12345> path <- file.path("wordbank", "edequality.dta")> print(path)>> > print(path)[1] "wordbank/edequality.dta" - GET(url) –> get the URL –> like cURL.
-
grep(things_to_find, df_to_find_From)
Example:
12#check to see if your name is includedgrep("Bob", unique(babynames$name))#looks like bob is in there - identical(x, y) –> test two objects for being exactly equal. Returns TRUE or FALSE
-
is.*() : Check whether the class of an R object is True or False.
Example:
is.na(iterables) -
list.dirs()
–> see the list of directory in the current working environment. -
list.files()
–> see the list of files (including directories) in the current working environment -
list.files("dir_name")
–> see the list of files inside the directory “dir_name”.
–> example:
123456789101112> # get the list of dirs in the current working env.> # one of the directories is "data"> list.dirs()[1] "." "./.aws" "./data">> # list.files() also results in the list of directories.> list.files()[1] "data">> # shows the list of files inside the dir "data"> list.files("data")[1] "hotdogs.txt" - ls() –> a function to see the list of data, values, and function in the global environment.
- na.omit(iterables) –> remove the not a number value from a vector, list, etc.
- na.rm –> if it’s TRUE , then remove missing value.
-
range(object, na.rm=FALSE)
- return a vector conssits of 2 elements: min and max.
- Example:
123456> x <- c('a', 'b', 'c', 'd', 'e')> y <- seq(11, 15)> range(x)[1] "a" "e"> range(y)[1] 11 15
-
rep(x, times = n) –> replicate x n times. x can be a single number, vector, or list.
rep(x, each = n) –> replicate x n times, but do it element-wise.
See the difference of 2 variants of rep in the example below.
Example:
12345> myseq <- seq(8, 2, by = -2)> rep(myseq, times = 3)[1] 8 6 4 2 8 6 4 2 8 6 4 2> rep(myseq, each = 3)[1] 8 8 8 6 6 6 4 4 4 2 2 2 -
rev(x) –> reverse the vector/list ‘x’
Example:
1234> x[1] 1 2 3 4 5 5 4 3> rev(x)[1] 3 4 5 5 4 3 2 1 - scale(x)
standardize the data –> mean = 0, sd = 1. -
seq(from, to, by=x) or
seq(from, to, length.out)
–> create a sequence from ‘from’ to ‘to’ increment by x (or you can use -x if you want a decrement).
Example:
12> seq(from=3, to=17, by=2)[1] 3 5 7 9 11 13 15 17
Notes:
1. from = starting point
2. end = end point
3. by = the increment of the element number
4. length.out = the desired length of the sequence. will be round up if it’s a fraction.
5. by and length.out cannot be used together. - sort(x, decreasing = FALSE)
-
seq_along(object)
- generates a sequence along the index of the object passed to it, but handles the empty case much better.
- = seq_len(length(x))
- an alternative to
11:ncol(df) - Example:
1234567891011> a <- c(2, 4, 6, 8, 10)> seq_along(a)[1] 1 2 3 4 5> b <- c(1, 5, 0, 6, 8, 7)> seq_along(b)[1] 1 2 3 4 5 6> c <- c('a', 'l', 'i', 'f')> c[1] "a" "l" "i" "f"> seq_along(c)[1] 1 2 3 4
- typeof() –> check the type of certain data/vector, etc. An alternative for class() .
- unique(something) –> remove duplicate elements. returns ‘the something’ with the duplicate elements removed.
- order(the_sorting_criteria) –> sorting vector elements in ascending order. Return a vector of the sorted elements indexes.
By default, sorting is ascending.
To reshuffle the vector elements, we can do vector selection using the index-ordered vector as the argument.
Example:
1234567891011> nama_ord <- nama[ord]> # Play around with the order function in the console> nama <- c('e', 'r', 'i', 'k', 'a')> ord <- order(nama)> ord[1] 5 1 3 4 2>> # reshuffle the vector into ordered vector> nama_ord <- nama[ord]> nama_ord[1] "a" "e" "i" "k" "r"
The sorting criteria should be a vector. If you want to sort a vector, then you can directly pass the vector as argument for function order().
If you want to order a dataframe based on certain column, you can select the column first using the $ sign, then store the ordered index in a variable.
Then to reshuffle the dataframe do the same thing with square brackets as that of reshuffling vector, only that we use 2 arguments in the square bracket to represent rows and columns.Example of sorting a vector:
1234567> # Use order() to create positions> positions <- order(planets_df$diameter)> positions[1] 1 4 2 3 8 7 6 5>> # Use positions to sort planets_df> planets_df_sorted <- planets_df[positions, ]Example of sorting a dataframe:
1234567> positions <- order(planets_df$diameter)> positions[1] 1 4 2 3 8 7 6 5>> # Use positions to sort planets_df> # 1st argument = the ordered index, 2nd argument = all columns> planets_df_sorted <- planets_df[positions, ] - Vector vs List
- selection
- vector –> single square bracket –> [ ]
- list –> double square bracket –> [[ ]]
- selection
- The general rule in selection
– x[index] –> example: x[2] –> selecting the 2nd element
– x[condition ] –>
example1: x[x < 5] –> selecting elements with values < 5
example 2:123# selecting emails that fulfil the hits conditionhits <- grep(pattern = "edu", emails)emails[hits] -
which(condition)
returns the index of the row(s) matched the condition.
Example:12> which(nested$country == 'Brazil')[1] 7Find the index number of a row whose
country
column =='Brazil'
.Alternatively, we can also use
match(matching_criteria, source)
to return the index of the first element matched the criteria.
Example:
12> match('Brazil', nested$country)[1] 7 -
rename(x, newcol1 = col1, newcol2 = col2, ...) .
- renaming column names in vector/df
Example:
12x <- c(a=1, b=2)rename(x,a="A",b="B")
- renaming column names in vector/df
- head() / tail() – see the head and the tail – also check out the corner function of the jaffelab package created by LIBD Rstats founding member E. Burke
- colnames() / rownames() – see and rename columns or row names
- colMeans() / rowMeans() / colSums() / rowSums() – get means and sums of columns and rows
- dim() and length() – determine the dimensions/size of a data set – need to use length() when evaluating a vector
- ncol() / nrow() – number of columns and rows
- str() – displays the structure of an object – this is very useful with complex data structures
- unique()/duplicated() – find unique and duplicated values
- order()/sort()– order and sort your data
- gsub() – replace values
- table() – build a frequency table of the counts at each combination of factor levels. Summarize your data in table format. Read also datasciencemadesimple.com to get a better understanding about table() function. With table() we can create:
- A frequency table.
Example:
1234> table(iris$Species)setosa versicolor virginica50 50 50 - A frequncey table with a condition. This will create a TRUE/FALSE table.
Example:
1234> table(iris$Petal.Length < 4)FALSE TRUE89 61 - 2-way cross table.
Example:
row names = mtcars$cyl, column names = mtcars$gear.
123456> table(mtcars$cyl, mtcars$gear)3 4 54 1 8 26 2 4 18 12 0 2 - 3-way cross table
Example:
group name = mtcars$carb.
In each group, row names = mtcars$cyl, column names = mtcars$gear
123456789101112131415161718192021222324252627282930313233343536373839404142<span class="GD15MCFCKUB ace_keyword">> </span><span class="GD15MCFCKTB ace_keyword">table(mtcars$cyl, mtcars$gear, mtcars$carb)</span><span class="GD15MCFCEUB">, , = 13 4 54 1 4 06 2 0 08 0 0 0, , = 23 4 54 0 4 26 0 0 08 4 0 0, , = 33 4 54 0 0 06 0 0 08 3 0 0, , = 43 4 54 0 0 06 0 4 08 5 0 1, , = 63 4 54 0 0 06 0 0 18 0 0 0, , = 83 4 54 0 0 06 0 0 08 0 0 1</span>
The example above shows that the columndonated
has 88,751 rows of value0
and 4711 rows of value1
.
- A frequency table.
- t.test() – perform a t test
- cor.test() – perform a correlation test
- lm() – make a linear model
- summary() – if you use the lm() output – this will give you the results
- set.seed() – allows for random permutations or random data to be the same every time your run your code.