The Basic R

Hello..hello…I am falling in love deeply with R. That’s why in this current post, I’d still like to ramble and blabber about R. For R newbie like me, it’s essential to know the basic data type in R, how to obtain data from various sources such as csv, xls, or Rstudio dataset, what is matrices and list in R, and what operators we can use to do operations in R. Without further ado, let’s just dive into this very basic of R.

I. Data Type in R

R can handle numeric, text, and logical.
Use function class(var_name)  to check the data type of certain variable.

  1. Numerical: integer (int), double (dbl)
  2. Categorical: factor (fct)
  3. Vectors (one-dimensional array)  –> can hold numeric, character or logical values.
    Created using:

    1. c()    or
    2. using a vector function   –>  vector(mode = "logical", length = 0)     –>  produces a vector of the given length and mode.
      Example:
      output <- vector("double", ncol(df))  Output:
      > output [1] 0 0 0 0
  4. Matrices (two-dimensional array) –>  can hold numeric, character or logical values. The elements in a matrix all have the same data type.
    Created using matrix() .
  5. Data frames (two-dimensional objects) –> can hold numeric, character or logical values. Within a column, all elements have the same data type, but different columns can be of different data type.
    Created using data.frame()  .
  6. Table  –>  a contingency table of the counts at each combination of factor levels.
  7. List –> a collection of a variety of objects under one name. These objects can be matrices, vectors, data frames, even other lists, etc.
    A list is a super data type: you can store practically any piece of information in it. Created using  list() .

II. Operators in R

  1. Relational (equal/unequal) :  ==  ,   !=  ,   >  ,   >=  ,  <  ,  <=
  2. Logical      –>  and (&)   ,   or (  |  )   ,   NOT (  !  ),    double and (&&)   ,    double OR (||)
    Notes:

    1. ‘&’ , ‘&&’, ‘|’, and ‘||’ behave differently in R, compared to other languages such as Java and C.
    2. &  and |
      1. compare every corresponding element in each vectors –> perform element-wise operation.
      2.  producing result having length of the longer operand.
    3. &&  and  ||
      1. examines only the first element of the operands resulting into a single length logical vector.
      2. All other elements in a vector or list are ignored regardless of the first ones value.
      3. See the example from datamentor.io below to get a clearer understanding:
      4. I must say that the concept of &, &&, |, and || is quite confusing to me. I also find that the documentation does not really give enlightenment. Luckily, I managed to gather information from various sources such as stackoverflow and csgillespie:
        1. && and || are what is called “short circuiting”. That means that they will not evaluate the second operand if the first operand is enough to determine the value of the expression.
        2. For example if the first operand to && is false then there is no point in evaluating the second operand, since it can’t change the value of the expression (false && true and false && false are both false). The same goes for || when the first operand is true.
        3. && and || are very useful in flow control purpose.
      5. Still confused? then just follow this rule:
        For logical comparison, stick to “&” and “|” unless you know you need “&&”.
        Use && and || when you want to do some flow control like if..else… and while() or whenever you are sure what you are goint to do with the && and ||”
        .
    4. x < y,                       TRUE if x is less than y
    5. x <= y,                    TRUE if x is less than or equal to y
    6. x == y,                    TRUE if x equals y
    7. x != y,                     TRUE if x does not equal y
    8. x >= y,                    TRUE if x is greater than or equal to y
    9. x > y,                      TRUE if x is greater than y
    10. x %in% c(a, b, c),  TRUE if x is in the vector c(a, b, c)

III. Obtaining Data

  1. hardcode it
    1. using 1-d array (vector)
      2 functions that we have to know when work with vector:
      1) the combine function c() for creating the vector and
      2) the names function names() for naming the elements in the vector.

      1. Creating a 1-d array (vector)
        use the combination function –> c()
        Remember: ‘c’  means vector
        e.g:
        numeric_vector <- c(1, 10, 49)
        character_vector <- c(“a”, “b”, “c”)
        days_vector <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”)Example:

        We can also name the vector elements using either of these 2 ways:
        1. directly naming it when creating the vector. So, it’s like creating an associative array.
        2. name it after creating the vector by using the function names(the_vector) <- the_vector_names .

        Example:

        We can also merge several vectors into 1 single vector. 
        e.g:

      2. selecting certain element
        a) based on index number enclosed in square bracket
        index starts from 1.
        Example:
        my_vector[2]    –> selecting the 2nd element of a vector
        my_vector[c(2, 3, 4, 5)]      –> selecting the 2nd, 3rd, 4th, and 5th element of a vector.            –> there is a more convenient way as follows:
        my_vector[2:5]   –> unlike python, the last index counts) based on the names of the elements

      3. Sum or Counting all elements in a vector
        use: sum(vector_name)
        example:

      4. Length of The Vector
        length(vector_name)
      5. create categories out of a vector’s values.
        use  –>  factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)
        We can define factor() as a categoric vector.
        Example 1: Factoring

        Example 2: Refactoring and Relabeling:

        1. Converting a numeric-like factor from ‘factor’ type to ‘numeric’ type. 
          1. Let’s say we have a dataframe ‘zf’ wih 1 factor column that contains numeric-like values.
          2. unclass it to get its integer code
          3. converting it directly to a numeric type will return its integer codes instead of its real values
          4. meanwhile, if we convert it to a character type, will result in the string version of its values.
          5. Thus, to convert a factor with numeric-like values into a numeric values, we have to do a double conversion
      6. Adding a new element to a vector
        Just directly add it using the function c() .
        Example:
  2. Load it from Existing Datasets provided by RStudio.
    1. By defaults, RStudio already provides a number of datasets for us to use such as iris, mtcars, Titanic, and AirPassengers.
    2. List of functions that we can use:
      1. data()  –> check the list of available datasets.
      2. data(dataset_name)  –> start using the dataset by loading it.
      3. str(dataset_name)  –> check the structure of the dataset.
      4. glimpse(dataset_name)  –> see the glimpse of the dataset.
  3. import from other files.
    1. Flat Files
      1. using built-in utils package scan(txtfile, skip=0)
        1. only work for .txt file.
        2. returns a 1-d vector (not a dataframe).
        3. Notes:
          1. txtfile = the text file
          2. skip = the number of lines of the input file to skip before beginning to read data values.
        4. Example:
          x3 <- scan('x.txt')
        5. to return a list of several vectors (matrix), we can use scan() inside matrix().
          Example:
          Let’s say we have a text file contains data as follows:

          Using scan() we get a vector as follows:

          To make it as matrix, we can use scan() inside matrix()
      2. using built-in utils package read.xxx
        Result: dataframe

        1. Supporting format:
          1. flat file –> *.txt, *.csv
            1. read.csv(filename, stringsAsFactors = TRUE, header = TRUE, sep = ',')
              Notes:

              1. for comma-delimited file.
              2. filename includes the path  –> must be enclosed in a double quotes.
              3. if stringsAsFactors  = TRUE –> convert strings in the flat file to factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be characters.
            2. read.delim(filename, stringsAsFactors = TRUE, sep = "\t", header = TRUE)
              Notes:

              1. for a tab-delimited file.
            3. read.table(filename, stringsAsFactors = TRUE, header = FALSE, sep = "", col.names = c("xxx", "yyy", "zzz"), colClasses = c('class1', 'class2', 'class3'))
              Notes:

              1. for other types of delimit.
              2. It can be used for comma-delimited or tab-delimited too as long as we use proper arguments for the sep.
              3. header = TRUE  means 1st row l= the variable names.
              4. sep = separator. It’s “” by default. We can change it in accordance with our purpose such as “/”.
              5. col.names  –> a vector of optional names for the variables. The default is to use “V” followed by the column number
              6. colClasses –> specify the column types/classes of the resulting data frame.
                –> If a column is set to “NULL” in the colClasses vector, this column will be skipped and will not be loaded into the data frame.Example:
          2. databases –> postGre, mySQL
          3. web
          4. other statistical software –> SPSS, Stata
      3. Using readr package by Hadley Wickhamtic
        1. Result: a tibble
        2. can handle flat files that are on the internet.
        3. format: read_    instead of read.
        4. utils vs readr
          No util readr
          1 using dot sign ‘.’ using underscore ‘_’
          2 output: dataframe output: tibble
          3 read.table() read_delim()
          4 read.csv() read_csv()
          5 read.delim() read_tsv()

        5. read_delim("filename", delim = "", col_names = TRUE, col_types = NULL, skip = 0, nmax = inf)
          Notes:

          1. delim = sep
          2. col_names = TRUE or FALSE or a vector of column names. If it’s TRUE, the 1st row will be header.
          3. col_types = colClasses.
            Possible values:

            1. NULL  –> The column types will be imputed based on the first 1000 rows on the input
            2. a collection which is defined using col_something().
              In the read_delim()  function, it is enclosed in list().
              Example:

              Possible collection:

              1. col_double()
              2. col_character()
              3. col_integer()  –> meaning:  the column should be interpreted as an integer.
              4. col_factor(levels = c(“x”, “y”, “z”))  –>  the column should be interpreted as a factor with levels
            3. a compact string representation where each character represents one column:
              c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column.
              Example:
              read_delim("states2.txt", delim = "/", col_types = "ccdd")   –> col_types = character, character, double, double.
            4. skip  –> number of first rows that will be excluded before we begin importing
            5. nmax  –> max number of rows that will be imported.
        6. read_csv("filename", col_names = TRUE)
        7. read_tsv("filename", col_names = TRUE)   –> tab-separated value
      4. Using the package data.table by Matt Dowle and Arun Srinivasan
        1. use fread()  for reading the imported file.
          1. suitable for large dataset that read.csv() won’t cut.
          2. Similar to read.table  but faster and more convenient.

            Notes:
            1. drop  –> Vector of column names or numbers to drop, keep the rest.
            2. keep  –> Vector of column names or numbers to keep, drop the rest.
            Example:
    2. Working with Excel Files
      1. using package readxl by Hadley Wickham –> special for reading excel file. 
        1. for reading from excel file.
        2. same arguments as readr.
        3. 2 functions:
          1. excel_sheets(‘filename’)  –> find out which sheets are available in the workbook.
            Example:
          2. read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)
            Notes:

            1. sheet_num ==> starts from 1
            2. read_excel()  cannot handle xls files that are on the internet (at least, not yet).
      2. using package gdata by Gregory R. Warnes
        1. read.xls(filename, sheet = 1)
        2. the first row of the original data will be imported as the variable name for the first column
          Meaning: the rows after import != the rows before import.
        3. it skips empty rows
        4. it returns a dataframe.
        5. Always compare the data after import to get an understanding of the new rows order.
        6. can handle .xls files that are on the internet.
      3. Using package XLConnect

        1. Example:
        2. Other functions in XLConnect:
          1. getSheets(filename)   –> get list the sheets in an Excel file. The list is in vector format.
          2. createSheet(workbook, name = "sheet_name")  –>   adding a new sheet to a workbook.
          3. writeWorksheet(workbook, data, filename)   –> populate a sheet with data
          4. saveWorkbook(workbook, filename)   –>  storing a workbook into a new file.
          5. renameSheet()
        3. if you encounter problems during the XLConnect installation, try this following steps:
          1. install java
          2. install jdk
          3. assotiate the JDK installed with R
          4. Install RJava and Rgdal
          5. Install package in RStudio
            install.packages("rJava")
          6. d
        4. For working with excel files through R. Bridging between Excel and R  –> no need to import the files.
        5. makes it able to edit your Excel files from inside R
          1. create a new sheet.
          2. populate the sheet with data
          3. save the results in a new Excel file.
    3. Working with Database
      In this case, we do not download the whole data from database to R. But rather just load rows that we need.
      Steps:

      1. Create a connection to the database. Define the host, db, user, pwd, host.
        Example for mysql:
      2. Connect to the DB  –> use function tbl(conn, dbname) .
        Example for mysql:

        This will be further discussed in another post eq.
    4. Download files from the internet
      1. use: download.file(url, dest_pathfile)
        1. this will save the downloaded file into our local directory
        2. it has no return value. Thus, cannot be assign into a variable.
      2. load the downloaded file into our working environment
        load(the_downloaded_file)   –> has no return value.
      3. alternative way
        load(url(the_file_url))  –> but the file will not be downloaded into our local directory.
    5. Working with JSON
      1. use the library jsonlite
        library(jsonlite)
      2. use the function fromJSON(json_strings)  to convert JSON data into a nicely structured R list.
        It also works if you pass a URL as a character string or the path to a local file that contains JSON data.
        Example:

        Output:
      3. toJSON(x, pretty = FALSE)   –> convert R back to JSON in the minified format.
        Notes:

        1. pretty = TRUE   –> prettify the JSON format
        2. othe JSON-related functions:
          1. prettify()
          2. minify()
    6. Import from other Statistical software
      1. using haven package by Hadley Wickham
        1. read_sas()   –> for SAS
        2. read_stata()   and read_dta()   –> for Stata
        3. read_spss()

          summary of haven package with Statistical Software Packages. Image taken from datacamp
      2. using foreign package by R Core team
        1. can handle more statistisal softwares:  spss, stata, systat, and weka.
        2. cannot import sas (.sas7bdat), only sas library (.xport):
        3. read stata

          Notes:

          1. convert.underscore  –> if TRUE, convert the ‘_’ to ‘.’
        4. read spss

          Notes:

          1. to.data.frame = TRUE   –> return data frame
            to.data.fraem = FALSE  –> return list.

IV. Using Data

  • selecting certain column  –> use the ‘$’ sign.
    Example:  Let’s say we have the data students with column ‘age’ becomes the column that we want. Thus, we do  students$age .

V. Visualization

  1. using R built-in plot()
    1. plot(x, y)
    2. hist(x, breaks = 10, main = main = paste(“Histogram of” , xname), xlab = xx, ylab = yy)
      Notes:

      1. main = the graph title
    3. d
  2. using ggplot2 library. Read more here.
    ggplot(data, aes)

VI. Mathematical Functions

a. Arithmetic

– Using operator:  +   -   *   /   %%
– Work in element-wise way.
– 2 types of operations:
a) matrix vs constant
b) matrix vs matrix
–   2 * my_matrix  –>   multiplied every element of my_matrix  by two.
my_matrix1 * my_matrix2  –>  creates a matrix where each element is the product of the corresponding elements in my_matrix1  and my_matrix2 .

 “+”  vs sum()

+  –> do element-wise summation
sum()  –> sum up all elements in the vector(s)
Example:

b. Mathematic/Statistics

  1. abs()
  2. coef(modeling_function)
    Extract coefficients from a modeling function.
    Example:
  3. mean(vector_name, trim=0, na.rm=FALSE)
    Notes:
    – trim = the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.
    – na.rm = remove missing values (NA), if it’s TRUE.Example:

    –> Output: 36.66667
  4. standard deviation  –>    sd(x, na.rm = FALSE)
  5. linear regression related
    1. lm(formula = y ~ x, data)
      1. To fit a linear model with a response variable y and explanatory variable x
      2. example:
        US_fit <- lm(percent_yes ~ year, data = US_by_year) .
        Output:

        We can also see more details of the lm()  output by using summary()  of the output.
      3. lm’s methods and properties:
        my_lm <- lm(y ~ x, data=my_data) .

        1. coef(my_lm)  –> extract the coefficient of the model.
        2. summary(my_lm)  –> show full output of the model.
      4. to make the model above looks tidy, use the package broom  and its function tidy(model) .
        Example:
        From the model US_fit  above, we can tidy it using broom  as follows:
    2. r.squared(object)
      –> find the coefficient of determination (R squared). R squared illustrates the proportion of dependent-variable variance that is predictable from the independent variables.
  6. max()  –> find the maximum value of a vector or data frame.
  7. min()  –> find the minimum value of a vector or data frame.
  8. pmax()  –> computes the parallel maxima of two or more input vectors.
  9. pmin()  –> computes the parallel minima of two or more input vectors.
    Example:

     
  10. median()
  11. quantile(x, probs, na.rm=FALSE, names=TRUE) .
    Notes:

    1. x = the vector to loop over
    2. probs = the n-th quantile we want to obtain. It can be a single number or a vector.
      Example: find the 5th percentile
      quantile(cars$mpg, probs = 0.05)
      Example: find the 5th and 95th percentile
      quantile(df, probs = c(0.05, 0.95))
    3. If we do not specify the probs, then by default quantile will return a list of 5 elements: min, q1, q2, q3, max.
      Example:

      Thus, if we want to get q3, we can just simply use:
  12. round()
  13. sum() –> sum of all values in the data structure (vector, list, etc).
  14. which.max() –> find the index of the max element in a vector/list.
  15. which.min()  –> find the index of the min element in a vector/list
    Example:
  16. cor(a, b, use=’everything’)
    1. Notes:
      1. a, b = 2 variables to calculate correlation of.
      2. use = a string which defines how to handle data in the presence of missing values.
        1. ‘everything’ –> default  –> resulting cor value will be NA whenever one of its contributing observations is NA.
        2. “all.obs” –> the presence of missing observations will produce an error.
        3. “complete.obs”  –> deletes all cases (rows) with missing values before calculating the correlation.  If there are no complete cases, that returns an error. Complete cases = rows whose columns are all filled with value –> contains no missing value.
        4. “na.or.complete”  –> same as “complete.obs” except that if there are no complete cases, it will return NA.
        5. “pairwise.complete.obs”  –> only compute correlation for all non-missing pairs. only take complete paired cases into correlation calculation. Let’s say the 3rd row in a is complete but the 3rd row in b is not complete. Then both paired rows will be dropped from the correlation calculation.
  17. chisq.test()
    performs chi-squared contingency table tests and goodness-of-fit tests.
    Format:

    Returns:
    a list of 9 elements: statistic, parameter, p.value, method, data.name, observed, expected, residuals, stdres.

    Example:

VII. Text-Related Functions

  • paste (, sep = " ", collapse = NULL)
    converts its arguments to character strings, and concatenates them (separating them by the string given by sep).
    Example:

    In the code above, paste concatenates My age is and age after converting age to string.

  • paste0(...)
    similar to paste() except that the strings are concatenated without separator.
    Example:

    See the difference?

  • cat(R_objects)
    – It converts its arguments to character vectors, concatenates them to a single character vector, appends the given sep= string(s) to each element and then output them to a console or a file.
    – cat() will not return anything, it will just output to the console or another connection.
    paste()  vs  cat()
    –> paste() will return something, while cat() only output the result to the console.
    –> thus, output from cat() cannot be assigned to a variable, whilst output from paste() can.
    – Example:

  • the length of chars –>  nchar(the_strings)
    Example:

  • replacement
    sub(pattern, replacement, x)  –> only replace the first match
    gsub(pattern, replacement, x)  –> replace all matches.
  • split string based on a certain separator. 
    use:  strsplit(string, split=sep)
    Notes:
    – split = the separator used for separating the string.
    – strsplit() returns a list of 1 or more components, where each component is separated by a comma ‘,’ .
    In example 1, the string only consist of 1 component. Thus, the resulting list will also consist of 1 component only.
    On the other hand, in the example 2, the input string is a vector of 4 elements separated by comma. Thus, the resulting list will comprise 4 components.
    – sep = “”  –> split by character
    – sep = ” ” –> split by word (separated by a whitespace).   –> see example 3.
    Single square bracket will return the element enclosed in its container.
    To directly get the content of the component, we have to add [[1]]  right after splitting the string.
    See the illustration below taken from stackoverflow:

    Example 1:

    Can you see the difference? Have a look at line 5 in the code above.

    Example 2: resulting list has more than 1 component.

  • Converting string to lowercase letters
    – use tolower()
    – example:

  • Converting string to uppercase letters
    – use toupper()  .
  • finding similarity and differences
    • unique(iterables)  –> unique() can also be used for number, etc. Not exclusive to string.
    • setequal(a, b)  –> check if list a equal to list b
    • identical(a, b)  –> check if list a identical with list b.
      The difference between setequal and identical [source]:

      • when there are duplicate values
      • when there are different ordering.

VIII. Comparison

  • recognized symbol: <  ,  >  , ==  ,  !=   , <=   ,   >=
  • can be use on scalar and vector
    • scalar –>  5 < 6    TRUE
    • vector  –>
      c(4, 5, 6) > 5                   Return:  FALSE FALSE TRUE

      1. We can also select only the elements whose values are True, by using the square bracket.
        Example:

        We can also simplify the code above by rewriting it as follows:

        See the difference on line 3.
  • When comparing 2 strings, comparison will be conducted in alphabetical order based on the 1st letter of each string.
  • When comparing 2 boolean, TRUE will be assigned the value of 1, while False will be assigned the value of 0. Thus, TRUE will always >  FALSE
  • When comparing 2 vectors, comparison will be conducted element-wise.

IX. Matrix

Creating A Matrix

matrix(data, byrow=TRUE/FALSE, nrow=x, ncol=y, dimnames=None) .
Notes:

  • data = the data that will be set into matrix –> must be an optional data vectorSeparatorSeparatorSeparatorSeparator
  • byrow = iF it is TRUE, the matrix will be filled row-wise. Otherwise, it will be arranged by column.
  • nrow = the number of rows in the matrix
  • dimnames = the names for rows and columns in the matrix.
    • Format:  dimnames(list(rownames_vector, colnames_vector))
    • use NULL if the matrix does not have rownames or colnames.
    • Example:
      pokemon <- matrix(mydata, ncol = 6, byrow = TRUE, dimnames = list(NULL, c('Ants','Birds','Cats','Dogs','Elephants','Flamingos')))

      Result:

       

  • dataframe vs matrix
    • matrix is homogenous –> e.g: all numeric, etc
    • dataframe is heterogenous

Example :  my_matrix <- matrix(1:9, byrow=TRUE, nrow=3)

The data could be hardcoded or taken from another array.
Example:

Output:

byrow defines how the matrix will be filled. If the matrix above is created using byrow=FALSE , then the output will be:

naming the columns and rows of a matrix:
a) naming the columns –>    colnames(matrix_name) <- row_names_vector
We can also use   colnames(matrix_name) <- row_names_vector  which will return the same result.
b) naming the rows      –>   rownames(matrix_name) <- col_names_vector

Example:

Output:

We can also name the rows and columns at once using the matrix parameter dimnames .
Example:

Output:

Count the number of rows and cols and in a matrix

nrow(matrix_name)      and    ncol(matrix_name)

Count

Calculating the total of each row

use rowSums(matrix_name) .
Example:

Calculating the total of each column

use colSums(matrix_name) .00 comments awaiting moderation
Example:

Calculating the cumulative summary of a value in a list. 

use: cumsum(the list) .
Example:

Calculating the cumsum:

Adding new column(s) from other matrixes/vectors

use cbind(matrix1, matrix2, vector1, ....)

Example:

Adding new row(s) from other matrixes/vectors

use

Example:

Get the names of the rows in  a dataframe

Use: row.names(df) .
Example:

The Use of Square Bracket [ ]

There are 2 different uses of square bracket:

  1. Selection   –> positive [ ]
    my_matrix[1,2]          –>  selects the element at the first row and second column
    my_matrix[1:3,2:4]     –>  select data on the rows 1, 2, 3 and columns 2, 3, 4.
    my_matrix[,1]             –>  selects all elements of the first column.
    my_matrix[1,]            –>  selects only 1st row and all columns.
  2. Select the opposite  –> positive []  and exclamation mark
    my_matrix[!is_ok]  –> select all the elements in my_matrix that do not fit the criteria in is_ok .
  3. Omission/Removal  –> negative [ ]

Note:  Column contains row header is not counted as a column in the matrix. Similar concept applies for column header.

Select the first observations –>  head(..., n = 1)
n = the number of observations we want to obtain. It can be 1, 2, 3, etc.

select the last observations –> tail(..., n = 1)

Math Operation

  • all math operation such as ‘+’, ‘-‘, ‘*’, ‘/’ will be conducted element-wise

Looping

Loop over a matrix can be done by iterate over the row first, the on each row, iterate over each of its column. In this case, the column is the elements themselves.
Example:

X. Data Frame

Data frame vs Matrix

  1. dataframe is heterogenous, while the matrix is homogenous.
    All columns in a matrix must have the same data type (numeric, character, etc.) and the same length.
    In data frame, different columns can have different modes (numeric, character, factor, etc.). Just like a table in a database or excel sheet.
  2. Data frame has row and column title, while matrix does not.

Creating a Data Frame

A data frame is basically a set of vectors. Thus, we can create a dataframe by combining several vectors using the function data.frame(vector1, vector2, ...., vectorN)  or  df = data.frame()
Example:

Output:

Exploratory Functions

  1. str   –> str(df)
  2. summary  –> summary(df)Separator
  3. glimpse   –> glimpse(df)  glimpse(df)   –> only applied with dplyr package.

Get the overview of Data Frame

Use the function str()  which (I think) stands for stSeparatorructure.

Example:

Get The summary Statistics of All Variables in The Data Frame or Vector

summary()
The summary consists of min, max, Q1, median, and Q3.
Example 1:

Example 2:
For the hotdogs data that consists of 3 variables: meat types, sodium, and calorie levels, the summary gives this following result.

Aggregate a Data Frame.

aggregate()  splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

Let say we have a dataframe x with columns: col1, col2, and col3. Then we can do various aggregation as follows.

Usage 1: aggregate(x, by=list(col2), FUN) –> aggregate each column in dataframe x by col2, with function FUN.

Usage 2: aggregate(x$col1, by = list(x$col2), FUN)   –> aggregate column col1 only grouped by col2, with function FUN.

Usage 3: aggregate(col1 ~ col2, data=x, FUN)  –> aggregate column col1 only grouped by col2 with function FUN.

Notes:

  1. x = the data we want to aggregate. It could be a dataframe itself or a column in the dataframe.
  2. by = the variable to group by the data. In the usage 1 and 2, it should be in the form of a list. But in the usage 3, it shouldn’t.
  3. FUN = the function to aggregate/summary the subset.

Example 1:

Let’s say we have the dataframe signs as follows:

From the dataframe signs find the average of each columns based on the sign_type column.
aggregate(signs, by = list(signs$sign_type), mean) .

Example 2:

Let’s say we only want to aggregate column r2 only. We can do such by using 2 different usages which will return the same output.

aggregate(signs$r2, list(signs$sign_type), mean)  or  aggregate(r2 ~ sign_type, data=signs, mean) .

The output is slightly different in terms of the resulted aggregated column names:

Output of example 2, usage 1

Output of example 2, usage 2

 

Selecting elements of Data Frame

Similar to that of matrix –> use square bracket [] .
In addition, we can also select columns using its names instead of its index number.
Example:

Since the diameter column is the 3rd column in the planets_df dataframe, then planets_df_5 <- planets_df[1:5, 'diameter']  can also be rewritten as planets_df_5 <- planets_df[1:5, 3] .

Selecting Certain All Rows and 1 Column Only

Use the sign ‘$’.
Example:

Select Rows that Fulfil Certain Condition. 

Method 1:
Use the condition as the parameter for the row.
Example:

The code above select all rows that has small = TRUE.

Method 2:
Another way to select by condition is using subset(df, subset=condition) , which is simpler and more understandable.
Example:  subset(where9am, daytype == 'weekday' & location == 'office')

Note that in R logical operator, we only use the ampersand symbol & once.

Selecting attribute of an object. 

attr(x, which)

Notes:

  1. x = an object whose attributes are to be accessed.
  2. which = a non-empty character string specifying which attribute is to be accessed.

Example:
Choose attribute prob from the object sign_pred.
attr(sign_pred, prob) .

XI. List

Created using list(comp1, comp2, ....)
comp = component which can be in any objet types: matrix, vector, df, etc.

Example:

Selecting Elements of a List

Since a list is a combination of several components, then selecting an element in a List is conducted in 2 steps:
1. selecting the component, using one of these 3 ways.
a) double square bracket with the index number inside  –>  my_list[[2]]
b) double square bracket with the component name inside  –>  my_list[[”comp1”]]
c) using the $ sign  –>  my_list$comp1
2. selecting the particular element itself.

Example:

Adding New Component to a List
Similar to that of vector.  Use the function c()
Example:

Converting a List into a Vector

  • use  unlist(the_list)
  • returns a vector contains all the values of all components in a list (not including the component’s name itself).
  • Example:

Getting the name attributes of a list

  • using function names(list)  .
    Example:
  • If we pass a non-associative array, names() will return NULL.
    Example:

XII. Factor

factor(the_vector, levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)  –> used to encode a vector as a categorical data or enumerated type (or ‘factor’ in R terms).

Notes:

  • the_vector = the vector that will be turned into categorical data
  • levels = a vector of uniques values in the_vector. levels can also be used for renaming the category.
    Let’s say we have data of students based on gender which is labeled by ‘F’ and ‘M’. To make this label meaningful, we can use ‘level’ to change the category. So, we’ll write something like this in the                     parameter level: levels = c('Female', 'Male')
  • ordered = logical flag to determine if the levels should be regarded as ordered (in the order given). By setting ordered = TRUE , we indicate that the factor is ordered based on the given levels. Otherwise, it will just be an unordered factor.

Example:

We can also define the levels later as in this following example:

The function ‘summary()’

summary(the_data_object)  is used to provide the summary of the given data object.
In a regular vector, summary()  only shows the length and the type of the data inside the object.
In a factor, summary()  shows the quantity of each category.

Example:

From the example above, we can see that using summary()  on a factor will give us the information of the number of elements in each category.

XIII. Miscellaneous

  • custom-styled report with RMarkdown and CSS – all of that within the powerful tidyverse.
  • R is case sensitive.
  • to get a help of certain function  –> use the question mark symbol ‘?thing_to_ask’         or          help(thing_to_ask)
  • Example:
  • documentation is available here: www.rdocumentation.org.

Cheatsheet for General Must-Known R Function

  1. %in%  –> checking if a value exist in a damtaframe or not
    1. use ! to find the ‘not’
    2. Example:

  2. any(iterables)
    1. Given a set of logical vectors, is at least one of the values true?
    2. Example:
  3. append(x, values, after = length(x))  –> appending ‘values’ to ‘x’  after certain criteria defined in ‘after’.
    Example:
  4. args(function_name)   –> Displays the argument names and corresponding default values of a function or primitive.
    Example:
  5. as.new_class(data_to_convert) : Convert an R object from one class to another class.
    Example 1:

    Example2 :
  6. class(object)  –> what class an object belongs to.
  7. diff(x)–> find the differences between each elements in vector x.
  8. dist(x, method='euclidean')  –>  find the distance between each point in the matrix.
    1. Notes:
      1. x = a numeric matrix, data frame or “dist” object.
      2. method = the distance measure to be used. This must be one of “euclidean“, “maximum“, “manhattan“, “canberra“, “binary” or “minkowski“.
    2. Example:
  9. file.path(subpath1, subpath2, ....)  –> Construct the path to a file
    Example:
  10. GET(url)  –> get the URL  –> like cURL.
  11. grep(things_to_find, df_to_find_From)
    Example:
  12. identical(x, y)  –>  test two objects for being exactly equal. Returns TRUE or FALSE
  13. is.*() :  Check whether the class of an R object is True or False.
    Example:
    is.na(iterables)
  14. list.dirs()
    –> see the list of directory in the current working environment.
  15. list.files()
    –> see the list of files (including directories) in the current working environment
  16. list.files("dir_name")
    –> see the list of files inside the directory “dir_name”.
    –> example:
  17. ls()  –>  a function to see the list of data, values, and function in the global environment.
  18. na.omit(iterables)  –> remove the not a number value from a vector, list, etc.
  19. na.rm  –> if it’s TRUE , then remove missing value.
  20. range(object, na.rm=FALSE)
    1. return a vector conssits of 2 elements: min and max.
    2. Example:
  21. rep(x, times = n)  –> replicate x n times. x can be a single number, vector, or list.
    rep(x, each = n)    –> replicate x n times, but do it element-wise.
    See the difference of 2 variants of rep in the example below.
    Example:
  22. rev(x)  –> reverse the vector/list ‘x’
    Example:
  23. scale(x)
    standardize the data –> mean = 0, sd = 1.
  24. seq(from, to, by=x)  or
    seq(from, to, length.out)
    –> create a sequence from ‘from’ to ‘to’ increment by x (or you can use -x if you want a decrement).
    Example:

    Notes:
    1. from = starting point
    2. end = end point
    3. by = the increment of the element number
    4. length.out = the desired length of the sequence. will be round up if it’s a fraction.
    5. by  and length.out  cannot be used together.
  25. sort(x, decreasing = FALSE)
  26. seq_along(object)
    1. generates a sequence along the index of the object passed to it, but handles the empty case much better.
    2. =   seq_len(length(x))
    3. an alternative to
    4. Example:
  27. typeof()  –> check the type of certain data/vector, etc. An alternative for class() .
  28. unique(something)   –> remove duplicate elements. returns ‘the something’ with the duplicate elements removed.
  29. order(the_sorting_criteria)  –> sorting vector elements in ascending order. Return a vector of the sorted elements indexes.
    By default, sorting is ascending.
    To reshuffle the vector elements, we can do vector selection using the index-ordered vector as the argument.
    Example:

    The sorting criteria should be a vector. If you want to sort a vector, then you can directly pass the vector as argument for function order().
    If you want to order a dataframe based on certain column, you can select the column first using the $ sign, then store the ordered index in a variable.
    Then to reshuffle the dataframe do the same thing with square brackets as that of reshuffling vector, only that we use 2 arguments in the square bracket to represent rows and columns.

    Example of sorting a vector:

    Example of sorting a dataframe:

  30. Vector vs List
    1. selection
      1. vector  –> single square bracket   –> [ ]
      2. list  –> double square bracket  –>   [[ ]]
  31. The general rule in selection
    x[index]   –> example: x[2]  –> selecting the 2nd element
    x[condition ]   –>
    example1:  x[x < 5]  –> selecting elements with values < 5
    example 2:

  32. which(condition) 
    returns the index of the row(s) matched the condition.
    Example:

    Find the index number of a row whose country column == 'Brazil'.

    Alternatively, we can also use match(matching_criteria, source) to return the index of the first element matched the criteria.
    Example:

  33. rename(x, newcol1 = col1, newcol2 = col2, ...) .
    1. renaming column names in vector/df
      Example:
  34. head() / tail() – see the head and the tail – also check out the corner function of the jaffelab package created by LIBD Rstats founding member E. Burke
  35. colnames() / rownames() – see and rename columns or row names
  36. colMeans() / rowMeans() / colSums() / rowSums() – get means and sums of columns and rows
  37. dim() and length() – determine the dimensions/size of a data set – need to use length() when evaluating a vector
  38. ncol() / nrow() – number of columns and rows
  39. str() – displays the structure of an object – this is very useful with complex data structures
  40. unique()/duplicated() – find unique and duplicated values
  41. order()/sort()– order and sort your data
  42. gsub() – replace values
  43. table() – build a contingency table of the counts at each combination of factor levels. Summarize your data in table format.
    Example:

    The example above shows that the column donated has 88,751 rows of value 0 and 4711 rows of value 1.
  44. t.test() – perform a t test
  45. cor.test() – perform a correlation test
  46. lm() – make a linear model
  47. summary() – if you use the lm() output – this will give you the results
  48. set.seed() – allows for random permutations or random data to be the same every time your run your code.

Leave a Reply

Your email address will not be published. Required fields are marked *

Show Buttons
Hide Buttons