1. R Data Types

top

Atomic classes in R:

  • charcater
  • numeric
  • integer (L)
  • complex
  • logical (TRUE/FALSE)

Special numbers Inf and NaN: 1/0 = Inf and 1/Inf = 0; 0/0=NaN

Object attributes:

  • names, dimnames
  • dimensions (for matrices and arrays)
  • class
  • length
  • Other user defined attributes

attributes() to get list of attributes of an object

Creating vectors

  • vector() : to create an empty vector
  • vector("numeric", length=10)
  • c('a', 'b') : to concatenate multiple objects into a single vector
  • Mixing objects from different classes: c(1.5, 'a') : coercion

Explicit coercion

  • as.character()
  • as.numeric()
  • as.logical()
  • as.complex()
In [1]:
x <- 0:6
print(x)
print(as.logical(x))
print(as.complex(x))
[1] 0 1 2 3 4 5 6
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i

Matrices

To create a vector:

  • create a matrox by matrix() function
  • change the dimension of a vector dim()
  • binding vectors and matruces: cbind() and rbind()
In [2]:
m <- matrix(nrow=2, ncol=3)
print(m)
attributes(m)
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA

Out[2]:
$dim
[1] 2 3

In [3]:
# Method1: create a matrix by matrix()
m1 <- matrix(1:6, nrow=2, ncol=3)
print(m1)
# Method2: change the dimension of a vector
m2 <- 1:6
dim(m2) = c(2,3)
print(m2)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

In [4]:
x <- 1:4
y <- 6:9
cbind(x,y)
rbind(x,y)
Out[4]:
     x y
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
Out[4]:
  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    6    7    8    9

Lists

Lists are special vectors that can store elements from different classes

  • Create a list by list()
In [5]:
x <- list(23, "abc", 1+5i, TRUE)
print(x)
[[1]]
[1] 23

[[2]]
[1] "abc"

[[3]]
[1] 1+5i

[[4]]
[1] TRUE


Factors

Factors are a special data type to represent categorical variables, such as gender ('male', 'female'), patterm classes ('1', '2') ..

  • Create factors by factor()
  • Access the labels: levels()
  • Make a table view of number of items in each factor: table()

Factor levels are ordered by default in alphabetic order. In order to directly specify the order: factor(vec, levels=c("male", "female"))

In [6]:
## create a vector of factors
xf <- factor(c("male", "female", "female", "female", "male", "female", "female", "male", "female"))
str(xf)
## print levels of xf
levels(xf)
unclass(xf)
 Factor w/ 2 levels "female","male": 2 1 1 1 2 1 1 2 1

Out[6]:
[1] "female" "male"  
Out[6]:
[1] 2 1 1 1 2 1 1 2 1
attr(,"levels")
[1] "female" "male"  
In [7]:
table(xf)
Out[7]:
xf
female   male 
     6      3 

Missing values

Missing values are denoted by NA or NaN

NaN is for undefined mathematical operations such as 0/0. NaN is a sub-type of NA

  • is.na() and is.nan() to test whether vector elements are missing
In [8]:
x <- c(1,2,3, NA, 6, NaN, 8)
is.na(x)
is.nan(x)
Out[8]:
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
Out[8]:
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

Data Frames

To store tabular data. One of the most useful things in R.

Data frames can store different object classes in different columns, unlike matrices that have to have the same elements class.

  • special attributes: row.names, col.names
  • Rading tabular data from files: read.table() and read.csv()
  • Convert data.frame to a matrix: data.matrox() (coercion may happen)
In [9]:
df <- data.frame(foo=11:14, bar=c(T, T, F, T))
df
nrow(df)
ncol(df)
Out[9]:
  foo   bar
1  11  TRUE
2  12  TRUE
3  13 FALSE
4  14  TRUE
Out[9]:
[1] 4
Out[9]:
[1] 2

Names

R objects can have a name. Useful for self describing data, so each vector can be accesed by an index or by its object names

  • access and change the names by names()
  • names in list: x <- list(a=1, b=2, c=3)
  • names in matrices: dimnames()
In [10]:
x <- 1:3
names(x) <- c("foo", "bar", "norf")
x
names(x)
Out[10]:
 foo  bar norf 
   1    2    3 
Out[10]:
[1] "foo"  "bar"  "norf"

2. Subsetting R objects

Top

To extract subsets of data (objects).

  • subsetting with [ ] (given numeric index or logical index)
  • subsetting with double bracket for lists: [[ ]] (only to extract a single element of a list, meaning no vector inside double bracket)
  • subsetting with $ to extract elements of lists and data.frames by their names
In [11]:
## subsetting vectors: numeric or logical index
x <- c("a", "b", "c", "d", "h")
x[1:3]
x[x > "c"]
Out[11]:
[1] "a" "b" "c"
Out[11]:
[1] "d" "h"

For matrices, you can acess rows and columns by [row index, column index]

By default, subsetting drops the column and row index, to avoid that use m[1,2, drop=F] This way, the output would still be a matrix.

In [12]:
## subsetting matrices
m1[1,2]
m2[,2]

m2[,2, drop=F]
Out[12]:
[1] 3
Out[12]:
[1] 3 4
Out[12]:
     [,1]
[1,]    3
[2,]    4

Subsetting lists

Using double bracket to subset lists. Only extract one element of a list at time. Therefore, no vector inside double bracket. (x[[1:3]] ==> error)

In [13]:
xl <- list(foo=1:5, bar=0.7, norf="ABcDE")
xl[[1]]
xl[["bar"]]
xl["bar"]
xl$norf
Out[13]:
[1] 1 2 3 4 5
Out[13]:
[1] 0.7
Out[13]:
$bar
[1] 0.7

Out[13]:
[1] "ABcDE"
  • Use single bracket [ ] To extract multiple elements of a list:
In [14]:
xl[c(1,3)]
Out[14]:
$foo
[1] 1 2 3 4 5

$norf
[1] "ABcDE"

  • Nested list
In [15]:
xnest <- list(a=list(10, 12, 13, 14), b=c(4:7))
# to extract 13:
xnest[["a"]][[3]]
Out[15]:
[1] 13

Partial Matching

Partial matching works for subsetting objects with [[ ]] and $

In [16]:
xp <- list(abc=c(1:8), fgh=c(T, F, F, T))
xp$a
xp[["f"]]
xp[["f", exact=F]]
Out[16]:
[1] 1 2 3 4 5 6 7 8
Out[16]:
NULL
Out[16]:
[1]  TRUE FALSE FALSE  TRUE

Removing missing values

  • In order to subset non-missing values onlt, use !is.na()
  • Complete cases of a data.frame, a matrix, or multiple vectors complete.cases()
In [17]:
x <- c(1,2,NA, 5, NaN, 8, NaN, 9)
x[!is.na(x)]
Out[17]:
[1] 1 2 5 8 9
In [18]:
y <- c(3, 5, 6, NA, NA, 8, 9, 10)
complete.cases(x,y)
Out[18]:
[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

3. Reading and Writing Data

Top

Reading data

  • To read in tabular data, use read.table() or read.csv()
  • readLines() to read lines of a text file
  • source() for reading R codes (inverse of dump())
  • dget() for reading R codes (inverse of dput())
  • load() for reading a saved workspace
  • unserialize() for reading single R objects in binary form

Attributes of read.table()

  • file: name of the file
  • header: logical (T/F) indicating if the file has a header or not
  • sep: a string that indicates how columns are separated
  • colClasses: a character vector indicating the class of each column in the data file
  • nrows: the number of rows in the dataset
  • comment.char: a character string indicating the comment character used in the file
  • skip: the number of lines to be skipped from the begining
  • stringAsFactors: logical, should character variables be coded as factors?

Writing data

  • write.table()
  • writeLines()
  • dump()
  • dput()
  • save()
  • serialize()

Reading large datasets

For efficiently reading in large datasets:

  • first identify column classes
  • ignore the comment character if you know there is no comment in the file: comment.char=""

You can automaucally specify column classes by reaading the first 10 rows of the file.

example: init10 <- read.table("large_dataset.txt", nrows=10) classes <- sapply(init10, class) datAll <- read.table("large_dataset.txt", colClasses=classes)

  • If you know the number of rows, it is more memory efficient to set nrows as well

Reading and writing in textual format

Textual format has the advantage of readablity and storing element class information.

  • dput() for a single object
  • dunpy() for multiple objects
## example for dupt:
y <- data.frame(a=c(1,2), b=c("HH", "WW"))
dput(y, file="data_y.R")
new.y <- dget("data_t.R")
## example for dput: multiple objects
x <- "foo"
dump(c("x", "y"), file="data_xy.R"))
rm(x,y)
source("data_xy.R")

Connection interfaces

Data can be read from other sources by creating a connection.

  • file(): opens a connection to a file
  • gzfile(): opens a connection to a gz compressed file
  • bzfile(): opens a connection to a bz compressed file
  • url(): opens a connection to a web-page
In [19]:
str(file)
function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), 
    raw = FALSE)  

In [20]:
con <- gzfile("data/hw1.csv.gz", "r")
xgz <- readLines(con, 10) ## read the first 1 lines
xgz[1]
simpleWarning in readLines(con, 10): seek on a gzfile connection returned an internal error

Out[20]:
[1] "RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,FTAXP,FTELP,FTENP,FVACSP,FVALP,FVEHP,FWATP,FYBLP,wgtp1,wgtp2,wgtp3,wgtp4,wgtp5,wgtp6,wgtp7,wgtp8,wgtp9,wgtp10,wgtp11,wgtp12,wgtp13,wgtp14,wgtp15,wgtp16,wgtp17,wgtp18,wgtp19,wgtp20,wgtp21,wgtp22,wgtp23,wgtp24,wgtp25,wgtp26,wgtp27,wgtp28,wgtp29,wgtp30,wgtp31,wgtp32,wgtp33,wgtp34,wgtp35,wgtp36,wgtp37,wgtp38,wgtp39,wgtp40,wgtp41,wgtp42,wgtp43,wgtp44,wgtp45,wgtp46,wgtp47,wgtp48,wgtp49,wgtp50,wgtp51,wgtp52,wgtp53,wgtp54,wgtp55,wgtp56,wgtp57,wgtp58,wgtp59,wgtp60,wgtp61,wgtp62,wgtp63,wgtp64,wgtp65,wgtp66,wgtp67,wgtp68,wgtp69,wgtp70,wgtp71,wgtp72,wgtp73,wgtp74,wgtp75,wgtp76,wgtp77,wgtp78,wgtp79,wgtp80"
In [21]:
## read from a web-page
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
head(x)
Out[21]:
[1] "<!DOCTYPE html>"                                               
[2] "<html lang=\"en\">"                                            
[3] ""                                                              
[4] "<head>"                                                        
[5] "<meta charset=\"utf-8\" />"                                    
[6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"

Top

4. Flow Control

  • if, else, else if
  • for
  • while
  • repeat an infinite loop, (to stop use break)
  • break break the execution of a loop
  • next skip one iteration
  • return exit a function
In [22]:
x <- 5
y <- if(x>3) {
   10
}else{
   0
}

print(y)
[1] 10

In [23]:
## three ways to iterate through a vector:
x <- c("a", "b", "c")

for (i in 1:length(x)) {
   print(x[i])
}

for (i in seq_along(x))  print(x[i])

for (letter in x)  print(letter)
[1] "a"
[1] "b"
[1] "c"
[1] "a"
[1] "b"
[1] "c"
[1] "a"
[1] "b"
[1] "c"

5. Functions

Top

Functions are first class objects.

We can have nested functions, functions inside other functions.

Function arguments can have default values. Function arguments can be mapped positionally, or by name.

The arguments that don't have default values, are required when function is called.

Note: Split your functions so that each function does a single task.

In [24]:
args(lm)
Out[24]:
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
NULL

special argument "..."

Three dots indicates variable number of arguments.

It is usually used when extending another function:

In [25]:
myplot <- function(x, y, mytype="l", ...) {
   plot(x, y, type=mytype)
}

... is also useful when the number of input arguments is not known in advance:

In [26]:
args(paste)

args(cat)
Out[26]:
function (..., sep = " ", collapse = NULL) 
NULL
Out[26]:
function (..., file = "", sep = " ", fill = FALSE, labels = NULL, 
    append = FALSE) 
NULL
In [27]:
paste("a", "b", sep=":")
# partial matching doesn't work after ...
paste("a", "b", se=":")
Out[27]:
[1] "a:b"
Out[27]:
[1] "a b :"

6. Scoping Rules

Top

R binds a value to every symbol.

If you define a new objects with a name that is used previously in R:

R uses lexical scoping

  • R first searches for a free variable in the environment in which the function is defined
  • If no matched found, it searches in the parent environment

The global enfironemnt is the workspace.

When a free variable is found, R searches through the search space, from the workspace (global environment) until the symbol is found or it hits the emoty environment.

In [28]:
search()
Out[28]:
[1] ".GlobalEnv"        "package:stats"     "package:graphics" 
[4] "package:grDevices" "package:utils"     "package:datasets" 
[7] "package:methods"   "Autoloads"         "package:base"     
In [29]:
make.Gaussian <- function(mu, sigma) {
    gauss <- function(x) {
       exp(-((x-mu)/sigma)^2)
    }
    gauss
}

mygauss_0 <- make.Gaussian(mu=0, sigma=1)
mygauss_1 <- make.Gaussian(mu=1, sigma=1)
mygauss_0
Out[29]:
function(x) {
       exp(-((x-mu)/sigma)^2)
    }
<environment: 0x1ff91d0>
In [30]:
library(IRdisplay)

x <- seq(-3, 5, by=0.1)

png("figs/mygauss.png")
par(mar=c(4.7, 4.7, 0.6, 0.6))
plot(x, mygauss_0(x), type="l", col=rgb(0.2, 0.4, 0.7, 0.5), lwd=6, xlab="x", ylab="Gaussian Distn.", cex.axis=1.5, cex.lab=1.5)
lines(x, mygauss_1(x), col=rgb(0.7, 0.3, 0.2, 0.5), lwd=6)
dev.off()
display_png("figs/mygauss.png")
Out[30]:
pdf 
  2 

7. Dates and Times

Top

  • dates are represented by Date class
  • times are represented by POSIXct or POSIXlt classes
  • POSIXct is a very large integer; It uses a data frame to store time
  • POSIXlt store other information, such as day of month, day of year, ...
  • internal reference time is 1970-01-01

  • weekdays() what day of the week that time/date is
  • months() gives the month name
  • quarters() gives the quarter number

  • objects of the same class can be used in mathematical operations

In [31]:
t <- as.Date("2014-05-11")
print(t)
unclass(t)  ## number of days since 1970-01-01
[1] "2014-05-11"

Out[31]:
[1] 16201
In [32]:
x <- Sys.time()
print(x)
p <- as.POSIXlt(x)
names(unclass(p))
p$hour
[1] "2014-07-05 23:45:44 EDT"

Out[32]:
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
 [9] "isdst"  "zone"   "gmtoff"
Out[32]:
[1] 23

Format Strings to dates

To convert a string in an arbitrary format to a Date format:

In [33]:
dstring <- c("January 01 2012, 10:40", "March 05 2013, 11:20")
strptime(dstring, "%B %d %Y, %H:%M")

## the formats can be found in help page ?strptime
Out[33]:
[1] "2012-01-01 10:40:00 EST" "2013-03-05 11:20:00 EST"

8. Apply Functions

Top

Apply functions can be used to apply a function over elements of a vector, list or data frame. They make the codes shorter, and faster through vectorized operations.

  • lapply() loop over elements of a list, and evaluate a function on each element.
    • if input is not a list, it will be corced to a list (if possible)
  • sapply() same as lapply, but also simplifies the result
  • apply() apply a function over margins of an array
  • tapply() apply a function over subsets of a vector
  • mapply() multivariate version of lapply
In [34]:
## Example for lapply()
x <- list(a=1:20, b=rnorm(10, 2))
lapply(x, mean)
sapply(x, mean)
Out[34]:
$a
[1] 10.5

$b
[1] 2.083524

Out[34]:
        a         b 
10.500000  2.083524 
In [35]:
## Example for apply()
x <- matrix(rnorm(50, mean=1), nrow=10, ncol=5)
apply(x, 2, mean)
apply(x, 1, mean)
Out[35]:
[1] 1.0639838 0.7583712 1.9225710 1.2163184 0.7788464
Out[35]:
 [1] 0.8258462 1.6545185 0.8579511 1.2843574 0.1814005 2.0651896 1.2851799
 [8] 1.5149253 0.9494952 0.8613177

Shortcut functions

For means and sums of a matrix, these shortcut functions perform much faster on big matrices.

  • rowSums() equivalent to apply(x, 1, sum)
  • colSums() equivalent to apply(x, 2, sum)
  • rowMeans() equivalent to apply(x, 1, mean)
  • colMeans() equivalent to apply(x, 2, mean)
In [36]:
# Benchmark: Comparing apply() and rowMeans()
library(rbenchmark)
a <- matrix(rnorm(100000), nrow=1000)

benchmark(apply(a, 1, mean))
benchmark(rowMeans(a))
Out[36]:
               test replications elapsed relative user.self sys.self user.child
1 apply(a, 1, mean)          100   2.064        1     2.044    0.012          0
  sys.child
1         0
Out[36]:
         test replications elapsed relative user.self sys.self user.child
1 rowMeans(a)          100   0.032        1     0.032        0          0
  sys.child
1         0

tapply() and split()

tapplly() can be used to apply a function on each sub-group (determied by a factor vector).

In [37]:
## Example for tapply()
x <- c(rnorm(10, mean=3), runif(10, min=1, max=3), rnorm(10, mean=5, sd=2))
f <- gl(3, 10)
tapply(x, f, mean)
Out[37]:
       1        2        3 
3.192245 1.742892 5.028784 

**split() ** is not a loop function, but it can be used to split a vector based on a given factor vector. The output is a list.

In [38]:
split(x, f)
Out[38]:
$`1`
 [1] 3.024968 3.866025 2.810482 3.425582 4.168800 4.495183 2.009164 1.691082
 [9] 3.326658 3.104507

$`2`
 [1] 2.974586 2.862309 1.254037 1.466379 1.491590 1.245616 1.958429 1.645276
 [9] 1.370200 1.160498

$`3`
 [1]  7.230541  5.884052  3.269644 10.447268  6.557051  2.791258  4.381073
 [8]  2.738739  3.447649  3.540565

In [39]:
library(datasets)
s <- split(airquality, airquality$Month)

sapply(s, function(x) colMeans(x[,c("Ozone", "Solar.R", "Wind", "Temp")], na.rm=T))
Out[39]:
                5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000
Temp     65.54839  79.10000  83.903226  83.967742  76.90000

9. Random Number Generation

Top

  • rnorm() generates normal random number variates with a given mean and SD
  • rnorm(n, mean=0, sd=1)
  • dnorm() returns the probablity density function at a point x, with given mean and CD
  • pnorm() returns the cumulative distribution function at point x
  • qnorm() quantile
  • rpois() generates random numbers with Poisson distribution with a given rate (lambda)
  • rbinom() generates binomial random numbers with a given probability
In [40]:
x <- rnorm(n=10, mean=20, sd=5)
x
str(rpois)
Out[40]:
 [1] 14.36841 27.17210 23.60247 17.18039 23.46058 19.76642 19.04366 31.96666
 [9] 22.28409 19.87019
function (n, lambda)  

Gaussian Noise

In [41]:
## adding Gaussian noise to sin(x)
x <- seq(0, 2*pi, by=0.02)
y <- sapply(sin(x), function(x) rnorm(n=1, mean=sin(x), sd=0.2))
## the same as y <- sin(x) + rnorm(n=length(x), mean=0, sd=0.2)

par(mar=c(4.7, 4.7, 0.7, 0.7))
plot(x, y, pch=19, col=rgb(0.2, 0.3, 0.7, 0.6), cex=1.3, cex.lab=1.6, cex.axis=1.6)
  • Use set.seed() to make the random numbers reproducible

Simulate random numbers from a Generalized Linear Model

Assume \(y \approx Poisson(\mu)\)

and \(log(\mu) = \beta_0 + \beta_1 x\)

In [42]:
x <- rnorm(200)
log.mu <- 0.5 + 0.3 * x
y <- rpois(200, exp(log.mu))

plot(x, y, pch=19, col=rgb(0.2, 0.3, 0.7, 0.6), cex=1.5, cex.axis=1.6, cex.lab=1.6)

Random Sampling

  • sample() to draw a random sample from a set
  • replace=TRUE
  • replace=FALSE
In [43]:
str(sample)
sample(letters, 5)
sample(1:10, size=6, replace=T)
function (x, size, replace = FALSE, prob = NULL)  

Out[43]:
[1] "f" "r" "q" "x" "t"
Out[43]:
[1]  7  1  2  1  6 10

10. R Profiler

  • First design your code, unit-test it, then optimize it

  • Using system.time() to measure how much time it takes to perform a step
  • returns an object of class proc_time
  • user time --> CPU time
  • elapsed time --> wallclock time

In [44]:
system.time(rnorm(n=100000, mean=0, sd=1))
Out[44]:
   user  system elapsed 
  0.016   0.000   0.015 
In [45]:
system.time(readLines("http://www.jhsph.edu"))
Out[45]:
   user  system elapsed 
  0.008   0.000   0.361 
  • Measureing multiple expressions:
In [46]:
system.time({
    n <- 1000
    r <- numeric(n)
    for (i in 1:n) {
        x = rnorm(i)
        r[i] = mean(x)
    }
})
Out[46]:
   user  system elapsed 
  0.104   0.000   0.105 
In [47]:
# comparing time of vectorized operations vs. for loops
n=1000000
x <- rnorm(n, mean=0, sd=1)

system.time(sum(x))
system.time({
    s = 0
    for (i in 1:n) {
        s = s + x[i]
    }
})
Out[47]:
   user  system elapsed 
  0.000   0.000   0.002 
Out[47]:
   user  system elapsed 
  1.452   0.004   1.462 

Using Rprof()

  • summaryRprof() will tabulates the output
In [48]:
Rprof(filename="data/rprof_test.dat")

for (i in 1:1000) {
    x <- seq(0, 5, by=0.01)
    y <- 1.5+x + rnorm(n=length(x), mean=0, sd=1)
    lm(y ~ x)
}

Rprof(NULL)

summaryRprof(filename="data/rprof_test.dat")
Out[48]:
$by.self
                        self.time self.pct total.time total.pct
"as.character"               0.32    12.21       0.32     12.21
"match"                      0.16     6.11       0.36     13.74
"model.frame.default"        0.14     5.34       0.90     34.35
"[.data.frame"               0.14     5.34       0.28     10.69
".External2"                 0.12     4.58       0.60     22.90
".External"                  0.12     4.58       0.12      4.58
"[["                         0.10     3.82       0.34     12.98
"[[.data.frame"              0.10     3.82       0.24      9.16
"na.omit.data.frame"         0.08     3.05       0.48     18.32
"deparse"                    0.08     3.05       0.18      6.87
"unique"                     0.08     3.05       0.12      4.58
"%in%"                       0.06     2.29       0.22      8.40
"as.name"                    0.06     2.29       0.06      2.29
"get"                        0.06     2.29       0.06      2.29
"unclass"                    0.06     2.29       0.06      2.29
"model.matrix.default"       0.04     1.53       0.46     17.56
"lapply"                     0.04     1.53       0.34     12.98
"lm.fit"                     0.04     1.53       0.14      5.34
"anyDuplicated.default"      0.04     1.53       0.04      1.53
"match.call"                 0.04     1.53       0.04      1.53
"mode"                       0.04     1.53       0.04      1.53
"names"                      0.04     1.53       0.04      1.53
"rep.int"                    0.04     1.53       0.04      1.53
"<Anonymous>"                0.02     0.76       2.62    100.00
"lm"                         0.02     0.76       2.44     93.13
"sapply"                     0.02     0.76       0.42     16.03
"model.response"             0.02     0.76       0.38     14.50
"FUN"                        0.02     0.76       0.22      8.40
"paste"                      0.02     0.76       0.20      7.63
".getXlevels"                0.02     0.76       0.18      6.87
"getExportedValue"           0.02     0.76       0.16      6.11
"simplify2array"             0.02     0.76       0.14      5.34
"as.list.data.frame"         0.02     0.76       0.08      3.05
"seq"                        0.02     0.76       0.08      3.05
"anyDuplicated"              0.02     0.76       0.06      2.29
"seq.default"                0.02     0.76       0.06      2.29
"terms.formula"              0.02     0.76       0.04      1.53
"$<-"                        0.02     0.76       0.02      0.76
"abs"                        0.02     0.76       0.02      0.76
"any"                        0.02     0.76       0.02      0.76
"as.list.default"            0.02     0.76       0.02      0.76
"attr"                       0.02     0.76       0.02      0.76
".Call"                      0.02     0.76       0.02      0.76
"dim"                        0.02     0.76       0.02      0.76
"isNamespace"                0.02     0.76       0.02      0.76
"length"                     0.02     0.76       0.02      0.76
"makepredictcall"            0.02     0.76       0.02      0.76
"NCOL"                       0.02     0.76       0.02      0.76
"pmatch"                     0.02     0.76       0.02      0.76
"pmin"                       0.02     0.76       0.02      0.76
"seq_len"                    0.02     0.76       0.02      0.76
"structure"                  0.02     0.76       0.02      0.76
"sum"                        0.02     0.76       0.02      0.76
"unique.default"             0.02     0.76       0.02      0.76

$by.total
                        total.time total.pct self.time self.pct
"<Anonymous>"                 2.62    100.00      0.02     0.76
"doTryCatch"                  2.62    100.00      0.00     0.00
"eval"                        2.62    100.00      0.00     0.00
"evaluate"                    2.62    100.00      0.00     0.00
"evaluate_call"               2.62    100.00      0.00     0.00
"handle"                      2.62    100.00      0.00     0.00
"handle_shell"                2.62    100.00      0.00     0.00
"try"                         2.62    100.00      0.00     0.00
"tryCatch"                    2.62    100.00      0.00     0.00
"tryCatchList"                2.62    100.00      0.00     0.00
"tryCatchOne"                 2.62    100.00      0.00     0.00
"withCallingHandlers"         2.62    100.00      0.00     0.00
"withVisible"                 2.62    100.00      0.00     0.00
"lm"                          2.44     93.13      0.02     0.76
"model.frame.default"         0.90     34.35      0.14     5.34
".External2"                  0.60     22.90      0.12     4.58
"na.omit.data.frame"          0.48     18.32      0.08     3.05
"na.omit"                     0.48     18.32      0.00     0.00
"model.matrix.default"        0.46     17.56      0.04     1.53
"model.matrix"                0.46     17.56      0.00     0.00
"sapply"                      0.42     16.03      0.02     0.76
"model.response"              0.38     14.50      0.02     0.76
"match"                       0.36     13.74      0.16     6.11
"[["                          0.34     12.98      0.10     3.82
"lapply"                      0.34     12.98      0.04     1.53
"as.character"                0.32     12.21      0.32    12.21
"[.data.frame"                0.28     10.69      0.14     5.34
"["                           0.28     10.69      0.00     0.00
"[[.data.frame"               0.24      9.16      0.10     3.82
"%in%"                        0.22      8.40      0.06     2.29
"FUN"                         0.22      8.40      0.02     0.76
"paste"                       0.20      7.63      0.02     0.76
"deparse"                     0.18      6.87      0.08     3.05
".getXlevels"                 0.18      6.87      0.02     0.76
"getExportedValue"            0.16      6.11      0.02     0.76
"::"                          0.16      6.11      0.00     0.00
"lm.fit"                      0.14      5.34      0.04     1.53
"simplify2array"              0.14      5.34      0.02     0.76
".External"                   0.12      4.58      0.12     4.58
"unique"                      0.12      4.58      0.08     3.05
"$"                           0.12      4.58      0.00     0.00
"as.vector"                   0.12      4.58      0.00     0.00
"$.data.frame"                0.12      4.58      0.00     0.00
"as.list"                     0.10      3.82      0.00     0.00
"rnorm"                       0.10      3.82      0.00     0.00
"as.list.data.frame"          0.08      3.05      0.02     0.76
"seq"                         0.08      3.05      0.02     0.76
"asNamespace"                 0.08      3.05      0.00     0.00
"getInternalExportName"       0.08      3.05      0.00     0.00
"model.offset"                0.08      3.05      0.00     0.00
"vapply"                      0.08      3.05      0.00     0.00
"as.name"                     0.06      2.29      0.06     2.29
"get"                         0.06      2.29      0.06     2.29
"unclass"                     0.06      2.29      0.06     2.29
"anyDuplicated"               0.06      2.29      0.02     0.76
"seq.default"                 0.06      2.29      0.02     0.76
".deparseOpts"                0.06      2.29      0.00     0.00
"getNamespace"                0.06      2.29      0.00     0.00
"anyDuplicated.default"       0.04      1.53      0.04     1.53
"match.call"                  0.04      1.53      0.04     1.53
"mode"                        0.04      1.53      0.04     1.53
"names"                       0.04      1.53      0.04     1.53
"rep.int"                     0.04      1.53      0.04     1.53
"terms.formula"               0.04      1.53      0.02     0.76
"getNamespaceInfo"            0.04      1.53      0.00     0.00
"model.weights"               0.04      1.53      0.00     0.00
"terms"                       0.04      1.53      0.00     0.00
"$<-"                         0.02      0.76      0.02     0.76
"abs"                         0.02      0.76      0.02     0.76
"any"                         0.02      0.76      0.02     0.76
"as.list.default"             0.02      0.76      0.02     0.76
"attr"                        0.02      0.76      0.02     0.76
".Call"                       0.02      0.76      0.02     0.76
"dim"                         0.02      0.76      0.02     0.76
"isNamespace"                 0.02      0.76      0.02     0.76
"length"                      0.02      0.76      0.02     0.76
"makepredictcall"             0.02      0.76      0.02     0.76
"NCOL"                        0.02      0.76      0.02     0.76
"pmatch"                      0.02      0.76      0.02     0.76
"pmin"                        0.02      0.76      0.02     0.76
"seq_len"                     0.02      0.76      0.02     0.76
"structure"                   0.02      0.76      0.02     0.76
"sum"                         0.02      0.76      0.02     0.76
"unique.default"              0.02      0.76      0.02     0.76
"is.empty.model"              0.02      0.76      0.00     0.00
"logical"                     0.02      0.76      0.00     0.00
"nrow"                        0.02      0.76      0.00     0.00
"unlist"                      0.02      0.76      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.62

11. Optimization in R

  • optim(), nlm(), optimize()

  • Pass them a function and a vector of parameters.
  • We can hold some parameters fixed, by fixed=c(F, T)

Example: maximize log-likelihood

In [49]:
## define a constructor function:
make.NegLogLik <- function(data, fixed=c(FALSE, FALSE)) {
    params <- fixed
    function(p) {
        params[!fixed] <- p
        mu <- params[1]
        sigma <- params[2]
        a <- -0.5 * length(data) * log(2*pi*sigma^2)
        b <- -0.5 * sum((data-mu)^2) / (sigma^2)
        -(a + b)
    }
}

# Generate a random dataset with normal distribution
d <- rnorm(n=1000, mean=2, sd=3)
nLL <- make.NegLogLik(d)
nLL
ls(environment(nLL))
Out[49]:
function(p) {
        params[!fixed] <- p
        mu <- params[1]
        sigma <- params[2]
        a <- -0.5 * length(data) * log(2*pi*sigma^2)
        b <- -0.5 * sum((data-mu)^2) / (sigma^2)
        -(a + b)
    }
<environment: 0x1c7b758>
Out[49]:
[1] "data"   "fixed"  "params"
In [50]:
## Minimze Neative of log-likelihood
optim(c(mu=0, sigma=1), nLL)$par
Out[50]:
      mu    sigma 
2.059232 3.108933 
In [51]:
## generate a range of possible values for mean
x <- seq(1, 3, len=100)
## when fixing sigma, use the true value
nLL.mean <- make.NegLogLik(d, fixed=c(FALSE, 3)) 
y <- sapply(x, nLL.mean)

#windows.options(width=40, height=10)
#layout(matrix(c(1,2), 1, 2, byrow = TRUE),
#   widths=c(5,5), heights=c(1,1))



par(mar=c(4.7, 4.7, 0.7, 0.7), mfrow=c(1,2))
plot(x, -(y-min(y)), type="l", lwd=4, col="steelblue", 
     cex.axis=1.7, cex.lab=1.7, xlab=expression(mu), ylab="log-likelihood")
plot(x, exp(-(y-min(y))), type="l", lwd=4, col="steelblue", 
     cex.axis=1.7, cex.lab=1.7, xlab=expression(mu), ylab="likelihood")