This page shows several examples of reading a comma-separate-value (CSV) file into a data.frame.
Recall Sample.csv file created using writeLines:
Find all the R code in this Gist, R-IO-readcsv-writecsv.R.
rawdata <- read.csv("Sample.csv")
rawdata
## Grade Age Code Amount Start
## 1 A 23 ABC 48.982 2014-01-01
## 2 B 56 DEF 2.718 2014-01-02
## 3 F NA 0.000
## 4 B 35 XYZ 3.142 2014-01-05
dim(rawdata) # dimensions
## [1] 4 5
nrow(rawdata) # number of rows
## [1] 4
ncol(rawdata) # number of columns
## [1] 5
names(rawdata) # column names
## [1] "Grade" "Age" "Code" "Amount" "Start"
row.names(rawdata)
## [1] "1" "2" "3" "4"
str(rawdata) # structure
## 'data.frame': 4 obs. of 5 variables:
## $ Grade : Factor w/ 3 levels "A","B","F": 1 2 3 2
## $ Age : int 23 56 NA 35
## $ Code : Factor w/ 4 levels "","ABC","DEF",..: 2 3 1 4
## $ Amount: num 48.98 2.72 0 3.14
## $ Start : Factor w/ 4 levels "","2014-01-01",..: 2 3 1 4
R head and tail functions are useful to see the first few or last few rows of a data.frame.
Data types will be described below.
Notes:
Columns of data.frame
rawdata$Grade
## [1] A B F B
## Levels: A B F
rawdata$Age
## [1] 23 56 NA 35
rawdata$Score # typo here!
## NULL
A data.frame can be indexed like a matrix: rawdata[rows, columns]
Note: indexing in R starts at 1.
rawdata[2, 4] # single element: row 2, column 4
## [1] 2.718
rawdata[2, ] # row 2
## Grade Age Code Amount Start
## 2 B 56 DEF 2.718 2014-01-02
rawdata[, 4] # column 4
## [1] 48.982 2.718 0.000 3.142
rawdata[1:2, c(1, 3, 5)] # first 2 rows of columns 1, 3 and 5
## Grade Code Start
## 1 A ABC 2014-01-01
## 2 B DEF 2014-01-02
Associate memory, hash, dictionary
# Single element
rawdata["2", "Amount"] # row name '2', 'Amount' column
## [1] 2.718
# Rows with names '2' and '1', columns with names 'Start', 'Code' and 'Age'
rawdata[c("2", "1"), c("Start", "Code", "Age")]
## Start Code Age
## 2 2014-01-02 DEF 56
## 1 2014-01-01 ABC 23
# 'Factors' are stored as integers internally
rawdata$Code
## [1] ABC DEF XYZ
## Levels: ABC DEF XYZ
as.integer(rawdata$Code)
## [1] 2 3 1 4
as.character(rawdata$Code)
## [1] "ABC" "DEF" "" "XYZ"
# Re-read using stringsAsFactors=FALSE
rawdata <- read.csv("Sample.csv", stringsAsFactors = FALSE)
rawdata
## Grade Age Code Amount Start
## 1 A 23 ABC 48.982 2014-01-01
## 2 B 56 DEF 2.718 2014-01-02
## 3 F NA 0.000
## 4 B 35 XYZ 3.142 2014-01-05
str(rawdata)
## 'data.frame': 4 obs. of 5 variables:
## $ Grade : chr "A" "B" "F" "B"
## $ Age : int 23 56 NA 35
## $ Code : chr "ABC" "DEF" "" "XYZ"
## $ Amount: num 48.98 2.72 0 3.14
## $ Start : chr "2014-01-01" "2014-01-02" "" "2014-01-05"
rawdata <- read.csv("Sample.csv", colClasses = c("factor", "integer", "character",
"numeric", "Date"))
str(rawdata)
## 'data.frame': 4 obs. of 5 variables:
## $ Grade : Factor w/ 3 levels "A","B","F": 1 2 3 2
## $ Age : int 23 56 NA 35
## $ Code : chr "ABC" "DEF" "" "XYZ"
## $ Amount: num 48.98 2.72 0 3.14
## $ Start : Date, format: "2014-01-01" "2014-01-02" ...
In 32-bit R:
N <- 1000000L
object.size(rep(0L, N)) # a million integers
## 4000040 bytes
object.size(rep(0, N)) # a million doubles
## 8000040 bytes
nrow dictates number of rows to read.
“NULL” type in colClasses supprsses reading of data column
selected <- read.csv("Sample.csv", nrow = 2, header = TRUE, colClasses = c("NULL",
"integer", "NULL", "numeric", "NULL"))
selected
## Age Amount
## 1 23 48.982
## 2 56 2.718
This trick can be used to read selected columns from a mulitmillion-row file that would otherwise fill available memory.
The default write.csv adds some information that may not be in the original file.
# Default write.csv
write.csv(rawdata) # echo to screen to review
## "","Grade","Age","Code","Amount","Start"
## "1","A",23,"ABC",48.982,2014-01-01
## "2","B",56,"DEF",2.7183,2014-01-02
## "3","F",NA,"",0,NA
## "4","B",35,"XYZ",3.1416,2014-01-05
write.csv(rawdata, "Sample-Copy-Default.csv") # write to file
write.csv defaults to including row.names and quotes around
# Drop row names.
write.csv(rawdata, row.names = FALSE) # echo to screen
## "Grade","Age","Code","Amount","Start"
## "A",23,"ABC",48.982,2014-01-01
## "B",56,"DEF",2.7183,2014-01-02
## "F",NA,"",0,NA
## "B",35,"XYZ",3.1416,2014-01-05
write.csv(rawdata, "Sample-Copy-No-RowNames.csv", row.names = FALSE)
# Exactly the same as original file (for this case)
write.csv(rawdata, row.names = FALSE, quote = FALSE, na = "") # echo to screen
## Grade,Age,Code,Amount,Start
## A,23,ABC,48.982,2014-01-01
## B,56,DEF,2.7183,2014-01-02
## F,,,0,
## B,35,XYZ,3.1416,2014-01-05
write.csv(rawdata, "Sample-Copy-Exact.csv", row.names = FALSE, quote = FALSE,
na = "")
write.csv(rawdata, file("clipboard"), row.names = FALSE, quote = FALSE, na = "") # Write to clipboard