Dealing with factors

Last updated on 2020-05-26 4 min read R, factor

A `factor` is an headache

I have a dataset, cleaning which has been a pain lately. I’m going to use 20 observations of the imported dataset in this post to demonstrate how pathetically have I been advancing with it.

plot	jan_23_2017	jan_26_2017	jan_29_2017	feb_02_2017
1	0	0	b	am
2	f	s	1p	10p
3	b	a	sm	3p
4	b	b	ap	2
5	0	b	bp	s
6	b	a	sp	3

Providing it a context, the colums represent multiple observations of same variable at different dates, as apparent from the column names.

One can observe from the str(head_boot) that the dataframe although has only 20 observations. Had a larger dataframe been subsetted, the factors with their usual nuisance carry over the all the levels, even currently unused ones. To demonstrate this, below I present how many factor levels the original dataframe had and what changes after a subset (slice, filter have same feature) operation.

# original dataframe
head_boot %>% select_if(is.factor) %>% map_int(function(x) length(levels(x)))

## jan_23_2017 jan_26_2017 jan_29_2017 feb_02_2017 
##           5           6          14          15

# subsetted dataframe
subset.data.frame(head_boot, subset = head_boot$jan_23_2017 %in% 
    c("s", "0"), select = sapply(head_boot, is.factor)) %>% 
    map_int(function(x) length(levels(x)))

## jan_23_2017 jan_26_2017 jan_29_2017 feb_02_2017 
##           5           6          14          15

A prophylactic measure is avoiding at all importing of character data columns as a factor. This could be done using setup option: options(stringsAsFactors = FALSE) ahead of import. Still, after the import many options exist. Some of the best alternatives out are listed below:

Use subsetted_df$factor_col[, drop=TRUE], where a vector of factor values is indexed with drop argument.
Convert factor columns of subsetted dataframe first to character and then back to factors class.
Use forcats::fct_drop(f) for more reliable NA handling
Use droplevels() or droplevels.data.frame()

To give a realistic feel to data, factor levels may be recoded or relabelled with more informative labels.

suit_labels <- factor(c("before_boot_n_boot", "heading", 
    "anthesis"), levels = c("before_boot_n_boot", "heading", 
    "anthesis"), ordered = T)

# what are all the levels
head_boot_lvl <- head_boot %>% select_if(is.factor) %>% 
    map(levels) %>% unlist() %>% unique()

# purrr::cross_df preserves the order,
# tidyr::crossing doesn't
possible_hb_lvl <- cross_df(list(post = factor(c("m", 
    "", "p"), ordered = TRUE), pre = factor(c("b", 
    "f", "a", "s", "1", "2", "3", "4", "10", "11", 
    "12"), ordered = TRUE))) %>% select(rev(everything())) %>% 
    # arrange(match(post, c('m', '', 'p'))) %>% # don't
# need this ordering
unite(col = "pre_post", sep = "") %>% add_row(pre_post = "0", 
    .before = 1) %>% mutate(pre_post = factor(pre_post, 
    levels = pre_post, ordered = TRUE))

# mutate original dataframe to represent ordinal
# factors
head_boot <- head_boot %>% mutate_if(is.factor, function(x) factor(x, 
    levels = possible_hb_lvl$pre_post, labels = possible_hb_lvl$pre_post, 
    ordered = TRUE))

# function to recode factor levels (better to use
# `fct_recode()`)
fct_minimize <- function(x) case_when(x < "1m" ~ suit_labels[1], 
    x >= "1m" & x < "10m" ~ suit_labels[2], x >= "10m" ~ 
        suit_labels[3], TRUE ~ as.factor(x))

# mutate original dataframe with function to recode
# factor levels
head_boot <- head_boot %>% mutate_if(is.factor, fct_minimize)

str(head_boot)

## tibble [20 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ plot       : num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##  $ jan_23_2017: Ord.factor w/ 3 levels "before_boot_n_boot"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ jan_26_2017: Ord.factor w/ 3 levels "before_boot_n_boot"<..: 1 1 1 1 1 1 1 2 1 1 ...
##  $ jan_29_2017: Ord.factor w/ 3 levels "before_boot_n_boot"<..: 1 2 1 1 1 1 2 2 2 1 ...
##  $ feb_02_2017: Ord.factor w/ 3 levels "before_boot_n_boot"<..: 1 3 2 2 1 2 2 3 2 1 ...

A more telling dataframe looks like the one below.

knitr::kable(head(head_boot), format = "html", align = "c") %>% 
    kableExtra::kable_styling(bootstrap_options = c("striped", 
        "hover"), font_size = 10, position = "center") %>% 
    kableExtra::row_spec(0, bold = TRUE) %>% kableExtra::column_spec(1, 
    bold = TRUE)

plot	jan_23_2017	jan_26_2017	jan_29_2017	feb_02_2017
1	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot
2	before_boot_n_boot	before_boot_n_boot	heading	anthesis
3	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot	heading
4	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot	heading
5	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot
6	before_boot_n_boot	before_boot_n_boot	before_boot_n_boot	heading

Visualizing data

Bar plotting of the ordinal data might reveal interesting insights, so let’s prepare ggplot graphs.

walk2(.x = head_boot %>% select(jan_23_2017:feb_02_2017) %>% 
    colnames(), .y = as.character(strptime(str_subset(colnames(head_boot), 
    "^\\w{3}_\\d{2}"), format = "%b_%d_%Y")), .f = ~print(ggplot(aes(x = get(.x)), 
    data = head_boot) + geom_bar(position = "dodge", 
    stat = "count", width = 0.5) + xlab(.y) + theme(axis.text.x = element_text(angle = 40, 
    size = 9, vjust = 0.6))))

Figure 1: Growth stages at different dates

Figure 2: Growth stages at different dates

Figure 3: Growth stages at different dates

Figure 4: Growth stages at different dates

With dplyr, It’s seemless to obtain a tabular summary of what has been visualized above.

# credit:
# https://stackoverflow.com/a/46340237/6725057
head_boot %>% select_if(is.factor) %>% tidyr::gather(date, 
    stage) %>% dplyr::group_by(date, stage) %>% dplyr::count() %>% 
    dplyr::ungroup() %>% tidyr::spread(date, n) %>% 
    knitr::kable(format = "html", align = "c") %>% 
    kableExtra::kable_styling(bootstrap_options = c("striped", 
        "hover"), font_size = 10, position = "center") %>% 
    kableExtra::row_spec(0, bold = TRUE) %>% kableExtra::column_spec(1, 
    bold = TRUE)

stage	feb_02_2017	jan_23_2017	jan_26_2017	jan_29_2017
anthesis	6
before_boot_n_boot	6	19	15	12
heading	8	1	5	8

Dealing with factors

A factor is an headache

Visualizing data

A `factor` is an headache