`factor` or `as.factor`?
Publish date: Sep 22, 2021Tags: Programming
The story
When plotting with characters, it is always troublesome to sort them in ggplot2.
gapminder %>%
ggplot(aes(x=continent, y=lifeExp)) +
geom_boxplot()

There are several ways to solve the problems as follows.
# 1. using `fct_relevel` function to change the order in
gapminder %>%
ggplot(aes(x=fct_relevel(continent,levels=c("Africa","Asia","Americas","Europe","Oceania")), y=lifeExp)) +
geom_boxplot()
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
- This method prevents changing the original data. We may want to make other plots with original order.
# 2. using `factor` or `reorder` function convert the chr variable to factor before plotting.
gapminder <- gapminder %>%
mutate(continent = as.factor(continent, levels = c("Africa","Asia","Americas","Europe","Oceania")))
## Error in `mutate()`:
## ! Problem while computing `continent = as.factor(...)`.
## Caused by error in `as.factor()`:
## ! unused argument (levels = c("Africa", "Asia", "Americas", "Europe", "Oceania"))
?

What’s wrong?
factor and as.factor
# 2. using `factor` or `reorder` function convert the chr variable to factor before plotting.
gapminder <- gapminder %>%
mutate(continent = factor(continent, levels = c("Africa","Asia","Americas","Europe","Oceania"))) # or repalce `factor` by `ordered`, rather than `as.ordered`
The problem is I used as.factor instead of factor. What’s the difference between them?
- From the source code, I found as.factor does not have levels as a parameter. This function is simply like a wrapper of factor function. If the columns is already a factor column or integer one, as.factor is more efficient. However, it cannot convert characters to factors or manually specify the levels of factors.
Example:
all_years = as.factor(gapminder$year)[1:5]
factor(all_years)
## [1] 1952 1957 1962 1967 1972
## Levels: 1952 1957 1962 1967 1972
as.factor(all_years)
## [1] 1952 1957 1962 1967 1972
## Levels: 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
See the difference? :D
I will appreciate if you can explain why in the discussion below~
Source Code:
as.factor
## function (x)
## {
## if (is.factor(x))
## x
## else if (!is.object(x) && is.integer(x)) {
## levels <- sort.int(unique.default(x))
## f <- match(x, levels)
## levels(f) <- as.character(levels)
## if (!is.null(nx <- names(x)))
## names(f) <- nx
## class(f) <- "factor"
## f
## }
## else factor(x)
## }
## <bytecode: 0x12b3260e8>
## <environment: namespace:base>