Making Summary Tables in R

Background

Table output of R is one of the richest and satisfying to use feature. Rmarkdown format provides loads of package support to create, format, and present tables beautifully. This is on one aspect extremely useful while on the other end it could very well be daunting as to choose between various package options to use while formating your table. I have a bunch of suggestions and enlistments here to help get off that dilemma.

Once in a while someone writes a blog post and addressess these issues. This is true for this topic too. https://rfortherestofus.com/2019/11/how-to-make-beautiful-tables-in-r/ has wonderfully curated list of several such options. My intnet too is to supplement the information included in the post.

General purpose tables

Here goes the list of packages:

  1. Table output (or in general, dataframe printing) is a more general idea for Rmarkdown documents. It can be set with a print option set in the YAML header.
title: Some good amount of table
output:
  html_document:
    df_print: paged 

The df_print option can take other values such as default, kable and tibble. More on this at https://bookdown.org/yihui/rmarkdown/html-document.html#data-frame-printing.

  1. gt package
  2. kable + kableExtra.

Here are a bunch of appealing examples that surely entice you into using this combination of packages.

vignette(package = "kableExtra", topic = "awesome_table_in_pdf")

Sharla Gelfand also has whole repository maintained for sharing examples on use of kableExtra. Check that out at: https://github.com/sharlagelfand/kableExtra-cookbook

Additionally, I have forked the repo and tried to contribute some of my own hacks (not exactly my own, but learnt elsewhere on the internet) to the bookdown project.

  1. formattable
  2. DT. More at: https://rstudio.github.io/DT/
  3. reactable. A demonstration of use at: https://projects.fivethirtyeight.com/2019-womens-world-cup-predictions/
  4. flextable: https://davidgohel.github.io/flextable/index.html
  5. huxtable. https://hughjonesd.github.io/huxtable/
  6. rhandsontable. This extremely helpful package in case if you have dirty data and data representation. This lets you manually edit the data like working in a spreadsheet software. More on: https://jrowen.github.io/rhandsontable/
  7. pixiedust. https://github.com/nutterb/pixiedust

Summary tables

rtables package

For the time of creating this post, the package rtable was available only as github project of G. Becker. In particular, specific branch was used to compile a package. However, as he mentions here, the project has a long history of being released as open-source well after being using as proprietary for some time.

We start by installing and loading essential libraries.

Throughout, williams.trees dataset from the agridat package will be used. Apart from already existing factors and numeric variables, an additional factor is generated from random process because pre-existing gen (genotype information) variable is nested in structure. Nested variable means that summary for that is available only for specific grouping and not for overall use.

Using rtable::split_cols_by, we split the analysis variable into multiple columns formed by a grouping variable.

##             Chanthaburi              HuaiBong               Ratchaburi               SaiThong                Sakaerat                 SiSaKet       
##          D       C       B       D       C       B       D       C       B       D       C       B       D       C       B       D       C       B  
## ----------------------------------------------------------------------------------------------------------------------------------------------------
## mean   227.7   306.6   247.9   201.5   179.5   227.8   445.5   464.5   474.2   655.9   699.1   576.7   285.5   318.6   397.3   421.1   439.5   537.4

Row splitting can also be done as shown.

##           Chanthaburi    HuaiBong    Ratchaburi    SaiThong     Sakaerat     SiSaKet  
## --------------------------------------------------------------------------------------
## D          15 (50%)     11 (40.7%)   11 (32.4%)   16 (48.5%)   11 (33.3%)   9 (25.7%) 
##   mean      227.67        201.55       445.45       655.94       285.45       421.11  
##   sd        107.19        91.81        141.97       277.59       142.17       177.37  
##   range       315          322          455          957          512          561    
##   max         391          397          673          1073         575          658    
##   min         76            75          218          116           63           97    
## C          5 (16.7%)     10 (37%)    11 (32.4%)   7 (21.2%)    10 (30.3%)   12 (34.3%)
##   mean       306.6        179.5        464.55       699.14       318.6        439.5   
##   sd         195.3        95.81        142.08       320.74       136.36       177.74  
##   range       489          249          508          820          458          515    
##   max         569          345          764          1083         632          735    
##   min         80            96          256          263          174          220    
## B         10 (33.3%)    6 (22.2%)    12 (35.3%)   10 (30.3%)   12 (36.4%)    14 (40%) 
##   mean       247.9        227.83       474.17       576.7        397.33       537.43  
##   sd         142.2        89.87        154.73       214.48       190.91       209.85  
##   range       471          225          555          580          568          681    
##   max         555          327          764          899          658          839    
##   min         84           102          209          319           90          158

In the previous table we used custom function for summarizing. However, we can use pre-existing helper functions of R like the summary function.

##             Chanthaburi    HuaiBong    Ratchaburi    SaiThong     Sakaerat     SiSaKet  
##               (N=30)        (N=27)       (N=34)       (N=33)       (N=33)       (N=35)  
## ----------------------------------------------------------------------------------------
## D            15 (50%)     11 (40.7%)   11 (32.4%)   16 (48.5%)   11 (33.3%)   9 (25.7%) 
##   Min.          76            75          218          116           63           97    
##   1st Qu.      154.5        145.5        358.5        464.75        207          353    
##   Median        186          189          436         718.5         235          414    
##   Mean        227.67        201.55       445.45       655.94       285.45       421.11  
##   3rd Qu.      325.5         227          538          884          362          521    
##   Max.          391          397          673          1073         575          658    
## C            5 (16.7%)     10 (37%)    11 (32.4%)   7 (21.2%)    10 (30.3%)   12 (34.3%)
##   Min.          80            96          256          263          174          220    
##   1st Qu.       163         114.5         377          476          256         316.25  
##   Median        306          138          439          802          278          358    
##   Mean         306.6        179.5        464.55       699.14       318.6        439.5   
##   3rd Qu.       415         227.75       559.5         897         358.75        619    
##   Max.          569          345          764          1083         632          735    
## B           10 (33.3%)    6 (22.2%)    12 (35.3%)   10 (30.3%)   12 (36.4%)    14 (40%) 
##   Min.          84           102          209          319           90          158    
##   1st Qu.      141.5        161.25       422.75       420.75       205.75       404.25  
##   Median       217.5        254.5        481.5         493          458          568    
##   Mean         247.9        227.83       474.17       576.7        397.33       537.43  
##   3rd Qu.     323.25        287.75        540          753         529.75       677.75  
##   Max.          555          327          764          899          658          839

In earlier functions, we used variable as data parameter. But dataset entirely can also be provided as a data parameter if summary involves multiple variables.

##                                Chanthaburi    HuaiBong    Ratchaburi    SaiThong      Sakaerat     SiSaKet  
##                                  (N=30)        (N=27)       (N=34)       (N=33)        (N=33)       (N=35)  
## ------------------------------------------------------------------------------------------------------------
## D                               15 (50%)     11 (40.7%)   11 (32.4%)   16 (48.5%)    11 (33.3%)   9 (25.7%) 
##   Total genotypes                  15            11           11           16            11           9     
##   Unique genotypes                  3            5            5             5            4            4     
##   Genotypes with > 1 records    12 (40%)     6 (22.22%)   6 (17.65%)   11 (33.33%)   7 (21.21%)   5 (14.29%)
## C                               5 (16.7%)     10 (37%)    11 (32.4%)    7 (21.2%)    10 (30.3%)   12 (34.3%)
##   Total genotypes                   5            10           11            7            10           12    
##   Unique genotypes                  3            4            4             4            5            5     
##   Genotypes with > 1 records    2 (6.67%)    6 (22.22%)   7 (20.59%)    3 (9.09%)    5 (15.15%)    7 (20%)  
## B                              10 (33.3%)    6 (22.2%)    12 (35.3%)   10 (30.3%)    12 (36.4%)    14 (40%) 
##   Total genotypes                  10            6            12           10            12           14    
##   Unique genotypes                  4            3            4             4            5            5     
##   Genotypes with > 1 records     6 (20%)     3 (11.11%)   8 (23.53%)   6 (18.18%)    7 (21.21%)   9 (25.71%)

Also, insted of letting automatic counting from the given analysis variable, we could manually supply the column aggregate summary by initially populating the columns counts. This is done using tapply or map functions.

##                                Chanthaburi    HuaiBong    Ratchaburi    SaiThong      Sakaerat     SiSaKet  
##                                  (N=30)        (N=27)       (N=34)       (N=33)        (N=33)       (N=35)  
## ------------------------------------------------------------------------------------------------------------
## D                               15 (50%)     11 (40.7%)   11 (32.4%)   16 (48.5%)    11 (33.3%)   9 (25.7%) 
##   Total genotypes                  15            11           11           16            11           9     
##   Unique genotypes                  3            5            5             5            4            4     
##   Genotypes with > 1 records    12 (40%)     6 (22.22%)   6 (17.65%)   11 (33.33%)   7 (21.21%)   5 (14.29%)
## C                               5 (16.7%)     10 (37%)    11 (32.4%)    7 (21.2%)    10 (30.3%)   12 (34.3%)
##   Total genotypes                   5            10           11            7            10           12    
##   Unique genotypes                  3            4            4             4            5            5     
##   Genotypes with > 1 records    2 (6.67%)    6 (22.22%)   7 (20.59%)    3 (9.09%)    5 (15.15%)    7 (20%)  
## B                              10 (33.3%)    6 (22.2%)    12 (35.3%)   10 (30.3%)    12 (36.4%)    14 (40%) 
##   Total genotypes                  10            6            12           10            12           14    
##   Unique genotypes                  4            3            4             4            5            5     
##   Genotypes with > 1 records     6 (20%)     3 (11.11%)   8 (23.53%)   6 (18.18%)    7 (21.21%)   9 (25.71%)

Here are some of the handy utility functions that can be used on the go.

qwraps2 package

For constructing simple whole sample or subsample summary tables, qwarps package have simple interface. It provides a richly showcased vignette using mtcars dataset.

It requires markup language to be set early on in code chunk to render proper format.

mtcars2 (N = 32)cyl_factor: 6 cylinders (N = 7)cyl_factor: 4 cylinders (N = 11)cyl_factor: 8 cylinders (N = 14)P-value
Miles Per Gallon          
   min10.417.821.410.4
   max33.921.433.919.2
   mean (sd)20.09 ± 6.0319.74 ± 1.4526.66 ± 4.5115.10 ± 2.56P < 0.0001
Displacement          
   min71.1145.071.1275.8
   median196.3167.6108.0350.5
   max472258.0146.7472.0
   mean (sd)230.72 ± 123.94183.31 ± 41.56105.14 ± 26.87353.10 ± 67.77P < 0.0001
Weight (1000 lbs)          
   min1.5132.6201.5133.170
   max5.4243.4603.1905.424
   mean (sd)3.22 ± 0.983.12 ± 0.362.29 ± 0.574.00 ± 0.76P < 0.0001
Forward Gears        P < 0.0001
   Three15 (47)2 (29)1 (9)12 (86)
   Four12 (38)4 (57)8 (73)0 (0)
   Five5 (16)1 (14)2 (18)2 (14)

Alternatively, row group name can be used for informing p-value. This is exemplified in the vignette for package.

gtsummary package

This package is a recent development but has a lot more preview ready examples as vignette. It is richer and more easily extensible in feature because it draws upon the gt package. Check vignettes out at cran repo for the package: https://cran.r-project.org/web/packages/gtsummary/index.html.

comments powered by Disqus

Related