POV: You’re tired of calculating means, but you have to. Data exploration tools in R

Part of my job is to review the results of various surveys from the Mexican Network of Parkinson’s Disease Research. We have a large number of surveys and responses, and I need to provide reports regularly and identify potential new projects for incoming students. To do this, I frequently conduct exploratory analyses using R. If you do something similar, you’ll agree that exploring datasets and performing routine statistics can become a repetitive and time-consuming task. Today, I’d like to share some libraries and functions in R that make this process easier.

To illustrate this, I’ve prepared a small example attached here.
Rutinarios.pdf (177.4 KB)

• glimpse() function from the dplyr library:
This function provides a quick overview of a dataframe, showing the structure of the data in a compact format. It displays the number of rows, columns, and a preview of the data types and values. It is especially useful for large or complex datasets. Also it saves a lot of time when you are using this dataframe as an input for other function (yes, it turns out your variable was an integer while you needed a double and that is why your code is not running)

• tabyl() function from the janitor library:
You can use this to create frequency tables and cross-tabulations. It shows counts and proportions of categorical data, which helps in understanding the distribution of your data.

• CreateTableOne function from the tableone library:
This function allows you to easily create a summary statistics of a study, which is common in clinical research. It provides descriptive statistics for continuous and categorical variables, and it allows for stratification by groups. Personally I wouldn’t use this one for my final table on a manuscript, specially I recommend double checking if the test for p-values is the one you want, but it is incredibly useful for checking all data all at once.

• Sum from the epiDisplay library
Overall, epiDisplay is a package for data exploration and result presentation of epidemiologcal data. It contains the full Epicalc package which provides a variety of functions for data management, descriptive statistics, and result presentation, as well as functions for calculating descriptive statistics, creating contingency tables, and running common epidemiological tests such as relative risk, odds ratios, and logistic regression models.

This package is very extensive and includes their own datasets. (It has many functions that can be covered with R-base tho). I highly suggest checking the documentation because there’s too much to cover!

However, summ() creates a summary of data frame in a convenient table. And can also create statistics and a graph for a certain variable

4 Likes

Thanks for you post @paularp! I didn’t realize that dplyr had glimpse() which looks like the base R equivalent to str().

I find the base R function summary() really useful as well for taking a quick peak at a dataframe. I’m looking forward to checking out these other packages that you listed. They look super useful!

2 Likes