A couple of weeks ago, I was looking for a package, I previously passed by, that prints summary statistics with inline histograms. I checked all my bookmarks and liked tweets, but I couldn’t find it! So I asked on twitter. fortunately Maëlle Salmon read the tweet and guided me to skimr
by ropenscilabs, who actually release many useful packages.
In this post, I will focus on spark histograms in summary statistics and beyond. I will give an example of summarizing data in a dataframe, and adding a new column with spark histograms, since I like to play and fit anything in list columns!
skimr Summary Statistics
Basically skimr
can show summary statistics, that one can go through quickly, handling different data types (numerics, factors, etc). For example, if we skim storms
, and look at one variable, we can see the summaries as follows:
## load libraries
library(tidyverse)
library(skimr)
library(knitr)
library(lubridate)
library(DT)
## Set locale (for windows)
slang <- Sys.setlocale("LC_CTYPE", "Chinese")
## skim storms
storm_skim <- skim(storms)
storm_skim %>%
filter(var == "wind") %>%
kable()
var | type | stat | level | value |
---|---|---|---|---|
wind | integer | missing | .all | 0.00000 |
wind | integer | complete | .all | 10010.00000 |
wind | integer | n | .all | 10010.00000 |
wind | integer | mean | .all | 53.49500 |
wind | integer | sd | .all | 26.21387 |
wind | integer | min | .all | 10.00000 |
wind | integer | median | .all | 45.00000 |
wind | integer | quantile | 25% | 30.00000 |
wind | integer | quantile | 75% | 65.00000 |
wind | integer | max | .all | 160.00000 |
wind | integer | hist | ▂▇▅▃▂▂▁▁▁▁ | 0.00000 |
So we can see the inline histogram, which might be useful while skimming a summary of many variables. And we can extract all the histograms for all the numerical variables as follows:
## print kable
storm_skim %>%
filter(stat == "hist") %>%
kable()
var | type | stat | level | value |
---|---|---|---|---|
year | numeric | hist | ▃▃▃▆▇▆▇▇▇▆ | 0 |
month | numeric | hist | ▁▁▁▁▁▂▅▇▃▁ | 0 |
day | integer | hist | ▇▆▆▆▆▆▆▆▆▆ | 0 |
hour | numeric | hist | ▇▁▇▁▁▇▁▇▁▁ | 0 |
lat | numeric | hist | ▂▇▇▆▇▇▅▂▁▁ | 0 |
long | numeric | hist | ▁▅▇▇▇▆▆▃▁▁ | 0 |
wind | integer | hist | ▂▇▅▃▂▂▁▁▁▁ | 0 |
pressure | integer | hist | ▁▁▁▁▁▁▂▃▇▂ | 0 |
ts_diameter | numeric | hist | ▇▇▅▂▁▁▁▁▁▁ | 0 |
hu_diameter | numeric | hist | ▇▁▁▁▁▁▁▁▁▁ | 0 |
Spark Histograms Beyond Summary Statistics
After exploring skimr
, I thought what if I wanted to have spark histograms in a dataframe as a summary for a group?, i.e. something like df %>% group_by(x) %>% summarize(hist = plot_spark_hist())
. This is part of my interest in fitting anything in list columns!
So for the next example, I will use Wuzzuf_Job_Posts_Sample.csv dataset released on Kaggle Here. The dataset provides a a sample of jobs posted in 2014-2016 with some details. Here we will focus on 2015 posts.
First: we will parse the post date and extract the job posted in 2015.
## parse date
jobs <- jobs %>%
mutate(post_date = as.Date(post_date))
## filter jobs in 2015
jobs_2015 <- jobs %>%
filter(year(post_date) == 2015)
Second : We will group by the job_category1
and nest the rest of the data to keep it in the same dataframe. Then we can count the number of jobs in each category and select the top ten.
## select the top ten categories with their details nested in data column
jobs_2015_topx <- jobs_2015 %>%
group_by(job_category1) %>%
nest() %>%
mutate(job_posts = map_int(data, ~nrow(.x))) %>%
arrange(desc(job_posts)) %>%
filter(between(row_number(), 1, 10))
THEN we will pick any numerical variable (let’s say views
) and mutate the dataframe with a new column including the spark histograms. It might seem confusing at first glance, but we can do it as follows:
pass the nested
data
to a formula that skims a dataframe that only includes the target variableviews
extract the histogram from the result
stats
extract the
level
value, which is spark histograms (character after all)
## count views and mutate histograms
jobs_2015_top_hist <- jobs_2015_topx %>%
mutate(views_sum = map_int(data, ~sum(.x[["views"]]))) %>%
mutate(views_hist = map(data, ~skim(select(.x, views)) %>%
filter(stat == "hist")) %>%
map_chr("level"))
## print kable excluding nested data
jobs_2015_top_hist %>%
select(-data) %>%
kable()
job_category1 | job_posts | views_sum | views_hist |
---|---|---|---|
IT/Software Development | 2812 | 3658423 | ▇▇▁▁▁▁▁▁▁▁ |
Engineering | 1674 | 2817031 | ▂▇▃▂▁▁▁▁▁▁ |
Customer Service/Support | 1334 | 2320181 | ▇▅▁▁▁▁▁▁▁▁ |
Marketing | 1201 | 1446364 | ▃▅▇▃▁▁▁▁▁▁ |
Sales/Retail/Business Development | 951 | 1210833 | ▂▇▇▃▁▁▁▁▁▁ |
Creative/Design | 777 | 1038488 | ▂▂▇▆▃▁▁▁▁▁ |
Quality Assurance/Quality Control | 199 | 405129 | ▃▇▆▃▁▁▁▁▁▁ |
Administration | 199 | 294276 | ▅▂▇▇▂▁▂▁▁▁ |
Editorial/Writing | 195 | 317639 | ▅▂▅▇▇▇▃▂▁▁ |
Installation/Maintenance/Repair | 129 | 220305 | ▃▂▇▇▃▂▁▁▁▁ |
Now, we have a quick way to see the distribution of views in different categories. Similarly, We can add more columns with other variables. This is definitly not enough for good exploration, and one will need to plot real histograms or other plots to gain better understanding of the data. But after all, it is just something to skim the data. And if we can try it, why not?!