Adding Skimr Spark Histograms in Dataframe Columns -

A couple of weeks ago, I was looking for a package, I previously passed by, that prints summary statistics with inline histograms. I checked all my bookmarks and liked tweets, but I couldn’t find it! So I asked on twitter. fortunately Maëlle Salmon read the tweet and guided me to skimr by ropenscilabs, who actually release many useful packages.

In this post, I will focus on spark histograms in summary statistics and beyond. I will give an example of summarizing data in a dataframe, and adding a new column with spark histograms, since I like to play and fit anything in list columns!

skimr Summary Statistics

Basically skimr can show summary statistics, that one can go through quickly, handling different data types (numerics, factors, etc). For example, if we skim storms, and look at one variable, we can see the summaries as follows:

## load libraries
library(tidyverse)
library(skimr)
library(knitr)
library(lubridate)
library(DT)

## Set locale (for windows)
slang <- Sys.setlocale("LC_CTYPE", "Chinese")

## skim storms
storm_skim <- skim(storms)

storm_skim %>% 
  filter(var == "wind") %>% 
  kable()

var	type	stat	level	value
wind	integer	missing	.all	0.00000
wind	integer	complete	.all	10010.00000
wind	integer	n	.all	10010.00000
wind	integer	mean	.all	53.49500
wind	integer	sd	.all	26.21387
wind	integer	min	.all	10.00000
wind	integer	median	.all	45.00000
wind	integer	quantile	25%	30.00000
wind	integer	quantile	75%	65.00000
wind	integer	max	.all	160.00000
wind	integer	hist	▂▇▅▃▂▂▁▁▁▁	0.00000

So we can see the inline histogram, which might be useful while skimming a summary of many variables. And we can extract all the histograms for all the numerical variables as follows:

## print kable
storm_skim %>% 
  filter(stat == "hist") %>% 
  kable()

var	type	stat	level
year	numeric	hist	▃▃▃▆▇▆▇▇▇▆
month	numeric	hist	▁▁▁▁▁▂▅▇▃▁
day	integer	hist	▇▆▆▆▆▆▆▆▆▆
hour	numeric	hist	▇▁▇▁▁▇▁▇▁▁
lat	numeric	hist	▂▇▇▆▇▇▅▂▁▁
long	numeric	hist	▁▅▇▇▇▆▆▃▁▁
wind	integer	hist	▂▇▅▃▂▂▁▁▁▁
pressure	integer	hist	▁▁▁▁▁▁▂▃▇▂
ts_diameter	numeric	hist	▇▇▅▂▁▁▁▁▁▁
hu_diameter	numeric	hist	▇▁▁▁▁▁▁▁▁▁

Spark Histograms Beyond Summary Statistics

After exploring skimr, I thought what if I wanted to have spark histograms in a dataframe as a summary for a group?, i.e. something like df %>% group_by(x) %>% summarize(hist = plot_spark_hist()). This is part of my interest in fitting anything in list columns!

So for the next example, I will use Wuzzuf_Job_Posts_Sample.csv dataset released on Kaggle Here. The dataset provides a a sample of jobs posted in 2014-2016 with some details. Here we will focus on 2015 posts.

First: we will parse the post date and extract the job posted in 2015.

## parse date
jobs <- jobs %>% 
  mutate(post_date = as.Date(post_date)) 

## filter jobs in 2015
jobs_2015 <- jobs %>% 
  filter(year(post_date) == 2015)

Second : We will group by the job_category1 and nest the rest of the data to keep it in the same dataframe. Then we can count the number of jobs in each category and select the top ten.

## select the top ten categories with their details nested in data column
jobs_2015_topx <- jobs_2015 %>% 
  group_by(job_category1) %>% 
  nest() %>% 
  mutate(job_posts = map_int(data, ~nrow(.x))) %>% 
  arrange(desc(job_posts)) %>% 
  filter(between(row_number(), 1, 10))

THEN we will pick any numerical variable (let’s say views) and mutate the dataframe with a new column including the spark histograms. It might seem confusing at first glance, but we can do it as follows:

pass the nested data to a formula that skims a dataframe that only includes the target variable views
extract the histogram from the result stats
extract the level value, which is spark histograms (character after all)

## count views and mutate histograms
jobs_2015_top_hist <- jobs_2015_topx %>% 
  mutate(views_sum = map_int(data, ~sum(.x[["views"]]))) %>% 
  mutate(views_hist = map(data, ~skim(select(.x, views)) %>% 
                            filter(stat == "hist")) %>% 
           map_chr("level")) 

## print kable excluding nested data
jobs_2015_top_hist %>% 
  select(-data) %>% 
  kable()

job_category1	job_posts	views_sum	views_hist
IT/Software Development	2812	3658423	▇▇▁▁▁▁▁▁▁▁
Engineering	1674	2817031	▂▇▃▂▁▁▁▁▁▁
Customer Service/Support	1334	2320181	▇▅▁▁▁▁▁▁▁▁
Marketing	1201	1446364	▃▅▇▃▁▁▁▁▁▁
Sales/Retail/Business Development	951	1210833	▂▇▇▃▁▁▁▁▁▁
Creative/Design	777	1038488	▂▂▇▆▃▁▁▁▁▁
Quality Assurance/Quality Control	199	405129	▃▇▆▃▁▁▁▁▁▁
Administration	199	294276	▅▂▇▇▂▁▂▁▁▁
Editorial/Writing	195	317639	▅▂▅▇▇▇▃▂▁▁
Installation/Maintenance/Repair	129	220305	▃▂▇▇▃▂▁▁▁▁

Now, we have a quick way to see the distribution of views in different categories. Similarly, We can add more columns with other variables. This is definitly not enough for good exploration, and one will need to plot real histograms or other plots to gain better understanding of the data. But after all, it is just something to skim the data. And if we can try it, why not?!

R skimr purrr tidyverse

skimr Summary Statistics

Spark Histograms Beyond Summary Statistics

See also