# Stores the values in a variable called "flippers_length"flippers_length <-c(18.1, 18.6, 19.5, 19.3, 19.0, 18.1, 19.5, 19.3, 19.0)# Calculates the mean of the values stored in "flippers_length"mean(flippers_length)
[1] 18.93333
Median (\(Q_2\))
The median is the value that splits equally the data set into two parts:
at least half of the observations are lower than or equal to the median;
at least half of the observations are higher than or equal to the median;
Median (\(Q_2\))
To calculate the median (denoted by \(Q_2\)), we must first arrange the data in ascending order.
Then, if the number of observations is:
odd: the median (\(Q_2\)) is the middle observation (i.e., the observation in position (n+1)/2);
even: the median (\(Q_2\)) is the average of the two central observations (i.e., the observations in positions (n/2) and (n/2+1));
Example 2: Median (\(Q_2\))
Compute the median of the following 9 length measures of penguins’ flippers (in cm):
# Stores the values in a variable called "flippers_length"flippers_length <-c(18.1, 18.6, 19.5, 19.3, 19.0, 18.1, 19.5, 19.3)# Calculates the mean of the values stored in "flippers_length"median(flippers_length)
[1] 19.15
Mean vs Median
Both, the mean and the median, are very useful centrality measurements.
But they have different behaviors and interpretations.
Mean vs Median - Interpretation
Suppose you want to know how much you will spend on monthly groceries in a given year.
Mean: the mean gives you an idea of how much you spend per month;
some months you will spend a bit more, some months a bit less;
but the mean also gives you an idea of the total amount you will spend in a year; just multiply it by 12.
Mean vs Median - Interpretation
Suppose you want to know how much you will spend with groceries per month in a given year.
Median: The median gives you an idea of how much you spend per month;
around 50% of the months you will pay more, and 50% of the months you will spend less than the median;
Note the extra precision – for the mean, we said “some months”
The median is not based on the total, so you can’t just multiply by 12 to know how much you will spend in a year.
Mean vs Median - Outlier
Let’s return to the 8 length measures of penguins’ flippers (in cm). But this time, we have an additional measurement from a baby penguin:
First, we need to arrange the observations in ascending order: \[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]
First Quartile (\(Q_1\)): \(n\) is even
First, we need to arrange the observations in ascending order: \[18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]
Since \(n=8\) is even, we keep the first \(n/2 = 4\) observations
First Quartile (\(Q_1\)): \(n\) is even
First, we need to arrange the observations in ascending order: \[18.1,\ 18.1,\ 18.6, 19.0\color{white}{,\ 19.3,\ 19.3,\ 19.5,\ 19.5} \]
Since \(n=8\) is even, we keep the first \(n/2 = 4\) observations
First Quartile (\(Q_1\)): \(n\) is even
First, we need to arrange the observations in ascending order: \[18.1,\ \color{red}{18.1},\ \color{red}{18.6}, 19.0\color{white}{,\ 19.3,\ 19.3,\ 19.5,\ 19.5} \]
Since \(n=8\) is even, we keep the first \(n/2 = 4\) observations
Calculate the median of the remaining values: \(Q_1 = \frac{18.1+18.6}{2} = 18.35\)
Third Quartile (\(Q_3\)): \(n\) is odd
Let’s bring back the baby penguin. Now, compute the first quartile of the following 9 length measures of penguins’ flippers (in cm):
First, we need to arrange the observations in ascending order: \[2.3, 18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]
Third Quartile (\(Q_3\)): \(n\) is odd
First, we need to arrange the observations in ascending order: \[2.3, 18.1,\ 18.1,\ 18.6,\ 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5\]
Since \(n=9\) is odd, we keep the last \((n+1)/2 = 5\) observations
Third Quartile (\(Q_3\)): \(n\) is odd
First, we need to arrange the observations in ascending order: \[\color{white}{2.3, 18.1,\ 18.1,\ 18.6,} 19.0,\ 19.3,\ 19.3,\ 19.5,\ 19.5 \]
Since \(n=9\) is odd, we keep the last \((n+1)/2 = 5\) observations
Third Quartile (\(Q_3\)): \(n\) is odd
First, we need to arrange the observations in ascending order: \[\color{white}{2.3, 18.1,\ 18.1,\ 18.6,} 19.0,\ 19.3,\ \color{red}{19.3},\ 19.5,\ 19.5 \]
Since \(n=9\) is odd, we keep the last \((n+1)/2 = 5\) observations
Calculate the median of the remaining values: \(Q_3 = 19.3\)
Quantiles using R
Using R:
# Stores the values in a variable called "flippers_length"flippers_length <-c(18.1, 18.6, 19.5, 19.3, 19.0, 18.1, 19.5, 19.3)# Calculates the quantiles of the values stored in "flippers_length"quantile(flippers_length, 0.25) # First quartile
25%
18.475
quantile(flippers_length, 0.50) # Second quartile
50%
19.15
quantile(flippers_length, 0.75) # Third quartile
75%
19.35
Warning
R uses a fancier way to obtain quantiles, which might differ slightly from what you get using this approach.
Exercise
The final exam for STAT 200 was scheduled at a different time of the day than the lecture. You want to learn how long it takes to get to UBC at the time of the day so you know when to leave home. You asked your usual bus driver. As a passionate statistician hobbyist, the bus driver asked what measure of centrality you want to know:
mean commute time;
median commute time;
another commute time quantile; which one?
I have no idea!
Explain your answer!
Scale
Variability measures
The measures of centrality are very helpful to tell us where the data is centred around.
However, they don’t tell us how much the data varies.
There are two very important variability measures: standard deviation and interquartile range;
Variance
Variance is the “arithmetic average” of the squared deviation from the mean:
Calculate the variance of penguins’ heights. The observed data is given below:
Variance - Step 1
Variance - Step 2
Variance - Step 3
Variance
You can also use R:
# Stores the values in a variable called "penguins_height"penguins_height <-c(50, 100, 75, 88, 65)# Calculates the variance of the values stored in "penguins_height"var(penguins_height)
[1] 379.3
Standard Deviation
The problem with the variance is that it uses the square of the deviations;
This affects the unit of measurement, and our interpretation;
To fix that, we can take the square root of the variance: \[
S = \sqrt{S^2}
\]\(S\) is called standard deviation;
Properties of Standard Deviation
Std. Deviation is always non-negative (\(\geq 0\)).
If you sum all observations by a constant \(c\), the std. deviation does not change.
If you multiply all observations by a constant \(c\), then the std. deviation is also multiplied by \(c\).
Interquartile range (IQR)
It is the range that encloses the middle 50% of the observations: \[
IQR = Q_3 - Q_1
\]
You can use the IQR function in R to compute the IQR:
# Stores the values in a variable called "penguins_height"penguins_height <-c(50, 100, 75, 88, 65)# Calculating the IQR using quantilesquantile(penguins_height, 0.75) -quantile(penguins_height, 0.25)
75%
23
# Or your can use the IQR functionIQR(penguins_height)
[1] 23
Visualization of quantitative variables
Histogram
Since we are dealing with quantitative variables, we don’t have categories to count and should not use a bar chart;
We use histograms to create bins and then count how many observations there are in each bin.
There are some specificities in histogram:
There should be no space between bins.
Histogram
viewof seed2 = Inputs.text({label:"Seed",placeholder:"Enter the seed",value:"12345"});
Histogram of peguin heights
histogram = {const n =20;Math.seedrandom(seed2);let runif = (a, b, n) => {let numbers =newArray();for (let i =0; i < n; i++){ numbers[i] =Math.round(Math.random() * (b-a) + a); }return numbers; };const data_penguins_height =runif(50,79, n);// add the randomly generated size to the cellslet penguins_container =document.querySelector("#penguin-container-histogram").querySelectorAll(".penguin-hist");for (let i =0; i < n; i++){const penguin = penguins_container[i];let penguins_paragraphs = penguin.querySelectorAll("p");if (penguins_paragraphs.length>1) penguins_paragraphs[1].remove()//penguins_paragraphs[0].querySelector("img").height = +data_penguins_height[i];const p =document.createElement("p"); p.innerText= data_penguins_height[i].toString() +' cm'; penguin.append(p); }// set the dimensions and margins of the graphvar margin = {top:10,right:30,bottom:100,left:100}, width =600- margin.left- margin.right, height =420- margin.top- margin.bottom;// append the svg object to the body of the page d3.select("#penguins-histogram").html("");var svg = d3.select("#penguins-histogram").append("svg").attr("width", width + margin.left+ margin.right).attr("height", height + margin.top+ margin.bottom).append("g").attr("transform","translate("+ margin.left+","+ margin.top+")");// X axis: scale and draw:var x = d3.scaleLinear().domain([40,90]) .range([0, width]); svg.append("g").attr("transform","translate(0,"+ height +")").call(d3.axisBottom(x).ticks(6));// Y axis: initializationvar y = d3.scaleLinear().range([height,0]);var yAxis = svg.append("g")// A function that builds the graph for a specific value of bin//const nBin = 10;// set the parameters for the histogramvar histogram = d3.histogram().value(function(d) { return d; }) // I need to give the vector of value.domain(x.domain()) // then the domain of the graphic.thresholds(x.ticks(nBin));// then the numbers of bins// And apply this function to data to get the binsvar bins =histogram(data_penguins_height);// Y axis: update now that we know the domain y.domain([0, d3.max(bins,function(d) { return d.length+2; })]);// d3.hist has to be called before the Y axis obviously yAxis.transition().duration(1000).call(d3.axisLeft(y).ticks(5));// Y-label svg.append("text").attr("x",x(35)).attr("text-anchor","middle").attr("transform","rotate(-90)").attr("y",y(d3.max(bins,function(d) { return d.length+2; })/2)).attr("class",'axesLabel').text("frequency").style('transform-box','fill-box').style('transform-origin','50% 50%'); svg.append("text").attr("x",x(65)).attr("text-anchor","middle").attr("y",360).attr("class",'axesLabel').text("Height (cm)")// Join the rect with the bins data svg.append("g").attr("fill","#69b3a2").selectAll("rect").data(bins).join("rect").attr("x", d =>x(d.x0) +1).attr("width", d =>Math.max(0,x(d.x1) -x(d.x0) -1)).attr("y", d =>y(d.length)).attr("height", d =>y(0) -y(d.length)).attr("class",'bin').on("mouseenter", (d, i, nodes) => { // Mouse-over event: turns the bin red and add the number of data points in the bin to the top of the bin d3.select(d.target).attr("fill","red"); d3.select(d.target.parentNode).append("text").attr("x", (x(i.x0) +x(i.x1)) /2).attr("text-anchor","middle").attr("y",y(i.length+.25)).attr("class","freq").text(i.length).property("bar", d.target).style('font-size','0.7em'); d3.select(d.target).style("cursor","pointer");// change the cursordocument.querySelectorAll(".penguin-hist").forEach(entry => {let value =+entry.querySelectorAll("p")[1].textContent.split(" ")[0];if (value >= d.target.__data__.x0&& value < d.target.__data__.x1){ entry.style.border="3px solid red"; } }); }).on("mouseout", (d, i, nodes) => { // Mouse-out event: returns to the original configurationif (!d.target.flag) { d3.select(d.target).attr("fill","#69b3a2") d3.selectAll(".freq").filter((e, j, texts) => {return texts[j].bar=== d.target; }).remove(); d3.select(d.target).style("cursor","default");document.querySelectorAll(".penguin-hist").forEach(entry => {let value =+entry.querySelectorAll("p")[1].textContent.split(" ")[0];if (value >= d.target.__data__.x0&& value < d.target.__data__.x1){ entry.style.border=""; } }); } }); svg.selectAll("text").style('font-size',"1.5em"); svg.selectAll(".axesLabel").style('font-size',".8em");}
viewof seed = Inputs.text({label:"Seed",placeholder:"Enter the seed",value:"12345"});viewof extra_info = Inputs.toggle({label:"Show Extra Info",value:false})a = { seed;document.querySelector("#boxplot-seed").querySelector("label").style.width='fit-content';document.querySelector("#boxplot-seed").querySelector("input").style.width='100px';document.querySelector("#boxplot-seed").querySelectorAll("label")[1].style.width='fit-content';}
boxplot = {const n =20;Math.seedrandom(seed);let runif = (a, b, n) => {let numbers =newArray();for (let i =0; i < n; i++){ numbers[i] =Math.round(Math.random() * (b-a) + a); }return numbers; };let data_penguins_height =runif(60,70, n-4); data_penguins_height = data_penguins_height.concat(runif(50,55,2)); data_penguins_height = data_penguins_height.concat(runif(75,79,2));// add the randomly generated size to the cellslet penguins_container =document.querySelector("#penguin-container-boxplot").querySelectorAll(".penguin-hist");for (let i =0; i < n; i++){const penguin = penguins_container[i];let penguins_paragraphs = penguin.querySelectorAll("p");if (penguins_paragraphs.length>1) penguins_paragraphs[1].remove()//penguins_paragraphs[0].querySelector("img").height = +data_penguins_height[i];const p =document.createElement("p"); p.innerText= data_penguins_height[i].toString() +' cm'; penguin.append(p); }// set the dimensions and margins of the graphvar margin = {top:10,right:30,bottom:100,left:200}, width =600- margin.left- margin.right, height =550- margin.top- margin.bottom;// append the svg object to the body of the page d3.select("#penguins-boxplot").html("");var svg = d3.select("#penguins-boxplot").append("svg").attr("width", width + margin.left+ margin.right).attr("height", height + margin.top+ margin.bottom).append("g").attr("transform","translate("+ margin.left+","+ margin.top+")");// Compute summary statistics used for the box:var data_sorted = data_penguins_height.sort(d3.ascending)var q1 = d3.quantile(data_sorted,.25)var median = d3.quantile(data_sorted,.5)var q3 = d3.quantile(data_sorted,.75)var interQuantileRange = q3 - q1var min = q1 -1.5* interQuantileRangevar max = q3 +1.5* interQuantileRange min =Math.min.apply(Math, data_sorted.filter(function(x){return x >= min})); max =Math.max.apply(Math, data_sorted.filter(function(x){return x <= max}));if (extra_info){// Append the summary statistics to div elementlet boxplot_summary =document.querySelector("#boxplot-summary"); boxplot_summary.innerHTML=''; boxplot_summary.append(document.createElement("hr"));let p =document.createElement("p"); p.innerHTML="Q<sub>1</sub>: "+ q1.toString() +" cm"; p.style.fontSize="0.75em"; boxplot_summary.append(p); p =document.createElement("p"); p.innerHTML="Q<sub>2</sub>: "+ median.toString() +" cm"; p.style.fontSize="0.75em"; boxplot_summary.append(p); p =document.createElement("p"); p.innerHTML="Q<sub>3</sub>: "+ q3.toString() +" cm"; p.style.fontSize="0.75em"; boxplot_summary.append(p); }// Show the Y scalevar y = d3.scaleLinear()//.domain([d3.min(data_penguins_height), d3.max(data_penguins_height)]).domain([50,80]).range([height,0]); svg.call(d3.axisLeft(y).ticks(7));// a few features for the boxvar center =100;var width =100;// Show the main vertical line svg.append("line").attr("x1", center).attr("x2", center).attr("y1",y(min)).attr("y2",y(q1)).attr("stroke","black"); svg.append("line").attr("x1", center).attr("x2", center).attr("y1",y(q3)).attr("y2",y(max)).attr("stroke","black");// Show the box svg.append("rect").attr("x", center- width/2).attr("y",y(q3)).attr("height", (y(q1)-y(q3))).attr("width", width).attr("stroke","black");//.style("fill", "#69b3a2");// show median, min and max horizontal lines svg.selectAll("toto").data([min, max]).enter().append("line").attr("x1", center-width/4).attr("x2", center+width/4).attr("y1",function(d){ return(y(d))}).attr("y2",function(d){ return(y(d))}).attr("stroke","black");if (extra_info){ svg.append("line").attr("x1",0).attr("x2", center + width).attr("y1",y(q3 +1.5*(q3-q1))).attr("y2",y(q3 +1.5*(q3-q1))).attr("stroke","black").attr("stroke-width","1px").attr("stroke-opacity","0.4").style("stroke-dasharray", ("3, 3")) svg.append("line").attr("x1",0).attr("x2", center + width).attr("y1",y(q1 -1.5*(q3-q1))).attr("y2",y(q1 -1.5*(q3-q1))).attr("stroke","black").attr("stroke-width","1px").attr("stroke-opacity","0.4").style("stroke-dasharray", ("3, 3")) } svg.append("line").attr("x1", center-width/2).attr("x2", center+width/2).attr("y1",y(median)).attr("y2",y(median)).attr("stroke","black").attr("stroke-width","4px"); svg.selectAll("circle").data(data_penguins_height.filter(x => x < min || x > max)).enter().append("circle").attr("cx", center).attr("cy", d =>y(d)).attr("r","3px").attr("stroke","black").attr("fill","black");if (extra_info){ svg.selectAll().data(data_penguins_height.filter(x => x >= min && x <= max)).enter().append("circle").attr("cx", d =>runif(center-width/6, center + width/6,1)).attr("cy", d =>y(d)).attr("r","3px").attr("stroke","blue").attr("fill","blue");// Appending a bunch of texts svg.append("text").attr("text-anchor","start").attr("x", center+width/2+5).attr("y",y(median-.35)).html("Q").style('font-size','20px').attr("fill","black").append('tspan').text('2').style('font-size','12px').attr('dx','.1em').attr('dy','.5em'); svg.append("text").attr("text-anchor","start").attr("x", center+width/2+5).attr("y",y(q1-.35)).html("Q").style('font-size','20px').attr("fill","black").append('tspan').text('1').style('font-size','12px').attr('dx','.1em').attr('dy','.5em'); svg.append("text").attr("text-anchor","start").attr("x", center+width/2+5).attr("y",y(q3-.35)).html("Q").style('font-size','20px').attr("fill","black").append('tspan').text('3').style('font-size','12px').attr('dx','.1em').attr('dy','.5em'); svg.append("text").attr("text-anchor","start").attr("x", center+width/4+5).attr("y",y(max-.25)).text("Largest point below the upper fence").style('font-size','14px').attr("fill","black"); svg.append("text").attr("text-anchor","start").attr("x", center+width/4+5).attr("y",y(min-.25)).text("Smallest point above the lower fence").style('font-size','14px').attr("fill","black"); svg.append("text").attr("text-anchor","start").attr("x", center+width+5).attr("y",y(q3 +1.5*(q3-q1)-.25)).text("Upper fence: Q3 + 1.5 x IQR").style('font-size','14px').attr("fill","black"); svg.append("text").attr("text-anchor","start").attr("x", center+width+5).attr("y",y(q1 -1.5*(q3-q1)-.25)).text("Lower fence: Q1 - 1.5 x IQR").style('font-size','14px').attr("fill","black"); }// Y-label svg.append("text").attr("text-anchor","middle").attr("transform","rotate(-90)").attr("x",-50).attr("y",y((85+45)/2)).attr("class",'axesLabel').text("Height (cm)").attr("fill","black").style('transform-box','fill-box').style('transform-origin','50% 50%'); svg.selectAll(".tick").selectAll("text").style('font-size',"14px"); svg.selectAll(".axesLabel").style('font-size',"24px");}