Sampling Techniques

STAT 200 - Chapter 9

Population Concepts

Population

  • Population: the group containing all elements you want to study.
    • The population is fixed;
    • You don’t have access to all elements of the population;
  • Examples:
    • All penguins in the world;
    • All Adelie penguins in the world;
    • All iPhones;
    • All Google’s employees;

Parameters

  • Parameters: quantities that summarize the population.
    • Parameters are fixed but unknown;
    • We want to estimate them because they give us useful information about the population;
  • Examples:
    • The average body mass of all penguins in the world;
    • The median flipper length of all Adelie penguins in the world;
    • The average lifetime of all iPhones 16 Pro Max;
    • The IQR of all Google employees’ salaries;

Census

Scroll down

  • Retrieving data from the entire population is called census;

  • In a census, we have to measure all elements of the population.

    • Unfortunately, this is often impossible, too costly (moneywise or timewise) or unethical.
  • Example 1: You want to learn the effectiveness of a new drug for HIV. Can you imagine infecting the entire population with HIV so we can give them an untested drug with unknown side effects?

  • Example 2: Ford wants to crash-test its vehicles to measure some safety metrics. Should they test every single car they produce?
  • What would they sell?

Parameters vs Variables

Scroll down

Caution

A common mistake many students make is to mix up a variable of interest with a parameter of interest.

  • A variable can be measured for each individual in the population;

  • Parameter is a summary of these measurements (e.g., mean, median, etc…)

  • For example, one might be interested in the average dolphin weight, which is a parameter;
    • the average weight is a quantity of the population, not of a single dolphin.
  • However, the dolphin weight is what is being measured, i.e., the variable of interest.

Population distribution

  • The population distribution is obtained by measuring all the elements in the population.

  • The population distribution is unknown!

    • remember: we don’t have access to all elements in the population, so we can never get the population distribution.

Sample Concepts

Sample

Scroll down

  • Sample: a subset (part) of the population;

    • You do have access to all elements of the sample;
  • We hope that the sample represents well the population, but this is not always the case.

  • We use samples to obtain information about the population (i.e., to estimate parameters).

  • Example 1: You’re making soup and want to know if it has enough salt. Then, you taste a spoonful (a sample!) of the soup. If that portion lacks salt, you conclude that the whole soup lacks salt.
    • You are extrapolating results from a sample to the entire population.

  • Example 2: Imagine you order a basket of french fries. You take one piece to see if you have put enough salt. But, just by chance, you ended with a piece that got too much salt on top of it. You might conclude wrongly that the whole basket of french fries is salty.
    • In this case, you got a sample that doesn’t represent the population well. But it’s still a sample!

Random Samples

  • There are many different strategies we can use for sampling! We will cover some of them today.

  • But they all have one thing in common: they have a random component!

  • Randomness is crucial in sampling and statistical theory.

  • Randomization tends to give samples that are fairly representative of the population.

Sample Distribution

  • The sample distribution is obtained by measuring all the elements in the sample.

  • The sample distribution is known!

  • We hope that the sample distribution resembles the population distribution;

    • remember: we don’t know the population distribution, so we will never know.

Statistics

Scroll down

  • Statistics: quantities that summarize the sample.
    • Samples are random, so statistics are also random;
    • Statistics can be calculated because we can measure the entire sample;
    • Statistics give information about the parameters;
  • Statistics are the sample counterpart of parameters;

Sampling Techniques

Example: BC’s Health System

  • Suppose we want to know the average income of all workers that work in BC’s hospitals.

Example: BC’s Health System

  • The first thing is to properly define our population;
    • part time workers?
    • temporary workers?
    • casual workers?

Example: BC’s Health System

  • Second, the parameter(s) of interest.

  • What population quantities are you interested in?

    • population mean income (\(\mu\))?
    • population median income (\(Q_2\))?
    • population Std. Dev. (\(\sigma\))?
  • Finally, how to select our sample?

Simple Random Sampling (SRS)

  • In SRS, all individuals have the same chance of being selected;
  • The steps are:
    1. obtain the list with the names of all hospital workers (sampling frame);
    2. select a few names from the list at random;
    3. go to the field and collect the data;

Simple Random Sampling (SRS)

Simple Random Sampling (SRS)

You might need to refresh this page to show the plot

Population (\(\mu = ?\))

Sample

Simple Random Sampling (SRS)

  • Use the previous slides to investigate the following questions:
    1. What happens to the statistics when a new sample is taken?
    2. What happens to the parameters when a new sample is taken?
    3. Contrast the Sample Distribution with the Population Distribution for small and large sample sizes. What do you notice?

Stratified Sampling

  • We are investigating the income of hospital workers in BC;

  • The idea is to divide the population into groups, called strata;

    • Individuals in the same stratum are similar to each other (in terms of the variables being measured);
  • Then, we draw a SRS from each stratum separately;

Stratified Sampling

  • For example, we could split the population into staff, nurse, and doctors.
    • or use even more groups: IT staff, Admin staff, licensed nurse, registered nurse, general doctor, specialist doctor, surgeons.
  • It is expected that the income within each stratum (job category) to be somewhat similar;

Stratified Sampling

  • In stratified sampling, we:
    1. split the population into subpopulations - called strata.
    2. draw a SRS from each stratum;
    3. estimate the parameters of interest of each stratum separately;
    4. combine the strata’s estimates to build an overall estimate;

Stratified Random Sampling

Stratified Random Sampling

You might need to refresh this page to show the plot

Population
Nurses
Staffs
Doctors
Parameter Staff Nurse Doctor Overall
Mean
Median
0.99-quantile
Std. Dev.
IQR

Sample

Nurses
Staffs
Doctors
Statistics Staff Nurse Doctor Overall
Sample Mean
Sample Median
Sample 0.99-quantile
Sample Std. Dev.
Sample IQR

Stratified Random Sampling

  • In stratified sampling, we study each subpopulation separately and then combine the results for the entire population.

  • Stratified Sampling tends to perform better than SRS (i.e., there is less variability across samples);

    • The more homogeneous the groups are, the better the Stratified Sampling is in comparison to SRS.

Cluster Sampling

  • SRS and Stratified sampling can be prohibitively expensive;

  • A more convenient way (but potentially less precise), is cluster sampling;

  • In cluster sampling, we split the population into groups, called clusters.

    • Different from a stratum, a cluster is supposed to be heterogenous;
    • Ideally, each cluster has similar composition as the population as a whole;

Cluster Sampling

  • For example, we could use as clusters hospital units.
    • Each hospital should have similar composition as the population;

Cluster Sampling

  • In cluster sampling, we:
    1. split the population into subpopulations - called clusters.
    2. get a list of all clusters in the population;
    3. draw a SRS of clusters;

Cluster Sampling

Once we have a sample of clusters we can:

  1. Collect the data from all units in the selected clusters; this is called one-stage cluster;

  2. Select a sample of units within each selected cluster using SRS or Stratified Sampling; two-stage cluster;

Systematic Sampling

Scroll down

  • A systematic sample is obtained by selecting every kth individual from the sampling frame;

  • The effectiveness of this method depends on the structure of the sampling frame.

  • It could be better, worse, or the same as SRS or even stratified sampling.

Multistage sampling

  • Multistage sampling involves more than one stage or more than one sampling procedure in obtaining a sample.

  • Two-stage cluster sampling is an example of multistage sampling.

Sampling problems

Biased samples

  • If our sampling approach systematically gives us nonrepresentative samples, we say that the sampling method is biased.

  • Remember, we don’t know if a sample is representative or not since we don’t know the population;

  • Biased sampling is a property of the approach, not of a given sample.

  • Let’s check a few things that can compromise our sample data;

Undercoverage

  • It occurs when a sampling frame or a sampling procedure completely excludes or underrepresents certain kinds of individuals from the population.

  • For example, a librarian wants to find out how often UBC students use library service. She only surveys students visiting the Woodward Biomedical Library.

Convenience Sampling

  • The selection of individuals from the population based on easy availability and accessibility.

  • For example, a market researcher wants to estimate the average price of housings in Vancouver. He collects information on the prices by sending out a survey to 50 households in his neighbourhood.

Voluntary Response Bias

  • If the participation in survey is voluntary, individuals with strong opinions tend to respond more often and thus will be overrepresented.

  • For example, call-in polls, UBC’s optional teaching evaluations, etc…

Nonresponse Bias

  • Individuals who do not respond in a survey might differ from the respondents in certain aspects (e.g.,mail-in questionnaires);

  • Voluntary response bias is a form of nonresponse bias; but nonresponse may occur for other reasons.

  • For example, those who are at work during the day won’t respond to a telephone survey conducted only during working hours.

Response Bias

  • When a surveyed subject’s response is influenced by how a question is phrased or asked, or due to misunderstanding of a question or unwillingness to disclose the truth, response bias has occurred.

  • For example, the question, “Have you ever committed a crime?” could pressure the respondents into lying to avoid compromising themselves.

References

Image Attributions

Female Nurse 1: Twitter, CC BY 4.0, via Wikimedia Commons.

Female Nurse 2: Twitter, CC BY 4.0, via Wikimedia Commons.

Female Nurse 3: Twitter, CC BY 4.0, via Wikimedia Commons.

Male Nurse 1: Twitter, CC BY 4.0, via Wikimedia Commons.

Male Nurse 2: Twitter, CC BY 4.0, via Wikimedia Commons.

Female Doctor 1: Google, Apache License 2.0, via Wikimedia Commons.

Female Doctor 2: Google, Apache License 2.0, via Wikimedia Commons.

Female Doctor 3: Google, Apache License 2.0, via Wikimedia Commons.

Male Doctor 1: Google, Apache License 2.0, via Wikimedia Commons.

Male Doctor 2: Google, Apache License 2.0, via Wikimedia Commons.

Female Staff 1: Google, Apache License 2.0, via Wikimedia Commons.

Female Staff 2: Google, Apache License 2.0, via Wikimedia Commons.

Female Staff 3: Google, Apache License 2.0, via Wikimedia Commons.

Male Staff 1: Google, Apache License 2.0, via Wikimedia Commons.

Male Staff 2: Google, Apache License 2.0, via Wikimedia Commons.

Male Staff 3: Google, Apache License 2.0, via Wikimedia Commons.