Descriptive statistics: describe the data.
Summarize (statistics), tabulate (frequency distribution), graph (histogram, etc.)

Data

Data
Qualitative (nominal, categorical)
words
Quantitative
numbers.
Discrete Continuous
integers real nos. (decimals)
counts measurements
(amount, size, time, rate)

Levels of measurement
Level Examples What can do with
Nominal names, labels, categories Yes/No, Agree/Disagree, Have/Havenot, Success/Failure, M/F, ...
MaritalStatus, State, County, Zipcode, Major, Brand,make,model,color, Place
race,religion,party,ideology..., TaxFilingStatus, Blood type, Housing type, Pet
Count/tally each category. Relative frequency. Mode. Bar chart.
Chi-square Tests (independence, goodness-of-fit)
Confidence interval 1-PropZInt
Ordinal orderable/rankable categories
but differences (obtained by subtraction) between data values either cannot be determined or are meaningless.
class(frosh/soph/jun/sen), trim levels, film ratings, gold/silver/bronze, letter grades, days of week, months, Education level, clothing sizes, pain scales, military rank, star ratings, priority/risk levels
Percentiles.
Likert scale:
Strongly disagree / Disagree / Neutral / Agree / Strongly agree
Very dissatisfied / Dissatisfied / Neutral / Satisfied / Very satisfied
Poor / Fair / Good / Very good / Excellent
Above + median/quartiles, Spearman.
Interval Numbers: orderable, and differences between data values can be found and are meaningful. But no natural zero (meaning none of the quantity). Temperature C or F, Years/Dates, shoe size, IQ/SAT, FICO, pH, BMI
0 is fakish
histogram, mean, median, SD...
Estimation, CI: t-test,
Hypothesis testing,
ANOVA,
correlation, linear regression
Ratio Numbers: orderable, and differences between data values can be found and are meaningful, and natural zero (meaning none of the quantity), and ratios (eg. "twice as much") are meanginful. Weight Height Age
Length Area Volume
Time Money TemperatureK Energy
BP LDL    MPG MPH BPM
DJI S&P500
counts
Above + CV, GM,

Measurements have some measuring unit, e.g. inches, pounds, meters, minutes, acres, grams, MPH, BPM, ng/L, ... but they are basically irrelevant for the statistical work.


Data "set" (but can have duplicates) consisting of observations/data values/measurements/datums/individuals/scores, all the same meaning, e.g. weights of adults, greasiness of bags of chips, longevity of bulbs, widget regional sales, effect of pill, number of events, ...
Whole numbers vs. real (decimal) numbers: no difference calculating stats, histogram, etc.
Negative numbers: ditto. If all negative, "middles" are negative; if some positive, middles might be either or 0. Range and SD always positive.

Some Triola data
Some data distros


Example: Population: weights of adults in country/county.
Not possible to census this. So need a non-biased, representative sample (a teaspoon of the pot of soup).
Ideal: Simple random sample (SRS): every adult equally-likely to be in the sample and every sample of that size is equally-likely.
   The selection procedure/method to take the sample is the "random". Randomly-taken sample.
  Bad: voluntary response, convenience sample.
Collect data. Measured vs self-reported (unreliable).
Calculate/derive statistic from the data: a point estimate of the parameter. But samples have uncertainty/variability so determine [confidence] interval estimate.
Inferential statistics: use probability to understand/quantify/describe uncertainty.
If have census, i.e. population is all known, no need to sample, just describe the population. Sample(s) only useful/taken/needed to estimate population parameter(s).


interval: set of continuous number. e.g. [1.45,3.7]
OR, where all our data is: [min,max]
range (statistic) is the length or distance of our data interval. Always positive.
Very Strong / Textbook Examples of Normal Data

Human heights (adult men or women within one population group)
Adult male heights in the US: mean ~175 cm, SD ~7 cm
Adult female heights: mean ~162 cm, SD ~6–7 cm
Very close to normal when measured in large samples from a single ethnic/geographic group.

IQ scores (standardized tests like WAIS, Stanford-Binet)
Designed to be normally distributed: mean = 100, SD = 15 (by construction).
One of the purest real-world normal distributions.

Birth weights (full-term infants in a population)
Mean ~3.4 kg, SD ~0.5 kg
Very close to normal for healthy, full-term babies.

Measurement errors (in repeated measurements of the same object)
Length of a metal rod measured 50 times with a caliper
Errors due to small random fluctuations → almost perfectly normal.

Blood pressure (systolic or diastolic in a healthy adult population)
Systolic BP in adults: often close to normal (mean ~120 mmHg, SD ~15–20 mmHg) when excluding hypertension cases.


Examples That Are Approximately Normal (Good Enough in Practice)

Test scores on standardized exams (SAT, ACT, GRE, etc.)
Scores are deliberately scaled to be roughly normal (mean 500–1000 range, SD ~100–200 depending on test).

Body mass index (BMI) in large adult populations
When excluding extremes (severe obesity or underweight), BMI often looks roughly normal.

Reaction times (simple visual or auditory reaction time tasks)
In controlled lab settings, reaction times are close to normal after removing outliers.

Residuals in regression models
In linear regression, the residuals (errors) are often assumed and observed to be normally distributed when the model fits well.



Right-skewed data:
Income / wealth distributions
Time to failure of mechanical parts (exponential)
Time between customer arrivals (exponential inter-arrival)
Insurance claim sizes
Sizes of cities / populations
Sizes of earthquakes (Richter scale)
Sizes of forest fires / natural disasters
Number of children per family
Time spent on websites / app sessions
Number of social media followers / likes
Sizes of companies (employees, revenue)
Word lengths in texts
Sizes of file downloads / internet traffic
Bacterial colony counts
Particle sizes in aerosols or sediments
Waiting times in queues


Data in Excel file (.xlsx). Open it in Excel. Select column, copy, then paste into other SW.

Data in Text file (.txt, .dat) or webpage, in column (of many columns, each a different data set):
  Open it or Import it in Excel. Select column, copy, then paste into other SW.
    OR
  Open it in Notepad and then select all (Ctrl-A) then copy (Ctrl-C) and paste into Excel. Select column, copy, then paste into other SW.
BodyTemperatures.txt


Dot plot.

Stem-and-leaf plot.

Data: 44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106

train schedule


random stochastic aleatory chance luck mis/fortune contingent accidental fate fortuitous haphazard

random number between 0 and 1 (e.g. a probability)
#0 #1 #2 #3 #4 #5 #6 #7 #8 #9
computer PRNG (pseudo-random number generator) : 9 quadrillion of them    (@1/s → 285M years)
There will be about the same number of them in each quartile, decile, percentile, millionile.

Each pixel randomly black or white:
     Sound track of randomness: