Descriptive statistics: describe the data.
Summarize (statistics), tabulate (frequency distribution), graph (histogram, etc.)
| Data | ||||||
|---|---|---|---|---|---|---|
| Qualitative (nominal, categorical)
words | Quantitative
numbers. | |||||
| ||||||
Levels of measurement
| Level | Examples | What can do with | |
|---|---|---|---|
| Nominal | names, labels, categories |
Yes/No, Agree/Disagree, Have/Havenot, Success/Failure, M/F, ...
MaritalStatus, State, County, Zipcode, Major, Brand,make,model,color, Place race,religion,party,ideology..., TaxFilingStatus, Blood type, Housing type, Pet | Count/tally each category. Relative frequency. Mode. Bar chart.
Chi-square Tests (independence, goodness-of-fit) Confidence interval 1-PropZInt |
| Ordinal | orderable/rankable categories
but differences (obtained by subtraction) between data values either cannot be determined or are meaningless. | class(frosh/soph/jun/sen), trim levels, film ratings,
gold/silver/bronze, letter grades, days of week, months, Education level,
clothing sizes, pain scales, military rank, star ratings, priority/risk levels
Percentiles. Likert scale: Strongly disagree / Disagree / Neutral / Agree / Strongly agree Very dissatisfied / Dissatisfied / Neutral / Satisfied / Very satisfied Poor / Fair / Good / Very good / Excellent | Above + median/quartiles, Spearman. |
| Interval | Numbers: orderable, and differences between data values can be found and are meaningful. But no natural zero (meaning none of the quantity). | Temperature C or F, Years/Dates, shoe size, IQ/SAT, FICO, pH
0 is fakish | histogram, mean, median, SD...
Estimation, CI: t-test, Hypothesis testing, ANOVA, correlation, linear regression |
| Ratio | Numbers: orderable, and differences between data values can be found and are meaningful, and natural zero (meaning none of the quantity), and ratios (eg. "twice as much") are meanginful. | Weight Height Age Length Area Volume Time Money TemperatureK Energy BP LDL BMI DJI S&P500 | Above + CV, GM, |
Example: Population: weights of adults in country/county.
Not possible to census this. So need a non-biased, representative sample (a teaspoon of the pot of soup).
Ideal: Simple random sample (SRS): every adult equally-likely to be in the sample
and every sample of that size is equally-likely.
Bad: voluntary response, convenience sample.
Collect data. Measured vs self-reported (unreliable).
Calculate/derive statistic from the data: a point estimate of the parameter.
But samples have uncertainty/variability so determine [confidence] interval estimate.
Inferential statistics: use probability to understand/quantify/describe uncertainty.
If have census, i.e. population is all known, no need to sample, just describe the population.
Sample(s) only useful/taken/needed to estimate population parameter(s).
Data set mapped/transformed/normalized to z-scores:
each datum x: its number of SDs from the mean:
z = (x-mean) / SD
Within ±2 is "normal". [2,2]
≤-2 (-∞,-2] or ≥2 [2,∞) is "statistically significant",
i.e. maybe important.