[1] "EWR" "LGA" "JFK"
A set is mathematical model for the collection of different things.
Examples
origin
airports in flights
A vector is an ordered collection of things (elements) of the same type.
In a set each thing can only be once and the order does not matter!
\(\{1,3,5\} = \{3,5,1\} = \{1,1,1,3,5,5\}\)
For vectors:
\([1\ 3\ 5] \neq [3\ 5\ 1]\) because we compare component-wise, so we cannot even compare with those with the vector \([1\ 1\ 1\ 3\ 5\ 5]\)
Sets \(A = \{đș, đŠ, đ¶\}\) and \(B = \{đ¶, đ·, đč\}\), \(C = \{đ¶, đ·\}\):
|
(or) \(x\in B\)&
(and) \(x\in B\)See the analogy of set operations and logical operations.
unique
shows the set of elements in a vector
Consider two sets: The domain \(X\) and the codomain \(Y\).
A function \(f\) assigns each element of \(X\) to exactly one element of \(Y\).
We write \(f : X \to Y\)
â\(f\) maps from \(X\) to \(Y\)â
and \(x \mapsto f(x)\)
â\(x\) maps to \(f(x)\)â
The yellow set is called the image of \(f\).
Important in math
\(\ \mapsto\ \)
Input from \(X = \{\text{A picture where a face can be recognized}\}\).
Function: Upload input at https://funny.pho.to/lion/ and download output.
Output from \(Y = \{\text{Set of pictures with a specific format.}\}\)
Yes, it is a function. Important: Output is the same for the same input!
Input a text snippet. Function: Enter text at https://www.craiyon.com. Output a picture.
Other examples:
No, it is not a function. It has nine outcomes and these change when run again.
\(f(x) = x\) identity function
\(f(x) = x^2\) square function
\(f(x) = \sqrt{x}\) square root function
\(f(x) = e^x\) exponential function
\(f(x) = \log(x)\) natural logarithm
\(\sqrt[2]{x}^2 = \sqrt[2]{x^2} = x\), \(\log(e^x) = e^{\log(x)} = x\)
How can we shift, stretch, or shrink a graph vertically and horizontally?
Add a constant to the function.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = x^3 - x^2 + a\)
For \(a =\) -2, -0.5, 0.5, 2
Subtract a constant from all \(x\) within the function definition.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = (x - a)^3 - (x - a)^2\)
For \(a =\) -2, -0.5, 0.5, 2
Attention:
Shifting \(a\) units to the right needs subtracting \(a\)!
You can think of the coordinate system being shifted in direction \(a\) while the graph stays.
Multiply a constant to all \(x\) within the function definition.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = a(x^3 - x^2)\)
For \(a =\) -2, -0.5, 0.5, 2
Negative numbers flip the graph around the \(x\)-axis.
Divide all \(x\) within the function definition by a constant.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = (x/a)^3 - (x/a)^2\)
For \(a =\) -2, -0.5, 0.5, 2
Negative numbers flip the graph around the \(y\)-axis.
Attention: Stretching needs a division by \(a\)!
You can think of the coordinate system being stretched multiplicatively by \(a\) while the graph stays.
A polynomial is a function which is composed of (many) addends of the form \(ax^n\) for different values of \(a\) and \(n\).
In an exponential the \(x\) appears in the exponent.
\(f(x) = x^3\) vs. \(f(x) = e^x\)
For \(x\to\infty\), any exponential will finally âovertakeâ any polynomial.
\(x^0\)
\(0^x\)
\(0^0\)
\((x\cdot y)^a\)
\(x^{-a}\), \(x^{-1}\)
\(x^\frac{a}{b}\), \(x^\frac{1}{2}\)
\((x^a)^b\)
\(x^0 = 1\)
\(0^x = 0\) for \(x\neq 0\)
\(0^0 = 1\) (discontinuity in \(0^x\))
\((x\cdot y)^a = x^a\cdot x^b\)
\(x^{-a} = \frac{1}{x^a}\), \(x^{-1} = \frac{1}{x}\)
\(x^\frac{a}{b} = \sqrt[b]{x^a} = (\sqrt[b]{x})^a,\ x^\frac{1}{2} = \sqrt{x}\)
\((x^a)^b = x^{a\cdot b} = (x^b)^a \neq x^{a^b} = x^{(a^b)}\)
Example: \((4^3)^2 = 64^2 = 4096 \qquad 4^{3^2} = 4^9 = 262144\)
\(x^a\cdot x^b\)
\(x^a\cdot x^b = x^{a+b}\) Multiplication of powers (with same base \(x\)) becomes addition of exponents.
\((x+y)^a\)
No âsimpleâ form! For \(a\) integer use binomial expansion. \((x+y)^2 = x^2 + 2xy + y^2\)
\((x+y)^3 = x^3 + 3x^2y + 3xy^2 + y^3\)
\((x+y)^n = \sum_{k=0}^n {n \choose k} x^{n-k}y^k\)
Pascalâs triangle
We meet it again in Probability:
A row represents a binomial distribution
Which tends to mimics the normal distribution more and more
and is related to the central limit theorem
Definition: A logarithm of \(a\) for some base \(b\) is the value of the exponent which brings \(b\) to \(a\): \(\log_b(a) = x\) means that \(b^x =a\)
Most common:
\(\log_{10}(100) =\)
\(2\)
\(\log_{10}(1) =\)
\(0\)
\(\log_{10}(6590) =\)
\(3.818885\)
\(\log_{10}(0.02) =\)
\(-1.69897\)
Usually only one base is used in the same context, because changing base is easy:
\(\log_c(x) = \frac{\log_b(x)}{\log_b(c)} = \frac{\log(x)}{\log(c)}\)
\(\log(x\cdot y)\)
\(= \log(x) + \log(y)\) Multiplication \(\to\) addition.
\(\log(x^y)\)
\(= y\cdot\log(x)\)
\(\log(x+y)\)
complicated!
Also changing bases for powers is easy: \(x^y = (e^{\log(x)})^y = e^{y\cdot\log(x)}\)
Difference to math terminolgy:
The output need not be the same for the same input.
function
is a class of an object in R
Calling the function without brackets writes its code or some information.
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x55b0c7b79b98>
<environment: namespace:stats>
function (...) .Primitive("c")
function (data = NULL, mapping = aes(), ..., environment = parent.frame())
{
UseMethod("ggplot")
}
<bytecode: 0x55b0c6cd1ae8>
<environment: namespace:ggplot2>
The skeleton for a function definition is
function_name
should be a short but evocative verb.input
can be empty or one or more name
or name=expression
terms as arguments.{}
can be omitted. For exampleadd_one <- function(x) x + 1
name=expression
or just expression
(then they are taken as the next argument)Mathematical functions in programming are often âvectorizedâ:
[1] 1 2 3 4 5 6 7 8 9 10
[1] -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
[16] 1.0 1.1 1.2 1.3 1.4 1.5
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
[8] 0.7777778 0.8888889 1.0000000
[1] 1 2 3 1 2 3 1 2 3
[1] 1 1 1 2 2 2 3 3 3
Vector creation and vectorized functions are key for plotting and transformation.
ggplot
ing functionsif
executes a code block if a condition is TRUE
else
executes a code block if the condition is FALSE
Skeleton
Example: A piece-wise defined function
Problem: piecewise
is not vectorized. piecewise(c(1,2,3))
does not work.
map
map
functions apply a function to each element of a vector.1map(.x, .f, ...)
applies the function .f
to each element of the vector of .x
and returns a list.map_dbl
returns a double vector (other variants exist)map
and reduce
Instead of a list or a vector reduce
returns a single value.
To that end it needs a function with two arguments. It applies it to the first two elements of the vector, then to the result and the third element, then the result and the fourth element, and so on.
Note: \(x)
is a short way to write an anonymous function as function(x)
.
Example: Reading multiple files
map
functions are powerful tools for iterative tasks!Summary Statistics are used to summarize a set of observations to communicate the largest amount of information as simple as possible.
Univariate (for one variable)
Bivariate (for two variables)
Goal: For a sequence of numerical observations \(x_1,\dots,x_n\) we want to measure
Three different ways:
Do they deliver one unambiguous answer for any sequence?
Mean and median, yes.
The mode has no rules for a tie.
Can they by generalized to variables with ordered or even unordered categories?
Mean: No.
Median: For ordered categories (except when even number and the two middlemost are not the same) Mode: For any categorical variable.
Is the measure always also in the data sequence?
Mean: No.
Median: Yes, for sequences of odd length.
Mode: Yes.
For \(x_1, \dots, x_n > 0\) and \(p\in \mathbb{R}_{\neq 0}\) the generalized mean is
\[M_p(x_1, \dots, x_n) = (\frac{1}{n}\sum_{i=1}^n x_i^p)^\frac{1}{p}\]
For \(p = 0\) it is \(M_0(x_1, \dots, x_n) = (\prod_{i=1}^n x_i)^\frac{1}{n}\).
\(M_1\) is the arithmetic mean. \(M_0\) is called the geometric mean. \(M_{-1}\) the harmonic mean.
Note: Generalized means are often only reasonable when all values are positive \(x_i > 0\).
For \(p \in \mathbb{R}\): \(f(x) = \begin{cases}\frac{x^p - 1}{p} & \text{for $p\neq 0$} \\ \log(x) & \text{for $p= 0$}\end{cases}\)
The \(p\)-mean is
\[M_p(x) = f^{-1}(\frac{1}{n}\sum_{i=1}^n f(x_i))\]
with \(x = [x_1, \dots, x_n]\). \(f^{-1}\) is the inverse2 of \(f\).
What is the weight of the meat of this ox?
library(readxl)
galton <- read_excel("data/galton_data.xlsx")
galton |> ggplot(aes(Estimate)) + geom_histogram(binwidth = 5) + geom_vline(xintercept = 1198, color = "green") +
geom_vline(xintercept = mean(galton$Estimate), color = "red") + geom_vline(xintercept = median(galton$Estimate), color = "blue") + geom_vline(xintercept = Mode(galton$Estimate), color = "purple")
787 estimates, true value 1198, mean 1196.7, median 1208, mode 1218
How many lots will be sold by the end of the festival?
viertel <- read_csv("data/Viertelfest.csv")
viertel |> ggplot(aes(`SchÀtzung`)) + geom_histogram() + geom_vline(xintercept = 10788, color = "green") +
geom_vline(xintercept = mean(viertel$SchÀtzung), color = "red") + geom_vline(xintercept = median(viertel$SchÀtzung), color = "blue") + geom_vline(xintercept = Mode(viertel$SchÀtzung), color = "purple")
1226 estimates, the maximal value is 29530000!
We should filter out the highest values for the histogramâŠ
How many lots will be sold by the end of the festival?
viertel <- read_csv("data/Viertelfest.csv")
viertel |> filter(SchÀtzung<100000) |> ggplot(aes(`SchÀtzung`)) + geom_histogram(binwidth = 500) + geom_vline(xintercept = 10788, color = "green") +
geom_vline(xintercept = mean(viertel$SchÀtzung), color = "red") + geom_vline(xintercept = median(viertel$SchÀtzung), color = "blue") + geom_vline(xintercept = Mode(viertel$SchÀtzung), color = "purple") + geom_vline(xintercept = exp(mean(log(viertel$SchÀtzung))), color = "orange")
1226 estimates, true value 10788, mean 53163.9, median 9843, mode 10000,
geometric mean 10510.1
viertel |> mutate(log10Est = log10(SchÀtzung)) |> ggplot(aes(log10Est)) + geom_histogram(binwidth = 0.05) + geom_vline(xintercept = log10(10788), color = "green") +
geom_vline(xintercept = log10(mean(viertel$SchÀtzung)), color = "red") + geom_vline(xintercept = log10(median(viertel$SchÀtzung)), color = "blue") + geom_vline(xintercept = log10(Mode(viertel$SchÀtzung)), color = "purple") + geom_vline(xintercept = mean(log10(viertel$SchÀtzung)), color = "orange")
1226 estimates, true value 10788, mean 53163.9, median 9843, mode 10000,
geometric mean 10510.1
Goal: We want to measure
Variance is the mean of the squared deviation from the mean: \(\text{Var}(x) = \frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2\) where \(\mu\) (mu) is the mean.
Standard deviation is the square root of the variance \(\text{SD}(x) = \sqrt{\text{Var}(x)}\).
The standard deviation is often denoted \(\sigma\) (sigma) and the variance \(\sigma^2\).
Mean absolute deviation (MAD) is the mean of the absolute deviation from the mean: \(\text{MAD}(x) = \frac{1}{n}\sum_{i=1}^n|x_i - \mu|\).
Range is the difference of the maximal and the minimal value \(\max(x) - \min(x)\).
Variables are standardized by subtracting their mean and then dividing by their standard deviations.
A value from a standardized variable is called a standard score or z-score.
\(z_i = \frac{x_i - \mu}{\sigma}\)
where \(\mu\) is the mean and \(\sigma\) the standard deviation of the vector \(x\).
Next steps coming (you will receive individual repositories for this):