Logistic Regression #
- Eager learners
- Logistic regression is a classification algorithm, that uses Sigmoid function.
- S-shaped curve varying between 0 and 1 and asymtotes at the tails
- Unlike linear regression the out vairable is discrete
- Logictic function takes any value and maps it between the range of 0 and 1
- Can work on both continuous and discrete attributes
- When using multiple attributes we cannot compare models as such. Hence we remove/add attributes and check if the variableseffect on prediction is greater than zero. (Wald’s Test)
- The non helping attribute is referred as “totes useless”
- Unlike linear regression the residuals concept doesn’t apply here. Hence max likelihood is used to fit the curve
- Special type of Generalized Linear Models (GLM)
- Though presented as a logistic function, the coefficients are determined using liner function by converting to logit function.
Representation #
\[h({\theta}) = {1 \above{1pt} (1 + e^-{\theta^{T}x})} \qquad \text{e - natural logarithm base}\] \[y \isin \lbrace 0, 1 \rbrace \qquad 0: \text{Negative case; 1: Positive case }\]Interpretation #
The output of the hypothesis function is interpreted as the probability of the target bring positive given features x parameterized by \(\theta\)
\[P(y=1|x;\theta) + P(y=0|x;\theta) = 1 \\ P(y=1|x;\theta) = 1 - P(y=0|x;\theta)\]Cost Function #
\[J({\theta}) = {1 \above{1pt} m} \sum_{\substack{i = 0}}^m Cost(h_{\theta}(x^i), y^i) \qquad \begin{dcases} {Cost(h_{\theta}(x^i), y^i) = −log(h_{\theta}(x))} &\text{if y = 1} \\ {Cost(h_{\theta}(x^i), y^i) = −log(1−h_{\theta}(x)} &\text{if y = 0} \end{dcases}\]Cost function can be simplified as below; Substution y = 0 or 1 gives the same function above, \[Cost(h_{\theta}(x^i), y^i) = -y * log(h_{\theta}(x)) - (1 - y) * log(1 − h_{\theta}(x)\]
\[J({\theta}) = - {1 \above{1pt} m} \sum_{\substack{i = 0}}^m y^i log(h_{\theta}(x)) + (1 - y^i) log(1 - h_{\theta}(x))\]Maximum likelihood estimation
Interpretation #
- Logistic regression models probability of the positive class.
Logit #
- \(logit^{-1}(x) = logistic(x)\)
- Find the y-intercept
- Find the standard-deviation
- z-value = y-intercept / standard-deviation; # of std deviations the y-intercept is away from mean of the normal curve
- Use Wald’s test to determine statistic significance.If the z-value is < 2 std dev, then insignificant
Optimization Algorithms #
- Gradient Descedent
- Conjugate Gradient
- BFGS
- L-BFGS
Cross Entropy #
Unlike linear regression the prediction function(sigmoid function), the cost function has many local and global minima. Hence gradient descent cannot be used. Hence Cross entropy is used; also known as Loss function.