Logistic Regression #

Eager learners
Logistic regression is a classification algorithm, that uses Sigmoid function.
S-shaped curve varying between 0 and 1 and asymtotes at the tails
Unlike linear regression the out vairable is discrete
Logictic function takes any value and maps it between the range of 0 and 1
Can work on both continuous and discrete attributes
When using multiple attributes we cannot compare models as such. Hence we remove/add attributes and check if the variableseffect on prediction is greater than zero. (Wald’s Test)
The non helping attribute is referred as “totes useless”
Unlike linear regression the residuals concept doesn’t apply here. Hence max likelihood is used to fit the curve
Special type of Generalized Linear Models (GLM)
Though presented as a logistic function, the coefficients are determined using liner function by converting to logit function.

Representation #

\[h({\theta}) = {1 \above{1pt} (1 + e^-{\theta^{T}x})} \qquad \text{e - natural logarithm base}\] \[y \isin \lbrace 0, 1 \rbrace \qquad 0: \text{Negative case; 1: Positive case }\]

Interpretation #

The output of the hypothesis function is interpreted as the probability of the target bring positive given features x parameterized by \(\theta\)

\[P(y=1|x;\theta) + P(y=0|x;\theta) = 1 \\ P(y=1|x;\theta) = 1 - P(y=0|x;\theta)\]

Cost Function #

\[J({\theta}) = {1 \above{1pt} m} \sum_{\substack{i = 0}}^m Cost(h_{\theta}(x^i), y^i) \qquad \begin{dcases} {Cost(h_{\theta}(x^i), y^i) = −log(h_{\theta}(x))} &\text{if y = 1} \\ {Cost(h_{\theta}(x^i), y^i) = −log(1−h_{\theta}(x)} &\text{if y = 0} \end{dcases}\]

Cost function can be simplified as below; Substution y = 0 or 1 gives the same function above, \[Cost(h_{\theta}(x^i), y^i) = -y * log(h_{\theta}(x)) - (1 - y) * log(1 − h_{\theta}(x)\]

\[J({\theta}) = - {1 \above{1pt} m} \sum_{\substack{i = 0}}^m y^i log(h_{\theta}(x)) + (1 - y^i) log(1 - h_{\theta}(x))\]

Maximum likelihood estimation

Interpretation #

Logistic regression models probability of the positive class.

Logit #

\(logit^{-1}(x) = logistic(x)\)
Find the y-intercept
Find the standard-deviation
z-value = y-intercept / standard-deviation; # of std deviations the y-intercept is away from mean of the normal curve
Use Wald’s test to determine statistic significance.If the z-value is < 2 std dev, then insignificant

Optimization Algorithms #

Gradient Descedent
Conjugate Gradient
BFGS
L-BFGS

Cross Entropy #

Unlike linear regression the prediction function(sigmoid function), the cost function has many local and global minima. Hence gradient descent cannot be used. Hence Cross entropy is used; also known as Loss function.

References #

Why isn’t Logistic Regression called Logistic Classification?