Jekyll2022-01-03T08:19:16+00:00https://rohanvarma.me/feed.xmlRohan VarmaSoftware Engineer @ FacebookEfficient Matrix Operations through Diagonalizability2019-07-21T00:00:00+00:002019-07-21T00:00:00+00:00https://rohanvarma.me/Diagonalizability<p>In this blog post, I’ll talk about <em>diagonalizability</em>, what it is, and why it may be useful to diagonalize matrices (when they can be) to efficiently compute operations on matrices. I won’t go into detail <em>when</em> a matrix is diagonalizable, but it will be briefly mentioned in an example.</p>
<p>If \(A\) is similar to a diagonal matrix \(D\), then \(A = QDQ^{-1}\) for some invertible matrix \(Q\) and some diagonal matrix \(D = \begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n \end{bmatrix}\), and \(A\) is <em>diagonalizable</em>. Only square matrices \(A \in \mathbb{M_{nxn}}\) are (possibly) diagonalizable. The matrix \(Q\) has columns given by (distinct, linearly independent) eigenvectors of the linear transformation given by \(A\), and \(D\) is a diagonal matrix whose diagonal entries contain eigenvalues corresponding to the eigenvectors of the linear transformation given by \(A\).</p>
<p>Being able to diagonalize matrices this way is useful, since it allows for easy and fast computation of functions of \(A\).</p>
<p>For example, if we wanted to compute \(A^2\), then the naive matrix multiplication algorithm will take \(O(n^3)\) time, and the best known algorithm is still \(O(n^{2.3})\). However, if we know that \(A\) is diagonalizable, then we have:</p>
\[A^2 = QDQ^{-1}QDQ^{-1} \rightarrow{} A^2 = QD^2Q^{-1}\]
<p>An induction on the power \(k\) we raise \(A\) to shows that in general \(A^k = QD^kQ^{-1}\).</p>
<p>Next, we show that this makes our life easier, since \(D^k = \begin{bmatrix} \lambda_1^k & 0 &… & 0 \\ 0 & \lambda_2^k & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n^k \end{bmatrix}\).</p>
<p>First, let’s consider \(D^2\). We have \(D^2 = DD = \begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n \end{bmatrix} \begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n \end{bmatrix}\).</p>
<p>In general, for a matrix product \(C = AB\) we have \(C_{ij} = \sum_{k=1}^{m} A_{ik}B_{kj}\) - the \((i,j)\) entry of \(C\) is the dot product of the \(i\)th row of \(A\) and the \(j\)th column of \(B\). Therefore, we have \(D^2_{i,j} = [0, …, \lambda_i, …][0, …, \lambda_j, …]^T\) so if \(i \neq j\) then \(D^2_{ij} = 0\), otherwise \(D^2_{ij} = \lambda_i^2 = \lambda_j^2\).</p>
<p>Thus, we have \(D^2 = \begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2^2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n^2 \end{bmatrix}\) and an inductive argument gives our above result for \(D^k\).</p>
<p>This drastically reduces the complexity of computing \(A^k\): instead of \(\log(k)\) matrix multiplications, taking \(O(\log(k)n^{2.3})\) with a good algorithm, we have an \(O(n)\) pass to exponentiate across the diagonal, followed by two matrix multiplications (the first matrix multiplication with the diagonal matrix is also more efficient, as the rows of \(Q^{-1}\) are just scaled), for a complexity of \(O(n^{2.3})\) - and the latter complexity does not grow with \(k\).</p>
<p>We can actually make a similar argument for computing any function \(f\) with a matrix argument: \(f(A) = Qf(D)Q^{-1} =Q \begin{bmatrix} f(\lambda_1) & 0 &… & 0 \\ 0 & f(\lambda_2) & … & 0 \\ … & … & … & … \\ 0 & 0 & … & f(\lambda_n) \end{bmatrix}Q^{-1}\), when \(A\) is diagonalizable.</p>
<p>To show why this is the case, let’s consider the taylor series expansion of \(f(x)\), where \(x \in \mathbb{R}\), around \(0\) (we are assuming here that \(f\) is infinitely differentiable around \(0\)):</p>
\[f(x) = f(0) + f'(0)(x) + \frac{1}{2}f''(0)x^2 + \frac{1}{3!}f'''(0)x^3 + … = \sum_{i=0}^{\infty}\frac{f^{(n)}(0)x^n}{n!}\]
<p>To bring in the matrix, we can substitute \(A\) in place of \(x\), and sums now become matrix sums and products scale the matrix:</p>
\[f(A) = f(0)I_{n} + f'(0)(A) + \frac{1}{2}f''(0)A^2 + \frac{1}{3!}f'''(0)A^3 + …\]
<p>Next, since \(A\) is diagonalizable and \(A = QDQ^{-1}\), we can subsitute:</p>
\[f(A) = f(QDQ^{-1}) = f(0)I_{n} + f'(0)(QDQ^{-1}) + \frac{1}{2}f''(0)(QDQ^{-1})^2 + \frac{1}{3!}f'''(0)(QDQ^{-1})^3 + …\]
<p>Writing the above expression as a sum yields the following:</p>
\[f(A) = \sum_{i=0}^{\infty} \frac{f^i(0)(QDQ^{-1})^i}{i!}\]
<p>The term \((QDQ^{-1})^i\) should be familiar - we know from above that that is just \(QD^iQ^{-1}\). Thus, we’re left with \(f(QDQ^{-1}) = \sum_{i=0}^{\infty} Q\frac{f^i(0)D^i}{i!}Q^{-1} = Q(\sum_{i=0}^{\infty} \frac{f^i(0)D^i}{i!})Q^{-1}\).</p>
<p>But \(f(D) = \sum_{i=0}^{\infty} \frac{f^i(0)D^i}{i!}\), so \(f(QDQ^{-1}) = Qf(D)Q^{-1}\).</p>
<p>Next, all that’s left to show is that \(f(D) = \begin{bmatrix} f(\lambda_1) & 0 &… & 0 \\ 0 & f(\lambda_2) & … & 0 \\ … & … & … & … \\ 0 & 0 & … & f(\lambda_n) \end{bmatrix}\). We use a similar approach with the Taylor series expansion around \(0\):</p>
\[f(D) = f(0)I_{n} + f'(0)D + \frac{1}{2}f''(0)D^2 + \frac{1}{3!}f'''(0)D^3 + …\]
\[= f(0) \begin{bmatrix} 1 & 0 &… & 0 \\ 0 & 1 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & 1 \end{bmatrix} + f'(0)\begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n \end{bmatrix} + \frac{1}{2}f''(0)\begin{bmatrix} \lambda_1^2 & 0 &… & 0 \\ 0 & \lambda_2^2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n^2 \end{bmatrix} + ...\]
<p>This is just a bunch of matrix scaling and summation, so let’s move some things inside the matrices:</p>
\[\begin{bmatrix} f(0) & 0 &… & 0 \\ 0 & f(0) & … & 0 \\ … & … & … & … \\ 0 & 0 & … & f(0) \end{bmatrix} + \begin{bmatrix} f'(0) \lambda_1 & 0 &… & 0 \\ 0 & f'(0)\lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & f'(0)\lambda_n \end{bmatrix} + \begin{bmatrix} \frac{1}{2}f''(0)\lambda_1^2 & 0 &… & 0 \\ 0 & \frac{1}{2}f''(0)\lambda_2^2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \frac{1}{2}f''(0)\lambda_n^2 \end{bmatrix} + …\]
<p>Now, adding up the matrices, we get:</p>
\[\begin{bmatrix} \sum_{i=0}^{\infty} \frac{f^i(0)\lambda_1^i}{i!}& 0 &… & 0 \\ 0 & \sum_{i=0}^{\infty}\frac{f^i(0)\lambda_2^i}{i!} & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \sum_{i=0}^{\infty}\frac{f^i(0)\lambda_n^i}{i!} \end{bmatrix}\]
<p>But each nonzero term is the Taylor series expansion for the corresponding \(\lambda_i\), so this is just</p>
<p>\(f(D) = \begin{bmatrix} f(\lambda_1) & 0 &… & 0 \\ 0 & f(\lambda_2) & … & 0 \\ … & … & … & … \\ 0 & 0 & … & f(\lambda_n) \end{bmatrix}\), so the proof is complete.</p>
<p>This is useful since it allows us to compute functions over matrices much more easily, as we saw above with exponentiation. Consider for example \(f(A) = \exp(A)\). A valid notion of exponentiating matrices can be given by defining \(\exp(A)\) similar to \(\exp(x), x \in \mathbb{R}\):</p>
\[\exp(A) = A^0 + A^1 + \frac{A^2}{2} + \frac{A^3}{3!} + … = \sum_{i=0}^{\infty} \frac{A^i}{i!}\]
<p>(As an aside, it turns out that the above sum is a much more natural way to think about the \(\exp\) function, instead of thinking of it as raising \(e\) to some power. Read <a href="https://www.quora.com/Is-it-just-a-coincidence-that-the-derivative-of-e-x-is-also-e-x-or-is-there-some-feature-in-the-function-due-to-which-this-happens/answer/Alon-Amit">this fascinating Quora answer</a> to find out why.)</p>
<p>Clearly, computing matrix powers like this to approximate \(\exp(A)\) is very expensive. But if \(A\) is diagonalizable, we have a much less expensive approach to computing \(\exp(A)\):</p>
\[\exp(A) = \exp(QDQ^{-1}) = Q\exp(D)Q^{-1} = Q \begin{bmatrix} \exp(\lambda_1) & 0 &… & 0 \\ 0 & \exp(\lambda_2) & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \exp(\lambda_n) \end{bmatrix}Q^{-1}\]
<p><strong>An application: predicting the growth of a portfolio through solving linear differential equations</strong></p>
<p>Consider a simple portfolio given by dollars invested into two investments (say IBM stock and a money market fund). Then, this portfolio lives in a two-dimensional vector space \(P = \begin{bmatrix} Stock \\ Money Market \end{bmatrix}\).</p>
<p>Assume further that we expect IBM stock to have a long-term growth rate of 3% (indeed an egregnious assumption for any individual stock), and to issue a 2% dividend every year. We also will assume that the money market fund will provide a consistent 2% return on investment. Let’s say that we initially have $100 invested in IBM stock, and 500 invested in the money market fund.</p>
<p>With these assumptions, the growth of our portfolio every year can be given by a linear transformation \(T\) who’s matrix is given by \(A = [T(v_1), T(v_2)]\) where \(v_1 = (1,0), v_2 = (0,1)\) are the standard basis vectors.</p>
<p>We have \(T(v_1) = [1.03, 0.02]^T\) since one dollar of stock returns 1.03 in the stock and 0.02 as a dividend, and \(T(v_2)= [0, 1.02]^T\) since one dollar in the money market fund does not return anything in the stock, and returns 1.02 in the fund.</p>
<p>Thus, the matrix of the linear transformation is given by \(A = \begin{bmatrix} 1.03 & 0 \\ 0.02 & 1.02 \end{bmatrix}\), and we can model the growth of our portfolio as a first-order linear differential equation:</p>
\[x'(t) = Ax(t)\]
<p>If \(A\) were diagonalizable, we’d have a simple, exact solution, so we attempt to diagonalize \(A\). Since \(A\) is upper triangular, we have that the eigenvalues for \(A\) are its diagonal entries \(\lambda_1 = 1.0, \lambda_2 = 1.02\). Since \(A\)’s eigenvalues both have algebraic multiplicity one, they must have geometric multiplicity one, and thus \(A\) is diagonalizable.</p>
<p>We compute \(ker(A - 1.03I_2)\) and \(ker(A - 1.02I_2)\) to see that the eigenspace corresponding to \(\lambda_1\) and \(\lambda_2\) are \(E_{\lambda_1} = Span(0.5,1)\) and \(E_{\lambda_2} = Span(0,1)\), respectively.</p>
<p>Thus, we have \(A = QDQ^{-1}\) with \(Q = \begin{bmatrix} 0.5 & 0 \\ 1 & 1 \end{bmatrix}, D = \begin{bmatrix} 1.0 3& 0 \\ 0 & 1.0 2\end{bmatrix}\), and \(x'(t) = QDQ^{-1}x(t)\).</p>
<p>To simplify things, we can multiply both sides by \(Q^{-1}\) to get \((Q^{-1}x(t))' = DQ^{-1}x(t)\). Now, we let \(y(t) = Q^{-1}x(t)\) so that \(y'(t) = Dy(t)\). Expanding this, we have</p>
<p>\(\begin{bmatrix} y_1'(t) \\ y_2'(t) \end{bmatrix} = \begin{bmatrix} 1.03 & 0 \\ 0 & 1.02 \end{bmatrix}\begin{bmatrix} y_1(t) \\ y_2(t) \end{bmatrix}\) , and we get the uncoupled differential equations</p>
<p>\(y_1'(t) = 1.03y_1(t)\), \(y_2'(t) = 1.0y2_2(t)\) which we solve separately to get \(y_1(t) = C_1e^{1.03t}, y_2(t) = C_2e^{1.02t}\), for some constants \(C_1, C_2\). Now, since \(x(t) = Qy(t)\), we recover \(x(t)\) by applying \(Q\):</p>
<p>\(x(t) = \begin{bmatrix} 0.5 & 0 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} C_1e^{1.03t} \\ C_2e^{1.02t} \end{bmatrix}\). Multiplying, we have that</p>
<p>\(x_1(t) = \frac{1}{2}C_1e^{1.03t}\) and \(x_2(t) = C_1e^{1.03t} + C_2e^{1.02t}\). Taking into account our initial conditions, we have that \(x_1(0) = \frac{1}{2}C_1 = 100\) and \(x_2(0) = C_1 + C_2 = 500\) , so we have that \(C_1 = 200, C_2 = 300\). Thus, our final solution is \(x_1(t) = 100e^{1.03t}\) and \(x_2(t) = 200e^{1.03t} + 300e^{1.02t}\), in which we can plug in a value for \(t\) to estimate the value of our portfolio at that time.</p>In this blog post, I’ll talk about diagonalizability, what it is, and why it may be useful to diagonalize matrices (when they can be) to efficiently compute operations on matrices. I won’t go into detail when a matrix is diagonalizable, but it will be briefly mentioned in an example.Hessians - A tool for debugging neural network optimization2019-03-03T00:00:00+00:002019-03-03T00:00:00+00:00https://rohanvarma.me/Optimization<p>Optimizing deep neural networks has long followed a general tried-and-true template. Generally, we randomly initialize our weights, which can be thought of as randomly picking a place on the “hill” which is the optimization landscape. There are some tricks we can do to achieve better initialization schemes, such as the He or Xavier initialization.</p>
<p>Then, we follow the gradient and update our parameters until we’ve met some stopping criterion. This is known as gradient descent. More commonly, stochastic gradient descent is used to add noise (via approximating the gradient instead of computing the exact gradient) to the optimization process and speed up training time. Additionally, vanilla SGD is less common now, with practitioners instead opting for methods that add in momentum or adaptive learning rate techniques.</p>
<p><img src="https://raw.githubusercontent.com/ucla-labx/deeplearningbook-notes/master/images/along_the_ravine.png" alt="" /></p>
<p>It’s natural to question theoretical - and practical - aspects of this common technique. What guarantees does it provide, and what are some conditions in which it could fail? Here, I’ll attempt to dissect some of these questions so we can better understand what to watch out for when optimizing deep networks.</p>
<h5 id="precursor-convexity-and-the-hessian">Precursor: Convexity and the Hessian</h5>
<p>Say we want to optimize \(f(x)\) where \(x\) are our parameters that we want to learn. The question of convexity arises. Is \(f\) convex, meaning that any local minimum we achieve is the optimal solution to our problem? Or is \(f\) nonconvex, in which case local minima can be larger than global minima? It turns out that the Hessian, or the matrix of \(f\)’s second derivatives, can tell us a lot about this.</p>
<p>The Hessian is real and symmetric, since in general we assume that the second derivatives exist and \(\frac{dy}{dx_1 x_2} = \frac{dy}{dx_2 x_1}\)for the functions that we are considering (Schwarz’s theorem provides the conditions that need to be true for this to hold). For example, the Hessian for a function \(y = f(x_1, x_2)\) could be expressed as</p>
\[\begin{bmatrix} \frac{dy}{dx_1x_1} & \frac{dy}{dx_1x_2} \\ \frac{dy}{dx_2x_1} & \frac{dy}{dx_2x_2}\end{bmatrix}\]
<p>Real symmetric matrices have nice properties:</p>
<ul>
<li>All eigenvalues are real and distinct eigenvalues correspond to distinct eigenvectors</li>
<li>The eigenvectors of distinct values are orthogonal, and therefore form a basis for \(R^n\), where \(n\) is the dimension of the row/column space of the matrix.</li>
<li>Thus, the matrix is diagonalizable, i.e. \(Q^{-1} H Q = D\), where \(D\) is a diagonal matrix with the eigenvalues on the diagonal, and \(Q\)’s column vectors form an orthonormal basis for \(R^n\). This is a result of the <a href="http://www.math.lsa.umich.edu/~speyer/417/SpectralTheorem.pdf">Spectral Theorem</a>.</li>
</ul>
<p>Next, a <em>positive definite</em> matrix is a symmetric matrix that has all positive eigenvalues. One way to determine if a function is <em>convex</em> is to check if its Hessian is positive definite.</p>
<p>To show this, it is enough to show that \(z^T H z > 0\) for any real vector \(z\). To see why all positive eigenvalues imply this, first let’s consider the case where \(z\) is an eigenvector of \(H\). Since \(Hz = \lambda z\) we have</p>
<p>\(z^T H z = z^T\lambda z = \lambda z^Tz = \lambda \vert \vert z\vert \vert^2 > 0\) since \(\lambda >0\)</p>
<p>To prove this for an arbitrary vector \(z\), we first note that we can diagonalize \(H\) as follows:</p>
\[z^T H z = z^T Q \Lambda Q^{-1}z\]
<p>Where \(Q\) is a matrix whose columns are (distinct) eigenvectors of \(H\) and \(\Lambda\) is a diagonal matrix with the corresponding eigenvalues on its diagonal. We know that this diagonalization is possible since \(H\) is real and symmetric.</p>
<p>As mentioned, the eigenvectors are orthogonal. Since \(Q\) is a matrix whose columns are the eigenvectors, \(Q\) is an orthogonal matrix, so we have \(Q^{-1} = Q^T\), giving us:</p>
\[z^T Q \Lambda Q^Tz >0\]
<p>Let’s define \(s = Q^T z\), so we now have \(s^T \Lambda s > 0\). Taking</p>
\[s = \begin{bmatrix} s_1 \\ … \\ s_n \end{bmatrix}\]
<p>and</p>
\[\Lambda = \begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n \end{bmatrix}\]
<p>We now have</p>
\[\begin{bmatrix} s_1 & … & s_n \end{bmatrix} \begin{bmatrix} \lambda_1 & 0 &… & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_n \end{bmatrix} \begin{bmatrix} s_1 \\ … \\ s_n \end{bmatrix} = \begin{bmatrix} s_1 & … & s_n \end{bmatrix} \begin{bmatrix} \lambda_1 s_1 \\ … \\ \lambda_ns_n \end{bmatrix} = \sum_{i=1}^{N}\lambda_is_i^2 > 0\]
<p>Which is true since all the eigenvalues are positive.</p>
<p>As an example of using this analysis to prove the convexity of a machine learning problem, we can take the loss function of the (l2-regularized) SVM:</p>
\[L(w) = \lambda \sum_n w_n^2 + \sum_n \max (0, 1 - y_n(w^Tx_n))\]
<p>The derivatives for each \(w_m\) are \(\frac{dL}{dw_m} = \lambda w_m + \sum \textbf{1} (y_n w^T x_n < 1)(-y_nx_{nm})\), where \(\textbf{1}\) denotes the indicator function that returns the second argument if its first argument is true, and \(x_{nm}\) denotes the \(m\)th feature of the \(n\)th feature vector.</p>
<p>The second derivatives can be characterized as \(\frac{dL}{dw_m w_k}, k != m\) and \(\frac{dL}{dw_m^2}\). The latter derivatives will appear as the diagonal entries of the Hessian.</p>
<p>For the first case, since our expression \(\frac{dL}{dw_m}\) is constant with respect to \(w_k\) the derivative is simply \(0\). For the second case, the derivative is simply \(\lambda\). Therefore, our Hessian is the following diagonal matrix:</p>
\[\begin{bmatrix} \lambda & 0 &… & 0 \\ 0 & \lambda & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda \end{bmatrix}\]
<p>Since this is a diagonal matrix, the eigenvalues are simply the entries on the diagonal. Since the regularization constant \(\lambda > 0\), the Hessian is positive definite, therefore the above formulation of the hinge loss is a convex problem. We could have alternatively shown that \(z^T H z > 0\) as well, since \(H z\) simply scales \(z\) by a positive scalar.</p>
<h5 id="okbut-what-about-neural-networks">Ok…But what about Neural Networks?</h5>
<p>The optimization problem for neural networks are generally highly nonconvex, making optimizing deep neural networks much tougher. Fortunately, there are still some applications of the Hessian that we could use.</p>
<h6 id="local-minima">Local Minima</h6>
<p>First, it may be worth it to know if we are at a local minimum at any point during our optimization process. In some cases, if this occurs early in our training process or at a high value for our objective function, then we could consider restarting our training/initialization process or randomly update our objective to “kick” our parameters out of the bad local minima.</p>
<p>Even though the Hessian may not be positive definite at any given point, we can check if we’re at a local minimum by examining the Hessian. This is basically the second derivative test in single-variable calculus. Consider the Taylor Series expansion of our objective \(f\) around \(x_0\):</p>
\[f(x)\approx f(x_0) + (x-x_0)\nabla_xf(x_0) + \frac{1}{2}(x-x_0)^T\textbf{H}(x-x_0)\]
<p>If we’re at a critical point, it is a potential local minimum and we have \(\nabla_x f(x) = 0\). Considering an SGD update \(x = x_0 - \epsilon u\), we have</p>
\[f(x_0-\epsilon u) \approx f(x_0) + \frac{1}{2}(x_0 - \epsilon u - x_0)^T\textbf{H}(x_0 - \epsilon u - x_0)\]
\[f(x_0-\epsilon u) \approx f(x_0) + \frac{1}{2}\epsilon^2 u^T\textbf{H} u\]
<p>If the Hessian is positive definite at \(x_0\), then it tells us that any direction \(u\) in which we choose to travel will result an increased value of the objective function (since the term \(\frac{1}{2}\epsilon^2 u^T\textbf{H} u > 0\)), so we are at a local minimum. We can similarly use the idea of negative definiteness to see if we are at a local maximum.</p>
<p><strong>Are local minima really that big of a problem?</strong></p>
<p>Arguably, the goal of fitting deep neural networks may not be to reach the global minimum, as this could result in overfitting, but rather find an acceptable local minima that obains good performance on the test set. In <a href="https://arxiv.org/pdf/1412.0233.pdf">this paper</a>, the authors state that “while local minima are numerous, they are relatively easy to find, and they are all more or less equivalent in terms of performance on the test set”. In the <a href="http://www.deeplearningbook.org/">Deep Learning Book</a>, Goodfellow et al. state the early convergence to a high value for the objective is generally <em>not</em> due to local minima, and this can be verified by plotting the norm of the gradient throughout the training process. The gradient at these supposed local minima exhibit healthy norms, indicating that the point is not actually a local minima.</p>
<h6 id="a-poorly-conditioned-hessian">A Poorly Conditioned Hessian</h6>
<p>There are some cases where the regular optimization process - a small step in the direction of the gradient - could actually fail, and increase the value of our objective function - couteractive to our goal. This happens due to the <em>curvature</em> of the local region of our objective function, and can be investigated with the Hessian as well.</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/curvature.png" alt="" /></p>
<p>How much will a gradient descent update change the value of our objective? First, let’s again look at the Taylor series expansion for an update \(x \leftarrow x_0 - \epsilon \textbf{g}\) where \(\textbf{g} = \nabla f(x_0)\):</p>
\[f(x_0 - \epsilon\textbf{g})\approx f(x_0) + (x_0 - \epsilon \textbf{g}-x_0)\textbf{g} + \frac{1}{2}(x_0 - \epsilon \textbf{g}-x_0)^T\textbf{H}(x_0 - \epsilon \textbf{g}-x_0)\]
\[f(x_0 - \epsilon\textbf{g})\approx f(x_0) -\epsilon\textbf{g}^T\textbf{g} + \frac{1}{2}\epsilon^2\textbf{g}^T\textbf{H}\textbf{g}\]
<p>We can see that the decrease in the objective is related to 2nd order information, specifically the term \(g^T H g\). If this term is negative, then the decrease in our objective is greater than our (scaled) squared gradient norm, if it is positive, then the decrease is less, if it is zero, the decrease is completely given by first-order information.</p>
<p>“Poor Conditioning” refers to the Hessian at the point \(x_0\) being such that it results in an <em>increased</em> value of our objective function. This can happen if the term that indicates our decrease in our objective is actually greater than \(0\):</p>
<p>\(-\epsilon g^Tg + \frac{1}{2}\epsilon^2g^T Hg > 0\) which corresponds to \(\epsilon g^T g < \frac{1}{2}\epsilon^2g^THg\)</p>
<p>This can happen if the Hessian is very large, in which case we’d want to use a smaller learning rate than \(\epsilon\) to offset some of the influence of the curvature. An intuitive justification for this is that due to the large magnitude of curvature, we’re more “unsure” of our gradient updates, so we want to take a smaller step, and therefore have a smaller learning rate.</p>
<h6 id="saddle-points">Saddle Points</h6>
<p>Getting stuck at a <em>saddle point</em> is a very real issue for optimizing deep neural networks, and arguably a more important issue than getting stuck at a local minima (see <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">http://www.offconvex.org/2016/03/22/saddlepoints/</a>). One explanation of this is that it’s much more likely to arrive at a saddle point than a local minimum or maximum: the probability of a given point in an \(n\) dimensional optimization space being a local minimum or maximum is just \(\frac{1}{2^n}\).</p>
<p>A saddle point is defined as a point with \(0\) gradient, but the Hessian is neither positive definite or negative definite. It is possible that learning would stop if we are relying only on first-order information (i.e. since \(\nabla_x f(x) = 0\), the weights will not be updated) but in practice, using techniques such as momentum reduce the chance of this.</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/saddle.png" alt="" /></p>
<p>Around a saddle point, the expansion of our objective is similar to the expansion at a local minimum:</p>
\[f(x_0-\epsilon u) \approx f(x_0) + \frac{1}{2}\epsilon^2 u^T\textbf{H} u\]
<p>Here, \(\textbf{H}\) is not positive definite, so we may be able to pick certain directions \(u\) that increase or decrease the value of our objective. Concretely, if we pick \(u\) such that \(u^T \textbf{H}u < 0\), then \(u\) is a direction that would decrease the value of our objective (since the term \(\frac{1}{2}\epsilon^2 u^T\textbf{H} u < 0\) decreases the value of our objective) , and we can update our parameters with \(u\). Ideally we’d like to find a \(u\) such that \(u^T\textbf{H}u\) is significantly less than \(0\), so that we have a steeper direction of descent to travel in.</p>
<p>How do we know what functions have saddle points that are “well-behaved” like this? Ge et al. in their <a href="https://arxiv.org/abs/1503.02101">paper</a> introduced the notion of “strict saddle” functions. One of the properties of these “strict saddle” functions is that at all saddle points \(x\), the Hessian has a (significant) negative eigenvalue, which means that we can pick the eigenvector corresponding to this eigenvalue as our direction to travel in if we are at a saddle point.</p>
<p>However, it is often infeasible to compute the Hessian while training deep learning models, since it is computationally expensive. How can we escape saddle points using only first-order information?</p>
<p>Get et al. in their paper also describe a variant of stochastic gradient descent which they call “noisy stochastic gradient descent”. The only variant is that some random noise is added to the gradient:</p>
\[\textbf{g} = \nabla_\theta \frac{1}{m}\sum_{i=1}^{m} L(f(x^i; \theta), y^i)\]
\[\theta_{t+1} \leftarrow{} \theta_{t} - \epsilon (\textbf{g} + \nu)\]
<p>where \(\nu\) is random noise sampled from the unit sphere. This ensures that there is noise in every direction, allowing this variant of SGD to explore the region around the saddle point. The authors show that this noise can help escape from saddle points for strict-saddle functions.</p>
<p>The authors mention in the <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">corresponding blog post</a> that it is useful to think of a saddle point as <em>unstable</em>, so slight perturbations can be effective in order to kick your weights out of the saddle point. To me this indicates that any perturbation in addition to the regular SGD update could be beneficial to continue optimization at a saddle point, so techniques such as momentum would be helpful as well.</p>
<h6 id="computing-the-hessian">Computing the Hessian</h6>
<p>Most of these methods discussed involve the need to compute the Hessian, which is often unrealistic in practice due to the aformented reasons. However, we can use <em>finite difference methods</em> in order to approximate the Hessian, which may be enough to do our job. To understand this, first we can consider the limit definition of the derivative in the single-variable case:</p>
\[\frac{d}{dx} f(x) = \lim_{h \rightarrow{} 0} \frac{f(x + h) - f(x)}{h}\]
<p>This means an approximation for \(\frac{d}{dx}f(x)\) can be given by</p>
\[\frac{f(x + h) - f(x)}{h}\]
<p>where our approximation improves with a smaller \(h\) but we run into numerical stability issues if we take \(h\) to be too small. It turns out that a slightly better approximation would be to use the centered version of the above (see the explanation on <a href="http://cs231n.github.io/neural-networks-3/#gradcheck">CS 231n</a> for details):</p>
\[\frac{d}{dx}f(x) \approx \frac{f(x + h) - f(x-h)}{2h}\]
<p>For vector-valued functions, we can generalize this idea to several dimensions:</p>
\[\frac{d}{dx_i}f(x) \approx \frac{f(x + h s_i) - f(x - h s_i)}{2h}\]
<p>where \(s_i\) is a vector that has \(1\) as its \(i\)th entry, and is zero everywhere else. We’re basically taking the unit vector pointing in that direction, scaling it so that it is small, and adding that to our setting of parameters given by the vector \(x\).</p>
<p>It turns out that this identity can be further generalized, to approximate the gradient-vector product (i.e. the product of the gradient times a given vector). We simply remove the constraint that \(s_i\) is the unit vector pointing in a particular direction, and instead let it be any vector \(u\). This lets us approximate the gradient vector product:</p>
\[\nabla_x f(x)^T u \approx \frac{f(x + h u) - f(x - h u)}{2h}\]
<p>It’s quite simple to extend this to computing the Hessian-vector product: given the gradient, the gradient of the gradient is the Hessian. This means that we can replace the gradient on the left hand side of the above equation with the Hessian, and the function on the right hand side with the gradient:</p>
\[H u \approx \frac{\nabla_xf(x + h u) - \nabla_xf(x - h u)}{2h}\]
<p>which gives us the Hessian-vector product.</p>
<h6 id="takeaways">Takeaways</h6>
<p>The Hessian can give us useful second order information when optimizing machine learning algorithms, though it is computationally tough to compute in practice. By analyzing the Hessian, we may be able to get information regarding the convex nature of our problem, and it can also help us determine local minima or “debug” gradient descent when it actually fails to reduce our cost function or gets stuck at a saddle point.</p>
<h6 id="sources">Sources</h6>
<ol>
<li><a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Escaping from Saddle Points</a></li>
<li><a href="https://arxiv.org/abs/1503.02101">Paper: Escaping from Saddle Points - Online SGD for Tensor Decomposition</a></li>
<li><a href="http://www.deeplearningbook.org/">Deep Learning Book Ch. 8</a></li>
<li><a href="https://timvieira.github.io/blog/post/2014/02/10/gradient-vector-product/">Explanation of approximating gradient-vector products</a></li>
</ol>
<h5 id="notes">Notes</h5>
<p>[4/7/19] - Added further explanation for the claim that “The eigenvectors of distinct values are orthogonal, and therefore form a basis for \(R^n\)”</p>
<p>[4/7/19] - Added a note about the indicator function notation that I used when explaining the L2-regularized SVM.</p>
<p>[5/7/19] - Fixed an incorrect equation and added some clarifications.</p>
<p>[5/30/19] - Fixed the gradient in the SVM example.</p>
<p>[6/7/19] - Added a note about the Spectral Theorem.</p>Optimizing deep neural networks has long followed a general tried-and-true template. Generally, we randomly initialize our weights, which can be thought of as randomly picking a place on the “hill” which is the optimization landscape. There are some tricks we can do to achieve better initialization schemes, such as the He or Xavier initialization.ResNets2019-02-17T00:00:00+00:002019-02-17T00:00:00+00:00https://rohanvarma.me/ResNetCIFAR<p><img src="https://raw.githubusercontent.com/rohan-varma/resnet-implementation/master/images/verydeep_network.png" alt="" /></p>
<p>These are some notes that I took while reading the paper <a href="https://arxiv.org/abs/1512.03385">Deep Residual Learning for Image Recognition</a>, the paper that introduced modern ResNets. A mock implementation of the network described for the CIFAR-10 portion of the paper is available <a href="https://github.com/rohan-varma/resnet-implementation">here</a>.</p>
<h5 id="1-introduction">1. Introduction</h5>
<ul>
<li>
<p>Main motivation: very deep neural networks are harder to fit</p>
<ul>
<li>
<p>Have higher training error on CIFAR 10 - so learning is not as simple as stacking more layers as it was once thought</p>
</li>
<li>
<p>Degredation problem is because it is difficult to fit very deep networks (despite batch normalization and He/Xavier initialization methods), they don’t just overfit, they actually have worse training error</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/resnet-implementation/master/images/verydeep_network.png" alt="" /></p>
</li>
</ul>
</li>
<li>
<p>Intuitively, deep networks should’t be “harder” to fit. If there is a certain number of layers \(N\) that achieve optimal accuracy on a dataset, then the layers after N could just learn the identity mapping (i.e. each layer computes their mapping as \(H(x) = x\) where \(H(x)\) is the mapping of the layer to be learned), and then the network will effectively have their final output at layer \(N\)</p>
</li>
<li>
<p>However, it is not “easy”for weights to be pushed in such ways that they exactly produce the identity mapping</p>
</li>
<li>
<p>Authors introduce the idea of <strong>residual learning</strong> - instead of directly approximating the underlying mapping we want, \(H(x)\), we instead learn a residual function \(H(x) - x\). This is done by making the output of a stack of layers be \(y = F(x) + x\), where \(F(x)\) is the output of the layers (before the ReLU of the last layer) and then the original input \(x\) is element-wise added:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/resnet-implementation/master/images/residual_learning_block.png" alt="" /></p>
</li>
<li>
<p>Therefore, if our underlying mapping is still \(y = H(x)\) that we want to learn, then \(F(x) = H(x) - x\) so that \(y = F(x) + x = H(x) - x + x = H(x)\) .</p>
<ul>
<li>The idea of learning identity mappings is now easier, since we just need to set all weights to \(0\), so that \(H(x) = 0\) and \(F(x) = -x\), so \(y = x\) is learned</li>
</ul>
</li>
<li>
<p>Ensemble of ResNets attained <strong>3.57%</strong> top-5 error rate on ImageNet dataset</p>
<ul>
<li>Six total ResNets of different dimension, 2 152-layer ResNets are used</li>
</ul>
</li>
</ul>
<h5 id="2-related-work">2. Related Work</h5>
<ul>
<li>Auxiliary classifiers inserted at early layers of a deep network to send back stronger gradient signals to deal w/vanishing gradient problems are similar</li>
<li>Inception network which uses concatenations of different operations includes a shortcut connection</li>
</ul>
<h5 id="3-deep-residual-learning">3. Deep Residual Learning</h5>
<ul>
<li>
<p>Say we want to learn \(y = H(x)\). We can either cdirectly learn this or try to learn \(F(x)\) where \(F(x)= H(x)-x\) and formulate our output as \(y = F(x) + x\). So by addign the identity in a so-called “residual block” we force the nework to learn a residual mapping \(F(x) = H(x) -x\).</p>
</li>
<li>
<p>A <em>building block</em> in ResNet is defined as \(y = F(x_i, {W_i}) + x\) where \(F\) can be multiple layers. For example, the above figure has \(F = W_2(\sigma (W_1 x))\)</p>
<ul>
<li>The \(+\) operation is performed by a shortcut operation and element-wise addition</li>
</ul>
</li>
<li>
<p>Described like this, the shortcut connection introduces no new parameters in a network, so training the network in this way doesn’t introduce an increase in training time due to the numbers of parameters that must be trained. But this isn’t possible when dimensionalities are different - for example, the 2 layers of conv/relu above may result in the output before the addition having different dimension than that of \(x\).</p>
</li>
<li>
<p>To handle this, we can use a projection matrix \(W_s\) that projects \(x\) to the same space as \(F(x)\). We have \(y = F(x, {W_i}) + W_sx\), but this introduces more parameters into the network.</p>
</li>
<li>
<p>These functions are just as applicable to convlutional layers as they are to FC layers. For examples, \(F\) can represent multiple conv layers, and the element-wise addition is performed on the two feature maps, going channel by channel (so the dims must be the same)</p>
</li>
<li>
<p><strong>Residual Network details</strong></p>
<ul>
<li>
<p>Based off a plain 34-layer network that has a \(7 * 7\) conv, then a series of \(3*3\) convs gradually increasing the channel size, followed by a global average pooling layer, followed by a \(1000\) way FC layer + softmax at the end that represents \(p(y \vert{} x)\)</p>
</li>
<li>
<p>The residual network is similar , but shortcut connectins are added every 2 layers, and the network looks as follows:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/resnet-implementation/master/images/resnet_layout.png" alt="" /></p>
</li>
<li>
<p>Two options when dimensions don’t map:</p>
<ul>
<li>Projection matrix (as mentioned above), or just padding extra zeros to increase dimension (doesn’t increase number of parameters) are both tried out</li>
</ul>
</li>
<li>
<p>Downsampling is directly performed with the stride size, which is \(2\) in all of the conv layers.</p>
</li>
<li>
<p>Design rules: If the output of a conv layer has the same feature map size, the layers have the same number of filters, but if the size of the feature map is halved, then the number of filters is doubled, so as to preserve the time complexity per layer.</p>
</li>
<li>
<p>Implementation details:</p>
<ul>
<li>224 x 224 crops sampled from ImageNet dataset, with per-pixel mean subtracted</li>
<li>Data augmentation: images are flipped to increase dataset size</li>
<li>Batch norm is used, the pattern is conv-bn-relu, so before the activation</li>
<li><em>He</em> initialization of weights is used, namely weights are initialized from sampling from a Gaussian with mean \(0\) and standard deviation \(\sqrt{\frac{2}{n_l}}\). Biases are initialized to be \(0\).</li>
<li>SGD with minibatch size 256 is used</li>
<li>Learning rate starts off as \(0.1\) and is then decreased by dividing by \(10\) when the error plateus.</li>
<li>\(60 * 10^4\) total iterations</li>
<li>Weight decay of \(0.0001\) and momentum of \(0.9\) is used</li>
<li>Dropout is <em>not</em> used, in favor of only batchnorm.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h5 id="4-training-and-approach">4. Training and Approach</h5>
<ul>
<li>
<p>Trained 18 and 34 layer plain networks, along with 18 and 34 layer ResNets</p>
<ul>
<li>It was shown that 34 layer plain nets have higher training error than 18 layer nets, and it was argued that this was not due to vanishing gradients, because: 1) proper initialization was used, 2) BN was used, 3) it was ensured that gradients have healthy norms throughout training</li>
<li>Speculated that deep networks have exponentially lowe convergence rates (i.e. need to be trained for much longer to achieve same results compared to shallower networks)</li>
</ul>
</li>
<li>
<p>For 18 and 24 layer ResNets, simple element-wise addition shortcut additions were used, so there were no new parameters in the network</p>
<ul>
<li>34 layer ResNet did better compared to 18 layer, indicating that the degredation problem observed in shallow nets was not evident here</li>
</ul>
</li>
<li>
<p><strong>Identity vs projection shortcuts</strong></p>
<ul>
<li>3 types:
<ul>
<li>A: zero-padding shortcuts used when dimensions do not match, all shortcuts are parameter free</li>
<li>B: Projection shortcuts used when dimensions do not match, and other shortcuts are regular element-wise addition</li>
<li>C: All shorcuts are projections (meaning that a square matrix is used even when the dimensions match)</li>
</ul>
</li>
<li>It was shown that B is slightly better than A and C was slightly better than B, but C introduces more parameters and increases the time/memory complexity of the network, so B was used overall (projections when dimensions do not match, otherwise regular identity and element-wise addition)</li>
</ul>
</li>
<li>
<p><strong>Bottleneck architecture</strong></p>
<ul>
<li>
<p>For every residual function \(F\), 3 layers instead of 2 are used: first layer is a 1x1 conv, then a 3x3 conv, then a 1x1 conv</p>
<ul>
<li>The 1x1 layers reduce and increase dimensionality, and the 3x3 conv operates on a smaller dimensional space</li>
</ul>
</li>
<li>
<p>Exampe: in the following figure, a \(256\) dimensional (256 channels) input is fed into a 1x1 which maps it to 64 channels, then a 3x3 which maps it to 64 channels, and then 1 x 1 that maps it back to the original dimensionality of 256 channels.</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/resnet-implementation/master/images/bottleneck.png" alt="" /></p>
</li>
<li>
<p>Parameter-free shortcuts here are particularly important, the time complexity and model size are doubled if identity shortcuts are replaced with projection</p>
</li>
<li>
<p>This architecture is used to create 50/101/152 layer ResNets, which all had improved accuracy compared to the 34 layer ResNets, and the degredation problem is not observed</p>
<ul>
<li>152-layer ResNet performed the best</li>
</ul>
</li>
</ul>
</li>
</ul>
<h5 id="resnets-on-cifar-10">ResNets on CIFAR-10</h5>
<ul>
<li>Network inputs are 32 * 32 with per-pixel mean subtracted</li>
<li>First layer: 3 x 3 conv, then stack of \(6n\) layers with \(3*3\) convolutions with feature map sizes of 32, 16, and 8. Each feature map size has \(2n\) layers for \(6n\) total layers.
<ul>
<li>This means that the output feature map size is 32 twice, then 16 twice, etc</li>
</ul>
</li>
<li>Number of filters are 16, 32, 64, respectively. Subsampling is done with conv layers of stride 2 instead of max/average pooling throughout the network (which is the traditional way of downsampling)</li>
<li>Global average pooling after all the conv layers, and then a 10-way fully connected layer + softmax at the end</li>
<li>Identity shortcuts used in all cases</li>
<li>Weight decay: \(0.0001\), momentum of \(0.9\), with He init, BN, and no dropout, with a batch size of \(128\).</li>
<li>Learning rate of \(0.1\) which is divided by \(10\) at 32k and 48k iterations, and training is terminated at 64k iterations</li>
<li>110 layer network achieved \(6.43\)% error, which is state of the art</li>
<li>Noticed that deeper ResNets have a smaller magnitute of responses, where a response is the standard deviation of layer responses for each layer (i.e. the responses in layers of the ResNets generally have lower standard deviations compared to plain networks)</li>
<li>1202 layer network did not work well (had similar training error, but higher testing error, indicatign overfitting)
<ul>
<li>Not much regularization was used in these ResNets (i.e. no maxout or dropout), regularization is just imposed by the architecture of the design</li>
</ul>
</li>
</ul>The Python GIL2019-02-06T00:00:00+00:002019-02-06T00:00:00+00:00https://rohanvarma.me/GIL<h4 id="a-story">A story</h4>
<p>Let’s imagine that you’re trying to optimize some code that computes statistics about how much time users spend on your website. You have a list of objects for each user - where each object itself is a list representing the amount of time a user spent on your website across several sessions. Your psuedocode looks something like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_time_spent</span><span class="p">(</span><span class="n">users</span><span class="p">):</span>
<span class="n">user_to_time</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">user_sessions</span><span class="p">)</span> <span class="ow">in</span> <span class="n">users</span><span class="p">:</span>
<span class="n">total_time_spent</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">user_sessions</span><span class="p">)</span>
<span class="n">user_to_time</span><span class="p">[</span><span class="n">user_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">total_time_spent</span>
<span class="k">return</span> <span class="n">user_to_time</span>
</code></pre></div></div>
<p>This code works, but you’re concerned about the time it takes, since the list of <code class="language-plaintext highlighter-rouge">users</code> and <code class="language-plaintext highlighter-rouge">user_session</code> is very large. You remember threads and know that this code will run on an 8-core machine, so you decide to split up the <code class="language-plaintext highlighter-rouge">users</code> list into 8 chunks and spin up 8 different threads. You quickly conclude that your code will now run about 8x faster, save for any overhead taken up in chunking the list and thread management - you don’t have to worry about things like locking since each thread is working on its own component of the data.</p>
<p>All you have to do is create the threads, and tell each one to run the above <code class="language-plaintext highlighter-rouge">get_time_spent</code> function on their distinct list of <code class="language-plaintext highlighter-rouge">users</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">threaded_time_spent</span><span class="p">(</span><span class="n">users</span><span class="p">):</span>
<span class="n">user_chunks</span> <span class="o">=</span> <span class="n">chunk_list</span><span class="p">(</span><span class="n">users</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
<span class="n">threads</span> <span class="o">=</span> <span class="p">[</span><span class="n">Thread</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">get_time_spent</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">chunk</span><span class="p">,))</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">user_chunks</span><span class="p">]</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">threads</span><span class="p">:</span>
<span class="n">t</span><span class="p">.</span><span class="n">start</span><span class="p">()</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">threads</span><span class="p">:</span>
<span class="n">t</span><span class="p">.</span><span class="n">join</span><span class="p">()</span>
</code></pre></div></div>
<p>After coding this up, you run some basic performance tests and realize that your new code actually takes <em>longer</em> to execute than your old code - a far cry from the 8x performance gain you had anticipated.</p>
<p>You’re confused, and begin to look for a bug in your threading implementation, but find nothing. After doing some research, you come across the Global Interpreter Lock, and realize that it effectively serializes your code, even though you wanted to be able to take advantage of multiple cores.</p>
<p>What the hell is that, and why does it even exist in Python?</p>
<h4 id="what-does-the-gil-do">What does the GIL do?</h4>
<p>Simply put, the GIL is a lock around the interpreter. Any thread wishing to execute Python bytecode must hold the GIL in order to do so. This means that at most one thread can be executing Python bytecode at any given moment. This effectively serializes portions of multithreaded programs where each thread is executing Python bytecode. To allow other threads to run, the thread holding the GIL releases it periodically - both voluntarily when it no longer needs it and involuntarily after a certain interval.</p>
<h4 id="why-does-python-have-a-gil">Why does Python have a GIL?</h4>
<p>First, as an aside, Python doesn’t technically have a GIL - the reference implementation, CPython, has one. Other implementations of Python - such as Jython and IronPython - don’t have a GIL, and have different tradeoffs than CPython due to it. In this post, I’ll mostly be focusing on the GIL as it is implemented in CPython.</p>
<p>The GIL was originally introduced as part of effort to support multithreaded programming in Python. Python uses automatic memory management via garbage collection, implemented with a technique called reference counting. Python internally manages a data structure containing all object references that can be accessed by a program, and when an object has zero references, it can be freed.</p>
<p>However, race conditions in multithreaded programming made it so that the count of these references could be updated incorrectly, making it so that objects could be erroneously freed or never freed at all. One way to solve this problem is with more granular locking, such as around every shared object, but this would create issues such as increased overhead due to a lot of lock acquire/release requests, as well as increase the possibility of deadlock. The Python developers instead chose to solve this problem by placing a lock around the entire interpreter, making each thread acquire this lock when it runs Python bytecode. This avoids a lot of the performance issues around excessive locking, but effectively serializes bytecode execution.</p>
<h4 id="how-does-the-gil-impact-performance">How does the GIL impact performance?</h4>
<p>Let’s consider two examples to illustrate the difference in performance for CPU-bound and I/O bound threads. First, we’ll create a dummy CPU bound task that just counts down to zero from an input. We’ll also define two implementations that call this function twice - one that just makes two successive calls, and another that spawns two threads that run this function, and then <code class="language-plaintext highlighter-rouge">join</code>s them:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">count</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">while</span> <span class="n">n</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">n</span><span class="o">-=</span><span class="mi">1</span>
<span class="o">@</span><span class="n">report_time</span>
<span class="k">def</span> <span class="nf">run_threaded</span><span class="p">():</span>
<span class="n">t1</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">Thread</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">count</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="mi">10000000</span><span class="p">,))</span>
<span class="n">t2</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">Thread</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">count</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="mi">10000000</span><span class="p">,))</span>
<span class="n">t1</span><span class="p">.</span><span class="n">start</span><span class="p">()</span>
<span class="n">t2</span><span class="p">.</span><span class="n">start</span><span class="p">()</span>
<span class="n">t1</span><span class="p">.</span><span class="n">join</span><span class="p">()</span>
<span class="n">t2</span><span class="p">.</span><span class="n">join</span><span class="p">()</span>
<span class="o">@</span><span class="n">report_time</span>
<span class="k">def</span> <span class="nf">run_sequential</span><span class="p">():</span>
<span class="n">count</span><span class="p">(</span><span class="mi">10000000</span><span class="p">)</span>
<span class="n">count</span><span class="p">(</span><span class="mi">10000000</span><span class="p">)</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">report_time</code> <a href="https://github.com/rohan-varma/python-gil/blob/master/time_decorator.py">decorator</a> is a simple decorator that uses the <code class="language-plaintext highlighter-rouge">time</code> module to report how long the function took to execute. Running this script 10 times and averaging the result gave me an average of 1.53 seconds for sequential execution, and 1.57 seconds for threaded execution (see <a href="https://github.com/rohan-varma/python-gil/blob/master/output.txt">results</a>) - meaning that despite having a 4-core machine, threading here did not help at all, and in fact marginally worsened performance.</p>
<p>Now let’s consider two I/O bound threads instead. The following <a href="https://github.com/rohan-varma/python-gil/blob/master/gil_test_io_bound.py">code</a> runs the <code class="language-plaintext highlighter-rouge">select</code> function on empty lists of file descriptors, and times out after 2 seconds:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def run_select():
a, b, c = select.select([], [], [], 2)
@report_time
def run_threaded():
t1 = threading.Thread(target=run_select)
t2 = threading.Thread(target=run_select)
t1.start()
t2.start()
t1.join()
t2.join()
@report_time
def run_sequential():
run_select()
run_select()
_, threaded_time = run_threaded()
_, seq_time = run_sequential()
print(f'{threaded_time} with threading, {seq_time} sequentially')
</code></pre></div></div>
<p>As expected, the sequential execution takes 4 seconds, and the threaded execution takes 2 seconds. In the threaded model, the I/O can be awaited in parallel. When the first thread runs, it grabs the GIL, loads in the <code class="language-plaintext highlighter-rouge">select</code> name and the empty lists along with the constant <code class="language-plaintext highlighter-rouge">2</code>, and then makes an I/O request. Python’s implementation of blocking I/O operations drops the GIL and then reacquires the GIL when the thread runs again after the I/O completes, so another thread is free to execute. In this case, the other thread can run almost instantly, making a call to <code class="language-plaintext highlighter-rouge">select</code> as well. This process of yielding the GIL during blocking I/O operations is similar to <em>cooperative multitasking</em>.</p>
<p>On the other hand, CPU-bound threads don’t voluntarily yield the CPU to other threads, and instead have to be forcibly switched out to allow other threads to run. In Python 2, there was a concept of “interpreter ticks”, and the GIL is forcibly dropped every 100 ticks if the thread hasn’t voluntarily given it up. In Python 3, this was switched (and the GIL was in general revamped) to be a time interval, which is by default 5 milliseconds. Both intervals can be reset with <code class="language-plaintext highlighter-rouge">sys.setcheckinterval()</code> and <code class="language-plaintext highlighter-rouge">sys.setswitchinterval()</code> respectively.</p>
<p>The main takeaway is that the GIL only inhibits performence when you have several CPU-bound threads that you’d like to execute across multiple cores, since the GIL serializes them. Parellizing I/O bound work via multiple threads is still a good bet.</p>
<h4 id="working-around-pythons-gil">Working around Python’s GIL</h4>
<p>There are still a few things that we can do to potentially optimize performance of CPU-bound Python code across multiple threads, despite the GIL. The first thing to consider is using the <code class="language-plaintext highlighter-rouge">Process</code> module. <code class="language-plaintext highlighter-rouge">Process</code> has an API similar to that of <code class="language-plaintext highlighter-rouge">thread</code>, but it spawns an entirely separate process to run your function. The good news is that the new process has its own interpreter, so you can take advantage of multiple cores on your machine. <a href="https://github.com/rohan-varma/python-gil/blob/master/gil_test_multiprocessing.py">Porting the above threaded code</a> to use <code class="language-plaintext highlighter-rouge">Process</code> shows the expected ~2x performance gain. However, processes are much heavier than threads, which brings in efficiency concerns if you’re creating a nontrivial number of separate processes. In addition, there’s more overhead involved in using IPC mechanisms rather than simply communicating with shared variables with threads.</p>
<p>As a more involved task, it may be worth considering porting your CPU-bound code to C, and then writing a C extension to bridge your Python code to the C code. This can provide significant performance advantages, and several scientific computing libraries such as <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">hashlib</code> release the GIL in their C extensions.</p>
<p>By default, code that runs as part of a call to a C extension is still subject to the GIL being held, but it can be manually released. This is common for blocking I/O operations as well as processor intensive computations. Python makes this easy to do with two macros: <code class="language-plaintext highlighter-rouge">Py_BEGIN_ALLOW_THREADS</code> and <code class="language-plaintext highlighter-rouge">Py_END_ALLOW_THREADS</code>, which save the thread state and drop the GIL, and restore the thread state and reacquire the GIL, respectively.</p>
<h4 id="details-around-the-gil-implementation">Details around the GIL implementation</h4>
<p>The GIL is implemented in <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L12">ceval_gil.h</a> and used by the interpreter in <a href="https://github.com/python/cpython/blob/master/Python/ceval.c">ceval.c</a>. A thread waiting for the GIL will do a timed wait on the GIL, with a <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L12">preset interval</a> that can be modified with <code class="language-plaintext highlighter-rouge">sys.setswitchinterval</code>. If the GIL hasn’t been released at all during that interval, then a <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L216">drop request</a> will be sent to the current running thread (which has the GIL). This is done via <code class="language-plaintext highlighter-rouge">COND_TIMED_WAIT</code> in the <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L209">source code</a>, which sets a <code class="language-plaintext highlighter-rouge">timed_out</code> <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L89">variable</a> that indicates if the wait has timed out.</p>
<p>The running thread will finish the instruction that it’s on, drop the GIL, and <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L166">signal on a condition variable that the GIL is available</a>. This is encapsulated by the <code class="language-plaintext highlighter-rouge">drop_gil</code> <a href="https://github.com/python/cpython/blob/master/Python/ceval.c#L1030">function</a>.</p>
<p>Importantly, it will also <a href="https://github.com/python/cpython/blob/27e2d1f21975dfb8c0ddcb192fa0f45a51b7977e/Python/ceval_gil.h#L173-L187">wait for a signal</a> that another thread was able to get the GIL and run. This is done by checking if the last holder of the GIL was the thread itself, and if so, resetting the GIL drop request and waiting on a condition variable that signals that the GIL has been switched.</p>
<p>This wasn’t the case in Python 2, where a thread that just dropped the GIL could potentially compete for it again. This would often result in starvation of certain threads, due to how the OS would schedule these threads. For example, if you had two cores, then the thread dropping the GIL (let’s call this t1) could still be running on one core, and the thread attempting to acquire the GIL (let’s call this t2) could be scheduled on the 2nd core. What could happen is that since t1 is still running, it could re-acquire the GIL before t2 even gets a chance to wake up and see that it can acquire the GIL, so t2 will continue to block since it wasn’t able to acquire the GIL, and then wake up and try again repeatedly. This would result in something dubbed the GIL battle:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/python-gil/master/gil_battle.png" alt="" /><em>The GIL battle (<a href="https://www.dabeaz.com/python/NewGIL.pdf">source</a>)</em></p>
<p>This would frequently happen for CPU-bound threads left running on a core in Python 2. This could result in I/O bound threads being starved and lots of extra signalling on condition variables, reducing performance. <a href="https://www.dabeaz.com/python/GIL.pdf">These slides</a> have some more details about the Python 2 GIL.</p>
<p>There’s one important subtelty in the case of multiple threads. Since Python doesn’t have its own thread scheduling and wraps POSIX/Windows threads, scheduling of threads is left up to the OS. Therefore, when multiple threads are competing to run, the thread that issued the GIL <code class="language-plaintext highlighter-rouge">drop_request</code> may not actually be the thread that acquires the GIL (since a context switch could occur, another waiting thread could see that the GIL is available, and acquire it). <a href="https://www.dabeaz.com/python/NewGIL.pdf">These slides</a> have some more details about this.</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/python-gil/master/new_gil_multiple_threads.png" alt="" /><em>GIL behavior with multiple threads (<a href="https://www.dabeaz.com/python/NewGIL.pdf">source</a>)</em></p>
<p>What could (but doesn’t) happen is that the thread that was unable to acquire the GIL, but still timed out on waiting for it, could continue to issue the <code class="language-plaintext highlighter-rouge">drop_request</code> and attempt to re-acquire the GIL. This would essentially be like a spin lock - the thread would keep polling for the GIL and demanding for it to be released, using up CPU to accomplish nothing.</p>
<p>Instead, on a time out, a check is also done to see if the GIL has switched in that time interval (i.e. to another thread). If so, then this thread simply goes to sleep waiting for the GIL again. While this does in a sense reduce “fairness” (in the sense that the thread that requested the GIL should get it next), it also dramatically reduces GIL contention, compared to Python 2, and is somewhat of a necessary tradeoff since Python doesn’t control when its threads run.</p>
<p>One criticism about the new GIL relates to the above point of the “most deserving” thread (i.e. the thread that sent the <code class="language-plaintext highlighter-rouge">drop_request</code>) getting the GIL. This is largely because thread scheduling is entirely up to the OS, but also has some repercussions for I/O operations that complete very quickly. Since any I/O operation will release the GIL, CPU-bound threads will restart and use their entire time slice before yielding back, and the I/O bound thread has to go through the entire timeout process to re-acquire the GIL. This can create somewhat of a convoy effect of quick-running I/O operations having to queue up to acquire the GIL:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/python-gil/master/newgil_convoy.png" alt="" /><em>Convoy effect with the new GIL (<a href="https://www.dabeaz.com/python/UnderstandingGIL.pdf">source</a>)</em></p>
<h4 id="summary">Summary</h4>
<p>The GIL is an interesting part of Python, and it’s cool to see the different tradeoffs and optimizations that were done in both Python 2 and Python 3 to improve performance as it relates to the GIL. The seemingly small changes to Python 3’s GIL (such as the time-based, as opposed to tick interval and reduction of GIL contention) emphasizes just how important and nuanced issues such as lock contention and thread switching are, and how hard they are to get right.</p>
<h3 id="sources">Sources</h3>
<ol>
<li><a href="https://www.dabeaz.com/python/NewGIL.pdf">Talk on the new GIL</a></li>
<li><a href="https://www.dabeaz.com/python/GIL.pdf">Talk on the old GIL</a></li>
<li><a href="https://www.dabeaz.com/python/UnderstandingGIL.pdf">Another talk on the new GIL</a></li>
<li><a href="https://opensource.com/article/17/4/grok-gil">Grokking the GIL</a></li>
<li><a href="https://realpython.com/python-gil/">About the Python GIL</a></li>
</ol>A storyHow Much do I use my Phone?2019-01-02T00:00:00+00:002019-01-02T00:00:00+00:00https://rohanvarma.me/Phone-Usage<p>Towards the end of 2017, I started using an iOS app called <a href="https://inthemoment.io/">Moment</a>, which tracks how much time you spent on your phone each day and how many times you pick it up. Through using this application for the year of 2018 and poking around in the app for a way to export my day-by-day data, I was able to obtain a <a href="https://github.com/rohan-varma/phone-usage-tracking/blob/master/data/moment.json">JSON file</a> consisting of my phone usage time and number of pickups for every day of the year.</p>
<p>I decided to do some exploring to figure out just how much I’ve been using my phone on a daily basis, and see if there are any daily, weekly, or monthly differences - i.e. did I use my phone more on the weekends or on the weekdays? What follows is a Jupyter notebook that I created for analyzing this data and coming up with some plots, as well as a bit of analysis.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># imports
</span><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="c1"># a class to manage the phone usage data for a particular day
</span><span class="k">class</span> <span class="nc">Day</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">day_dict</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">minutes</span> <span class="o">=</span> <span class="n">day_dict</span><span class="p">[</span><span class="s">'minuteCount'</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">pickups</span> <span class="o">=</span> <span class="n">day_dict</span><span class="p">[</span><span class="s">'pickupCount'</span><span class="p">]</span>
<span class="c1"># get the date and save if it is a weekday or not
</span> <span class="bp">self</span><span class="p">.</span><span class="n">date</span> <span class="o">=</span> <span class="n">day_dict</span><span class="p">[</span><span class="s">'date'</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">'T'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">is_weekday</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o"><</span> <span class="mi">5</span>
<span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="s">'minutes: {0}, pickups: {1}, date: {2}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">minutes</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">pickups</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">date</span><span class="p">)</span>
<span class="c1"># open and deserialize json, convert into day objects
</span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/moment.json'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="n">day_data</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">values</span><span class="p">()))</span>
<span class="n">days</span> <span class="o">=</span> <span class="p">[</span><span class="n">Day</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">day_data</span><span class="p">]</span>
<span class="c1"># filter out non 2018
</span><span class="n">days</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="s">'2018'</span> <span class="ow">in</span> <span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">]</span>
</code></pre></div></div>
<p><br />
Here is what some of the raw JSON data coming from the Moment app looks like:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">day_data</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'pickupCount': 69, 'pickups': [], 'date': '2018-12-30T00:00:00+11:00',
'minuteCount': 181, 'appUsages': [], 'sessions': []}
</code></pre></div></div>
<p>To attempt to understand the overall data, we can find the mean and standard deviation of how many minutes per day I used my phone, as well as plot a histogram.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">minute_data</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span><span class="p">]</span>
<span class="n">mean_time</span><span class="p">,</span> <span class="n">time_std</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">minute_data</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">minute_data</span><span class="p">)</span>
<span class="c1"># hourly bins
</span><span class="n">bins</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">max</span><span class="p">(</span><span class="n">minute_data</span><span class="p">)</span> <span class="o">+</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">60</span><span class="p">)]</span>
<span class="c1"># plot overall usage
</span><span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">([</span><span class="n">minute_data</span><span class="p">],</span> <span class="n">bins</span><span class="o">=</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Minutes of Phone Usage'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency (# of Days)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Histogram of Phone Usage Time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mi">300</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="sa">r</span><span class="s">'$\mu={0:.2f},\ \sigma={1:.2f}$'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">mean_time</span><span class="p">,</span> <span class="n">time_std</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="c1"># hour-by-hour data
</span><span class="n">bin_ranges</span> <span class="o">=</span> <span class="p">[(</span><span class="n">bins</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">bins</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)]</span>
<span class="n">hours_to_num_days</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">bin_ranges</span><span class="p">,</span> <span class="n">n</span><span class="p">))</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">hours_to_num_days</span><span class="p">.</span><span class="n">items</span><span class="p">()):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Between {0} and {1} hours of usage: {2} days'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">k</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">/</span><span class="mi">60</span><span class="p">,</span> <span class="n">k</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">/</span><span class="mi">60</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">v</span><span class="p">)))</span>
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_6_0.png" alt="png" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Between 0.0 and 1.0 hours of usage: 33 days
Between 1.0 and 2.0 hours of usage: 154 days
Between 2.0 and 3.0 hours of usage: 121 days
Between 3.0 and 4.0 hours of usage: 42 days
Between 4.0 and 5.0 hours of usage: 8 days
Between 5.0 and 6.0 hours of usage: 3 days
Between 6.0 and 7.0 hours of usage: 1 days
Between 7.0 and 8.0 hours of usage: 1 days
Between 8.0 and 9.0 hours of usage: 1 days
</code></pre></div></div>
<p>It looks like I spent an average of about 2 hours and 6 minutes on my phone each day, with a standard deviation of 1 hour and 2 minutes. This is slightly lower than the <a href="https://hackernoon.com/how-much-time-do-people-spend-on-their-mobile-phones-in-2017-e5f90a0b10a6">average time per day</a> spent on their phones by American adults, which comes in at 2 hours and 51 minutes.</p>
<p>In other words, I spent about 8.75% of my entire day on my phone. If you only consider waking hours and assume 8 hours of sleep per day, then I spent about 13% of my waking hours using my phone each day. Translated to a year, this means I spent a whopping 766.25 hours on my phone in 2018, or 31.93 days - more than an entire month!</p>
<p>Another interesting thing to look at is the variability in my phone usage. Most days, I was around one to three hours of phone usage per day - this accounts for about 75% of all days of the year. However, there were a couple days with more than 6+ hours of phone usage per day, which increased the variability in my phone usage. Looking back, I think that this makes sense, as I do use my phone a lot on days when I’m traveling or on a road trip, or if I’m just really bored that day and don’t feel like doing anything else.</p>
<p>Let’s look at some more data, such as whether there’s a difference between weekdays and weekends.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># separate weekdays and weekends, and plot each
</span><span class="n">weekdays</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">d</span><span class="p">.</span><span class="n">is_weekday</span><span class="p">]</span>
<span class="n">weekends</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">d</span><span class="p">.</span><span class="n">is_weekday</span><span class="p">]</span>
<span class="n">weekday_mean</span><span class="p">,</span> <span class="n">weekend_mean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">weekdays</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">weekends</span><span class="p">)</span>
<span class="n">weekday_std</span><span class="p">,</span> <span class="n">weekend_std</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">weekdays</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">weekends</span><span class="p">)</span>
<span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">weekdays</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Minutes of Phone Usage on Weekdays'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency (# of Days)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Weekday Phone Usage Time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mi">300</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="sa">r</span><span class="s">'$\mu={0:.2f},\ \sigma={1:.2f}$'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">weekday_mean</span><span class="p">,</span> <span class="n">weekday_std</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">weekends</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">bins</span><span class="p">,</span> <span class="n">facecolor</span><span class="o">=</span><span class="s">'orange'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Minutes of Phone Usage on Weekends'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency (# of Days)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Weekend Phone Usage Time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mi">300</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="sa">r</span><span class="s">'$\mu={0:.2f},\ \sigma={1:.2f}$'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">weekend_mean</span><span class="p">,</span> <span class="n">weekend_std</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_8_0.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_8_1.png" alt="png" /></p>
<p>This was really interesting to me - the distributions for my weekday and weekend phone usage are quite similar, indicating that there’s little difference in my phone usage on a weekend or weekday. This ran counter to my hypothesis that I’d use my phone a lot more on weekends, given that I don’t have class or work. Next, lets see if there’s any difference in phone usage on different days of the week, different weeks, and different months.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># separate each day of the week, and plot each.
</span><span class="n">mon</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">tues</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">wed</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">2</span><span class="p">]</span>
<span class="n">thurs</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">fri</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">4</span><span class="p">]</span>
<span class="n">sat</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">5</span><span class="p">]</span>
<span class="n">sun</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">days</span> <span class="k">if</span> <span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">date</span><span class="p">,</span> <span class="s">'%Y-%M-%d'</span><span class="p">).</span><span class="n">weekday</span><span class="p">()</span> <span class="o">==</span> <span class="mi">6</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">plot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">title</span><span class="p">):</span>
<span class="k">global</span> <span class="n">bins</span>
<span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency (# of Days)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Phone Usage Time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mi">300</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="sa">r</span><span class="s">'$\mu={0:.2f},\ \sigma={1:.2f}$'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">data</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">data</span><span class="p">)))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">mon</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Monday'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">tues</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Tuesday'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">wed</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Wednesday'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">thurs</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Thursday'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">fri</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Friday'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">sat</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Saturday'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">sun</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Minutes of Phone Usage on Sunday'</span><span class="p">)</span>
<span class="c1"># plot overall for mean comparison
</span><span class="n">means</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[</span><span class="n">mon</span><span class="p">,</span> <span class="n">tues</span><span class="p">,</span> <span class="n">wed</span><span class="p">,</span> <span class="n">thurs</span><span class="p">,</span> <span class="n">fri</span><span class="p">,</span> <span class="n">sat</span><span class="p">,</span> <span class="n">sun</span><span class="p">]]</span>
<span class="k">print</span><span class="p">(</span><span class="n">means</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">7</span><span class="p">),</span> <span class="n">means</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Average Phone Usage for Day of Week'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Average Phone Usage'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Day of Week'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_0.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_1.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_2.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_3.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_4.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_5.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_6.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_10_8.png" alt="png" /></p>
<p>We can see that there’s a lot of similarity between the days of the weeks, though it looks like on average, I use my phone less on Thursdays, Fridays, and Sundays, while I use it comparatively more on Tuesdays, Wednesdays, and Saturdays. Overall, we can see that each day’s distribution is similar, taking on a mean of around two hours and a standard deviaton of around an hour. Let’s examine weekly usage now.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># extract weeks from the year by sorting days and going by sevens
</span><span class="n">ordered_days</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">reversed</span><span class="p">(</span><span class="n">days</span><span class="p">))</span>
<span class="n">weeks</span> <span class="o">=</span> <span class="p">[</span><span class="n">ordered_days</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">7</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ordered_days</span><span class="p">),</span> <span class="mi">7</span><span class="p">)]</span>
<span class="n">weekly_usages</span> <span class="o">=</span> <span class="p">[</span><span class="nb">sum</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">week</span><span class="p">)</span> <span class="k">for</span> <span class="n">week</span> <span class="ow">in</span> <span class="n">weeks</span><span class="p">]</span>
<span class="c1"># plot each week's usage in a bar graph
</span><span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">weekly_usages</span><span class="p">))],</span><span class="n">weekly_usages</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Week of the Year'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Phone Usage Minutes'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Week-by-Week Phone Usage Minutes'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="c1"># plot weekly usage histogram
</span><span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">weekly_usages</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Weekly Phone Usage Distribution'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency (# of Weeks)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Weekly Phone Usage Time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mi">950</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="sa">r</span><span class="s">'$\mu={0:.2f},\ \sigma={1:.2f}$'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span>
<span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">weekly_usages</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">weekly_usages</span><span class="p">)))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">max_weekly</span><span class="p">,</span> <span class="n">min_weekly</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">weekly_usages</span><span class="p">),</span> <span class="nb">min</span><span class="p">(</span><span class="n">weekly_usages</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{} minutes in highest-usage week, {} minutes in lowest-usage week'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span>
<span class="n">max_weekly</span><span class="p">,</span> <span class="n">min_weekly</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_12_0.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_12_1.png" alt="png" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1685 minutes in highest-usage week, 580 minutes in lowest-usage week
</code></pre></div></div>
<p>This is pretty interesting - it looks like my phone usage clustered around the 700-900 minute range for many weeks, with frequent spikes up to the 1100+ minute range in a couple of the weeks. My highest-usage week was a whopping 1685 minutes, or over 28 hours, or more than an entire day of the week spent solely on my phone. Finally, let’s move on to monthly usage.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># parse months out of dates, and get those days corresponding to the month
</span><span class="n">months</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">([</span><span class="s">"-"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">day</span><span class="p">.</span><span class="n">date</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"-"</span><span class="p">)[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span> <span class="k">for</span> <span class="n">day</span> <span class="ow">in</span> <span class="n">ordered_days</span><span class="p">]))</span>
<span class="n">month_to_days</span> <span class="o">=</span> <span class="p">{</span><span class="nb">int</span><span class="p">(</span><span class="n">month</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"-"</span><span class="p">)[</span><span class="mi">1</span><span class="p">]):</span>
<span class="p">[</span><span class="n">day</span> <span class="k">for</span> <span class="n">day</span> <span class="ow">in</span> <span class="n">ordered_days</span> <span class="k">if</span> <span class="n">month</span> <span class="ow">in</span> <span class="n">day</span><span class="p">.</span><span class="n">date</span><span class="p">]</span> <span class="k">for</span> <span class="n">month</span> <span class="ow">in</span> <span class="n">months</span><span class="p">}</span>
<span class="c1"># plot bar graph of monthly means
</span><span class="n">monthly_means</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">([</span><span class="n">day</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">day</span> <span class="ow">in</span> <span class="n">li</span><span class="p">])</span> <span class="k">for</span> <span class="n">li</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">month_to_days</span><span class="p">.</span><span class="n">values</span><span class="p">())]</span>
<span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">month_to_days</span><span class="p">.</span><span class="n">keys</span><span class="p">()),</span> <span class="n">monthly_means</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Month (1 = Jan)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Average daily minutes of phone usage'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Average daily minutes of phone usage per month'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">13</span><span class="p">)))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="c1"># plot histograms of most and least used months.
</span><span class="n">most_use_month</span><span class="p">,</span> <span class="n">least_use_month</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">monthly_means</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">argmin</span><span class="p">(</span><span class="n">monthly_means</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">most_use_days</span><span class="p">,</span> <span class="n">least_use_days</span> <span class="o">=</span> <span class="n">month_to_days</span><span class="p">[</span><span class="n">most_use_month</span><span class="p">],</span> <span class="n">month_to_days</span><span class="p">[</span><span class="n">least_use_month</span><span class="p">]</span>
<span class="n">most_use_mins</span> <span class="o">=</span> <span class="p">[</span><span class="n">day</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">day</span> <span class="ow">in</span> <span class="n">most_use_days</span><span class="p">]</span>
<span class="n">least_use_mins</span> <span class="o">=</span> <span class="p">[</span><span class="n">day</span><span class="p">.</span><span class="n">minutes</span> <span class="k">for</span> <span class="n">day</span> <span class="ow">in</span> <span class="n">least_use_days</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">plot</span><span class="p">(</span><span class="n">month</span><span class="p">,</span> <span class="n">mins</span><span class="p">):</span>
<span class="n">bins</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">max</span><span class="p">(</span><span class="n">mins</span><span class="p">)</span> <span class="o">+</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">60</span><span class="p">)]</span>
<span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">mins</span><span class="p">,</span> <span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Minutes of Phone Usage'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency (# of Days)'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Histogram of Phone Usage Time: Month {0:02d}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">month</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mi">300</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="sa">r</span><span class="s">'$\mu={0:.2f},\ \sigma={1:.2f}$'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">mins</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">mins</span><span class="p">)))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">plot</span><span class="p">(</span><span class="n">most_use_month</span><span class="p">,</span> <span class="n">most_use_mins</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">least_use_month</span><span class="p">,</span> <span class="n">least_use_mins</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_14_0.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_14_1.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/phone-usage-tracking/master/How%20Much%20do%20I%20use%20my%20Phone%3F_files/How%20Much%20do%20I%20use%20my%20Phone%3F_14_2.png" alt="png" /></p>
<p>It looks like my monthly phone usage was consistent, hovering around the slightly above two hour mark, with a dip during the summer months and an increase during April and December. Over April, I used my phone for slightly over three hours a day on average (nearly 19% of my waking hours!) while using my phone for 1 hour and 43 minutes, or 10.7% of my waking hours, in the least-used month of August.</p>
<p>To be fair, the month of April had a large amount of variability, so the mean of 3 hours doesn’t really reflect my usual usage that month: April contained all three of the outliers in the entire year, where I used my phone for more than 6 hours. Honestly, I’m not too sure what may have happened, I either left my phone on accidently at some points during the month or more realistically just used my phone a lot those couple of days.</p>
<h4 id="concluding-remarks">Concluding Remarks</h4>
<p>The data indicates that my phone occupies a pretty significant chunk of my waking hours on an average day. I knew, like many people, that I used my phone a lot, but I didn’t quite understand how much until I actually looked at the data. However, this data doesn’t capture more granular information of whether I’m using my phone for more “useful” purposes such as necessary communication, homework/work-related stuff, calling a lyft/uber, getting directions, or talking on the phone, versus more typical time wasters (randomly checking social media/email for the 10000th+ time or just browsing around).</p>
<p>The overall takeaway for me is to think of my phone more as a tool, instead of as a distraction for when I’m bored. Phones and applications can be incredibly useful in keeping us connected with our friends and family, getting from place to place, learning new things, or capturing incredible moments, but can also take away from the present moment.</p>
<p>In 2019, I’m going to make a conscious effort to simply note when I use my phone immediately when boredom presents itself, such as during a long car ride, waiting for an elevator, or even just walking from place to place. Hopefully, this will make me more mindful when I use my phone to distract myself from the present moment, and in time, I can learn to turn off this deeply ingrained habit. Here’s to being more present in 2019.</p>Towards the end of 2017, I started using an iOS app called Moment, which tracks how much time you spent on your phone each day and how many times you pick it up. Through using this application for the year of 2018 and poking around in the app for a way to export my day-by-day data, I was able to obtain a JSON file consisting of my phone usage time and number of pickups for every day of the year.Training very deep networks with Batchnorm2018-02-19T00:00:00+00:002018-02-19T00:00:00+00:00https://rohanvarma.me/Batch-Norm<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_grad_first_layer.png" alt="grad" /></p>
<p>Training very deep neural networks is hard. It turns out one significant issue with deep neural networks is that the activations of each layer tend to converge to 0 in the later layers, and therefore the gradients vanish as they backpropagate throughout the network.</p>
<p>A lot of this has to do with the sheer size of the network - obviously as you multiply numbers less than zero together over and over, they’ll converge to zero, and that’s partially why network architectures such as InceptionV3 insert auxiliary classifiers after layers earlier on in their network, so there’s a stronger gradient signal back propagated during the first few epochs of training.</p>
<p>However, there’s also a more subtle issue that leads to this problem of vanishing activations and gradients. It has to do with the initialization of the weights in each layer of our network, and the subsequent distributions of the activations in our network. Understanding this issue is key to understanding why batch normalization is now a staple in training deep networks.</p>
<p>First, we can write some code to generate some random data, and forward it through a dummy deep neural network:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_examples</span><span class="p">,</span> <span class="n">hidden_layer_dim</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">100</span>
<span class="n">input_dim</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_examples</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c1"># 100 examples of 1000 points
</span><span class="n">n_layers</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">layer_dim</span> <span class="o">=</span> <span class="p">[</span><span class="n">hidden_layer_dim</span><span class="p">]</span> <span class="o">*</span> <span class="n">n_layers</span> <span class="c1"># each one has 100 neurons
</span>
<span class="n">hs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c1"># stores the hidden layer activations
</span><span class="n">zs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c1"># stores the affine transforms in each layer, used for backprop
</span><span class="n">ws</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># stores the weights
</span>
<span class="c1"># the forward pass
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n_layers</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">hs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># get the input into this hidden layer
</span> <span class="c1">#W = np.random.randn(h.shape[0], layer_dim[i]) * np.sqrt(2)/(np.sqrt(200) * np.sqrt(3))
</span> <span class="c1">#W = np.random.uniform(-np.sqrt(6)/(200), np.sqrt(6)/200, size = (h.shape[0], layer_dim[i]))
</span> <span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="p">(</span><span class="n">h</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">layer_dim</span><span class="p">[</span><span class="n">i</span><span class="p">])),</span> <span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">layer_dim</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">h</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="c1">#W = np.random.normal(0, np.sqrt(2/(h.shape[0] + layer_dim[i])), size = (layer_dim[i], h.shape[0])) * 0.01
</span> <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="n">h_out</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="p">(</span><span class="n">z</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="n">ws</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">W</span><span class="p">)</span>
<span class="n">zs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">hs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_out</span><span class="p">)</span>
</code></pre></div></div>
<p>Now that we have a list of each layer’s hidden activations stored in <strong>hs</strong>, we can go ahead and plot the activations to see what their distribution looks like. Here, I’ve included plots of the activations at the first and final hidden layers in our 20 layer network:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/activation_0.png" alt="act0" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/activation_19.png" alt="act19" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/activation_20.png" alt="act20" /></p>
<p>What’s important to notice is that in later layers, <em>nearly all of the activations are zero</em> (just look at the scale of the axes). If we look at the distributions of these activations, it’s clear that they differ significantly with respect to each other - the first activation takes on a clear Gaussian shape around 0, while successive hidden layers have most of their activations at 0, with rapidly decreasing variance. This is what the <a href="https://arxiv.org/pdf/1502.03167.pdf">batch normalization paper</a> refers to as <em>internal covariate shift</em> - it basically means that the distributions of activations differ with respect to each other.</p>
<p><strong>Why does this matter, and why is this bad?</strong></p>
<p>This is bad mostly due to the small, and decreasing variance in the distributions of our activations across layers. Having zero activations is fine, unless nearly all your activations are zero. To understand why this is bad, we need to look at the backwards pass of our network, which is responsible for computing each gradient \(\frac{dL}{dW_i}\) across each hidden layer in our network. Given the following formulation of an arbitrary layer in our network: \(h_i=relu(W_ih_{i−1}+b_i)\) where \(h_i\) denotes the activations of the <em>i</em>th layer in our network, we can construct the local gradient \(\frac{dL}{dW_i}\). Given an upstream gradient into this layer \(\frac{dL}{dh_i}\), we can compute the local gradient with the chain rule:</p>
\[\frac{dL}{dW_i} = \frac{dh_i}{dW_i} * \frac{dL}{dh_i}\]
<p>Applying the derivatives, we obtain:</p>
\[\frac{dL}{dW_i} = [\mathbb{1}(W_ih_{i-1} + b > 0) \odot \frac{dL}{dh_i}]h_{i-1}^T\]
<p>Concretely, we can take our loss function for a single point to be given by the squared error, i.e. \(L_i = \frac{1}{2}(y-t)^2\), and if we were at the last layer of our network (i.e. \(h_i = y\)), our upstream gradient would be \(\frac{dL}{dh_i} = h_i - t\). This would give us a gradient of</p>
\[\frac{dL}{dW_i} = [\mathbb{1}(W_ih_{i-1} + b > 0) \odot (h_i - t)]h_{i-1}^T\]
<p>in the final layer of our network.</p>
<p><strong>What does this tell us about our gradients for our weights?</strong></p>
<p>The expression for the gradient of our weights is intuitive: for every element in the incoming gradient matrix, pass the gradient through if this layer’s linear transformation would activate the relu neuron at that element, and scale the gradient by our input into this layer. Otherwise, zero out the gradient.</p>
<p>This means that if the incoming gradient at a certain element wasn’t already zero, it will be scaled by the input into this layer. The input in this layer is just the activations from the previous layer in our network. And as we discussed above, nearly all of those activations were zero.</p>
<p>Therefore, nearly all of the gradients backpropagated through our network will be zero, and few weight updates, if any, will occur. In the final few layers of our network, this isn’t as much of a problem. We have a strong gradient signal (i.e. \(h_i - t\) in the example above) coming from the gradient of our loss function with respect to the outputs of our network (since it is early in learning, and our predictions are inaccurate). However, after we backpropagate this signal even a few layers, chances that the gradient is zeroed out become extremely high.</p>
<p>In order to see if this is actually true, we can write out the backwards pass of our 20 layer network, and plot the gradients as we did with our activations. The following code computes the gradients using the expression given above, for all layers in our network:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dLdh</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">hidden_layer_dim</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c1"># random incoming grad into our last layer
</span><span class="n">h_grads</span> <span class="o">=</span> <span class="p">[</span><span class="n">dLdh</span><span class="p">]</span> <span class="c1"># store the incoming grads into each layer
</span><span class="n">w_grads</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># store dL/dw for each layer
</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_layers</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">):</span>
<span class="c1"># get the incoming gradient
</span> <span class="n">incoming_loss_grad</span> <span class="o">=</span> <span class="n">h_grads</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># backprop through the relu
</span> <span class="n">dLdz</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="c1"># zs was the result of Wx + b
</span> <span class="c1"># get the gradient dL/dh_{i-1}, this will be the incoming grad into the next layer
</span> <span class="n">h_grad</span> <span class="o">=</span> <span class="n">ws</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">T</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">dLdz</span><span class="p">)</span> <span class="c1"># ws[i-1] are our weights at this layer
</span> <span class="c1"># get the gradient of the weights of this layer (dL/dw)
</span> <span class="n">weight_grad</span> <span class="o">=</span> <span class="n">dLdz</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">T</span><span class="p">)</span> <span class="c1"># hs[i-1] was our input into this layer
</span> <span class="n">h_grads</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_grad</span><span class="p">)</span>
<span class="n">w_grads</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">weight_grad</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, we can plot our gradients for our earlier layers and see if our hypothesis was true:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grad_layer2.png" alt="grad1" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grad_layer_3.png" alt="grad3" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grad_layer_4.png" alt="grad4" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grads_layer_20.png" alt="grad20" /></p>
<p>As we can see, for the final layer vanishing gradients aren’t an issue, but they are for earlier layers - in fact, after a few layers nearly all of the gradients are zero). This will result in extremely slow learning (if at all).</p>
<p><strong>Ok, but what does batch normalization have to do any of this?</strong></p>
<p>Batch normalization is a way to fix the root cause of our issue of zero activations and vanishing gradients: reducing internal covariate shift. We want to ensure that the variances of our activations do not differ too much from each other. Batch normalization does this by normalizing each activation in a batch:</p>
\[x_k = \frac{x_k - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}\]
<p>Here, we denote\(x_k\) to be a certain activation, and \(\mu_B\), \(\sigma^2_B\) to be the mean and variance across the minibatch for that activation. A small constant \(\epsilon\) is added to ensure that we don’t divide by zero.</p>
<p>This constrains all hidden layer activations to have zero mean and unit variance, so the variances in our hidden layer activations should not differ too much from each other, and therefore we shouldn’t have nearly all our activations be zero.</p>
<p>It’s important to note here that batch normalization doesn’t <em>force</em> the network activations to rigidly follow this distribution at all times, because the above result is scaled and shifted by some parameters before being passed as input into the next layer:</p>
\[y_k = \gamma \hat{x_i} + \beta\]
<p>This allows the network to “undo” the previous normalization procedure if it wants to, such as if \(y_k\) was an input into a sigmoid neuron, we may not want to normalize at all, because doing so may constrain the expressivity of the sigmoid neuron.</p>
<p><strong>Does normalizing our inputs into the next layer actually work?</strong></p>
<p>With batch normalization, we can be confident that the distributions of our activations across hidden layers are reasonably similar. If this is true, then we know that the gradients should have a wider distribution, and not be nearly all zero, following the same scaling logic described above.</p>
<p>Let’s add batch normalization to our forward pass to see if the activations have reasonable variances. Our forward pass changes in only a few lines:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_examples</span><span class="p">,</span> <span class="n">hidden_layer_dim</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">100</span>
<span class="n">input_dim</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_examples</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c1"># 100 examples of 1000 points
</span><span class="n">n_layers</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">layer_dim</span> <span class="o">=</span> <span class="p">[</span><span class="n">hidden_layer_dim</span><span class="p">]</span> <span class="o">*</span> <span class="n">n_layers</span> <span class="c1"># each one has 100 neurons
</span>
<span class="n">hs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c1"># save hidden states
</span><span class="n">hs_not_batchnormed</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c1"># saves the results before we do batchnorm, because we need this in the backward pass.
</span><span class="n">zs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c1"># save affine transforms for backprop
</span><span class="n">ws</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># save the weights
</span><span class="n">gamma</span><span class="p">,</span> <span class="n">beta</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span>
<span class="c1"># the forward pass
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n_layers</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">hs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># get the input into this hidden layer
</span> <span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">layer_dim</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">h</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> <span class="o">*</span> <span class="mf">0.01</span> <span class="c1"># weight init: gaussian around 0
</span> <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="n">h_out</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="p">(</span><span class="n">z</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># save the not batchnormmed part for backprop
</span> <span class="n">hs_not_batchnormed</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_out</span><span class="p">)</span>
<span class="c1"># apply batch normalization
</span> <span class="n">h_out</span> <span class="o">=</span> <span class="p">(</span><span class="n">h_out</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">h_out</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">h_out</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># scale and shift
</span> <span class="n">h_out</span> <span class="o">=</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">h_out</span> <span class="o">+</span> <span class="n">beta</span>
<span class="n">ws</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">W</span><span class="p">)</span>
<span class="n">zs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">hs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_out</span><span class="p">)</span>
</code></pre></div></div>
<p>Using the results of this forward pass (again stored in <strong>hs</strong>), we can plot a few of the activations:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_activation_4.png" alt="act4" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_activation_19.png" alt="act20" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_activation_20.png" alt="act20" /></p>
<p>This is great! Our later activations now have a much more reasonable distribution compared to previously, where they were all almost zero - just compare the scales of the axes on the batchnorm graphs against the non-original graphs.</p>
<p>Let’s see if this makes any difference in our gradients. First, we have to rewrite our original backwards pass to accommodate the gradients for the batchnorm operation. The gradients I used in the batchnorm layer are the ones given by the <a href="https://arxiv.org/pdf/1502.03167.pdf">original paper</a>. Our backwards pass now becomes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dLdh</span> <span class="o">=</span> <span class="mf">0.01</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">hidden_layer_dim</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c1"># random incoming grad into our last layer
</span><span class="n">h_grads</span> <span class="o">=</span> <span class="p">[</span><span class="n">dLdh</span><span class="p">]</span> <span class="c1"># incoming grads into each layer
</span><span class="n">w_grads</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># will hold dL/dw_i for each layer
</span>
<span class="c1"># the backwards pass
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_layers</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">):</span>
<span class="c1"># get the incoming gradient
</span> <span class="n">incoming_loss_grad</span> <span class="o">=</span> <span class="n">h_grads</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># backprop through the batchnorm layer
</span> <span class="c1">#the y_i is the restult of batch norm, so h_out or hs[i]
</span> <span class="n">dldx_hat</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="n">gamma</span>
<span class="n">dldvar</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dldx_hat</span> <span class="o">*</span> <span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="o">*</span> <span class="o">-</span><span class="p">.</span><span class="mi">5</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">power</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">),</span> <span class="o">-</span><span class="mf">1.5</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">dldmean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dldx_hat</span> <span class="o">*</span> <span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">dldvar</span> <span class="o">*</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span><span class="o">/</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># the following is dL/hs_not_batchnormmed[i] (aka dL/dx_i) in the paper!
</span> <span class="n">dldx</span> <span class="o">=</span> <span class="n">dldx_hat</span> <span class="o">*</span> <span class="mi">1</span><span class="o">/</span><span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">dldvar</span> <span class="o">*</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span><span class="o">/</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">dldmean</span><span class="o">/</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># although we don't need it for this demo, for completeness we also compute the derivatives with respect to gamma and beta.
</span> <span class="n">dldgamma</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="n">hs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">dldbeta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">incoming_loss_grad</span><span class="p">)</span>
<span class="c1"># now incoming_loss_grad should be replaced by that backpropped result
</span> <span class="n">incoming_loss_grad</span> <span class="o">=</span> <span class="n">dldx</span>
<span class="c1"># backprop through the relu
</span> <span class="k">print</span><span class="p">(</span><span class="n">incoming_loss_grad</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">dLdz</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># get the gradient dL/dh_{i-1}, this will be the incoming grad into the next layer
</span> <span class="n">h_grad</span> <span class="o">=</span> <span class="n">ws</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">T</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">dLdz</span><span class="p">)</span>
<span class="c1"># get the gradient of the weights of this layer (dL/dw)
</span> <span class="n">weight_grad</span> <span class="o">=</span> <span class="n">dLdz</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">T</span><span class="p">)</span>
<span class="n">h_grads</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_grad</span><span class="p">)</span>
<span class="n">w_grads</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">weight_grad</span><span class="p">)</span>
</code></pre></div></div>
<p>Using this backwards pass, we can now plot our gradients. We expect them to no longer be nearly all zero, which will mean that avoiding internal covariate shift fixed our vanishing gradients problem:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_grad_first_layer.png" alt="bngrad1" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_grad_second_layer.png" alt="bngrad3" /></p>
<p>Awesome! Looking at our gradients early in the network, we can see that they follow a roughly normal distribution with plenty of non-zero, large-magnitude values. Since our gradients are much more reasonable than previously, where they were nearly all zero, we are more confident that learning will occur at a reasonable rate, even for a large deep neural network (20 layers). We’ve successfully used batch normalization to fix one of the most common issues in training deep neural networks!</p>
<h4 id="intuition-for-why-batch-normalization-helps-with-better-gradient-signals">Intuition for why Batch Normalization helps with better gradient signals</h4>
<p>When gradient descent updates a certain layer in our network with the gradient \(\frac{dL}{dW_i}\), it is ignorant of the changes in statistics in other layers - for example, it implicitly assumes that the distribution of the activations of the previous layer (and hence the input into this layer) stay the same as it updates the current layer it is on. Without batch normalization, this assumption isn’t true: gradient descent also eventually updates the weights in the previous layer, therefore changing the statistics of the output activations for that layer. Therefore, there could be a case where we update layer \(i\) , but the distribution of the inputs into that layer change such that the update actually does <em>worse</em> on these new inputs. Batch normalization fixes this, by guaranteeing that the statistics of the input into each layer stay the same throughout the learning process. See <a href="https://www.youtube.com/watch?v=Xogn6veSyxA&feature=youtu.be&t=325">this explanation</a> by Goodfellow for more on this.</p>
<p>P.S. - all the code used to generate the plots used in this answer are available <a href="https://github.com/rohan-varma/nn-init-demo/">here</a>.</p>
<h4 id="references">References</h4>
<ol>
<li><a href="https://arxiv.org/abs/1502.03167">Batch Normalization Paper</a></li>
<li><a href="cs231n.stanford.edu">CS 231n Lecture on Batch Normlization</a></li>
</ol>
<h4 id="notes">Notes</h4>
<p>[2/19/18] - I originally wrote this as an <a href="https://www.quora.com/How-does-batch-normalization-help/answer/Rohan-Varma-8">answer on Quora</a></p>
<p>[2/21/18] - The code used in the forward and backward pass isn’t completely accurate with respect to scaling the outputs by parameters \(\gamma\) and \(\beta\). In actuality, there is supposed to be a \(\gamma_i\) and a \(\beta_i\) for <em>each</em> activation in <em>each</em> hidden layer - for example, if we have a batch of \(n\) activations and each activation has shape \(1000\), there should be \(1000\) \(\gamma_i\)s and \(1000\) \(\beta_i\)s in each layer. I didn’t bother to actually implement it this way as it doesn’t affect the normalization process for the one step I illustrated.</p>
<p>[2/22/18] - I applied batch normalization <em>after</em> the ReLU nonlinearity, whereas the original paper states that it is applied after the affine layer and <em>before</em> the nonlinearity. Apparently, their actual code applies it after the ReLU as well, and it was mis-stated in their paper. See <a href="https://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/">this Reddit thread</a> for more discussion.</p>Picking Loss Functions - A comparison between MSE, Cross Entropy, and Hinge Loss2018-01-09T00:00:00+00:002018-01-09T00:00:00+00:00https://rohanvarma.me/Loss-Functions<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/loss3.jpg" alt="loss" /></p>
<p>Loss functions are a key part of any machine learning model: they define an objective against which the performance of your model is measured, and the setting of weight parameters learned by the model is determined by minimizing a chosen loss function. There are several different common loss functions to choose from: the cross-entropy loss, the mean-squared error, the huber loss, and the hinge loss - just to name a few. Given a particular model, each loss function has particular properties that make it interesting - for example, the (L2-regularized) hinge loss comes with the maximum-margin property, and the mean-squared error when used in conjunction with linear regression comes with convexity guarantees.</p>
<p>In this post, I’ll discuss three common loss functions: the mean-squared (MSE) loss, cross-entropy loss, and the hinge loss. These are the most commonly used functions I’ve seen used in traditional machine learning and deep learning models, so I thought it would be a good idea to figure out the underlying theory behind each one, and when to prefer one over the others.</p>
<h4 id="the-mean-squared-loss-probabalistic-interpretation">The Mean-Squared Loss: Probabalistic Interpretation</h4>
<p>For a model prediction such as \(h_\theta(x_i) = \theta_0 + \theta_1x\) (a simple linear regression in 2 dimensions) where the inputs are a feature vector \(x_i\), the mean-squared error is given by summing across all \(N\) training examples, and for each example, calculating the squared difference from the true label \(y_i\) and the prediction \(h_\theta(x_i)\):</p>
\[J = \frac{1}{N} \sum_{i=1}^{N} (y_i - h_\theta(x_i))^2\]
<p>It turns out we can derive the mean-squared loss by considering a typical linear regression problem.</p>
<p>With linear regression, we seek to model our real-valued labels \(Y\) as being a linear function of our inputs \(X\), corrupted by some noise. Let’s write out this assumption:</p>
\[Y = \theta_0 + \theta_1x + \eta\]
<p>And to solidify our assumption, we’ll say that \(\eta\) is Gaussian noise with 0 mean and unit variance, that is \(\eta \sim N(0, 1)\). This means that \(E[Y] = E[\theta_0 + \theta_1x + \eta] = \theta_0 + \theta_1x\) and \(Var[Y] = Var[\theta_0 + \theta_1x + \eta] =\),1 so \(Y\) is also Gaussian with mean \(\theta_0 + \theta_1x\) and variance 1.</p>
<p>We can write out the probability of observing a single \((x_i, y_i)\) sample:</p>
\[p(y_i \vert x_i) = e^{-\frac{(y_{i} - (\theta_{0} + \theta_{1}x_{i}))^2}{2}}\]
<p>Summing across \(N\) of these samples in our dataset, we can write down the likelihood - essentially the probability of observing all \(N\) of our samples. Note that we also make the assumption that our data are independent of each other, so we can write out the likelihood as a simple product over each individual probability:</p>
\[L(x, y) = \prod_{i=1}^{N}e^{-\frac{(y_i - (\theta_0 + \theta_1x_i))^2}{2}}\]
<p>Next, we can take the log of our likelihood function to obtain the log-likelihood, a function that is easier to differentiate and overall nicer to work with:</p>
\[l(x, y) = -\frac{1}{2}\sum_{i=1}^{N}(y_i - (\theta_0 + \theta_1x_i))^2\]
<p>This gives us the MSE:</p>
\[J = \frac{1}{2}\sum_{i=1}^{N}(y_i - \theta^Tx_i)^2\]
<p>Essentially, this means that using the MSE loss makes sense if the assumption that your outputs are a real-valued function of your inputs, with a certain amount of irreducible Gaussian noise, with constant mean and variance. If these assumptions don’t hold true (such as in the context of classification), the MSE loss may not be the best bet.</p>
<h4 id="the-cross-entropy-loss-probabalistic-interpretation">The Cross-Entropy Loss: Probabalistic Interpretation</h4>
<p>In the context of classification, our model’s prediction \(h_\theta(x_i)\) will be given by \(\sigma(Wx_i + b)\) which produces a value between \(0\) and \(1\) that can be interpreted as a probability of example \(x_i\) belonging to the positive class. If this probability were less than \(0.5\) we’d classify it as a negative example, otherwise we’d classify it as a positive example. This means that we can write down the probabilily of observing a negative or positive instance:</p>
<p>\(p(y_i = 1 \vert x_i) = h_\theta(x_i)\) and \(p(y_i = 0 \vert x_i) = 1 - h_\theta(x_i)\)</p>
<p>We can combine these two cases into one expression:</p>
\[p(y_i | x_i) = [h_\theta(x_i)]^{(y_i)} [1 - h_\theta(x_i)]^{(1 - y_i)}\]
<p>Invoking our assumption that the data are independent and identically distributed, we can write down the likelihood by simply taking the product across the data:</p>
\[L(x, y) = \prod_{i = 1}^{N}[h_\theta(x_i)]^{(y_i)} [1 - h_\theta(x_i)]^{(1 - y_i)}\]
<p>Similar to above, we can take the log of the above expression and use properties of logs to simplify, and finally invert our entire expression to obtain the cross entropy loss:</p>
\[J = -\sum_{i=1}^{N} y_i\log (h_\theta(x_i)) + (1 - y_i)\log(1 - h_\theta(x_i))\]
<h4 id="the-cross-entropy-loss-in-the-case-of-multi-class-classification">The Cross-Entropy Loss in the case of multi-class classification</h4>
<p>Let’s supposed that we’re now interested in applying the cross-entropy loss to multiple (> 2) classes. The idea behind the loss function doesn’t change, but now since our labels \(y_i\) are one-hot encoded, we write down the loss (slightly) differently:</p>
\[-\sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(h_{\theta}(x_{i}){_j})\]
<p>This is pretty similar to the binary cross entropy loss we defined above, but since we have multiple classes we need to sum over all of them. The loss \(L_i\) for a particular training example is given by</p>
<p>\(L_{i} = - \log p(Y = y_{i} \vert X = x_{i})\).</p>
<p>In particular, in the inner sum, only one term will be non-zero, and that term will be the \(\log\) of the (normalized) probability assigned to the correct class. Intuitively, this makes sense because \(\log(x)\) is increasing on the interval \((0, 1)\) so \(-\log(x)\) is decreasing on that interval. For example, if we have a score of 0.8 for the correct label, our loss will be 0.09, if we have a score of .08 our loss would be 1.09.</p>
<p>Another variant on the cross entropy loss for multi-class classification also adds the other predicted class scores to the loss:</p>
\[- \sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(h_{\theta}(x_{i})_{j}) + (1-y_{ij})log(1 - h_{\theta}(x_{i})_{j})\]
<p>The second term in the inner sum essentially inverts our labels and score assignments: it gives the other predicted classes a probability of \(1 - s_j\), and penalizes them by the \(\log\) of that amount (here, \(s_j\) denotes the \(j\)th score, which is the \(j\)th element of \(h_\theta(x_i)\)).</p>
<p>This again makes sense - penalizing the incorrect classes in this way will encourage the values \(1 - s_j\) (where each \(s_j\) is a probability assigned to an incorrect class) to be large, which will in turn encourage \(s_j\) to be low. This alternative version seems to tie in more closely to the binary cross entropy that we obtained from the maximum likelihood estimate, but the first version appears to be more commonly used both in practice and in teaching.</p>
<p>It turns out that it doesn’t really matter which variant of cross-entropy you use for multiple-class classification, as they both decrease at similar rates and are just offset, with the second variant discussed having a higher loss for a particular setting of scores. To show this, I <a href="https://github.com/rohan-varma/machine-learning-notes/blob/master/loss-experiment/loss.py">wrote some code</a> to plot these 2 loss functions against each other, for probabilities for the correct class ranging from 0.01 to 0.98, and obtained the following plot:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/machine-learning-notes/master/loss-experiment/loss.png" alt="loss" /></p>
<h4 id="cross-entropy-loss-an-information-theory-perspective">Cross Entropy Loss: An information theory perspective</h4>
<p>As mentioned in the <a href="https://cs231n.github.io/linear-classify/">CS 231n lectures</a>, the cross-entropy loss can be interpreted via information theory. In information theory, the Kullback-Leibler (KL) divergence measures how “different” two probability distributions are. We can think of our classification problem as having 2 different probability distributions: first, the distribution for our actual labels, where all the probability mass is concentrated on the correct label, and there is no probability mass on the rest, and second, the distribution which we are learning, where the concentrations of probability mass are given by the outputs of the running our raw scores through a softmax function.</p>
<p>In an ideal world, our learned distribution would match the actual distribution, with 100% probability being assigned to the correct label. This can’t really happen since that would mean our raw scores would have to be \(\infty\) and \(-\infty\) for our correct and incorrect classes respectively, and, more practically, constraints we impose on our model (i.e. using logistic regression instead of a deep neural net) will limit our ability to correctly classify every example with high probability on the correct label.</p>
<p>Interpreting the cross-entropy loss as minimizing the KL divergence between 2 distributions is interesting if we consider how we can extend cross-entropy to different scenarios. For example, a lot of datasets are only partially labelled or have noisy (i.e. occasionally incorrect) labels. If we could probabilistically assign labels to the unlabelled portion of a dataset, or interpret the incorrect labels as being sampled from a probabalistic noise distribution, we can still apply the idea of minimizing the KL-divergence, although our ground-truth distribution will no longer concentrate all the probability mass over a single label.</p>
<h4 id="differences-in-learning-speed-for-classification">Differences in learning speed for classification</h4>
<p>It turns out that if we’re given a typical classification problem and a model \(h_\theta(x) = \sigma(Wx_i + b)\), we can show that (at least theoretically) the cross-entropy loss leads to quicker learning through gradient descent than the MSE loss. This is primarily due to the use of the sigmoid function. First, let’s recall the gradient descent update rule:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For i = 1 ... N:
Compute dJ/dw_i for i = 1 ... M parameters
Let w_i = w_i - learning_rate * dJ/dw_i
</code></pre></div></div>
<p>(Note that the gradient terms \(\frac{dJ}{dw_i}\) should all be computed before applying the updates). Essentially, the gradient descent algorithm computes partial derivatives for all the parameters in our network, and updates the parameters by decrementing the parameters by their respective partial derivatives, times a constant known as the learning rate, taking a step towards a local minimum.</p>
<p>This means that the “speed” of learning is dictated by two things: the learning rate and the size of the partial derivative. The learning rate is a hyperparameter that we must tune, so we’ll focus on the size of the partial derivatives for now. Consider the following binary classification scenario: we have an input feature vector \(x_i\), a label \(y_i\), and a prediction \(\hat{y_i} = h_\theta(x_i)\).</p>
<p>We’ll show that given our model \(h_\theta(x) = \sigma(Wx_i + b)\), learning can occur much faster during the beginning phases of training if we used the cross-entropy loss instead of the MSE loss. And we want this to happen, since at the beginning of training, our model is performing poorly due to the weights being randomly initialized.</p>
<p>First, given our prediction \(\hat{y_i} = \sigma(Wx_i + b)\) and our loss \(J = \frac{1}{2}(y_i - \hat{y_i})^2\) , we first obtain the partial derivative \(\frac{dJ}{dW}\), applying the chain rule twice:</p>
\[\frac{dJ}{dW} = (y_i - \hat{y_i})\sigma'(Wx_i + b)x_i\]
<p>This derivative has the term \(\sigma'(Wx_i + b)\) in it. This can be expressed as \(\sigma(Wx_i + b)(1 - \sigma(Wx_i + b))\) (see here for a proof). Since we initialized our weights randomly with values close to 0, this expression will be very close to 0, which will make the partial derivative nearly vanish during the early stages of training. A plot of the sigmoid curve’s derivative is shown below [6], indicating that the gradients are small whenever the outputs are close to \(0\) or \(1\):</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/sigmoid_derivative.jpg" alt="sigmoid" /></p>
<p>This can lead to slower learning at the beginning stages of gradient descent, since the smaller derivatives change each weight by only a small amount, and gradient descent takes a while to get out of this loop and make larger updates towards a minima.</p>
<p>On the other hand, given the cross entropy loss:</p>
\[J = -\sum_{i=1}^{N} y_i\log(\sigma (Wx_i + b)) + (1-y_i)\log(1 - \sigma(Wx_i + b))\]
<p>We can obtain the partial derivative \(\frac{dJ}{dW}\) as follows (with the substitution \(\sigma(z) = \sigma(Wx_i + b)\):</p>
\[\frac{dJ}{dW} = -\sum_{i=1}^{N} \frac{y_i x_i\sigma'(z)}{\sigma(z)} - \frac{(1-y_i)x_i \sigma'(z)}{1 - \sigma(z)}\]
<p>Simplifying, we obtain a nice expression for the gradient of the loss function with respect to the weights:</p>
\[\sum_{i=1}^{N} x_i(\sigma(z) - y_i)\]
<p>This derivative does not have a \(\sigma'\) term in it, and we can see that the magnitude of the derivative is entirely dependent on the magnitude of our error \(\sigma(z) - y_i\) - how far off our prediction was from the ground truth. This is great, since that means early on in learning, the derivatives will be large, and later on in learning, the derivatives will get smaller and smaller, corresponding to smaller adjustments to the weight variables, which makes intuitive sense since if our error is small, then we’d want to avoid large adjustments that could cause us to jump out of the minima. Michael Nielsen in his <a href="https://neuralnetworksanddeeplearning">book</a> has an in-depth discussion and illustration of this that is really helpful.</p>
<h4 id="hinge-loss-vs-cross-entropy-loss">Hinge Loss vs Cross-Entropy Loss</h4>
<p>There’s actually another commonly used type of loss function in classification related tasks: the hinge loss. The (L2-regularized) hinge loss leads to the canonical support vector machine model with the max-margin property: the margin is the smallest distance from the line (or more generally, hyperplane) that separates our points into classes and defines our classification:</p>
<p><img src="https://docs.opencv.org/2.4.13.4/_images/optimal-hyperplane.png" alt="svm" /></p>
<p>The hinge loss penalizes predictions not only when they are incorrect, but even when they are correct but not confident. It penalizes gravely wrong predictions significantly, correct but not confident predictions a little less, and only confident, correct predictions are not penalized at all. Let’s formalize this by writing out the hinge loss in the case of binary classification:</p>
\[\sum_{i} max(0, 1 - y_{i} * h_{\theta}(x_{i}))\]
<p>Our labels \(y_{i}\) are either -1 or 1, so the loss is only zero when the signs match and \(\vert (h_{\theta}(x_{i}))\vert \geq 1\). For example, if our score for a particular training example was \(0.2\) but the label was \(-1\), we’d incur a penalty of \(1.2\), if our score was \(-0.7\) (meaning that this instance was predicted to have label \(-1\)) we’d still incur a penalty of \(0.3\), but if we predicted \(-1.1\) then we would incur no penalty. A visualization of the hinge loss (in green) compared to other cost functions is given below:</p>
<p><img src="https://i.stack.imgur.com/4DFDU.png" alt="hinge loss" /></p>
<p>The main difference between the hinge loss and the cross entropy loss is that the former arises from trying to maximize the margin between our decision boundary and data points - thus attempting to ensure that each point is correctly and confidently classified*, while the latter comes from a maximum likelihood estimate of our model’s parameters. The softmax function, whose scores are used by the cross entropy loss, allows us to interpret our model’s scores as relative probabilities against each other. For example, the cross-entropy loss would invoke a much higher loss than the hinge loss if our (un-normalized) scores were \([10, 8, 8]\) versus \([10, -10, -10]\), where the first class is correct. In fact, the (multi-class) hinge loss would recognize that the correct class score already exceeds the other scores by more than the margin, so it will invoke zero loss on both scores. Once the margins are satisfied, the SVM will no longer optimize the weights in an attempt to “do better” than it is already.</p>
<h4 id="wrap-up">Wrap-Up</h4>
<p>In this post, we’ve show that the MSE loss comes from a probabalistic interpretation of the regression problem, and the cross-entropy loss comes from a probabalistic interpretaion of binary classification. The MSE loss is therefore better suited to regression problems, and the cross-entropy loss provides us with faster learning when our predictions differ significantly from our labels, as is generally the case during the first several iterations of model training. We’ve also compared and contrasted the cross-entropy loss and hinge loss, and discussed how using one over the other leads to our models learning in different ways. Thanks for reading, and hope you enjoyed the post!</p>
<h4 id="sources">Sources</h4>
<ol>
<li>
<p><a href="https://neuralnetworksanddeeplearning.com/chap3.html">Michael Nielsen’s Neural Networks and Deep Learning, Chapter 3</a></p>
</li>
<li>
<p><a href="https://cs231n.github.io/linear-classify/">Stanford CS 231n notes on cross entropy and hinge loss</a></p>
</li>
<li>
<p><a href="https://docs.opencv.org/2.4.13.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html">OpenCV introduction to SVMs</a></p>
</li>
<li>
<p><a href="https://math.stackexchange.com/questions/782586/how-do-you-minimize-hinge-loss">StackExchange answer on hinge loss minimization</a></p>
</li>
<li>
<p><a href="https://www.cs.princeton.edu/courses/archive/fall16/cos402/lectures/402-lec5.pdf">Machine Learning, Princeton University</a></p>
</li>
<li>
<p><a href="https://ronny.rest/blog/post_2017_08_10_sigmoid/">Ronny Restrepo, sigmoid functions</a></p>
</li>
</ol>
<h4 id="notes">Notes</h4>
<p>[4/16/19] - Fixed broken links and clarified the particular model for which the learning speed of MSE loss is slower than cross-entropy</p>Paper Analysis - Sequence to Sequence Learning2018-01-02T00:00:00+00:002018-01-02T00:00:00+00:00https://rohanvarma.me/Seq-2-Seq<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/seq2seq.png" alt="seq2seq" /></p>
<p><a href="https://arxiv.org/pdf/1409.3215.pdf">Link to paper</a>
<a href="https://github.com/rohan-varma/paper-analysis/blob/master/seq2seq-paper/Pytorch%20Seq%202%20Seq%20Model.ipynb">Link to example implementation</a></p>
<p><strong>Abstract</strong></p>
<ul>
<li>
<p>Traditional DNNs have achieved good performance whenever large labelled training datasets are available, but cannot map sequences to sequences</p>
</li>
<li>
<p>Main approach of the paper is to use a multilayer LSTM to map input sequence to a fixed-length vector, and then another deep LSTM to decode the fixed length vector into a sequence</p>
</li>
<li>
<p>The LSTM model also learned useful phrase & sentence representations that are sensitive to word order and invariant to passive/active voice</p>
</li>
<li>
<p>indicates that the actual structure of the language was mostly captured, the representation of “He ate the cookie” and “the cookie was eaten by him” aren’t too different</p>
</li>
<li>
<p>Reversing word order in source sentence helped because it introduced more short-term dependencies</p>
</li>
</ul>
<p><strong>Introduction</strong></p>
<ul>
<li>DNNs are powerful, example: 2 layer neural network of quadratic size can learn to sort n n-bit numbers</li>
<li>DNNs are only useful for problems who’s inputs and labels can be expressed as fixed-length vectors in some way</li>
<li>This is limiting, DNNs can’t do many tasks who’s inputs are best represented as sequences such as machine translation, POS tagging, speech recognition</li>
<li>Main idea: use one LSTM to read the input sequence (of variable length) one time step at a time, and map this to a fixed-length vector. Then a second LSTM takes this fixed-length vector as an input and produces an output sequence</li>
<li>The second LSTM is basically an RNN language model that is conditioned on the encoded representation</li>
<li>The researchers trained an ensemble of 5 deep LSTMs (each with 384 million parameters) and used a beam search decoder to achieve state of the art performance on the WMT english to french translation task</li>
<li>Reversing the words in the source sentence helped train the LSTMs a lot, because this introduced many more short term dependencies to make it easier to train the LSTM with SGD</li>
</ul>
<p><strong>Model</strong></p>
<ul>
<li>
<p>RNN inputs: sequence \(x_1 … x_T\) as inputs, computes a sequence of outputs \(y_1 … y_T\)</p>
</li>
<li>
<p>At each timestep, RNN computes a hidden state \(h_t\) and an output \(y_t\). We can think of the hidden state as encapsulating the information encountered at previous timesteps</p>
</li>
<li>
\[h_t = tanh(W_{hx}x_t + W_{hh}h_{t-1})\]
</li>
<li>
\[y_t = W_{hy}h_t\]
</li>
<li>
<p>This is for a “single-layer” RNN that does not have layers of hidden states. If there are multiple layers of hidden states, then instead of \(x_t\) as the input into a later hidden layer, the input is the \(h_t\) at the previous layer, same timestep</p>
</li>
<li>
<p>For general/variable-length sequence to sequence learning, the general idea is to map the input sequence to a fixed-length vector using one RNN and then map the fixed-length vector to the target sequence with another RNN</p>
</li>
<li>
<p>However, in practice RNNs aren’t very good with learning longer-term dependencies. LSTMs have been shown to do much better at learning longer-term dependencies because they don’t fall victim to the vanishing gradient problem like traditional RNN cells do</p>
</li>
<li>
<p>Goal of the LSTM is to estimate the following conditional probability:</p>
<ul>
<li>\(p(y_{1}, … y_{T'} \vert x_{1} … x_{T})\) , where the length of the 2 sequences differ from each other</li>
<li>The LSTM does this by computing a fixed-dimensional representation \(v\) after observing the input sequence \(x_1 … x_T\)</li>
<li>Then, conditioned on this \(v\), we can produce the output sequence via the following formulation:
<ul>
<li>
\[p(y_1 | x_1 … x_T) = p(y_1 | v)\]
</li>
<li>
\[P(y_1, y_2 | x_1 … x_T) = p(y_2 | v, y_1) p(y_1 | v)\]
</li>
<li>i.e., at each time step the the \(i\)th output is conditioned on the fixed length vector \(v\) and the previous outputs, if any</li>
<li>In general, \(P(y_1, y_2 … y_{T'} \vert x_1, … x_T) = \prod_{t = 1}^{T'} p(y_t \vert v, y_1… y_{T'-1})\)</li>
<li>Each of these distributions are represented with a softmax over the vocabulary</li>
</ul>
</li>
<li>2 different LSTMs are used since this doesn’t increase the number of model parameters by much and makes it easier to train the LSTM on multiple language pairs</li>
<li>A deep LSTM was used with four layers; this was found to significantly outperform shallow LSTMs</li>
<li>Reversed order of words in input sequences was found to help a lot</li>
<li>The dataset had 12m sentences with 348m french words and 304m english words</li>
</ul>
<p><strong>Training</strong></p>
<ul>
<li>Model was trained to maximize probability of producing a correct translation given a source sentence:</li>
<li>
\[\frac{1}{N} \sum_{i=1}^{N} \log p(T_i | S_i)\]
</li>
<li>Translations are produced by finding the most likely translation given by the LSTM: \(T* = argmax_T p(T \vert S)\)</li>
<li>Beam search used to find the most likely translations. At each time, we maintain a list of partial hypotheses and then extend each partial hypothesis with every word in the vocabulary. Then we discard all but \(B\) of the most likely hypotheses (where the likelihood is given by the model’s log probability).</li>
<li>A hypothesis is finished once the end-of-sentence tag <EOS> is emitted</EOS></li>
<li>As discussed, reversing the words in the source sentences helps the translation task a lot</li>
<li>An intuitive explanation for this is by reversing the source sentence, the average distance between corresponding words decreases so there is less of an overall time lag</li>
<li>Therefore, backpropagation has an easier time communicating between the source sentence and the target sentence.</li>
<li>Ex: “I like to eat the apples” and “Me gusta comer las manzanitas” vs “Apples the eat to like I” and “Me gusta comer las manzanitas”, the second pair has more words closer to the corresponding word in the translated sentence</li>
<li>A deep LSTM with 4 layers and 1000 cells at each layer was used. 100 dimensional word embeddings</li>
<li>Initialization was uniform random between -0.08 and 0.08</li>
<li>SGD without momentum with lr = 0.7, and then the learning rate was halved from epochs 5 to 7.5</li>
<li>batch size = 128</li>
<li>To avoid exploding gradients, the researchers enforced a hard cap on the norm of the gradients and the gradiewnts weres scaled down when the norm exceeded a threshold</li>
<li>Each layer was trained on a different GPU and communicated its activations when it was one. Spent about 10 days training</li>
</ul>
<p><strong>Results</strong></p>
<ul>
<li>
<p>The model achieved state of the art accuracy on english to french translation tasks</p>
</li>
<li>
<p>The fixed length vectors learned were pretty meaningful, in that they were sensitive to order of the words (i.e. John admires Mary was further away from Mary admires John, but Mary admires John and Mary respects John were relatively close together)</p>
</li>
<li>
<p>Other approaches included using convolutional networks to map sentences to fixed length vectors, using attention mechanisms to overcome issues with long sentence translation, or taking phrase-based approaches to achieve smoother translations</p>
</li>
</ul>
</li>
</ul>Interpreting Regularization as a Bayesian Prior2017-08-24T00:00:00+00:002017-08-24T00:00:00+00:00https://rohanvarma.me/Regularization<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/reg.png" alt="img" /></p>
<h3 id="introductionbackground">Introduction/Background</h3>
<p>In machine learning, we often start off by writing down a probabalistic model that defines our data. We then go on to write down a likelihood or some type of loss function, which we then optimize over to get the optimal settings for the parameters that we seek to estimate. Along the way, techniques such as regularization, hyperparameter tuning, and cross-validation can be used to ensure that we don’t overfit on our training dataset and our model generalizes well to unseen data.</p>
<p>Specifically, we have a few key functions and variables: the underlying probability distribution \(p(x, y)\) which generate our training examples (pairs of features and labels), a training set \((x, y)_{i = 1}^{D}\) of \(D\) examples which we observe, and a model \(h(x) : x \rightarrow{} y\) which we wish to learn in order to produce a mapping from \(x\) to \(y\). This function \(h\) is selected from a larger function space \(H\).</p>
<p>For example, if we are in the context of linear regression models, then all functions in the function space of \(H\) will take on the form \(y_i = x_{i}^T \beta\) where a particular setting of our parameters \(\beta\) will result in a particular \(h(x)\). We also have some function \(L(h(x), y)\) that takes in our predictions and labels, and quantifies how accurate our model is across some data.</p>
<p>Ideally, we’d like to minimize the risk function</p>
\[R[h(x)] = \sum_{(x, y)} L( h(x), y) p(x, y)\]
<p>across all possible \((x, y)\) pairs. However, this is impossible since we don’t know the underlying probability distribution that describes our dataset, so instead we seek to approximate the risk function by minimizing a loss function across the data that we have observed:</p>
\[\frac{1}{N} \sum_{i = 1}^{N} L(h(x), y)\]
<h3 id="linear-models">Linear Models</h3>
<p>If we assume that our data are roughly linear, then we can write a relationship between our features and real-valued outputs: \(y_i = x_i^T \beta + \epsilon\) where \(\epsilon \tilde{} N(0, \sigma^2)\). This essentially means that our data has a linear relationship that is corrupted by random Gaussian noise that has zero mean and constant variance.</p>
<p>This has the implication that \(y_i\) is a Gaussian random variable, and we can compute its expectation and variance:</p>
\[E[y_i] = E[x_i^T \beta + \epsilon] = x_i^T \beta\]
\[Var[y_i] = Var[x_i^T \beta + \epsilon] = \sigma^2\]
<p>We can now write down the probability of observing a value \(y_i\) given a certain set of features \(x\):</p>
\[p(y_i | x_i) = N(y_i | x_i^T \beta, \sigma^2)\]
<p>Next, we can write down the probability of observing the entire dataset of \((x, y)\) pairs. This is known as the likelihood, and it’s simply the product of observing each of the individual feature, label pairs:</p>
\[L(x,y) = \prod_{i = 1}^{n} N(y_i | x_i \beta, \sigma^2)\]
<p>As a note, writing down the likelihood this way does assume that our training data are independent and identically distributed, meaning that we are assuming that each of the training samples have the same probability distribution, and are mutually independent.</p>
<p>If we want to find the \(\hat{\beta}\) that maximizes the chance of us observing the training examples that we observed, then it makes sense to maximize the above likelihood. This is known as <strong>maximum likelihood estimation</strong>, and is a common approach to many machine learning problems such as linear and logistic regression.</p>
<p>In other words, we want to find</p>
\[\hat{\beta} = argmax_{\beta} \prod_{i = 1}^{n} N(y_i | x_i \beta, \sigma^2)\]
<p>To simplify this a little bit, we can write out the normal distribution, and also take the log of the function, since the \(\hat{\beta}\) that maximizes \(L\) will also maximize \(log(L)\). We end up with</p>
\[\hat{\beta} = argmax_{\beta} log \prod_{i = 1}^{n} \frac{1}{\sqrt(2 \pi \sigma^2}e^-\frac{(y_i - x_i \beta)^2}{2 \sigma^2}\]
<p>Distributing the log and dropping constants (since they don’t affect the value of our parameter which maximizes the expression), we obtain</p>
\[\hat{\beta} = argmax_{\beta} \sum_{i = 1}^{N} -(y_i - x_i \beta)^2\]
<p>Since minimizing the opposite of a function is the same as maximizing it, we can turn the above into a minimization problem:</p>
\[\hat{\beta} = argmin_{\beta} \sum_{i = 1}^{N} (y_i - x_i \beta)^2\]
<p>This is the familiar least squares estimator, which says that the optimal parameter is the one that minimizes the \(L2\) squared norm between the predictions and actual values. We can use gradient descent with some initial setting of \(\beta\) and be guaranteed to get to a global minimum (since the function is convex) or we can explicitly solve for \(\beta\) and obtain the same answer.</p>
<p>Right now is a good time to think about the assumptions of this linear regression model. Like many models, it assumes that the data are drawn independently from the same data generating distribution. Furthermore, it assumes that this distribution is normal with a linear mean and constant variance. It also has a more implicit assumption: that the parameter \(\beta\) which we wish to estimate is not a random variable itself, and we will show how relaxing this assumption leads to a regularized linear model.</p>
<h3 id="regularization">Regularization</h3>
<p>Regularization is a popular approach to reducing a model’s predisposition to overfit on the training data and thus hopefully increasing the generalization ability of the model. Previously, we sought to learn the optimial \(h(x)\) from the space of functions \(H\). However, if the whole function space can be explored, and our samples were observed with some amount of noise, then the model will likely select a function that overfits on the observed data. One way we can combat this is by limiting our search to a subspace within \(H\), and this is exactly what regularization does.</p>
<p>To regularize a model, we take our loss function and add a regularizer to it. Regularizers take the form \(\lambda R(\beta)\) where \(R(\beta)\) is some function of our parameters, and \(\lambda\) is a hyperparameter describing our regularization constant. Using this rule, we can write out a regularized version of our loss function above, giving us a model known as ridge regression:</p>
\[\hat{\beta} = argmin_{\beta} \sum_{i = 1}^{N} (y_i - x_i \beta)^2 + \lambda \sum_{i = 1}^{j} \beta_j^2\]
<p>What’s interesting about regularization is that it can be more deeply understood if we reconsider our original probabalistic model. In our original model, we conditioned our outputs on a linear function of the parameter which we wish to learn \(\beta\). Instead of considering \(\beta\) as a fixed quantity that we want a point estimate of, we can consider \(\beta\) itself as a random variable drawn from a certain probability distribution. This is known as the <strong>prior</strong> probability distribution, because we assign \(\beta\) some probability without having observed the associated \((x, y)\) pairs. Imposing a prior would be especially useful if we had some information about the parameter before observing any of the training data (possibly from domain knowledge), but it turns out that imposing a Gaussian prior even in the absence of actual prior knowledge leads to interesting properties (see the <a href="http://www.deeplearningbook.org/">Deep Learning Book chapter on regularization</a> for more details about this). In particular, we can condition \(\beta\) as on a Gaussian with 0 mean and constant variance [1]:</p>
\[\beta \tilde{} N(0, \lambda^{-1})\]
<p>As a consequence, we must adjust our probability of observing a particular \((x, y)\) pair to accommodate the probability of observing the parameter that generated this pair. We obtain a new expression for our likelihood:</p>
\[L(x,y) = \prod_{i = 1}^{n} N(y_i | x_i \beta, \sigma^2) N(\beta | 0, \lambda^{-1})\]
<p>Similar to the previously discussed method of maximum likelihood estimation, we can estimate the parameter \(\beta\) to be the \(\hat{\beta}\) that maximizes the above function:</p>
\[\hat{\beta} = argmax_{\beta} \sum_{i = 1}^{N} log N(y_i | x_i \beta, \sigma^2) + log N(\beta | 0, \lambda^{-1})\]
<p>This is the maximum a posteriori estimate of \(\beta\), and it only differs from the maximum likelihood estimate in that the former takes into account previous information, or a prior distribution, on the parameter \(\beta\). In fact, the maximum likelihood estimate of the parameter can be seen as a special case of the maximum a posteriori estimate, where we take the prior probability distribution on the parameter to just be a constant.</p>
<p>Since (dropping unneeded constants) \(N(\beta, 0, \lambda^{-1}) = exp(\frac{- \beta^{2}}{2 \lambda^{-1}})\), after taking the log, and minimizing the negative of the above function we obtain the familiar regularizer \(\frac{1}{2} \lambda \beta^2\) and our squared loss function \(\sum_{i = 1}^{N} (y_i - x_i \beta)^2\) is the same as the loss function we obtained without regularization. In this way, \(L2\) regularization on a linear model can be thought of as imposing a Bayesian prior on the underlying parameters which we wish to estimate.</p>
<h3 id="aside-interpreting-regularization-in-the-context-of-bias-and-variance">Aside: interpreting regularization in the context of bias and variance</h3>
<p>The error of a statistical model can be decomposed into three distinct sources of error: error due to bias, error due to variance, and irreducible error. They are related as follows:</p>
\[Err(x) = bias(X)^2 + var(x) + \epsilon\]
<p>Given a constant error, this means that there will always be a tradeoff between bias and variance. Having too much bias or too much variance isn’t good for a model, but for different reasons. A high bias, low variance model will likely end up being inaccurate across both the training and testing datasets, and its predictions will likely not deviate too much based on the data sample it is trained on. On the other hand, a low-bias, high-variance model will likely give good results on a training dataset, but fail to generalize as well on a testing dataset.</p>
<p>The Gauss-Markov theorem states that in a linear regression problem, the least squares estimator has the lowest variance out of all other unbiased estimators. However, if we consider biased estimators such as the estimator given by ridge regression, we can arrive at a lower variance, higher-bias solution. In particular, the expectation of the ridge estimator (derived <a href="http://math.bu.edu/people/cgineste/classes/ma575/p/w14_1.pdf">here</a>) is given by:</p>
\[\beta - \lambda (X^TX + \lambda I)^{-1} \beta\]
<p>The bias of an estimator is defined as the difference between the parameter’s expected value and the true parameter \(\beta\): \(bias(\hat{\beta}) = E[\hat{\beta}] - \beta\)</p>
<p>As you can see, the bias is proportional to \(\lambda\) and \(\lambda = 0\) gives us the unbiased least squares estimator since \(E[\hat{\beta}] = \beta\). Therefore, assuming a constant total error for the least squares estimator and the ridge estimator, the variance for the ridge estimator is lower. A more complete discussion, including formal calculations for the bias and variance of the ridge estimator compared to the least squares estimator, is given <a href="https://math.bu.edu/people/cgineste/classes/ma575/p/w14_1.pdf">here</a>.</p>
<h3 id="a-linear-algebra-perspective">A linear algebra perspective</h3>
<p>To see why regularization makes sense from a linear algebra perspective, we can write down our least squares estimate in vectorized form:</p>
\[argmin_{\beta} { (y - X\beta)^T (y - X \beta) }\]
<p>Next, we can expand this and simplify a little bit:</p>
\[argmin_{\beta} (y^T - \beta^TX^T)(y - X\beta)\]
\[= argmin_{\beta} -2y^TX\beta + \beta^TX^TX\beta\]
<p>where we have dropped the terms that are not a factor of \(\beta\) since they will zero out when we differentiate.</p>
<p>To minimize, we differentiate with respect to \(\beta\):</p>
\[\frac{\delta L}{\delta \beta} = -2 y^TX + 2X^TX\beta\]
<p>Setting the derivative equal to zero gives us the closed form solution of \(\beta\) which is the least-squares estimate [2]:</p>
\[\hat{\beta} = (X^TX)^{-1} y^TX\]
<p>As we can see, in order to actually compute this quantity the matrix \(X^T X\) must be invertible. The matrix \(X^T X\) being invertible corresponds exactly to showing that the matrix is positive definite, which means that the scalar quantity \(z^T X^T X z > 0\) for any real, non-zero vectors \(z\). However, the best we can do is show that \(X^T X\) is positive semidefinite.</p>
<p>To show that \(X^TX\) is positive semidefinite, we must show that the quantity \(z^T X^T X z \geq 0\) for any real, non-zero vectors \(z\).</p>
<p>If we expand out the quantity \(X^T X\), we obtain \(\sum_{i = 1}^{N} x_i x_i^T\) and it follows that the quantity \(z^T (\sum_{i = 1}^{N} x_i x_i^T) z = \sum_{i = 1}^{N} (x_i^Tz)^2 \geq 0\). This means that in sitautions where this quantity is exactly \(0\), the matrix \(X^T X\) cannot be inverted and a closed-form least squares solution cannot be computed.</p>
<p>On the other hand, expanding out our ridge estimate which has an extra regulariztion term \(\lambda \sum_{i} \beta_i^2\), we obtain the derivative</p>
\[\frac{\delta L}{\delta \beta} = -2 y^TX + 2X^TX\beta + 2 \lambda \beta\]
<p>Setting this quantity equal to zero, and rewriting \(\lambda \beta\) as \(\lambda I \beta\) (using the property of multiplication with the identity matrix), we now obtain</p>
\[\beta (X^TX + \lambda I) = y^T X\]
<p>giving us the ridge estimate</p>
\[\hat{\beta_{ridge}} = (X^TX + \lambda I)^{-1} y^TX\]
<p>The only difference in this closed-form solution is the addition of the \(\lambda I\) term to the quantity that gets inverted, so we are now sure that this quantity is positive definite if \(\lambda > 0\). In other words, even when the matrix \(X^T X\) is not invertible, we can still compute a ridge estimate from our data [3].</p>
<h3 id="regularizers-in-neural-networks">Regularizers in neural networks</h3>
<p>While techniques such as L2 regularization can be used while training a neural network, employing techniques such as dropout, which randomly discards some proportion of the activations at a per-layer level during training, have been shown to be much more successful. There is also a different type of regularizer that takes into account the idea that a neural network should have sparse activations for any particular input. There are several theoretical reeasons for why sparsity is important, a topic covered very well by Glorot et al. in a <a href="https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf">2011 paper</a>.</p>
<p>Since sparsity is important in neural networks, we can introduce a constraint that can gaurantee us some degree of sparsity. Specifically, we can constrain the average activation of a particular neuron in a particular hidden layer.</p>
<p>In particular, the average activation of a neuron in a particular layer, weighted by the input into the neuron, can be given by summing over all of the activation - input pairs: \(\hat{\rho} = \frac{1}{m} \sum_{i = 1}^{N} x_i a_i^2\). Next, we can choose a hyperparameter \(\rho\) for this particular neuron, which represents the average activation we want it to have - for example, if we wanted this neuron to activate sparsely, we might set \(\rho = 0.05\). In order to ensure that our model learns neurons which sparsely activate, we must incorporate some function of \(\hat{\rho}\) and \(\rho\) into our cost function.</p>
<p>One way to do this is with the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>, which computes how much one probability distribution (in this case, our current average activation \(\hat\rho\)) and another expected probability distribution (\(\rho\)) diverge from each other. If we minimize the KL divergence for each of our neuron’s activations then our model will learn sparse activations. The cost function may be:</p>
\[J_{sparse} (W, b) = J(W, b) + \lambda \sum_{i = 1}^{M} KL(\rho_i || \hat{\rho_i})\]
<p>where \(J(W, b)\) is a regular cost function used in neural networks, such as the cross-entropy loss. The hyperparameter \(\lambda\) indicates how important sparsity is to us - as \(\lambda \rightarrow{} \infty\), we disregard the actual loss function and only aim to learn a sparse representation, and as \(\lambda \rightarrow{} 0\) we disregard the importance of sparse activations and only minimize the original loss function. Additional details on this type of regularization with application to sparse autoencoders are given <a href="https://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity">here</a>.</p>
<h3 id="recap">Recap</h3>
<p>As we have seen, regularization can be interpreted in several different ways, each of which gives us additional insight into what exactly regularization accomplishes. A few of the different interpretations are:</p>
<p>1) As a Bayesian prior on the paramaters which we are trying to learn.</p>
<p>2) As a term added to the loss function of our model which penalizes some function of our parameters, thereby introducing a tradeoff between minimizing the original loss function and ensuring our weights do not deviate too much from what we want them to be.</p>
<p>3) As a constraint on the model which we are trying to learn. This means we can take the original optimization problem and frame it in a constrained fashion, thereby ensuring that the magnitude of our weights never exceed a certain threshold (in the case of \(L2\) regularization).</p>
<p>4) As a method of reducing the function search space \(H\) to a new function search space \(H'\) that is smaller than \(H\). Without regularization, we may search for our optimal function \(h\) in a much larger space, and constraining this to a smaller subspace can lead us to select models with better generalization ability.</p>
<p>Overall, regularization is a useful technique that is often employed to reduce the overall variance of a model, thereby improving its generalization capability. Of course, there’s tradeoffs in using regularization, most notably having to tune the hyperparameter \(\lambda\) which can be costly in terms of computational time. Thanks for reading!</p>
<h3 id="sources">Sources</h3>
<ol>
<li>
<p><a href="https://math.bu.edu/people/cgineste/classes/ma575/p/w14_1.pdf">Boston University Linear Models Course by Cedric Ginestet</a></p>
</li>
<li>
<p><a href="https://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity">Autoencoders and Sparsity, Stanford UFDL</a></p>
</li>
<li>
<p><a href="https://math.stackexchange.com/questions/1582348/simple-example-of-maximum-a-posteriori/1582407">Explanation of MAP Estimation</a></p>
</li>
</ol>
<h4 id="notes">Notes</h4>
<p>[1] Imposing different prior distributions on the parameter leads to different types of regularization. A normal distribution with zero mean and constant variance leads to \(L2\) regularization, while a Laplacean prior would lead to \(L1\) regularization.</p>
<p>[2] Technically, we’ve only shown that the \(\hat{\beta}\) we’ve found is a local optimum. We actually want to verify that this is indeed a global minimum, which can be done by showing that the function we are minimizing is convex.</p>
<p>[3] For completeness, it is worth mentioning that there are other solutions if the inverse of the matrix \(X^T X\) does not exist. One common workaround is to use the <a href="https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse">Moore-Penrose Psuedoinverse</a> which can be computed using the singular value decompisition of the matrix being psuedo-inverted. This is commonly used in implementations of PCA algorithms.</p>
<p>[7/15/19] - Added a note abou the underlying parameters of a model coming from a prior probability distribution.</p>Language Models, Word2Vec, and Efficient Softmax Approximations2017-07-02T00:00:00+00:002017-07-02T00:00:00+00:00https://rohanvarma.me/Word2Vec<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png" alt="img" /></p>
<h3 id="introduction">Introduction</h3>
<p>The Word2Vec model has become a standard method for representing words as dense vectors. This is typically done as a preprocessing step, after which the learned vectors are fed into a discriminative model (typically an RNN) to generate predictions such as movie review sentiment, do machine translation, or even generate text, <a href="https://github.com/karpathy/char-rnn">character by character</a>.</p>
<h3 id="previous-language-models">Previous Language Models</h3>
<p>Previously, the bag of words model was commonly used to represent words and sentences as numerical vectors, which could then be fed into a classifier (for example Naive Bayes) to produce output predictions. Given a vocabulary of \(V\) words and a document of \(N\) words, a \(V\)-dimensional vector would be created to represent the document, where index \(i\) in the vector denotes the number of times the \(i\)th word in the vocabulary occured in the document.</p>
<p>This model represented words as atomic units, assuming that all words were independent of each other. It had success in several fields such as document classification, spam detection, and even sentiment analysis, but its assumptions (that words are completely independent of each other) were too strong for more powerful and accurate models. A model that aimed to reduce some of the strong assumptions of the traditional bag of words model was the n-gram model.</p>
<h3 id="n-gram-models-and-markov-chains">N-gram models and Markov Chains</h3>
<p>Language models seek to predict the probability of observing the \(t + 1\)th word \(w_{t + 1}\) given the previous \(t\) words:</p>
\[p(w_{t + 1} | w_1, w_2, ... w_t)\]
<p>Using the chain rule of probabilty, we can compute the probabilty of observing an entire sentence:</p>
\[p(w_1, w_2, ... w_t) = p(w_1)p(w_2 | w_1)...p(w_t | w_{t -1}, ... w_1)\]
<p>Computing these probabilities have many applications, for example in speech recognition, spelling corrections, and automatic sentence completion. However, estimating these probabilites can be tough. We can use the maximum likelihood estimate:</p>
\[p(x_{t + 1} | x_1, ... x_t) = \frac{count(x_1, x_2, ... x_t, x_{t + 1})}{count(x_1, x_2, ... x_t)}\]
<p>However, computing this is quite unrealistic - we will generally not observe enough data from a corpus to obtain realistic counts for any sequence of \(t\) words for any nontrivial value of \(t\), so we instead invoke the Markov assumption. The Markov assumption assumes that the probability of observing a word at a given time is only dependent on the word observed in the previous time step:</p>
\[p(x_{t + 1} | x_1, x_2, ... x_t) = p(x_{t + 1} | x_t)\]
<p>Therefore, the probabilty of a sentence can be given by</p>
\[p(w_1, w_2, ... w_t) = p(w_1)\prod_{i = 2}^{t} p(w_i | w_{i - 1})\]
<p>The Markov assumption can be extended to condition the probability of the \(t\)th word on the previous two, three, four, and so on words. This is where the name of the n-gram model comes in - \(n\) is the number of previous timesteps we condition the current timestep on. The unigram and bigram models, respectively, are given below.</p>
\[p(x_{t + 1} | x_{1}, x_{2}, ... x_{t}) = p(x_{t + 1})\]
\[p(x_{t + 1} | x_{1}, x_{2}, ... x_{t}) = p(x_{t + 1} | x_{t})\]
<p>There is a lot more to the n-gram model such as linear interpolation and smoothing techniques, which <a href="https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf">these slides</a> explain very well. One thing to keep in mind is that as we increase \(n\), the probabilities become more challening to compute, though a smaller \(n\) may make our model less expressive, since aren’t able to learn longer-term conditionings.</p>
<h3 id="the-skip-gram-and-continuous-bag-of-words-models">The Skip-Gram and Continuous Bag of Words Models</h3>
<p>Word vectors, or word embeddings, or distributed representation of words, generally refer to a dense vector representation of a word, as compared to a sparse (ie one-hot) traditional representation. There are actually two different implementations of models that learn dense representation of words: the Skip-Gram model and the Continuous Bag of Words model. Both of these models learn dense vector representation of words, based on the words that surround them (ie, their <em>context</em>).</p>
<p>The difference is that the skip-gram model predicts context (surrounding) words given the current word, wheras the continuous bag of words model predicts the current word based on several surrounding words.</p>
<p>This notion of “surrounding” words is best described by considering a center (or current) word and a window of words around it. For example, if we consider the sentence “The quick brown fox jumped over the lazy dog”, and a window size of 2, we’d have the following pairs for the skip-gram model:</p>
<p><img src="https://mccormickml.com/assets/word2vec/training_data.png" alt="img" /></p>
<p>Figure 1: Training Samples <a href="https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">(Source, from Chris McCormick’s insightful post)</a></p>
<p>In contrast, for the CBOW model, we’ll input the context words within the window (such as “the”, “brown”, “fox”) and aim to predict the target word “quick” (simply reversing the input to prediction pipeline from the skip-gram model).</p>
<p>The following is a visualization of the skip-gram and CBOW models:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png" alt="img" /></p>
<p>Figure 2: CBOW vs Skip-gram models. <a href="https://arxiv.org/pdf/1301.3781.pdf">(Source)</a></p>
<p>In this <a href="https://arxiv.org/pdf/1301.3781.pdf">paper</a>, the overall recommendation was to use the skip-gram model, since it had been shown to perform better on analogy-related tasks than the CBOW model. Overall, if you understand one model, it is pretty easy to understand the other: just reverse the inputs and predictions. Since both papers focused on the skip-gram model, this post will do the same.</p>
<h3 id="learning-with-the-skip-gram-model">Learning with the Skip-Gram Model</h3>
<p>Our goal is to find word representations that are useful for predicting the surrounding words given a current word.
In particular, we wish to maximize the average log probability across our entire corpus:</p>
\[argmax_{\theta} \frac{1}{T} \sum_{t=1}^{T} \sum_{j \in c, j != 0} log p(w_{t + j} | w_{t} ; \theta)\]
<p>This equation essentially says that there is some probability \(p\) of observing a particular word that’s within a window of size \(c\) of the current word \(w_t\). This probability is conditioned on the current word (\(w_t\)) and some setting of parameters \(\theta\) (determined by our model). We wish to set these parameters \(\theta\) so that this probability is maximized across our entire corpus.</p>
<h3 id="basic-parametrization-softmax-model">Basic Parametrization: Softmax Model</h3>
<p>The basic skip-gram model defines the probability \(p\) through the softmax function. If we consider \(w_i\) to be a one-hot encoded vector with dimension \(N\) and \(\theta\) to be a \(N * K\) matrix embedding matrix (here, we have \(N\) words in our vocabulary and our learned embeddings have dimension \(K\)), then we can define</p>
\[p(w_{i} | w_{t} ; \theta) = \frac{exp(\theta w_i)}{\sum_t exp(\theta w_t)}\]
<p>It is worth noting that after learning, the matrix \(\theta\) can be thought of as an embedding lookup matrix. If you have a word that is represented with the \(k\)th index of a vector being hot, then the learning embedding for that word will be the \(k\)th column. This parametrization has a major disadvantage that limits its usefulness in cases of very large corpuses. Specifically, we notice that in order to compute a single forward pass of our model, we must sum across the entire corpus vocabulary in order to evaluate the softmax function. This is prohibitively expensive on large datasets, so we look to alternate approximations of this model for the sake of computational efficiency.</p>
<h3 id="hierarchical-softmax">Hierarchical Softmax</h3>
<p>As discussed, the traditional softmax approach can become prohibitively expensive on large corpora, and the hierarchical softmax is a common alternative approach that approximates the softmax computation, but has logarithmic time complexity in the number of words in the vocabulary, as opposed to linear time complexity.</p>
<p>This is done by representing the softmax layer as a binary tree where the words are leaf nodes of the tree, and the probabilities are computed by a walk from the root of the binary tree to the particular leaf. An example of the binary tree of the hierarchical layer is given below:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/hierarchical.png" alt="img" /></p>
<p>Figure 3: Hierarchical Softmax Tree. <a href="https://www.youtube.com/watch?v=B95LTf2rVWM">(Source)</a></p>
<p>At each node in the tree starting from the root, we would like to predict the probability of branching right given the observed context. Therefore, in the above tree, if we would like to compute the probability of observing the word “cat” given a certain context, we would define it as the product of going left at node 1, then going right at node 2, and then again going right at node 5 (conditioned on the context).</p>
<p>The actual computation to determine the probability of a word is done by taking the output of the previous layer, applying a set of node-specific weights and biases to it, and running that result through a non-linearity (often sigmoidal). The following image is an illustration of the process of computing the probability of the word “cat” given an observed context:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/hierarchical2.png" alt="img" /></p>
<p>Figure 4: Hierarchical Softmax Computation. <a href="https://www.youtube.com/watch?v=B95LTf2rVWM">(Source)</a></p>
<p>Here, \(V\) is our matrix of weights connecting the outputs of our previous layer (denoted by \(h(x)\)) to our hierarchical layer, and the probabiltiy of branching right at a certain node is given by \(\sigma(h(x)W_n + b_n)\). The probability of observing a particular word, then is just the product of the branches that lead to it.</p>
<p>In the above image, we also notice that in a vocabulary of 8 words, we only needed 3 computations to approximate the softmax computation as opposed to 8. More generally, hierarchical softmax greatly reduces our computation time to \(log_2(n)\) where \(n\) is our vocabulary size, compared to linear time for the traditional softmax approach. However, this speedup is only useful for training when we don’t need to know the full probability distribution. In settings where we wish to emit the most likely word given a context (for example, in sentence generation), we’d still need to compute the probability of all of the words given the context, resulting in no speed up (although some methods such as pruning when the probability of a certain word quickly tends to zero can certainly increase efficiency).</p>
<h3 id="negative-sampling-and-noise-contrastive-estimation">Negative Sampling and Noise Contrastive Estimation</h3>
<p>Multinomial softmax regression is expensive when we are computing softmax across many different classes (each word essentially denotes a separate class). The core idea of Noise Contrastive Estimation (NCE) is to convert a multiclass classification problem into one of binary classification via logistic regression, while still retaining the quality of word vectors learned. With NCE, word vectors are no longer learned by attempting to predict the context words from the target word. Instead we learn word vectors by learning how to distinguish true pairs of (target, context) words from corrupted (target, random word from vocabulary) pairs. The idea is that if a model can distinguish between actual pairs of target and context words from random noise, then good word vectors will be learned.</p>
<p>Specifically, for each positive sample (ie, true target/context pair) we present the model with \(k\) negative samples drawn from a noise distribution. For small to average size training datasets, a value for \(k\) between 5 and 20 was recommended, while for very large datasets a smaller value of \(k\) between 2 and 5 suffices. Our model only has a single output node, which predicts whether the pair was just random noise or actually a valid target/context pair. The noise distribution itself is a free parameter, but the paper found that the unigram distribution raised to the power \(3/4\) worked better than other distributions, such as the unigram and uniform distributions.</p>
<p>The main differences between NCE and Negative sampling is the choice of distribution - the paper used a distribution (discussed above) that sampled less frequently occuring words more often. Moreover, NCE approximately minimizes the log probability across the entire corpus (so it is a good approximation of softmax regression), but this does not hold for negative sampling (but negative sampling still learns quality word vectors).</p>
<h3 id="practical-considerations">Practical Considerations</h3>
<p><strong>Implementing Softmax</strong>: If you’re implementing your own softmax function, it’s important to consider overflow issues. Specifically, the computation \(\sum_i e^{z_i}\) can easily overflow, leading to <code class="language-plaintext highlighter-rouge">NaN</code> values while training. To resolve this issue, we can instead compute the equivalent \(\frac{e^{z_i + k}}{\sum_i e^{z_i + k}}\) and set \(k = - max z\) so that the largest exponent is zero, avoiding overflow issues.</p>
<p><strong>Subsampling of frequent words</strong>: We don’t get much information from very frequent words such as “the”, “it”, and the like. There will be many more pairs of (the, French) as opposed to (France, French) but we’re more interested in the latter pair. Therefore, it would be useful to subsample some of the more frequent words. We would also like to do this proportionally: very common words are sampled out with high probability, and uncommon words are not sampled out.</p>
<p>In order to do this, the paper defines the probability of discarding a particular word as \(p(w_i) = 1 - \frac{t}{freq(w_i)}\) where \(t\) is an arbitrary constant, taken in the paper to be \(10^{-5}\). This discarding function will cause words that appear with a frequency greater than \(t\) to be sampled out with a high probability, while words that appear with a freqeuncy of less than or equal to \(t\) will not be sampled out. For example, if \(t = 10^{-5}\) and a particular word covers \(0.1%\) of the corpus, then each instance of that word will be discarded from the training corpus with probability \(0.9\).</p>
<h3 id="conclusion">Conclusion</h3>
<p>We have discussed language models including the bag of words model, the n-gram model, and the word2vec model along with changes to the softmax layer in order to more efficiently compute word embeddings. The paper presented empirical results that indicated that negative sampling outperforms hierarchical softmax and (slightly) outperforms NCE on analogical reasoning tasks. Overall, word2vec is one of the most commonly used models for learning dense word embeddings to represent words, and these vectors have several interesting properties (such as additive compositionality). Once these word vectors are learned, they can be a more powerful representation than the typical one-hot encodings when used as inputs into RNNs/LSTMs for applications such as machine translation or sentiment analysis. Thanks for reading! A discussion on Hacker News can be found <a href="https://news.ycombinator.com/item?id=15578788">here</a>.</p>
<h3 id="sources">Sources</h3>
<ul>
<li><a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Distributed Representations of Words and Phrases</a> - the main paper discussed.</li>
<li><a href="https://www.youtube.com/watch?v=B95LTf2rVWM">Hierarchical Output Layer Video by Hugo Larochelle</a> - an excellent video going into great detail about hierarchical softmax.</li>
<li><a href="https://arxiv.org/pdf/1402.3722v1.pdf">Word2Vec explained</a> - a meta-paper explaining the word2vec paper</li>
<li><a href="https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Chris McCormick’s Word2Vec Tutorial</a></li>
<li><a href="https://www.quora.com/Word2vec-How-can-hierarchical-soft-max-training-method-of-CBOW-guarantee-its-self-consistence">Stephan Gouws’s Quora answer on Hierarchical Softmax</a> - an insightful answer about the hierarchical output layer</li>
<li><a href="https://sebastianruder.com/word-embeddings-1/">Word Embeddings Post by Sebastian Ruder</a> - an informative post covering word embeddings and language modelling.</li>
<li><a href="https://arxiv.org/pdf/1301.3781.pdf">Efficient estimation of word representations</a> another key word2vec paper discussing the differences (both from an architecture perspective and empirical results) of the bag of words, skip-gram, and word2vec models.</li>
</ul>