Jekyll2018-02-22T18:40:01+00:00http://rohan-varma.github.io/rohan-blogCS, machine learning, and other ramblingsTraining very deep networks with Batchnorm2018-02-19T00:00:00+00:002018-02-19T00:00:00+00:00http://rohan-varma.github.io/Batch-Norm<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_grad_first_layer.png" alt="grad" /></p>
<p>Training very deep neural networks is hard. It turns out one significant issue with deep neural networks is that the activations of each layer tend to converge to 0 in the later layers, and therefore the gradients vanish as they backpropagate throughout the network.</p>
<p>A lot of this has to do with the sheer size of the network - obviously as you multiply numbers less than zero together over and over, they’ll converge to zero, and that’s partially why network architectures such as InceptionV3 insert auxiliary classifiers after layers earlier on in their network, so there’s a stronger gradient signal back propagated during the first few epochs of training.</p>
<p>However, there’s also a more subtle issue that leads to this problem of vanishing activations and gradients. It has to do with the initialization of the weights in each layer of our network, and the subsequent distributions of the activations in our network. Understanding this issue is key to understanding why batch normalization is now a staple in training deep networks.</p>
<p>First, we can write some code to generate some random data, and forward it through a dummy deep neural network:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_examples</span><span class="p">,</span> <span class="n">hidden_layer_dim</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">100</span>
<span class="n">input_dim</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_examples</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c"># 100 examples of 1000 points</span>
<span class="n">n_layers</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">layer_dim</span> <span class="o">=</span> <span class="p">[</span><span class="n">hidden_layer_dim</span><span class="p">]</span> <span class="o">*</span> <span class="n">n_layers</span> <span class="c"># each one has 100 neurons</span>
<span class="n">hs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c"># stores the hidden layer activations </span>
<span class="n">zs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c"># stores the affine transforms in each layer, used for backprop</span>
<span class="n">ws</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># stores the weights</span>
<span class="c"># the forward pass</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n_layers</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">hs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c"># get the input into this hidden layer</span>
<span class="c">#W = np.random.randn(h.shape[0], layer_dim[i]) * np.sqrt(2)/(np.sqrt(200) * np.sqrt(3))</span>
<span class="c">#W = np.random.uniform(-np.sqrt(6)/(200), np.sqrt(6)/200, size = (h.shape[0], layer_dim[i]))</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="p">(</span><span class="n">h</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">layer_dim</span><span class="p">[</span><span class="n">i</span><span class="p">])),</span> <span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">layer_dim</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">h</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="c">#W = np.random.normal(0, np.sqrt(2/(h.shape[0] + layer_dim[i])), size = (layer_dim[i], h.shape[0])) * 0.01</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="n">h_out</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="p">(</span><span class="n">z</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="n">ws</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">W</span><span class="p">)</span>
<span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">hs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_out</span><span class="p">)</span>
</code></pre></div></div>
<p>Now that we have a list of each layer’s hidden activations stored in <strong>hs</strong>, we can go ahead and plot the activations to see what their distribution looks like. Here, I’ve included plots of the activations at the first and final hidden layers in our 20 layer network:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/activation_0.png" alt="act0" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/activation_19.png" alt="act19" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/activation_20.png" alt="act20" /></p>
<p>What’s important to notice is that in later layers, <em>nearly all of the activations are zero</em> (just look at the scale of the axes). If we look at the distributions of these activations, it’s clear that they differ significantly with respect to each other - the first activation takes on a clear Gaussian shape around 0, while successive hidden layers have most of their activations at 0, with rapidly decreasing variance. This is what the <a href="https://arxiv.org/pdf/1502.03167.pdf">batch normalization paper</a> refers to as <em>internal covariate shift</em> - it basically means that the distributions of activations differ with respect to each other.</p>
<p><strong>Why does this matter, and why is this bad?</strong></p>
<p>This is bad mostly due to the small, and decreasing variance in the distributions of our activations across layers. Having zero activations is fine, unless nearly all your activations are zero. To understand why this is bad, we need to look at the backwards pass of our network, which is responsible for computing each gradient <script type="math/tex">\frac{dL}{dW_i}</script> across each hidden layer in our network. Given the following formulation of an arbitrary layer in our network: <script type="math/tex">h_i=relu(W_ih_{i−1}+b_i)</script> where <script type="math/tex">h_i</script> denotes the activations of the <em>i</em>th layer in our network, we can construct the local gradient <script type="math/tex">\frac{dL}{dW_i}</script>. Given an upstream gradient into this layer <script type="math/tex">\frac{dL}{dh_i}</script>, we can compute the local gradient with the chain rule:</p>
<script type="math/tex; mode=display">\frac{dL}{dW_i} = \frac{dh_i}{dW_i} * \frac{dL}{dh_i}</script>
<p>Applying the derivatives, we obtain:</p>
<script type="math/tex; mode=display">\frac{dL}{dW_i} = [\mathbb{1}(W_ih_{i-1} + b > 0) \odot \frac{dL}{dh_i}]h_{i-1}^T</script>
<p>Concretely, we can take our loss function for a single point to be given by the squared error, i.e. <script type="math/tex">L_i = \frac{1}{2}(y-t)^2</script>, and if we were at the last layer of our network (i.e. <script type="math/tex">h_i = y</script>), our upstream gradient would be <script type="math/tex">\frac{dL}{dh_i} = h_i - t</script>. This would give us a gradient of</p>
<script type="math/tex; mode=display">\frac{dL}{dW_i} = [\mathbb{1}(W_ih_{i-1} + b > 0) \odot (h_i - t)]h_{i-1}^T</script>
<p>in the final layer of our network.</p>
<p><strong>What does this tell us about our gradients for our weights?</strong></p>
<p>The expression for the gradient of our weights is intuitive: for every element in the incoming gradient matrix, pass the gradient through if this layer’s linear transformation would activate the relu neuron at that element, and scale the gradient by our input into this layer. Otherwise, zero out the gradient.</p>
<p>This means that if the incoming gradient at a certain element wasn’t already zero, it will be scaled by the input into this layer. The input in this layer is just the activations from the previous layer in our network. And as we discussed above, nearly all of those activations were zero.</p>
<p>Therefore, nearly all of the gradients backpropagated through our network will be zero, and few weight updates, if any, will occur. In the final few layers of our network, this isn’t as much of a problem. We have a strong gradient signal (i.e. <script type="math/tex">h_i - t</script> in the example above) coming from the gradient of our loss function with respect to the outputs of our network (since it is early in learning, and our predictions are inaccurate). However, after we backpropagate this signal even a few layers, chances that the gradient is zeroed out become extremely high.</p>
<p>In order to see if this is actually true, we can write out the backwards pass of our 20 layer network, and plot the gradients as we did with our activations. The following code computes the gradients using the expression given above, for all layers in our network:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dLdh</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">hidden_layer_dim</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c"># random incoming grad into our last layer</span>
<span class="n">h_grads</span> <span class="o">=</span> <span class="p">[</span><span class="n">dLdh</span><span class="p">]</span> <span class="c"># store the incoming grads into each layer</span>
<span class="n">w_grads</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># store dL/dw for each layer</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">flip</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_layers</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">):</span>
<span class="c"># get the incoming gradient</span>
<span class="n">incoming_loss_grad</span> <span class="o">=</span> <span class="n">h_grads</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c"># backprop through the relu</span>
<span class="n">dLdz</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="c"># zs was the result of Wx + b</span>
<span class="c"># get the gradient dL/dh_{i-1}, this will be the incoming grad into the next layer</span>
<span class="n">h_grad</span> <span class="o">=</span> <span class="n">ws</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">dLdz</span><span class="p">)</span> <span class="c"># ws[i-1] are our weights at this layer</span>
<span class="c"># get the gradient of the weights of this layer (dL/dw)</span>
<span class="n">weight_grad</span> <span class="o">=</span> <span class="n">dLdz</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">T</span><span class="p">)</span> <span class="c"># hs[i-1] was our input into this layer</span>
<span class="n">h_grads</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_grad</span><span class="p">)</span>
<span class="n">w_grads</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">weight_grad</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, we can plot our gradients for our earlier layers and see if our hypothesis was true:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grad_layer2.png" alt="grad1" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grad_layer_3.png" alt="grad3" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grad_layer_4.png" alt="grad4" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/grads_layer_20.png" alt="grad20" /></p>
<p>As we can see, for the final layer vanishing gradients aren’t an issue, but they are for earlier layers - in fact, after a few layers nearly all of the gradients are zero). This will result in extremely slow learning (if at all).</p>
<p><strong>Ok, but what does batch normalization have to do any of this?</strong></p>
<p>Batch normalization is a way to fix the root cause of our issue of zero activations and vanishing gradients: reducing internal covariate shift. We want to ensure that the variances of our activations do not differ too much from each other. Batch normalization does this by normalizing each activation in a batch:</p>
<script type="math/tex; mode=display">x_k = \frac{x_k - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}</script>
<p>Here, we denote<script type="math/tex">x_k</script> to be a certain activation, and <script type="math/tex">\mu_B</script>, <script type="math/tex">\sigma^2_B</script> to be the mean and variance across the minibatch for that activation. A small constant <script type="math/tex">\epsilon</script> is added to ensure that we don’t divide by zero.</p>
<p>This constrains all hidden layer activations to have zero mean and unit variance, so the variances in our hidden layer activations should not differ too much from each other, and therefore we shouldn’t have nearly all our activations be zero.</p>
<p>It’s important to note here that batch normalization doesn’t <em>force</em> the network activations to rigidly follow this distribution at all times, because the above result is scaled and shifted by some parameters before being passed as input into the next layer:</p>
<script type="math/tex; mode=display">y_k = \gamma \hat{x_i} + \beta</script>
<p>This allows the network to “undo” the previous normalization procedure if it wants to, such as if <script type="math/tex">y_k</script> was an input into a sigmoid neuron, we may not want to normalize at all, because doing so may constrain the expressivity of the sigmoid neuron.</p>
<p><strong>Does normalizing our inputs into the next layer actually work?</strong></p>
<p>With batch normalization, we can be confident that the distributions of our activations across hidden layers are reasonably similar. If this is true, then we know that the gradients should have a wider distribution, and not be nearly all zero, following the same scaling logic described above.</p>
<p>Let’s add batch normalization to our forward pass to see if the activations have reasonable variances. Our forward pass changes in only a few lines:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_examples</span><span class="p">,</span> <span class="n">hidden_layer_dim</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">100</span>
<span class="n">input_dim</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_examples</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c"># 100 examples of 1000 points</span>
<span class="n">n_layers</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">layer_dim</span> <span class="o">=</span> <span class="p">[</span><span class="n">hidden_layer_dim</span><span class="p">]</span> <span class="o">*</span> <span class="n">n_layers</span> <span class="c"># each one has 100 neurons</span>
<span class="n">hs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c"># save hidden states</span>
<span class="n">hs_not_batchnormed</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c"># saves the results before we do batchnorm, because we need this in the backward pass.</span>
<span class="n">zs</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">]</span> <span class="c"># save affine transforms for backprop</span>
<span class="n">ws</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># save the weights</span>
<span class="n">gamma</span><span class="p">,</span> <span class="n">beta</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span>
<span class="c"># the forward pass</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n_layers</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">hs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c"># get the input into this hidden layer</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">layer_dim</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">h</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> <span class="o">*</span> <span class="mf">0.01</span> <span class="c"># weight init: gaussian around 0</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="n">h_out</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="p">(</span><span class="n">z</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># save the not batchnormmed part for backprop</span>
<span class="n">hs_not_batchnormed</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_out</span><span class="p">)</span>
<span class="c"># apply batch normalization</span>
<span class="n">h_out</span> <span class="o">=</span> <span class="p">(</span><span class="n">h_out</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">h_out</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">h_out</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># scale and shift</span>
<span class="n">h_out</span> <span class="o">=</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">h_out</span> <span class="o">+</span> <span class="n">beta</span>
<span class="n">ws</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">W</span><span class="p">)</span>
<span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">hs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_out</span><span class="p">)</span>
</code></pre></div></div>
<p>Using the results of this forward pass (again stored in <strong>hs</strong>), we can plot a few of the activations:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_activation_4.png" alt="act4" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_activation_19.png" alt="act20" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_activation_20.png" alt="act20" /></p>
<p>This is great! Our later activations now have a much more reasonable distribution compared to previously, where they were all almost zero - just compare the scales of the axes on the batchnorm graphs against the non-original graphs.</p>
<p>Let’s see if this makes any difference in our gradients. First, we have to rewrite our original backwards pass to accommodate the gradients for the batchnorm operation. The gradients I used in the batchnorm layer are the ones given by the <a href="https://arxiv.org/pdf/1502.03167.pdf">original paper</a>. Our backwards pass now becomes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dLdh</span> <span class="o">=</span> <span class="mf">0.01</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">hidden_layer_dim</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">)</span> <span class="c"># random incoming grad into our last layer</span>
<span class="n">h_grads</span> <span class="o">=</span> <span class="p">[</span><span class="n">dLdh</span><span class="p">]</span> <span class="c"># incoming grads into each layer</span>
<span class="n">w_grads</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># will hold dL/dw_i for each layer</span>
<span class="c"># the backwards pass</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">flip</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_layers</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">):</span>
<span class="c"># get the incoming gradient</span>
<span class="n">incoming_loss_grad</span> <span class="o">=</span> <span class="n">h_grads</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c"># backprop through the batchnorm layer</span>
<span class="c">#the y_i is the restult of batch norm, so h_out or hs[i]</span>
<span class="n">dldx_hat</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="n">gamma</span>
<span class="n">dldvar</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dldx_hat</span> <span class="o">*</span> <span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="o">*</span> <span class="o">-.</span><span class="mi">5</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">power</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">var</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">),</span> <span class="o">-</span><span class="mf">1.5</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">dldmean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dldx_hat</span> <span class="o">*</span> <span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">),</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">dldvar</span> <span class="o">*</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span><span class="o">/</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c"># the following is dL/hs_not_batchnormmed[i] (aka dL/dx_i) in the paper!</span>
<span class="n">dldx</span> <span class="o">=</span> <span class="n">dldx_hat</span> <span class="o">*</span> <span class="mi">1</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">dldvar</span> <span class="o">*</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span><span class="o">/</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">dldmean</span><span class="o">/</span><span class="n">hs_not_batchnormed</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c"># although we don't need it for this demo, for completeness we also compute the derivatives with respect to gamma and beta. </span>
<span class="n">dldgamma</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="n">hs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">dldbeta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">incoming_loss_grad</span><span class="p">)</span>
<span class="c"># now incoming_loss_grad should be replaced by that backpropped result</span>
<span class="n">incoming_loss_grad</span> <span class="o">=</span> <span class="n">dldx</span>
<span class="c"># backprop through the relu</span>
<span class="k">print</span><span class="p">(</span><span class="n">incoming_loss_grad</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">dLdz</span> <span class="o">=</span> <span class="n">incoming_loss_grad</span> <span class="o">*</span> <span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># get the gradient dL/dh_{i-1}, this will be the incoming grad into the next layer</span>
<span class="n">h_grad</span> <span class="o">=</span> <span class="n">ws</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">dLdz</span><span class="p">)</span>
<span class="c"># get the gradient of the weights of this layer (dL/dw)</span>
<span class="n">weight_grad</span> <span class="o">=</span> <span class="n">dLdz</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">h_grads</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h_grad</span><span class="p">)</span>
<span class="n">w_grads</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">weight_grad</span><span class="p">)</span>
</code></pre></div></div>
<p>Using this backwards pass, we can now plot our gradients. We expect them to no longer be nearly all zero, which will mean that avoiding internal covariate shift fixed our vanishing gradients problem:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_grad_first_layer.png" alt="bngrad1" /></p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/nn-init-demo/master/plots/batchnorm_grad_second_layer.png" alt="bngrad3" /></p>
<p>Awesome! Looking at our gradients early in the network, we can see that they follow a roughly normal distribution with plenty of non-zero, large-magnitude values. Since our gradients are much more reasonable than previously, where they were nearly all zero, we are more confident that learning will occur at a reasonable rate, even for a large deep neural network (20 layers). We’ve successfully used batch normalization to fix one of the most common issues in training deep neural networks!</p>
<h4 id="intuition-for-why-batch-normalization-helps-with-better-gradient-signals">Intuition for why Batch Normalization helps with better gradient signals</h4>
<p>When gradient descent updates a certain layer in our network with the gradient <script type="math/tex">\frac{dL}{dW_i}</script>, it is ignorant of the changes in statistics in other layers - for example, it implicitly assumes that the distribution of the activations of the previous layer (and hence the input into this layer) stay the same as it updates the current layer it is on. Without batch normalization, this assumption isn’t true: gradient descent also eventually updates the weights in the previous layer, therefore changing the statistics of the output activations for that layer. Therefore, there could be a case where we update layer <script type="math/tex">i</script> , but the distribution of the inputs into that layer change such that the update actually does <em>worse</em> on these new inputs. Batch normalization fixes this, by guaranteeing that the statistics of the input into each layer stay the same throughout the learning process. See <a href="https://www.youtube.com/watch?v=Xogn6veSyxA&feature=youtu.be&t=325">this explanation</a> by Goodfellow for more on this.</p>
<p>P.S. - all the code used to generate the plots used in this answer are available <a href="https://github.com/rohan-varma/nn-init-demo/">here</a>.</p>
<h4 id="references">References</h4>
<ol>
<li><a href="https://arxiv.org/abs/1502.03167">Batch Normalization Paper</a></li>
<li><a href="cs231n.stanford.edu">CS 231n Lecture on Batch Normlization</a></li>
</ol>
<h4 id="notes">Notes</h4>
<p>[2/19/18] - I originally wrote this as an <a href="https://www.quora.com/How-does-batch-normalization-help/answer/Rohan-Varma-8">answer on Quora</a></p>
<p>[2/21/18] - The code used in the forward and backward pass isn’t completely accurate with respect to scaling the outputs by parameters <script type="math/tex">\gamma</script> and <script type="math/tex">\beta</script>. In actuality, there is supposed to be a <script type="math/tex">\gamma_i</script> and a <script type="math/tex">\beta_i</script> for <em>each</em> activation in <em>each</em> hidden layer - for example, if we have a batch of <script type="math/tex">n</script> activations and each activation has shape <script type="math/tex">1000</script>, there should be <script type="math/tex">1000</script> <script type="math/tex">\gamma_i</script>s and <script type="math/tex">1000</script> <script type="math/tex">\beta_i</script>s in each layer. I didn’t bother to actually implement it this way as it doesn’t affect the normalization process for the one step I illustrated.</p>
<p>[2/22/18] - I applied batch normalization <em>after</em> the ReLU nonlinearity, whereas the original paper states that it is applied after the affine layer and <em>before</em> the nonlinearity. Apparently, their actual code applies it after the ReLU as well, and it was mis-stated in their paper. See <a href="https://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/">this Reddit thread</a> for more discussion.</p>Picking Loss Functions - A comparison between MSE, Cross Entropy, and Hinge Loss2018-01-09T00:00:00+00:002018-01-09T00:00:00+00:00http://rohan-varma.github.io/Loss-Functions<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/loss3.jpg" alt="loss" /></p>
<p>Loss functions are a key part of any machine learning model: they define an objective against which the performance of your model is measured, and the setting of weight parameters learned by the model is determined by minimizing a chosen loss function. There are several different common loss functions to choose from: the cross-entropy loss, the mean-squared error, the huber loss, and the hinge loss - just to name a few. Given a particular model, each loss function has particular properties that make it interesting - for example, the (L2-regularized) hinge loss comes with the maximum-margin property, and the mean-squared error when used in conjunction with linear regression comes with convexity guarantees.</p>
<p>In this post, I’ll discuss three common loss functions: the mean-squared (MSE) loss, cross-entropy loss, and the hinge loss. These are the most commonly used functions I’ve seen used in traditional machine learning and deep learning models, so I thought it would be a good idea to figure out the underlying theory behind each one, and when to prefer one over the others.</p>
<h4 id="the-mean-squared-loss-probabalistic-interpretation">The Mean-Squared Loss: Probabalistic Interpretation</h4>
<p>For a model prediction such as <script type="math/tex">h_\theta(x_i) = \theta_0 + \theta_1x</script> (a simple linear regression in 2 dimensions) where the inputs are a feature vector <script type="math/tex">x_i</script>, the mean-squared error is given by summing across all <script type="math/tex">N</script> training examples, and for each example, calculating the squared difference from the true label <script type="math/tex">y_i</script> and the prediction <script type="math/tex">h_\theta(x_i)</script>:</p>
<script type="math/tex; mode=display">J = \frac{1}{N} \sum_{i=1}^{N} (y_i - h_\theta(x_i))^2</script>
<p>It turns out we can derive the mean-squared loss by considering a typical linear regression problem.</p>
<p>With linear regression, we seek to model our real-valued labels <script type="math/tex">Y</script> as being a linear function of our inputs <script type="math/tex">X</script>, corrupted by some noise. Let’s write out this assumption:</p>
<script type="math/tex; mode=display">Y = \theta_0 + \theta_1x + \eta</script>
<p>And to solidify our assumption, we’ll say that <script type="math/tex">\eta</script> is Gaussian noise with 0 mean and unit variance, that is <script type="math/tex">\eta \sim N(0, 1)</script>. This means that <script type="math/tex">E[Y] = E[\theta_0 + \theta_1x + \eta] = \theta_0 + \theta_1x</script> and <script type="math/tex">Var[Y] = Var[\theta_0 + \theta_1x + \eta] =</script>,1 so <script type="math/tex">Y</script> is also Gaussian with mean <script type="math/tex">\theta_0 + \theta_1x</script> and variance 1.</p>
<p>We can write out the probability of observing a single <script type="math/tex">(x_i, y_i)</script> sample:</p>
<script type="math/tex; mode=display">p(y_i \vert x_i) = e^{-\frac{(y_{i} - (\theta_{0} + \theta_{1}x_{i}))^2}{2}}</script>
<p>Summing across <script type="math/tex">N</script> of these samples in our dataset, we can write down the likelihood - essentially the probability of observing all <script type="math/tex">N</script> of our samples. Note that we also make the assumption that our data are independent of each other, so we can write out the likelihood as a simple product over each individual probability:</p>
<script type="math/tex; mode=display">L(x, y) = \prod_{i=1}^{N}e^{-\frac{(y_i - (\theta_0 + \theta_1x_i))^2}{2}}</script>
<p>Next, we can take the log of our likelihood function to obtain the log-likelihood, a function that is easier to differentiate and overall nicer to work with:</p>
<script type="math/tex; mode=display">l(x, y) = -\frac{1}{2}\sum_{i=1}^{N}(y_i - (\theta_0 + \theta_1x_i))^2</script>
<p>This gives us the MSE:</p>
<script type="math/tex; mode=display">J = \frac{1}{2}\sum_{i=1}^{N}(y_i - \theta^Tx_i)^2</script>
<p>Essentially, this means that using the MSE loss makes sense if the assumption that your outputs are a real-valued function of your inputs, with a certain amount of irreducible Gaussian noise, with constant mean and variance. If these assumptions don’t hold true (such as in the context of classification), the MSE loss may not be the best bet.</p>
<h4 id="the-cross-entropy-loss-probabalistic-interpretation">The Cross-Entropy Loss: Probabalistic Interpretation</h4>
<p>In the context of classification, our model’s prediction <script type="math/tex">h_\theta(x_i)</script> will be given by <script type="math/tex">\sigma(Wx_i + b)</script> which produces a value between <script type="math/tex">0</script> and <script type="math/tex">1</script> that can be interpreted as a probability of example <script type="math/tex">x_i</script> belonging to the positive class. If this probability were less than <script type="math/tex">0.5</script> we’d classify it as a negative example, otherwise we’d classify it as a positive example. This means that we can write down the probabilily of observing a negative or positive instance:</p>
<p><script type="math/tex">p(y_i = 1 \vert x_i) = h_\theta(x_i)</script> and <script type="math/tex">p(y_i = 0 \vert x_i) = 1 - h_\theta(x_i)</script></p>
<p>We can combine these two cases into one expression:</p>
<script type="math/tex; mode=display">p(y_i | x_i) = [h_\theta(x_i)]^{(y_i)} [1 - h_\theta(x_i)]^{(1 - y_i)}</script>
<p>Invoking our assumption that the data are independent and identically distributed, we can write down the likelihood by simply taking the product across the data:</p>
<script type="math/tex; mode=display">L(x, y) = \prod_{i = 1}^{N}[h_\theta(x_i)]^{(y_i)} [1 - h_\theta(x_i)]^{(1 - y_i)}</script>
<p>Similar to above, we can take the log of the above expression and use properties of logs to simplify, and finally invert our entire expression to obtain the cross entropy loss:</p>
<script type="math/tex; mode=display">J = -\sum_{i=1}^{N} y_i\log (h_\theta(x_i)) + (1 - y_i)\log(1 - h_\theta(x_i))</script>
<h4 id="the-cross-entropy-loss-in-the-case-of-multi-class-classification">The Cross-Entropy Loss in the case of multi-class classification</h4>
<p>Let’s supposed that we’re now interested in applying the cross-entropy loss to multiple (> 2) classes. The idea behind the loss function doesn’t change, but now since our labels <script type="math/tex">y_i</script> are one-hot encoded, we write down the loss (slightly) differently:</p>
<script type="math/tex; mode=display">-\sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(h_{\theta}(x_{i}){_j})</script>
<p>This is pretty similar to the binary cross entropy loss we defined above, but since we have multiple classes we need to sum over all of them. The loss <script type="math/tex">L_i</script> for a particular training example is given by</p>
<p><script type="math/tex">L_{i} = - \log p(Y = y_{i} \vert X = x_{i})</script>.</p>
<p>In particular, in the inner sum, only one term will be non-zero, and that term will be the <script type="math/tex">\log</script> of the (normalized) probability assigned to the correct class. Intuitively, this makes sense because <script type="math/tex">\log(x)</script> is increasing on the interval <script type="math/tex">(0, 1)</script> so <script type="math/tex">-\log(x)</script> is decreasing on that interval. For example, if we have a score of 0.8 for the correct label, our loss will be 0.09, if we have a score of .08 our loss would be 1.09.</p>
<p>Another variant on the cross entropy loss for multi-class classification also adds the other predicted class scores to the loss:</p>
<script type="math/tex; mode=display">- \sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(h_{\theta}(x_{i})_{j}) + (1-y_{ij})log(1 - h_{\theta}(x_{i})_{j})</script>
<p>The second term in the inner sum essentially inverts our labels and score assignments: it gives the other predicted classes a probability of <script type="math/tex">1 - s_j</script>, and penalizes them by the <script type="math/tex">\log</script> of that amount (here, <script type="math/tex">s_j</script> denotes the <script type="math/tex">j</script>th score, which is the <script type="math/tex">j</script>th element of <script type="math/tex">h_\theta(x_i)</script>).</p>
<p>This again makes sense - penalizing the incorrect classes in this way will encourage the values <script type="math/tex">1 - s_j</script> (where each <script type="math/tex">s_j</script> is a probability assigned to an incorrect class) to be large, which will in turn encourage <script type="math/tex">s_j</script> to be low. This alternative version seems to tie in more closely to the binary cross entropy that we obtained from the maximum likelihood estimate, but the first version appears to be more commonly used both in practice and in teaching.</p>
<p>It turns out that it doesn’t really matter which variant of cross-entropy you use for multiple-class classification, as they both decrease at similar rates and are just offset, with the second variant discussed having a higher loss for a particular setting of scores. To show this, I <a href="https://github.com/rohan-varma/machine-learning-courses/blob/master/cs231n/loss.py">wrote some code</a> to plot these 2 loss functions against each other, for probabilities for the correct class ranging from 0.01 to 0.98, and obtained the following plot:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/machine-learning-courses/master/cs231n/loss.png" alt="loss" /></p>
<h4 id="cross-entropy-loss-an-information-theory-perspective">Cross Entropy Loss: An information theory perspective</h4>
<p>As mentioned in the <a href="http://cs231n.github.io/linear-classify/">CS 231n lectures</a>, the cross-entropy loss can be interpreted via information theory. In information theory, the Kullback-Leibler (KL) divergence measures how “different” two probability distributions are. We can think of our classification problem as having 2 different probability distributions: first, the distribution for our actual labels, where all the probability mass is concentrated on the correct label, and there is no probability mass on the rest, and second, the distribution which we are learning, where the concentrations of probability mass are given by the outputs of the running our raw scores through a softmax function.</p>
<p>In an ideal world, our learned distribution would match the actual distribution, with 100% probability being assigned to the correct label. This can’t really happen since that would mean our raw scores would have to be <script type="math/tex">\infty</script> and <script type="math/tex">-\infty</script> for our correct and incorrect classes respectively, and, more practically, constraints we impose on our model (i.e. using logistic regression instead of a deep neural net) will limit our ability to correctly classify every example with high probability on the correct label.</p>
<p>Interpreting the cross-entropy loss as minimizing the KL divergence between 2 distributions is interesting if we consider how we can extend cross-entropy to different scenarios. For example, a lot of datasets are only partially labelled or have noisy (i.e. occasionally incorrect) labels. If we could probabilistically assign labels to the unlabelled portion of a dataset, or interpret the incorrect labels as being sampled from a probabalistic noise distribution, we can still apply the idea of minimizing the KL-divergence, although our ground-truth distribution will no longer concentrate all the probability mass over a single label.</p>
<h4 id="differences-in-learning-speed-for-classification">Differences in learning speed for classification</h4>
<p>It turns out that if we’re given a typical classification problem, we can show that (at least theoretically) the cross-entropy loss leads to quicker learning through gradient descent than the MSE loss. First, let’s recall the gradient descent update rule:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For i = 1 ... N:
Compute dJ/dw_i for i = 1 ... M parameters
Let w_i = w_i - learning_rate * dJ/dw_i
</code></pre></div></div>
<p>Essentially, the gradient descent algorithm computes partial derivatives for all the parameters in our network, and updates the parameters by decrementing the parameters by their respective partial derivatives, times a constant known as the learning rate, taking a step towards a local minimum.</p>
<p>This means that the “speed” of learning is dictated by two things: the learning rate and the size of the partial derivative. The learning rate is a hyperparameter that we must tune, so we’ll focus on the size of the partial derivatives for now. Consider the following binary classification scenario: we have an input feature vector <script type="math/tex">x_i</script>, a label <script type="math/tex">y_i</script>, and a prediction <script type="math/tex">\hat{y_i} = h_\theta(x_i)</script>.</p>
<p>We’ll show that given our model <script type="math/tex">h_\theta(x) = \sigma(Wx_i + b)</script>, learning can occur much faster during the beginning phases of training if we used the cross-entropy loss instead of the MSE loss. And we want this to happen, since at the beginning of training, our model is performing poorly due to the weights being randomly initialized.</p>
<p>First, given our prediction <script type="math/tex">\hat{y_i} = \sigma(Wx_i + b)</script> and our loss <script type="math/tex">J = \frac{1}{2}(y_i - \hat{y_i})^2</script> , we first obtain the partial derivative <script type="math/tex">\frac{dJ}{dW}</script>, applying the chain rule twice:</p>
<script type="math/tex; mode=display">\frac{dJ}{dW} = (y_i - \hat{y_i})\sigma'(Wx_i + b)x_i</script>
<p>This derivative has the term <script type="math/tex">\sigma'(Wx_i + b)</script> in it. This can be expressed as <script type="math/tex">\sigma(Wx_i + b)(1 - \sigma(Wx_i + b))</script> (see here for a proof). Since we initialized our weights randomly with values close to 0, this expression will be very close to 0, which will make the partial derivative nearly vanish during the early stages of training. A plot of the sigmoid curve’s derivative is shown below, indicating that the gradients are small whenever the outputs are close to <script type="math/tex">0</script> or <script type="math/tex">1</script>:</p>
<p><img src="http://ronny.rest/media/blog/2017/2017_08_10_sigmoid/sigmoid_and_derivative_plot.jpg" alt="sigmoid" /></p>
<p>This can lead to slower learning at the beginning stages of gradient descent, since the smaller derivatives change each weight by only a small amount, and gradient descent takes a while to get out of this loop and make larger updates towards a minima.</p>
<p>On the other hand, given the cross entropy loss:</p>
<script type="math/tex; mode=display">J = -\sum_{i=1}^{N} y_i\log(\sigma (Wx_i + b)) + (1-y_i)\log(1 - \sigma(Wx_i + b))</script>
<p>We can obtain the partial derivative <script type="math/tex">\frac{dJ}{dW}</script> as follows (with the substitution <script type="math/tex">\sigma(z) = \sigma(Wx_i + b)</script>:</p>
<script type="math/tex; mode=display">\frac{dJ}{dW} = -\sum_{i=1}^{N} \frac{y_i x_i\sigma'(z)}{\sigma(z)} - \frac{(1-y_i)x_i \sigma'(z)}{1 - \sigma(z)}</script>
<p>Simplifying, we obtain a nice expression for the gradient of the loss function with respect to the weights:</p>
<script type="math/tex; mode=display">\sum_{i=1}^{N} x_i(\sigma(z) - y_i)</script>
<p>This derivative does not have a <script type="math/tex">\sigma'</script> term in it, and we can see that the magnitude of the derivative is entirely dependent on the magnitude of our error <script type="math/tex">\sigma(z) - y_i</script> - how far off our prediction was from the ground truth. This is great, since that means early on in learning, the derivatives will be large, and later on in learning, the derivatives will get smaller and smaller, corresponding to smaller adjustments to the weight variables, which makes intuitive sense since if our error is small, then we’d want to avoid large adjustments that could cause us to jump out of the minima. Michael Nielsen in his <a href="http://neuralnetworksanddeeplearning">book</a> has an in-depth discussion and illustration of this that is really helpful.</p>
<h4 id="hinge-loss-vs-cross-entropy-loss">Hinge Loss vs Cross-Entropy Loss</h4>
<p>There’s actually another commonly used type of loss function in classification related tasks: the hinge loss. The (L2-regularized) hinge loss leads to the canonical support vector machine model with the max-margin property: the margin is the smallest distance from the line (or more generally, hyperplane) that separates our points into classes and defines our classification:</p>
<p><img src="https://docs.opencv.org/2.4.13.4/_images/optimal-hyperplane.png" alt="svm" /></p>
<p>The hinge loss penalizes predictions not only when they are incorrect, but even when they are correct but not confident. It penalizes gravely wrong predictions significantly, correct but not confident predictions a little less, and only confident, correct predictions are not penalized at all. Let’s formalize this by writing out the hinge loss in the case of binary classification:</p>
<script type="math/tex; mode=display">\sum_{i} max(0, 1 - y_{i} * h_{\theta}(x_{i}))</script>
<p>Our labels <script type="math/tex">y_{i}</script> are either -1 or 1, so the loss is only zero when the signs match and <script type="math/tex">\vert (h_{\theta}(x_{i}))\vert \geq 1</script>. For example, if our score for a particular training example was <script type="math/tex">0.2</script> but the label was <script type="math/tex">-1</script>, we’d incur a penalty of <script type="math/tex">1.2</script>, if our score was <script type="math/tex">-0.7</script> (meaning that this instance was predicted to have label <script type="math/tex">-1</script>) we’d still incur a penalty of <script type="math/tex">0.3</script>, but if we predicted <script type="math/tex">-1.1</script> then we would incur no penalty. A visualization of the hinge loss (in green) compared to other cost functions is given below:</p>
<p><img src="https://i.stack.imgur.com/4DFDU.png" alt="hinge loss" /></p>
<p>The main difference between the hinge loss and the cross entropy loss is that the former arises from trying to maximize the margin between our decision boundary and data points - thus attempting to ensure that each point is correctly and confidently classified*, while the latter comes from a maximum likelihood estimate of our model’s parameters. The softmax function, whose scores are used by the cross entropy loss, allows us to interpret our model’s scores as relative probabilities against each other. For example, the cross-entropy loss would invoke a much higher loss than the hinge loss if our (un-normalized) scores were <script type="math/tex">[10, 8, 8]</script> versus <script type="math/tex">[10, -10, -10]</script>, where the first class is correct. In fact, the (multi-class) hinge loss would recognize that the correct class score already exceeds the other scores by more than the margin, so it will invoke zero loss on both scores. Once the margins are satisfied, the SVM will no longer optimize the weights in an attempt to “do better” than it is already.</p>
<h4 id="wrap-up">Wrap-Up</h4>
<p>In this post, we’ve show that the MSE loss comes from a probabalistic interpretation of the regression problem, and the cross-entropy loss comes from a probabalistic interpretaion of binary classification. The MSE loss is therefore better suited to regression problems, and the cross-entropy loss provides us with faster learning when our predictions differ significantly from our labels, as is generally the case during the first several iterations of model training. We’ve also compared and contrasted the cross-entropy loss and hinge loss, and discussed how using one over the other leads to our models learning in different ways. Thanks for reading, and hope you enjoyed the post!</p>
<h4 id="sources">Sources</h4>
<ol>
<li>
<p><a href="http://neuralnetworksanddeeplearning.com/chap3.html">Michael Nielsen’s Neural Networks and Deep Learning, Chapter 3</a></p>
</li>
<li>
<p><a href="http://cs231n.github.io/linear-classify/">Stanford CS 231n notes on cross entropy and hinge loss</a></p>
</li>
<li>
<p><a href="https://docs.opencv.org/2.4.13.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html">OpenCV introduction to SVMs</a></p>
</li>
<li>
<p><a href="https://math.stackexchange.com/questions/782586/how-do-you-minimize-hinge-loss">StackExchange answer on hinge loss minimization</a></p>
</li>
<li>
<p><a href="https://www.cs.princeton.edu/courses/archive/fall16/cos402/lectures/402-lec5.pdf">Machine Learning, Princeton University</a></p>
</li>
<li>
<p><a href="http://ronny.rest/blog/post_2017_08_10_sigmoid/">Ronny Restrepo, sigmoid functions</a></p>
</li>
</ol>Paper Analysis - Sequence to Sequence Learning2018-01-02T00:00:00+00:002018-01-02T00:00:00+00:00http://rohan-varma.github.io/Seq-2-Seq<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/seq2seq.png" alt="seq2seq" /></p>
<p><a href="https://arxiv.org/pdf/1409.3215.pdf">Link to paper</a>
<a href="https://github.com/rohan-varma/paper-analysis/blob/master/seq2seq-paper/Pytorch%20Seq%202%20Seq%20Model.ipynb">Link to example implementation</a></p>
<p><strong>Abstract</strong></p>
<ul>
<li>
<p>Traditional DNNs have achieved good performance whenever large labelled training datasets are available, but cannot map sequences to sequences</p>
</li>
<li>
<p>Main approach of the paper is to use a multilayer LSTM to map input sequence to a fixed-length vector, and then another deep LSTM to decode the fixed length vector into a sequence</p>
</li>
<li>
<p>The LSTM model also learned useful phrase & sentence representations that are sensitive to word order and invariant to passive/active voice</p>
</li>
<li>
<p>indicates that the actual structure of the language was mostly captured, the representation of “He ate the cookie” and “the cookie was eaten by him” aren’t too different</p>
</li>
<li>
<p>Reversing word order in source sentence helped because it introduced more short-term dependencies</p>
</li>
</ul>
<p><strong>Introduction</strong></p>
<ul>
<li>DNNs are powerful, example: 2 layer neural network of quadratic size can learn to sort n n-bit numbers</li>
<li>DNNs are only useful for problems who’s inputs and labels can be expressed as fixed-length vectors in some way</li>
<li>This is limiting, DNNs can’t do many tasks who’s inputs are best represented as sequences such as machine translation, POS tagging, speech recognition</li>
<li>Main idea: use one LSTM to read the input sequence (of variable length) one time step at a time, and map this to a fixed-length vector. Then a second LSTM takes this fixed-length vector as an input and produces an output sequence</li>
<li>The second LSTM is basically an RNN language model that is conditioned on the encoded representation</li>
<li>The researchers trained an ensemble of 5 deep LSTMs (each with 384 million parameters) and used a beam search decoder to achieve state of the art performance on the WMT english to french translation task</li>
<li>Reversing the words in the source sentence helped train the LSTMs a lot, because this introduced many more short term dependencies to make it easier to train the LSTM with SGD</li>
</ul>
<p><strong>Model</strong></p>
<ul>
<li>
<p>RNN inputs: sequence <script type="math/tex">x_1 … x_T</script> as inputs, computes a sequence of outputs <script type="math/tex">y_1 … y_T</script></p>
</li>
<li>
<p>At each timestep, RNN computes a hidden state <script type="math/tex">h_t</script> and an output <script type="math/tex">y_t</script>. We can think of the hidden state as encapsulating the information encountered at previous timesteps</p>
</li>
<li>
<script type="math/tex; mode=display">h_t = tanh(W_{hx}x_t + W_{hh}h_{t-1})</script>
</li>
<li>
<script type="math/tex; mode=display">y_t = W_{hy}h_t</script>
</li>
<li>
<p>This is for a “single-layer” RNN that does not have layers of hidden states. If there are multiple layers of hidden states, then instead of <script type="math/tex">x_t</script> as the input into a later hidden layer, the input is the <script type="math/tex">h_t</script> at the previous layer, same timestep</p>
</li>
<li>
<p>For general/variable-length sequence to sequence learning, the general idea is to map the input sequence to a fixed-length vector using one RNN and then map the fixed-length vector to the target sequence with another RNN</p>
</li>
<li>
<p>However, in practice RNNs aren’t very good with learning longer-term dependencies. LSTMs have been shown to do much better at learning longer-term dependencies because they don’t fall victim to the vanishing gradient problem like traditional RNN cells do</p>
</li>
<li>
<p>Goal of the LSTM is to estimate the following conditional probability:</p>
<ul>
<li><script type="math/tex">p(y_{1}, … y_{T'} \vert x_{1} … x_{T})</script> , where the length of the 2 sequences differ from each other</li>
<li>The LSTM does this by computing a fixed-dimensional representation <script type="math/tex">v</script> after observing the input sequence <script type="math/tex">x_1 … x_T</script></li>
<li>Then, conditioned on this <script type="math/tex">v</script>, we can produce the output sequence via the following formulation:
<ul>
<li>
<script type="math/tex; mode=display">p(y_1 | x_1 … x_T) = p(y_1 | v)</script>
</li>
<li>
<script type="math/tex; mode=display">P(y_1, y_2 | x_1 … x_T) = p(y_2 | v, y_1) p(y_1 | v)</script>
</li>
<li>i.e., at each time step the the <script type="math/tex">i</script>th output is conditioned on the fixed length vector <script type="math/tex">v</script> and the previous outputs, if any</li>
<li>In general, <script type="math/tex">P(y_1, y_2 … y_{T'} \vert x_1, … x_T) = \prod_{t = 1}^{T'} p(y_t \vert v, y_1… y_{T'-1})</script></li>
<li>Each of these distributions are represented with a softmax over the vocabulary</li>
</ul>
</li>
<li>2 different LSTMs are used since this doesn’t increase the number of model parameters by much and makes it easier to train the LSTM on multiple language pairs</li>
<li>A deep LSTM was used with four layers; this was found to significantly outperform shallow LSTMs</li>
<li>Reversed order of words in input sequences was found to help a lot</li>
<li>The dataset had 12m sentences with 348m french words and 304m english words</li>
</ul>
<p><strong>Training</strong></p>
<ul>
<li>Model was trained to maximize probability of producing a correct translation given a source sentence:</li>
<li>
<script type="math/tex; mode=display">\frac{1}{N} \sum_{i=1}^{N} \log p(T_i | S_i)</script>
</li>
<li>Translations are produced by finding the most likely translation given by the LSTM: <script type="math/tex">T* = argmax_T p(T \vert S)</script></li>
<li>Beam search used to find the most likely translations. At each time, we maintain a list of partial hypotheses and then extend each partial hypothesis with every word in the vocabulary. Then we discard all but <script type="math/tex">B</script> of the most likely hypotheses (where the likelihood is given by the model’s log probability).</li>
<li>A hypothesis is finished once the end-of-sentence tag <EOS> is emitted</EOS></li>
<li>As discussed, reversing the words in the source sentences helps the translation task a lot</li>
<li>An intuitive explanation for this is by reversing the source sentence, the average distance between corresponding words decreases so there is less of an overall time lag</li>
<li>Therefore, backpropagation has an easier time communicating between the source sentence and the target sentence.</li>
<li>Ex: “I like to eat the apples” and “Me gusta comer las manzanitas” vs “Apples the eat to like I” and “Me gusta comer las manzanitas”, the second pair has more words closer to the corresponding word in the translated sentence</li>
<li>A deep LSTM with 4 layers and 1000 cells at each layer was used. 100 dimensional word embeddings</li>
<li>Initialization was uniform random between -0.08 and 0.08</li>
<li>SGD without momentum with lr = 0.7, and then the learning rate was halved from epochs 5 to 7.5</li>
<li>batch size = 128</li>
<li>To avoid exploding gradients, the researchers enforced a hard cap on the norm of the gradients and the gradiewnts weres scaled down when the norm exceeded a threshold</li>
<li>Each layer was trained on a different GPU and communicated its activations when it was one. Spent about 10 days training</li>
</ul>
<p><strong>Results</strong></p>
<ul>
<li>
<p>The model achieved state of the art accuracy on english to french translation tasks</p>
</li>
<li>
<p>The fixed length vectors learned were pretty meaningful, in that they were sensitive to order of the words (i.e. John admires Mary was further away from Mary admires John, but Mary admires John and Mary respects John were relatively close together)</p>
</li>
<li>
<p>Other approaches included using convolutional networks to map sentences to fixed length vectors, using attention mechanisms to overcome issues with long sentence translation, or taking phrase-based approaches to achieve smoother translations</p>
</li>
</ul>
</li>
</ul>Interpreting Regulariation as a Bayesian Prior2017-08-24T00:00:00+00:002017-08-24T00:00:00+00:00http://rohan-varma.github.io/Regularization<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/reg.png" alt="img" /></p>
<h3 id="introductionbackground">Introduction/Background</h3>
<p>In machine learning, we often start off by writing down a probabalistic model that defines our data. We then go on to write down a likelihood or some type of loss function, which we then optimize over to get the optimal settings for the parameters that we seek to estimate. Along the way, techniques such as regularization, hyperparameter tuning, and cross-validation can be used to ensure that we don’t overfit on our training dataset and our model generalizes well to unseen data.</p>
<p>Specifically, we have a few key functions and variables: the underlying probability distribution <script type="math/tex">p(x, y)</script> which generate our training examples (pairs of features and labels), a training set <script type="math/tex">(x, y)_{i = 1}^{D}</script> of <script type="math/tex">D</script> examples which we observe, and a model <script type="math/tex">h(x) : x \rightarrow{} y</script> which we wish to learn in order to produce a mapping from <script type="math/tex">x</script> to <script type="math/tex">y</script>. This function <script type="math/tex">h</script> is selected from a larger function space <script type="math/tex">H</script>.</p>
<p>For example, if we are in the context of linear regression models, then all functions in the function space of <script type="math/tex">H</script> will take on the form <script type="math/tex">y_i = x_{i}^T \beta</script> where a particular setting of our parameters <script type="math/tex">\beta</script> will result in a particular <script type="math/tex">h(x)</script>. We also have some function <script type="math/tex">L(h(x), y)</script> that takes in our predictions and labels, and quantifies how accurate our model is across some data.</p>
<p>Ideally, we’d like to minimize the risk function</p>
<script type="math/tex; mode=display">R[h(x)] = \sum_{(x, y)} L( h(x), y) p(x, y)</script>
<p>across all possible <script type="math/tex">(x, y)</script> pairs. However, this is impossible since we don’t know the underlying probability distribution that describes our dataset, so instead we seek to approximate the risk function by minimizing a loss function across the data that we have observed:</p>
<script type="math/tex; mode=display">\frac{1}{N} \sum_{i = 1}^{N} L(h(x), y)</script>
<h3 id="linear-models">Linear Models</h3>
<p>If we assume that our data are roughly linear, then we can write a relationship between our features and real-valued outputs: <script type="math/tex">y_i = x_i^T \beta + \epsilon</script> where <script type="math/tex">\epsilon \tilde{} N(0, \sigma^2)</script>. This essentially means that our data has a linear relationship that is corrupted by random Gaussian noise that has zero mean and constant variance.</p>
<p>This has the implication that <script type="math/tex">y_i</script> is a Gaussian random variable, and we can compute its expectation and variance:</p>
<script type="math/tex; mode=display">E[y_i] = E[x_i^T \beta + \epsilon] = x_i^T \beta</script>
<script type="math/tex; mode=display">Var[y_i] = Var[x_i^T \beta + \epsilon] = \sigma^2</script>
<p>We can now write down the probability of observing a value <script type="math/tex">y_i</script> given a certain set of features <script type="math/tex">x</script>:</p>
<script type="math/tex; mode=display">p(y_i | x_i) = N(y_i | x_i^T \beta, \sigma^2)</script>
<p>Next, we can write down the probability of observing the entire dataset of <script type="math/tex">(x, y)</script> pairs. This is known as the likelihood, and it’s simply the product of observing each of the individual feature, label pairs:</p>
<script type="math/tex; mode=display">L(x,y) = \prod_{i = 1}^{n} N(y_i | x_i \beta, \sigma^2)</script>
<p>As a note, writing down the likelihood this way does assume that our training data are independent and identically distributed, meaning that we are assuming that each of the training samples have the same probability distribution, and are mutually independent.</p>
<p>If we want to find the <script type="math/tex">\hat{\beta}</script> that maximizes the chance of us observing the training examples that we observed, then it makes sense to maximize the above likelihood. This is known as <strong>maximum likelihood estimation</strong>, and is a common approach to many machine learning problems such as linear and logistic regression.</p>
<p>In other words, we want to find</p>
<script type="math/tex; mode=display">\hat{\beta} = argmax_{\beta} \prod_{i = 1}^{n} N(y_i | x_i \beta, \sigma^2)</script>
<p>To simplify this a little bit, we can write out the normal distribution, and also take the log of the function, since the <script type="math/tex">\hat{\beta}</script> that maximizes <script type="math/tex">L</script> will also maximize <script type="math/tex">log(L)</script>. We end up with</p>
<script type="math/tex; mode=display">\hat{\beta} = argmax_{\beta} log \prod_{i = 1}^{n} \frac{1}{\sqrt(2 \pi \sigma^2}e^-\frac{(y_i - x_i \beta)^2}{2 \sigma^2}</script>
<p>Distributing the log and dropping constants (since they don’t affect the value of our parameter which maximizes the expression), we obtain</p>
<script type="math/tex; mode=display">\hat{\beta} = argmax_{\beta} \sum_{i = 1}^{N} -(y_i - x_i \beta)^2</script>
<p>Since minimizing the opposite of a function is the same as maximizing it, we can turn the above into a minimization problem:</p>
<script type="math/tex; mode=display">\hat{\beta} = argmin_{\beta} \sum_{i = 1}^{N} (y_i - x_i \beta)^2</script>
<p>This is the familiar least squares estimator, which says that the optimal parameter is the one that minimizes the <script type="math/tex">L2</script> squared norm between the predictions and actual values. We can use gradient descent with some initial setting of <script type="math/tex">\beta</script> and be guaranteed to get to a global minimum (since the function is convex) or we can explicitly solve for <script type="math/tex">\beta</script> and obtain the same answer.</p>
<p>Right now is a good time to think about the assumptions of this linear regression model. Like many models, it assumes that the data are drawn independently from the same data generating distribution. Furthermore, it assumes that this distribution is normal with a linear mean and constant variance. It also has a more implicit assumption: that the parameter <script type="math/tex">\beta</script> which we wish to estimate is not a random variable itself, and we will show how relaxing this assumption leads to a regularized linear model.</p>
<h3 id="regularization">Regularization</h3>
<p>Regularization is a popular approach to reducing a model’s predisposition to overfit on the training data and thus hopefully increasing the generalization ability of the model. Previously, we sought to learn the optimial <script type="math/tex">h(x)</script> from the space of functions <script type="math/tex">H</script>. However, if the whole function space can be explored, and our samples were observed with some amount of noise, then the model will likely select a function that overfits on the observed data. One way we can combat this is by limiting our search to a subspace within <script type="math/tex">H</script>, and this is exactly what regularization does.</p>
<p>To regularize a model, we take our loss function and add a regularizer to it. Regularizers take the form <script type="math/tex">\lambda R(\beta)</script> where <script type="math/tex">R(\beta)</script> is some function of our parameters, and <script type="math/tex">\lambda</script> is a hyperparameter describing our regularization constant. Using this rule, we can write out a regularized version of our loss function above, giving us a model known as ridge regression:</p>
<script type="math/tex; mode=display">\hat{\beta} = argmin_{\beta} \sum_{i = 1}^{N} (y_i - x_i \beta)^2 + \lambda \sum_{i = 1}^{j} \beta_j^2</script>
<p>What’s interesting about regularization is that it can be more deeply understood if we reconsider our original probabalistic model. In our original model, we conditioned our outputs on a linear function of the parameter which we wish to learn <script type="math/tex">\beta</script>. It turns out we often want to also consider <script type="math/tex">\beta</script> itself as a random variable, and impose a probability distribution on it. This is known as the <strong>prior</strong> probability distribution, because we assign <script type="math/tex">\beta</script> some probability without having observed the associated <script type="math/tex">(x, y)</script> pairs. Imposing a prior would be especially useful if we had some information about the parameter before observing any of the training data (possibly from domain knowledge), but it turns out that imposing a Gaussian prior even in the absence of actual prior knowledge leads to interesting properties. In particular, we can condition <script type="math/tex">\beta</script> as on a Gaussian with 0 mean and constant variance [1]:</p>
<script type="math/tex; mode=display">\beta \tilde{} N(0, \lambda^{-1})</script>
<p>As a consequence, we must adjust our probability of observing a particular <script type="math/tex">(x, y)</script> pair to accommodate the probability of observing the parameter that generated this pair. We obtain a new expression for our likelihood:</p>
<script type="math/tex; mode=display">L(x,y) = \prod_{i = 1}^{n} N(y_i | x_i \beta, \sigma^2) N(\beta | 0, \lambda^{-1})</script>
<p>Similar to the previously discussed method of maximum likelihood estimation, we can estimate the parameter <script type="math/tex">\beta</script> to be the <script type="math/tex">\hat{\beta}</script> that maximizes the above function:</p>
<script type="math/tex; mode=display">\hat{\beta} = argmax_{\beta} \sum_{i = 1}^{N} log N(y_i | x_i \beta, \sigma^2) + log N(\beta | 0, \lambda^{-1})</script>
<p>This is the maximum a posteriori estimate of <script type="math/tex">\beta</script>, and it only differs from the maximum likelihood estimate in that the former takes into account previous information, or a prior distribution, on the parameter <script type="math/tex">\beta</script>. In fact, the maximum likelihood estimate of the parameter can be seen as a special case of the maximum a posteriori estimate, where we take the prior probability distribution on the parameter to just be a constant.</p>
<p>Since (dropping unneeded constants) <script type="math/tex">N(\beta, 0, \lambda^{-1}) = exp(\frac{- \beta^{2}}{2 \lambda^{-1}})</script>, after taking the log, and minimizing the negative of the above function we obtain the familiar regularizer <script type="math/tex">\frac{1}{2} \lambda \beta^2</script> and our squared loss function <script type="math/tex">\sum_{i = 1}^{N} (y_i - x_i \beta)^2</script> is the same as the loss function we obtained without regularization. In this way, <script type="math/tex">L2</script> regularization on a linear model can be thought of as imposing a Bayesian prior on the underlying parameters which we wish to estimate.</p>
<h3 id="aside-interpreting-regularization-in-the-context-of-bias-and-variance">Aside: interpreting regularization in the context of bias and variance</h3>
<p>The error of a statistical model can be decomposed into three distinct sources of error: error due to bias, error due to variance, and irreducible error. They are related as follows:</p>
<script type="math/tex; mode=display">Err(x) = bias(X)^2 + var(x) + \epsilon</script>
<p>Given a constant error, this means that there will always be a tradeoff between bias and variance. Having too much bias or too much variance isn’t good for a model, but for different reasons. A high bias, low variance model will likely end up being inaccurate across both the training and testing datasets, and its predictions will likely not deviate too much based on the data sample it is trained on. On the other hand, a low-bias, high-variance model will likely give good results on a training dataset, but fail to generalize as well on a testing dataset.</p>
<p>The Gauss-Markov theorem states that in a linear regression problem, the least squares estimator has the lowest variance out of all other unbiased estimators. However, if we consider biased estimators such as the estimator given by ridge regression, we can arrive at a lower variance, higher-bias solution. In particular, the expectation of the ridge estimator (derived <a href="http://math.bu.edu/people/cgineste/classes/ma575/p/w14_1.pdf">here</a>) is given by:</p>
<script type="math/tex; mode=display">\beta - \lambda (X^TX + \lambda I)^{-1} \beta</script>
<p>The bias of an estimator is defined as the difference between the parameter’s expected value and the true parameter <script type="math/tex">\beta</script>: <script type="math/tex">bias(\hat{\beta}) = E[\hat{\beta}] - \beta</script></p>
<p>As you can see, the bias is proportional to <script type="math/tex">\lambda</script> and <script type="math/tex">\lambda = 0</script> gives us the unbiased least squares estimator since <script type="math/tex">E[\hat{\beta}] = \beta</script>. Therefore, assuming a constant total error for the least squares estimator and the ridge estimator, the variance for the ridge estimator is lower. A more complete discussion, including formal calculations for the bias and variance of the ridge estimator compared to the least squares estimator, is given <a href="http://math.bu.edu/people/cgineste/classes/ma575/p/w14_1.pdf">here</a>.</p>
<h3 id="a-linear-algebra-perspective">A linear algebra perspective</h3>
<p>To see why regularization makes sense from a linear algebra perspective, we can write down our least squares estimate in vectorized form:</p>
<script type="math/tex; mode=display">argmin_{\beta} { (y - X\beta)^T (y - X \beta) }</script>
<p>Next, we can expand this and simplify a little bit:</p>
<script type="math/tex; mode=display">argmin_{\beta} (y^T - \beta^TX^T)(y - X\beta)</script>
<script type="math/tex; mode=display">= argmin_{\beta} -2y^TX\beta + \beta^TX^TX\beta</script>
<p>where we have dropped the terms that are not a factor of <script type="math/tex">\beta</script> since they will zero out when we differentiate.</p>
<p>To minimize, we differentiate with respect to <script type="math/tex">\beta</script>:</p>
<script type="math/tex; mode=display">\frac{\delta L}{\delta \beta} = -2 y^TX + 2X^TX\beta</script>
<p>Setting the derivative equal to zero gives us the closed form solution of <script type="math/tex">\beta</script> which is the least-squares estimate [2]:</p>
<script type="math/tex; mode=display">\hat{\beta} = (X^TX)^{-1} y^TX</script>
<p>As we can see, in order to actually compute this quantity the matrix <script type="math/tex">X^T X</script> must be invertible. The matrix <script type="math/tex">X^T X</script> being invertible corresponds exactly to showing that the matrix is positive definite, which means that the scalar quantity <script type="math/tex">z^T X^T X z > 0</script> for any real, non-zero vectors <script type="math/tex">z</script>. However, the best we can do is show that <script type="math/tex">X^T X</script> is positive semidefinite.</p>
<p>To show that <script type="math/tex">X^TX</script> is positive semidefinite, we must show that the quantity <script type="math/tex">z^T X^T X z \geq 0</script> for any real, non-zero vectors <script type="math/tex">z</script>.</p>
<p>If we expand out the quantity <script type="math/tex">X^T X</script>, we obtain <script type="math/tex">\sum_{i = 1}^{N} x_i x_i^T</script> and it follows that the quantity <script type="math/tex">z^T (\sum_{i = 1}^{N} x_i x_i^T) z = \sum_{i = 1}^{N} (x_i^Tz)^2 \geq 0</script>. This means that in sitautions where this quantity is exactly <script type="math/tex">0</script>, the matrix <script type="math/tex">X^T X</script> cannot be inverted and a closed-form least squares solution cannot be computed.</p>
<p>On the other hand, expanding out our ridge estimate which has an extra regulariztion term <script type="math/tex">\lambda \sum_{i} \beta_i^2</script>, we obtain the derivative</p>
<script type="math/tex; mode=display">\frac{\delta L}{\delta \beta} = -2 y^TX + 2X^TX\beta + 2 \lambda \beta</script>
<p>Setting this quantity equal to zero, and rewriting <script type="math/tex">\lambda \beta</script> as <script type="math/tex">\lambda I \beta</script> (using the property of multiplication with the identity matrix), we now obtain</p>
<script type="math/tex; mode=display">\beta (X^TX + \lambda I) = y^T X</script>
<p>giving us the ridge estimate</p>
<script type="math/tex; mode=display">\hat{\beta_{ridge}} = (X^TX + \lambda I)^{-1} y^TX</script>
<p>The only difference in this closed-form solution is the addition of the <script type="math/tex">\lambda I</script> term to the quantity that gets inverted, so we are now sure that this quantity is positive definite if <script type="math/tex">\lambda > 0</script>. In other words, even when the matrix <script type="math/tex">X^T X</script> is not invertible, we can still compute a ridge estimate from our data [3].</p>
<h3 id="regularizers-in-neural-networks">Regularizers in neural networks</h3>
<p>While techniques such as L2 regularization can be used while training a neural network, employing techniques such as dropout, which randomly discards some proportion of the activations at a per-layer level during training, have been shown to be much more successful. There is also a different type of regularizer that takes into account the idea that a neural network should have sparse activations for any particular input. There are several theoretical reeasons for why sparsity is important, a topic covered very well by Glorot et al. in a <a href="http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf">2011 paper</a>.</p>
<p>Since sparsity is important in neural networks, we can introduce a constraint that can gaurantee us some degree of sparsity. Specifically, we can constrain the average activation of a particular neuron in a particular hidden layer.</p>
<p>In particular, the average activation of a neuron in a particular layer, weighted by the input into the neuron, can be given by summing over all of the activation - input pairs: <script type="math/tex">\hat{\rho} = \frac{1}{m} \sum_{i = 1}^{N} x_i a_i^2</script>. Next, we can choose a hyperparameter <script type="math/tex">\rho</script> for this particular neuron, which represents the average activation we want it to have - for example, if we wanted this neuron to activate sparsely, we might set <script type="math/tex">\rho = 0.05</script>. In order to ensure that our model learns neurons which sparsely activate, we must incorporate some function of <script type="math/tex">\hat{\rho}</script> and <script type="math/tex">\rho</script> into our cost function.</p>
<p>One way to do this is with the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>, which computes how much one probability distribution (in this case, our current average activation <script type="math/tex">\hat\rho</script>) and another expected probability distribution (<script type="math/tex">\rho</script>) diverge from each other. If we minimize the KL divergence for each of our neuron’s activations then our model will learn sparse activations. The cost function may be:</p>
<script type="math/tex; mode=display">J_{sparse} (W, b) = J(W, b) + \lambda \sum_{i = 1}^{M} KL(\rho_i || \hat{\rho_i})</script>
<p>where <script type="math/tex">J(W, b)</script> is a regular cost function used in neural networks, such as the cross-entropy loss. The hyperparameter <script type="math/tex">\lambda</script> indicates how important sparsity is to us - as <script type="math/tex">\lambda \rightarrow{} \infty</script>, we disregard the actual loss function and only aim to learn a sparse representation, and as <script type="math/tex">\lambda \rightarrow{} 0</script> we disregard the importance of sparse activations and only minimize the original loss function. Additional details on this type of regularization with application to sparse autoencoders are given <a href="http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity">here</a>.</p>
<h3 id="recap">Recap</h3>
<p>As we have seen, regularization can be interpreted in several different ways, each of which gives us additional insight into what exactly regularization accomplishes. A few of the different interpretations are:</p>
<p>1) As a Bayesian prior on the paramaters which we are trying to learn.</p>
<p>2) As a term added to the loss function of our model which penalizes some function of our parameters, thereby introducing a tradeoff between minimizing the original loss function and ensuring our weights do not deviate too much from what we want them to be.</p>
<p>3) As a constraint on the model which we are trying to learn. This means we can take the original optimization problem and frame it in a constrained fashion, thereby ensuring that the magnitude of our weights never exceed a certain threshold (in the case of <script type="math/tex">L2</script> regularization).</p>
<p>4) As a method of reducing the function search space <script type="math/tex">H</script> to a new function search space <script type="math/tex">H'</script> that is smaller than <script type="math/tex">H</script>. Without regularization, we may search for our optimal function <script type="math/tex">h</script> in a much larger space, and constraining this to a smaller subspace can lead us to select models with better generalization ability.</p>
<p>Overall, regularization is a useful technique that is often employed to reduce the overall variance of a model, thereby improving its generalization capability. Of course, there’s tradeoffs in using regularization, most notably having to tune the hyperparameter <script type="math/tex">\lambda</script> which can be costly in terms of computational time. Thanks for reading!</p>
<h3 id="sources">Sources</h3>
<ol>
<li>
<p><a href="http://math.bu.edu/people/cgineste/classes/ma575/p/w14_1.pdf">Boston University Linear Models Course by Cedric Ginestet</a></p>
</li>
<li>
<p><a href="http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity">Autoencoders and Sparsity, Stanford UFDL</a></p>
</li>
<li>
<p><a href="https://math.stackexchange.com/questions/1582348/simple-example-of-maximum-a-posteriori/1582407">Explanation of MAP Estimation</a></p>
</li>
</ol>
<p>[1] Imposing different prior distributions on the parameter leads to different types of regularization. A normal distribution with zero mean and constant variance leads to <script type="math/tex">L2</script> regularization, while a Laplacean prior would lead to <script type="math/tex">L1</script> regularization.</p>
<p>[2] Technically, we’ve only shown that the <script type="math/tex">\hat{\beta}</script> we’ve found is a local optimum. We actually want to verify that this is indeed a global minimum, which can be done by showing that the function we are minimizing is convex.</p>
<p>[3] For completeness, it is worth mentioning that there are other solutions if the inverse of the matrix <script type="math/tex">X^T X</script> does not exist. One common workaround is to use the <a href="https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse">Moore-Penrose Psuedoinverse</a> which can be computed using the singular value decompisition of the matrix being psuedo-inverted. This is commonly used in implementations of PCA algorithms.</p>Language Models, Word2Vec, and Efficient Softmax Approximations2017-07-02T00:00:00+00:002017-07-02T00:00:00+00:00http://rohan-varma.github.io/Word2Vec<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png" alt="img" /></p>
<h3 id="introduction">Introduction</h3>
<p>The Word2Vec model has become a standard method for representing words as dense vectors. This is typically done as a preprocessing step, after which the learned vectors are fed into a discriminative model (typically an RNN) to generate predictions such as movie review sentiment, do machine translation, or even generate text, <a href="https://github.com/karpathy/char-rnn">character by character</a>.</p>
<h3 id="previous-language-models">Previous Language Models</h3>
<p>Previously, the bag of words model was commonly used to represent words and sentences as numerical vectors, which could then be fed into a classifier (for example Naive Bayes) to produce output predictions. Given a vocabulary of <script type="math/tex">V</script> words and a document of <script type="math/tex">N</script> words, a <script type="math/tex">V</script>-dimensional vector would be created to represent the vector, where index <script type="math/tex">i</script> denotes the number of times the <script type="math/tex">i</script>th word in the vocabulary occured in the document.</p>
<p>This model represented words as atomic units, assuming that all words were independent of each other. It had success in several fields such as document classification, spam detection, and even sentiment analysis, but its assumptions (that words are completely independent of each other) were too strong for more powerful and accurate models. A model that aimed to reduce some of the strong assumptions of the traditional bag of words model was the n-gram model.</p>
<h3 id="n-gram-models-and-markov-chains">N-gram models and Markov Chains</h3>
<p>Language models seek to predict the probability of observing the <script type="math/tex">t + 1</script>th word <script type="math/tex">w_{t + 1}</script> given the previous <script type="math/tex">t</script> words:</p>
<script type="math/tex; mode=display">p(w_{t + 1} | w_1, w_2, ... w_t)</script>
<p>Using the chain rule of probabilty, we can compute the probabilty of observing an entire sentence:</p>
<script type="math/tex; mode=display">p(w_1, w_2, ... w_t) = p(w_1)p(w_2 | w_1)...p(w_t | w_{t -1}, ... w_1)</script>
<p>Computing these probabilities have many applications, for example in speech recognition, spelling corrections, and automatic sentence completion. However, estimating these probabilites can be tough. We can use the maximum likelihood estimate:</p>
<script type="math/tex; mode=display">p(x_{t + 1} | x_1, ... x_t) = \frac{count(x_1, x_2, ... x_t, x_{t + 1})}{count(x_1, x_2, ... x_t)}</script>
<p>However, computing this is quite unrealistic - we will generally not observe enough data from a corpus to obtain realistic counts for any sequence of <script type="math/tex">t</script> words for any nontrivial value of <script type="math/tex">t</script>, so we instead invoke the Markov assumption. The Markov assumption assumes that the probability of observing a word at a given time is only dependent on the word observed in the previous time step, and independent of the words observed in all of the previous time steps:</p>
<script type="math/tex; mode=display">p(x_{t + 1} | x_1, x_2, ... x_t) = p(x_{t + 1} | x_t)</script>
<p>Therefore, the probabilty of a sentence can be given by</p>
<script type="math/tex; mode=display">p(w_1, w_2, ... w_t) = p(w_1)\prod_{i = 2}^{t} p(w_i | w_{i - 1})</script>
<p>The Markov assumption can be extended to condition the probability of the <script type="math/tex">t</script>th word on the previous two, three, four, and so on words. This is where the name of the n-gram model comes in - <script type="math/tex">n</script> is the number of previous timesteps we condition the current timestep on. The unigram and bigram models, respectively, are given below.</p>
<script type="math/tex; mode=display">p(x_{t + 1} | x_{1}, x_{2}, ... x_{t}) = p(x_{t + 1})</script>
<script type="math/tex; mode=display">p(x_{t + 1} | x_{1}, x_{2}, ... x_{t}) = p(x_{t + 1} | x_{t})</script>
<p>There is a lot more to the n-gram model such as linear interpolation and smoothing techniques, which <a href="https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf">these slides</a> explain very well.</p>
<h3 id="the-skip-gram-and-continuous-bag-of-words-models">The Skip-Gram and Continuous Bag of Words Models</h3>
<p>Word vectors, or word embeddings, or distributed representation of words, generally refer to a dense vector representation of a word, as compared to a sparse (ie one-hot) traditional representation. There are actually two different implementations of models that learn dense representation of words: the Skip-Gram model and the Continuous Bag of Words model. Both of these models learn dense vector representation of words, based on the words that surround them (ie, their <em>context</em>).</p>
<p>The difference is that the skip-gram model predicts context (surrounding) words given the current word, wheras the continuous bag of words model predicts the current word based on several surrounding words.</p>
<p>This notion of “surrounding” words is best described by considering a center (or current) word and a window of words around it. For example, if we consider the sentence “The quick brown fox jumped over the lazy dog”, and a window size of 2, we’d have the following pairs for the skip-gram model:</p>
<p><img src="http://mccormickml.com/assets/word2vec/training_data.png" alt="img" /></p>
<p>Figure 1: Training Samples <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">(Source, from Chris McCormick’s insightful post)</a></p>
<p>In contrast, for the CBOW model, we’ll input the context words within the window (such as “the”, “brown”, “fox”) and aim to predict the target word “quick” (simply reversing the input to prediction pipeline from the skip-gram model).</p>
<p>The following is a visualization of the skip-gram and CBOW models:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png" alt="img" /></p>
<p>Figure 2: CBOW vs Skip-gram models. <a href="https://arxiv.org/pdf/1301.3781.pdf">(Source)</a></p>
<p>In this <a href="https://arxiv.org/pdf/1301.3781.pdf">paper</a>, the overall recommendation was to use the skip-gram model, since it had been shown to perform better on analogy-related tasks than the CBOW model. Overall, if you understand one model, it is pretty easy to understand the other: just reverse the inputs and predictions. Since both papers focused on the skip-gram model, this post will do the same.</p>
<h3 id="learning-with-the-skip-gram-model">Learning with the Skip-Gram Model</h3>
<p>Our goal is to find word representations that are useful for predicting the surrounding words given a current word.
In particular, we wish to maximize the average log probability across our entire corpus:</p>
<script type="math/tex; mode=display">argmax_{\theta} \frac{1}{T} \sum_{t=1}^{T} \sum_{j \in c, j != 0} log p(w_{t + j} | w_{t} ; \theta)</script>
<p>This equation essentially says that there is some probability <script type="math/tex">p</script> of observing a particular word that’s within a window of size <script type="math/tex">c</script> of the current word <script type="math/tex">w_t</script>. This probability is conditioned on the current word (<script type="math/tex">w_t</script>) and some setting of parameters <script type="math/tex">\theta</script> (determined by our model). We wish to set these parameters <script type="math/tex">\theta</script> so that this probability is maximized across our entire corpus.</p>
<h3 id="basic-parametrization-softmax-model">Basic Parametrization: Softmax Model</h3>
<p>The basic skip-gram model defines the probability <script type="math/tex">p</script> through the softmax function. If we consider <script type="math/tex">w_i</script> to be a one-hot encoded vector with dimension <script type="math/tex">N</script> and <script type="math/tex">\theta</script> to be a <script type="math/tex">N * K</script> matrix embedding matrix (here, we have <script type="math/tex">N</script> words in our vocabulary and our learned embeddings have dimension <script type="math/tex">K</script>), then we can define</p>
<script type="math/tex; mode=display">p(w_{i} | w_{t} ; \theta) = \frac{exp(\theta w_i)}{\sum_t exp(\theta w_t)}</script>
<p>It is worth noting that after learning, the matrix <script type="math/tex">\theta</script> can be thought of as an embedding lookup matrix. If you have a word that is represented with the <script type="math/tex">k</script>th index of a vector being hot, then the learning embedding for that word will be the <script type="math/tex">k</script>th column. This parametrization has a major disadvantage that limits its usefulness in cases of very large corpuses. Specifically, we notice that in order to compute a single forward pass of our model, we must sum across the entire corpus vocabulary in order to evaluate the softmax function. This is prohibitively expensive on large datasets, so we look to alternate approximations of this model for the sake of computational efficiency.</p>
<h3 id="hierarchical-softmax">Hierarchical Softmax</h3>
<p>As discussed, the traditional softmax approach can become prohibitively expensive on large corpora, and the hierarchical softmax is a common alternative approach that approximates the softmax computation, but has logarithmic time complexity in the number of words in the vocabulary, as opposed to linear time complexity.</p>
<p>This is done by representing the softmax layer as a binary tree where the words are leaf nodes of the tree, and the probabilities are computed by a walk from the root of the binary tree to the particular leaf. An example of the binary tree of the hierarchical layer is given below:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/hierarchical.png" alt="img" /></p>
<p>Figure 3: Hierarchical Softmax Tree. <a href="https://www.youtube.com/watch?v=B95LTf2rVWM">(Source)</a></p>
<p>At each node in the tree starting from the root, we would like to predict the probability of branching right given the observed context. Therefore, in the above tree, if we would like to compute the probability of observing the word “cat” given a certain context, we would define it as the product of going left at node 1, then going right at node 2, and then again going right at node 5 (conditioned on the context).</p>
<p>The actual computation to determine the probability of a word is done by taking the output of the previous layer, applying a set of node-specific weights and biases to it, and running that result through a non-linearity (often sigmoidal). The following image is an illustration of the process of computing the probability of the word “cat” given an observed context:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/hierarchical2.png" alt="img" /></p>
<p>Figure 4: Hierarchical Softmax Computation. <a href="https://www.youtube.com/watch?v=B95LTf2rVWM">(Source)</a></p>
<p>Here, <script type="math/tex">V</script> is our matrix of weights connecting the outputs of our previous layer (denoted by <script type="math/tex">h(x)</script>) to our hierarchical layer, and the probabiltiy of branching right at a certain node is given by <script type="math/tex">\sigma(h(x)W_n + b_n)</script>. The probability of observing a particular word, then is just the product of the branches that lead to it.</p>
<p>In the above image, we also notice that in a vocabulary of 8 words, we only needed 3 computations to approximate the softmax computation as opposed to 8. More generally, hierarchical softmax greatly reduces our computation time to <script type="math/tex">log_2(n)</script> where <script type="math/tex">n</script> is our vocabulary size, compared to linear time for the traditional softmax approach. However, this speedup is only useful for training when we don’t need to know the full probability distribution. In settings where we wish to emit the most likely word given a context (for example, in sentence generation), we’d still need to compute the probability of all of the words given the context, resulting in no speed up (although some methods such as pruning when the probability of a certain word quickly tends to zero can certainly increase efficiency).</p>
<h3 id="negative-sampling-and-noise-contrastive-estimation">Negative Sampling and Noise Contrastive Estimation</h3>
<p>Multinomial softmax regression is expensive when we are computing softmax across many different classes (each word essentially denotes a separate class). The core idea of Noise Contrastive Estimation (NCE) is to convert a multiclass classification problem into one of binary classification via logistic regression, while still retaining the quality of word vectors learned. With NCE, word vectors are no longer learned by attempting to predict the context words from the target word. Instead we learn word vectors by learning how to distinguish true pairs of (target, context) words from corrupted (target, random word from vocabulary) pairs. The idea is that if a model can distinguish between actual pairs of target and context words from random noise, then good word vectors will be learned.</p>
<p>Specifically, for each positive sample (ie, true target/context pair) we present the model with <script type="math/tex">k</script> negative samples drawn from a noise distribution. For small to average size training datasets, a value for <script type="math/tex">k</script> between 5 and 20 was recommended, while for very large datasets a smaller value of <script type="math/tex">k</script> between 2 and 5 suffices. Our model only has a single output node, which predicts whether the pair was just random noise or actually a valid target/context pair. The noise distribution itself is a free parameter, but the paper found that the unigram distribution raised to the power <script type="math/tex">3/4</script> worked better than other distributions, such as the unigram and uniform distributions.</p>
<p>The main differences between NCE and Negative sampling is the choice of distribution - the paper used a distribution (discussed above) that sampled less frequently occuring words more often. Moreover, NCE approximately minimizes the log probability across the entire corpus (so it is a good approximation of softmax regression), but this does not hold for negative sampling (but negative sampling still learns quality word vectors).</p>
<h3 id="practical-considerations">Practical Considerations</h3>
<p><strong>Implementing Softmax</strong>: If you’re implementing your own softmax function, it’s important to consider overflow issues. Specifically, the computation <script type="math/tex">\sum_i e^{z_i}</script> can easily overflow, leading to <code class="highlighter-rouge">NaN</code> values while training. To resolve this issue, we can instead compute the equivalent <script type="math/tex">\frac{e^{z_i + k}}{\sum_i e^{z_i + k}}</script> and set <script type="math/tex">k = - max z</script> so that the largest exponent is zero, avoiding overflow issues.</p>
<p><strong>Subsampling of frequent words</strong>: We don’t get much information from very frequent words such as “the”, “it”, and the like. There will be many more pairs of (the, French) as opposed to (France, French) but we’re more interested in the latter pair. Therefore, it would be useful to subsample some of the more frequent words. We would also like to do this proportionally: very common words are sampled out with high probability, and uncommon words are not sampled out.</p>
<p>In order to do this, the paper defines the probability of discarding a particular word as <script type="math/tex">p(w_i) = 1 - \frac{t}{freq(w_i)}</script> where <script type="math/tex">t</script> is an arbitrary constant, taken in the paper to be <script type="math/tex">10^{-5}</script>. This discarding function will cause words that appear with a frequency greater than <script type="math/tex">t</script> to be sampled out with a high probability, while words that appear with a freqeuncy of less than or equal to <script type="math/tex">t</script> will not be sampled out. For example, if <script type="math/tex">t = 10^{-5}</script> and a particular word covers <script type="math/tex">0.1%</script> of the corpus, then each instance of that word will be discarded from the training corpus with probability <script type="math/tex">0.9</script>.</p>
<h3 id="conclusion">Conclusion</h3>
<p>We have discussed language models including the bag of words model, the n-gram model, and the word2vec model along with changes to the softmax layer in order to more efficiently compute word embeddings. The paper presented empirical results that indicated that negative sampling outperforms hierarchical softmax and (slightly) outperforms NCE on analogical reasoning tasks. Overall, word2vec is one of the most commonly used models for learning dense word embeddings to represent words, and these vectors have several interesting properties (such as additive compositionality). Once these word vectors are learned, they can be a more powerful representation than the typical one-hot encodings when used as inputs into RNNs/LSTMs for applications such as machine translation or sentiment analysis. Thanks for reading! A discussion on Hacker News can be found <a href="https://news.ycombinator.com/item?id=15578788">here</a>.</p>
<h3 id="sources">Sources</h3>
<ul>
<li><a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Distributed Representations of Words and Phrases</a> - the main paper discussed.</li>
<li><a href="https://www.youtube.com/watch?v=B95LTf2rVWM">Hierarchical Output Layer Video by Hugo Larochelle</a> - an excellent video going into great detail about hierarchical softmax.</li>
<li><a href="https://arxiv.org/pdf/1402.3722v1.pdf">Word2Vec explained</a> - a meta-paper explaining the word2vec paper</li>
<li><a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Chris McCormick’s Word2Vec Tutorial</a></li>
<li><a href="https://www.quora.com/Word2vec-How-can-hierarchical-soft-max-training-method-of-CBOW-guarantee-its-self-consistence">Stephan Gouws’s Quora answer on Hierarchical Softmax</a> - an insightful answer about the hierarchical output layer</li>
<li><a href="http://sebastianruder.com/word-embeddings-1/">Word Embeddings Post by Sebastian Ruder</a> - an informative post covering word embeddings and language modelling.</li>
<li><a href="https://arxiv.org/pdf/1301.3781.pdf">Efficient estimation of word representations</a> another key word2vec paper discussing the differences (both from an architecture perspective and empirical results) of the bag of words, skip-gram, and word2vec models.</li>
</ul>Creating Neural Networks in Tensorflow2017-05-16T00:00:00+00:002017-05-16T00:00:00+00:00http://rohan-varma.github.io/Neural-Net-Tensorflow<p>This is a write-up and code tutorial that I wrote for an AI workshop given at UCLA, at which I gave a talk on neural networks and implementing them in Tensorflow. It’s part of a series on machine learning with Tensorflow, and the tutorials for the rest of them are available <a href="https://github.com/uclaacmai/tf-workshop-series">here</a>.</p>
<h3 id="recap-the-learning-problem">Recap: The Learning Problem</h3>
<p>We have a large dataset of <script type="math/tex">(x, y)</script> pairs where <script type="math/tex">x</script> denotes a vector of features and <script type="math/tex">y</script> denotes the label for that feature vector. We want to learn a function <script type="math/tex">h(x)</script> that maps features to labels, with good generalization accuracy. We do this by minimizing a loss function computed on our dataset: <script type="math/tex">\sum_{i=1}^{N} L(y_i, h(x_i))</script>. There are many loss functions we can choose. We have gone over the cross-entropy loss and variants of the squared error loss functions in previous workshops, and we will once again consider those today.</p>
<h3 id="review-a-single-neuron-aka-the-perceptron">Review: A Single “Neuron”, aka the Perceptron</h3>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/perceptron.png" alt="perceptron" /></p>
<p>A single perceptron first calculates a <strong>weighted sum</strong> of our inputs. This means that we multiply each of our features <script type="math/tex">(x_1, x_2, ... x_n) \in x</script> with an associated weight <script type="math/tex">(w_1, w_2, ... w_n)</script> . We then take the sign of this linear combination, which and the sign tells us whether to classify this instance as a positive or negative example.</p>
<script type="math/tex; mode=display">h(x) = sign(w^Tx + b)</script>
<p>We then moved on to logistic regression, where we changed our sign function to instead be a sigmoid (<script type="math/tex">\sigma</script>) function. As a reminder, here’s the sigmoid function:</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/600px-Logistic-curve.svg.png" alt="sigmoid" /></p>
<p>Therefore, the function we compute for logistic regression is <script type="math/tex">h(x) = \sigma (w^Tx + b)</script>.</p>
<p>The sigmoid function is commonly referred to as an “activation” function. When we say that a “neuron computes an activation function”, it means that a standard linear combination is calculated (<script type="math/tex">w^Tx + b</script>) and then we apply a <em>non linear</em> function to it, such as the sigmoid function.</p>
<p>Here are a few other common activation functions:</p>
<p><img src="http://www.dplot.com/functions/tanh.png" alt="tanh" />
<img src="https://i.stack.imgur.com/8CGlM.png" alt="relu" /></p>
<h3 id="review-from-binary-to-multi-class-classification">Review: From binary to multi-class classification</h3>
<p>The most important change in moving from a binary (negative/positive) classification model to one that can classify training instances into many different classes (say, 10, for MNIST) is that our vector of weights <script type="math/tex">w</script> changes into a matrix <script type="math/tex">W</script>.</p>
<p>Each row of weights we learn represents the parameters for a certain class:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/imagemap.jpg" alt="weights" /></p>
<p>We also want to take our output and normalize the results so that they all sum to one, so that we can interpret them as probabilities. This is commonly done using the <em>softmax</em> function, which takes in a vector and returns another vector who’s elements sum to 1, and each element is proportional in scale to what it was in the original vector. In binary classification we used the sigmoid function to compute probabilities. Now since we have a vector, we use the softmax function.</p>
<p>Here is our current model of learning, then:</p>
<p><script type="math/tex">h(x) = softmax(Wx + b)</script>.</p>
<h3 id="building-up-the-neural-network">Building up the neural network</h3>
<p>Now that we’ve figured out how to linearly model multi-class classification, we can create a basic neural network. Consider what happens when we combine the idea of artificial neurons with our softmax classifier. Instead of computing a linear function $Wx + b$ and immediately passing the output to a softmax function, we have an intermediate step: pass the output of our linear combination to a vector of artificial neurons, which each compute a nonlinear function.</p>
<p>The output of this “layer” of neurons can be multiplied with a matrix of weights again, and we can apply our softmax function to this result to produce our predictions.</p>
<p><strong>Original function</strong>: <script type="math/tex">h(x) = softmax(Wx + b)</script></p>
<p><strong>Neural Network function</strong>: <script type="math/tex">h(x) = softmax(W_2(nonlin(W_1x + b_1)) + b_2)</script></p>
<p>The key differences are that we have more biases and weights, as well as a larger composition of functions. This function is harder to optimize, and introduces a few interesting ideas about learning the weights with an algorithm known as backpropagation.</p>
<p>This “intermediate step” is actually known as a hidden layer, and we have complete control over it, meaning that among other things, we can vary the number of parameters or connections between weights and neurons to obtain an optimal network. It’s also important to notice that we can stack an arbitrary amount of these hidden layers between the input and output of our network, and we can tune these layers individually. This lets us make our network as deep as we want it. For example, here’s what a neural network with two hidden layers would look like:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/neuralnet.png" alt="neuralnet" /></p>
<p>We’re now ready to start implementing a basic neural network in Tensorflow. First, let’s start off with the standard <code class="highlighter-rouge">import</code> statements, and visualize a few examples from our training dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">tensorflow.examples.tutorials.mnist</span> <span class="kn">import</span> <span class="n">input_data</span>
<span class="n">mnist</span> <span class="o">=</span> <span class="n">input_data</span><span class="o">.</span><span class="n">read_data_sets</span><span class="p">(</span><span class="s">'MNIST_data'</span><span class="p">,</span> <span class="n">one_hot</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c"># reads in the MNIST dataset</span>
<span class="c"># a function that shows examples from the dataset. If num is specified (between 0 and 9), then only pictures with those labels will beused</span>
<span class="k">def</span> <span class="nf">show_pics</span><span class="p">(</span><span class="n">mnist</span><span class="p">,</span> <span class="n">num</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
<span class="n">to_show</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">num</span> <span class="k">else</span> <span class="p">[</span><span class="n">num</span><span class="p">]</span><span class="o">*</span><span class="mi">10</span> <span class="c"># figure out which numbers we should show</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">mnist</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">next_batch</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c"># gets some examples</span>
<span class="n">pic</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="n">batch</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">batch</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">label</span><span class="p">)</span> <span class="ow">in</span> <span class="n">to_show</span><span class="p">:</span>
<span class="c"># use matplotlib to plot it</span>
<span class="n">pic</span> <span class="o">=</span> <span class="n">pic</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Label: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">label</span><span class="p">)))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">pic</span><span class="p">,</span> <span class="n">cmap</span> <span class="o">=</span> <span class="s">'binary'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">to_show</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">label</span><span class="p">))</span>
<span class="c">#show_pics(mnist)</span>
<span class="n">show_pics</span><span class="p">(</span><span class="n">mnist</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/uclaacmai/tf-workshop-series/master/week6-neural-nets/Neural%20Network%20Tensorflow_files/Neural%20Network%20Tensorflow_1_1.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/uclaacmai/tf-workshop-series/master/week6-neural-nets/Neural%20Network%20Tensorflow_files/Neural%20Network%20Tensorflow_1_2.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/uclaacmai/tf-workshop-series/master/week6-neural-nets/Neural%20Network%20Tensorflow_files/Neural%20Network%20Tensorflow_1_3.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/uclaacmai/tf-workshop-series/master/week6-neural-nets/Neural%20Network%20Tensorflow_files/Neural%20Network%20Tensorflow_1_4.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/uclaacmai/tf-workshop-series/master/week6-neural-nets/Neural%20Network%20Tensorflow_files/Neural%20Network%20Tensorflow_1_5.png" alt="png" /></p>
<p><img src="https://raw.githubusercontent.com/uclaacmai/tf-workshop-series/master/week6-neural-nets/Neural%20Network%20Tensorflow_files/Neural%20Network%20Tensorflow_1_6.png" alt="png" /></p>
<p>As usual, we would like to define several variables to represent our weight matrices and our biases. We will also need to create placeholders to hold our actual data. Anytime we want to create variables or placeholders, we must have a sense of the <strong>shape</strong> of our data so that Tensorflow has no issues in carrying out the numerical computations.</p>
<p>In addition, neural networks rely on various hyperparameters, some of which will be defined below. Two important ones are the ** learning rate ** and the number of neurons in our hidden layer. Depending on these settings, the accuracy of the network may greatly change.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># some functions for quick variable creation</span>
<span class="k">def</span> <span class="nf">weight_variable</span><span class="p">(</span><span class="n">shape</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">truncated_normal</span><span class="p">(</span><span class="n">shape</span><span class="p">,</span> <span class="n">stddev</span> <span class="o">=</span> <span class="mf">0.1</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">bias_variable</span><span class="p">(</span><span class="n">shape</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">shape</span> <span class="o">=</span> <span class="n">shape</span><span class="p">))</span>
<span class="c"># hyperparameters we will use</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="n">hidden_layer_neurons</span> <span class="o">=</span> <span class="mi">50</span>
<span class="n">num_iterations</span> <span class="o">=</span> <span class="mi">5000</span>
<span class="c"># placeholder variables</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="mi">784</span><span class="p">])</span> <span class="c"># none = the size of that dimension doesn't matter. why is that okay here? </span>
<span class="n">y_</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
</code></pre></div></div>
<p>We will now actually create all of the variables we need, and define our neural network as a series of function computations.</p>
<p>In our first layer, we take our inputs that have dimension <script type="math/tex">n * 784</script>, and multiply them with weights that have dimension <script type="math/tex">784 * k</script>, where <script type="math/tex">k</script> is the number of neurons in the hidden layer. We then add the biases to this result, which also have a dimension of <script type="math/tex">k</script>.</p>
<p>Finally, we apply a nonlinearity to our result. There are, as discussed, several choices, three of which are tanh, sigmoid, and rectifier. We have chosen to use the rectifier (also known as relu, standing for Rectified Linear Unit), since it has been shown in both research and practice that they tend to outperform and learn faster than other activation functions.</p>
<p>Therefore, the “activations” of our hidden layer are given by <script type="math/tex">h_1 = relu(Wx + b)</script>.</p>
<p>We follow a similar procedure for our output layer. Our activations have a shape <script type="math/tex">n * k</script>, where <script type="math/tex">n</script> is the number of training examples we input into our network and $k$ is the number of neurons in our hidden layer.</p>
<p>We want our final outputs to have dimension <script type="math/tex">n * 10</script> (in the case of MNIST) since we have 10 classes. Therefore, it makes sense for our second matrix of weights to have dimension <script type="math/tex">k * 10</script> and the bias to have dimension <script type="math/tex">10</script>.</p>
<p>After taking the linear combination <script type="math/tex">W_2(h_1) + b</script>, we would then apply the softmax function. However, applying the softmax function and then writing out the cross-entropy loss ourself could result in numerical unstability, so we will instead use a library call that computes both the softmax outputs and the cross entropy loss.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># create our weights and biases for our first hidden layer</span>
<span class="n">W_1</span><span class="p">,</span> <span class="n">b_1</span> <span class="o">=</span> <span class="n">weight_variable</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="n">hidden_layer_neurons</span><span class="p">]),</span> <span class="n">bias_variable</span><span class="p">([</span><span class="n">hidden_layer_neurons</span><span class="p">])</span>
<span class="c"># compute activations of the hidden layer</span>
<span class="n">h_1</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">W_1</span><span class="p">)</span> <span class="o">+</span> <span class="n">b_1</span><span class="p">)</span>
<span class="n">W_2_hidden</span> <span class="o">=</span> <span class="n">weight_variable</span><span class="p">([</span><span class="n">hidden_layer_neurons</span><span class="p">,</span> <span class="mi">30</span><span class="p">])</span>
<span class="n">b_2_hidden</span> <span class="o">=</span> <span class="n">bias_variable</span><span class="p">([</span><span class="mi">30</span><span class="p">])</span>
<span class="n">h_2</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h_1</span><span class="p">,</span> <span class="n">W_2_hidden</span><span class="p">)</span> <span class="o">+</span> <span class="n">b_2_hidden</span><span class="p">)</span>
<span class="c"># create our weights and biases for our output layer</span>
<span class="n">W_2</span><span class="p">,</span> <span class="n">b_2</span> <span class="o">=</span> <span class="n">weight_variable</span><span class="p">([</span><span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">]),</span> <span class="n">bias_variable</span><span class="p">([</span><span class="mi">10</span><span class="p">])</span>
<span class="c"># compute the of the output layer</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h_2</span><span class="p">,</span><span class="n">W_2</span><span class="p">)</span> <span class="o">+</span> <span class="n">b_2</span>
</code></pre></div></div>
<p>The cross entropy loss function is a commonly used loss function. For a single prediction/label pair, it is given by <script type="math/tex">C(h(x), y) = -\sum_i y_i log(h(x_i))</script>.*</p>
<p>Here, <script type="math/tex">y</script> is a specific one-hot encoded label vector, meaning that it is a column vector that has a 1 at the index corresponding to its label, and is zero everywhere else. <script type="math/tex">h(x)</script> is the output of our prediction function whose elements sum to 1. As an example, we may have:</p>
<script type="math/tex; mode=display">y = \begin{bmatrix}
1 \\
0 \\
0
\end{bmatrix}, h(x_i) = \begin{bmatrix}
0.2 \\
0.7 \\
0.1
\end{bmatrix} \longrightarrow{} C(y, h(x)) = -\sum_{i=1}^{N}y_ilog(h(x_i)) = -log(0.2) = 0.61</script>
<p>The contribution to the entire training data’s loss by this pair was 0.61. To contrast, we can swap the first two probabilities in our softmax vector. We then end up with a lower loss:</p>
<script type="math/tex; mode=display">y = \begin{bmatrix}
1 \\
0 \\
0
\end{bmatrix}, h(x) = \begin{bmatrix}
0.7 \\
0.2 \\
0.1
\end{bmatrix} \longrightarrow{} C(y, h(x)) = -\sum_{i=1}^{N}y_ilog(h(x_i)) = -log(0.7) = 0.15</script>
<p>So our cross-entropy loss makes intuitive sense: it is lower when our softmax vector has a high probability at the index of the true label, and it is higher when our probabilities indicate a wrong or uncertain choice.</p>
<p><strong>Sanity check: why do we need the negative sign outside the sum?</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># define our loss function as the cross entropy loss</span>
<span class="n">cross_entropy_loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softmax_cross_entropy_with_logits</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">y_</span><span class="p">,</span> <span class="n">logits</span> <span class="o">=</span> <span class="n">y</span><span class="p">))</span>
<span class="c"># create an optimizer to minimize our cross entropy loss</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">GradientDescentOptimizer</span><span class="p">(</span><span class="n">learning_rate</span><span class="p">)</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">cross_entropy_loss</span><span class="p">)</span>
<span class="c"># functions that allow us to gauge accuracy of our model</span>
<span class="n">correct_predictions</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">equal</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">tf</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y_</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="c"># creates a vector where each element is T or F, denoting whether our prediction was right</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">correct_predictions</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">))</span> <span class="c"># maps the boolean values to 1.0 or 0.0 and calculates the accuracy</span>
<span class="c"># we will need to run this in our session to initialize our weights and biases. </span>
<span class="n">init</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">()</span>
</code></pre></div></div>
<p>With all of our variables created and computation graph defined, we can now launch the graph in a session and begin training. It is important to remember that since we declared the <script type="math/tex">x</script> and <script type="math/tex">y</script> variables as placeholders, we will need to feed in data to run our optimizer that minimizes the cross entropy loss.</p>
<p>The data we will feed in (by passing into our function a dictionary <em>feed_dict</em>) will come from the MNIST dataset. To randomly sample 100 training examples, we can use a wrapper provided by Tensorflow: <code class="highlighter-rouge">mnnist.train.next_batch(100)</code>.</p>
<p>When we run the optimizer with the call <code class="highlighter-rouge">optimizer.run(..)</code> Tensorflow calculates a forward pass for us (essentially propagating our data through the graph we have described), and then uses the loss function we created to evaluate the loss, and then computes partial derivatives with respect to each set of weights and updates the weights according to the partial derivatives. This is called the backpropagation algorithm, and it involves significant application of the chain rule. CS 231N provides an <a href="http://cs231n.github.io/optimization-2/">excellent explanation</a> of backpropagation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># launch a session to run our graph defined above. </span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">init</span><span class="p">)</span> <span class="c"># initializes our variables</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iterations</span><span class="p">):</span>
<span class="c"># get a sample of the dataset and run the optimizer, which calculates a forward pass and then runs the backpropagation algorithm to improve the weights</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">mnist</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">next_batch</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">batch</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">y_</span><span class="p">:</span> <span class="n">batch</span><span class="p">[</span><span class="mi">1</span><span class="p">]})</span>
<span class="c"># every 100 iterations, print out the accuracy</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">100</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="c"># accuracy and loss are both functions that take (x, y) pairs as input, and run a forward pass through the network to obtain a prediction, and then compares the prediction with the actual y.</span>
<span class="n">acc</span> <span class="o">=</span> <span class="n">accuracy</span><span class="o">.</span><span class="nb">eval</span><span class="p">(</span><span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">batch</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">y_</span><span class="p">:</span> <span class="n">batch</span><span class="p">[</span><span class="mi">1</span><span class="p">]})</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">cross_entropy_loss</span><span class="o">.</span><span class="nb">eval</span><span class="p">(</span><span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">batch</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">y_</span><span class="p">:</span> <span class="n">batch</span><span class="p">[</span><span class="mi">1</span><span class="p">]})</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Epoch: {}, accuracy: {}, loss: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">acc</span><span class="p">,</span> <span class="n">loss</span><span class="p">))</span>
<span class="c"># evaluate our testing accuracy </span>
<span class="n">acc</span> <span class="o">=</span> <span class="n">accuracy</span><span class="o">.</span><span class="nb">eval</span><span class="p">(</span><span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">mnist</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">images</span><span class="p">,</span> <span class="n">y_</span><span class="p">:</span> <span class="n">mnist</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">labels</span><span class="p">})</span>
<span class="k">print</span><span class="p">(</span><span class="s">"testing accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">acc</span><span class="p">))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0, accuracy: 0.07999999821186066, loss: 2.2931833267211914
Epoch: 100, accuracy: 0.8399999737739563, loss: 0.6990350484848022
Epoch: 200, accuracy: 0.8700000047683716, loss: 0.35569435358047485
Epoch: 300, accuracy: 0.9300000071525574, loss: 0.26591774821281433
Epoch: 400, accuracy: 0.8999999761581421, loss: 0.3307000696659088
Epoch: 500, accuracy: 0.9399999976158142, loss: 0.23977749049663544
Epoch: 600, accuracy: 0.9800000190734863, loss: 0.09397666901350021
Epoch: 700, accuracy: 0.9200000166893005, loss: 0.2931550145149231
Epoch: 800, accuracy: 0.9399999976158142, loss: 0.20180968940258026
Epoch: 900, accuracy: 0.949999988079071, loss: 0.18461622297763824
Epoch: 1000, accuracy: 0.9700000286102295, loss: 0.18968147039413452
Epoch: 1100, accuracy: 0.9599999785423279, loss: 0.14828498661518097
Epoch: 1200, accuracy: 0.949999988079071, loss: 0.1613173633813858
Epoch: 1300, accuracy: 0.9800000190734863, loss: 0.10008890926837921
Epoch: 1400, accuracy: 0.9900000095367432, loss: 0.07440848648548126
Epoch: 1500, accuracy: 0.9599999785423279, loss: 0.1167958676815033
Epoch: 1600, accuracy: 0.9100000262260437, loss: 0.1591644138097763
Epoch: 1700, accuracy: 0.9599999785423279, loss: 0.10022231936454773
Epoch: 1800, accuracy: 0.9700000286102295, loss: 0.1086776852607727
Epoch: 1900, accuracy: 0.9700000286102295, loss: 0.15659521520137787
Epoch: 2000, accuracy: 0.9599999785423279, loss: 0.09391114860773087
Epoch: 2100, accuracy: 0.9800000190734863, loss: 0.09786181151866913
Epoch: 2200, accuracy: 0.9700000286102295, loss: 0.11428779363632202
Epoch: 2300, accuracy: 0.9900000095367432, loss: 0.07231700420379639
Epoch: 2400, accuracy: 0.9700000286102295, loss: 0.09908157587051392
Epoch: 2500, accuracy: 0.9599999785423279, loss: 0.15657338500022888
Epoch: 2600, accuracy: 0.9900000095367432, loss: 0.07787769287824631
Epoch: 2700, accuracy: 0.9800000190734863, loss: 0.07373256981372833
Epoch: 2800, accuracy: 0.9700000286102295, loss: 0.062044695019721985
Epoch: 2900, accuracy: 0.9700000286102295, loss: 0.12512363493442535
Epoch: 3000, accuracy: 0.9900000095367432, loss: 0.11000598967075348
Epoch: 3100, accuracy: 0.9700000286102295, loss: 0.20609986782073975
Epoch: 3200, accuracy: 0.9800000190734863, loss: 0.09811186045408249
Epoch: 3300, accuracy: 0.9700000286102295, loss: 0.09816547483205795
Epoch: 3400, accuracy: 0.9700000286102295, loss: 0.10826745629310608
Epoch: 3500, accuracy: 0.9900000095367432, loss: 0.0645124614238739
Epoch: 3600, accuracy: 0.9700000286102295, loss: 0.1555529236793518
Epoch: 3700, accuracy: 0.9700000286102295, loss: 0.06963416188955307
Epoch: 3800, accuracy: 0.9900000095367432, loss: 0.08054723590612411
Epoch: 3900, accuracy: 0.9800000190734863, loss: 0.06120322644710541
Epoch: 4000, accuracy: 0.9900000095367432, loss: 0.06058483570814133
Epoch: 4100, accuracy: 0.9700000286102295, loss: 0.11490124464035034
Epoch: 4200, accuracy: 0.9700000286102295, loss: 0.10046141594648361
Epoch: 4300, accuracy: 0.9800000190734863, loss: 0.04671316221356392
Epoch: 4400, accuracy: 0.9900000095367432, loss: 0.052477456629276276
Epoch: 4500, accuracy: 0.9800000190734863, loss: 0.08245706558227539
Epoch: 4600, accuracy: 0.9900000095367432, loss: 0.041497569531202316
Epoch: 4700, accuracy: 0.9900000095367432, loss: 0.050769224762916565
Epoch: 4800, accuracy: 0.9900000095367432, loss: 0.039090484380722046
Epoch: 4900, accuracy: 0.9900000095367432, loss: 0.0564178042113781
testing accuracy: 0.9653000235557556
</code></pre></div></div>
<h3 id="questions-to-ponder">Questions to Ponder</h3>
<ul>
<li>Why is the test accuracy lower than the (final) training accuracy ?</li>
<li>Why is there only a nonlinearity in our hidden layer, and not in the output layer?</li>
<li>How can we tune our hyperparameters? In practice, is it okay to continually search for the best performance on the test dataset?</li>
<li>Why do we use only 100 examples in each iteration, as opposed to the entire dataset of 50,000 examples?</li>
</ul>
<h3 id="exercises">Exercises</h3>
<ol>
<li>Using different activation functions. Consult the Tensorflow documentation on <code class="highlighter-rouge">tanh</code> and <code class="highlighter-rouge">sigmoid</code>, and use that as the activation function instead of <code class="highlighter-rouge">relu</code>. Gauge the resulting changes in accuracy.</li>
<li>Varying the number of neurons - as mentioned, we have complete control over the number of neurons in our hidden layer. How does the testing accuracy change with a small number of neurons versus a large number of neurons? What about the generalization accuracy (with respect to the testing accuracy?)</li>
<li>Using different loss functions - we have discussed the cross entropy loss. Another common loss function used in neural networks is the MSE loss. Consult the Tensorflow documentation and implement the <code class="highlighter-rouge">MSELoss()</code> function.</li>
<li>Addition of another hidden layer - We can create a deeper neural network with additional hidden layers. Similar to how we created our original hidden layer, you will have to figure out the dimensions for the weights (and biases) by looking at the dimension of the previous layer, and deciding on the number of neurons you would like to use. Once you have decided this, you can simply insert another layer into the network with only a few lines of code:
<ol>
<li>Use <code class="highlighter-rouge">weight_variable()</code> and <code class="highlighter-rouge">bias_variable()</code> to create new variables for the additional layer (remember to specify the shape correctly).</li>
<li>Similar to computing the activations for the first layer, <code class="highlighter-rouge">h_1 = tf.nn.relu(...)</code>, compute the activations for your additional hidden layer.</li>
<li>Remember to change your output weight dimensions in order to reflect the number of neurons in the previous layer.</li>
</ol>
</li>
</ol>
<h3 id="more">More</h3>
<ol>
<li>Adding dropout</li>
<li>Using momentum optimization or other optimizers</li>
<li>Decaying learning rate</li>
<li>L2-regularization</li>
</ol>
<p>*Technical note: The way this loss function is presented is such that activations corresponding to a label of zero are not penalized at all. The full form of the cross-entropy loss is given by <script type="math/tex">C(y, h(x)) = \sum_i y_i log(h(x_i)) + (1 - y_i)(log(1 - h(x_i))</script>. However, the previously presented function works just as well in environments with larger amounts of data samples and training for many epochs (passes through the dataset), which is typically the case for neural networks.</p>This is a write-up and code tutorial that I wrote for an AI workshop given at UCLA, at which I gave a talk on neural networks and implementing them in Tensorflow. It’s part of a series on machine learning with Tensorflow, and the tutorials for the rest of them are available here.Paper Analysis - Training on corrupted labels2017-04-07T00:00:00+00:002017-04-07T00:00:00+00:00http://rohan-varma.github.io/Noisy-Labels<p><a href="https://arxiv.org/pdf/1703.08774.pdf">Link to paper</a></p>
<h3 id="abstract-and-intro">Abstract and Intro</h3>
<p>This paper talks about an innovative way to use labels assigned to medical images by many different doctors. Generally, large medical datasets are labelled by a variety of doctors and each doctor labels a small fraction of the dataset, and we also have many different doctors labelling the same picture. Often, their labels disagree. Generally when creating training and testing labels, this “disagreement” is captured through a majority vote or through modelling it with a probability distribution.</p>
<p>As an example, if a specific medical image is labelled as malignant by 5 doctors and benign by 4, then with the majority vote method the label will be malignant, and with the probability distribution method the label will be malignant with probability 5/9. This is equivalent to sampling a Bernoulli distribution with parameter 5/9.</p>
<p>However, there could be potentially useful information in this disagreement of labels that other methods could better model. For example, we could take in to account which expert produced which label, and the relaibility of the expert. A possible way to do this is by modelling each expert individually and weighting the label by the expert’s reliability.</p>
<p>This paper first showed that the assumption that the training label accuracy is an upper bound for a neural net’s accuracy is false, and next showed that there are better ways of modelling the opinions of several experts.</p>
<h3 id="motivation">Motivation</h3>
<p>The main motivation was to show that a neural network could “perform better than its teacher”, or attain a test accuracy that is better than the actual labels for the testeing dataset. An example of this was shown with MNIST.</p>
<p>The researchers trained a (relatively shallow) convolutional network with 2 conv layers and a single fully connected layer followed by a 10-way softmax. it was trained with stochastic gradient descent with minibatch learning. SGD is explained further in the next section. When the researchers introduced noise into the data, such as corrupting the true label with another random label that corresponds to another class with probability <script type="math/tex">0.5</script>, the network still only got 2.29% error. However as the probability of corrupting the label increased to above about <script type="math/tex">0.83</script> the network failed to learn and had the same error as the corruption probability.</p>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<p>As an aside, stochastic gradient descent is a method for approximated the true gradient which is computed with gradient descent. We consider the typical gradient descent algorithm that takes derivatives with respect to the parameters of a loss function <script type="math/tex">J(\theta)</script> and then updates the parameters in the opposite direction:</p>
<script type="math/tex; mode=display">\delta \theta_i = \nabla_{\theta_i} J(\theta, X)</script>
<script type="math/tex; mode=display">\theta_i += -\alpha * \delta \theta_i</script>
<script type="math/tex; mode=display">\forall i \in [1...m]</script>
<p>where there are <script type="math/tex">m</script> parameters that we need to learn. The above algorithm just models regular gradient descent without any techniques such as momentum or Adagrad applied. The main point is that when we compute partial derivatives, we need to use the entire training set <script type="math/tex">X</script>.</p>
<p>If the training set is extremely large, this can be computationally prohibitive. The main idea behind SGD is then to use only a small portion of the training dataset to compute the updates, which are approximations of the true gradient. For example, the researchers used minibatches of 200 samples instead of the entire training set of 50,000 examples. These minibatch samples need to be drawn randomly. Even though each individual approximation may not be very accurate, in the long run we get a very good approximation of the true gradient.</p>
<h3 id="better-use-of-noisy-labels">Better use of Noisy Labels</h3>
<ul>
<li>The paper pointed out that there’s a lot of differences in how doctors label the same data due to different training they received and even the biases that every human has. The paper pointed out that doctors only agreed with each other 70% of the time and sometimes they even changed their own opinion from what they had previously.</li>
<li>This is pretty common in medicine. There’s usually no single right answer, and a lot of times doctors rely on previous experience and intution to diagnose their patients. I was reminded by a talk given by Vinod Khosla at Stanford MedicineX, where he said that the “practice” of medicine could become a more robust science if we use artificially intelligent agents to aid diagnosis.</li>
<li>This paper trained the neural network to model each of the individual doctors who were labelling data, instead of training the network to average the doctors.</li>
<li>Previously, deep learning methods have been really successful in diabetic retinopathy detection, with some networks attaining high sensitivity and specificity (97.55 and 93.4% respectively)</li>
</ul>
<h3 id="accuracy-sensitivityrecall-specifity-and-precision">Accuracy, Sensitivity/Recall, Specifity, and Precision</h3>
<ul>
<li>Accuracy is not always the best way to measure the ability of a model, and sometimes using it can be completely useless. Consider a scenario where you have 98 spam emails and 2 non-spam emails on a testing dataset. A model that gets 95% accuracy is not useful, as it performs worse than simply taking the majority label. Always be wary of accuracy percentages if they are not contextualized.</li>
<li>To understand sensitivity (same as recall), specifitiy, and precision, we first consider the following diagram, from a blog post by <a href="http://yuvalg.com/blog/2012/01/01/precision-recall-sensitivity-and-specificity/">Yuval Greenfield</a>:</li>
</ul>
<p><img src="http://i.imgur.com/cJDJU.png" alt="Measurement methods" /></p>
<ul>
<li>Let’s define some terms. Consider a binary classifications system that outputs a positive or negative label. Then a true positive is outcome is when the classifier correctly predicts a positive label. A false positive is when the classifier incorrectly predicts a positive label, and similar for the true and false negatives.</li>
<li>Accuracy, intuitively, is just the number of instances that we classified correctly over all the instances (so the instances we classified correctly and incorrectly). This means that <script type="math/tex">acc = \frac{TP + FN}{TP + FN + TN + FP}</script>.</li>
<li>Recall is defined as the proportion of correct positive classifications over the total number of positives. Therefore, we have the recall <script type="math/tex">r = \frac{TP}{TP + FN}</script>, where the sum <script type="math/tex">TP + FN</script> gives us all instances that are positive. Recall measures the proportion of actual positives that we predicted as positive. The term sensitivity is replaceable with sensitivity.</li>
<li>Precision measures a different quantity than recall, but they are very easy to mix up. Precision measures the proportion of actual positives over how many positives we predicted. This means that the precision <script type="math/tex">p = \frac{TP}{TP + FP}</script>. Note how this differs from recall. Recall measures how many positives we “found” out of all the positives, while precision measures the proportion of all our positive predictions that were correct.</li>
<li>Specificity is like Recall, but for negatives - it measures the proportion of the correct negative classifications over all of the negatives, giving us the ratio of how many negatives we found to all of the existing negatives. This means that the specificity <script type="math/tex">s = \frac{TN}{TN + FP}</script></li>
</ul>
<h3 id="methods">Methods</h3>
<ul>
<li>The researchers trained several different models of varying complexity on the diabetic retinopathy dataset.</li>
<li>As a baseline, the Inception-v3 architecture was used. Inception-v3 is a deep CNN with layers of “inception modules” that are composed of a concatenation of pooling, conv, and 1x1 conv steps. This is explained further in the next section.</li>
<li>
<p>The other networks used includes the “Doctor Net” that is extended to model the opinion of each doctor, and the “Weighted Doctor Net” that trains individual models for each doctor, and then combines their predictions through weighted averaging.</p>
</li>
<li>The cross entropy loss function was used to quantify the loss in all the models. The main difference between the several different models that the researchers trained can be seen in the cross entropy loss function. The usual inputs into the cross-entropy loss are the predictions for a certain image along with the true label. This was replaced with, for example, the target distribution (basically probabalistic labels) and averaged predictions.</li>
</ul>
<h3 id="inception-modules-in-convolutional-architectures">Inception Modules in Convolutional Architectures</h3>
<ul>
<li>At each step in a convolutional neural network’s architecture, you’re faced with many different possible choices. If you’re adding a convolutional layer, you’ll have to select the stride length, the kernel size, and whether you want to pad the edges or not. Altenratively you may want to add a pooling region, whether that’s max or average pooling.</li>
<li>The idea behind the inception module is that you don’t have to choose, and can instead apply all of these different options to your image/image transformation.</li>
<li>For example, you could have a 5 x 5 convolution followed by max pooling, as well as a 3 x 3 convolution followed by a 1 x 1 convolution, and simply concenate the outputs of these operations at the end. The following image, from the Udacity Course on Deep Learning, gives a good visualization of this:</li>
</ul>
<p><img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/noise-labels-paper/incmod.png" alt="Inception Module" /></p>
<ul>
<li>The main idea behind inception modules is that a 5 x 5 kernel and 3 x 3 kernel followed by a 1 x 1 convolution may both be beneficial to the modelling power of your architecture, so we could just use both, and the model will often perform better than using a single convolution. <a href="https://www.youtube.com/watch?v=VxhSouuSZDY">This video</a> explains the inception module in more detail.</li>
</ul>
<h3 id="modelling-label-noise-through-probabilistic-methods">Modelling label noise through probabilistic methods</h3>
<ul>
<li>
<p>The label noise was modelled by first assuming that a true label m is generated from an image s with some conditional probability: <script type="math/tex">p(m \vert{} s)</script>. Usually any form of deep neural networks (and general supervised ML) tries to learn this underlying probability distribution. Several learning algorithms such as binary logistic regression, softmax regression, and linear regression have a probabilistic interpretation of trying to model some underlying distribution. Here are a few examples:</p>
<ul>
<li>Binary logistic regression tries to model a bernoulli distribution by conditioning the label <script type="math/tex">y_n</script> on the input <script type="math/tex">x</script> and the weights of the model <script type="math/tex">w</script>: <script type="math/tex">p(y_n = 1 \vert{} x_n; w) = h_w(x_n)</script> where <script type="math/tex">h</script> is our model that we learn. More generally, we have the likelihood <script type="math/tex">L = \prod h_w(x_n)^{y_n} * (1 - h_w(x_n))^{1-y_n}</script>. We can then maximize the likelihood (or more typically, minimize the negative log likelihood) by applying gradient descent.</li>
<li>Linear regression can be interpretted as the real-valued output, <script type="math/tex">y</script>, being a linear function of the input <script type="math/tex">x</script> with Gaussin noise <script type="math/tex">n_1 \tilde{} N(\mu, \sigma)</script> added to it. Then we can write the log likelihood as <script type="math/tex">l(\theta) = \sum_i p(y_n \vert{} \theta * x, \sigma^2) = \sum_i \frac{-1}{2\sigma^2}(y_n - \theta^T x)^2 + Nlog(\sigma^2)</script>.</li>
<li>What these probabalistic interpretations let us do is see the assumptions our models make, which is key if we want to simulate the real world. For example, these probability distributions show us that a key assumption is that our data are independent of each other. More specifically for typical linear regression, we also assume that the noise in our model is drawn from a normal distribution with linear mean and constant variance.</li>
</ul>
</li>
<li>This paper tries to model a similar probability distribution <script type="math/tex">p(m \vert{} s)</script> but with deep neural networks. It further takes that probabilty distribution of labels and adds a corrupting probability. The ideal label was <script type="math/tex">m</script> but we observe, in our training set, a noisy label <script type="math/tex">\hat{m}</script> with probability <script type="math/tex">p(\hat{m} \vert{} m)</script>.</li>
<li>These probabilities can be drawn from any distribution; the researchers chose an asymmetric binary one. This allows us to account for the fact that even doctors disagree on the true label, so we better model real-world scenarios.</li>
</ul>
<h3 id="training-the-model">Training the Model</h3>
<ul>
<li>The training was done with TensorFlow across several different workers and GPUs. The model was pre-initialized with weights learned by the inception-v3 architecture on the ImageNet dataset.</li>
<li>This method of “transfer learning”, or transferring knowledge from one task to another, has recently gained popularity. The idea is that with learning parameters on ImageNet first, the model learns weights that aid with basic object recognition. Then, the model is trained on a more specific dataset to adjust its later layers, which model higher-level features of what we desire to learn.</li>
</ul>
<h3 id="results">Results</h3>
<ul>
<li>The results support the researcher’s thesis that generalization accuracy improves if the amount of information in the desired outputs is increased.</li>
<li>Training was done with 5-class loss. Results reported included the 5-class error, binary AUC, and specifity.</li>
<li>The hyperparameters were tuned with grid search. Methods to avoid overfitting that were used include L1 and L2-regularization as well as dropout throughout the networks. More information about regularization methods to prevent overfitting is in one of my blog posts here.</li>
<li>The “Weighted Doctor Network” (the network that averages weights of predictions given by several different models, learned for a particular doctor) performed best with a 5-class error fo 20.58%, beating out the baseline inception net and the expectation-maximization algorithm that had 23.83% and 23.74% error respectively.</li>
</ul>
<h3 id="grid-search">Grid Search</h3>
<ul>
<li>Grid search is a common method for tuning the hyperparameters for a deep model. Deep neural networks often require careful hyperparemeter tuning; for example, a learning rate that is too large or one that does not decay as training goes on may cause the algorithm to overshoot the minima and start to diverge. Therefore, we look at all the possible sequences of hyperparameters and pick the one that performs the best.</li>
<li>Specifically, we enumerate values for our hyperparameters:
<ul>
<li>learning_rates = [0.0001, 0.001, 0.01, 0.1]</li>
<li>momentum_consts = [0.1, 0.5, 1.0]</li>
<li>dropout_probability = [0.1, 0.5, 0.8]</li>
</ul>
</li>
<li>Next we do a search over all possible values. To evaluate the performance, it is important to use the validation set or k-fold cross validation. Never touch the test set during training:
<ul>
<li>for lr in learning_rates:
<ul>
<li>for momentum in momentum_consts:
<ul>
<li>for dropout in dropout_probs:
<ul>
<li>model = trained_model(X_train, y_train, lr, momentum, dropout) # calls the function that trains the model</li>
<li>cv_error = cross_validate(model, X_train, y_train)</li>
<li>update best cv error and hyperparams if less error is found</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Finally, train a model with the selected hyperparameters.</li>
</ul>
</li>
<li>As you may have noticed, this method can get expensive as the number of different hyperparameters or different values for each goes up. For <script type="math/tex">n</script> different parameters with <script type="math/tex">k</script> possibilies we have to consider <script type="math/tex">k^n</script> different tuples.</li>
</ul>
<h3 id="conclusion">Conclusion</h3>
<ul>
<li>The paper showed that there are more effective methods to use the noise in labels than using a probability distribution or voting method. The network in this paper seeks to model the labels given by each individual doctor, and learn how to weight them optimially.</li>
</ul>
<h3 id="future-application">Future Application</h3>
<ul>
<li>This new method of modelling noise in the training datasets is pretty cool. I think it bettter models real-world datasets, where “predictions”, or diagnoses are made by experts with varying levels of experience, biases, and predispositions. For deep learning to advance in the medical field, modelling this aspect of medicine well will be essential. It also has application to other fields where noisy labels exist in any fashion. <a href="https://github.com/rohan-varma/paper-analysis/blob/master/noise-labels-paper/tf-implementation.py">Here</a> is an example tensorflow implementation of training on corrupted labels.</li>
</ul>Link to paperImplementing a Neural Network in Python2017-02-10T00:00:00+00:002017-02-10T00:00:00+00:00http://rohan-varma.github.io/Neural-Net<p>Recently, I spent sometime writing out the code for a neural network in python from scratch, without using any machine learning libraries. It proved to be a pretty enriching experience and taught me a lot about how neural networks work, and what we can do to make them work better. I thought I’d share some of my thoughts in this post.</p>
<h3 id="defining-the-learning-problem">Defining the Learning Problem</h3>
<p>In supervised learning problems, we’re given a training dataset that contains pairs of input instances and their corresponding labels. For example, in the MNIST dataset, our input instances are images of handwritten digits, and our labels are a single digit that indicate the number written in the image. To input this training data to a computer, we need to numerically represent our data. Each image in the MNIST dataset is a 28 x 28 grayscale image, so we can represent each image as a vector <script type="math/tex">\vec{x} \in R^{784}</script>. The elements in the vector <script type="math/tex">x</script> are known as features, and in this case they’re values in between 0 and 255. Our labels are commonly denoted as <script type="math/tex">y</script>, and as mentioned, are in between 0 and 9. Here’s an an example from the MNIST dataset [1]:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/master/images/mnistimg.png" alt="image" /></p>
<p>We can think of this dataset as a sample from some probability distribution over the feature/label space, known as the data generating distribution. Specifically, this distribution gives us the probability of observing any particular <script type="math/tex">(x, y)</script> pairs for all <script type="math/tex">(x, y)</script> pairs in the cartesian product <script type="math/tex">X \cdot Y</script>. Intuitively, we would expect that the pair that consists of an image of a handwritten 2 and the label 2 to have a high probablity, while a pair that consists of a handwritten 2 and the label 9 to have a low probability.</p>
<p>Unfortunately, we don’t know what this data generating distribution is parametrized by, and this is where machine learning comes in: we aim to learn a function <script type="math/tex">h</script> that maps feature vectors to labels as accurately as possible, and in doing so, come up with estimates for the true underlying parameters. This function should generalize well: we don’t just want to learn a function that produces a flawless mapping on our training set. The function needs to be able to generalize over all unseen examples in the distribution. With this, we can introduce the idea of the loss function, a function that quantifies how off our prediction is from the true value. The loss function gives us a good idea about our model’s performance, so over the entire population of (feature vector, label) pairs, we’d want the expectation of the loss to be as low as possible. Therefore, we want to find <script type="math/tex">h(x)</script> that minimizes the following function:</p>
<script type="math/tex; mode=display">E[L(y, h(x))] = \sum_{(x, y) \in D} p(x, y)L(y, h(x))</script>
<p>However, there’s a problem here: we can’t compute <script type="math/tex">p(x, y)</script>, so we have to resort to approximations of the loss function based on the training data that we do have access to. To approximate our loss, it is common to sum the loss function’s output across our training data, and then divide it by the number of training examples to obtain an average loss, known as the training loss:</p>
<script type="math/tex; mode=display">\frac{1}{N} \sum_{i=1}^{N} L(y_i, h(x_i))</script>
<p>There are several different loss functions that we can use in our neural network to give us an idea of how well it is doing. The function that I ended up using was the cross-entropy loss, which will be discussed a bit later.</p>
<p>In the space of neural networks, the function <script type="math/tex">h(x)</script> we will find will consist of several operations of matrix multiplications followed by applying nonlinearity functions. The basic idea is that we need to find the parameters of this function that both produce a low training loss and generalize well to unseen data. With our learning problem defined, we can get on to the theory behind neural networks:</p>
<h3 id="precursor-a-single-neuron">Precursor: A single Neuron</h3>
<p>In the special case of binary classification, we can model an artificial neuron as receiving a linear combination of our inputs <script type="math/tex">w^{T} \cdot x</script>, and then computing a function that returns either 0 or 1, which is the predicted label of the input.</p>
<p>The weights are applied to the inputs, which are just the features of the training instance. Then, as a simple example of a function an artificial neuron can compute, we take the sign of the resulting number, and map that to a prediction. So the following is the neural model of learning [2]:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/perceptron.png" alt="Perceptron" /></p>
<p>There’s a few evident limitations to this kind of learning - for one, it can only do binary classification. Moreover, this neuron can only linearly separate data, and therefore this model assumes that the data is indeed linearly separable. Deep neural networks are capable of learning representations that model the nonlinearity inherent in many data samples. The idea, however, is that neural networks are just made up of layers of these neurons, which by themselves, are pretty simple, but extremely powerful when they are combined.</p>
<h3 id="from-binary-classification-to-multinomial-classfication">From Binary Classification to Multinomial Classfication</h3>
<p>In the context of our MNIST problem, we’re interested in producing more than a binary classification - we want to predict one label out of a possible ten. One intuitive way of doing this is simply training several classifiers - a one classifier, a two classifier, and so on. We don’t want to train multiple models separately though, we’d like a single model to learn all the possible different classifications.</p>
<p>If we consider our basic model of a neuron, we see that it has one vector of weights that it applies to determine a label. What if we had multiple vectors - a matrix - of weights instead? Then, each row of weights could represent a separate classifier. To see this clearly, we can start off with a simple linear mapping:</p>
<script type="math/tex; mode=display">a = W^{T}x + b</script>
<p>For our MNIST problem, x is a vector with 784 components, W was originally a single vector with 784 values, and the bias, b, was a single number. However, if we modified W to be a matrix instead, we get multiple rows of weights, each of which can be applied to the input x via a matrix multiplication. Since we want to be able to predict 10 different labels, we can let W be a 10 x 784 matrix, and the matrix product <script type="math/tex">Wx</script> will produce a column vector of values that represent the output of 10 separate classifiers, where the weights for each classifier is given by the rows of W. The bias term is now a 10-dimensional vector that each add a bias term to matrix product. The core idea, however, is that this matrix of weights represent different classifiers, and now we can predict more than just binary labels. An image from Stanford’s CS 231n course shows this clearly [3]:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/imagemap.jpg" alt="Multi-class classification" /></p>
<p>Now that we have a vector of outputs that roughly correspond to scores for each predicted class, we’d like to figure out the most likely label. To do this, we can map our 10 dimensional vector to another 10 dimensional vector which each value is in the range (0, 1), and the sum of all values is 1. This is known as the softmax function. We can use the output of this function to represent a probability distribution: each value gives us the probability of the input x mapping to a particular label y. The softmax function’s input and output are both vectors, and it can be defined as <script type="math/tex">\frac{e^{z_i}}{\sum_{i=1}^{N} e^{z_i}}</script></p>
<p>Next, we can use our loss function discussed previously to evaluate how well our classifier is doing. Specifically, we use the cross-entropy loss, which for a single prediction/label pair, is given by <script type="math/tex">C(S,L) = - \sum_{i}L_{i}log(S_{i})</script>.</p>
<p>Here, <script type="math/tex">L</script> is a specific one-hot encoded label vector, meaning that it is a column vector that has a 1 at the index corresponding to its label, and is zero everywhere else. <script type="math/tex">S</script> is a prediction vector whose elements sum to 1. As an example, we may have:</p>
<script type="math/tex; mode=display">L = \begin{bmatrix}
1 \\
0 \\
0
\end{bmatrix}, S = \begin{bmatrix}
0.2 \\
0.7 \\
0.1
\end{bmatrix} \longrightarrow{} C(S, L) = - \sum_{i=1}^{N}L_ilog(S_i) = -log(0.2) = 0.70</script>
<p>The contribution to the entire training data’s loss by this pair was 0.70. To contrast, we can swap the first two probabilities in our softmax vector. We then end up with a lower loss:</p>
<script type="math/tex; mode=display">L = \begin{bmatrix}
1 \\
0 \\
0
\end{bmatrix}, S = \begin{bmatrix}
0.7 \\
0.2 \\
0.1
\end{bmatrix} \longrightarrow{} C(S, L) = - \sum_{i=1}^{N}L_ilog(S_i) = -log(0.7) = 0.15</script>
<p>So our cross-entropy loss makes intuitive sense: it is lower when our softmax vector has a high probability at the index of the true label, and it is higher when our probabilities indicate a wrong or uncertain choice. The average cross entropy loss is given by plugging into the average training loss function given above. A large part of training our neural network will be finding parameters that make the value of this function as small as possible, but still ensuring that our parameters generalize well to unseen data. For the linear softmax classifier, the training loss can be written as:</p>
<script type="math/tex; mode=display">L = - \frac{1}{N}\sum_{j} C( S(Wx_j + b), L_j)</script>
<p>This is the function that we seek to minimize. Using the gradient descent algorithm, we can learn a particular matrix of weights that performs well and produces a low training loss. The assumption is that a low trainin gloss will correspond to a low expected loss across all samples in the population of data, but this is a risky assumption that can lead to overfitting. Therefore, a lot of research into machine learning is directed towards figuring out how to minimize training loss while also retaining the ability to generalize.</p>
<p>Now that we’ve figured out how to linearly model multilabel classification, we can create a basic neural network. Consider what happens when we combine the idea of artificial neurons with our logistic classifier. Instead of computing a linear function <script type="math/tex">Wx + b</script> and immediately passing the result to a softmax function, we can have an intermediate step: pass the output of our linear combination to a vector of artificial neurons, that each compute a nonlinear function. Then, we can take a linear combination with a vector of weights for each of these outputs, and pass that into our softmax function.</p>
<p>Our previous linear function was given by:</p>
<script type="math/tex; mode=display">\hat{y} = softmax(W_1x + b)</script>
<p>And our new function is not too different:</p>
<script type="math/tex; mode=display">\hat{y} = softmax(W_2(nonlin(W_1x + b_1)) + b_2)</script>
<p>The key differences are that we have more biases and weights, as well as a larger composition of functions. This function is harder to optimize, and introduces a few interesting ideas about learning the weights with an algorithm known as backpropagation.</p>
<p>This “intermediate step” is actually known as a hidden layer, and we have complete control over it, meaning that among other things, we can vary the number of parameters or connections between weights and neurons to obtain an optimal network. It’s also important to notice that we can stack an arbitrary amount of these hidden layers between the input and output of our network, and we can tune these layers individually. This lets us make our network as deep as we want it. For example, here’s what a neural network with two hidden layers would look like [4]:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/neuralnet.png" alt="neural network" /></p>
<h3 id="implementing-the-neural-network">Implementing the Neural Network</h3>
<p>With a bit of background out of the way, we can actually begin implementing our network. If we’re going to implement a neural network with one hidden layer of arbitrary size, we need to initalize two matrices of weights: one to multiply with our inputs to feed into the hidden layer, and one to multiply with the outputs of our hidden layer, to feed into the softmax layer. Here’s how we can initialize our weights:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">init_weights</span><span class="p">(</span><span class="n">num_input_features</span><span class="p">,</span> <span class="n">num_hidden_units</span><span class="p">,</span> <span class="n">num_output_units</span><span class="p">):</span>
<span class="s">"""initialize weights uniformly randomly with small values"""</span>
<span class="n">w1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">num_hidden_units</span><span class="o">*</span><span class="p">(</span><span class="n">num_input_features</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">num_hidden_units</span><span class="p">,</span> <span class="n">num_input_features</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">w2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">num_output_units</span><span class="o">*</span><span class="p">(</span><span class="n">num_hidden_units</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">num_output_units</span><span class="p">,</span> <span class="n">num_hidden_units</span><span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">w1</span><span class="p">,</span> <span class="n">w2</span>
<span class="n">w1</span><span class="p">,</span> <span class="n">w2</span> <span class="o">=</span> <span class="n">init_weights</span><span class="p">(</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">print</span> <span class="n">w1</span><span class="o">.</span><span class="n">shape</span> <span class="c"># expect </span>
<span class="k">print</span> <span class="n">w2</span><span class="o">.</span><span class="n">shape</span> <span class="c"># expect</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(30, 785)
(10, 31)
</code></pre></div></div>
<p>An important preprocessing step is to one-hot encode all of our labels. This is a typical process in machine learning and deep learning problems that involve modeling more labels than two. We begin with a 1-dimensional vector <script type="math/tex">y</script> with <em>m</em> elements, where element <script type="math/tex">y_i \in [0...N]</script> and turn it into an <em>N x M</em> matrix <em>Y</em>. Then, the <em>ith</em> column in <em>Y</em> represents the <em>ith</em> training label (this is also the element at index <em>i</em> in <script type="math/tex">y_i</script>). For this column, the label is given by the element <em>j</em> for which the value <script type="math/tex">Y[j][i] = 1</script>.</p>
<p>In other words, we’ve taken a vector in which a label <em>j</em> is given by <script type="math/tex">y[i] = j</script> and changed it into the matrix where the label would be <em>j</em> for the <em>ith</em> training example if <script type="math/tex">Y[j][i] = 1</script>. From this, we can implement a one-hot encoding:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">encode_labels</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">num_labels</span><span class="p">):</span>
<span class="s">""" Encode labels into a one-hot representation
Params:
y: numpy array of num_samples, contains the target class labels for each training example.
For example, y = [2, 1, 3, 3] -> 4 training samples, and the ith sample has label y[i]
k: number of output labels
returns: onehot, a matrix of labels by samples. For each column, the ith index will be
"hot", or 1, to represent that index being the label.
"""</span>
<span class="n">onehot</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">num_labels</span><span class="p">,</span> <span class="n">y</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="n">onehot</span><span class="p">[</span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="k">return</span> <span class="n">onehot</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">])</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">encode_labels</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span><span class="mi">9</span><span class="p">)</span>
<span class="n">Y</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 1., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0.]])
</code></pre></div></div>
<p>With that out of the way, we’re ready to start implementing the bread and butter of the neural network: the <code class="highlighter-rouge">fit()</code> function. Fitting a function to our data requires two key steps: the forward propagation, where we make a prediction for a specific training example, and the backpropagation algorithm, where we update each of our weights by calculating the weight’s impact on our prediction error. The prediction error is quantified by the average training loss discussed above.</p>
<p>The first step in implementing the entire fit function will be to implement forward propagation. I decided to use the tanh function as the nonlinearity. Other popular choices include the sigmoid and ReLu functions. The forward propagation code passes our inputs to the hidden layer via a matrix multiplication with weights, and the output of the hidden layer is multiplied with a different set of weights, the result of which is passed into the softmax layer from which we obtain our predictions.</p>
<p>It’s also useful to save and return these intermediate values instead of only returning the prediction, since we’ll need these values later for backpropagation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">w1</span><span class="p">,</span> <span class="n">w2</span><span class="p">,</span> <span class="n">do_dropout</span> <span class="o">=</span> <span class="bp">True</span><span class="p">):</span>
<span class="s">""" Compute feedforward step """</span>
<span class="c">#the activation of the input layer is simply the input matrix plus bias unit, added for each sample.</span>
<span class="n">a1</span> <span class="o">=</span> <span class="n">add_bias_unit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c">#the input of the hidden layer is obtained by applying our weights to our inputs. We essentially take a linear combination of our inputs</span>
<span class="n">z2</span> <span class="o">=</span> <span class="n">w1</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a1</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
<span class="c">#applies the tanh function to obtain the input mapped to a distrubution of values between -1 and 1</span>
<span class="n">a2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">z2</span><span class="p">)</span>
<span class="c">#add a bias unit to activation of the hidden layer.</span>
<span class="n">a2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">add_bias_unit</span><span class="p">(</span><span class="n">a2</span><span class="p">,</span> <span class="n">column</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c"># compute input of output layer in exactly the same manner.</span>
<span class="n">z3</span> <span class="o">=</span> <span class="n">w2</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a2</span><span class="p">)</span>
<span class="c"># the activation of our output layer is just the softmax function.</span>
<span class="n">a3</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">z3</span><span class="p">)</span>
<span class="k">return</span> <span class="n">a1</span><span class="p">,</span> <span class="n">z2</span><span class="p">,</span> <span class="n">a2</span><span class="p">,</span> <span class="n">z3</span><span class="p">,</span> <span class="n">a3</span>
</code></pre></div></div>
<p>Since these operations are all vectorized, we generally run forward propagation on the entire matrix of training data at once. Next, we want to quantify how “off” our weights are, baed on what was predicted. The cost function is given by <script type="math/tex">-\sum_{i,j} L_{i,j}log(S_{i,j})</script> , where <script type="math/tex">L</script> is the one-hot encoded label for a particular example and <script type="math/tex">S</script> is the output of the softmax function in the final layer of our neural network. In code, it can be implemented as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_cost</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y_enc</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">w1</span><span class="p">,</span> <span class="n">w2</span><span class="p">):</span>
<span class="s">""" Compute the cost function."""</span>
<span class="n">cost</span> <span class="o">=</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">y_enc</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">output</span><span class="p">))</span>
<span class="n">cost</span> <span class="o">=</span> <span class="n">cost</span>
<span class="k">return</span> <span class="n">cost</span><span class="o">/</span><span class="n">y_enc</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c">#average cost</span>
</code></pre></div></div>
<h3 id="learning-weights-with-gradient-descent">Learning Weights with Gradient Descent</h3>
<p>Now we’re at a stage where our neural network can make predictions given training data, compare it to the actual labels, and quantify the error across our entire training dataset. Our network is able to learn quite yet however. The actual “learning” happens with the gradient descent algorithm. Gradient descent works by computing the partial derivative of our weights with respect to the cost. The vector of these partial derivatives gives us the direction of fastest increase for our loss function (in particular, it can be shown mathematically that the gradient of a function points in the direction of fastest increase). Then, we update each of our weights by the negative value of the gradient (hence the “descent” part of gradient descent. This can be seen as taking a “step” in the direction of a minimum. The size of this step is given by a hyperparameter known as the learning rate, which turns out to be extremely important in getting gradient descent to work. In general, the gradient descent algorithm can be given as follows:</p>
<p><em>while not converged</em>:</p>
<script type="math/tex; mode=display">\delta_i = \frac{\delta L}{\delta w_i} \forall w_i \in W</script>
<script type="math/tex; mode=display">w_i := w_i - \alpha*\delta_i</script>
<p>Gradient descent seeks to find the weights that bring our cost function to a global minimum. Intuitively, this makes sense, as we’d like our cost function to be as low as possible (while still taking care not to overfit on our training data). However, the functions that quantify the loss for most machine learning algorithms tend not to have an explicit solution to <script type="math/tex">\frac{\delta L}{\delta W} = 0</script>, so we must use numerical optimization algorithms such as gradient descent to hopefully get to a local minimum. It turns out that we’re not always gauranteed to get to a global minimum either. Gradient descent only converges to a global minimum if our cost function is <strong>convex</strong>, and while cost functions for algorithms such as logistic regression are convex, the cost function for our single hidden layer neural network is not.</p>
<p>We can still use gradient descent and get to a reasonably good set of weights, however. The art of doing this is an active area of deep learning research. Currently, a common method for implementing gradient descent for deep learning seems to be:</p>
<p>1) Initializing your weights sensibly. This often involves some experimentation in how you initialize your weights. If your network is not very deep, initializing them randomly with small values and low variance usually works. If your network is deeper, larger values are preferred. <a href="http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization">Xavier initialization</a> is a useful algorithm that determines weight initialization with respect to the net’s size.</p>
<p>2) Choosing an optimal learning rate. If the learning rate is too large, gradient descent could end up actually diverging, or skipping over the minimum entirely since it takes steps that are too large. Likewise, if the learning rate is too small, gradient descent will converge much more slowly. In general, it is advisable to start off with a small learning rate and decay it over time as your function begins to converge.</p>
<p>3) Use minibatch gradient descent. Instead of computing the loss and weight updates across the entire set of training examples, <strong>randomly</strong> chooose a subset of your training examples and use that to update your weights. While this may cause gradient descent to not work optimally at each iteration, it is much more efficient so we end up winning by a lot. We essentially approximate the gradient across the entire training set from a sample from the training set.</p>
<p>4) Use the momentum method. This involves remembering the previous gradients, and factoring in the direction of those previous gradients when calculating the current update. This has proved to be pretty successful, as Geoffrey Hinton discusses in <a href="https://www.youtube.com/watch?v=8yg2mRJx-z4">this video</a>.</p>
<p>As a side note, the co-founder of OpenAI, Ilya Sutskever, has more about training deep neural networks with stochastic gradient descent <a href="http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html">here</a></p>
<p>Here’s an implementation of the fit() function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">print_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="s">""" Learn weights from training data """</span>
<span class="n">X_data</span><span class="p">,</span> <span class="n">y_data</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">copy</span><span class="p">(),</span> <span class="n">y</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">y_enc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode_labels</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_output</span><span class="p">)</span>
<span class="c"># init previous gradients</span>
<span class="n">prev_grad_w1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w1</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">prev_grad_w2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w2</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c">#pass through the dataset</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">epochs</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">learning_rate</span> <span class="o">/=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">decay_rate</span><span class="o">*</span><span class="n">i</span><span class="p">)</span>
<span class="c"># use minibatches</span>
<span class="n">mini</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array_split</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">y_data</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="bp">self</span><span class="o">.</span><span class="n">minibatch_size</span><span class="p">)</span>
<span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">mini</span><span class="p">:</span>
<span class="c">#feed feedforward</span>
<span class="n">a1</span><span class="p">,</span> <span class="n">z2</span><span class="p">,</span> <span class="n">a2</span><span class="p">,</span> <span class="n">z3</span><span class="p">,</span> <span class="n">a3</span><span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">X_data</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">w1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w2</span><span class="p">)</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">get_cost</span><span class="p">(</span><span class="n">y_enc</span><span class="o">=</span><span class="n">y_enc</span><span class="p">[:,</span> <span class="n">idx</span><span class="p">],</span> <span class="n">output</span><span class="o">=</span><span class="n">a3</span><span class="p">,</span> <span class="n">w1</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">w1</span><span class="p">,</span> <span class="n">w2</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">w2</span><span class="p">)</span>
<span class="c">#compute gradient via backpropagation</span>
<span class="n">grad1</span><span class="p">,</span> <span class="n">grad2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">a1</span><span class="o">=</span><span class="n">a1</span><span class="p">,</span> <span class="n">a2</span><span class="o">=</span><span class="n">a2</span><span class="p">,</span> <span class="n">a3</span><span class="o">=</span><span class="n">a3</span><span class="p">,</span> <span class="n">z2</span><span class="o">=</span><span class="n">z2</span><span class="p">,</span> <span class="n">y_enc</span><span class="o">=</span><span class="n">y_enc</span><span class="p">[:,</span> <span class="n">idx</span><span class="p">],</span> <span class="n">w1</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">w1</span><span class="p">,</span> <span class="n">w2</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">w2</span><span class="p">)</span>
<span class="c"># update parameters, multiplying by learning rate + momentum constants</span>
<span class="c"># gradient update: w += -alpha * gradient.</span>
<span class="n">w1_update</span><span class="p">,</span> <span class="n">w2_update</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">learning_rate</span><span class="o">*</span><span class="n">grad1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">learning_rate</span><span class="o">*</span><span class="n">grad2</span>
<span class="c"># gradient update: w += -alpha * gradient.</span>
<span class="c"># use momentum - add in previous gradient mutliplied by a momentum hyperparameter.</span>
<span class="n">momentum_factor_w1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum_const</span> <span class="o">*</span> <span class="n">prev_grad_w1</span>
<span class="n">momentum_factor_w2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum_const</span> <span class="o">*</span> <span class="n">prev_grad_w2</span>
<span class="c">#update</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w1</span> <span class="o">+=</span> <span class="o">-</span><span class="p">(</span><span class="n">w1_update</span> <span class="o">+</span> <span class="n">momentum_factor_w1</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w2</span> <span class="o">+=</span> <span class="o">-</span><span class="p">(</span><span class="n">w2_update</span> <span class="o">+</span> <span class="n">momentum_factor_w2</span><span class="p">)</span>
<span class="c"># save current gradients</span>
<span class="n">prev_grad_w1</span><span class="p">,</span> <span class="n">prev_grad_w2</span> <span class="o">=</span> <span class="n">w1_update</span><span class="p">,</span> <span class="n">w2_update</span>
<span class="k">if</span> <span class="n">print_progress</span> <span class="ow">and</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="mi">50</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"Epoch: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span> <span class="s">"Loss: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
<span class="n">acc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">training_acc</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span> <span class="s">"Training Accuracy: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">acc</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span>
</code></pre></div></div>
<p>To compute the actual gradients, we use the backpropagation algorithm that calculates the gradients that we need to update our weights from the outputs of our feed forward step. Essentially, we repeatedly apply the chain rule starting from our outputs until we end up with values for <script type="math/tex">\frac{\delta L}{\delta W_1}</script> and <script type="math/tex">\frac{\delta L}{\delta W_2}</script>. CS 231N provides an <a href="http://cs231n.github.io/optimization-2/">excellent explanation</a> of backprop.</p>
<p>Our forward pass was given by:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a1</span> <span class="o">=</span> <span class="n">X</span>
<span class="n">z2</span> <span class="o">=</span> <span class="n">w1</span> <span class="o">*</span> <span class="n">a1</span><span class="o">.</span><span class="n">T</span>
<span class="n">a2</span> <span class="o">=</span> <span class="n">tanh</span><span class="p">(</span><span class="n">z2</span><span class="p">)</span>
<span class="n">z3</span> <span class="o">=</span> <span class="n">w2</span> <span class="o">*</span> <span class="n">a1</span>
<span class="n">a3</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">z3</span><span class="p">)</span>
</code></pre></div></div>
<p>Using these values, our backwards pass can be given by:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">s3</span> <span class="o">=</span> <span class="n">a3</span> <span class="o">-</span> <span class="n">y_actual</span>
<span class="n">grad_w1</span> <span class="o">=</span> <span class="n">s3</span> <span class="o">*</span> <span class="n">a2</span>
<span class="n">s2</span> <span class="o">=</span> <span class="n">w2</span><span class="o">.</span><span class="n">T</span> <span class="o">*</span> <span class="n">s3</span> <span class="o">*</span> <span class="n">tanh</span><span class="p">(</span><span class="n">z2</span><span class="p">,</span> <span class="n">deriv</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">grad_w2</span> <span class="o">=</span> <span class="n">s3</span> <span class="o">*</span> <span class="n">a2</span><span class="o">.</span><span class="n">T</span>
</code></pre></div></div>
<p>The results of our backwards pass were used in the fit() function to update our weights. That’s essentially all of the important parts of implementing a neural network, and training this vanilla neural network on MNIST with 1000 epochs gave me about 95% accuracy on test data. There’s still a few more bells and whistles we can add to our network to make it generalize better to unseen data, however. These techniques reduce overfitting, and two common ones are L2-regularization and dropout.</p>
<h3 id="l2-regularization">L2-regularization</h3>
<p>Using L2-regularization in neural networks is the most common way to address the issue of overfitting. L2 regularization adds a term to the cost function which we seek to minimize.</p>
<p>Previously, our cost function was given by <script type="math/tex">- \sum_{i,j} L_{i,j} log(S_{i,j})</script></p>
<p>Now, we tack on an additional regularization term: <script type="math/tex">0.5 \lambda W^{2}</script>. Essentially, we impose a penalty on large weight values. Large weights are indicative of overfitting, so we want to keep the weights in our model relatively small, which is more indicative of a simpler model. To see why this is, consider the classic case of overfitting, where our learning algorithm essentially memorizes the training data [5]:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/overfitting.png" alt="overfitting" /></p>
<p>The values for the degree 9 polynomial are much greater than the values for the degree 3 polynomial:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/gh-pages/images/overfitting2.png" alt="overfitting values" /></p>
<p>With regularization, when we minimize the cost function, we have two separate goals. Minimizing the first term picks weight values that give us the smallest training error. Minimizing the second term picks weight values that are as small as possible. The value of the hyperparameter <script type="math/tex">\lambda</script> controls how much we penalize large weights: if <script type="math/tex">\lambda</script> is 0, we don’t regularize at all, and if <script type="math/tex">\lambda</script> is very large, then the entropy term becomes ignored and we prioritize small weight values.</p>
<p>Adding the L2-regularization term to the cost function does not change gradient descent very much. The derivative with respect to <script type="math/tex">W</script> with of to the regularization term <script type="math/tex">0.5 \lambda W^2</script> is simply <script type="math/tex">\lambda W</script>, so we just add that term while computing the gradient. The result of adding this extra term to the gradients is that each time we update our weights, the weights undergo a linear decay.</p>
<p>While L2-regularization is quite popular, a few other forms of regularization are used as well. Another common method is L1-regularization, in which we add on the L1-norm of our weights, multiplied by the regularization hyperparameter: <script type="math/tex">\lambda W</script>.</p>
<p>With L1-regularization, we penalize weights that are non-zero, thus leading our network to learn sparse vectors of weights (vectors where many of the weight entries are zero). Therefore, our neurons will only fire when the most important features (whatever they may be) are detected in our training examples. This helps with feature selection.</p>
<h3 id="dropout">Dropout</h3>
<p>Dropout is a recently introduced, but very effective technique for reducing overfitting in neural networks. Generally, every neuron in a particular layer is connected to all the neurons in the next layer. This is called a “fully-connected” or “Dense” layer - all activations are passed through the layer in the network. Dropout randomly drops a subset of a layer’s neuron’s activations, so the neurons in the next layer don’t receive any activations from the dropped neurons in the previous layer. This process is random, meaning that a different set of activations is discarded across different iterations of learning. Here’s a visualization of what happens when dropout is in use [6]:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/master/images/dropout.jpeg" alt="dropout" /></p>
<p>When dropout is used, each neuron is forced to learn redundant representations of its features, meaning that it is less likely to only fire when an extremely specific set of features is seen. This leads to better generalization. Alternatively, dropout can be seen as training several different neural network architectures during training (since some neurons are sampled out). When the network is tested, we don’t discard any activations, so it is similar to taking an average prediction from many different (though not independent) neural network architectures.</p>
<p>Dropout is very effective, often yielding better results than state-of-the-art regularization and early-stopping (stopping training when the error on validation dataset gets too high). In a <a href="http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf">paper describing dropout</a>, researchers were able to train a 65-million parameter network on MNIST (which has 60,000 training examples) with only 0.95% error using dropout - overfitting would have been a huge issue if such a large network relied only on regularization methods.</p>
<p>To implement dropout, we can set some of the activations computed to 0, and then pass that vector of results to the next layer. Forward propagation changes slightly:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">w1</span><span class="p">,</span> <span class="n">w2</span><span class="p">,</span> <span class="n">do_dropout</span> <span class="o">=</span> <span class="bp">True</span><span class="p">):</span>
<span class="s">""" Compute feedforward step """</span>
<span class="n">a1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">add_bias_unit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span> <span class="ow">and</span> <span class="n">do_dropout</span><span class="p">:</span> <span class="n">a1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">compute_dropout</span><span class="p">(</span><span class="n">a1</span><span class="p">)</span> <span class="c"># dropout</span>
<span class="c">#the input of the hidden layer is obtained by applying our weights to our inputs. We essentially take a linear combination of our inputs</span>
<span class="n">z2</span> <span class="o">=</span> <span class="n">w1</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a1</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
<span class="c">#applies the tanh function to obtain the input mapped to a distrubution of values between 0 and 1</span>
<span class="n">a2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">z2</span><span class="p">)</span>
<span class="c">#add a bias unit to activation of the hidden layer.</span>
<span class="n">a2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">add_bias_unit</span><span class="p">(</span><span class="n">a2</span><span class="p">,</span> <span class="n">column</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span> <span class="ow">and</span> <span class="n">do_dropout</span><span class="p">:</span> <span class="n">a2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">compute_dropout</span><span class="p">(</span><span class="n">a2</span><span class="p">)</span> <span class="c"># dropout</span>
<span class="c"># compute input of output layer in exactly the same manner.</span>
<span class="n">z3</span> <span class="o">=</span> <span class="n">w2</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a2</span><span class="p">)</span>
<span class="c"># the activation of our output layer is just the softmax function.</span>
<span class="n">a3</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">z3</span><span class="p">)</span>
<span class="k">return</span> <span class="n">a1</span><span class="p">,</span> <span class="n">z2</span><span class="p">,</span> <span class="n">a2</span><span class="p">,</span> <span class="n">z3</span><span class="p">,</span> <span class="n">a3</span>
</code></pre></div></div>
<p>In order to actually compute the dropout, we can randomly sample the activations to set to 0 from a binomial distribution with probability p, which is yet another hyperparameter that must be tuned. When using dropout, its also important to scale the activations by p when doing a prediction (which doesn’t use dropout). This is because during training time, the average value of a certain neuron will be <script type="math/tex">px + (1-p)x</script>, where x was the activation before applying dropout. To keep the same average output when dropout is off during prediction time, we should scale the activations by p. This is equivalent to dividing the activations by p when training, and we’d prefer to do that to be more efficient while predicting. The <a href="http://cs231n.github.io/neural-networks-2/">CS 231n lectures</a> explain this dropout scaling concept well.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compute_dropout</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">activations</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
<span class="s">"""Sets a proportion p of the activations to zero"""</span>
<span class="n">mult</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">activations</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">activations</span><span class="o">/=</span><span class="n">p</span>
<span class="n">activations</span><span class="o">*=</span><span class="n">mult</span>
<span class="k">return</span> <span class="n">activations</span>
</code></pre></div></div>
<p>With these modificaitons, our neural network is less prone to overfitting and generalizes better. The full source code for the neural network can be found <a href="https://github.com/rohan-varma/neuralnets/blob/master/NeuralNetwork.py">here</a>, along with an <a href="https://github.com/rohan-varma/neuralnets/blob/master/NeuralNetDemo.ipynb">iPython notebook</a> with a demonstration on the MNIST dataset.</p>
<p><strong>Error Corrections/Changes</strong></p>
<ul>
<li>(1/16/18): Realized that I forgot to scale the activations when using dropout, so I added a note for that. Also fixed in the code with <a href="https://github.com/rohan-varma/neuralnets/commit/1040f5f091a38e369e933fde6d72f7f49e84b049">this</a> commit.</li>
<li>(1/16/18): Fixed broken links to source code.</li>
</ul>
<p><strong>References</strong></p>
<p>[1] <a href="http://yann.lecun.com/exdb/mnist/">The MNIST Database of Handwritten Digits</a></p>
<p>[2] <a href="https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/">Programming a Perceptron in Python</a> by Danilo Bargen</p>
<p>[3] <a href="http://cs231n.github.io/linear-classify/">Stanford CS 231N</a></p>
<p>[4] <a href="http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/">Stanford Deep Learning Tutorial</a></p>
<p>[5] <a href="http://web.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec09.pdf">Ameet Talwalkar, UCLA CS 260</a></p>
<p>[6] <a href="http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf">Srivastava, Hinton, et. al, Dropout: A simple way to prevent Neural Networks from Overfitting</a></p>
<p>[7] <a href="https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb">Sebastian Raschka, Python Machine Learning, Chapter 12 Neural Networks</a> for code samples</p>Recently, I spent sometime writing out the code for a neural network in python from scratch, without using any machine learning libraries. It proved to be a pretty enriching experience and taught me a lot about how neural networks work, and what we can do to make them work better. I thought I’d share some of my thoughts in this post.Building and testing an API with Express, Mongo, and Chai2017-01-03T00:00:00+00:002017-01-03T00:00:00+00:00http://rohan-varma.github.io/Express-API<p>Recently, I’ve been going through the Express, Mongoose, and Chai docs in order to help build out and test an API that’s going to be used for ACM Hack, a committee of UCLA’s CS club that focuses on teaching students new technologies and frameworks, as well as fostering/building an environment of hackers and makers at UCLA. We’re completely revamping Hack for the next quarter with regular events, projects, and additional content in terms of blog posts and tutorials for our users. To do this, we needed to revamp the Hack website.</p>
<p>Specifically, a few backend tasks were required, in the form of creating a functional API to support the needs of our front-end developers and users:</p>
<ul>
<li>Create, update, get, and delete Events (an Event, for example, could be an “Android Workshop Session”)</li>
<li>Create, update, get, and delete Showcase Projects (these our projects that our hack members submit to us, and we showcase the coolest/most innovative projects)</li>
<li>Securing this API through the use of tokens, to make sure that requests cannot be spammed.</li>
<li>Create an email list API endpoint, that allows users to subscribe to our mailing list that notifies them about new events or important updates.</li>
<li>Create Mongoose schemas for all of the above data types.</li>
</ul>
<h3 id="tools-used">Tools Used</h3>
<p>On the backend, we decided to use MongoDB for our database, Express.js for our web framework, and Mocha/Chai for unit tests. The first order of business was to create database schemas for all of the above data types. We used <code class="highlighter-rouge">mongoose</code> to interact with our MongoDB database. <a href="http://mongoosejs.com/index.html">Mongoose</a> allows us to define object models that we can save and retrieve from our database. From the <a href="http://mongoosejs.com/docs/api.html">MongooseJS docs</a>, models are compiled from their schema definitions and represent specific documents in our database. The models also handle document creation and retrieval.</p>
<p>To take the example of creating our mailing list API endpoint, it would be useful to have an email schema that contains both the user’s email address as well as the user’s name. Moreover, we’d like to be able to retrieve all emails in a single request. Here’s the schema that we defined for emails:</p>
<script src="https://gist.github.com/rohan-varma/1cde65d7e093ddfc24d048a28dcc4af0.js"></script>
<p>We defined a <code class="highlighter-rouge">getAll</code> function in our schema to support querying for the entire mailing list. From the MongooseJS docs, each model has <code class="highlighter-rouge">find</code>, <code class="highlighter-rouge">findById</code>, <code class="highlighter-rouge">findOne</code> and a few other useful functions that we can use to retrieve particular documents. We primarily used the <code class="highlighter-rouge">find</code> function, that has a few interesting use cases:</p>
<script src="https://gist.github.com/rohan-varma/20889e90b5bc7f7d348d214753397a05.js"></script>
<p>We used the latter to return all email documents, thus providing us with our mailing list.</p>
<p>Next, we created a <code class="highlighter-rouge">mongoose</code> instance and connected it to MongoDB. There are several ways to create your own MongoDB instance, a popular choice being <a href="https://mlab.com">MongoLab</a>. We also exported our schemas so that they can be instantiated in other areas of our application, namely, in our API where these models will be created and accessed. The following code connects the <code class="highlighter-rouge">mongoose</code> instance and exports the schemas:</p>
<script src="https://gist.github.com/rohan-varma/ad8eb415c940d359e31159fc6ee4d327.js"></script>
<h3 id="defining-our-api-endpoint-with-express">Defining Our API Endpoint with Express</h3>
<p>The next step was to set up the Express framework and begin to define routes and endpoints for our application. <a href="http://expressjs.com/">Express</a> is a minimal web framework that is essentially composed of two things: routing and middleware functions. At a high level, <a href="https://expressjs.com/en/guide/routing.html">routing</a> defines endpoints for your application that can be accessed to perform certain actions (ie, GET or POST certain data). In other words, it defines the structure that is used for interaction with the backend of your web app. An Express route essentially maps a URL to a specific set of functions, called <a href="https://expressjs.com/en/guide/writing-middleware.html">middleware functions</a>. Middleware functions are quite powerful, and are capable of the following actions:</p>
<ul>
<li>Execute any code on the server</li>
<li>Modify the request (req) and response (res) object</li>
<li>Access the next middleware function on the stack, denoted by <code class="highlighter-rouge">next()</code></li>
<li>End the API call.</li>
</ul>
<p>For example, we can create a route for obtaining and sending data to our mailing list. To do this, we will create a router that maps the URL <code class="highlighter-rouge">/api/v1/email/:email?</code> to a set of functions. The last part of the URL, <code class="highlighter-rouge">:email?</code> is an an optional URL parameter. First, we can define middleware functions for this URL, which will also take care of the behavior of the endpoint without the optional argument:</p>
<script src="https://gist.github.com/rohan-varma/5ff1f324e9524332468f77ec9233a4c1.js"></script>
<p>In other files in our <code class="highlighter-rouge">api</code> directory of our application, we can tell Express to use certain routers for specific API endpoints. This way, routers can be composed: the <code class="highlighter-rouge">/api</code> endpoint can have routes for each API version, and each API version can have routes for its several endpoints that access data such as the mailing list or upcoming events:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//require the routers implemented for each data type</span>
<span class="nx">router</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="err">‘</span><span class="o">/</span><span class="nx">event</span><span class="err">’</span><span class="p">,</span> <span class="nx">require</span><span class="p">(</span><span class="err">‘</span><span class="p">.</span><span class="o">/</span><span class="nx">event</span><span class="err">’</span><span class="p">).</span><span class="nx">router</span><span class="p">);</span>
<span class="nx">router</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="s2">`/email`</span><span class="p">,</span> <span class="nx">require</span><span class="p">(</span><span class="s2">`./email).router);
router.use(`</span><span class="o">/</span><span class="nx">showcase</span><span class="s2">`, require(`</span><span class="p">.</span><span class="o">/</span><span class="nx">showcase</span><span class="p">).</span><span class="nx">router</span><span class="p">);</span>
<span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="p">{</span><span class="nx">router</span><span class="p">};</span>
</code></pre></div></div>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//require routers for each version of the API implemented</span>
<span class="nx">router</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="err">‘</span><span class="o">/</span><span class="nx">v1</span><span class="err">’</span><span class="p">,</span> <span class="nx">require</span><span class="p">(</span><span class="err">‘</span><span class="p">.</span><span class="o">/</span><span class="nx">v1</span><span class="err">’</span><span class="p">).</span><span class="nx">router</span><span class="p">);</span>
<span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="p">{</span><span class="nx">router</span><span class="p">};</span>
</code></pre></div></div>
<p>With this setup, access to our application’s data was organized into several different API endpoints. Next, we had to actually implement each middleware function for each of our API endpoints. To do this, we had to think about our API’s design at a granular level: what fields will we require for particular requests? Which requests will need token authentication? What will the response body look like in the case of success and in the case of failure?</p>
<p>We decided that our response objects will have two high level fields: <code class="highlighter-rouge">success</code>, a boolean value that indicates the status of the request, and <code class="highlighter-rouge">errors</code>, a string that indicates the errors (if any) that were encountered during the request (such as an invalid ID or unauthorized token). Here’s an example implementation of a <code class="highlighter-rouge">get</code> request:</p>
<script src="https://gist.github.com/rohan-varma/7d045f555f659f92f9bf394fbf2d7247.js"></script>
<p>As indicated above, we can have certain requests require a valid <code class="highlighter-rouge">token</code> for the request to return successfully. Also, we pass in an anonymous function that takes in two parameters to the <code class="highlighter-rouge">getAll</code> function defined in our Email model. From the implementation of <code class="highlighter-rouge">getAll</code> in the email schema discussed previously, the function retrieves all emails and then calls a provided callback function. In this case, the function returns a response object back to the user.</p>
<h3 id="testing-the-api-using-mocha-and-chai">Testing the API using Mocha and Chai</h3>
<p>Next, we moved on to testing our API endpoints to make sure they work well, especially in edge cases such as malformed or unauthorized requests. At first, we manually tested our API using <a href="https://www.getpostman.com/">Postman</a>, which is a useful tool for quickly querying your endpoint to make sure it works correctly. However, as our API and overall application began to change rapidly and increase in size, we decided to use unit testing in order to make sure that our core functionality doesn’t break as a result of an erroneous commit.</p>
<p>Unit tests allowed us to automatically detect problems in our codebase when they happen, and we can make sure we don’t push a broken build by making sure all of our tests pass during the build step. We used two JavaScript unit testing libraries: <a href="https://mochajs.org/">Mocha.js</a>, which allows us to actually run unit tests, and <a href="http://chaijs.com/">Chai.js</a> which contains several useful helper functions to write our testing code. Using a few more add-ons such as chai-Http (to create and send HTTP requests) and chai-should (to write clean assert statements), we can efficiently create a testing schematic for our API.</p>
<p>First, we describe a test and what it should do, and have an anonymous function running the actual test. The test for an API makes a request to that endpoint with some data, and then we verify that the response object looks like it should. As an example, to test our email API endpoint, we did the following:</p>
<ul>
<li>Create a valid GET request with a valid token in the body. Verify that the response object contains the relevant status fields and returns mailing list.</li>
<li>Create an invalid GET request that is missing a valid token. Verify that the response object indicates failure and provides no emails.</li>
<li>Create a valid POST request that has a body indicating the user’s name and email address. Verify that the response object indicates that the request executed successfully.</li>
<li>Create a valid POST request that has a body that is missing optional fields. Ensure that missing these optional fields doesn’t cause the request to fail.</li>
</ul>
<p>Here’s an example of a single test case:</p>
<script src="https://gist.github.com/rohan-varma/aaf8f1f74633334e5e6f6b95072bd07d.js"></script>
<p>To easily run our tests, we just need to add the line <code class="highlighter-rouge">"test": "mocha"</code> to our <code class="highlighter-rouge">package.json</code> file. Then, the unit tests can be run with a single command line argument: <code class="highlighter-rouge">npm test</code>. Chai and Mocha allow the developer to create and define tests so that the end result of running the tests is descriptive of what tests were run, and how they should behave:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/master/images/chaitest.png" alt="chai-test" title="unit tests" /></p>
<p>And that’s it! We now have a well-organized, reliable, and reusable set up for creating and testing a robust API. In the coming months, we hope to expand on this and push out even more interesting features for UCLA’s CS community.</p>
<h3 id="projectcode-contributors">Project/Code Contributors:</h3>
<ul>
<li><a href="https://github.com/nkansal96">Nikhil Kansal</a></li>
<li><a href="https://github.com/yvonneCh">Yvonne Chen</a></li>
<li><a href="https://github.com/hsykwon">Justin Liu</a></li>
<li><a href="https://github.com/akhilnadendla">Akhil Nadendla</a></li>
</ul>Recently, I’ve been going through the Express, Mongoose, and Chai docs in order to help build out and test an API that’s going to be used for ACM Hack, a committee of UCLA’s CS club that focuses on teaching students new technologies and frameworks, as well as fostering/building an environment of hackers and makers at UCLA. We’re completely revamping Hack for the next quarter with regular events, projects, and additional content in terms of blog posts and tutorials for our users. To do this, we needed to revamp the Hack website.Training Production-Grade Machine Learning Pipelines2016-10-01T00:00:00+00:002016-10-01T00:00:00+00:00http://rohan-varma.github.io/ML-Production<p>A few thoughts on how machine learning models can be scaled, stored, and used in production applications.</p>
<p>Choosing, training, and testing the right machine learning classifier is a difficult task: you have to preprocess
and analyze your dataset’s features, possibly extract new features, tune hyperparameters, and perform cross-validation, just to name a few components of a typical machine learning problem.
After you’ve trained and tested a reliable classifier, it’s ready to be deployed to serve new predictions at scale.
These machine learning systems that are trained on a massive amount of data coming from a variety of sources can be hard to maintain and scale up. This post is a few of my thoughts on deploying a machine learning architecture, specifically using Amazon Web Services.</p>
<h3 id="the-multi-model-architecture">The Multi-Model Architecture</h3>
<p>Our machine learning system has to be capable of a few different tasks:</p>
<ul>
<li>It needs to efficienty store data, as well as pull data from several different sources.</li>
<li>It should be capable of automatically re-training and testing itself. Since new data is always flowing to our system, it’s probably not a good idea to train our model only once on an initial dataset.</li>
<li>The time-consuming training phase should occur offline. When the model is trained, it should be deployed such that any arbitrary event can trigger it.</li>
<li>A user-friendly interface is essential for developers to manage the training, testing, and deployment phases of the machine learning system.</li>
</ul>
<p>For the above reasons, I’ve found the tools and infrastructure offered by AWS to be very helpful. Specifically, I’ll be talking about how we can use EC2, RDS, S3, and Lambda to build out a production-grade architecture.</p>
<h3 id="the-architecture">The Architecture</h3>
<p>Our architecture is composed of many pieces that interact with each other to train, deploy, and store our machine learning models. Here’s an overview of how our architecture could work, with details to follow:</p>
<p><img src="https://raw.githubusercontent.com/rohan-varma/rohan-blog/master/images/model.png" alt="model" /></p>
<p>Let’s review this model piece by piece.</p>
<h3 id="storage-components">Storage Components</h3>
<p>This model uses two storage components: RDS and S3. RDS (Relational Database System) is a relational database stored in the cloud, and acts as our datawarehouse: we can efficiently query for data when we are testing or training our model. S3 (Secure Storage Server) will store our machine learning models as serialized data transfer objects. We’ll send these objects to other components when they need to be used or updated. Here’s how a serializable Neural Network object could be represented - using C#’s <code class="highlighter-rouge">DataContract</code> paradigm:
<script src="https://gist.github.com/rohan-varma/92b6a07db23399cfdb98f348cca9370c.js"></script></p>
<h3 id="offline-training">Offline Training</h3>
<p>Training highly accurate machine learning algorithms with a lot of data can take a really long time. The training phase should occur offline (ie, separate from our application’s use of it) and on separate hardware. This is because training is a typically CPU/GPU intensive process, and dedicated hardware can result in faster training times, as well as separating the training concern from your application. Amazon EC2 (Elastic Cloud Compute) provides compute power on the cloud as a service - you can recruit new instances when you need them, and terminate them when finished (such as when all your models are trained). EC2 allows you to quickly scale your compute resources and configure additional instances quickly.</p>
<p>We can delegate the process of training our machine learning model to EC2. EC2 will be responsible for pulling data from RDS, training a model, testing and validating it, and sending that model to be stored in S3. Additionally, we’ll need to retrain our model as new data becomes available. To do this, we can use a popular queue-based paradigm to manage the training jobs we need to get done - this is the “Training Request Queue” in our model above. Requests for training or re-training a model can be generated by our application when enough new data becomes available. Here’s what a serializable request object might look like:</p>
<script src="https://gist.github.com/rohan-varma/ad7306b3628a98db712d2b504c7d15fa.js"></script>
<p>These requests are lined up into a queue, from which a pool of EC2 instances can pull from. Then, the instance can parse the training request, which involves obtaining the data needed from RDS and information about the particular type of classifier needed. After training, the instance sends the new object to S3, and is ready to pull another training request. If there’s no more training requests, we can easily terminate the instance so as to not waste compute power.</p>
<h3 id="making-predictions-at-scale-with-lambda">Making Predictions at Scale with Lambda</h3>
<p>We’ve discussed storing the relevant data and objects we need, as well as training our classifier using EC2. Now, it’s time to use our trained classifiers to serve prediction requests at scale. Lambda is a great option for this. Lambda employs a serverless architecture - you can run code without having to manage any servers or a backend service. All you have to do is upload your code and define when it should be executed, and Lambda will take care of the compute resources needed to run and scale your code.</p>
<p>Our Lambda function can simply be the relevant fit function from our trained machine learning classifier - a function that takes our classifier’s weights and applies them to our input dataset, and returns the predicted label. It’ll be responsible for loading the serialized model from S3, deserializing it, and outputting the prediction. If we’re training several different machine learning classifiers, we can deploy independent Lambda functions and invoke the relevant one. This way, each function represents a single model that solves a single problem.</p>
<p>Along with writing the code for our function, we’ll have to define <code class="highlighter-rouge">triggers</code> that invoke our function. These can be nearly anything - API requests, updates from S3, or explicit calls. This makes it easy to turn our machine learning applications into several reusable microservices.</p>
<p>And that’s it! Having a well-defined machine learning infrastructure to use in production makes it easier to scale up, encapsulate different tasks, and quickly track problems when something’s not working. There’s definitely a lot more to doing machine learning at scale well - such as extracting the right features, preprocessing your dataset, and choosing the right classifier for the task. Thanks for reading!</p>A few thoughts on how machine learning models can be scaled, stored, and used in production applications.