Neural Networks: 1 to 100

Neural Networks: 1 to 100

First question that comes to any normal human being's mind is — what the hell even is a Neural Network? And that is completely fine, and it should be answered ASAP.


What is a Neural Network?

A neural network is a machine learning model that stacks simple neurons in layers and learns pattern-recognizing weights and biases from data to map inputs to outputs. Neural networks are among the most influential algorithms in modern machine learning and artificial intelligence.

Neural networks are a subset of machine learning. They are computational models inspired by the human brain, designed to recognize patterns and solve complex problems by processing data through interconnected layers of nodes (neurons). Neural networks are the backbone of deep learning — a more advanced branch of machine learning.

Note: The inspiration for neural networks comes from the biological neurons in the human brain, which communicate through electrical signals. The perceptron is the historical ancestor of today's networks: essentially a linear model with a constrained output.


Architecture of Neural Networks

A neural network is organized as a sequence of layers through which data flows from the input to the final output. The first layer is called the input layer, and it is responsible for receiving the raw data. Then it is passed on to the hidden layers for learning, and finally to the output layer for the prediction.

Note: There can be any number of hidden layers stacked between the input and the output layer.

Neural networks solve problems we tackle intuitively — but with greater accuracy — by learning patterns from data.

It is called a neural network because it is made of smaller units called neurons, and these neurons are the basic building blocks that work together to form the whole network.


The Neuron: A Tiny Decision Maker

Neuron — the basic unit of information processing. It acts as a mathematical function that receives inputs, calculates a weighted sum, adds bias, and applies an activation function to produce an output.

Inspired by biological neurons, these nodes are organized into layers to build AI models capable of complex pattern recognition.

Key Components of a Neuron:

  1. Inputs — Numerical data points representing features (e.g., house size, pixel intensity).
  2. Weights & Summation — Importance assigned to each input, combined into a single weighted sum.
  3. Activation — A non-linear function that decides whether the signal should "fire" or be suppressed.

Anatomy of a Neuron

output = activation(Σ wᵢxᵢ + b)

Complete breakdown:

  1. Weights (w) — Determines the strength and importance of each input signal.
  2. Bias (b) — An offset that allows the neuron to shift its decision boundary.
  3. Activation — A non-linear function that decides if the signal should fire or be suppressed.

The Step-by-Step Data Flow Inside a Neuron

Within a single neuron, the data flows through four key steps:

  1. Input Reception — The neuron receives input values from the previous layer or raw data.
  2. Weighted Summation — Each input is multiplied by a corresponding weight, which dictates the strength of that input's connection to the neuron. All weighted inputs are summed together, and a bias is added, which adjusts the activation threshold.
  3. Activation Function — The resulting weighted sum is passed through a nonlinear activation function (e.g., ReLU or Sigmoid). This determines if the neuron "fires" and introduces nonlinearity to model complex data patterns.
  4. Output Generation — The final activation value is passed as input to the next layer of neurons.

What are weights? — Weights are numerical values that show how important each input is to a neuron. A higher weight makes that input have more effect on the output. A negative weight can reduce the output or push it in the opposite direction.

Bias — Bias is an extra number added to the weighted sum of inputs. It lets the neuron produce an output even when inputs are small or zero, making the model more flexible.

How are parameters calculated? — Weights and biases are called parameters. They are not calculated from a formula at the start — they are initialized randomly and then learned during training by adjusting them step by step to reduce prediction error.


Activation Functions

Here is the thing that makes neural networks actually powerful. Without activation functions, no matter how many layers you stack, the entire network collapses into a single linear equation. A linear function of a linear function is still linear. That means the network can only draw straight lines — which is useless for anything remotely complex.

Activation functions introduce non-linearity. They decide whether a neuron should fire and how strongly, giving the network the ability to approximate any function.

ReLU — Rectified Linear Unit

f(z) = max(0, z)

ReLU is the default choice for hidden layers in almost every modern neural network. It is computationally cheap, does not suffer from the vanishing gradient problem for positive values, and trains fast.

The problem: Dying ReLU. If a neuron's input is always negative, the gradient is always zero, and the neuron stops learning permanently. Variants like Leaky ReLU fix this by allowing a small negative slope instead of a hard zero.

Why ReLU works so well — It keeps gradients alive for positive values (gradient = 1), which means the network can learn efficiently across many layers. This is what made deep networks practical.

Sigmoid

f(z) = 1 / (1 + e⁻ᶻ)

Sigmoid squashes any input into the range (0, 1), making it a natural fit for binary classification — the output can directly be interpreted as a probability.

The problem: Vanishing gradient. For very large or very small inputs, the sigmoid function becomes nearly flat, and its gradient approaches zero. During backpropagation, this tiny gradient gets multiplied across layers and virtually disappears — making deep networks using sigmoid very hard to train.

Tanh — Hyperbolic Tangent

f(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ)

Tanh is similar to Sigmoid but outputs values between (-1, 1). This zero-centered output is important — it means the gradients do not always push in the same direction, which leads to more stable updates during training.

It still suffers from vanishing gradients at the extremes, but it generally outperforms Sigmoid in hidden layers.

Softmax

f(zᵢ) = eᶻⁱ / Σeᶻʲ

Softmax is used exclusively in the output layer for multi-class classification. It converts a vector of raw scores (called logits) into a probability distribution — every output is between 0 and 1, and all outputs sum to exactly 1. The class with the highest probability is the prediction.

ActivationOutput RangeBest Used ForProblem
ReLU[0, ∞)Hidden layersDying neurons
Sigmoid(0, 1)Binary outputVanishing gradient
Tanh(-1, 1)Hidden layersVanishing gradient
Softmax(0, 1), sums to 1Multi-class output

Forward Propagation

Forward propagation is the process of passing data through the network, layer by layer, from input to output. For each layer, the computation is straightforward:

z = W · x + b
a = activation(z)

Where W is the weight matrix, x is the input, b is the bias vector, and a is the output (activation) of that layer.

The final output ŷ is the network's prediction. We then compare it to the true label y using a loss function.


Loss Functions

The loss function measures how wrong the network's prediction is. This is the signal the network uses to improve.

For Regression — Mean Squared Error (MSE):

L = (1/n) Σ (yᵢ - ŷᵢ)²

MSE penalizes larger errors more heavily because of the square. It is smooth and differentiable everywhere, which makes it easy to optimize.

For Classification — Cross-Entropy Loss:

L = -Σ yᵢ · log(ŷᵢ)

Cross-entropy measures the difference between the true probability distribution and the predicted distribution. It pairs naturally with the Softmax activation in the output layer.

The goal of training — Find the weights W and biases b that minimize the loss function L. Every step of training is pushing towards a lower loss.


Backpropagation

Once the loss is computed, the network needs to figure out: which weights contributed most to the error, and in which direction should they change?

This is what backpropagation does. It uses the chain rule of calculus to propagate the gradient of the loss backwards through the network — from the output layer all the way to the input layer.

For each weight w, backpropagation computes ∂L/∂w — the partial derivative of the loss with respect to that weight. This gradient tells us: if we increase w slightly, does the loss go up or down, and by how much?

Chain Rule — The reason backprop works across many layers is the chain rule. The gradient at each layer is computed by multiplying the gradient from the layer ahead by the local gradient of the activation function. This chain of multiplications flows all the way back to the first layer.


Gradient Descent

Now that we have gradients ∂L/∂w for every parameter, we update the weights to reduce the loss:

w = w - η · (∂L/∂w)

Where η (eta) is the learning rate — a small positive number that controls how large each update step is.

The learning rate matters a lot:

  • Too large → the updates overshoot the minimum, and the loss bounces around or diverges.
  • Too small → training takes forever, and you might get stuck in a local minimum.

A good default starting point is η = 0.001 with the Adam optimizer, which adapts the learning rate automatically for each parameter.

Variants of Gradient Descent

VariantWhat it does
Batch GDUses all training samples per update. Stable but slow.
Stochastic GD (SGD)Uses one sample per update. Fast but noisy.
Mini-batch GDUses a small batch (e.g., 32 or 64 samples). Best of both worlds.

In practice, mini-batch gradient descent is the standard. The batch size is a hyperparameter you choose — 32 and 64 are common defaults.


Putting It All Together — The Training Loop

Training a neural network is a loop that runs for many epochs (one epoch = one full pass through the training dataset):

Key terminology:

  • Epoch — One complete pass through the full training dataset.
  • Batch — A subset of training samples used for one weight update.
  • Iteration — One forward + backward pass on one batch.
  • Hyperparameters — Settings you choose before training (learning rate, batch size, number of layers, number of neurons). These are not learned — they are set by you.

Types of Neural Networks

Once you understand the core mechanics, the world of neural networks opens up into many specialized architectures:

  • CNN — Uses convolutional filters to detect spatial patterns in images. Powers image classification, object detection, and face recognition.
  • RNN / LSTM — Designed for sequential data with memory. Each step can remember information from previous steps. Used in time series and early NLP.
  • Transformers — The architecture behind GPT, Claude, and every modern large language model. Replaced RNNs for most sequence tasks using a mechanism called attention.

Where Neural Networks Are Used Today

Neural networks are not a lab curiosity anymore — they are embedded in tools you use every day:

  • Computer Vision — Face unlock on your phone, medical image diagnosis, self-driving cars.
  • Natural Language Processing — Search engines, chatbots, translation, summarization.
  • Speech Recognition — Siri, Alexa, Google Assistant.
  • Recommendation Systems — YouTube, Netflix, Spotify all use deep networks to suggest content.
  • Drug Discovery — Predicting protein structures (AlphaFold), finding new drug candidates.
  • Code Generation — GitHub Copilot, Claude, ChatGPT.

What You Have Covered

From one neuron doing a weighted sum, to a full training loop updating millions of parameters — this is the complete picture:

The math is not magic. It is just calculus, linear algebra, and a lot of iteration. The more you train networks and read the gradients, the more intuition you build. The best way to go from knowing this conceptually to really knowing it is to build one — a small network from scratch in NumPy, before reaching for PyTorch or TensorFlow.

That is how you go from 1 to 100.


References

Built with love by Sidhant Singh Rathore