What is a Loss Function?

As an intuitive understanding loss would be predicted value – actual value i.e. how much off is prediction to our actual target, most of you might be familiar with R^2 loss in high school which is what we often compute while fitting points to a line.

Loss function is what quantifies the discrepancy between what the model thinks reality is, and what reality actually is.

For a single training example i.e. a single data point, it outputs a scalar representing that specific prediction’s error. When we average this error across an entire dataset or a mini batch during optimization, it becomes the Cost Function or Objective Function that the model tries to minimize.

Now why are Loss function important in ML?

Just like a physical system works on the principle of Least action (Lagrangian)

What it means is, nature basically evaluates the Lagrangian (difference of kinetic energy and potential energy) – and defines the path of the particle – which is the path of least action.

That is how our ML models also learn – following the path of least action i.e. decreasing loss function – to get to a better and better understanding of the inherent relations in the data to either output those or to output a prediction.

There are a ton of loss functions out there – some of the most common ones like RMSE, MAE, MSE etc. for regression like problems and CE, BCE, Log Loss, Hinge Loss etc. for classification like problems, we will probably cover those in detail some other time and you could try to get familiar with the basics here if you want:
https://www.datacamp.com/tutorial/loss-function-in-machine-learning

In this post I will cover some loss functions I’ve been using in my current iteration on a project involving imbalanced classification on an image (pixel level classification).

Table Of Contents

What is a Loss Function?
Binary Cross Entropy (BCE)
- Cross Entropy
Focal
Dice
Tversky
Conclusion
Implementation
Resources

Pointwise

Pointwise functions would be those, that take into account single prediction and true value, compute loss and decide how much it should weigh in overall fitting / learning trajectory.

Binary Cross Entropy (BCE)

The essence of this function lies in framing the problem as a series of independent coin flips, what that means is – we calculate the loss (log loss actually) of each pixel independently.

Which if you think about it, will heavily penalize confidently incorrect predictions (using a log scale!)

Walkthrough

Now to demonstrate how it works, lets work with this example:

We have this image with 9 pixels in which only 1/9 is flooded (our target class)

Now our first model generates a series of predictions p1, which completely misses our target, which is common in a binary classification if a class imbalance isn’t handled properly, what it means is, if the model trains on data thats 99% dry (target 0) – classifying all pixels as class 0 is a guaranteed way to optimize accuracy.

in this example, our loss would come out to be:

Now lets look at p2, the predictions of another model (or second iteration of this one), the model correctly classifies the flooded pixel and in a way localizes the neighborhood as well, i.e. some pixels next to the flooded pixel also get classified as flooded, in this case:

So if we evaluate these two iterations with BCE we would assume that p1 is superior to p2

Notes

Measures distance between predicted and true probability distributions
Best for balanced binary classification problems
It is a convex function – smooth gradient
Probabilistic – so helps callibrate for max likelihood
could have runtime errors with log(0) -> -inf
Pytorch implementation han ln and also has BCEWithLogitsLoss (Sigmoid + BCE) implementation

Cross Entropy

Cross entropy is for multi class mutually exclusive labels

Feature	Binary Cross-Entropy (BCE)	Cross-Entropy (CE)
Number of Classes	C = 2	C > 2
Output Activation	Sigmoid	Softmax
Output Layer Shape	1 neuron (or N independent)	C neurons (one per class)
Class Relationship	Independent (Can have multiple true labels)	Mutually Exclusive (Only one true label)
Target Label Format	Scalars (0 or 1) or multi-hot vectors	One-hot encoded vectors (or probability distribution)
Gradient Interaction	Gradients for each node are calculated independently.	Gradients are coupled across all output nodes via Softmax.

Focal

Focal Loss was an improved, dynamically modified BCE, that was introduced by Facebook originally for a R-CNN framework.

The essence of this function is that unlike BCE that treats all errors equally, Focal loss introduces weights, and down weights the well classified examples forcing the network to mine for hard examples.

It has two hyperparameters alpha and gamma, gamma term makes for the modulating factor dictating how much should samples who are easily classified weigh and alpha helps tune loss function to the inherent imbalance of the dataset.

Intuition is that,

gamma controls the steepness of the penalty for being wrong
alpha roughly reflects the inverse frequency of classes, but must be dampened if using a high gamma

Usually we start with gamma = 2.0 and alpha = 0.25 (which roughly corresponds to the true target being 25% of the values)

Walkthrough

Now to demonstrate how it works, lets work the same example as above:

Here we see if a pixel is classified easily, like p = 0.95 (i.e. for the void pixels of prediction 0.05), the modulating factor with gamma, reduces their contribution to overall loss function significantly.

Notes

A good potential solution for highly imbalanced datasets (especially the highly sparse ones)
Makes network focus on its representational capacity
Mines for rare / hard samples
Potentially very sensitive to noise – minor noise could be treated as extremely rare samples and dictate the optimization direction
Computationally its significantly more expensive than BCE
Numerical stability needs handling when p approaches 0 or 1
Performance highly dependent on hyperparameter selection (alpha and gamma) – adds another layer of tuning/selection

Overall – Macro Boundaries

Instead of pixel level attribution, these evaluate overall overlaps and use the True Positives, False Negatives, false Positives etc. in their calculations, letting you optimize for the desired outcome directly.

These are called geometric or boundary functions because they focus on preserving the decision boundary.

Dice

It is used to calculate overlap between predicted mask and ground truth, or as we call Intersection Over Union (IoU). The essence is the function ignores the background altogether and cares only about true and predicted values of target class.

Walkthrough

To explain how this works, we will consider this example:

Here we have an image of 100 pixels, in which 10 are flooded (i.e. our target class) and the rest 90 are class 0.

Now, lets say our model predicts 15 of the 100 pixels to be flooded, in which only 8 are truly flooded (True Positives).

This means we have 7 false alarms (False Positives), and we missed 2 true floods (False Negatives) and the rest 83 were non flooded which we guessed correctly (True Negatives)

As you can see even though TN make about 83% of our data – we don’t really care for them in our loss function.

Notes

Blind to TN
It directly optimizes for IoU which is the metric we care about
Its good when preserving boundaries is important (Medical Imaging / SAR)
Its inherently balanced since intersections are relative
because of the mentioned scale invariance – can swing wildly from image to image i.e. missing 100 pixels in 10k pixels is a way smaller penalty than missing all 10 in 10
By itself – its a non convex function and would be challenging to find a global minima – causing unstable gradients
Its blind to the types of errors – our use case cares about FP more than FN or vice versa – this loss function does not accommodate that calibration
Needs the term epsilon added – to prevent division by 0 during runtime (prone to vanishing gradients)

Tversky

Basically modified Dice with two parameters alpha and beta added to tune the FP and FN weight as per use case requirement. Th idea is that not all mistakes carry same penalty.

alpha = beta = 0.5 => we have dice loss function

alpha < beta => Penalizing FN (Optimizing Recall) – Missing a true label is more expensive

alpha > beta => Penalizing FP (Optimizing Precision) – Being confident in true labels is critical

Walkthrough

To explain how this works, we will consider the same example:

Here we have an image of 100 pixels, in which 10 are flooded (i.e. our target class) and the rest 90 are class 0.

Now, lets say our model predicts 15 of the 100 pixels to be flooded, in which only 8 are truly flooded (True Positives).

This means we have 7 false alarms (False Positives), and we missed 2 true floods (False Negatives) and the rest 83 were non flooded which we guessed correctly (True Negatives)

Now, if the model instead made the opposite error (2 False Positives and 7 False Negatives), the denominator would become 8 + (0.3 * 2) + (0.7 * 7) = 13.5, resulting in a Tversky Index of 0.59 and a higher Loss of 0.41. The math explicitly punishes the network harder for missing the target than for over predicting it

Notes

pretty much same as dice, completely blind to true negatives
great for real world usecases / risks – lets optimize penalties for FP, FN
Like focal loss – sensitive to parameter selection – adds another layer of hyperparameter tuning and selection
Because of hyperparameter weighting – can be really sensitive to noise – if say FN penalized highly i.e. beta = 0.9, model would overindex and bloat target label predictions
Can optimize better for Recall than Dice
like dice non convex – can lead to unstable gradients if used by itself
Same issues as Dice for scale invariance – can fluctuate wildly from image to image
Same issues as Dice needs the term epsilon added – to prevent division by 0 during runtime (prone to vanishing gradients)

Conclusion

There are a number of widely tested loss functions which are better suited to a specific problem and model, I would recommend trying with the most basic ones for the first prototype and then researching the ones best suited to your use case, you might need to define a custom one yourself too! But I hope this post helps give some insight into few of the ones covered, I’ll follow up with a deep into some other ones.

Implementation

In my case I’ve tried to implement a combination of pointwise and overall loss functions

Focal + Tversky:

class FocalTverskyLoss(nn.Module):

  def __init__(
    self,
    focal_alpha: float = 0.25,      # focal class weight on positive pixels
    focal_gamma: float = 2.0,      # down-weight easy pixels: (1 - pt)^gamma
    tversky_alpha: float = 0.3,    # FP cost in Tversky denominator
    tversky_beta: float = 0.7,     # FN cost (β > α → recall-friendly)
    smooth: float = 1e-5,          # epsilon - to prevent division by 0
    tversky_weight: float = 1.0,
  ):
    super().__init__()
    # focal loss
    # alpha indicates class weight to positive class, gamma is the modulating factor
    self.focal_alpha, self.focal_gamma = focal_alpha, focal_gamma
    # tversky loss
    # beta = 0.8, alpha = 0.2  - penalizes FN 4x times more than FP
    self.tversky_alpha, self.tversky_beta = tversky_alpha, tversky_beta
    # smooth - epsilon - to prevent division by 0
    self.smooth, self.tversky_weight = smooth, tversky_weight

  def forward(self, logits, targets, valid_mask=None):
    #extract valid area to ignore void pixels
    m = torch.ones_like(targets) if valid_mask is None else valid_mask.float()
    #total number of pixels (valid)
    n = m.sum() + self.smooth  

    # --- Focal: batch-level pos_weight in BCE, then (1-pt)^gamma focusing ---
    # with torch.no_grad():
      # TP count
      # pos = (targets * m).sum()
      # TN count
      # neg = ((1 - targets) * m).sum()
      # dynamic ratio of positive to negative  (capped at 50)
      # pos_w = (neg / (pos + self.smooth)).clamp(1.0, 50.0)  # rare-flood upweight
    # calculate BCE 
    # bce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none", pos_weight=pos_w)
    bce_unweighted = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
    pt = torch.exp(-bce_unweighted)
    # static alpha balancing map
    alpha_t = self.focal_alpha * targets + (1.0 - self.focal_alpha) * (1.0 - targets)
    # Focal formula
    # focal = (w * (1 - pt).pow(self.focal_gamma) * bce_unweighted * m).sum() / n
    focal_map = alpha_t * (1.0 - pt).pow(self.focal_gamma) * bce_unweighted
    focal = (focal_map * m).sum() / n

    # --- Tversky: soft tp/fp/fn per sample, then mean(1 - index) ---
    # Converts raw logits to probabilities and zeroes out invalid pixels
    p = (torch.sigmoid(logits) * m).flatten(1)
    # Flatten Ground Truth
    t = (targets * m).flatten(1)
    tp = (p * t).sum(1)
    fp = (p * (1 - t)).sum(1)
    fn = ((1 - p) * t).sum(1)
    # Tversky Formula
    tversky = 1 - (tp + self.smooth) / (tp + self.tversky_alpha * fp + self.tversky_beta * fn + self.smooth)
    # Averages the Tversky scores across the batch and adds the Focal loss
    return focal + self.tversky_weight * tversky.mean()

class FocalTverskyLoss(nn.Module):

  def __init__(
    self,
    focal_alpha: float = 0.25,      # focal class weight on positive pixels
    focal_gamma: float = 2.0,      # down-weight easy pixels: (1 - pt)^gamma
    tversky_alpha: float = 0.3,    # FP cost in Tversky denominator
    tversky_beta: float = 0.7,     # FN cost (β > α → recall-friendly)
    smooth: float = 1e-5,          # epsilon - to prevent division by 0
    tversky_weight: float = 1.0,
  ):
    super().__init__()
    # focal loss
    # alpha indicates class weight to positive class, gamma is the modulating factor
    self.focal_alpha, self.focal_gamma = focal_alpha, focal_gamma
    # tversky loss
    # beta = 0.8, alpha = 0.2  - penalizes FN 4x times more than FP
    self.tversky_alpha, self.tversky_beta = tversky_alpha, tversky_beta
    # smooth - epsilon - to prevent division by 0
    self.smooth, self.tversky_weight = smooth, tversky_weight

  def forward(self, logits, targets, valid_mask=None):
    #extract valid area to ignore void pixels
    m = torch.ones_like(targets) if valid_mask is None else valid_mask.float()
    #total number of pixels (valid)
    n = m.sum() + self.smooth  

    # --- Focal: batch-level pos_weight in BCE, then (1-pt)^gamma focusing ---
    # with torch.no_grad():
      # TP count
      # pos = (targets * m).sum()
      # TN count
      # neg = ((1 - targets) * m).sum()
      # dynamic ratio of positive to negative  (capped at 50)
      # pos_w = (neg / (pos + self.smooth)).clamp(1.0, 50.0)  # rare-flood upweight
    # calculate BCE 
    # bce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none", pos_weight=pos_w)
    bce_unweighted = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
    pt = torch.exp(-bce_unweighted)
    # static alpha balancing map
    alpha_t = self.focal_alpha * targets + (1.0 - self.focal_alpha) * (1.0 - targets)
    # Focal formula
    # focal = (w * (1 - pt).pow(self.focal_gamma) * bce_unweighted * m).sum() / n
    focal_map = alpha_t * (1.0 - pt).pow(self.focal_gamma) * bce_unweighted
    focal = (focal_map * m).sum() / n

    # --- Tversky: soft tp/fp/fn per sample, then mean(1 - index) ---
    # Converts raw logits to probabilities and zeroes out invalid pixels
    p = (torch.sigmoid(logits) * m).flatten(1)
    # Flatten Ground Truth
    t = (targets * m).flatten(1)
    tp = (p * t).sum(1)
    fp = (p * (1 - t)).sum(1)
    fn = ((1 - p) * t).sum(1)
    # Tversky Formula
    tversky = 1 - (tp + self.smooth) / (tp + self.tversky_alpha * fp + self.tversky_beta * fn + self.smooth)
    # Averages the Tversky scores across the batch and adds the Focal loss
    return focal + self.tversky_weight * tversky.mean()

BCE + DICE loss (assuming its for more or less balanced datasets):

class BCEDiceLoss(nn.Module):
    def __init__(self, smooth: float = 1e-5):
        super().__init__()
        self.smooth = smooth
        self.bce = nn.BCEWithLogitsLoss(reduction="none")

    def forward(self, logits: torch.Tensor, targets: torch.Tensor, valid_mask: torch.Tensor = None) -> torch.Tensor:
        
        if valid_mask is None:
            valid_mask = torch.ones_like(targets)

        valid_mask = valid_mask.float()
        
        # BCE Loss (masked)
        bce_map = self.bce(logits, targets)
        bce_loss = (bce_map * valid_mask).sum() / (valid_mask.sum() + self.smooth)

        # Vectorized Dice Loss
        probs = torch.sigmoid(logits)
        
        # Flatten spatial dimensions to [Batch, Features]
        batch_size = probs.shape[0]
        p_flat = (probs * valid_mask).view(batch_size, -1)
        t_flat = (targets * valid_mask).view(batch_size, -1)
        
        inter = (p_flat * t_flat).sum(dim=1)
        p_sum = p_flat.sum(dim=1)
        t_sum = t_flat.sum(dim=1)
        
        dice_score = (2.0 * inter + self.smooth) / (p_sum + t_sum + self.smooth)
        dice_loss = (1.0 - dice_score).mean()

        return bce_loss + dice_loss

class BCEDiceLoss(nn.Module):
    def __init__(self, smooth: float = 1e-5):
        super().__init__()
        self.smooth = smooth
        self.bce = nn.BCEWithLogitsLoss(reduction="none")

    def forward(self, logits: torch.Tensor, targets: torch.Tensor, valid_mask: torch.Tensor = None) -> torch.Tensor:
        
        if valid_mask is None:
            valid_mask = torch.ones_like(targets)

        valid_mask = valid_mask.float()
        
        # BCE Loss (masked)
        bce_map = self.bce(logits, targets)
        bce_loss = (bce_map * valid_mask).sum() / (valid_mask.sum() + self.smooth)

        # Vectorized Dice Loss
        probs = torch.sigmoid(logits)
        
        # Flatten spatial dimensions to [Batch, Features]
        batch_size = probs.shape[0]
        p_flat = (probs * valid_mask).view(batch_size, -1)
        t_flat = (targets * valid_mask).view(batch_size, -1)
        
        inter = (p_flat * t_flat).sum(dim=1)
        p_sum = p_flat.sum(dim=1)
        t_sum = t_flat.sum(dim=1)
        
        dice_score = (2.0 * inter + self.smooth) / (p_sum + t_sum + self.smooth)
        dice_loss = (1.0 - dice_score).mean()

        return bce_loss + dice_loss

Resources

Basic intuition

Relatively Comprehensive Landscape

https://arxiv.org/html/2504.04242v1

Focal Loss:

https://arxiv.org/pdf/1708.02002

Live & Learn

BCE, Focal, Dice, and Tversky – Flood Prediction Loss Deep Dive

What is a Loss Function?

Pointwise

Binary Cross Entropy (BCE)

Walkthrough

Notes

Cross Entropy

Focal

Walkthrough

Notes

Overall – Macro Boundaries

Dice

Walkthrough

Notes

Tversky

Walkthrough

Notes

Conclusion

Implementation

Resources

Comments

Leave a Reply Cancel reply