Normalizing Data for ML : Methods and Lessons Learned

I ended up writing this article after digging into a tiny rabbit hole of stellar spectra normalization (a project I picked up on a weekend and was supposed to end that weekend). Normalization is critical for anything data, data analysis, data science, ML and definitely NN but normalization techniques vary for data, data type and model / process downstream where we want to leverage that data.

The aim of this article is to cover the concept of normalization, walk through different techniques, what they do and where to use them as well as proceed to some data specific normalization techniques hoping to give you some insights on how to work with this kind of data (Some of these are definitely not a good choice for spectral data – and the classification job i have, but we aim to recap all the methodologies, how they work and what they do to our data samples), so, lets start with the basics, what is normalization?

Normalization is the process of transforming data to a ‘standard scale’ keeping the relative information intact (i.e. without distorting the difference in values and keeping critical information intact).

This could mean scaling values between 0 and 1, centering them around 0, or adjusting them relative to some baseline (you decide the standard scale). In the context of spectra, it often means removing the baseline ‘continuum’ so the sharp features (like absorption/emission lines) stand out clearly.

Now, why is this important? Having a standard scale is important for visually interpreting and comparing things, some metrics and methods assume certain distributions and scales and if we’re training a model, standardizing data introduces numerical stability in the models, helps gradient convergence and could even help better identify features and outliers. The right normalization technique can significantly improve signal clarity and model performance.

Which one to chose? There is no one answer, the decision usually depends on

The type of data (images, time series, spectra (scientific data), categorical, etc.)

The goal (visual clarity, comparability, feature extraction, feeding into ML models)

The model or algorithm downstream (linear regression cares about scale, tree models don’t)

If you prefer a brief overview, here is a summary I created with the help of notebookLM :

Table Of Contents

Dataset
Classical Normalization Techniques
Continuum Normalization Techniques
Conclusion

Dataset

In this example, we’re working with resampled spectral data from the Sloan Digital Sky Survey (SDSS), mapped to a common wavelength grid from 3800 to 9200 Angstroms across 3600 bins. This preprocessing ensures consistent input size and resolution for comparative and statistical analysis.

Spectra is rich, continuous, structured data, it capture physical information about stars (like temperature, composition, and velocity) by measuring how light intensity varies with wavelength. Its messy! different days, different instruments, different atmosphere, different stars, interstellar objects can create all sorts of variations and noise.

Why this data? Simple, because I was already working with it, and its a good use case because, its continuous and messy and requires a lot of pre processing, it has a physical meaning and is sensitive to accuracy – we need to align specific spectral lines to fingerprint a star! Yoou can read more about the topic here or in my upcoming article on the project (some initial code for the whole project here).

Our spectral data could look like this:

Where each of the above image represents the spectrum of a star (1 sample of our data).

Overall, our sample data consists of about ~3200 stars and is distributed as:

Now lets proceed to the different normalization techniques!

For the sake of making a reference sheet and understanding what techniques are used in general, we’ll go through all of them here:

Classical Normalization Techniques

Each method has a unique philosophy, some preserve absolute features, others emphasize shape, while a few are robust to noise or outliers. The intention is to review one of the popular and well known normalization techniques in ML and see what it really does to our data.

Most of these are available by default in popular ML libraries like scikit-learn or tensorflow, here I just write the method, because its easy enough and will help you understand what’s happening better.

1. Min-Max Scaling

Method: Linearly rescales values using minimum and maximum values to the [0, 1] range. good for preserving shape and definite range of 0 and 1

(flux - min(flux)) / (max(flux) - min(flux) + 1e-8)

Pros: Simple and preserves relative differences.
Cons: Very sensitive to outliers, may suppress true variability.

def normalize_min_max(spectra):
    spectra_min = spectra.min(axis=1, keepdims=True)
    spectra_max = spectra.max(axis=1, keepdims=True)
    return (spectra - spectra_min) / (spectra_max - spectra_min + 1e-8)

Applying this on a single spectra (locally), each spectra is treated as isolated data – and the internal values are used to get the min and max – this would preserve the shape (like relative line depths and structure within the spectra) but would remove all the information which could be used to compare different spectras (like absolute flux, brightness etc.)

Applying this globally, that is scaling our sample relative to other samples (we use global min and max across spectras) – this would not preserve their shape but would reflect their absolute relation, how flux and brightness etc. compares to other stars

One could be fancier and apply min max class wise (its not that fancy – quite common actually for certain use cases)

If we apply our min max normalization globally – our data distribution would look something like:

2. Z-Score (Standard) Normalization

Method: Centers data at zero mean, unit variance. good for normal distributions

(flux - mean(flux)) / (std(flux) + 1e-8)

Pros: Ideal for Gaussian features, useful for statistical ML.
Cons: Still somewhat sensitive to outliers, assumes normal distribution.

def normalize_z_score(spectra):
    mean = spectra.mean(axis=1, keepdims=True)
    std = spectra.std(axis=1, keepdims=True)
    return (spectra - mean) / (std + 1e-8)

Again, applying locally,

And applying globally,

Overall distribution of data after standardization,

3. Area Under Curve (AUC) Normalization

Method: Normalizes based on total energy (integrated flux).

flux / (∫ flux dλ + 1e-8)

Pros: Preserves energy distribution; useful in signal comparison.
Cons: Can be skewed by wide continuum slopes or noise.

def normalize_auc(spectra):
    area = np.trapz(spectra, COMMON_WAVE, axis=1).reshape(-1, 1)
    return spectra / (area + 1e-8)

Applying this locally,

Applying it globally,

the data distribution looks somewhat like,

4. Max-Peak Normalization

Method: Divides spectrum by its maximum peak value. good for relative intensities and cases where max / peak of data are important.

flux / (max(flux) + 1e-8)

Pros: Highlights peak intensity; good for emission line analysis.
Cons: Dominated by a single point — poor for noise-prone data.

def normalize_max(spectrum):
    return spectrum / (np.max(spectrum) + 1e-8)

Applying this locally,

Applying it globally,

the data distribution looks somewhat like,

5. L2 Normalization

Method: Ensures unit vector norm in high-dimensional space. good for unit length vector use cases. (certain ML models)

flux / (||flux||_2 + 1e-8)

Pros: Maintains shape in vector space, good for ML.
Cons: Doesn’t reflect physical scale, ignores energy magnitude.

def normalize_l2(spectrum):
    norm = np.linalg.norm(spectrum)
    return spectrum / (norm + 1e-8)

Applying this locally,

Applying it globally,

the data distribution looks somewhat like,

6. Robust (IQR) Normalization

Method: Uses interquartile range and median for robust scaling. handles outlier betters, median centered data (instead of mean).

(flux - median) / (IQR + 1e-8)

Pros: Resists outliers and skew; ideal for noisy spectra.
Cons: Can flatten meaningful variations outside central range.

def normalize_robust(spectrum):
    median = np.median(spectrum)
    iqr = np.percentile(spectrum, 75) - np.percentile(spectrum, 25)
    return (spectrum - median) / (iqr + 1e-8)

Applying this locally,

Applying it globally,

the data distribution looks somewhat like,

7. Log Normalization

Method: Applies log transform to compress dynamic range.

log(1 + flux)

Pros: Dampens spikes, enhances low intensity lines.
Cons: Loses physical interpretability, can’t handle negatives.

def normalize_log(spectra):
    return np.log1p(spectra)  # log(1 + x)

Applying this locally,

Applying it globally,

the data distribution looks somewhat like,

8. Quantile Normalization

Method: Aligns value ranks across spectra, forcing identical distributions. Removes technical variations between samples. Sort values, get ranks, average values at each rank across samples

Pros: Removes inter-sample technical variation.
Cons: Physically meaningless in spectroscopy

def normalize_quantile(spectra):
    sorted_X = np.sort(spectra, axis=1)
    ranks = np.argsort(np.argsort(spectra, axis=1), axis=1)
    avg_dist = np.mean(sorted_X, axis=0)
    return avg_dist[ranks]

Applying this locally,

Applying it globally,

the data distribution looks somewhat like,

Continuum Normalization Techniques

Now, after that brief recap, lets jump into what I’ve been doing these days on the side, star spectrum normalization!

While flux normalization scales spectra globally, continuum normalization removes the baseline trend (continuum) to highlight relative features like absorption or emission lines. This is crucial in stellar classification, abundance analysis, and ML model performance on astrophysical spectra.

1. Savitzky-Golay Filter

Method: Applies a polynomial smoothing filter (low degree polynomial over a window) to estimate the continuum, then divides the original spectrum by the smoothed version.

continuum = savgol_filter(flux, window_length=151, polyorder=3)
norm_flux = flux / continuum

Pros: Best for clean spectras with broad features, fast, no need for model fitting, preserves broad shape.
Cons: Can over smooth broad features, bad for low SNR spectra (or noisy data).

from scipy.signal import savgol_filter

def continuum_savgol(flux, window=None, poly=3, clip_percentile=90):

    if window is None: window = min(151, len(flux) // 10 | 1) 
    if window % 2 == 0: window += 1  # odd window (req by savgol filter)

    # Mask absorption lines (keep only upper X% flux values)
    threshold = np.percentile(flux, clip_percentile)
    mask = flux > threshold

    # Interpolate over masked points for smoother baseline
    interp_flux = np.copy(flux)
    interp_flux[~mask] = np.interp(np.flatnonzero(~mask), np.flatnonzero(mask), flux[mask])

    continuum = savgol_filter(interp_flux, window_length=window, polyorder=poly, mode='interp')

    #Avoid division by very very small numbers so its not inf and ensure its not 0 or -ve
    continuum = np.clip(continuum, 1e-5, None) 

    return flux / continuum, continuum

Methodology Overview:

This function would auto select a window if not given (min 151 and odd)
then it identifies all the high points in the spectrum based on the given threshold ( by default we clip at 90% i.e. we keep only 10% of brightest points)
then it creates a mask (keeping all thehigh points and filtering all the absorption region)
fills the dips in the mask by interpolating
applies smoothing filter on this data (this is savgol step – we use a windowing lower polynomial estimation for this smoothing – the order is 3 by default)
(and ensures non zeroes) => this is our cntinuum
divides the flux by the continuum to get a normalized flux

2. Spline Fit

Method: Identifies high-flux points across bins, fits a spline through them to approximate the continuum, and normalizes by the result.

spline = UnivariateSpline(wave[mask], flux[mask], s=s_factor)
continuum = spline(wave)
norm_flux = flux / continuum

Pros: Adapts well to variable continuum shape.
Cons: Sensitive to masking thresholds, can fail if not enough valid points.

from scipy.interpolate import UnivariateSpline

def continuum_spline(wave, flux, high_percentile=90, s_factor=1e-4):
    mask = np.zeros_like(flux, dtype=bool)
    # n_bins = 5
    # bins = np.linspace(wave.min(), wave.max(), n_bins+1)
    n_bins = max(10, len(flux) // 300)
    bins = np.linspace(wave.min(), wave.max(), n_bins + 1)

    for i in range(n_bins):
        bin_mask = (wave >= bins[i]) & (wave < bins[i+1])
        bin_flux = flux[bin_mask]
        if len(bin_flux) == 0:
            continue
        threshold = np.percentile(bin_flux, high_percentile)
        mask[bin_mask] |= (flux[bin_mask] > threshold)

    if np.sum(mask) < 10:
        raise ValueError("Too few points to fit spline")
    
    s = s_factor * (wave.max() - wave.min()) * len(wave)
    spline = UnivariateSpline(wave[mask], flux[mask], s=s)
    continuum = spline(wave)
    continuum = np.clip(continuum, 1e-4, None)
    return flux / continuum, continuum

Methodology Overview:

Initialize masks and bins (10 minimum) but its based on a bin per every 300 points
in each bin we then select a percentile of points based on bin flux ( by default we clip at 90% i.e. we keep only 10% of brightest points)
then we define a smoothing factor based on wave range (s=0 would mean exact interpolation through all the points and s>0 would encourage smoothing)
then we call univariate splie fit (piecewise polynomial through the selected lines)
its important to have a high enough s to prevent overfitting but not too high to miss important features, not have an aggressive high points percentile resulting in missing continuum in noisyregions, have decent amount of points per bin

3. Spline with Savitzky-Golay + Sigma Clipping

Method: Combines smoothing, sigma clipping, and bin-based peak selection to choose continuum points, followed by spline fitting.

# Apply median filter, then savgol, sigma clip, then spline fit on top N% points per bin
norm_flux, continuum = continuum_spline_wsavgol(wave, flux)

Pros: Robust to noise and adaptable to spectra with varying features.
Cons: Computationally heavier; requires fine-tuning of bins and smoothing.

from scipy.interpolate import UnivariateSpline
from scipy.signal import savgol_filter, medfilt

def continuum_spline_wsavgol(wave, flux, smooth_window=None, sigma=2.5, s_factor=1e-4, top_n_per_bin=10, n_bins=15, verbose=False):
    if smooth_window is None: smooth_window = min(151, len(flux) // 10 | 1)
    if smooth_window % 2 == 0: smooth_window += 1

    # Step 1: Initial smoothing
    init_smooth = medfilt(flux, kernel_size=15)
    smoothed = savgol_filter(init_smooth, window_length=smooth_window, polyorder=3)

    # Step 2: Sigma clipping
    mask = np.ones_like(flux, dtype=bool)
    for _ in range(5):
        residual = flux[mask] - smoothed[mask]
        mad = np.median(np.abs(residual - np.median(residual)))
        if mad < 1e-10:
            break
        std = 1.4826 * mad
        mask[mask] = np.abs(residual) < sigma * std
        smoothed = savgol_filter(flux * mask, window_length=smooth_window, polyorder=3)

    # Step 3: Always get points from each bin
    bins = np.linspace(wave.min(), wave.max(), n_bins + 1)
    cont_mask = np.zeros_like(flux, dtype=bool)

    for i in range(n_bins):
        in_bin = (wave >= bins[i]) & (wave < bins[i + 1])
        idx = np.where(in_bin & mask)[0]
        if len(idx) == 0:
            idx = np.where(in_bin)[0]
        if len(idx) > 0:
            top_idx = idx[np.argsort(flux[idx])[-min(top_n_per_bin, len(idx)):]]
            cont_mask[top_idx] = True

    # Step 4: Fit spline to selected points
    wave_scaled = (wave - wave.mean()) / wave.std()
    s = s_factor * len(wave) * np.median(flux)**2
    spline = UnivariateSpline(wave_scaled[cont_mask], flux[cont_mask], s=s)
    continuum = spline(wave_scaled)
    continuum = np.clip(continuum, 1e-5, None)
    norm = flux / continuum
    norm = np.clip(norm, 0, 10)

    if verbose:
        print(f"Spline fit with {np.sum(cont_mask)} points (no fallback). s={s:.2e}")

    return norm, continuum

Methodology Overview:

This function would auto select a window if not given (min 151 and odd)
Then it will do a median filtering to remove local spikes
This is followed by a savgol smoothing (low ordered windowed polynomial approximation)
Then we do an iterative sigma clipping to remove outliers that deviate too much – we use median absolute deviation instead of standard deviation here to make it more robust to outliers (we use 2.5 sigma clipping by default)
then we divide the data into bins
within each bin we take the points that survived the sigma clipping (if there are no points we just take all points)
we sort these points by flux and select the top n brightest points (10 by default)
we standardize the wave (for stability) and select an appropriate smoothing factor and then proceed to fit a spline using our selected mask
the above steps give us our continuum
then we just divide our flux by the continuum baseline to get normalized flux

4. Polynomial Fit

Method: Fits a low-degree polynomial to the top percentile of flux values.

coeffs = np.polyfit(wave_scaled[mask], flux[mask], deg)
continuum = np.polyval(coeffs, wave_scaled)
norm_flux = flux / continuum

Pros: Fast, interpretable, ideal for low-resolution data.
Cons: Poor for complex spectra; can underfit or overfit depending on degree.

def continuum_polynomial(wave, flux, deg=5, clip_percentile=90):

    wave_scaled = (wave - wave.mean()) / wave.std()

    threshold = np.percentile(flux, clip_percentile)
    mask = flux > threshold

    coeffs = np.polyfit(wave_scaled[mask], flux[mask], deg)
    continuum = np.polyval(coeffs, wave_scaled)
    continuum = np.clip(continuum, 1e-5, None)

    return flux / continuum, continuum

Methodology Overview:

relatively simple – we take the highest points by flux (default 90 percentile clipping – i.e. we keep only 10% of brightest points)
fit a polynomial through those points to get the continuum
we divide flux by the continuum to get a normalized flux

5. Weighted Polynomial Fit

Method: Uses smoothed flux as weights to emphasize higher values while fitting the polynomial continuum.

weights = np.clip(medfilt(flux, 5) / threshold, 0, 1)
coeffs = np.polyfit(wave_scaled, flux, deg=4, w=weights)
continuum = np.polyval(coeffs, wave_scaled)

Pros: Soft masking avoids hard thresholds; more stable than hard masking.
Cons: Still limited in flexibility compared to splines.

from scipy.signal import medfilt

def continuum_polynomial_weighted(wave, flux, deg=4, high_percentile=90, smooth=True):
    wave_scaled = (wave - wave.mean()) / wave.std()

    if smooth:
        flux_for_weights = medfilt(flux, kernel_size=5)
    else:
        flux_for_weights = flux

    threshold = np.percentile(flux_for_weights, high_percentile)
    weights = np.clip(flux_for_weights / threshold, 0, 1)

    coeffs = np.polyfit(wave_scaled, flux, deg=deg, w=weights)
    continuum = np.polyval(coeffs, wave_scaled)
    continuum = np.clip(continuum, 1e-4, None)

    return flux / continuum, continuum

Methodology Overview:

standardizes wave for numerical stability (polynomial fitting could be severely affected by large values)
clips high percentile points based on flux
sets a threshold (weights for points passing the threshold close to 1) – all weights [0,1]
weighted polynomial fitting (so the brighter regions have more influence)
this gives our continuum, then we just divid flux by continuum to get normalized flux

6. Auto-Selected Continuum Method

Method: Applies all continuum methods, scores them based on smoothness, flatness, and curvature, and selects the best one automatically.

# Score = smoothness + 0.5 * flatness + 2.0 * clipping_penalty + 0.3 * curvature
best_method = min(scored, key=lambda x: x[3])

Pros: Adaptable across diverse data qualities; removes manual guesswork.
Cons: Depends on scoring heuristics; black-box to non-experts.

def score_method(norm, cont):
    smoothed = savgol_filter(cont, 101, 3)
    smoothness = np.std(cont - smoothed)
    flatness = np.abs(np.median(norm) - 1.0)
    clip_penalty = np.mean((norm > 5) | (norm < 0.2))
    curvature = np.std(np.gradient(cont))
    score = (
        1.0 * smoothness +
        0.5 * flatness +
        2.0 * clip_penalty +
        0.3 * curvature
    )
    return score

def auto_continuum_normalize(wave, flux, verbose=False, methods=None):
    if methods is None:
        methods = [
            "spline_wsavgol",
            "polynomial_weighted",
            "savgol"
        ]

    results = []

    for method in methods:
        try:
            if method == "spline_wsavgol":
                norm, cont = continuum_spline_wsavgol(wave, flux)
            elif method == "polynomial_weighted":
                norm, cont = continuum_polynomial_weighted(wave, flux)
            elif method == "savgol":
                cont = savgol_filter(flux, window_length=151, polyorder=3, mode='interp')
                cont = np.clip(cont, 1e-5, None)
                norm = flux / cont
            else:
                continue  # Unknown method

            results.append((method, norm, cont))
            if verbose:
                print(f"{method} succeeded")

        except Exception as e:
            if verbose:
                print(f"{method} failed: {e}")

    if not results:
        raise RuntimeError("All continuum normalization methods failed.")

    

    scored = [(method, norm, cont, score_method(norm, cont)) for method, norm, cont in results]
    best_method, best_norm, best_cont, best_score = min(scored, key=lambda x: x[3])

    if verbose:
        print("Tried methods (with scores):")
        for method, _, _, score in scored:
            print(f" - {method}: {score:.4f}")
        print(f"Selected '{best_method}' with score {best_score:.4f}")

    return best_norm, best_cont, best_method

Methodology Overview:

tries different techniques and picks up the best one
scoring methodology is the key, to assess which one to pick, we first,
measure how bumpy each continuum is by calculating smoothness which is a standard deviation from savgol
then we measure flatness i.e. if the normalized flux is centered around 1 (median = 1)
then we measure clip penalty – fraction of outliers or extreme points
then we measure curvature i.e.e how bendy the continuum is by calculating standard deviation of gradient
we combine all the above 4 metrics into a scre 1.0*smoothness + 0.5*flatness + 2.0*clip_penalty +0.3*curvature
lower score indicates a better fit (smooth and more stable normalization)

For each technique, we visually inspected the normalized result to assess whether it retained key spectral features while flattening the continuum. This process helps guide preprocessing decisions for any spectral ML pipeline.

Continuum Normalization Analysis

For the specific spectra we’ve been analyzing so far i.e. an M type star, we observe:

Let’s look at some other stars now:

We also analyze the flux range distributions before and after normalization. Here’s how the range of normalized flux behaves after normalization:

Conclusion

Spectrum normalization is not one size fits all. Your choice should reflect the data characteristics, application goals, and downstream models. Visual inspection is key, and automated scoring can help streamline the selection process.

As you can see even our fancier algorithms failed ‘catastrophically’ on some stars, based on the graphs i shared seems like spline with savgol might be our best bet – you could modify the polynomial and weighted polynomial to atleast add a clipping so they don’t explode like that. I’ll follow up on that and new scoring methodology for my auto function. And some manual spectral line validation to see if our normalization lines up consistently with the spectral types

You can access this code on my github repo here

Live & Learn

Normalizing Data for ML : Methods and Lessons Learned

Dataset

Classical Normalization Techniques

1. Min-Max Scaling

2. Z-Score (Standard) Normalization

3. Area Under Curve (AUC) Normalization

4. Max-Peak Normalization

5. L2 Normalization

6. Robust (IQR) Normalization

7. Log Normalization

8. Quantile Normalization

Continuum Normalization Techniques

1. Savitzky-Golay Filter

2. Spline Fit

3. Spline with Savitzky-Golay + Sigma Clipping

4. Polynomial Fit

5. Weighted Polynomial Fit

6. Auto-Selected Continuum Method

Continuum Normalization Analysis

Conclusion

Comments

Leave a Reply Cancel reply