In Embeddings I we explored an LLM Embedding model, an idea of using this trained encoding algorithm to make our mundane features context aware and the challenges of the naive ways of combining them.

To recap, we have a made up use case containing user and different domain interactions of that user, we want to group users together or create a recommendation system for a user or something along the lines. And to do that (i bet there’re other ways) but we want to create a user profile – earlier we tried to get an embedding for each domain and combine them as a single user vector (using frequency weighted averaging), its not an OHE so yes the weighted averaging has some potential concerns espescially if we scale our little example to real data with millions of user and thousands of domains (both theoretically and computationally)

Here we discuss a different and kind of hacky approach for our use case – not as fancy as Google’s User LLM which we’ll probably discuss later at some point (Its upcoming in this series).

So, what we basically do is:

Embed individual domains into vector space using an llm embedding model (text-embedding-3-small)
Cluster these placecodes in this embedding space and dynamically identify sub domains that we will define the user in
We can also label these clusters using GPT to generate human readable tags – i opted for taking a top N domains by user frequency in each cluster
Now that the clusters are defined – we build user profiles as a set of weighted vectors across multiple clusters – this gives out our single user object to be Nx1536 instead of a 1D (1536), user is now N dimensional – which when you think about it makes sense – those domains represent single or very related behaviour in the embedding space and can be efficiently combined together as a representative centroid – and they won’t even interfere with other drastically different vectors in any arithmetic operations
Compute similarity between query prompts and users using different strategies (e.g. a max, weighted average, softmax attention)

This allows us to represent a user as a mixture of interests instead of a single point mish mash of everything the user has done

https://github.com/w-winnie/livnlearn/blob/main/embeddings_combination_alternative1_livnlearnversion.ipynb

embeddings_combination_alternative1_livnlearnversion

Imports¶

In [1]:

from dotenv import load_dotenv
import os

import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict

from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

In [2]:

import openai
from openai import OpenAI

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

openai_client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

def get_embedding(text, deployment_name="text-embedding-3-small"):
    response = openai_client.embeddings.create(
        input=[text],
        model=deployment_name
    )
    return response.data[0].embedding

def call_gpt_labeller(prompt, model="gpt-4o-mini", max_tokens=50, temperature=0.2):
    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a concise labeling assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_tokens,
        temperature=temperature
    )
    return response.choices[0].message.content.strip()

Example 1¶

In [73]:

# df_users = pd.DataFrame([
#     {"user_id": user, "domain": search, "count": count}
#     for user, searches in user_searches.items()
#     for search, count in searches.items()
# ])

OR Generate some Random data¶

In [74]:

import random

random.seed(42)
np.random.seed(42)

themes = {
    "Skiing": ["Ski Aspen CO USA", "Ski Whistler BC Canada", "Ski Zermatt Switzerland", 
    "Ski Vail CO USA", "Ski St Anton Austria"],
    "Surfing": ["Surf Bondi Beach Australia", "Surf Huntington Beach CA USA", "Surf Tofino BC Canada", "Surf Bali Indonesia", "Surf Maui HI USA"],
    "Nightlife": ["Nightclub NYC USA", "Bar Las Vegas NV USA", "Rooftop Tokyo Japan", "Pub Dublin Ireland", "Lounge Miami FL USA"],
    "Luxury Travel": ["Four Seasons Maldives", "Private Jet Service Dubai", "Luxury Yacht Monaco", "Ritz Paris France", "Aman Tokyo Japan"],
    "Family": ["Zoo San Diego CA USA", "Pediatrician Toronto Canada", "Toy Store NYC USA", "Children Museum Boston USA", "Montessori School LA USA"],
    "Tech/Geek": ["Hackathon MIT Boston USA", "AI Lab Google Mountain View USA", "Tech Meetup SF USA", "SpaceX HQ Hawthorne USA", "AR Conference Tokyo Japan"],
    "Pop Culture": ["Taylor Swift Eras LA USA", "Comic Con San Diego USA", "Kpop Concert Seoul Korea", "Anime Expo LA USA", "Billie Eilish NYC USA"],
    "Creative Arts": ["Art Gallery Paris France", "Theatre Broadway NYC USA", "Dance Studio Mumbai India", "Film Fest Cannes France", "Poetry Slam Berlin Germany"],
    "Outdoor": ["Hiking Trail Banff Canada", "Kayaking Lake Tahoe USA", "Rock Climb Yosemite USA", "Biking Trail Moab USA", "Nature Park Costa Rica"],
    "Noise": ["Starbucks NYC USA", "Walmart LA USA", "Gas Station Houston USA", "Grocery Store Chicago USA", "CVS Pharmacy Boston USA"]
}

all_domains = []
for theme, places in themes.items():
    all_domains.extend([(place, theme) for place in places])

num_users = 50
users = []
for i in range(num_users):
    user_id = f"user_{i+1:03}"
    profile = {"user_id": user_id}
    user_data = []

    num_themes = np.random.choice([1, 2, 3], p=[0.4, 0.4, 0.2])
    selected_themes = random.sample(list(themes.keys()), num_themes)

    for theme in selected_themes:
        selected_places = random.sample(themes[theme], k=random.randint(2, 4))
        for place in selected_places:
            frequency = random.randint(1, 10)
            user_data.append({"user_id": user_id, "domain": place, "theme": theme, "count": frequency})

    # noise
    if random.random() < 0.5:
        noise_places = random.sample(themes["Noise"], k=random.randint(1, 3))
        for np_ in noise_places:
            frequency = random.randint(10, 30)
            user_data.append({"user_id": user_id, "domain": np_, "theme": "Noise", "count": frequency})

    users.extend(user_data)

df_users = pd.DataFrame(users)

Create user vectors¶

domain embeddings¶

In [75]:

all_domains = df_users['domain'].unique()
# print(f"All unique domains: {all_domains}")
domain_embeddings = {pc: np.array(get_embedding(pc)) for pc in all_domains}
# len(domain_embeddings[""])

Clustering¶

domain clustering – identifying subdomains¶

In [76]:

all_domains = list(domain_embeddings.keys())
all_vectors = np.array([domain_embeddings[pc] for pc in all_domains])

In [77]:

from sklearn.metrics import silhouette_score

pca = PCA(n_components=10, random_state=42)
reduced_embeddings = pca.fit_transform(all_vectors)

k_values = [5, 10, 20, 30]
inertias = []
silhouette_scores = []

for k in k_values:
    kmeans = MiniBatchKMeans(n_clusters=k, batch_size=10, random_state=42)
    labels = kmeans.fit_predict(reduced_embeddings)
    
    sample_idx = np.random.choice(len(reduced_embeddings), size=min(30, len(reduced_embeddings)), replace=True)
    sil = silhouette_score(reduced_embeddings[sample_idx], labels[sample_idx])
    
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(sil)

In [78]:

# plt.figure(figsize=(12, 5))
# plt.subplot(1, 2, 1)
# plt.plot(k_values, inertias, marker='o')
# plt.title("Elbow Method (Inertia)")
# plt.xlabel("k (Number of clusters)")
# plt.ylabel("Inertia")

# plt.subplot(1, 2, 2)
# plt.plot(k_values, silhouette_scores, marker='o')
# plt.title("Silhouette Score")
# plt.xlabel("k (Number of clusters)")
# plt.ylabel("Score")

# plt.tight_layout()
# plt.show()

In [79]:

n_global_clusters = 10
kmeans = MiniBatchKMeans(n_clusters=n_global_clusters, batch_size=5000, random_state=42)
domain_cluster_ids = kmeans.fit_predict(all_vectors)

Map domains -> cluster id

In [80]:

pc_to_cluster = {
    pc: cid for pc, cid in zip(all_domains, domain_cluster_ids)
}

In [81]:

cluster_df = pd.DataFrame({
    "domain": all_domains,
    "cluster": domain_cluster_ids
})

Inspect¶

In [82]:

def top_domains_per_cluster(cluster_df, top_n=5):
    top_examples = {}
    for cid, group in cluster_df.groupby("cluster"):
        examples = group["domain"].head(top_n).tolist()
        top_examples[cid] = examples
    return top_examples

cluster_examples = top_domains_per_cluster(cluster_df, top_n=5)

print("Sample domains per cluster:")
for cid, pcs in cluster_examples.items():
    print(f"Cluster {cid}: {pcs}")

Sample domains per cluster:
Cluster 0: ['Surf Tofino BC Canada', 'Surf Huntington Beach CA USA', 'Hiking Trail Banff Canada', 'Surf Maui HI USA', 'Surf Bondi Beach Australia']
Cluster 1: ['Ski St Anton Austria', 'Ski Whistler BC Canada', 'Ski Vail CO USA', 'Ski Zermatt Switzerland', 'Ski Aspen CO USA']
Cluster 2: ['SpaceX HQ Hawthorne USA', 'Walmart LA USA', 'Starbucks NYC USA', 'Grocery Store Chicago USA', 'Gas Station Houston USA']
Cluster 3: ['Taylor Swift Eras LA USA', 'Anime Expo LA USA', 'Comic Con San Diego USA', 'Hackathon MIT Boston USA', 'Tech Meetup SF USA']
Cluster 4: ['Children Museum Boston USA', 'Montessori School LA USA']
Cluster 5: ['Dance Studio Mumbai India']
Cluster 6: ['Kpop Concert Seoul Korea', 'AR Conference Tokyo Japan', 'Aman Tokyo Japan', 'Poetry Slam Berlin Germany', 'Rooftop Tokyo Japan']
Cluster 7: ['Ritz Paris France', 'Art Gallery Paris France', 'Film Fest Cannes France']
Cluster 8: ['Nature Park Costa Rica']
Cluster 9: ['Four Seasons Maldives', 'Luxury Yacht Monaco']

Label¶

In [83]:

# pc_freq = defaultdict(int)
# for user_places in user_searches.values():
#     for pc, count in user_places.items():
#         pc_freq[pc] += count
# cluster_df["frequency"] = cluster_df["domain"].map(pc_freq)
pc_freq = df_users.groupby("domain")["user_id"].count().to_dict()
cluster_df["frequency"] = df_users["domain"].map(pc_freq).fillna(0).astype(int)

print(cluster_df[["domain", "frequency"]].head())

                         domain  frequency
0         Surf Tofino BC Canada          6
1  Surf Huntington Beach CA USA          7
2     Hiking Trail Banff Canada         12
3        Nature Park Costa Rica         11
4              Surf Maui HI USA          7

In [84]:

def generate_cluster_label(domains, model="gpt-4o-mini"):
    prompt = f"""
    You are given a list of places (user searches): 
    {domains}

    Please generate a short descriptive label (2-4 words) 
    that best summarizes the common theme of these places.
    Examples:
    - ['Walmart', 'Costco', 'Superstore'] → "Grocery Retail"
    - ['Ski Aspen', 'Ski Whistler', 'Ski Zermatt'] → "Ski Destinations"
    - ['Starbucks', 'Coffee Bean', 'Dunkin Donuts'] → "Coffee Shops"
    """
    label = call_gpt_labeller(prompt, model=model, max_tokens=20, temperature=0.2)
    return label


def auto_label_clusters(cluster_df, top_n=5):
    cluster_labels = {}
    for cid, group in cluster_df.groupby("cluster"):
        # Pick top-N most frequent domains in the cluster
        top_places = group.sort_values("frequency", ascending=False).head(top_n)["domain"].tolist()
        # Ask GPT for a short label
        label = generate_cluster_label(top_places)
        cluster_labels[cid] = label
        print(f"cluster {cid} with {len(group)} domains: {label} for {top_places}")
    return cluster_labels

cluster_labels = auto_label_clusters(cluster_df, top_n=5)

# print("Automatic cluster labels:")
# for cid, label in cluster_labels.items():
#     print(f"Cluster {cid}: {label}")

cluster 0 with 9 domains: "Outdoor Adventure Spots" for ['Hiking Trail Banff Canada', 'Kayaking Lake Tahoe USA', 'Surf Huntington Beach CA USA', 'Surf Maui HI USA', 'Surf Bondi Beach Australia']
cluster 1 with 5 domains: "Ski Resorts" for ['Ski Whistler BC Canada', 'Ski St Anton Austria', 'Ski Zermatt Switzerland', 'Ski Vail CO USA', 'Ski Aspen CO USA']
cluster 2 with 8 domains: "Retail and Services" for ['CVS Pharmacy Boston USA', 'SpaceX HQ Hawthorne USA', 'Grocery Store Chicago USA', 'Gas Station Houston USA', 'Starbucks NYC USA']
cluster 3 with 12 domains: "Entertainment Venues" for ['Lounge Miami FL USA', 'Private Jet Service Dubai', 'Pub Dublin Ireland', 'Theatre Broadway NYC USA', 'Zoo San Diego CA USA']
cluster 4 with 2 domains: "Educational Institutions" for ['Children Museum Boston USA', 'Montessori School LA USA']
cluster 5 with 1 domains: "Dance Studios" for ['Dance Studio Mumbai India']
cluster 6 with 7 domains: "Event Venues" for ['Aman Tokyo Japan', 'AR Conference Tokyo Japan', 'Rooftop Tokyo Japan', 'Poetry Slam Berlin Germany', 'AI Lab Google Mountain View USA']
cluster 7 with 3 domains: "French Cultural Venues" for ['Ritz Paris France', 'Art Gallery Paris France', 'Film Fest Cannes France']
cluster 8 with 1 domains: "Nature Parks" for ['Nature Park Costa Rica']
cluster 9 with 2 domains: "Luxury Travel" for ['Luxury Yacht Monaco', 'Four Seasons Maldives']

Visualize¶

In [85]:

from sklearn.decomposition import PCA

In [86]:

pca = PCA(n_components=2, random_state=42)
domain_2d = pca.fit_transform(all_vectors)

cluster_df["PC1"] = domain_2d[:, 0]
cluster_df["PC2"] = domain_2d[:, 1]
cluster_df["label"] = cluster_df["cluster"].map(cluster_labels)

In [87]:

fig = px.scatter(
        cluster_df,
        x="PC1", y="PC2",
        color="label",  #
        size="frequency",  
        hover_data={"domain": True, "cluster": True, "frequency": True},
        title="Global domain Clusters (PCA Projection with GPT Labels)"
    )

fig.update_traces(marker=dict(opacity=0.8, line=dict(width=0.5, color="DarkSlateGrey")))

fig.show()

Building user vectors¶

In [88]:

from collections import defaultdict
import numpy as np

def build_user_multi_vectors_df(df_users, cluster_df, kmeans):
    # Create domain -> cluster_id mapping
    pc_to_cluster = dict(zip(cluster_df['domain'], cluster_df['cluster']))

    user_multi_vectors = {}

    for user_id, group in df_users.groupby("user_id"):
        cluster_to_weight = defaultdict(int)

        for _, row in group.iterrows():
            domain = row["domain"]
            count = row["count"]

            if domain not in pc_to_cluster:
                continue

            cluster_id = pc_to_cluster[domain]
            cluster_to_weight[cluster_id] += count

        user_subvectors = []
        for cid, weight in cluster_to_weight.items():
            centroid = kmeans.cluster_centers_[cid]
            user_subvectors.append((centroid, weight))

        user_multi_vectors[user_id] = user_subvectors

    return user_multi_vectors

user_multi_vectors = build_user_multi_vectors_df(df_users, cluster_df, kmeans)
print(f"Built multi interest vectors for all users - {len(user_multi_vectors)} users processed")

Built multi interest vectors for all users - 50 users processed

Define Similarity¶

V1 – max cosine similarity¶

In [89]:

def user_query_similarity_max(query_text, user_subvectors, get_embedding_fn):
    query_vec = np.array(get_embedding_fn(query_text))
    sims = [cosine_similarity([query_vec], [vec])[0][0] * weight
            for vec, weight in user_subvectors]
    return max(sims) if sims else 0.0

def list_similar_users_to_query_max(query_text, top_n=5):
    scores = []
    for user, subvectors in user_multi_vectors.items():
        sim = user_query_similarity_max(query_text, subvectors, get_embedding)
        scores.append((user, sim))
    df = pd.DataFrame(scores, columns=["User", "Similarity"])
    df.sort_values(by="Similarity", ascending=False, inplace=True)
    return df.head(top_n)

V2 – normalized¶

In [90]:

def user_query_similarity_norm(query_text, user_subvectors, get_embedding_fn):
    query_vec = np.array(get_embedding_fn(query_text))
    
    weights = np.array([w for _, w in user_subvectors], dtype=float)
    weights /= weights.sum()  
    
    sims = np.array([cosine_similarity([query_vec], [vec])[0][0] for vec, _ in user_subvectors])
    
    return float(np.dot(sims, weights))  


def list_similar_users_to_query_norm(query_text, top_n=5):
    scores = []
    for user, subvectors in user_multi_vectors.items():
        sim = user_query_similarity_norm(query_text, subvectors, get_embedding)
        scores.append((user, sim))

    df = pd.DataFrame(scores, columns=["User", "Similarity"])
    df.sort_values(by="Similarity", ascending=False, inplace=True)
    return df.head(top_n)

V3 – softmax¶

In [91]:

def softmax(x, temp=1.0):
    e_x = np.exp((x - np.max(x)) / temp)
    return e_x / e_x.sum()

def user_query_similarity_softmax(query_text, user_subvectors, get_embedding_fn):
    query_vec = np.array(get_embedding_fn(query_text))

    sims = np.array([cosine_similarity([query_vec], [vec])[0][0]
                     for vec, _ in user_subvectors])
    
    weights = np.array([w for _, w in user_subvectors], dtype=float)
    weights = weights / weights.sum()
    
    attn = softmax(sims, temp=0.5)  
    
    return float(np.dot(sims, attn * weights))

def list_similar_users_to_query_softmax(query_text, top_n=5):
    scores = []
    for user, subvectors in user_multi_vectors.items():
        sim = user_query_similarity_softmax(query_text, subvectors, get_embedding)
        scores.append((user, sim))

    df = pd.DataFrame(scores, columns=["User", "Similarity"])
    df.sort_values(by="Similarity", ascending=False, inplace=True)
    return df.head(top_n)

V4 – Wighted + Max¶

In [92]:

def user_query_similarity_mixed1(query_text, user_subvectors, get_embedding_fn, alpha=0.7):
    query_vec = np.array(get_embedding_fn(query_text))

    sims = np.array([cosine_similarity([query_vec], [vec])[0][0] for vec, _ in user_subvectors])
    weights = np.array([w for _, w in user_subvectors], dtype=float)
    weights = weights / weights.sum()

    weighted_avg = np.dot(sims, weights)
    best = sims.max()

    return float(alpha * weighted_avg + (1 - alpha) * best)


def list_similar_users_to_query_mixed1(query_text, top_n=5):
    scores = []
    for user, subvectors in user_multi_vectors.items():
        sim = user_query_similarity_mixed1(query_text, subvectors, get_embedding)
        scores.append((user, sim))

    df = pd.DataFrame(scores, columns=["User", "Similarity"])
    df.sort_values(by="Similarity", ascending=False, inplace=True)
    return df.head(top_n)

Test¶

In [93]:

def list_similar_users_to_query_multi(type="max", query_text="", top_n=5):
    if type == "max":
        return list_similar_users_to_query_max(query_text, top_n)
    elif type == "norm":
        return list_similar_users_to_query_norm(query_text, top_n)
    elif type == "softmax":
        return list_similar_users_to_query_softmax(query_text, top_n)
    elif type == "mixed1":
        return list_similar_users_to_query_mixed1(query_text, top_n)
    else:
        raise ValueError(f"Unknown type: {type}")

In [106]:

def inspect_user(df, user_id):
    user_data = df[df['user_id'] == user_id]
    if user_data.empty:
        return f"No data found for user_id: {user_id}"
    return user_data.sort_values(by="count", ascending=False)

Inspect¶

In [94]:

sim_type = "norm"  # "max", "norm", "softmax", "mixed1"

In [103]:

query = "ski"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))

Query: ski, type: norm
        User  Similarity
46  user_047    0.438464
20  user_021    0.385194
35  user_036    0.342506
27  user_028    0.338921
39  user_040    0.329243

In [110]:

inspect_user(df_users, 'user_047')

Out[110]:

	user_id	domain	theme	count
286	user_047	Ski Zermatt Switzerland	Skiing	6
289	user_047	Ski St Anton Austria	Skiing	5
287	user_047	Ski Vail CO USA	Skiing	3
288	user_047	Ski Whistler BC Canada	Skiing	1

In [109]:

inspect_user(df_users, 'user_040')

Out[109]:

	user_id	domain	theme	count
246	user_040	Gas Station Houston USA	Noise	9
250	user_040	Ski St Anton Austria	Skiing	8
248	user_040	Ski Aspen CO USA	Skiing	7
244	user_040	CVS Pharmacy Boston USA	Noise	6
249	user_040	Ski Whistler BC Canada	Skiing	6
245	user_040	Grocery Store Chicago USA	Noise	2
247	user_040	Ski Zermatt Switzerland	Skiing	2

In [104]:

query = "surf"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))

Query: surf, type: norm
        User  Similarity
0   user_001    0.502815
44  user_045    0.502730
16  user_017    0.502730
2   user_003    0.405118
1   user_002    0.401624

In [111]:

inspect_user(df_users, 'user_001')

Out[111]:

	user_id	domain	theme	count
0	user_001	Surf Tofino BC Canada	Surfing	4
1	user_001	Surf Huntington Beach CA USA	Surfing	3

In [112]:

inspect_user(df_users, 'user_002')

Out[112]:

	user_id	domain	theme	count
4	user_002	Surf Maui HI USA	Surfing	9
7	user_002	Surf Bali Indonesia	Surfing	8
5	user_002	Surf Bondi Beach Australia	Surfing	7
8	user_002	Kpop Concert Seoul Korea	Pop Culture	6
9	user_002	Taylor Swift Eras LA USA	Pop Culture	5
2	user_002	Hiking Trail Banff Canada	Outdoor	4
3	user_002	Nature Park Costa Rica	Outdoor	4
6	user_002	Surf Tofino BC Canada	Surfing	4
11	user_002	Comic Con San Diego USA	Pop Culture	4
10	user_002	Anime Expo LA USA	Pop Culture	3

In [117]:

query = "luxury"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))

Query: luxury, type: norm
        User  Similarity
9   user_010    0.320341
14  user_015    0.286443
45  user_046    0.284857
49  user_050    0.277800
17  user_018    0.248539

In [118]:

inspect_user(df_users, 'user_010')

Out[118]:

	user_id	domain	theme	count
58	user_010	Four Seasons Maldives	Luxury Travel	8
60	user_010	Ski Whistler BC Canada	Skiing	8
59	user_010	Ski St Anton Austria	Skiing	3
57	user_010	Aman Tokyo Japan	Luxury Travel	2

In [119]:

inspect_user(df_users, 'user_018')

Out[119]:

	user_id	domain	theme	count
104	user_018	Aman Tokyo Japan	Luxury Travel	10
105	user_018	Private Jet Service Dubai	Luxury Travel	10
107	user_018	Taylor Swift Eras LA USA	Pop Culture	10
106	user_018	Billie Eilish NYC USA	Pop Culture	7

In [123]:

query = "outdoors"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))

Query: outdoors, type: norm
        User  Similarity
0   user_001    0.333498
44  user_045    0.333498
16  user_017    0.333476
32  user_033    0.299596
1   user_002    0.298054

In [124]:

inspect_user(df_users, 'user_001')

Out[124]:

	user_id	domain	theme	count
0	user_001	Surf Tofino BC Canada	Surfing	4
1	user_001	Surf Huntington Beach CA USA	Surfing	3

In [127]:

inspect_user(df_users, 'user_033')

Out[127]:

	user_id	domain	theme	count
194	user_033	Nature Park Costa Rica	Outdoor	8
195	user_033	Biking Trail Moab USA	Outdoor	8
196	user_033	Hiking Trail Banff Canada	Outdoor	5

Compare similarity methods¶

In [128]:

import plotly.express as px
import plotly.graph_objects as go

def compare_similarity_functions(query_text, top_n=5):
    sim_types = ["norm", "softmax", "mixed1"] #max
    results = []

    for sim_type in sim_types:
        df = list_similar_users_to_query_multi(sim_type, query_text, top_n=top_n)
        df["Type"] = sim_type
        results.append(df)

    results_df = pd.concat(results)
    heatmap_df = results_df.pivot(index="User", columns="Type", values="Similarity").fillna(0)


    # --- Heatmap ---
    fig_heatmap = px.imshow(
        heatmap_df,
        text_auto=".2f",
        aspect="auto",
        color_continuous_scale="RdBu",
        title=f"Query: '{query_text}' — User Similarities Across Methods",
    )
    fig_heatmap.update_layout(
        width=700, height=300, margin=dict(l=30, r=30, t=30, b=30)
    )
    fig_heatmap.show()

    # --- Bar chart ---
    bar_df = results_df.copy()
    fig_bar = px.bar(
        bar_df,
        x="User",
        y="Similarity",
        color="Type",
        barmode="group",
        title=f"Query: '{query_text}' — Similarity Scores",
    )
    fig_bar.update_layout(
        width=700, height=350,
        xaxis_tickangle=-45,
        margin=dict(l=30, r=30, t=30, b=30)
    )
    fig_bar.show()

    return heatmap_df

def plot_user_themes(df, user_ids):
    filtered_df = df[df["user_id"].isin(user_ids)]
    theme_counts = filtered_df.groupby(["user_id", "theme"])["count"].sum().reset_index()
    fig = px.bar(
        theme_counts,
        x="user_id",
        y="count",
        color="theme",
        barmode="group",
        title="Theme Distribution per User",
        labels={"user_id": "User", "count": "Interaction Count"},
        height=400
    )
    fig.update_layout(xaxis_tickangle=-45, width=750)
    fig.show()

In [133]:

query = "surf"
comparison_df = compare_similarity_functions(query, top_n=5)

In [134]:

plot_user_themes(df_users, comparison_df.index)

Sorry for the rendering problem above – still have to figure out how to fix plotly frames in wordpress html block – but attaching them as an iframe here:

Does it solve all our problems mentioned before – well not exactly – but in theory it does give another approach to look at the same problem and with some secret sauce of an appropriate method to combine similarity scores for a user, we might have found the perfect hack for our little problem – ofcourse that doesn’t mean I didn’t look into the User LLM and the custom encoder Google suggested – which creates user embeddings by passing users interaction sequence through a self supervised next token predictor transformer – its basically attention that helps them combine embeddings. Though that one is a bit on a back burner now for me – but i’ll try to make it the next one for this series atleast.

Live & Learn

Embeddings II – Combining hack

Imports¶

Example 1¶

OR Generate some Random data¶

Create user vectors¶

domain embeddings¶

Clustering¶

domain clustering – identifying subdomains¶

Inspect¶

Label¶

Visualize¶

Building user vectors¶

Define Similarity¶

V1 – max cosine similarity¶

V2 – normalized¶

V3 – softmax¶

V4 – Wighted + Max¶

Test¶

Inspect¶

Compare similarity methods¶

Comments

Leave a Reply Cancel reply