In Embeddings I we explored an LLM Embedding model, an idea of using this trained encoding algorithm to make our mundane features context aware and the challenges of the naive ways of combining them.
To recap, we have a made up use case containing user and different domain interactions of that user, we want to group users together or create a recommendation system for a user or something along the lines. And to do that (i bet there’re other ways) but we want to create a user profile – earlier we tried to get an embedding for each domain and combine them as a single user vector (using frequency weighted averaging), its not an OHE so yes the weighted averaging has some potential concerns espescially if we scale our little example to real data with millions of user and thousands of domains (both theoretically and computationally)
Here we discuss a different and kind of hacky approach for our use case – not as fancy as Google’s User LLM which we’ll probably discuss later at some point (Its upcoming in this series).
So, what we basically do is:
- Embed individual domains into vector space using an llm embedding model (text-embedding-3-small)
- Cluster these placecodes in this embedding space and dynamically identify sub domains that we will define the user in
- We can also label these clusters using GPT to generate human readable tags – i opted for taking a top N domains by user frequency in each cluster
- Now that the clusters are defined – we build user profiles as a set of weighted vectors across multiple clusters – this gives out our single user object to be Nx1536 instead of a 1D (1536), user is now N dimensional – which when you think about it makes sense – those domains represent single or very related behaviour in the embedding space and can be efficiently combined together as a representative centroid – and they won’t even interfere with other drastically different vectors in any arithmetic operations
- Compute similarity between query prompts and users using different strategies (e.g. a max, weighted average, softmax attention)
This allows us to represent a user as a mixture of interests instead of a single point mish mash of everything the user has done
Imports¶
from dotenv import load_dotenv
import os
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
import openai
from openai import OpenAI
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
openai_client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
def get_embedding(text, deployment_name="text-embedding-3-small"):
response = openai_client.embeddings.create(
input=[text],
model=deployment_name
)
return response.data[0].embedding
def call_gpt_labeller(prompt, model="gpt-4o-mini", max_tokens=50, temperature=0.2):
response = openai_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a concise labeling assistant."},
{"role": "user", "content": prompt}
],
max_tokens=max_tokens,
temperature=temperature
)
return response.choices[0].message.content.strip()
Example 1¶
# df_users = pd.DataFrame([
# {"user_id": user, "domain": search, "count": count}
# for user, searches in user_searches.items()
# for search, count in searches.items()
# ])
OR Generate some Random data¶
import random
random.seed(42)
np.random.seed(42)
themes = {
"Skiing": ["Ski Aspen CO USA", "Ski Whistler BC Canada", "Ski Zermatt Switzerland",
"Ski Vail CO USA", "Ski St Anton Austria"],
"Surfing": ["Surf Bondi Beach Australia", "Surf Huntington Beach CA USA", "Surf Tofino BC Canada", "Surf Bali Indonesia", "Surf Maui HI USA"],
"Nightlife": ["Nightclub NYC USA", "Bar Las Vegas NV USA", "Rooftop Tokyo Japan", "Pub Dublin Ireland", "Lounge Miami FL USA"],
"Luxury Travel": ["Four Seasons Maldives", "Private Jet Service Dubai", "Luxury Yacht Monaco", "Ritz Paris France", "Aman Tokyo Japan"],
"Family": ["Zoo San Diego CA USA", "Pediatrician Toronto Canada", "Toy Store NYC USA", "Children Museum Boston USA", "Montessori School LA USA"],
"Tech/Geek": ["Hackathon MIT Boston USA", "AI Lab Google Mountain View USA", "Tech Meetup SF USA", "SpaceX HQ Hawthorne USA", "AR Conference Tokyo Japan"],
"Pop Culture": ["Taylor Swift Eras LA USA", "Comic Con San Diego USA", "Kpop Concert Seoul Korea", "Anime Expo LA USA", "Billie Eilish NYC USA"],
"Creative Arts": ["Art Gallery Paris France", "Theatre Broadway NYC USA", "Dance Studio Mumbai India", "Film Fest Cannes France", "Poetry Slam Berlin Germany"],
"Outdoor": ["Hiking Trail Banff Canada", "Kayaking Lake Tahoe USA", "Rock Climb Yosemite USA", "Biking Trail Moab USA", "Nature Park Costa Rica"],
"Noise": ["Starbucks NYC USA", "Walmart LA USA", "Gas Station Houston USA", "Grocery Store Chicago USA", "CVS Pharmacy Boston USA"]
}
all_domains = []
for theme, places in themes.items():
all_domains.extend([(place, theme) for place in places])
num_users = 50
users = []
for i in range(num_users):
user_id = f"user_{i+1:03}"
profile = {"user_id": user_id}
user_data = []
num_themes = np.random.choice([1, 2, 3], p=[0.4, 0.4, 0.2])
selected_themes = random.sample(list(themes.keys()), num_themes)
for theme in selected_themes:
selected_places = random.sample(themes[theme], k=random.randint(2, 4))
for place in selected_places:
frequency = random.randint(1, 10)
user_data.append({"user_id": user_id, "domain": place, "theme": theme, "count": frequency})
# noise
if random.random() < 0.5:
noise_places = random.sample(themes["Noise"], k=random.randint(1, 3))
for np_ in noise_places:
frequency = random.randint(10, 30)
user_data.append({"user_id": user_id, "domain": np_, "theme": "Noise", "count": frequency})
users.extend(user_data)
df_users = pd.DataFrame(users)
Create user vectors¶
domain embeddings¶
all_domains = df_users['domain'].unique()
# print(f"All unique domains: {all_domains}")
domain_embeddings = {pc: np.array(get_embedding(pc)) for pc in all_domains}
# len(domain_embeddings[""])
Clustering¶
domain clustering – identifying subdomains¶
all_domains = list(domain_embeddings.keys())
all_vectors = np.array([domain_embeddings[pc] for pc in all_domains])
from sklearn.metrics import silhouette_score
pca = PCA(n_components=10, random_state=42)
reduced_embeddings = pca.fit_transform(all_vectors)
k_values = [5, 10, 20, 30]
inertias = []
silhouette_scores = []
for k in k_values:
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=10, random_state=42)
labels = kmeans.fit_predict(reduced_embeddings)
sample_idx = np.random.choice(len(reduced_embeddings), size=min(30, len(reduced_embeddings)), replace=True)
sil = silhouette_score(reduced_embeddings[sample_idx], labels[sample_idx])
inertias.append(kmeans.inertia_)
silhouette_scores.append(sil)
# plt.figure(figsize=(12, 5))
# plt.subplot(1, 2, 1)
# plt.plot(k_values, inertias, marker='o')
# plt.title("Elbow Method (Inertia)")
# plt.xlabel("k (Number of clusters)")
# plt.ylabel("Inertia")
# plt.subplot(1, 2, 2)
# plt.plot(k_values, silhouette_scores, marker='o')
# plt.title("Silhouette Score")
# plt.xlabel("k (Number of clusters)")
# plt.ylabel("Score")
# plt.tight_layout()
# plt.show()
n_global_clusters = 10
kmeans = MiniBatchKMeans(n_clusters=n_global_clusters, batch_size=5000, random_state=42)
domain_cluster_ids = kmeans.fit_predict(all_vectors)
Map domains -> cluster id
pc_to_cluster = {
pc: cid for pc, cid in zip(all_domains, domain_cluster_ids)
}
cluster_df = pd.DataFrame({
"domain": all_domains,
"cluster": domain_cluster_ids
})
Inspect¶
def top_domains_per_cluster(cluster_df, top_n=5):
top_examples = {}
for cid, group in cluster_df.groupby("cluster"):
examples = group["domain"].head(top_n).tolist()
top_examples[cid] = examples
return top_examples
cluster_examples = top_domains_per_cluster(cluster_df, top_n=5)
print("Sample domains per cluster:")
for cid, pcs in cluster_examples.items():
print(f"Cluster {cid}: {pcs}")
Sample domains per cluster: Cluster 0: ['Surf Tofino BC Canada', 'Surf Huntington Beach CA USA', 'Hiking Trail Banff Canada', 'Surf Maui HI USA', 'Surf Bondi Beach Australia'] Cluster 1: ['Ski St Anton Austria', 'Ski Whistler BC Canada', 'Ski Vail CO USA', 'Ski Zermatt Switzerland', 'Ski Aspen CO USA'] Cluster 2: ['SpaceX HQ Hawthorne USA', 'Walmart LA USA', 'Starbucks NYC USA', 'Grocery Store Chicago USA', 'Gas Station Houston USA'] Cluster 3: ['Taylor Swift Eras LA USA', 'Anime Expo LA USA', 'Comic Con San Diego USA', 'Hackathon MIT Boston USA', 'Tech Meetup SF USA'] Cluster 4: ['Children Museum Boston USA', 'Montessori School LA USA'] Cluster 5: ['Dance Studio Mumbai India'] Cluster 6: ['Kpop Concert Seoul Korea', 'AR Conference Tokyo Japan', 'Aman Tokyo Japan', 'Poetry Slam Berlin Germany', 'Rooftop Tokyo Japan'] Cluster 7: ['Ritz Paris France', 'Art Gallery Paris France', 'Film Fest Cannes France'] Cluster 8: ['Nature Park Costa Rica'] Cluster 9: ['Four Seasons Maldives', 'Luxury Yacht Monaco']
Label¶
# pc_freq = defaultdict(int)
# for user_places in user_searches.values():
# for pc, count in user_places.items():
# pc_freq[pc] += count
# cluster_df["frequency"] = cluster_df["domain"].map(pc_freq)
pc_freq = df_users.groupby("domain")["user_id"].count().to_dict()
cluster_df["frequency"] = df_users["domain"].map(pc_freq).fillna(0).astype(int)
print(cluster_df[["domain", "frequency"]].head())
domain frequency 0 Surf Tofino BC Canada 6 1 Surf Huntington Beach CA USA 7 2 Hiking Trail Banff Canada 12 3 Nature Park Costa Rica 11 4 Surf Maui HI USA 7
def generate_cluster_label(domains, model="gpt-4o-mini"):
prompt = f"""
You are given a list of places (user searches):
{domains}
Please generate a short descriptive label (2-4 words)
that best summarizes the common theme of these places.
Examples:
- ['Walmart', 'Costco', 'Superstore'] → "Grocery Retail"
- ['Ski Aspen', 'Ski Whistler', 'Ski Zermatt'] → "Ski Destinations"
- ['Starbucks', 'Coffee Bean', 'Dunkin Donuts'] → "Coffee Shops"
"""
label = call_gpt_labeller(prompt, model=model, max_tokens=20, temperature=0.2)
return label
def auto_label_clusters(cluster_df, top_n=5):
cluster_labels = {}
for cid, group in cluster_df.groupby("cluster"):
# Pick top-N most frequent domains in the cluster
top_places = group.sort_values("frequency", ascending=False).head(top_n)["domain"].tolist()
# Ask GPT for a short label
label = generate_cluster_label(top_places)
cluster_labels[cid] = label
print(f"cluster {cid} with {len(group)} domains: {label} for {top_places}")
return cluster_labels
cluster_labels = auto_label_clusters(cluster_df, top_n=5)
# print("Automatic cluster labels:")
# for cid, label in cluster_labels.items():
# print(f"Cluster {cid}: {label}")
cluster 0 with 9 domains: "Outdoor Adventure Spots" for ['Hiking Trail Banff Canada', 'Kayaking Lake Tahoe USA', 'Surf Huntington Beach CA USA', 'Surf Maui HI USA', 'Surf Bondi Beach Australia'] cluster 1 with 5 domains: "Ski Resorts" for ['Ski Whistler BC Canada', 'Ski St Anton Austria', 'Ski Zermatt Switzerland', 'Ski Vail CO USA', 'Ski Aspen CO USA'] cluster 2 with 8 domains: "Retail and Services" for ['CVS Pharmacy Boston USA', 'SpaceX HQ Hawthorne USA', 'Grocery Store Chicago USA', 'Gas Station Houston USA', 'Starbucks NYC USA'] cluster 3 with 12 domains: "Entertainment Venues" for ['Lounge Miami FL USA', 'Private Jet Service Dubai', 'Pub Dublin Ireland', 'Theatre Broadway NYC USA', 'Zoo San Diego CA USA'] cluster 4 with 2 domains: "Educational Institutions" for ['Children Museum Boston USA', 'Montessori School LA USA'] cluster 5 with 1 domains: "Dance Studios" for ['Dance Studio Mumbai India'] cluster 6 with 7 domains: "Event Venues" for ['Aman Tokyo Japan', 'AR Conference Tokyo Japan', 'Rooftop Tokyo Japan', 'Poetry Slam Berlin Germany', 'AI Lab Google Mountain View USA'] cluster 7 with 3 domains: "French Cultural Venues" for ['Ritz Paris France', 'Art Gallery Paris France', 'Film Fest Cannes France'] cluster 8 with 1 domains: "Nature Parks" for ['Nature Park Costa Rica'] cluster 9 with 2 domains: "Luxury Travel" for ['Luxury Yacht Monaco', 'Four Seasons Maldives']
Visualize¶
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
domain_2d = pca.fit_transform(all_vectors)
cluster_df["PC1"] = domain_2d[:, 0]
cluster_df["PC2"] = domain_2d[:, 1]
cluster_df["label"] = cluster_df["cluster"].map(cluster_labels)
fig = px.scatter(
cluster_df,
x="PC1", y="PC2",
color="label", #
size="frequency",
hover_data={"domain": True, "cluster": True, "frequency": True},
title="Global domain Clusters (PCA Projection with GPT Labels)"
)
fig.update_traces(marker=dict(opacity=0.8, line=dict(width=0.5, color="DarkSlateGrey")))
fig.show()
Building user vectors¶
from collections import defaultdict
import numpy as np
def build_user_multi_vectors_df(df_users, cluster_df, kmeans):
# Create domain -> cluster_id mapping
pc_to_cluster = dict(zip(cluster_df['domain'], cluster_df['cluster']))
user_multi_vectors = {}
for user_id, group in df_users.groupby("user_id"):
cluster_to_weight = defaultdict(int)
for _, row in group.iterrows():
domain = row["domain"]
count = row["count"]
if domain not in pc_to_cluster:
continue
cluster_id = pc_to_cluster[domain]
cluster_to_weight[cluster_id] += count
user_subvectors = []
for cid, weight in cluster_to_weight.items():
centroid = kmeans.cluster_centers_[cid]
user_subvectors.append((centroid, weight))
user_multi_vectors[user_id] = user_subvectors
return user_multi_vectors
user_multi_vectors = build_user_multi_vectors_df(df_users, cluster_df, kmeans)
print(f"Built multi interest vectors for all users - {len(user_multi_vectors)} users processed")
Built multi interest vectors for all users - 50 users processed
Define Similarity¶
V1 – max cosine similarity¶
def user_query_similarity_max(query_text, user_subvectors, get_embedding_fn):
query_vec = np.array(get_embedding_fn(query_text))
sims = [cosine_similarity([query_vec], [vec])[0][0] * weight
for vec, weight in user_subvectors]
return max(sims) if sims else 0.0
def list_similar_users_to_query_max(query_text, top_n=5):
scores = []
for user, subvectors in user_multi_vectors.items():
sim = user_query_similarity_max(query_text, subvectors, get_embedding)
scores.append((user, sim))
df = pd.DataFrame(scores, columns=["User", "Similarity"])
df.sort_values(by="Similarity", ascending=False, inplace=True)
return df.head(top_n)
V2 – normalized¶
def user_query_similarity_norm(query_text, user_subvectors, get_embedding_fn):
query_vec = np.array(get_embedding_fn(query_text))
weights = np.array([w for _, w in user_subvectors], dtype=float)
weights /= weights.sum()
sims = np.array([cosine_similarity([query_vec], [vec])[0][0] for vec, _ in user_subvectors])
return float(np.dot(sims, weights))
def list_similar_users_to_query_norm(query_text, top_n=5):
scores = []
for user, subvectors in user_multi_vectors.items():
sim = user_query_similarity_norm(query_text, subvectors, get_embedding)
scores.append((user, sim))
df = pd.DataFrame(scores, columns=["User", "Similarity"])
df.sort_values(by="Similarity", ascending=False, inplace=True)
return df.head(top_n)
V3 – softmax¶
def softmax(x, temp=1.0):
e_x = np.exp((x - np.max(x)) / temp)
return e_x / e_x.sum()
def user_query_similarity_softmax(query_text, user_subvectors, get_embedding_fn):
query_vec = np.array(get_embedding_fn(query_text))
sims = np.array([cosine_similarity([query_vec], [vec])[0][0]
for vec, _ in user_subvectors])
weights = np.array([w for _, w in user_subvectors], dtype=float)
weights = weights / weights.sum()
attn = softmax(sims, temp=0.5)
return float(np.dot(sims, attn * weights))
def list_similar_users_to_query_softmax(query_text, top_n=5):
scores = []
for user, subvectors in user_multi_vectors.items():
sim = user_query_similarity_softmax(query_text, subvectors, get_embedding)
scores.append((user, sim))
df = pd.DataFrame(scores, columns=["User", "Similarity"])
df.sort_values(by="Similarity", ascending=False, inplace=True)
return df.head(top_n)
V4 – Wighted + Max¶
def user_query_similarity_mixed1(query_text, user_subvectors, get_embedding_fn, alpha=0.7):
query_vec = np.array(get_embedding_fn(query_text))
sims = np.array([cosine_similarity([query_vec], [vec])[0][0] for vec, _ in user_subvectors])
weights = np.array([w for _, w in user_subvectors], dtype=float)
weights = weights / weights.sum()
weighted_avg = np.dot(sims, weights)
best = sims.max()
return float(alpha * weighted_avg + (1 - alpha) * best)
def list_similar_users_to_query_mixed1(query_text, top_n=5):
scores = []
for user, subvectors in user_multi_vectors.items():
sim = user_query_similarity_mixed1(query_text, subvectors, get_embedding)
scores.append((user, sim))
df = pd.DataFrame(scores, columns=["User", "Similarity"])
df.sort_values(by="Similarity", ascending=False, inplace=True)
return df.head(top_n)
Test¶
def list_similar_users_to_query_multi(type="max", query_text="", top_n=5):
if type == "max":
return list_similar_users_to_query_max(query_text, top_n)
elif type == "norm":
return list_similar_users_to_query_norm(query_text, top_n)
elif type == "softmax":
return list_similar_users_to_query_softmax(query_text, top_n)
elif type == "mixed1":
return list_similar_users_to_query_mixed1(query_text, top_n)
else:
raise ValueError(f"Unknown type: {type}")
def inspect_user(df, user_id):
user_data = df[df['user_id'] == user_id]
if user_data.empty:
return f"No data found for user_id: {user_id}"
return user_data.sort_values(by="count", ascending=False)
Inspect¶
sim_type = "norm" # "max", "norm", "softmax", "mixed1"
query = "ski"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))
Query: ski, type: norm
User Similarity
46 user_047 0.438464
20 user_021 0.385194
35 user_036 0.342506
27 user_028 0.338921
39 user_040 0.329243
inspect_user(df_users, 'user_047')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 286 | user_047 | Ski Zermatt Switzerland | Skiing | 6 |
| 289 | user_047 | Ski St Anton Austria | Skiing | 5 |
| 287 | user_047 | Ski Vail CO USA | Skiing | 3 |
| 288 | user_047 | Ski Whistler BC Canada | Skiing | 1 |
inspect_user(df_users, 'user_040')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 246 | user_040 | Gas Station Houston USA | Noise | 9 |
| 250 | user_040 | Ski St Anton Austria | Skiing | 8 |
| 248 | user_040 | Ski Aspen CO USA | Skiing | 7 |
| 244 | user_040 | CVS Pharmacy Boston USA | Noise | 6 |
| 249 | user_040 | Ski Whistler BC Canada | Skiing | 6 |
| 245 | user_040 | Grocery Store Chicago USA | Noise | 2 |
| 247 | user_040 | Ski Zermatt Switzerland | Skiing | 2 |
query = "surf"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))
Query: surf, type: norm
User Similarity
0 user_001 0.502815
44 user_045 0.502730
16 user_017 0.502730
2 user_003 0.405118
1 user_002 0.401624
inspect_user(df_users, 'user_001')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 0 | user_001 | Surf Tofino BC Canada | Surfing | 4 |
| 1 | user_001 | Surf Huntington Beach CA USA | Surfing | 3 |
inspect_user(df_users, 'user_002')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 4 | user_002 | Surf Maui HI USA | Surfing | 9 |
| 7 | user_002 | Surf Bali Indonesia | Surfing | 8 |
| 5 | user_002 | Surf Bondi Beach Australia | Surfing | 7 |
| 8 | user_002 | Kpop Concert Seoul Korea | Pop Culture | 6 |
| 9 | user_002 | Taylor Swift Eras LA USA | Pop Culture | 5 |
| 2 | user_002 | Hiking Trail Banff Canada | Outdoor | 4 |
| 3 | user_002 | Nature Park Costa Rica | Outdoor | 4 |
| 6 | user_002 | Surf Tofino BC Canada | Surfing | 4 |
| 11 | user_002 | Comic Con San Diego USA | Pop Culture | 4 |
| 10 | user_002 | Anime Expo LA USA | Pop Culture | 3 |
query = "luxury"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))
Query: luxury, type: norm
User Similarity
9 user_010 0.320341
14 user_015 0.286443
45 user_046 0.284857
49 user_050 0.277800
17 user_018 0.248539
inspect_user(df_users, 'user_010')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 58 | user_010 | Four Seasons Maldives | Luxury Travel | 8 |
| 60 | user_010 | Ski Whistler BC Canada | Skiing | 8 |
| 59 | user_010 | Ski St Anton Austria | Skiing | 3 |
| 57 | user_010 | Aman Tokyo Japan | Luxury Travel | 2 |
inspect_user(df_users, 'user_018')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 104 | user_018 | Aman Tokyo Japan | Luxury Travel | 10 |
| 105 | user_018 | Private Jet Service Dubai | Luxury Travel | 10 |
| 107 | user_018 | Taylor Swift Eras LA USA | Pop Culture | 10 |
| 106 | user_018 | Billie Eilish NYC USA | Pop Culture | 7 |
query = "outdoors"
print(f"Query: {query}, type: {sim_type}")
print(list_similar_users_to_query_multi(sim_type, query, top_n=5))
Query: outdoors, type: norm
User Similarity
0 user_001 0.333498
44 user_045 0.333498
16 user_017 0.333476
32 user_033 0.299596
1 user_002 0.298054
inspect_user(df_users, 'user_001')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 0 | user_001 | Surf Tofino BC Canada | Surfing | 4 |
| 1 | user_001 | Surf Huntington Beach CA USA | Surfing | 3 |
inspect_user(df_users, 'user_033')
| user_id | domain | theme | count | |
|---|---|---|---|---|
| 194 | user_033 | Nature Park Costa Rica | Outdoor | 8 |
| 195 | user_033 | Biking Trail Moab USA | Outdoor | 8 |
| 196 | user_033 | Hiking Trail Banff Canada | Outdoor | 5 |
Compare similarity methods¶
import plotly.express as px
import plotly.graph_objects as go
def compare_similarity_functions(query_text, top_n=5):
sim_types = ["norm", "softmax", "mixed1"] #max
results = []
for sim_type in sim_types:
df = list_similar_users_to_query_multi(sim_type, query_text, top_n=top_n)
df["Type"] = sim_type
results.append(df)
results_df = pd.concat(results)
heatmap_df = results_df.pivot(index="User", columns="Type", values="Similarity").fillna(0)
# --- Heatmap ---
fig_heatmap = px.imshow(
heatmap_df,
text_auto=".2f",
aspect="auto",
color_continuous_scale="RdBu",
title=f"Query: '{query_text}' — User Similarities Across Methods",
)
fig_heatmap.update_layout(
width=700, height=300, margin=dict(l=30, r=30, t=30, b=30)
)
fig_heatmap.show()
# --- Bar chart ---
bar_df = results_df.copy()
fig_bar = px.bar(
bar_df,
x="User",
y="Similarity",
color="Type",
barmode="group",
title=f"Query: '{query_text}' — Similarity Scores",
)
fig_bar.update_layout(
width=700, height=350,
xaxis_tickangle=-45,
margin=dict(l=30, r=30, t=30, b=30)
)
fig_bar.show()
return heatmap_df
def plot_user_themes(df, user_ids):
filtered_df = df[df["user_id"].isin(user_ids)]
theme_counts = filtered_df.groupby(["user_id", "theme"])["count"].sum().reset_index()
fig = px.bar(
theme_counts,
x="user_id",
y="count",
color="theme",
barmode="group",
title="Theme Distribution per User",
labels={"user_id": "User", "count": "Interaction Count"},
height=400
)
fig.update_layout(xaxis_tickangle=-45, width=750)
fig.show()
query = "surf"
comparison_df = compare_similarity_functions(query, top_n=5)
plot_user_themes(df_users, comparison_df.index)
Sorry for the rendering problem above – still have to figure out how to fix plotly frames in wordpress html block – but attaching them as an iframe here:
Does it solve all our problems mentioned before – well not exactly – but in theory it does give another approach to look at the same problem and with some secret sauce of an appropriate method to combine similarity scores for a user, we might have found the perfect hack for our little problem – ofcourse that doesn’t mean I didn’t look into the User LLM and the custom encoder Google suggested – which creates user embeddings by passing users interaction sequence through a self supervised next token predictor transformer – its basically attention that helps them combine embeddings. Though that one is a bit on a back burner now for me – but i’ll try to make it the next one for this series atleast.

Leave a Reply