The purpose of this notebook is to study and evaluate the use of an embedding model in user segmentation
Task¶
User segmentation / profiling (without training)
Data¶
lets say if i have data for what locations people are interested in (maybe they’ve searched those or visited those)
i’ve formatted the data like:
“{user1}”: {
“{name of place} {city} {state} {country}”: {frequency of search/visit},
“{name of place} {city} {state} {country}”: {frequency of search/visit},
…
}
Plan¶
we’ve all spent time doing strenous feature selection and engineering using distribution tests, predictive powers, iterations etc etc to select key features, encode them with things like OHE and train models creating user vectors and clustering
but what if instead we could just get raw data encoded directly and user vectors created with context of each feature making user itself queryable- no predefined feature engg, vectorization or clustering – we use the power of pre trained embedding models to bring in the knowledge of context
Here’s how the process could look like:
- get distinct strings (lets call them domain entries)
- get embedding for each distinct domain entry – {name of place} {city} {state} {country} – using openai embedding model
- weigh each embedding by frequency and average them together to create a user vector
- encode query vector using same embedding model
- list all users similar to a description (using a cosine similarity for instance)
Purpose of this notebook¶
To test the embedding model and get more insights on how they encode – what can we do with their embeddings, characterstics of embedding spaces – how does doing a weighted average look like in principle
concerns in real life behaviour could have tens of thousands of distinct placecodes for a user, combining them could lead to things like
- dilution by too frequent behaviours, too many domains
- dilution of weaker traits by combination
- information loss
- missclassification etc. etc. I discuss them below in the Test section
We use Open AI text embedding 3 small to inspect and play with the embeddings
To begin with, the import section below imports the necessary libraries and sets up the embedding function
Imports¶
from dotenv import load_dotenv
import os
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
import openai
from openai import OpenAI
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
openai_client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
def get_embedding(text, deployment_name="text-embedding-3-small"):
response = openai_client.embeddings.create(
input=[text],
model=deployment_name
)
return response.data[0].embedding
Example 1¶
We start with the standard example they quote everywhere – basically if the context makes sense and its the same vector space, then king – man + woman -> queen
texts = {
"king": "king",
"man": "man",
"woman": "woman",
"queen": "queen"
}
embeddings = {k: np.array(get_embedding(v)) for k, v in texts.items()}
king_vec = embeddings["king"]
man_vec = embeddings["man"]
woman_vec = embeddings["woman"]
queen_vec = embeddings["queen"]
analogy_vec = king_vec - man_vec + woman_vec
similarity = cosine_similarity([analogy_vec], [queen_vec])[0][0]
print(f"Cosine similarity with 'queen': {similarity:.4f}")
Cosine similarity with 'queen': 0.6162
print("Similarity of analogy vector to all texts:")
for word, vec in embeddings.items():
sim = cosine_similarity([analogy_vec], [vec])[0][0]
print(f"{word}: {sim:.4f}")
Similarity of analogy vector to all texts: king: 0.7637 man: 0.1922 woman: 0.6023 queen: 0.6162
Even though this is quite similar to queen (& woman) – its still most similar to king
embeddings['analogy'] = analogy_vec
labels = list(embeddings.keys())
embedding_matrix = np.array(list(embeddings.values()))
you could choose either PCA or t-SNE to visualize both will allow to reduce this 1536 dimension to a 2-3 dim one for visualization ( its important to note they’re approximations and both of them work differently – inspite of what people keep saying :/ ) – in my experience tSNE better captures this higher dimensionality but you’re free to try both – this case is very simple so it doesn’t matter but you can add more examples and see
pca_3d = PCA(n_components=3)
pca_result = pca_3d.fit_transform(embedding_matrix)
fig_pca = px.scatter_3d(
x=pca_result[:, 0], y=pca_result[:, 1], z=pca_result[:, 2],
text=labels,
title="3D PCA of OpenAI Embeddings"
)
fig_pca.show()
# tsne_3d = TSNE(n_components=3, perplexity=3, n_iter=1000, random_state=42)
# tsne_result = tsne_3d.fit_transform(embedding_matrix)
# fig_tsne = px.scatter_3d(
# x=tsne_result[:, 0], y=tsne_result[:, 1], z=tsne_result[:, 2],
# text=labels,
# title="3D t-SNE of OpenAI Embeddings"
# )
# fig_tsne.show()
Example 2¶
It took me a lot of time to come up with one perfect example to demonstrate the potential challenges i wanted – ofcourse 2 hours on chatgpt backn forth helped
user_searches = {
# Strong and coherent interest: SKI
"user_ski_only": {
"Ski Aspen Colorado United States": 6,
"Ski Whistler British Columbia Canada": 5,
"Ski St Anton Tyrol Austria": 5
},
# Strong interest diluted by frequent unrelated behaviour (groceries, coffee shops)
"user_ski_with_daily_life": {
"Ski Aspen Colorado United States": 6,
"Ski Whistler British Columbia Canada": 5,
"Ski St Anton Tyrol Austria": 5,
"Starbucks New York New York United States": 30,
"Trader Joe's San Francisco California United States": 25,
"Whole Foods Austin Texas United States": 20
},
# Weak trait: Taylor Swift fan (only 1–2 relevant signals)
"user_taylor_swift_fan": {
"Taylor Swift Eras Tour Los Angeles California United States": 2,
"Nail Salon Brooklyn New York United States": 3,
"Pop Music Club Miami Florida United States": 2
},
# Taylor Swift fan diluted by unrelated high-frequency daily searches
"user_taylor_swift_with_noise": {
"Taylor Swift Eras Tour Los Angeles California United States": 2,
"Nail Salon Brooklyn New York United States": 3,
"Pop Music Club Miami Florida United States": 2,
"Starbucks Chicago Illinois United States": 20,
"Walmart Phoenix Arizona United States": 18,
"Home Depot Houston Texas United States": 15,
"CVS Pharmacy Boston Massachusetts United States": 12
},
# Multi-interest user: SKI + Tennis + Horror
"user_ski_tennis_horror": {
"Ski Aspen Colorado United States": 4,
"City St Anton Tyrol Austria": 4,
"Tennis Court Queens New York United States": 4,
"City Wimbledon London United Kingdom": 4,
"Horror Nights Universal Orlando Florida United States": 4,
"Haunted House Salem Massachusetts United States": 4
},
# Multi-interest + unrelated noise (risk of cancellation / misclassification)
"user_ski_tennis_horror_with_noise": {
"Ski Aspen Colorado United States": 4,
"City St Anton Tyrol Austria": 4,
"Tennis Court Queens New York United States": 4,
"City Wimbledon London United Kingdom": 4,
"Horror Nights Universal Orlando Florida United States": 4,
"Haunted House Salem Massachusetts United States": 4,
"Grocery Store Dallas Texas United States": 30,
"Gas Station Atlanta Georgia United States": 25,
"Starbucks Los Angeles California United States": 20,
"Walmart Denver Colorado United States": 15
},
# Pure noise: No clear theme
"user_random_behaviour": {
"Starbucks New York New York United States": 20,
"McDonald's Chicago Illinois United States": 15,
"Walmart Los Angeles California United States": 18,
"Grocery Store Miami Florida United States": 12,
"Gym Boston Massachusetts United States": 5,
"Gas Station Seattle Washington United States": 8,
"Pharmacy San Diego California United States": 10
},
"user_ski_horror_romcom": {
"Ski Aspen Colorado United States": 4,
"Ski Zermatt Valais Switzerland": 4,
"Horror Nights Universal Orlando Florida United States": 4,
"Haunted House Salem Massachusetts United States": 4,
"Romantic Comedy Theatre New York New York United States": 4,
"Romantic Movies Los Angeles California United States": 4
},
"user_luxury": {
"Fairmont Chateau Whistler British Columbia Canada": 1,
"Pebble Beach Golf Links California United States": 2,
"Don Alfonso Toronto Canada": 2,
},
"user_party_parent": {
"Montessori School New York New York United States": 6,
"Children's Library New York New York United States": 6,
"Night Club Las Vegas Nevada United States": 6,
"Rooftop Bar New York New York United States": 6
}
}
Create user vectors¶
domain_entry embeddings¶
all_domain_entrys = set(pc for user in user_searches.values() for pc in user)
# print(f"All unique domain_entrys: {all_domain_entrys}")
domain_entry_embeddings = {pc: np.array(get_embedding(pc)) for pc in all_domain_entrys}
# len(domain_entry_embeddings[""])
len(domain_entry_embeddings["Fairmont Chateau Whistler British Columbia Canada"])
1536
Weighted average¶
Why this works:
where we add up v embeddings for N places multiplying each with its respective frequency f and averaging into a single vector for a user u
Weighted averaging as you know and can see is linear (assumes linear relationships with all variables) – which is what makes it challenging
there could be dilution (if 1 behaviour dominates), information loss (if opposing behaviors), noise (too many low frequency behaviours generalizing)
user_vectors = {}
for user, places in user_searches.items():
vectors = []
weights = []
for pc, count in places.items():
vec = domain_entry_embeddings[pc]
vectors.append(vec * count)
weights.append(count)
avg_vector = np.sum(vectors, axis=0) / sum(weights)
user_vectors[user] = avg_vector
Inspect¶
Similar users¶
users = list(user_vectors.keys())
user_matrix = np.array([user_vectors[u] for u in users])
similarity_matrix = cosine_similarity(user_matrix)
similarity_df = pd.DataFrame(similarity_matrix, index=users, columns=users)
print("User-to-User Cosine Similarity:")
display(similarity_df)
User-to-User Cosine Similarity:
| user_ski_only | user_ski_with_daily_life | user_taylor_swift_fan | user_taylor_swift_with_noise | user_ski_tennis_horror | user_ski_tennis_horror_with_noise | user_random_behaviour | user_ski_horror_romcom | user_luxury | user_party_parent | |
|---|---|---|---|---|---|---|---|---|---|---|
| user_ski_only | 1.000000 | 0.540992 | 0.397819 | 0.456787 | 0.691260 | 0.485610 | 0.410343 | 0.671378 | 0.543556 | 0.390753 |
| user_ski_with_daily_life | 0.540992 | 1.000000 | 0.582823 | 0.778416 | 0.646878 | 0.774614 | 0.824077 | 0.608937 | 0.541601 | 0.638665 |
| user_taylor_swift_fan | 0.397819 | 0.582823 | 1.000000 | 0.619498 | 0.657044 | 0.603537 | 0.675986 | 0.629545 | 0.471972 | 0.666918 |
| user_taylor_swift_with_noise | 0.456787 | 0.778416 | 0.619498 | 1.000000 | 0.652059 | 0.846196 | 0.881911 | 0.599575 | 0.493377 | 0.539648 |
| user_ski_tennis_horror | 0.691260 | 0.646878 | 0.657044 | 0.652059 | 1.000000 | 0.695228 | 0.645057 | 0.853430 | 0.600145 | 0.631478 |
| user_ski_tennis_horror_with_noise | 0.485610 | 0.774614 | 0.603537 | 0.846196 | 0.695228 | 1.000000 | 0.889049 | 0.621469 | 0.492833 | 0.555950 |
| user_random_behaviour | 0.410343 | 0.824077 | 0.675986 | 0.881911 | 0.645057 | 0.889049 | 1.000000 | 0.617215 | 0.499182 | 0.646826 |
| user_ski_horror_romcom | 0.671378 | 0.608937 | 0.629545 | 0.599575 | 0.853430 | 0.621469 | 0.617215 | 1.000000 | 0.572886 | 0.626741 |
| user_luxury | 0.543556 | 0.541601 | 0.471972 | 0.493377 | 0.600145 | 0.492833 | 0.499182 | 0.572886 | 1.000000 | 0.454355 |
| user_party_parent | 0.390753 | 0.638665 | 0.666918 | 0.539648 | 0.631478 | 0.555950 | 0.646826 | 0.626741 | 0.454355 | 1.000000 |
# plt.figure(figsize=(10, 8))
# sns.heatmap(similarity_df, annot=True, fmt=".2f", cmap="coolwarm", square=True)
# plt.title("User to User Cosine Similarity")
# plt.xticks(rotation=45, ha='right')
# plt.yticks(rotation=0)
# plt.tight_layout()
# plt.show()
Users similar to query¶
weighted average user embeddings¶
def list_similar_users_to_query(query_text, top_n=5):
query_embedding = get_embedding(query_text)
query_similarities = []
for user, vec in user_vectors.items():
sim = cosine_similarity([vec], [query_embedding])[0][0]
query_similarities.append((user, sim))
query_similarities.sort(key=lambda x: x[1], reverse=True)
query_sim_df = pd.DataFrame(query_similarities, columns=["User", "cos_sim"])
query_sim_df.sort_values(by="cos_sim", ascending=False, inplace=True)
query_sim_df = query_sim_df[:top_n]
print(f"Similarity to: {query_text}")
# print(query_sim_df)
return query_sim_df
list_similar_users_to_query("culture", top_n=5)
Similarity to: culture
| User | cos_sim | |
|---|---|---|
| 0 | user_ski_tennis_horror | 0.210922 |
| 1 | user_ski_horror_romcom | 0.206665 |
| 2 | user_luxury | 0.190605 |
| 3 | user_taylor_swift_fan | 0.189760 |
| 4 | user_party_parent | 0.171029 |
domain_entry embeddings¶
query_text = "culture"
query_embedding = get_embedding(query_text)
domain_entry_similarities = []
for user, places in user_searches.items():
for pc in places:
if pc in domain_entry_embeddings:
vec = domain_entry_embeddings[pc]
if isinstance(vec, np.ndarray) and vec.shape == (1536,):
sim = cosine_similarity([vec], [query_embedding])[0][0]
domain_entry_similarities.append((user, pc, sim))
domain_entry_sim_df = pd.DataFrame(domain_entry_similarities, columns=["User", "domain_entry", "Similarity"])
print(f"Top 5 domain_entrys for query '{query_text}':")
print(domain_entry_sim_df.head(5).to_string(index=False))
top_users_df = domain_entry_sim_df.groupby("User")["Similarity"].sum().sort_values(ascending=False).head(5).reset_index()
print(f"Top 5 users for query '{query_text}':")
print(top_users_df)
Top 5 domain_entrys for query 'culture':
User domain_entry Similarity
user_ski_only Ski Aspen Colorado United States 0.106800
user_ski_only Ski Whistler British Columbia Canada 0.122614
user_ski_only Ski St Anton Tyrol Austria 0.082558
user_ski_with_daily_life Ski Aspen Colorado United States 0.106800
user_ski_with_daily_life Ski Whistler British Columbia Canada 0.122614
Top 5 users for query 'culture':
User Similarity
0 user_ski_tennis_horror_with_noise 1.151959
1 user_ski_horror_romcom 0.825786
2 user_ski_tennis_horror 0.820006
3 user_taylor_swift_with_noise 0.810314
4 user_random_behaviour 0.738377
Visualize¶
pca = PCA(n_components=3)
pca_result = pca.fit_transform(user_matrix)
pca_df = pd.DataFrame(pca_result, columns=["PC1", "PC2", "PC3"])
pca_df["user"] = users
fig = px.scatter_3d(
pca_df, x="PC1", y="PC2", z="PC3",
text="user", title="3D PCA of User Embeddings from Weighted domain_entry Searches"
)
fig.show()
# from sklearn.manifold import TSNE
# tsne = TSNE(n_components=3, perplexity=5, n_iter=1000, random_state=42)
# tsne_result = tsne.fit_transform(user_matrix)
# tsne_df = pd.DataFrame(tsne_result, columns=["TSNE1", "TSNE2", "TSNE3"])
# tsne_df["user"] = users
# fig_tsne = px.scatter_3d(
# tsne_df, x="TSNE1", y="TSNE2", z="TSNE3",
# text="user", title="3D t-SNE of User Embeddings from Weighted Placecode Searches"
# )
# fig_tsne.show()
Test Cases¶
Dilution of Strong Characterstics – Multi Domain Noise (High distinct Domains)¶
list_similar_users_to_query("horror", top_n=5)
# list_similar_users_to_query("halloween events", top_n=5)
# list_similar_users_to_query("haunted", top_n=5)
# Expected - user_ski_tennis_horror, user_ski_tennis_horror_with_noise
# Observed - user_ski_tennis_horror_with_noise is considerably lower because of additional factors
Similarity to: horror
| User | cos_sim | |
|---|---|---|
| 0 | user_ski_horror_romcom | 0.335731 |
| 1 | user_ski_tennis_horror | 0.261294 |
| 2 | user_party_parent | 0.132138 |
| 3 | user_luxury | 0.124931 |
| 4 | user_taylor_swift_fan | 0.116897 |
Dilution of Strong Characterstics – Dominant Features (High Frequency activities)¶
list_similar_users_to_query("ski", top_n=5)
# list_similar_users_to_query("ski enthusiast", top_n=5)
# Expected - user_ski_only followed by user_ski_with_daily_life - both ranked higher than user_ski_tennis_horror, user_ski_tennis_horror_with_noise
# Observed - user_ski_with_daily_life even though has more ski searches and frequencies is ranked lower
Similarity to: ski
| User | cos_sim | |
|---|---|---|
| 0 | user_ski_only | 0.412250 |
| 1 | user_ski_horror_romcom | 0.261849 |
| 2 | user_ski_tennis_horror | 0.245377 |
| 3 | user_ski_with_daily_life | 0.225419 |
| 4 | user_luxury | 0.213018 |
Dilution of Weak Characterstics – Multi Domain Noise (High distinct Domains)¶
list_similar_users_to_query("Taylor Swift", top_n=5)
# Expected - user_taylor_swift_fan followed by user_taylor_swift_with_noise
# Observed - user_taylor_swift_with_noise is ranked lower due to dilution by unrelated domains
Similarity to: Taylor Swift
| User | cos_sim | |
|---|---|---|
| 0 | user_taylor_swift_fan | 0.326926 |
| 1 | user_ski_with_daily_life | 0.227022 |
| 2 | user_ski_horror_romcom | 0.215764 |
| 3 | user_ski_tennis_horror | 0.211203 |
| 4 | user_luxury | 0.208624 |
Incorrect information – possibly due to specific combinations¶
list_similar_users_to_query("luxury", top_n=5)
# Expected - user_luxury followed by ski or something
# Observed - user_ski_horror_romcom (as those themes of being urban and horror enthusiast?) seem closer to luxury than actual luxury places
Similarity to: luxury
| User | cos_sim | |
|---|---|---|
| 0 | user_ski_horror_romcom | 0.243705 |
| 1 | user_ski_tennis_horror | 0.229921 |
| 2 | user_luxury | 0.228843 |
| 3 | user_party_parent | 0.201947 |
| 4 | user_taylor_swift_fan | 0.195764 |
Information loss – Domain Conflicts (opposing domains)¶
# list_similar_users_to_query("party", top_n=5)
list_similar_users_to_query("parent", top_n=5)
# Expected - user_party_parent
# Observed - user_party_parent is very low because of semantic conflict between children and party - which cancels each other when combined
Similarity to: parent
| User | cos_sim | |
|---|---|---|
| 0 | user_luxury | 0.193544 |
| 1 | user_ski_tennis_horror | 0.173891 |
| 2 | user_ski_horror_romcom | 0.158006 |
| 3 | user_ski_only | 0.140331 |
| 4 | user_taylor_swift_fan | 0.102343 |
https://github.com/w-winnie/livnlearn/blob/main/embeddings_livnlearnversion.ipynb
