embeddings_livnlearnversion

The purpose of this notebook is to study and evaluate the use of an embedding model in user segmentation

Task¶

User segmentation / profiling (without training)

Data¶

lets say if i have data for what locations people are interested in (maybe they’ve searched those or visited those)
i’ve formatted the data like:
“{user1}”: {
“{name of place} {city} {state} {country}”: {frequency of search/visit},
“{name of place} {city} {state} {country}”: {frequency of search/visit},
…
}

Plan¶

we’ve all spent time doing strenous feature selection and engineering using distribution tests, predictive powers, iterations etc etc to select key features, encode them with things like OHE and train models creating user vectors and clustering

but what if instead we could just get raw data encoded directly and user vectors created with context of each feature making user itself queryable- no predefined feature engg, vectorization or clustering – we use the power of pre trained embedding models to bring in the knowledge of context

Here’s how the process could look like:

get distinct strings (lets call them domain entries)
get embedding for each distinct domain entry – {name of place} {city} {state} {country} – using openai embedding model
weigh each embedding by frequency and average them together to create a user vector
encode query vector using same embedding model
list all users similar to a description (using a cosine similarity for instance)

Purpose of this notebook¶

To test the embedding model and get more insights on how they encode – what can we do with their embeddings, characterstics of embedding spaces – how does doing a weighted average look like in principle

concerns in real life behaviour could have tens of thousands of distinct placecodes for a user, combining them could lead to things like

dilution by too frequent behaviours, too many domains
dilution of weaker traits by combination
information loss
missclassification etc. etc. I discuss them below in the Test section

We use Open AI text embedding 3 small to inspect and play with the embeddings
To begin with, the import section below imports the necessary libraries and sets up the embedding function

Imports¶

In [50]:

from dotenv import load_dotenv
import os

import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

In [51]:

import openai
from openai import OpenAI

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

openai_client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)
def get_embedding(text, deployment_name="text-embedding-3-small"):
    response = openai_client.embeddings.create(
        input=[text],
        model=deployment_name
    )
    return response.data[0].embedding

Example 1¶

We start with the standard example they quote everywhere – basically if the context makes sense and its the same vector space, then king – man + woman -> queen

In [52]:

texts = {
    "king": "king",
    "man": "man",
    "woman": "woman",
    "queen": "queen"
}

In [53]:

embeddings = {k: np.array(get_embedding(v)) for k, v in texts.items()}

In [54]:

king_vec = embeddings["king"]
man_vec = embeddings["man"]
woman_vec = embeddings["woman"]
queen_vec = embeddings["queen"]

In [55]:

analogy_vec = king_vec - man_vec + woman_vec

In [56]:

similarity = cosine_similarity([analogy_vec], [queen_vec])[0][0]
print(f"Cosine similarity with 'queen': {similarity:.4f}")

Cosine similarity with 'queen': 0.6162

In [57]:

print("Similarity of analogy vector to all texts:")
for word, vec in embeddings.items():
    sim = cosine_similarity([analogy_vec], [vec])[0][0]
    print(f"{word}: {sim:.4f}")

Similarity of analogy vector to all texts:
king: 0.7637
man: 0.1922
woman: 0.6023
queen: 0.6162

Even though this is quite similar to queen (& woman) – its still most similar to king

In [58]:

embeddings['analogy'] = analogy_vec
labels = list(embeddings.keys())
embedding_matrix = np.array(list(embeddings.values()))

you could choose either PCA or t-SNE to visualize both will allow to reduce this 1536 dimension to a 2-3 dim one for visualization ( its important to note they’re approximations and both of them work differently – inspite of what people keep saying :/ ) – in my experience tSNE better captures this higher dimensionality but you’re free to try both – this case is very simple so it doesn’t matter but you can add more examples and see

In [59]:

pca_3d = PCA(n_components=3)
pca_result = pca_3d.fit_transform(embedding_matrix)

fig_pca = px.scatter_3d(
    x=pca_result[:, 0], y=pca_result[:, 1], z=pca_result[:, 2],
    text=labels,
    title="3D PCA of OpenAI Embeddings"
)
fig_pca.show()

In [60]:

# tsne_3d = TSNE(n_components=3, perplexity=3, n_iter=1000, random_state=42)
# tsne_result = tsne_3d.fit_transform(embedding_matrix)

# fig_tsne = px.scatter_3d(
#     x=tsne_result[:, 0], y=tsne_result[:, 1], z=tsne_result[:, 2],
#     text=labels,
#     title="3D t-SNE of OpenAI Embeddings"
# )
# fig_tsne.show()

Example 2¶

It took me a lot of time to come up with one perfect example to demonstrate the potential challenges i wanted – ofcourse 2 hours on chatgpt backn forth helped

In [61]:

user_searches = {
    # Strong and coherent interest: SKI
    "user_ski_only": {
        "Ski Aspen Colorado United States": 6,
        "Ski Whistler British Columbia Canada": 5,
        "Ski St Anton Tyrol Austria": 5
    },

    # Strong interest diluted by frequent unrelated behaviour (groceries, coffee shops)
    "user_ski_with_daily_life": {
        "Ski Aspen Colorado United States": 6,
        "Ski Whistler British Columbia Canada": 5,
        "Ski St Anton Tyrol Austria": 5,
        "Starbucks New York New York United States": 30,
        "Trader Joe's San Francisco California United States": 25,
        "Whole Foods Austin Texas United States": 20
    },

    # Weak trait: Taylor Swift fan (only 1–2 relevant signals)
    "user_taylor_swift_fan": {
        "Taylor Swift Eras Tour Los Angeles California United States": 2,
        "Nail Salon Brooklyn New York United States": 3,
        "Pop Music Club Miami Florida United States": 2
    },

    # Taylor Swift fan diluted by unrelated high-frequency daily searches
    "user_taylor_swift_with_noise": {
        "Taylor Swift Eras Tour Los Angeles California United States": 2,
        "Nail Salon Brooklyn New York United States": 3,
        "Pop Music Club Miami Florida United States": 2,
        "Starbucks Chicago Illinois United States": 20,
        "Walmart Phoenix Arizona United States": 18,
        "Home Depot Houston Texas United States": 15,
        "CVS Pharmacy Boston Massachusetts United States": 12
    },

    # Multi-interest user: SKI + Tennis + Horror
    "user_ski_tennis_horror": {
        "Ski Aspen Colorado United States": 4,
        "City St Anton Tyrol Austria": 4,
        "Tennis Court Queens New York United States": 4,
        "City Wimbledon London United Kingdom": 4,
        "Horror Nights Universal Orlando Florida United States": 4,
        "Haunted House Salem Massachusetts United States": 4
    },

    # Multi-interest + unrelated noise (risk of cancellation / misclassification)
    "user_ski_tennis_horror_with_noise": {
        "Ski Aspen Colorado United States": 4,
        "City St Anton Tyrol Austria": 4,
        "Tennis Court Queens New York United States": 4,
        "City Wimbledon London United Kingdom": 4,
        "Horror Nights Universal Orlando Florida United States": 4,
        "Haunted House Salem Massachusetts United States": 4,
        "Grocery Store Dallas Texas United States": 30,
        "Gas Station Atlanta Georgia United States": 25,
        "Starbucks Los Angeles California United States": 20,
        "Walmart Denver Colorado United States": 15
    },

    # Pure noise: No clear theme
    "user_random_behaviour": {
        "Starbucks New York New York United States": 20,
        "McDonald's Chicago Illinois United States": 15,
        "Walmart Los Angeles California United States": 18,
        "Grocery Store Miami Florida United States": 12,
        "Gym Boston Massachusetts United States": 5,
        "Gas Station Seattle Washington United States": 8,
        "Pharmacy San Diego California United States": 10
    },

    "user_ski_horror_romcom": {
        "Ski Aspen Colorado United States": 4,
        "Ski Zermatt Valais Switzerland": 4,
        "Horror Nights Universal Orlando Florida United States": 4,
        "Haunted House Salem Massachusetts United States": 4,
        "Romantic Comedy Theatre New York New York United States": 4,
        "Romantic Movies Los Angeles California United States": 4
    },
    "user_luxury": {
        "Fairmont Chateau Whistler British Columbia Canada": 1,
        "Pebble Beach Golf Links California United States": 2,
        "Don Alfonso Toronto Canada": 2,
    },
    "user_party_parent": {
        "Montessori School New York New York United States": 6,
        "Children's Library New York New York United States": 6,
        "Night Club Las Vegas Nevada United States": 6,
        "Rooftop Bar New York New York United States": 6
    }


}

Create user vectors¶

domain_entry embeddings¶

In [62]:

all_domain_entrys = set(pc for user in user_searches.values() for pc in user)
# print(f"All unique domain_entrys: {all_domain_entrys}")
domain_entry_embeddings = {pc: np.array(get_embedding(pc)) for pc in all_domain_entrys}
# len(domain_entry_embeddings[""])

In [63]:

len(domain_entry_embeddings["Fairmont Chateau Whistler British Columbia Canada"])

Out[63]:

Weighted average¶

Why this works:

where we add up v embeddings for N places multiplying each with its respective frequency f and averaging into a single vector for a user u

Weighted averaging as you know and can see is linear (assumes linear relationships with all variables) – which is what makes it challenging

there could be dilution (if 1 behaviour dominates), information loss (if opposing behaviors), noise (too many low frequency behaviours generalizing)

In [64]:

user_vectors = {}
for user, places in user_searches.items():
    vectors = []
    weights = []
    for pc, count in places.items():
        vec = domain_entry_embeddings[pc]
        vectors.append(vec * count)
        weights.append(count)
    avg_vector = np.sum(vectors, axis=0) / sum(weights)
    user_vectors[user] = avg_vector

Inspect¶

Similar users¶

In [65]:

users = list(user_vectors.keys())
user_matrix = np.array([user_vectors[u] for u in users])
similarity_matrix = cosine_similarity(user_matrix)

similarity_df = pd.DataFrame(similarity_matrix, index=users, columns=users)
print("User-to-User Cosine Similarity:")
display(similarity_df)

User-to-User Cosine Similarity:

	user_ski_only	user_ski_with_daily_life	user_taylor_swift_fan	user_taylor_swift_with_noise	user_ski_tennis_horror	user_ski_tennis_horror_with_noise	user_random_behaviour	user_ski_horror_romcom	user_luxury	user_party_parent
user_ski_only	1.000000	0.540992	0.397819	0.456787	0.691260	0.485610	0.410343	0.671378	0.543556	0.390753
user_ski_with_daily_life	0.540992	1.000000	0.582823	0.778416	0.646878	0.774614	0.824077	0.608937	0.541601	0.638665
user_taylor_swift_fan	0.397819	0.582823	1.000000	0.619498	0.657044	0.603537	0.675986	0.629545	0.471972	0.666918
user_taylor_swift_with_noise	0.456787	0.778416	0.619498	1.000000	0.652059	0.846196	0.881911	0.599575	0.493377	0.539648
user_ski_tennis_horror	0.691260	0.646878	0.657044	0.652059	1.000000	0.695228	0.645057	0.853430	0.600145	0.631478
user_ski_tennis_horror_with_noise	0.485610	0.774614	0.603537	0.846196	0.695228	1.000000	0.889049	0.621469	0.492833	0.555950
user_random_behaviour	0.410343	0.824077	0.675986	0.881911	0.645057	0.889049	1.000000	0.617215	0.499182	0.646826
user_ski_horror_romcom	0.671378	0.608937	0.629545	0.599575	0.853430	0.621469	0.617215	1.000000	0.572886	0.626741
user_luxury	0.543556	0.541601	0.471972	0.493377	0.600145	0.492833	0.499182	0.572886	1.000000	0.454355
user_party_parent	0.390753	0.638665	0.666918	0.539648	0.631478	0.555950	0.646826	0.626741	0.454355	1.000000

In [66]:

# plt.figure(figsize=(10, 8))
# sns.heatmap(similarity_df, annot=True, fmt=".2f", cmap="coolwarm", square=True)
# plt.title("User to User Cosine Similarity")
# plt.xticks(rotation=45, ha='right')
# plt.yticks(rotation=0)
# plt.tight_layout()
# plt.show()

Users similar to query¶

weighted average user embeddings¶

In [67]:

def list_similar_users_to_query(query_text, top_n=5):
    query_embedding = get_embedding(query_text)

    query_similarities = []
    for user, vec in user_vectors.items():
        sim = cosine_similarity([vec], [query_embedding])[0][0]
        query_similarities.append((user, sim))

    query_similarities.sort(key=lambda x: x[1], reverse=True)

    query_sim_df = pd.DataFrame(query_similarities, columns=["User", "cos_sim"])
    query_sim_df.sort_values(by="cos_sim", ascending=False, inplace=True)
    query_sim_df = query_sim_df[:top_n]
    print(f"Similarity to: {query_text}")
    # print(query_sim_df)
    return query_sim_df

list_similar_users_to_query("culture", top_n=5)

Similarity to: culture

Out[67]:

	User	cos_sim
0	user_ski_tennis_horror	0.210922
1	user_ski_horror_romcom	0.206665
2	user_luxury	0.190605
3	user_taylor_swift_fan	0.189760
4	user_party_parent	0.171029

domain_entry embeddings¶

In [68]:

query_text = "culture"
query_embedding = get_embedding(query_text)

domain_entry_similarities = []
for user, places in user_searches.items():
    for pc in places:
        if pc in domain_entry_embeddings:
            vec = domain_entry_embeddings[pc]
            if isinstance(vec, np.ndarray) and vec.shape == (1536,):
                sim = cosine_similarity([vec], [query_embedding])[0][0]
                domain_entry_similarities.append((user, pc, sim))

domain_entry_sim_df = pd.DataFrame(domain_entry_similarities, columns=["User", "domain_entry", "Similarity"])
print(f"Top 5 domain_entrys for query '{query_text}':")
print(domain_entry_sim_df.head(5).to_string(index=False))
top_users_df = domain_entry_sim_df.groupby("User")["Similarity"].sum().sort_values(ascending=False).head(5).reset_index()
print(f"Top 5 users for query '{query_text}':")
print(top_users_df)

Top 5 domain_entrys for query 'culture':
                    User                         domain_entry  Similarity
           user_ski_only     Ski Aspen Colorado United States    0.106800
           user_ski_only Ski Whistler British Columbia Canada    0.122614
           user_ski_only           Ski St Anton Tyrol Austria    0.082558
user_ski_with_daily_life     Ski Aspen Colorado United States    0.106800
user_ski_with_daily_life Ski Whistler British Columbia Canada    0.122614
Top 5 users for query 'culture':
                                User  Similarity
0  user_ski_tennis_horror_with_noise    1.151959
1             user_ski_horror_romcom    0.825786
2             user_ski_tennis_horror    0.820006
3       user_taylor_swift_with_noise    0.810314
4              user_random_behaviour    0.738377

Visualize¶

In [69]:

pca = PCA(n_components=3)
pca_result = pca.fit_transform(user_matrix)

pca_df = pd.DataFrame(pca_result, columns=["PC1", "PC2", "PC3"])
pca_df["user"] = users

fig = px.scatter_3d(
    pca_df, x="PC1", y="PC2", z="PC3",
    text="user", title="3D PCA of User Embeddings from Weighted domain_entry Searches"
)
fig.show()

In [70]:

# from sklearn.manifold import TSNE

# tsne = TSNE(n_components=3, perplexity=5, n_iter=1000, random_state=42)
# tsne_result = tsne.fit_transform(user_matrix)

# tsne_df = pd.DataFrame(tsne_result, columns=["TSNE1", "TSNE2", "TSNE3"])
# tsne_df["user"] = users

# fig_tsne = px.scatter_3d(
#     tsne_df, x="TSNE1", y="TSNE2", z="TSNE3",
#     text="user", title="3D t-SNE of User Embeddings from Weighted Placecode Searches"
# )
# fig_tsne.show()

Test Cases¶

Dilution of Strong Characterstics – Multi Domain Noise (High distinct Domains)¶

In [71]:

list_similar_users_to_query("horror", top_n=5) 
# list_similar_users_to_query("halloween events", top_n=5)
# list_similar_users_to_query("haunted", top_n=5)

# Expected -  user_ski_tennis_horror, user_ski_tennis_horror_with_noise
# Observed - user_ski_tennis_horror_with_noise is considerably lower because of additional factors

Similarity to: horror

Out[71]:

	User	cos_sim
0	user_ski_horror_romcom	0.335731
1	user_ski_tennis_horror	0.261294
2	user_party_parent	0.132138
3	user_luxury	0.124931
4	user_taylor_swift_fan	0.116897

Dilution of Strong Characterstics – Dominant Features (High Frequency activities)¶

In [72]:

list_similar_users_to_query("ski", top_n=5) 
# list_similar_users_to_query("ski enthusiast", top_n=5)

# Expected -  user_ski_only followed by user_ski_with_daily_life - both ranked higher than user_ski_tennis_horror, user_ski_tennis_horror_with_noise 

# Observed - user_ski_with_daily_life even though has more ski searches and frequencies is ranked lower

Similarity to: ski

Out[72]:

	User	cos_sim
0	user_ski_only	0.412250
1	user_ski_horror_romcom	0.261849
2	user_ski_tennis_horror	0.245377
3	user_ski_with_daily_life	0.225419
4	user_luxury	0.213018

Dilution of Weak Characterstics – Multi Domain Noise (High distinct Domains)¶

In [73]:

list_similar_users_to_query("Taylor Swift", top_n=5)

# Expected - user_taylor_swift_fan followed by user_taylor_swift_with_noise
# Observed - user_taylor_swift_with_noise is ranked lower due to dilution by unrelated domains

Similarity to: Taylor Swift

Out[73]:

	User	cos_sim
0	user_taylor_swift_fan	0.326926
1	user_ski_with_daily_life	0.227022
2	user_ski_horror_romcom	0.215764
3	user_ski_tennis_horror	0.211203
4	user_luxury	0.208624

Incorrect information – possibly due to specific combinations¶

In [74]:

list_similar_users_to_query("luxury", top_n=5)

# Expected - user_luxury followed by ski or something
# Observed - user_ski_horror_romcom (as those themes of being urban and horror enthusiast?) seem closer to luxury than actual luxury places

Similarity to: luxury

Out[74]:

	User	cos_sim
0	user_ski_horror_romcom	0.243705
1	user_ski_tennis_horror	0.229921
2	user_luxury	0.228843
3	user_party_parent	0.201947
4	user_taylor_swift_fan	0.195764

Information loss – Domain Conflicts (opposing domains)¶

In [75]:

# list_similar_users_to_query("party", top_n=5)
list_similar_users_to_query("parent", top_n=5)

# Expected - user_party_parent
# Observed - user_party_parent is very low because of semantic conflict between children and party - which cancels each other when combined

Similarity to: parent

Out[75]:

	User	cos_sim
0	user_luxury	0.193544
1	user_ski_tennis_horror	0.173891
2	user_ski_horror_romcom	0.158006
3	user_ski_only	0.140331
4	user_taylor_swift_fan	0.102343

https://github.com/w-winnie/livnlearn/blob/main/embeddings_livnlearnversion.ipynb

Live & Learn

Embeddings I – Exploring Embeddings

Task¶

Data¶

Plan¶

Purpose of this notebook¶

Imports¶

Example 1¶

Example 2¶

Create user vectors¶

domain_entry embeddings¶

Weighted average¶

Inspect¶

Similar users¶

Users similar to query¶

weighted average user embeddings¶

domain_entry embeddings¶

Visualize¶

Test Cases¶

Dilution of Strong Characterstics – Multi Domain Noise (High distinct Domains)¶

Dilution of Strong Characterstics – Dominant Features (High Frequency activities)¶

Dilution of Weak Characterstics – Multi Domain Noise (High distinct Domains)¶

Incorrect information – possibly due to specific combinations¶

Information loss – Domain Conflicts (opposing domains)¶