Embeddings II – Combining hack

In Embeddings I we explored an LLM Embedding model, an idea of using this trained encoding algorithm to make our mundane features context aware and the challenges of the naive ways of combining them.

To recap, we have a made up use case containing user and different domain interactions of that user, we want to group users together or create a recommendation system for a user or something along the lines. And to do that (i bet there’re other ways) but we want to create a user profile – earlier we tried to get an embedding for each domain and combine them as a single user vector (using frequency weighted averaging), its not an OHE so yes the weighted averaging has some potential concerns espescially if we scale our little example to real data with millions of user and thousands of domains (both theoretically and computationally)

Here we discuss a different and kind of hacky approach for our use case – not as fancy as Google’s User LLM which we’ll probably discuss later at some point (Its upcoming in this series).

So, what we basically do is:

  1. Embed individual domains into vector space using an llm embedding model (text-embedding-3-small)
  2. Cluster these placecodes in this embedding space and dynamically identify sub domains that we will define the user in
  3. We can also label these clusters using GPT to generate human readable tags – i opted for taking a top N domains by user frequency in each cluster
  4. Now that the clusters are defined – we build user profiles as a set of weighted vectors across multiple clusters – this gives out our single user object to be Nx1536 instead of a 1D (1536), user is now N dimensional – which when you think about it makes sense – those domains represent single or very related behaviour in the embedding space and can be efficiently combined together as a representative centroid – and they won’t even interfere with other drastically different vectors in any arithmetic operations
  5. Compute similarity between query prompts and users using different strategies (e.g. a max, weighted average, softmax attention)

This allows us to represent a user as a mixture of interests instead of a single point mish mash of everything the user has done

https://github.com/w-winnie/livnlearn/blob/main/embeddings_combination_alternative1_livnlearnversion.ipynb

embeddings_combination_alternative1_livnlearnversion

Sorry for the rendering problem above – still have to figure out how to fix plotly frames in wordpress html block – but attaching them as an iframe here:

Does it solve all our problems mentioned before – well not exactly – but in theory it does give another approach to look at the same problem and with some secret sauce of an appropriate method to combine similarity scores for a user, we might have found the perfect hack for our little problem – ofcourse that doesn’t mean I didn’t look into the User LLM and the custom encoder Google suggested – which creates user embeddings by passing users interaction sequence through a self supervised next token predictor transformer – its basically attention that helps them combine embeddings. Though that one is a bit on a back burner now for me – but i’ll try to make it the next one for this series atleast.

Share:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *