Visualising RAG data
You will need to setup Jupyter Notebook to run the code below. See Jupyter Notebook setup
The example code below assumes you already have your documents loaded into the bionicGPT database.
We will create a scatter graph showing the document chunks of data together with the 4 most 'relevant' chunks based on the query you specified.
At the end we should have a diagram like this
![Scatter example Alt text](../vis-rag-scatter-example.png)
Setup and Retrieve Query Embeddings
!pip install -q sqlalchemy psycopg2-binary pandas
!pip install -q matplotlib seaborn scikit-learn
import sqlalchemy
import pandas as pd
import requests
from sqlalchemy import text
import json
import numpy as np
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
url = "http://embeddings-api:80/embed"
data = {"inputs": "how much money did the bank of england hold at end of 2022?"}
headers = {"Content-Type": "application/json"}
response = requests.post(url, json=data, headers=headers)
engine = sqlalchemy.create_engine('postgresql://postgres:testpassword@postgres:5432/bionic-gpt')
conn = engine.connect()
![Setup Alt text](../vis-rag-1.png)
Retrieve Document Data from Database
We will now use the query embeddings from above to retrieve all document chunks from the data ordered by 'similarity'
query_embedding = response.text
sql = text(f"""SELECT document_id, file_name, text, embeddings FROM chunks, documents
where documents.id = document_id
and embeddings is not null
ORDER BY embeddings <-> '{query_embedding[1:-1]}'""")
df = pd.read_sql(sql,conn)
df
![Retrieve chunks Alt text](../vis-rag-ret-chunks.png)
Convert Data Retrieved into 2 Dimensional Data
df['embeddings_vec'] = df['embeddings'].apply(lambda x: [float(y) for y in json.loads(x)])
embeddings_list = df['embeddings_vec'].tolist()
df
x = np.array(df['embeddings_vec'].to_list(), dtype=np.float32)
tsne = TSNE(random_state=0, n_iter=1000)
tsne_results = tsne.fit_transform(x)
df_tsne = pd.DataFrame(tsne_results, columns=['x','y'])
df_tsne['doc'] = df['file_name']
df_tsne
![Data Conversion Alt text](../vis-rag-convert.png)
Plot Results
Different colours refer to the different documents uploaded.
The 4 circles in blue highlight the 4 most 'relevant' chunks based on the query used above.
fig, ax = plt.subplots(figsize=(10,8)) sns.set_style('darkgrid', {"grid.color": ".6", "grid.linestyle": ":"})
sns.scatterplot(data=df_tsne, x='x', y='y', hue='doc', palette='hls')
sns.scatterplot(data=df_tsne.head(4), x='x', y='y', marker="$\circ$", ec="face", hue='doc', s=30)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.title('Scatter plot of document chunks in bionic database');
plt.xlabel('x');
plt.ylabel('y');
plt.axis('equal')
![Data Conversion Alt text](../vis-rag-plot.png)