My dive into Vector databases

Izumi

25 May 2023 • 12 min read

Toying with technology - there's a distinct thrill to it, isn't there? Particularly when you get to muck about with VectorDBs, a piece of tech that's as fascinating as it is perplexing. Imagine my surprise, when the same tech popped up in Langchain. Two birds, one stone, I thought; a chance to experiment and explore.

Vector databases - a concept so astounding, it's repeatedly left me slack-jawed and mind blown. I had planned to pen down my findings earlier but there was a gnawing sense of an incomplete understanding. In a twist of fate, while my mind was embroiled in other matters, the missing piece emerged in my subconscious, as if out of thin air or by using the force. Consequently, I find myself returning to the subject, armed with newfound insight and ready to share.

Note: I am still learning about Vector databases myself and my dataset is tiny so I got no experience with scaling it up outside of messing around with it, so if you came for more advanced stuff but I am sorry I am not at that level yet.

So let's do a basic break down (based on my understanding):

Vector databases, or VectorDBs as they're fondly abbreviated, have been a nice breakthrough in the realm of database technology, online we can see articles from around 2021 but they did not really leave an impact until recently. Operating in high dimensional vector spaces, these databases excel in dealing with complex, unstructured data types - think image, audio, and even text.

Thanks to a certain technology, they've become the new rising star of the tech world for a reason. Their performance and versatility give them an edge in tasks such as similarity searches, which they achieve with remarkable efficiency.

While we're at it, let's note that Langchain isn't just slapping these VectorDBs in for the sake of it. The use of embeddings allows the system to attach a wealth of metadata to every query, enhancing usability and information depth.

So there you have it, a rudimentary introduction to the world of VectorDBs. A world we'll soon dive into headfirst. If you new to this prepare to have your mind blown, just as mine was.

I will break this up into "Vector store only" and "Vector store with a retriever".

Well probably like everyone who played with Langchain and read the docs I was curious about this 'Vector database' thing and saw they an had an in-memory version and I thought perfect, this is a good way to get my toes wet. So I setup a basic dataset. At this point I had no problem I wanted to solve, I just wanted to see what will happen.

Vector store only

const vectorStore = await MemoryVectorStore.fromTexts(
  [
    "You favorite swimsuit is a neon green onepiece",
    "You have green eyes",
    "You are tall",
    "You like to eat pizza",
    "You like rock music",
    "新宿駅から東京駅まで山手線に乗れます",
    "新橋駅からお台場駅までゆりかもめに乗れます",
    "京都はかっこいいです",
    "これは関係ない日本語です",
    "東京都でタクシーに乗ってどこでも行けます",
  ],
  [
    { id: 2, type: "swimsuit" },
    { id: 1 },
    { id: 3 },
    { id: 4 },
    { id: 5 },
    { id: 6 },
    { id: 7 },
    { id: 8 },
    { id: 9 },
  ],
  new OpenAIEmbeddings()
);

Initially I did not realize it, but in the second array can be used for the meta data or additional data or tagging you wish to provide. I used a mix of English and Japanese to get a feel of things. So now that we have our input what did I use it for? Let's see

Question 1: Bikini
Result 1: [
  Document {
    pageContent: 'You favorite swimsuit is a neon green onepiece',
    metadata: { id: 2, type: 'swimsuit' }
  }
]
Question 2: What is my favorite color?
Result 2: [
  Document {
    pageContent: 'You favorite swimsuit is a neon green onepiece',
    metadata: { id: 2, type: 'swimsuit' }
  },
  Document { pageContent: 'You have green eyes', metadata: { id: 1 } },
  Document {
    pageContent: 'You like to eat pizza',
    metadata: { id: 4 }
  },
  Document { pageContent: 'You like rock music', metadata: { id: 5 } }
]
Question 3: Do you like R & B?
Result 3: [
  Document { pageContent: 'You like rock music', metadata: { id: 5 } }
]
Question 4: どうやってお台場に行きますか
Result 4: [
  Document {
    pageContent: '新橋駅からお台場駅までゆりかもめに乗れます',
    metadata: { id: 7 }
  }
]
Question 5: どうやって新宿に行けますか
Result 5: [ Document { pageContent: '新宿駅から東京駅まで山手線に乗れます', metadata: { id: 6 } } ]
Question 6: どうやって江戸川に行けますか
Result 6: [
  Document {
    pageContent: '新橋駅からお台場駅までゆりかもめに乗れます',
    metadata: { id: 7 }
  },
  Document { pageContent: '東京都でタクシーに乗ってどこでも行けます', metadata: {} }
]
Question 7: How do I get to Shinjuku?
Result 7: [ Document { pageContent: '新宿駅から東京駅まで山手線に乗れます', metadata: { id: 6 } } ]
Question 8: Tell me about the Tokyo illuminations
Result 8: [ Document { pageContent: '京都はかっこいいです', metadata: { id: 8 } } ]
Question 9: どうやってガンダム像を見に行きますか
Result 9: [
  Document {
    pageContent: '新橋駅からお台場駅までゆりかもめに乗れます',
    metadata: { id: 7 }
  },
  Document { pageContent: '東京都でタクシーに乗ってどこでも行けます', metadata: {} }
]
Question 10: How do I get to the Gundam statue?
Result 10: [
  Document {
    pageContent: '新橋駅からお台場駅までゆりかもめに乗れます',
    metadata: { id: 7 }
  },
  Document { pageContent: '新宿駅から東京駅まで山手線に乗れます', metadata: { id: 6 } }
]
Question 11: Hamburger
Result 11: [
  Document {
    pageContent: 'You like to eat pizza',
    metadata: { id: 4 }
  }
]
Question 12: お寿司
Result 12: [ Document { pageContent: '京都はかっこいいです', metadata: { id: 8 } } ]
Question 13: どうやって清水寺に行きますか？
Result 13: [
  Document {
    pageContent: '新橋駅からお台場駅までゆりかもめに乗れます',
    metadata: { id: 7 }
  },
  Document { pageContent: '新宿駅から東京駅まで山手線に乗れます', metadata: { id: 6 } },
  Document { pageContent: '東京都でタクシーに乗ってどこでも行けます', metadata: {} }
]

The above should just be using the in-memory vector database using embeddings. I mind blown how accurate it got, let's go over some of it.

On the 'bikini' inquiry, despite adding 'swimsuit' metadata later, the system consistently zeroed in on it as the prime answer. It connected the dots between one-piece and bikinis.

Colour related question as expected brought up answers related to colour even thought it was asked in a strange manor, it also figured out R & B might be closest to music which then got my Rock music data.

The Japanese is basically just how do I get from point A to point B generally around Tokyo. It's almost uncanny how it was able to get the closest train lines to the areas like that, absolutely blew my mind. Witness the Edogawa query: it advised the use of the Yurikamome line or, as an alternative, a taxi - However I am unsure if that is the correct line or if it wants to get to Shinbashi. To Shinjuku? The correct train line again, demonstrating its elementary understanding of translation.

What I found interesting about the question about how do I get to the Gundam statue, both in English and Japanese it is suggesting I use the Yurikamome, but in Japanese it's second suggestion is to use a taxi but in English it's suggestion I first use the Yamanote line, this is excellent but could be a fluke? I've run this multiple times and it usually comes up with these answers.

Where it did fail was anytime I asked it how do I get to X questions it would suggest the routes I provided in Tokyo, but it failed completely when I wanted to go see something in Kyoto, it suggested taking a taxi as it's third answer. I was hoping it would just tell me Kyoto is cool as it is the closest question to Kyoto in the limited dataset. But this was a gotcha I put in willingly to see what it can and can't do.

Another successful gotcha was the nonsensical record 「これは関係ない日本語です」, a bunch of gibberish that I set as a booby trap to ensure the system would never choose it as an answer in either English or Japanese. And indeed, it didn't.

So there we have it, a display of embeddings in action. However, remember this adventure was conducted with a rather sparse dataset. It's anyone's guess how the outcomes would evolve with a larger vector store to play with.

Vector store with a retriever

Ever been left hanging by a technological conundrum? I have. I was knee-deep in VectorDBs and a persistent issue was keeping me up at night. It was the matter of sorting results - a seemingly trivial matter that proved elusive. For weeks, I put it on the backburner, till a new revelation came crashing through the doors - Langchain 0.70. I read the docs and I noticed something, and with it, a whole new avenue opened up: the use of VectorDBs with the Retriever.

But, what's this Retriever, you ask? Well, it's a handy little interface that ties together documents with language models. Picture it as a bridge connecting two islands. The Retriever's function is straightforward. It hosts a get_relevant_documents method that takes a string query and spits out a list of pertinent documents.

The arrival of Langchain update at the time was like a sudden shaft of sunlight piercing through dark clouds. It brought an example of using VectorDBs in tandem with a Retriever - a development that set my mind racing with possibilities.

So, I dived back into my previous Vector store, launched the Retriever over it and instructed it to retrieve the four most relevant answers. With an air of defiance, I then added a few more records in the most uninteresting manner. This gave me access to my initial Vector store and the new additions, but the old problem remained: how to handle contradictory truths?

Let's take a simple example. "Person A lives in X," and then "Person A lives in Y." A clear contradiction, yet a common occurrence in data sets. It's this challenge that I set out to tackle next.

Warning: The dataset and break down might contain spoilers for Cyberpunk Edgerunners and Star Wars

A puzzle, that's what this was. A puzzling question gnawing at the back of my mind. I harked back to a test I'd once heard for Alexa: “Hey Alexa, what colour is Luke Skywalker’s lightsaber?” Quite the poser, considering the potential for it to be either blue or green. As I mulled over this, I realised the essence of my problem mirrored this very scenario.

Frustrated and stumped, I turned to the infinite wisdom of the internet for a quick fix. I didn't really get a quick answer to my question. So I thought this problem required a hands-on approach.

A data set was my next port of call, one carefully designed with precise inputs, outputs and detailed metadata. Following this, I threw a series of questions at gpt3.5 vector store with retriever, curious to see how it would respond.

Impressed but not satisfied, I decided to test my 'Person A lives in X' conundrum. This required a slight detour into the world of anime - specifically "Cyberpunk Edgerunners," my go-to playground for tech experiments. I injected some related data into my set, sticking to the specifics I knew, such as Lucy being 20 in 2076. As for when she moved to Night City, I left it ambiguous - not out of laziness, mind you, but to challenge the system's reliance on either my vector store or GPT’s logic.

So here we are, standing on the precipice of discovery. Brace yourself as we delve into the results I gathered, employing the retriever method and putting both vector store and GPT to the test. Stay tuned; things are about to get interesting.

For our input

await memory.saveContext(
  { input: "Luke Skywalker's lightsaber color in 'A New Hope'." },
  {
    output: `In 'A New Hope', Luke Skywalker's lightsaber was blue. This was Anakin Skywalker's original lightsaber.`,
    type: "lightsaber_color",
    time: "1977-05-25" // The release date of 'A New Hope'
  }
);

await memory.saveContext(
  { input: "Luke Skywalker's lightsaber color in 'Return of the Jedi'." },
  {
    output: `In 'Return of the Jedi', Luke Skywalker had a green lightsaber. He crafted this one himself after losing his first one.`,
    type: "lightsaber_color",
    time: "1983-05-25" // The release date of 'Return of the Jedi'
  }
);

await memory.saveContext(
  { input: "Where does Lucy live?" },
  {
    output: `Lucy was Abducted by Arasaka and lived in their secret labotory, trained to be a net runner.`,
    type: "Lucy's residence",
    time: "2066-05-25" // Lucy is 20 in 2076
  }
);

await memory.saveContext(
  { input: "Where does Lucy live?" },
  {
    output: `Lucy moved to Night City and lives in a small apartment in the city.`,
    type: "Lucy's residence",
    time: "2070-05-25" // Lucy is is 20 in 2076, make an assumption that she moved to Night City at 14
  }
);

I will only focus on the above data here is what generally comes out, some of these will have 2 answers to demonstrate the difference between the runs but only at places I deemed to be less accurate

What colour is luke's lightsaber?
In 'Return of the Jedi', Luke Skywalker had a green lightsaber. He crafted this one himself after losing his first one. In 'A New Hope', his lightsaber was blue, which was the original lightsaber of his father, Anakin Skywalker.

Where does Lucy live?
Lucy currently resides in Night City, in a small apartment. However, in 2066 she was being held captive in Arasaka's secret laboratory.
Lucy moved to Night City and lives in a small apartment in the city.

Where did Lucy live?
Lucy moved to Night City and now lives in a small apartment in the city. But she was previously held captive in Arasaka's secret laboratory in 2066, trained to be a net runner
As per my memory, Lucy moved to Night City and lives in a small apartment in the city.*

What was Lucy trained to do?
Lucy was trained as a net runner while being held captive in Arasaka's secret laboratory.
As per my memory, Lucy was abducted by Arasaka and trained to be a net runner in one of their secret laboratories. She was taught to hack into computer systems and manipulate virtual environments to carry out various tasks.

Does Lucy have a lightsaber?
I'm sorry, I don't have any information on Lucy owning a lightsaber.

If Lucy is 20 years old in 2076 how old is Lucy when she moved to night city?
As per my memory, Lucy moved to Night City in 2066. Assuming she was born in 2056, she would have been 10 years old when she moved to Night City.

Now, let's dissect the results. It's all quite straightforward, isn't it? Yet, if your eyes darted to the final question regarding Lucy's age and time, you'd have spotted a glaring error. A numerical catastrophe that would have any self-respecting mathematician up in arms.

I'd also like to point out that I generally get a good result between runs there was the point marked * which failed to acknowledge the question was in the past tense, but this was not a common answer based on the multiple runs I've did. Sometimes asking it the question "Where does Lucy live" it gives the previous place of residence, I could probably add settings to make it stricter.

However, I'm not one to be daunted by such trivialities. The solution was as swift as it was effective - a switch to GPT4. Following the transition, the results were not just better; they were immaculate. It breezed through the calculation as though it were child's play. Ah, the marvels of robust data preparation for vector stores, revealing a tantalising glimpse of accurate answers. But when faced with the formidable prowess of GPT4, it's clear who reigns supreme. It doesn't just set the bar; it is the bar.

What colour is luke's lightsaber?
In 'A New Hope,' Luke's lightsaber was blue, and in 'Return of the Jedi,' it was green.

Where does Lucy live?
Lucy lives in a small apartment in Night City now, but she was once abducted by Arasaka and lived in their secret laboratory, trained to be a netrunner.

Where did Lucy live?
Lucy lived in a secret Arasaka laboratory when she was abducted and trained as a netrunner, and now she lives in a small apartment in Night City.

What was Lucy trained to do?
Lucy was trained to be a netrunner by Arasaka when she was abducted and lived in their secret laboratory.

Does Lucy have a lightsaber?
No, Lucy doesn't have a lightsaber. She's a netrunner, not a Jedi from a fictional universe.

If Lucy is 20 years old in 2076 how old is Lucy when she moved to night city?
Hmm, let's try some basic math, shall we? If Lucy was 20 years old in 2076 and she moved to Night City on 2070-05-25, then she would be 14 years old when she moved.

Succinct, flawless. However there is a certain charm with running it with GPT3 where it tried to explain certain things, albeit making the answer really long, but for learning that could be better? Also note there is no instruction difference between the two, the only change was switching from GPT3.5 to GPT4.

Now, for those of you attentive souls or avid fans who've managed to stick through my rant, you might have noticed something intriguing. A little nod to the galaxy far, far away in the dates scattered across the data. A not-so-subtle 'Happy Birthday' to our beloved Star Wars.

In a strange twist of fate, it's almost as if the Force itself has been at work today, sparking the very inspiration that solved my VectorDB conundrum. How fitting that on the anniversary of Star Wars, we've embarked on a journey into the uncharted territories of technology, echoing the pioneering spirit of that galactic adventure.

So, there we have it. A journey into the realm of VectorDBs and GPT engines, filled with perplexing puzzles and intriguing insights. May this serve as a reminder - there's always room for exploration, questioning, and a healthy dose of trial and error. Especially when you're toying with technology.