Simple chat bot using custom documents with langchain
With the ChatGPT and LLM craze going on, I set out to build a simple question answering bot, however the flavor I was interested in was a chatbot that can answer questions about a custom dataset that I point it to. Additionally, I wanted to try langchain - it turned out to be a great idea. Langchain is fantastic for exploratory and prototype work, and I highly recommend it. You can easily connect different solutions. In my example, I needed:
- A dataset composed of text files with information I want to query
- A way to narrow down the context for the LLM model to answer the question
- A way to feed the LLM model with the context and query and get the answer
Langchain basically provides all the necessary building blocks to do this. I’ll go through the steps I took to build a simple question answering bot.
Dataset
Because I want to simulate answering questions about a custom dataset, I needed to create one - I didn’t want the knowledge within the LLM to bleed into the answers.
I ended up asking ChatGPT to generate a bunch of text files about an imaginary fantasy world called Eryndor. I broke it into 3 files: factions, lands and people. This simulates a knowledge base spread across multiple files.
Code
The code is surprisingly short, you can see it here.
I’ll go over the important parts here.
1. Load the dataset
The dataset is broken up into directories, and you choose the directory as the context. When it’s chosen, I use a DirectoryLoader with TextLoader and a CharacterTextSplitter to break it up into small chunks.
2. Initialize embedding storage
This is the “secret sauce” - because the dataset can be too large to feed it into the LLM model, we need to store it in a way that allows us to quickly retrieve a subset of it. Embeddings allow us to represent words in a vector form and in result quickly retrieve documents similar to the query.
Additionally, I wanted to use an external vector store, I found out I could use Redis (and I like Redis!) so I went with it.
3. Initialize the LLM model
My goal was to have it work on GPT4All, but it was very slow and embeddings did not work correctly. I ended up using OpenAI apis, which worked much better.
4. Query the bot
And finally, create a Q/A loop that allows you to ask questions and get answers.
Result
First off, I wanted to see what happens when I load an empty dataset and ask the question:
Then, I load the dataset and ask the same questions:
The wrap up
I’m very happy with the results. I’m still learning how to navigate the LLM ecosystem and I was able to build what I set out to do - a simple, custom data question answering bot - in a few hours. Obviously, it’s a very simple example, but it shows how you can use langchain to build a prototype quickly to experiment and learn.