Marcin Bunsch

Marcin Bunsch

Code Mechanic at at Synthesis

< Back to home

Simple chat bot using custom documents with langchain

With the ChatGPT and LLM craze going on, I set out to build a simple question answering bot, however the flavor I was interested in was a chatbot that can answer questions about a custom dataset that I point it to. Additionally, I wanted to try langchain - it turned out to be a great idea. Langchain is fantastic for exploratory and prototype work, and I highly recommend it. You can easily connect different solutions. In my example, I needed:

  1. A dataset composed of text files with information I want to query
  2. A way to narrow down the context for the LLM model to answer the question
  3. A way to feed the LLM model with the context and query and get the answer

Langchain basically provides all the necessary building blocks to do this. I’ll go through the steps I took to build a simple question answering bot.

Dataset

Because I want to simulate answering questions about a custom dataset, I needed to create one - I didn’t want the knowledge within the LLM to bleed into the answers.

I ended up asking ChatGPT to generate a bunch of text files about an imaginary fantasy world called Eryndor. I broke it into 3 files: factions, lands and people. This simulates a knowledge base spread across multiple files.

Code

The code is surprisingly short, you can see it here.

I’ll go over the important parts here.

1. Load the dataset

The dataset is broken up into directories, and you choose the directory as the context. When it’s chosen, I use a DirectoryLoader with TextLoader and a CharacterTextSplitter to break it up into small chunks.

# load all documents from folder

loader = DirectoryLoader(index_name, glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

2. Initialize embedding storage

This is the “secret sauce” - because the dataset can be too large to feed it into the LLM model, we need to store it in a way that allows us to quickly retrieve a subset of it. Embeddings allow us to represent words in a vector form and in result quickly retrieve documents similar to the query.

Additionally, I wanted to use an external vector store, I found out I could use Redis (and I like Redis!) so I went with it.

from langchain.vectorstores.redis import Redis
from langchain.embeddings import OpenAIEmbeddings

# This is an OpenAI API client that allows us to get embeddings for a given text

embeddings = OpenAIEmbeddings()

# Feed the Redis vector store with the documents and embeddings

rds = Redis.from_documents(docs, embeddings, redis_url=redis_url, index_name=index_name)

3. Initialize the LLM model

My goal was to have it work on GPT4All, but it was very slow and embeddings did not work correctly. I ended up using OpenAI apis, which worked much better.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

llm = OpenAI()

retriever = rds.as_retriever(search_kwargs={"k": 1})
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

4. Query the bot

And finally, create a Q/A loop that allows you to ask questions and get answers.

print("Ready to answer questions!")
print("> ", end="", flush=True)

for line in sys.stdin:
    sys.stdout.flush()
    question = line.strip()
    result = qa.run(question)
    stripped = result.strip()
    print("<", stripped)
    print("> ", end="", flush=True)

Result

First off, I wanted to see what happens when I load an empty dataset and ask the question:

$ REBUILD_INDEX=true python3 run.py nothing
Rebuilding index
Ready to answer questions!
> who is the leader of the dwarves?
< I don't know.
> Who is the enemy of humans?
< I don't know.
> Are there necromancers?
< I don't know.

Then, I load the dataset and ask the same questions:

$ REBUILD_INDEX=true python3 run.py eryndor
Rebuilding index
Ready to answer questions!
> who is the leader of the dwarves?
< The leader of the dwarves is King Drogan Stonehammer.
> Who is the enemy of humans?
< The Orcs of the Savage Wastes.
> Are there necromancers?
< Yes, there are necromancers. They are part of the Undead of the Dark Marshes and are led by their necromancer queen, Sylvanas Windrunner.

The wrap up

I’m very happy with the results. I’m still learning how to navigate the LLM ecosystem and I was able to build what I set out to do - a simple, custom data question answering bot - in a few hours. Obviously, it’s a very simple example, but it shows how you can use langchain to build a prototype quickly to experiment and learn.