This is a smart chatbot that extracts and queries text from documents and images within them. In this blog, I will guide you how to build this advanced OCR-based chatbot using Vertex AI, LangChain, Gradio, ChromaDB, Google Cloud Storage (GCS), and multithreading. This powerful system manages and queries files efficiently.
The chatbot uses ChromaDB collections to handle both new and old files together without reprocessing older files. It provides source information for every response while querying a document. Let’s dive into the techniques behind this OCR-based chatbot and its features.
Key Features:
1. OCR-Based Text Extraction
- Document Formats Supported: Handles PDF, DOCX, PPTX, and TXT file formats.
- Image Extraction: Extracts images from documents and performs OCR using EasyOCR to convert images to text.
- Text Extraction: Utilizes libraries like PyMuPDF for PDFs, python-docx for DOCX files, and python-pptx for PPTX files to extract text from these formats.
2. Vector Database Integration (ChromaDB)
- Manages and stores document embeddings. Supports creation, retrieval, and deletion of collections.
- Document Embeddings: Uses Vertex AI for creating embeddings of document text.
- Document Splitting: Uses `RecursiveCharacterTextSplitter` to handle large documents by splitting them into manageable chunks.
3. Concurrent Document Processing
- Efficient File Handling: Processes documents concurrently, minimizing the time required for large-scale operations.
4. Gradio UI
- File Upload & Management: Allows users to upload files, process them, and manage collections.
- Question Answering: Users can ask questions based on the content of the documents, with answers retrieved from the ChromaDB vector store.
- Collection Management: Supports listing, deleting, and retrieving files from collections.
5. Integration with Vertex AI
- Vertex AI Embeddings: Utilizes Vertex AI’s text embedding model to generate vector representations of document content.
- Vertex AI Language Model: Uses Vertex AI’s text-bison-32k model for generating user query response.
6. Google Cloud Platform (GCP) Integration
- GCS: Handles file uploads and downloads from Google Cloud Storage.
- Persistent Client: Manages the ChromaDB client for persistent storage of embeddings.
7. Metadata Handling
- Metadata Extraction: Captures metadata including file names and upload timestamps, and integrates it with extracted text for richer information retrieval.
These features come together to create a sophisticated document management and querying system that is capable of handling diverse file formats and delivering accurate responses based on the content of the documents.
Some Screenshots of RAG System Creation and Q&A:
(Figure 1) RAG System Creation & some API’s to interact with Collections
(Figure 2) Q&A System
Architecture Diagram of OCR based chatbot:
The provided architecture diagram illustrates a system for creating a knowledge base and a retrieval-augmented generation (RAG) system. Here’s a brief explanation:
1. Document Upload and Processing:
- User Upload: Users upload documents (PDFs, DOCX, PPTX, TXT) via a web interface.
- Processing Pipeline:
- Text Extraction: Extract text from documents.
- OCR: Extract text from images within documents.
- Text Aggregation: Combine extracted text from various sources.
2. Chunking and Embedding:
- Text Chunking: Split aggregated text into chunks.
- Text Embedding: Embed chunks using a model.
- Store in Vector DB: Save embedded chunks in ChromaDB.
3. Vector Database Management:
- Collection Operations:
- Create: Set up new collections.
- Retrieve: Access existing collections.
- Delete: Remove collections as needed.
4. Q&A System:
- Query Processing:
- User Query: Input from users.
- Embed Query: Convert query into embeddings.
- Search and Response Generation:
- Similarity Search: Match query embedding against the vector database.
- Generate Prompt: Create prompt using matched chunks and query.
- Response Generation: Process prompt with Vertex AI LLM.
Let’s see some of the important code snippets of the Chatbot:
Embedding model & LLM
This above piece of code configures a model for text embeddings and question-answering:
- embedding = VertexAIEmbeddings() initializes a text embeddings model using Vertex AI.
- Configures a large language model (LLM) with specific parameters for answering questions. Model name and version: “text-bison-32k”, Randomness: 0.1 (low randomness)
VectorDB Operations
- CHROMA_DB_CLIENT: Initializes a persistent Chroma database client at the mentioned path.
- del_collection(collection_name): This Deletes the particular collection from the Chroma database.
- get_all_collections(): This Lists and give the names of all collections in the Chroma database.
- get_files_in_collection(collection_name): Give details (file name and upload date) of all files in the particular collection or gives an error message if it fails.
Document Class & split Documents
Above code defines a Document class to store page content and metadata. The split_documents function processes files by splitting their content into chunks. It takes metadata_all_files and dict_pagenumber_text_all_files as inputs, creates Document instances for each page, and then uses RecursiveCharacterTextSplitter to split these documents into smaller chunks. The function finally returns the split documents and prints the total number of chunks created.
Text Extraction
The extract_text function extracts text from files stored in a GCS bucket. It first creates a temporary folder and downloads the specified file. Depending on the file type (PDF, DOCX, PPTX, or TXT), it uses different methods to extract the text and organizes it by page numbers. The function also extracts and processes images within the file using OCR to include any text from the images. Finally, the temporary file is deleted, and the function returns the combined text, metadata, and a dictionary mapping page numbers to text.
Process Documents
The `process_documents` function handles processing multiple documents from a GCS bucket. It processes the files concurrently to extract text and metadata, then splits the documents into smaller chunks. These chunks are then used to create embeddings, which are saved in a ChromaDB collection. Finally, the function returns the database instance containing the processed documents.
Future Enhancements for the OCR-Based Chatbot:
- Extend to handle more file formats .xls, .csv, and other document types.
- Better Image Processing: Use advanced techniques for improved OCR.
- Multiple-Language Support: Extend OCR and text processing for multiple languages.
- Scalability: Improve to manage larger volumes of documents efficiently.
- Allow users to select and customize different embedding models.
- Add detailed logging and monitoring for performance and error tracking.
- Feedback Mechanism: Add a way for users to rate the correctness and relevance of responses.
Reference:
- Langchain github
- Chromadb documentation
- Vertexai documentation
References used for the preparation of this chatbot & blog include articles, research papers, and documentation from reputable sources. I apologize if any references have been inadvertently missed. Please let us know if you find any gaps, and I will address them promptly
Code Link: https://github.com/deeepalichandel/OCR_based_chatbot.git
Linkedin Profile: https://www.linkedin.com/in/deepalichandel16/