Supa Crawl Chat
š Supa-Crawl-Chat
Introducing Supa-Crawl-Chat: A Comprehensive Web Crawling, Semantic Search, and AI-Driven Chat Solution with Supabase & Crawl4AI.
Seamlessly crawl websites, transform content into vector embeddings, and enable advanced semantic search. Supa-Crawl-Chat utilizes Supabase for reliable data storage and incorporates AI-powered chat with long-term memory features.
Github repo for Supa Crawl Chat
āØ Key Features
- š·ļø High-Performance Web Crawling
- Harness the power of Crawl4AI to efficiently index websites and sitemaps with configurable depth and scope
- Advanced crawling algorithms adapt to different website structures and content types for optimal data extraction
- Seamless handling of JavaScript-rendered content and dynamic websites
- š Advanced Semantic Search Engine
- Leverage cutting-edge vector similarity and OpenAI embeddings for context-aware search capabilities
- Achieve up to 95% more relevant search results compared to traditional keyword-based approaches
- Fine-tuned ranking algorithms that understand semantic relationships between concepts
- š AI-Powered Content Intelligence
- Transform raw web content into structured, actionable data using terminal or UI.
- Generate human-quality titles, summaries, and site descriptions with remarkable accuracy
- Automatic content categorization and entity extraction for enhanced data organization
- š Interactive Data Visualization
- Explore your data ecosystem through an intuitive Streamlit-based interface
- Real-time analytics and insights into your content repository
- Customizable dashboards for monitoring crawl performance and content metrics
- š³ Scalable Deployment Architecture
- Deploy with confidence using our Docker configurations:
- Lightweight: App-only deployment for integration with existing infrastructure
- Standard: App + Crawl4AI for complete content processing capabilities
- Full-Stack: End-to-end solution with App + Crawl4AI + Supabase for maximum autonomy
- Deploy with confidence using our Docker configurations:
- š Comprehensive API Ecosystem
- RESTful API with comprehensive documentation for seamless integration
- Webhook support for event-driven architectures
- Comprehensive access management and multi-factor authentication for enhanced security
Prerequisites
- Python 3.10+
- Node 18+
- A running Crawl4AI instance (self-hosted or provided)
- A Supabase instance (self-hosted or provided)
- OpenAI API key for generating embeddings, content summaries and chat
- Docker (optional)
Installation
Clone this repository:
1 2
git clone https://github.com/bigsk1/supa-crawl-chat.git cd supa-crawl-chat
Install the required dependencies:
1
pip install -r requirements.txt
Changed directory to frontend and install dependencies:
1
cd frontend
1
npm install
Create a
.env
file with your configuration:
# Crawl4AI Configuration
# Locally ran in docker or external service - easily setup with docker compose
CRAWL4AI_API_TOKEN=your_crawl4ai_api_token
# Local Docker
# CRAWL4AI_BASE_URL=http://crawl4ai:11235
# External Service
CRAWL4AI_BASE_URL=your_crawl4ai_base_url
# Supabase Configuration
SUPABASE_URL=your_supabase_host:port
# Database credentials
SUPABASE_DB=postgres
SUPABASE_KEY=postgres
SUPABASE_PASSWORD=postgres
# OpenAI Configuration
OPENAI_API_KEY=sk-proj-
# Model to use for embeddings
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
# Model to use for title and summary generation and chat analysis
OPENAI_CONTENT_MODEL=gpt-4o-mini
# Crawl Configuration
# Set to 'url' for regular website or 'sitemap' for sitemap crawling, will crawl child pages from the sitemap
CRAWL_TYPE=url
# URL to crawl (can be a website URL or sitemap URL)
CRAWL_URL=https://example.com
# Maximum number of URLs to crawl from a sitemap (set to 0 for unlimited)
MAX_URLS=30
# Optional name for the site (if not provided, one will be generated)
CRAWL_SITE_NAME=
# Optional description for the site (if not provided, one will be generated)
CRAWL_SITE_DESCRIPTION=
# Chat Configuration
# Model to use for the chat interface
CHAT_MODEL=gpt-4o
# Number of results to retrieve for each query
CHAT_RESULT_LIMIT=5
# Similarity threshold for vector search (0-1)
CHAT_SIMILARITY_THRESHOLD=0.4
# Default session ID (if not provided, a new one will be generated) you can use a random string
CHAT_SESSION_ID=
# Default user ID (optional, name, user, i.e. larry)
CHAT_USER_ID=
# Default chat profile (default, pydantic, technical, concise, scifi, pirate, supabase_expert, medieval, etc.)
CHAT_PROFILE=default
# Directory containing profile YAML files
CHAT_PROFILES_DIR=profiles
# Verbose mode (true, false) - enable to see more during chat
CHAT_VERBOSE=false
Running the Frontend and Backend
To run the backend API and the frontend UI, follow these steps:
Start the Backend API: Open a terminal and navigate to the root directory of the project. Then run:
1
python run_api.py
Start the Frontend UI: Open a separate terminal, navigate to the
frontend
directory, and run:1
npm run dev
Access the Web UI: Open your web browser and go to:
1
http://localhost:3000/
This will start the backend API on port 8001 and the frontend UI on port 3000.
Database Connection Options
The project supports two ways to connect to your Supabase database:
Single URL (Option 1): Use this for both local and remote connections. The URL can be specified with or without protocol.
1 2 3 4 5
# With protocol (for remote instances) SUPABASE_URL=https://your-project.supabase.co:5432 # Without protocol (for local instances) SUPABASE_URL=192.168.xx.xx:54322
Youāll need to provide the database credentials:
SUPABASE_DB=postgres
SUPABASE_KEY=postgres
SUPABASE_PASSWORD=postgres
Content Chunking for LLM Interaction
The system automatically breaks down large content into smaller, more manageable chunks for better LLM interaction and more precise search results. This provides several benefits:
Improved Search Precision: Instead of matching against entire pages, the system can find the specific chunk that best answers a query.
Efficient Token Usage: When interacting with LLMs, only the relevant chunks are sent, reducing token usage and costs.
Better Context Management: Each chunk maintains a reference to its parent page, preserving the full context.
Automatic Token Limit Handling: Content is automatically chunked to stay within the token limits of the embedding model (8,192 tokens for text-embedding-3-small).
How Chunking Works
The system uses a sophisticated semantic chunking strategy:
Semantic Boundary Detection: Content is first split along natural semantic boundaries:
- Markdown headers (e.g.,
# Section Title
) - Paragraph breaks
- This preserves the meaning and context of each chunk
- Markdown headers (e.g.,
Token-Based Sizing: Each section is then analyzed to ensure it fits within token limits:
- Sections that fit are kept together
- Sections that exceed limits are further split with token-based chunking
- A 200-token overlap is maintained between chunks for context continuity
Smart Overlap: When creating overlaps between chunks, the system looks for natural break points:
- Paragraph breaks
- Sentence endings
- Clause breaks
- Word boundaries
Metadata Preservation: Each chunk maintains references to:
- Its parent document
- Its position in the sequence (chunk index)
- Its token count
This approach ensures that chunks are not only sized appropriately for LLMs but also maintain semantic coherence, making them more useful for search and retrieval.
Configuring Chunking
You can adjust the chunking parameters in the code:
1
2
# In crawler.py, enhance_pages method
enhanced_pages = asyncio.run(self.enhance_pages(pages, max_tokens_per_chunk=4000))
The default settings are:
max_tokens_per_chunk
: 4,000 tokens (half of the 8,192 token limit for safety)overlap_tokens
: 200 tokens (overlap between chunks to maintain context)
</details>
Testing the Setup
Before using the crawler, you can test your setup:
- Test the database connection:
1
python tests/test_db_connection.py
- Test the Crawl4AI API:
1
python tests/test_crawl_api.py
Usage
Setting up the database
Before using the crawler, you need to set up the database:
1
python main.py setup
This will create the necessary tables and extensions in your Supabase database.
Terminal
Crawling a website with args
You can crawl a website in two ways:
- Using the command-line interface:
1
python main.py crawl https://example.com --name "Example Site" --description "An example website"
To crawl a sitemap:
1
python main.py crawl https://example.com/sitemap.xml --sitemap --name "Example Site"
You can limit the number of URLs to crawl from the sitemap:
1
python main.py crawl https://example.com/sitemap.xml --sitemap --max-urls 20
Note: If you donāt provide a description, the system will automatically generate one based on the content of the homepage or main page.
Using the
.env
file configuration: ( recommended )First, update the
.env
file with your crawl settings:
CRAWL_TYPE=url # or 'sitemap' for sitemap crawling
CRAWL_URL=https://example.com
CRAWL_SITE_NAME=Example Site
CRAWL_SITE_DESCRIPTION=An example website # Optional - will be auto-generated if empty
Then run:
1
python run_crawl.py
Title and Summary Generation
The crawler automatically generates titles and summaries for crawled content using OpenAI. You can configure the model used for this in the .env
file:
1
OPENAI_CONTENT_MODEL=gpt-4o-mini
Updating Existing Content
If you have existing pages without titles or summaries, or if you want to regenerate them with a different model, you can use the update_content.py
script:
1
2
3
4
5
6
7
8
9
10
11
# Update all sites
python update_content.py
# Update a specific site
python update_content.py --site-id 1
# Limit the number of pages to update
python update_content.py --limit 50
# Force update all pages, even if they already have titles and summaries
python update_content.py --force
Searching the crawled content
To search the crawled content using semantic search:
1
python main.py search "your search query"
To use text-based search instead of semantic search:
1
python main.py search "your search query" --text-only
To adjust the similarity threshold and limit the number of results:
1
python main.py search "your search query" --threshold 0.8 --limit 2
To save the search results to a file:
1
python main.py search "your search query" --output results.json
Listing crawled sites
To list all the sites that have been crawled:
1
python main.py list-sites
By default, this only counts parent pages (not chunks). To include chunks in the page count:
1
python main.py list-sites --include-chunks
Working with Chunks
When retrieving or searching content, you can control whether chunks are included:
1
2
3
4
5
# Get pages for a site (parent pages only)
pages = crawler.get_site_pages(site_id, limit=100)
# Get pages for a site including chunks
pages_with_chunks = crawler.get_site_pages(site_id, limit=100, include_chunks=True)
When searching, chunks are automatically included and prioritized for more precise results. Each chunk includes context about its parent document:
1
python main.py search "your search query"
The search results will include:
- The content snippet that matched your query
- Which document it came from
- Which part of the document it represents (e.g., āPart 2 of 5ā)
This makes it easier to understand the context of each search result, even when itās a small chunk of a larger document.
Using the chat interface in terminal
The project includes a chat interface in the terminal that uses an LLM to answer questions based on the crawled data. The chat interface now supports persistent conversation history, allowing the LLM to remember previous interactions even after restarting the application.
You can start the terminal chat interface using either the dedicated script or the main CLI:
1
2
3
4
5
# Using the dedicated script
python chat.py
# Using the main CLI
python main.py chat
Chat Interface Options
You can customize the chat interface with various options:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Specify a different OpenAI model
python main.py chat --model gpt-4
# Set the maximum number of search results to retrieve when chatting
python main.py chat --limit 10
# Adjust the similarity threshold for vector search (0-1)
python main.py chat --threshold 0.6
# Use a specific session ID for persistent conversations
python main.py chat --session my-chat-session
# Associate the conversation with a specific user
python main.py chat --user John
# Enable verbose debug output
python main.py chat --verbose
# Combined
python main.py chat --model gpt-4 --limit 15 --threshold 0.3 --session 12123111111 --user John --verbose
Search Functionality
The chat interface uses a sophisticated hybrid search approach that combines vector similarity with text matching:
Vector Search: Uses OpenAIās embeddings to find semantically similar content
Text Search: Enhances results with keyword matching for better precision
Hybrid Approach: Combines both methods to provide the most relevant results
This approach ensures that even when vector similarity might not find exact matches, the text search component can still retrieve relevant information. The system automatically adjusts the search strategy based on the query type and available content.
Persistent Conversation History
The chat interface stores all conversation history in the database, allowing the LLM to remember previous interactions. This enables more natural and contextual conversations over time.
Key features:
Session-based conversations: Each conversation gets a unique session ID
User identification: Optionally associate conversations with specific users
Conversation continuity: Continue conversations where you left off, even after restarting
Chat commands:
- Type
clear
to clear the conversation history - Type
history
to view the conversation history - Type
exit
orbye
orexit
to quit the chat interface
- Type
Important: To maintain the same conversation across multiple chat sessions, you must use the same session ID. The session ID is displayed when you start the chat interface. You can specify it before starting a new chat session:
1
2
3
4
5
6
# Start a new chat session
python chat.py --user Joe
# Note the session ID displayed (e.g., "Session ID: a24b6b72-e526-4a09-b662-0f85e82f78a7")
# Later, continue the same conversation by specifying the session ID
python chat.py --user Joe --session a24b6b72-e526-4a09-b662-0f85e82f78a7
You can also set a default session ID in your .env
file:
1
CHAT_SESSION_ID=your-session-id
This way, the chat interface will always use the same session ID unless you explicitly specify a different one with the --session
parameter.
User Preferences and Memory
The chat interface can remember user preferences and information shared during conversations, as long as you use the same session ID. For example:
- If you tell the assistant āI like Corvettesā in one session
- Then in a later session (using the same session ID), ask āWhat cars do I like?ā
- The assistant will remember and respond with āYou like Corvettesā
This memory persistence works by:
- Storing all messages in the database with the session ID
- Analyzing conversation history when relevant questions are asked
- Extracting user preferences and information from previous messages
To get the most out of this feature, always use the same session ID and user ID when you want the assistant to remember previous conversations.
Managing User Preferences via CLI
The chat interface includes several commands for managing user preferences directly from the command line:
Viewing Preferences
1
preferences
Displays a table of all active preferences for the current user, including ID, type, value, confidence, context, and last used timestamp.
Adding Preferences
1
add preference <type> <value> [confidence]
Manually adds a new preference for the current user. If confidence is not specified, it defaults to 0.9.
Examples:
1
2
3
add preference like Python
add preference expertise JavaScript 0.85
add preference goal "Learn machine learning"
Deleting Preferences
1
delete preference <id>
Deletes a specific preference by ID.
Clearing All Preferences
1
clear preferences
Deletes all preferences for the current user after confirmation.
Important: Preference commands are only available when a user ID is provided (using --user
when starting the chat).
Chat Profiles
The chat interface supports different profiles that customize the behavior of the assistant. Each profile has its own system prompt, search settings, and site filtering capabilities. So ideally crawl the sitemap for a doc site and then use or create a profile with an additional system prompt to be an expert about those docs.
Built-in profiles:
default: General-purpose assistant that searches all sites
pydantic: Specialized for Pydantic documentation, focusing on technical details and code examples
technical: Provides detailed technical explanations with step-by-step instructions
concise: Gives brief, to-the-point answers without unnecessary details
You can switch profiles during a chat session:
1
profile pydantic
Or start with a specific profile:
1
python main.py chat --profile technical
You can also view all available profiles:
1
profiles
How Site Filtering Works
The sites
array in each profileās search_settings
controls which sites the assistant searches through when answering questions:
1
2
3
4
search_settings:
sites: ["pydantic"] # Only search in sites with "pydantic" in the name
threshold: 0.6
limit: 8
Hereās how the filtering works:
Empty array (
sites: []
): Searches across ALL sites in the databaseSite patterns: Filters to only include sites where the site name contains any of the specified patterns
Pattern matching: Uses case-insensitive partial matching, so
"bigsk1"
would match site names like āBigsk1 Comā, ābigsk1.comā, etc.Multiple patterns: You can include multiple patterns to search across several related sites
The filtering process:
- When a user asks a question, the system looks at the current profileās
sites
setting - It queries the
crawl_sites
table to find site IDs where the name contains any of the patterns - It then only searches for content in pages associated with those site IDs
- This allows profiles to focus on specific content sources, making responses more relevant
You can switch profiles during a chat session:
1
profile pydantic
Or start with a specific profile:
1
python main.py chat --profile technical
You can also view all available profiles:
1
profiles
Custom Profiles
You can create your own custom profiles by adding YAML files to the profiles
directory. Each profile file should include:
name
: The name of the profile (used to select it)description
: A brief description of the profilesystem_prompt
: The system prompt that defines the assistantās behaviorsearch_settings
: Configuration for search behaviorsites
: List of site name patterns to filter by (empty list means search all sites)threshold
: Similarity threshold for vector search (0-1)limit
: Maximum number of results to return
Example profile file (profiles/custom_expert.yaml
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
name: custom_expert
description: Custom expert for specific documentation
system_prompt: |
You are an expert on [specific topic].
Your expertise includes:
- [Area of expertise 1]
- [Area of expertise 2]
- [Area of expertise 3]
When answering questions:
- [Instruction 1]
- [Instruction 2]
- [Instruction 3]
search_settings:
sites: ["site1", "site2"] # Only search in sites containing these terms
threshold: 0.6 # Higher threshold for more precise matches
limit: 8 # Number of results to return
You can specify a custom profiles directory:
1
python main.py chat --profiles-dir my_profiles
Configuration via .env
You can set default values for the chat interface in your .env
file:
1
2
3
4
5
6
7
8
9
# Chat Configuration
CHAT_MODEL=gpt-4o
CHAT_RESULT_LIMIT=5
CHAT_SIMILARITY_THRESHOLD=0.5
CHAT_SESSION_ID=default-session
CHAT_USER_ID=default-user
CHAT_PROFILE=default
CHAT_PROFILES_DIR=profiles
CHAT_VERBOSE=false
This allows you to maintain consistent settings and continue the same conversation across multiple sessions.
Resetting the database
If you want to start fresh and delete all data or recreate the tables, you can use the reset_database.py
script:
1
python tests/reset_database.py
This script provides two options:
- Delete all data (keep tables) - This will delete all data from the tables but keep the table structure.
- Drop and recreate tables - This will drop the tables and recreate them, effectively starting from scratch.
Programmatic usage
You can also use the crawler programmatically in your own Python code. See tests/example.py
for a demonstration.
Project Structure
Backend
main.py
: Main script with command-line interfacecrawler.py
: Main crawler class that ties everything togethercrawl_client.py
: Client for interacting with the Crawl4AI APIembeddings.py
: Module for generating OpenAI embeddingscontent_enhancer.py
: Module for generating titles and summaries using OpenAIdb_client.py
: Client for interacting with the Supabase databasedb_setup.py
: Script for setting up the databasechat.py
: Chat interface for interacting with crawled data using an LLMrun_api.py
: Script to run the APIrun_crawl.py
: Script to run a crawl using the configuration from the.env
fileupdate_content.py
: Script to update existing pages with titles and summariesutils.py
: Utility functions for the CLIrequirements.txt
: List of dependencies for the backend.env.example
: Example environment file for the backendapi/
: Directory containing the FastAPI implementationmain.py
: FastAPI application entry pointrouters/
: Directory containing API route definitionscrawl.py
: Endpoints for crawling websites and sitemapssearch.py
: Endpoints for searching crawled contentsites.py
: Endpoints for managing and retrieving site informationchat.py
: Endpoints for interacting with the chat interfacepages.py
: Endpoints for managing and retrieving page information
README.md
: Comprehensive API documentation
docker/
: Directory containing Docker-related filesDockerfile
: Docker image definition for the backend applicationfrontend.Dockerfile
: Docker image definition for the frontend applicationdocker-compose.yml
: Docker Compose configuration for the API service onlycrawl4ai-docker-compose.yml
: Docker Compose configuration for integrated API and Crawl4AI servicesfull-stack-compose.yml
: Docker Compose configuration for the complete stack (API, Crawl4AI, Supabase, Frontend)setup.sh
: Script to set up the full stack environmentreset.sh
: Script to reset the full stack environmentstatus.sh
: Script to check the status of the full stack environment.env
: Environment variables for Docker deployment.env.example
: Example environment file for Docker deploymentfull-stack/
: Documentation and utilities for the full stack setupREADME.md
: Documentation for the full stack setupENV_GUIDE.md
: Guide for configuring environment variablescheck_db_connections.sh
: Script to verify database connections
volumes/
: Directory for Docker volumes.dockerignore
: Specifies files to exclude from Docker builds
supabase_explorer/
: Directory containing the Supabase Explorer Streamlit appsupabase_explorer.py
: Interactive Streamlit app for database explorationsupabase_queries.md
: Collection of useful SQL queriesdatabase_explorer_readme.md
: Documentation for the Supabase Explorer
profiles/
: Directory containing chat profile configurations- Various YAML files defining different chat personalities and behaviors
tests/
: Directory containing test scriptsexample.py
: Example script demonstrating programmatic usagetest_db_connection.py
: Script to test the database connectiontest_crawl_api.py
: Script to test the Crawl4AI APIreset_database.py
: Script to delete tables or reset the database
Frontend
frontend/
: Directory containing the React-based web UIsrc/
: Source code for the frontend applicationapi/
: API client for communicating with the backendapiService.ts
: Service for making API requestsapiWrapper.ts
: Wrapper for API endpoints with type definitions
components/
: Reusable UI componentsLayout.tsx
: Main layout component with Sidebar and NavbarNavbar.tsx
: Top navigation barSidebar.tsx
: Side navigation menuNotificationCenter.tsx
: Notification system for user alertsPageListItem.tsx
: Component for displaying page items in listsUserProfileModal.tsx
: Modal for user profile managementui/
: Shadcn UI component library- Various UI components like buttons, inputs, dialogs, etc.
context/
: React context providers for state managementhooks/
: Custom React hookslib/
: Utility libraries and configurationspages/
: Main application viewsHomePage.tsx
: Landing pageChatPage.tsx
: AI chat interfaceCrawlPage.tsx
: Web crawling interfaceSearchPage.tsx
: Search interfaceSitesPage.tsx
: Site managementSiteDetailPage.tsx
: Detailed view of a crawled siteNotificationInfo.tsx
: Notification settings and informationUserProfileModal.tsx
: User profile managementUserPreferencesPage.tsx
: User preferences management
styles/
: CSS and styling filesutils/
: Utility functionsApp.tsx
: Main application componentmain.tsx
: Entry point for the React application
public/
: Static assetsindex.html
: HTML entry pointvite.config.ts
: Vite configurationtailwind.config.js
: Tailwind CSS configurationtsconfig.json
: TypeScript configurationpackage.json
: NPM dependencies and scripts
Database Structure
The project uses the following tables in the Supabase database:
crawl_sites
: Stores information about the sites youāve crawledid
: Primary keyname
: Name of the siteurl
: URL of the sitedescription
: Optional description of the sitecreated_at
: Timestamp when the site was added
crawl_pages
: Stores the actual content, embeddings, titles, and summaries for each pageid
: Primary keysite_id
: Foreign key referencing thecrawl_sites
tableurl
: URL of the page (unique)title
: Title of the pagecontent
: Content of the pagesummary
: Summary of the pageembedding
: Vector embedding of the contentmetadata
: Additional metadata about the pageis_chunk
: Boolean indicating if this is a chunk of a larger pagechunk_index
: Index of the chunk within the parent pageparent_id
: Foreign key referencing the parent pagecreated_at
: Timestamp when the page was addedupdated_at
: Timestamp when the page was last updated
chat_conversations
: Stores conversation history for the chat interfaceid
: Primary keysession_id
: Unique identifier for the conversation sessionuser_id
: Optional identifier for the usertimestamp
: Timestamp when the message was sentrole
: Role of the message sender (user, assistant, system)content
: Content of the messagemetadata
: Additional metadata about the message
When you crawl a site multiple times, the system will update existing pages rather than creating duplicates, ensuring you always have the most recent content. Similarly, the chat interface will maintain conversation history across sessions, allowing for more natural and contextual interactions.
Supabase Explorer
The project includes a powerful Streamlit-based Supabase Explorer app that allows you to interactively explore and analyze your database. This tool makes it easy to run SQL queries, visualize results, and gain insights from your crawled data.
Features
Interactive Query Interface: Run predefined or custom SQL queries with a single click
Data Visualization: Create bar charts, line charts, and pie charts from your query results
Database Overview: View statistics about your database, including site counts and page distribution
Export Functionality: Download query results as CSV files for further analysis
Predefined Queries: Access a comprehensive collection of useful SQL queries organized by category:
- Site queries
- Page queries
- Chunk queries
- Metadata queries
- Conversation history queries
- Statistics queries
- Embedding analysis queries
- Content quality queries
- Advanced conversation analysis
- Performance queries
- Search performance analysis
Running the Supabase Explorer
To launch the Supabase Explorer:
1
2
3
cd supabase_explorer
pip install -r requirements.txt
streamlit run supabase_explorer.py
The app will automatically connect to your Supabase database using the credentials in your root .env
file.
Running Supabase Explorer in Docker
The Supabase Explorer is also available as part of the Docker setup. When you run either of the Docker Compose configurations, the Streamlit app will be accessible at:
1
http://localhost:8501
This allows you to explore your database directly from the Docker container without having to install Streamlit locally.
1
2
3
4
5
# Start the Docker containers including the Supabase Explorer
docker-compose -f docker/docker-compose.yml up -d
# Or with the integrated Crawl4AI setup
docker-compose -f docker/crawl4ai-docker-compose.yml up -d
Adding Custom Queries
You can add your own custom queries to the predefined list by editing the supabase_explorer/supabase_queries.md
file. Follow the existing format:
1
2
3
4
5
6
Your Category
Your Query Name
```sql
SELECT * FROM your_table WHERE your_condition;
After adding your queries, restart the Streamlit app to load the new queries.
Docker Deployment - 3 different options!
1
2
3
4
5
# Build and start the container
docker-compose -f docker/docker-compose.yml up -d
# View logs
docker-compose -f docker/docker-compose.yml logs -f
This setup includes:
- API backend on port 8001
- Frontend UI on port 3000
- Streamlit Explorer on port 8501
Integrated Crawl4AI Docker Deployment
If you want to run both the API and Crawl4AI in Docker containers, this is when you already have a supabase locally or externally, you can use the provided crawl4ai-docker-compose.yml
file:
1
2
3
4
5
# Build and start both containers
docker-compose -f docker/crawl4ai-docker-compose.yml up -d
# View logs
docker-compose -f docker/crawl4ai-docker-compose.yml logs -f
This setup will:
- Start a Crawl4AI container using the official image from Docker Hub
- Start your API container with the correct configuration to connect to Crawl4AI
- Start the frontend UI container for the web interface
- Start the Streamlit Explorer for database exploration
- Create a network for the containers to communicate with each other
Make sure your .env
file in root includes the necessary Crawl4AI configuration:
# Crawl4AI Configuration
CRAWL4AI_API_TOKEN=your_crawl4ai_api_token
# This will be automatically set to the Docker service name in the container
# CRAWL4AI_BASE_URL=http://crawl4ai:11235
Access the services:
- API: http://localhost:8001
- Frontend UI: http://localhost:3000
- Streamlit Explorer: http://localhost:8501
- Crawl4AI: http://localhost:11235
Full Stack Docker Setup (Supabase + API + Crawl4AI + Frontend)
We provide a comprehensive Docker setup that includes everything you need to run the entire application stack:
- Supa Chat API Backend
- Frontend UI
- Supabase Docker images (Database, Kong, Realtime, etc.)
- Crawl4AI Docker image for web crawling
This setup comes with everything you need to run the complete application without any external dependencies.
Important Environment Variable Configuration
The full-stack Docker setup requires careful configuration of environment variables:
- SUPABASE_URL: This should be commented out or left empty to ensure the API connects directly to the database:
1
# SUPABASE_URL=http://kong:8002
If this is set, the API will try to connect to Kong for database operations, which will cause SSL negotiation errors.
- Direct Database Connection: Ensure these database connection parameters are set correctly:
1
2
3
4
SUPABASE_HOST=db
SUPABASE_PORT=5432
SUPABASE_KEY=supabase_admin
SUPABASE_PASSWORD=${POSTGRES_PASSWORD}
Setting Up the Full Stack
To use the full stack Docker setup:
Navigate to the docker directory:
1
cd docker
Run the setup script to create necessary configuration files:
1 2
chmod +x setup_update.sh ./setup_update.sh
This script will:
- Check for the existence of the
.env
file - Create SQL scripts for database initialization
- Download Supabase initialization scripts
- Create application tables and functions
- Generate the Kong configuration file
- Check for the existence of the
Edit the Docker-specific
.env
file with your actual values:1
nano .env
Start the services:
1
docker-compose -f full-stack-compose.yml up -d
Access the services:
- API: http://localhost:8001
- API Documentation: http://localhost:8001/docs
- Frontend UI: http://localhost:3000
- Supabase Studio: http://localhost:3001 (username: supabase, password: from your .env file)
- Kong API Gateway: http://localhost:8002
- Crawl4AI: http://localhost:11235
Monitor or manage the stack:
1 2 3 4 5
# Check status of all services ./status.sh # Reset the stack (removes all data) ./reset.sh
Troubleshooting
- Database Connection Issues:
- If you see SSL negotiation errors, make sure
SUPABASE_URL
is commented out or empty in your.env
file - Verify the database credentials in the
.env
file Restart the API service after making changes:
1
docker-compose -f full-stack-compose.yml restart api
- If you see SSL negotiation errors, make sure
- REST Service Issues:
If the REST service is not connecting properly, run the fix script:
1
./fix_rest.sh
- Checking Logs:
View logs for a specific service:
1 2 3
docker logs supachat-api docker logs supachat-kong docker logs supachat-frontend
API
The project includes a FastAPI-based REST API that allows you to integrate the Supa-Crawl-Chat functionality with other applications or build custom frontends. The API provides endpoints for searching, crawling, managing sites, and chatting.
Running the API
To start the API server:
1
python run_api.py
or use:
1
2
cd api
uvicorn api.main:app --host 0.0.0.0 --port 8001 --reload
The API will be available at http://localhost:8001
API Endpoints
The interactive API documentation is available at:
1
http://localhost:8001/docs
The API provides the following endpoints:
Search
GET /api/search
: Search for content using semantic search or text search- Parameters:
query
: The search querythreshold
: Similarity threshold (0-1)limit
: Maximum number of resultstext_only
: Use text search instead of embeddingssite_id
: Optional site ID to filter results by
- Parameters:
Crawl
POST /api/crawl
: Crawl a website or sitemap- Body:
url
: URL to crawlsite_name
: Optional name for the sitesite_description
: Optional description of the siteis_sitemap
: Whether the URL is a sitemapmax_urls
: Maximum number of URLs to crawl from a sitemap
- Body:
GET /api/crawl/status/{site_id}
: Get the status of a crawl by site ID
Sites
GET /api/sites
: List all crawled sites- Parameters:
include_chunks
: Whether to include chunks in the page count
- Parameters:
GET /api/sites/{site_id}
: Get a site by ID- Parameters:
include_chunks
: Whether to include chunks in the page count
- Parameters:
GET /api/sites/{site_id}/pages
: Get pages for a specific site- Parameters:
include_chunks
: Whether to include chunks in the resultslimit
: Maximum number of pages to return
- Parameters:
Chat
POST /api/chat
: Send a message to the chat bot and get a response- Body:
message
: The userās messagesession_id
: Optional session ID for persistent conversationsuser_id
: Optional user IDprofile
: Optional profile to use
- Parameters:
model
: Optional model to useresult_limit
: Optional maximum number of search resultssimilarity_threshold
: Optional similarity threshold (0-1)include_context
: Whether to include search context in the responseinclude_history
: Whether to include conversation history in the response
- Body:
GET /api/chat/profiles
: List all available profiles- Parameters:
session_id
: Optional session ID to get active profileuser_id
: Optional user ID
- Parameters:
POST /api/chat/profiles/{profile_name}
: Set the active profile for a session- Parameters:
session_id
: Session IDuser_id
: Optional user ID
- Parameters:
GET /api/chat/history
: Get conversation history for a session- Parameters:
session_id
: Session IDuser_id
: Optional user ID
- Parameters:
DELETE /api/chat/history
: Clear conversation history for a session- Parameters:
session_id
: Session IDuser_id
: Optional user ID
- Parameters:
Example API Usage
Hereās an example of how to use the API with curl:
1
2
3
4
5
6
7
8
9
10
11
12
# Search for content
curl -X GET "http://localhost:8001/api/search?query=pydantic&threshold=0.3&limit=5" -H "accept: application/json"
# Start a chat session
curl -X POST "http://localhost:8001/api/chat" \
-H "Content-Type: application/json" \
-d '{"message": "Tell me about pydantic", "user_id": "example_user"}'
# Continue the conversation with the same session ID
curl -X POST "http://localhost:8001/api/chat" \
-H "Content-Type: application/json" \
-d '{"message": "How do I use BaseModel?", "session_id": "SESSION_ID_FROM_PREVIOUS_RESPONSE", "user_id": "example_user"}'
Finished crawl example
License
This project is licensed under the MIT License - see the LICENSE file for details.