Dump's ingestion pipeline takes a URL, text, or image and transforms it into a searchable, categorized knowledge item.
Ingestion Flow
Source Detection
The URL is matched against known patterns (twitter.com, youtube.com, etc.) to determine the source type. Plain text and images are detected by format.
Content Extraction
A source-specific extractor pulls title, content, author, media URLs, and metadata from the source.
AI Categorization
Gemini assigns a root category, subcategories, tags, summary, language, and entities.
Embedding
Content is vectorized using text-embedding-004 (768 dimensions) for semantic search.
Storage
The item is saved to Supabase with full-text indexes and vector embeddings.
API
POST/api/ingest
Request Body
{
url?: string // URL to extract content from
text?: string // Plain text to store directly
image?: string // Image URL or base64
force?: boolean // Re-ingest even if URL already exists
}At least one of url, text, or image must be provided. If url is given and already exists in the vault, the request is rejected unless force: true is set.
Example
curl -X POST http://localhost:3106/api/ingest \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'Duplicate Detection
Before processing, Dump normalizes the URL (strips tracking params, trailing slashes) and checks if it already exists in the user's vault. This prevents accidental duplicates while allowing forced re-ingestion when needed.
Error Handling
If extraction or categorization fails, Dump falls back gracefully:
- Extraction failure: Returns error with source-specific message
- Categorization failure: Retries once, then assigns default category ("Reference")
- Embedding failure: Non-fatal, item is saved without vector (full-text search still works)