Ingestion

Dump's ingestion pipeline takes a URL, text, or image and transforms it into a searchable, categorized knowledge item.

Ingestion Flow

Source Detection

The URL is matched against known patterns (twitter.com, youtube.com, etc.) to determine the source type. Plain text and images are detected by format.

Content Extraction

A source-specific extractor pulls title, content, author, media URLs, and metadata from the source.

AI Categorization

Gemini assigns a root category, subcategories, tags, summary, language, and entities.

Embedding

Content is vectorized using text-embedding-004 (768 dimensions) for semantic search.

Storage

The item is saved to Supabase with full-text indexes and vector embeddings.

API

POST/api/ingest

Request Body

{
  url?: string    // URL to extract content from
  text?: string   // Plain text to store directly
  image?: string  // Image URL or base64
  force?: boolean // Re-ingest even if URL already exists
}

At least one of url, text, or image must be provided. If url is given and already exists in the vault, the request is rejected unless force: true is set.

Example

Ingest a YouTube video

curl -X POST http://localhost:3106/api/ingest \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'

Before processing, Dump normalizes the URL (strips tracking params, trailing slashes) and checks if it already exists in the user's vault. This prevents accidental duplicates while allowing forced re-ingestion when needed.

Error Handling

If extraction or categorization fails, Dump falls back gracefully:

Extraction failure: Returns error with source-specific message
Categorization failure: Retries once, then assigns default category ("Reference")
Embedding failure: Non-fatal, item is saved without vector (full-text search still works)