Source Types

Dump automatically detects the source type from the input and routes it to the appropriate extractor. Detection is based on URL pattern matching.

Detection Priority

Source detection follows a priority order — the first matching pattern wins:

Extracts tweet content, author, and media. On Vercel (serverless), uses VPS proxy with Bird CLI. On local, uses direct extraction.

Extracts: Tweet text, author, author URL, media URLs, thread content.

Extracts video transcript using the youtube-transcript library.

Extracts: Video title, transcript text, channel info.

Uses oEmbed API + Open Graph meta tags for extraction.

Extracts: Post caption, author, media URLs, thumbnail.

Instagram extraction uses oEmbed + OG tags because VPS-based extraction returned empty results.

Parses Open Graph meta tags with a Googlebot user agent to access public post data.

Extracts: Post content, author name, author URL.

LinkedIn extraction uses Googlebot UA because VPS-based extraction returned empty results.

Uses Reddit's JSON API directly (appending .json to the URL).

Extracts: Post title, body text, author, subreddit, comments.

Parses PDF documents using the pdf-parse library.

Extracts: Full text content, metadata (title, author, page count).

Generic web page extractor for any URL not matching other patterns. Extracts main content using readability algorithms.

Extracts: Title, main content, author, description, Open Graph metadata.

Processes image files via VPS proxy. Supports PNG, JPEG, GIF, WebP, SVG, BMP, ICO.

Extracts: Image metadata, OCR text (when available).

Stores plain text input directly without extraction.

Extracts: The input text as-is, with no transformation.