Dump automatically detects the source type from the input and routes it to the appropriate extractor. Detection is based on URL pattern matching.
Detection Priority
Source detection follows a priority order — the first matching pattern wins:
| Priority | Pattern | Source Type |
|---|---|---|
| 1 | twitter.com, x.com | twitter |
| 2 | youtube.com, youtu.be | youtube |
| 3 | instagram.com | instagram |
| 4 | linkedin.com | linkedin |
| 5 | reddit.com, redd.it | reddit |
| 6 | .pdf in URL | pdf |
| 7 | Any other http:// or https:// | article |
| 8 | Image extension (.png, .jpg, .webp) | image |
| 9 | Plain text (no URL) | text |
Extractor Details
Twitter/X
Extracts tweet content, author, and media. On Vercel (serverless), uses VPS proxy with Bird CLI. On local, uses direct extraction.
Extracts: Tweet text, author, author URL, media URLs, thread content.
YouTube
Extracts video transcript using the youtube-transcript library.
Extracts: Video title, transcript text, channel info.
Uses oEmbed API + Open Graph meta tags for extraction.
Extracts: Post caption, author, media URLs, thumbnail.
Instagram extraction uses oEmbed + OG tags because VPS-based extraction returned empty results.
Parses Open Graph meta tags with a Googlebot user agent to access public post data.
Extracts: Post content, author name, author URL.
LinkedIn extraction uses Googlebot UA because VPS-based extraction returned empty results.
Uses Reddit's JSON API directly (appending .json to the URL).
Extracts: Post title, body text, author, subreddit, comments.
Parses PDF documents using the pdf-parse library.
Extracts: Full text content, metadata (title, author, page count).
Article
Generic web page extractor for any URL not matching other patterns. Extracts main content using readability algorithms.
Extracts: Title, main content, author, description, Open Graph metadata.
Image
Processes image files via VPS proxy. Supports PNG, JPEG, GIF, WebP, SVG, BMP, ICO.
Extracts: Image metadata, OCR text (when available).
Text
Stores plain text input directly without extraction.
Extracts: The input text as-is, with no transformation.