This guide walks through creating a new extractor to support an additional content source in Dump.
Overview
Dump's extractor system is a registry mapping SourceType values to Extractor implementations. Each extractor takes an input string and returns a standardized ExtractionResult.
Steps
Add the Source Type
Add your new source type to the SourceType union in src/types/index.ts:
export type SourceType =
| 'twitter'
| 'youtube'
| 'article'
| 'instagram'
| 'linkedin'
| 'text'
| 'image'
| 'pdf'
| 'reddit'
| 'tiktok' // <-- new source typeAdd Detection Pattern
Register a URL pattern in src/lib/extractors/detect.ts. The order matters — first match wins.
const SOURCE_PATTERNS: [RegExp, SourceType][] = [
[/(?:twitter|x)\.com/i, 'twitter'],
[/(?:youtube\.com|youtu\.be)/i, 'youtube'],
[/instagram\.com/i, 'instagram'],
[/linkedin\.com/i, 'linkedin'],
[/(?:reddit\.com|redd\.it)/i, 'reddit'],
[/tiktok\.com/i, 'tiktok'], // <-- add before .pdf catch
[/\.pdf($|\?)/i, 'pdf'],
[/^https?:\/\//i, 'article'],
]Place your pattern before the generic article catch-all and the .pdf pattern, otherwise your URLs will be classified as articles.
Create the Extractor
Create a new file src/lib/extractors/tiktok.ts:
import type { ExtractionResult } from '@/types'
export async function extractTikTok(
input: string
): Promise<ExtractionResult> {
// Your extraction logic here
// Fetch the TikTok URL, parse the content
return {
title: 'Video Title',
content: 'Extracted content...',
author: '@username',
author_url: 'https://tiktok.com/@username',
media_urls: ['https://...'],
source_date: null,
source_type: 'tiktok',
raw_content: { /* raw API response */ },
metadata: { /* additional metadata */ },
}
}The ExtractionResult interface requires these fields:
| Field | Type | Description |
|---|---|---|
title | string | null | Content title |
content | string | Main text content (required) |
author | string | null | Author name |
author_url | string | null | Author profile URL |
media_urls | string[] | Array of media (images, videos) |
source_date | string | null | Original publication date |
source_type | SourceType | Must match your type |
raw_content | Record | Raw extraction data |
metadata | Record | Additional metadata |
Register the Extractor
Add your extractor to the registry in src/lib/extractors/index.ts:
import { extractTikTok } from './tiktok'
const tiktokExtractor: Extractor = { extract: extractTikTok }
const extractors: Record<SourceType, Extractor> = {
// ... existing extractors
tiktok: tiktokExtractor,
}Create a Card Component (Optional)
If the source needs custom rendering, create a card in src/components/cards/:
export function TikTokCard({ item }: { item: DumpItem }) {
return (
// Your card component
)
}Then register it in DumpCard.tsx's source-type switch.
Test
- Start the dev server:
pnpm --filter dump dev - Ingest a URL matching your pattern
- Verify the item appears with correct source type
- Check categorization and search work
Tips
- Keep extractors focused on a single source type
- Return as much content as possible in the
contentfield (used for search and categorization) - Use
raw_contentto preserve the original API response for debugging - Extraction failures should throw errors — the ingest route handles error responses
- For sources requiring a proxy (e.g., anti-bot protection), see
vps.tsfor the VPS extractor pattern