Adding a New Extractor

This guide walks through creating a new extractor to support an additional content source in Dump.

Overview

Dump's extractor system is a registry mapping SourceType values to Extractor implementations. Each extractor takes an input string and returns a standardized ExtractionResult.

Steps

Add the Source Type

Add your new source type to the SourceType union in src/types/index.ts:

src/types/index.ts

export type SourceType =
  | 'twitter'
  | 'youtube'
  | 'article'
  | 'instagram'
  | 'linkedin'
  | 'text'
  | 'image'
  | 'pdf'
  | 'reddit'
  | 'tiktok'  // <-- new source type

Add Detection Pattern

src/lib/extractors/detect.ts

const SOURCE_PATTERNS: [RegExp, SourceType][] = [
  [/(?:twitter|x)\.com/i, 'twitter'],
  [/(?:youtube\.com|youtu\.be)/i, 'youtube'],
  [/instagram\.com/i, 'instagram'],
  [/linkedin\.com/i, 'linkedin'],
  [/(?:reddit\.com|redd\.it)/i, 'reddit'],
  [/tiktok\.com/i, 'tiktok'],  // <-- add before .pdf catch
  [/\.pdf($|\?)/i, 'pdf'],
  [/^https?:\/\//i, 'article'],
]

Place your pattern before the generic article catch-all and the .pdf pattern, otherwise your URLs will be classified as articles.

Create the Extractor

Create a new file src/lib/extractors/tiktok.ts:

src/lib/extractors/tiktok.ts

import type { ExtractionResult } from '@/types'

export async function extractTikTok(
  input: string
): Promise<ExtractionResult> {
  // Your extraction logic here
  // Fetch the TikTok URL, parse the content

  return {
    title: 'Video Title',
    content: 'Extracted content...',
    author: '@username',
    author_url: 'https://tiktok.com/@username',
    media_urls: ['https://...'],
    source_date: null,
    source_type: 'tiktok',
    raw_content: { /* raw API response */ },
    metadata: { /* additional metadata */ },
  }
}

The ExtractionResult interface requires these fields:

Field	Type	Description
`title`	`string \| null`	Content title
`content`	`string`	Main text content (required)
`author`	`string \| null`	Author name
`author_url`	`string \| null`	Author profile URL
`media_urls`	`string[]`	Array of media (images, videos)
`source_date`	`string \| null`	Original publication date
`source_type`	`SourceType`	Must match your type
`raw_content`	`Record`	Raw extraction data
`metadata`	`Record`	Additional metadata

Register the Extractor

Add your extractor to the registry in src/lib/extractors/index.ts:

src/lib/extractors/index.ts

import { extractTikTok } from './tiktok'

const tiktokExtractor: Extractor = { extract: extractTikTok }

const extractors: Record<SourceType, Extractor> = {
  // ... existing extractors
  tiktok: tiktokExtractor,
}

Create a Card Component (Optional)

If the source needs custom rendering, create a card in src/components/cards/:

src/components/cards/TikTokCard.tsx

export function TikTokCard({ item }: { item: DumpItem }) {
  return (
    // Your card component
  )
}

Then register it in DumpCard.tsx's source-type switch.

Test

Start the dev server: pnpm --filter dump dev
Ingest a URL matching your pattern
Verify the item appears with correct source type
Check categorization and search work

Tips

Keep extractors focused on a single source type
Return as much content as possible in the content field (used for search and categorization)
Use raw_content to preserve the original API response for debugging
Extraction failures should throw errors — the ingest route handles error responses
For sources requiring a proxy (e.g., anti-bot protection), see vps.ts for the VPS extractor pattern

On this page