HYVE Docs
DumpGuides

This guide walks through creating a new extractor to support an additional content source in Dump.

Overview

Dump's extractor system is a registry mapping SourceType values to Extractor implementations. Each extractor takes an input string and returns a standardized ExtractionResult.

Steps

Add the Source Type

Add your new source type to the SourceType union in src/types/index.ts:

src/types/index.ts
export type SourceType =
  | 'twitter'
  | 'youtube'
  | 'article'
  | 'instagram'
  | 'linkedin'
  | 'text'
  | 'image'
  | 'pdf'
  | 'reddit'
  | 'tiktok'  // <-- new source type

Add Detection Pattern

Register a URL pattern in src/lib/extractors/detect.ts. The order matters — first match wins.

src/lib/extractors/detect.ts
const SOURCE_PATTERNS: [RegExp, SourceType][] = [
  [/(?:twitter|x)\.com/i, 'twitter'],
  [/(?:youtube\.com|youtu\.be)/i, 'youtube'],
  [/instagram\.com/i, 'instagram'],
  [/linkedin\.com/i, 'linkedin'],
  [/(?:reddit\.com|redd\.it)/i, 'reddit'],
  [/tiktok\.com/i, 'tiktok'],  // <-- add before .pdf catch
  [/\.pdf($|\?)/i, 'pdf'],
  [/^https?:\/\//i, 'article'],
]

Place your pattern before the generic article catch-all and the .pdf pattern, otherwise your URLs will be classified as articles.

Create the Extractor

Create a new file src/lib/extractors/tiktok.ts:

src/lib/extractors/tiktok.ts
import type { ExtractionResult } from '@/types'

export async function extractTikTok(
  input: string
): Promise<ExtractionResult> {
  // Your extraction logic here
  // Fetch the TikTok URL, parse the content

  return {
    title: 'Video Title',
    content: 'Extracted content...',
    author: '@username',
    author_url: 'https://tiktok.com/@username',
    media_urls: ['https://...'],
    source_date: null,
    source_type: 'tiktok',
    raw_content: { /* raw API response */ },
    metadata: { /* additional metadata */ },
  }
}

The ExtractionResult interface requires these fields:

FieldTypeDescription
titlestring | nullContent title
contentstringMain text content (required)
authorstring | nullAuthor name
author_urlstring | nullAuthor profile URL
media_urlsstring[]Array of media (images, videos)
source_datestring | nullOriginal publication date
source_typeSourceTypeMust match your type
raw_contentRecordRaw extraction data
metadataRecordAdditional metadata

Register the Extractor

Add your extractor to the registry in src/lib/extractors/index.ts:

src/lib/extractors/index.ts
import { extractTikTok } from './tiktok'

const tiktokExtractor: Extractor = { extract: extractTikTok }

const extractors: Record<SourceType, Extractor> = {
  // ... existing extractors
  tiktok: tiktokExtractor,
}

Create a Card Component (Optional)

If the source needs custom rendering, create a card in src/components/cards/:

src/components/cards/TikTokCard.tsx
export function TikTokCard({ item }: { item: DumpItem }) {
  return (
    // Your card component
  )
}

Then register it in DumpCard.tsx's source-type switch.

Test

  1. Start the dev server: pnpm --filter dump dev
  2. Ingest a URL matching your pattern
  3. Verify the item appears with correct source type
  4. Check categorization and search work

Tips

  • Keep extractors focused on a single source type
  • Return as much content as possible in the content field (used for search and categorization)
  • Use raw_content to preserve the original API response for debugging
  • Extraction failures should throw errors — the ingest route handles error responses
  • For sources requiring a proxy (e.g., anti-bot protection), see vps.ts for the VPS extractor pattern

On this page