Data ingestion

Every supported source — native parsers, AI-powered formats, and URL extraction — and what to expect from each.

Native parsers

These formats are parsed directly by ShipMCP. They're the fastest path: no model inference, no rendering, just stream → schema → Postgres.

Format	Behaviour
`.csv` / `.tsv`	Headers infer column names. Types are detected from values (int, numeric, boolean, timestamp, text).
`.json` / `.ndjson`	Array of objects → table. Nested objects become JSONB columns. Arrays of mixed shapes are flattened with best-effort union types.
`.sql`	Postgres-flavoured dump. `CREATE TABLE` + `INSERT` statements are replayed against your dedicated Neon project.
`.xlsx` / `.xls`	Each sheet becomes a table. First row is treated as headers.
`.md`	Single-document corpus. Frontmatter, `#tags`, `[links]`, and headings are extracted into `documents` / `tags` / `links` / `headings` tables.
`.txt` / `.html`	Treated as a markdown corpus after a lightweight conversion. HTML strips boilerplate; `.txt` becomes a single document.

AI-powered documents

These run through Cloudflare Workers AI's toMarkdown binding before being ingested as a Markdown corpus. The model converts each file into clean markdown; from there the standard markdown ingest path applies (frontmatter, tags, links, headings).

PDFDOCXExcel macroExcel binaryET sheetODSODTNumbersXMLPNGJPGWebPSVG

Quality of extraction depends on the source. Born-digital PDFs and Word documents produce the cleanest markdown. Scanned images go through OCR — accuracy is good for printed text, lower for handwriting or low-resolution scans.

AI-powered audio

Audio files run through Cloudflare Workers AI's Whisper model (@cf/openai/whisper) for speech-to-text. The transcript is wrapped with the filename as an H1 heading and ingested as a single-document corpus — same downstream treatment as a Markdown upload.

MP3WAVM4AOGGFLAC

Whisper auto-detects the codec from file content. Transcription accuracy is high for clear single-speaker audio; quality drops with overlapping speakers, heavy accents, or background noise. Audio files cap at 25 MB per request — chunking for longer recordings is on the roadmap.

info

Video isn't yet supported. No native ffmpeg in Workers means we can't extract audio or frames from video files. Tracked on the roadmap.

info

Workers AI quotas apply. AI-powered ingest counts against your Workers AI usage on the underlying Cloudflare account. For the hosted ShipMCP service, this is included in your plan; on Pro and Scale, expect headroom for hundreds of multi-page PDFs (or hours of audio) per month.

URL extraction

Pick a mode at upload time:

Markdown — fetches one page through Cloudflare Browser Rendering, converts to clean markdown, parses the same corpus tables (documents, tags, links, headings).
Crawl — same-origin breadth-first up to N pages (configurable per upload). Each page becomes one document row.
JSON — uses Cloudflare's structured-extraction endpoint to coerce the page into typed records. You provide the table name; we infer the schema from the response.

Browser Rendering is configured with waitUntil: "load" + a 3-second post-load action timeout so SPAs (Notion sites, React landing pages) hydrate before extraction. Empty-extraction guard rejects results < 50 chars with a clear error pointing at likely causes (auth wall, headless-block, JS-only site).

What happens when you upload

Every ingest — file or URL, create or append — runs through the same seven-phase pipeline. The endpoint detail page shows live progress through these phases via the same component used in the dashboard. Each phase in one sentence:

queued — your upload landed in R2 (for files) or your URL was accepted (for URL ingest); the job is sitting in the Cloudflare Queue waiting for a worker.
fetching — for files: pulled from R2 staging to worker memory. For URLs: Browser Rendering is rendering the page (this is the slowest phase for crawls — typically 5-30s per page).
provisioning_db — first-time only; creates a dedicated Neon Postgres project for this endpoint. Skipped on every append.
loading_data — runs the appropriate ingest path: native parser (CSV / JSON / SQL / Excel / Markdown / TSV / TXT / HTML), Workers AI document conversion (PDF / DOCX / PPTX / images), Whisper turbo (audio), or Browser Rendering markdown/JSON/links extraction. Inserts rows into the per-endpoint Postgres.
introspecting — connects to the Postgres project, reads information_schema, derives column types and FK relationships, and computes capabilities for every table. This is where schema-gen runs.
publishing — generates the MCP tool manifest from the introspection result, writes tool_manifest to D1, builds llms.txt from the schema + tool catalog, writes that to D1 too, and emails you on first activation.
active — endpoint is live. tools/list returns the new manifest; llms.txt serves at the public well-known paths; agents can connect.

When you append data to an existing endpoint, phases 3 (provisioning_db) is skipped — the project already exists. When you toggle allow_writes or click Rebuild on the Tools list card, only phases 5 (introspecting) and 6 (publishing) run; nothing in your data changes.

Append-data flow

Endpoints are not immutable. From the endpoint detail page, click Add data to append more rows or pull in another related table. The append flow:

Reuses the existing Neon project — no new database, no new endpoint URL.
Loads the new rows into matching tables, or creates new tables if shapes differ.
Re-runs schema introspection so newly-added columns become filterable.
Regenerates llms.txt and the tool manifest.

The endpoint stays online throughout — agents see the new tools on their next tools/list call.

Limits & gotchas

Per-file size — 25 MB on Free, 100 MB on Pro, 500 MB on Scale. SQL dumps can be larger; we stream them.
Column count — soft cap of 200 columns per table. Wider tables split heuristically.
Mixed-shape JSON arrays — we generate a union schema, but you'll get cleaner tools by normalizing first.
Excel formulas — evaluated values are loaded; formula text is discarded.
Crawl politeness — same-origin only, respects robots.txt, hard cap of 200 pages per job.

← Previous Agent rules Next → Tool generation

edit Suggest an edit