Data ingestion
Every supported source — native parsers, AI-powered formats, and URL extraction — and what to expect from each.
Native parsers
These formats are parsed directly by ShipMCP. They're the fastest path: no model inference, no rendering, just stream → schema → Postgres.
| Format | Behaviour |
|---|---|
.csv / .tsv | Headers infer column names. Types are detected from values (int, numeric, boolean, timestamp, text). |
.json / .ndjson | Array of objects → table. Nested objects become JSONB columns. Arrays of mixed shapes are flattened with best-effort union types. |
.sql | Postgres-flavoured dump. CREATE TABLE + INSERT statements are replayed against your dedicated Neon project. |
.xlsx / .xls | Each sheet becomes a table. First row is treated as headers. |
.md | Single-document corpus. Frontmatter, #tags, [links], and headings are extracted into documents / tags / links / headings tables. |
.txt / .html | Treated as a markdown corpus after a lightweight conversion. HTML strips boilerplate; .txt becomes a single document. |
AI-powered documents
These run through Cloudflare Workers AI's toMarkdown binding before
being ingested as a Markdown corpus. The model converts each file into clean
markdown; from there the standard markdown ingest path applies (frontmatter, tags,
links, headings).
Quality of extraction depends on the source. Born-digital PDFs and Word documents produce the cleanest markdown. Scanned images go through OCR — accuracy is good for printed text, lower for handwriting or low-resolution scans.
AI-powered audio
Audio files run through Cloudflare Workers AI's Whisper model
(@cf/openai/whisper) for speech-to-text. The transcript is wrapped
with the filename as an H1 heading and ingested as a single-document corpus —
same downstream treatment as a Markdown upload.
Whisper auto-detects the codec from file content. Transcription accuracy is high for clear single-speaker audio; quality drops with overlapping speakers, heavy accents, or background noise. Audio files cap at 25 MB per request — chunking for longer recordings is on the roadmap.
Video isn't yet supported. No native ffmpeg in Workers means we can't extract audio or frames from video files. Tracked on the roadmap.
Workers AI quotas apply. AI-powered ingest counts against your Workers AI usage on the underlying Cloudflare account. For the hosted ShipMCP service, this is included in your plan; on Pro and Scale, expect headroom for hundreds of multi-page PDFs (or hours of audio) per month.
URL extraction
Pick a mode at upload time:
- Markdown — fetches one page through Cloudflare Browser Rendering, converts to clean markdown, parses the same corpus tables (
documents,tags,links,headings). - Crawl — same-origin breadth-first up to N pages (configurable per upload). Each page becomes one document row.
- JSON — uses Cloudflare's structured-extraction endpoint to coerce the page into typed records. You provide the table name; we infer the schema from the response.
Browser Rendering is configured with waitUntil: "load" + a 3-second post-load
action timeout so SPAs (Notion sites, React landing pages) hydrate before extraction.
Empty-extraction guard rejects results < 50 chars with a clear error pointing at
likely causes (auth wall, headless-block, JS-only site).
What happens when you upload
Every ingest — file or URL, create or append — runs through the same seven-phase pipeline. The endpoint detail page shows live progress through these phases via the same component used in the dashboard. Each phase in one sentence:
- queued — your upload landed in R2 (for files) or your URL was accepted (for URL ingest); the job is sitting in the Cloudflare Queue waiting for a worker.
- fetching — for files: pulled from R2 staging to worker memory. For URLs: Browser Rendering is rendering the page (this is the slowest phase for crawls — typically 5-30s per page).
- provisioning_db — first-time only; creates a dedicated Neon Postgres project for this endpoint. Skipped on every append.
- loading_data — runs the appropriate ingest path: native parser (CSV / JSON / SQL / Excel / Markdown / TSV / TXT / HTML), Workers AI document conversion (PDF / DOCX / PPTX / images), Whisper turbo (audio), or Browser Rendering markdown/JSON/links extraction. Inserts rows into the per-endpoint Postgres.
- introspecting — connects to the Postgres project, reads
information_schema, derives column types and FK relationships, and computes capabilities for every table. This is where schema-gen runs. - publishing — generates the MCP tool manifest from the introspection result, writes
tool_manifestto D1, buildsllms.txtfrom the schema + tool catalog, writes that to D1 too, and emails you on first activation. - active — endpoint is live.
tools/listreturns the new manifest;llms.txtserves at the public well-known paths; agents can connect.
When you append data to an existing endpoint, phases 3 (provisioning_db) is
skipped — the project already exists. When you toggle allow_writes or click Rebuild on the Tools list card, only phases
5 (introspecting) and 6 (publishing) run; nothing in your data changes.
Append-data flow
Endpoints are not immutable. From the endpoint detail page, click Add data to append more rows or pull in another related table. The append flow:
- Reuses the existing Neon project — no new database, no new endpoint URL.
- Loads the new rows into matching tables, or creates new tables if shapes differ.
- Re-runs schema introspection so newly-added columns become filterable.
- Regenerates
llms.txtand the tool manifest.
The endpoint stays online throughout — agents see the new tools on their next tools/list call.
Limits & gotchas
- Per-file size — 25 MB on Free, 100 MB on Pro, 500 MB on Scale. SQL dumps can be larger; we stream them.
- Column count — soft cap of 200 columns per table. Wider tables split heuristically.
- Mixed-shape JSON arrays — we generate a union schema, but you'll get cleaner tools by normalizing first.
- Excel formulas — evaluated values are loaded; formula text is discarded.
- Crawl politeness — same-origin only, respects
robots.txt, hard cap of 200 pages per job.