Recipes · extraction
Structured JSON Extraction
Extrae datos estructurados de texto no estructurado con alta fiabilidad. Usa schemas Zod para validar y tipar el output del modelo.
model-agnosticextractionjsonzodtypescriptparsingActualizado 2026-04-23
System Prompt
You are a data extraction engine. Extract information from the provided text
and return it as valid JSON matching the schema described by the user.
## Rules
- Return ONLY valid JSON — no markdown, no code fences, no explanation.
- If a field is not present in the text, use null for optional fields or omit
the key if the schema marks it as optional.
- Do not invent values. If the text is ambiguous, use null.
- Dates must be in ISO 8601 format (YYYY-MM-DD).
- Numbers must be numbers, not strings.
User Prompt template
Extract the following fields from the text below.
Schema:
{{SCHEMA_DESCRIPTION}}
Text:
{{INPUT_TEXT}}
Implementación completa (TypeScript + Zod)
typescript
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
const client = new Anthropic();
async function extractStructured<T extends z.ZodTypeAny>(
text: string,
schema: T,
schemaDescription: string,
): Promise<z.infer<T>> {
const response = await client.messages.create({
model: "claude-haiku-3-5", // Haiku es suficiente para extracción simple
max_tokens: 2048,
system: STRUCTURED_JSON_SYSTEM_PROMPT,
messages: [
{
role: "user",
content: `Extract the following fields from the text below.\n\nSchema:\n${schemaDescription}\n\nText:\n${text}`,
},
],
});
const rawText = response.content
.filter((b): b is Anthropic.TextBlock => b.type === "text")
.map((b) => b.text)
.join("")
.trim();
// Strip accidental markdown code fences
const cleaned = rawText
.replace(/^```(?:json)?\n?/, "")
.replace(/\n?```$/, "")
.trim();
const parsed = JSON.parse(cleaned) as unknown;
return schema.parse(parsed);
}
// --- Ejemplo de uso ---
const InvoiceSchema = z.object({
invoice_number: z.string(),
date: z.string().nullable(),
vendor: z.string(),
total_amount: z.number().nullable(),
currency: z.string().default("USD"),
line_items: z.array(
z.object({
description: z.string(),
quantity: z.number().nullable(),
unit_price: z.number().nullable(),
}),
),
});
const schemaDescription = `
{
invoice_number: string,
date: string | null, // ISO 8601 format
vendor: string,
total_amount: number | null,
currency: string, // default "USD"
line_items: Array<{
description: string,
quantity: number | null,
unit_price: number | null
}>
}`;
const invoice = await extractStructured(
rawInvoiceText,
InvoiceSchema,
schemaDescription,
);
console.log(invoice.total_amount); // number | null, fully typedManejo de errores de parsing
typescript
async function safeExtract<T extends z.ZodTypeAny>(
text: string,
schema: T,
schemaDescription: string,
retries = 2,
): Promise<z.infer<T> | null> {
for (let attempt = 0; attempt <= retries; attempt++) {
try {
return await extractStructured(text, schema, schemaDescription);
} catch (err) {
if (attempt === retries) {
console.error("Extraction failed after retries:", err);
return null;
}
// On retry, tell the model what went wrong
console.warn(`Extraction attempt ${attempt + 1} failed, retrying…`);
}
}
return null;
}Optimización con Prompt Caching
Si procesas muchos documentos con el mismo schema, cachea el system prompt y la descripción del schema:
typescript
const response = await client.messages.create({
model: "claude-haiku-3-5",
max_tokens: 2048,
system: [
{
type: "text",
text: STRUCTURED_JSON_SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: [
{
type: "text",
text: `Schema:\n${schemaDescription}\n\nText:\n`,
cache_control: { type: "ephemeral" }, // cachea la descripción del schema
},
{
type: "text",
text: documentText, // solo este bloque cambia por documento
},
],
},
],
});Para extracciones en batch, usa claude-haiku-3-5 — es ~10x más rápido y barato que Opus. Reserva Opus para documentos complejos o ambiguos donde la extracción simple falla.
Schemas comunes
| Caso de uso | Campos típicos |
|---|---|
| Facturas | invoice_number, date, vendor, total, line_items |
| CVs | name, email, experience[], skills[], education[] |
| Artículos de noticias | title, author, published_at, topics[], sentiment |
| Contratos | parties[], effective_date, termination_date, obligations[] |
| Tickets de soporte | id, priority, category, customer_id, description |