PDF Extraction and EDI Parsing for B2B Data | Blog

The problem: B2B data comes in every format imaginable

If you build B2B software, you have heard the same story from your enterprise clients. They cannot send you a clean CSV. Their ERP exports fixed-width text files. Their procurement system generates EDI X12 850 purchase orders. Their finance team sends PDF invoices. Their compliance department shares XML reports. Their HR team exports XLSX workbooks with six sheets, three of which are blank.

This is not an edge case. This is the default state of enterprise data exchange. According to industry estimates, fewer than 30% of B2B file transfers use CSV as the primary format. The rest is a mix of legacy formats, industry-specific standards, and whatever export button the source system happens to offer.

Most file import tools were built for CSV. Some support Excel. Almost none handle PDF table extraction, EDI parsing, or fixed-width files. If you rely on one of these tools, you are still writing custom parsers for the majority of your enterprise clients. That defeats the purpose of having a file import tool in the first place.

Key insight

The average B2B SaaS company receives data in 5 to 8 different file formats across its client base. Supporting only CSV and Excel means you are still writing custom code for the majority of your integrations.

Every file format FileFeed supports

FileFeed was designed from the ground up to handle the full spectrum of file formats that enterprises actually use. Every format feeds into the same validation, mapping, and transformation pipeline. You define your schema once, and FileFeed normalizes incoming data regardless of how it arrives.

CSV with full encoding and delimiter support

CSV is the most common format, but it is also the most inconsistent. FileFeed handles CSV files with any delimiter: commas, tabs, pipes, semicolons, or custom characters. Encoding is detected automatically, supporting UTF-8, UTF-16, Latin-1, Windows-1252, and ISO-8859-1. BOM (byte order mark) headers are stripped transparently. Quoted fields with embedded delimiters, newlines, and escape characters are parsed correctly. There is no configuration required for standard CSV files. FileFeed detects the format and parses it.

Excel workbooks (XLSX and XLS)

Enterprise clients frequently send Excel files with multiple sheets, merged cells, formula-driven columns, and inconsistent header rows. FileFeed reads XLSX and legacy XLS formats, extracts data from specified sheets or auto-detects the primary data sheet, strips formulas down to their computed values, and handles merged cells by propagating values. Header row detection is automatic, even when the first few rows contain titles or metadata rather than column headers.

JSON with nested structure support

API-generated exports and modern systems often produce JSON. FileFeed parses both flat and deeply nested JSON structures. Arrays are automatically unwrapped into rows. Nested objects can be flattened or accessed via dot notation during the mapping phase. Whether the file contains a top-level array of records or a wrapper object with metadata and a nested data array, FileFeed extracts the records and presents them as tabular data for mapping.

XML with configurable element extraction

XML remains common in healthcare, government, and legacy enterprise systems. FileFeed parses XML files with configurable root and row element selectors. Attributes and child elements are extracted as fields. Namespaces are handled transparently. You specify which XML element represents a record, and FileFeed extracts each instance as a row in your schema.

Fixed-width files

Fixed-width (positional) files are still used by banks, insurance companies, government agencies, and mainframe systems. Each field occupies a specific character range in each line. FileFeed supports position-based field extraction with configurable start position and length for each column. If your client sends a 200-character-per-line fixed-width payroll file from a system built in the 1990s, FileFeed can parse it without custom code.

PDF table extraction (AI-powered)

PDF is the most challenging format because it is a presentation format, not a data format. Tables in PDFs have no underlying structure. They are visual arrangements of text on a page. FileFeed uses AI-powered table detection to identify tabular data within PDF documents, extract rows and columns into structured data, and present the results for user confirmation before processing.

EDI X12 and EDIFACT

Electronic Data Interchange (EDI) is the backbone of supply chain, healthcare, and retail data exchange. FileFeed parses EDI X12 transaction sets including 850 (Purchase Order), 810 (Invoice), 856 (ASN), and 834 (Benefit Enrollment). EDIFACT formats including ORDERS, INVOIC, and DESADV are also supported. EDI segments and elements are parsed into structured JSON, which then flows through the same mapping and validation pipeline as any other format.

file formats supported natively

30+

encoding types auto-detected

$41B

global EDI market size in 2024

$91B

projected EDI market size by 2032

How PDF data extraction works in FileFeed

PDF extraction deserves a deeper explanation because it solves one of the hardest problems in data onboarding. When an enterprise client sends you a PDF invoice, a compliance certificate, or a bank statement, the data you need is locked inside a format designed for printing, not for parsing.

Traditional PDF parsers rely on text extraction, which produces an unstructured stream of characters with no concept of rows, columns, or tables. This approach fails for any document with meaningful tabular structure. FileFeed takes a fundamentally different approach.

Table detection: FileFeed's AI model analyzes the visual layout of each page and identifies regions that contain tabular data. This works even when tables have no visible borders or grid lines.
Structure recognition: Within each detected table region, the model identifies column boundaries, header rows, and data rows. Merged cells are detected and their values are properly attributed.
Data extraction: Text content is extracted from each cell and organized into a structured row-and-column format, producing clean tabular data from what was previously an unstructured visual layout.
User preview and confirmation: The extracted data is presented in a preview interface where users can verify the extraction, adjust column mappings, and confirm before the data enters the processing pipeline.
Multi-page handling: Tables that span multiple pages are recognized and concatenated automatically. Headers are detected on continuation pages and deduplicated.

Common use cases for PDF extraction include processing vendor invoices with line item tables, extracting transaction data from bank statements, parsing compliance and audit reports, reading medical claims from PDF explanation-of-benefits documents, and ingesting inventory reports from suppliers who only provide PDF exports.

The result

PDF table extraction converts documents that previously required manual data entry into structured data that flows through your automated pipeline. A process that took 30 minutes per document becomes seconds.

How EDI parsing works in FileFeed

EDI (Electronic Data Interchange) has been the standard for B2B document exchange in supply chain, healthcare, and retail for decades. Despite its age, EDI is not going away. The global EDI market was valued at $41 billion in 2024 and is projected to grow to $91 billion by 2032. If your product operates in any of these industries, you will encounter EDI.

The challenge with EDI is that it was designed for machine-to-machine communication in the 1970s. An EDI X12 document is a stream of segments separated by tildes, with elements separated by asterisks. There are no labels, no headers, and no self-describing structure. Parsing EDI requires knowledge of the specific transaction set schema to know which segment contains which business data.

EDI X12 transaction sets

FileFeed parses the most common X12 transaction sets used in B2B commerce:

850 (Purchase Order): The most widely used EDI document. Contains buyer information, shipping details, and line items with quantities, prices, and product identifiers. FileFeed extracts each line item as a row and maps PO header data as context fields.
810 (Invoice): The supplier's invoice in response to a purchase order. Contains invoice amounts, line item details, tax information, and payment terms. FileFeed parses both summary and detail segments.
856, Advance Ship Notice (ASN): Notifies the buyer that a shipment is on its way. Contains hierarchical data with shipment, order, and item levels. FileFeed flattens the hierarchy into rows while preserving parent-child relationships.
834 (Benefit Enrollment): Used heavily in healthcare for enrolling members in benefit plans. Contains member demographics, coverage information, and effective dates. FileFeed parses both full-file and change-only (maintenance) 834 documents.

EDI EDIFACT message types

For international commerce, FileFeed supports UN/EDIFACT message types:

ORDERS: International purchase orders used extensively in European and Asian supply chains.
INVOIC: Commercial invoices with support for multiple currencies, tax schemes, and regulatory requirements.
DESADV: Despatch advice (shipping notice), the EDIFACT equivalent of the X12 856 ASN.

The EDI-to-JSON pipeline

When FileFeed receives an EDI document, the parsing pipeline works as follows: the raw EDI stream is segmented and tokenized according to the declared transaction set. Each segment is mapped to its schema definition, producing labeled key-value pairs. The labeled data is structured as JSON, with hierarchical segments (like the HL loops in an 856) represented as nested objects. This JSON then enters the standard FileFeed pipeline where it can be mapped to your schema, validated against your rules, and delivered to your API.

FileFeed can also generate EDI documents for outbound export use cases. If your system needs to send 810 invoices or 856 ASNs to trading partners, FileFeed can transform your JSON data into compliant EDI X12 or EDIFACT documents. This bidirectional capability means you can use FileFeed for both inbound and outbound EDI without a separate EDI translator.

Key insight

EDI is not legacy technology. It is a $41B market growing at 10% annually. Companies that treat EDI as a first-class integration format rather than an afterthought have a significant advantage in supply chain, healthcare, and retail verticals.

Encoding and format detection

Before any file can be parsed, FileFeed needs to determine its encoding and format. This happens automatically, with no configuration required from the user. The encoding detection system identifies the character encoding of incoming files by analyzing byte patterns, BOM markers, and statistical character frequency analysis.

Encoding detection: UTF-8, UTF-8 with BOM, UTF-16 LE/BE, Latin-1 (ISO-8859-1), Windows-1252, Shift_JIS, EUC-KR, and 20+ additional encodings are detected automatically. Files are transcoded to UTF-8 before parsing.
CSV delimiter detection: Comma, tab, pipe, semicolon, and custom delimiters are detected by analyzing the first several rows for consistency. European CSV files that use semicolons (because commas are decimal separators) are handled correctly.
BOM handling: Byte order marks are detected and stripped so they do not appear as phantom characters in the first column header, a common problem with CSV files exported from Excel on Windows.
Line ending normalization: CRLF (Windows), LF (Unix/Mac), and CR (legacy Mac) line endings are all normalized during parsing.
File type detection: When a file extension is missing or incorrect, FileFeed uses magic bytes and content analysis to determine the actual file type. An XLSX file renamed to .csv will still be parsed as an Excel workbook.

One pipeline for every format

The most important architectural decision in FileFeed is that every file format feeds into the same processing pipeline. Whether the incoming data started as a PDF invoice, an EDI 850 purchase order, an XML feed, or a plain CSV file, it is normalized into a tabular structure before it reaches the mapping, validation, and transformation stages.

This means you define your target schema once. You define your validation rules once. You write your transformation functions once. A new client that sends PDF invoices and an existing client that sends CSV files both flow through the same pipeline, the same webhooks, and the same API delivery.

Ingest: The file arrives via SFTP, API upload, or the embeddable importer. FileFeed detects the file type and encoding.
Parse: The format-specific parser (CSV, Excel, JSON, XML, PDF, EDI, or fixed-width) converts the file into normalized tabular rows.
Map: Columns from the source file are mapped to fields in your target schema using AI-powered field mapping. Mappings can be saved per client so recurring files are processed automatically.
Validate: Each row is validated against your schema rules: required fields, data types, formats, ranges, and custom validation logic.
Transform: Transformation functions run on validated data: date format conversion, string normalization, value lookups, computed fields, and conditional logic.
Deliver: Clean, validated, transformed data is delivered to your API endpoint, webhook, or database as structured JSON.

This unified pipeline eliminates the need for format-specific processing logic in your application. Your code receives the same clean JSON payload regardless of whether the source was a PDF, an EDI document, or a CSV file.

How FileFeed compares to other file import tools

Most file import and data onboarding tools were built for CSV and, in some cases, Excel. When you look at what formats are actually supported by the tools in the market, the gaps become clear.

Flatfile: Supports CSV and Excel. No PDF extraction, no EDI parsing, no XML, no fixed-width. If your enterprise clients send anything other than spreadsheet formats, you are on your own.
OneSchema: Supports CSV and Excel. Similar limitations to Flatfile. Focused on the embeddable import experience for spreadsheet-like files.
Couchdrop: Handles file transfer (SFTP, cloud storage routing) but does not parse files at all. No schema validation, no mapping, no transformation. It moves files from point A to point B without processing them.
Custom code: You can build parsers for any format, but each one takes engineering time. Maintaining parsers for 8 file formats across hundreds of client configurations is a full-time job for a team.
FileFeed: Supports CSV, Excel, JSON, XML, PDF (AI-powered extraction), EDI X12, EDI EDIFACT, and fixed-width files. All formats feed into the same mapping, validation, and transformation pipeline.

The problem

If your file import tool only supports CSV and Excel, you are covering less than 30% of the file formats that enterprise clients actually send. Every unsupported format becomes a custom engineering project.

Real-world scenarios

Supply chain: EDI purchase orders to your order management system

A retailer sends EDI 850 purchase orders from their procurement system. FileFeed parses the X12 document, extracts line items with quantities and product IDs, validates against your product catalog, and delivers structured order data to your API. When the retailer's trading partner is in Europe and sends EDIFACT ORDERS instead, the same pipeline handles it. No code changes.

Fintech: PDF bank statements to transaction records

A lending platform needs transaction history from applicants. Some banks provide CSV exports, but many only offer PDF statements. FileFeed's PDF extraction identifies the transaction table on each page, extracts dates, descriptions, and amounts, and delivers them as structured transaction records. The lending platform's underwriting system receives the same JSON payload regardless of whether the source was CSV or PDF.

Healthcare: 834 enrollment files for benefits administration

A benefits administration platform receives 834 Benefit Enrollment files from insurance carriers. These EDI documents contain member demographics, coverage selections, and effective dates. FileFeed parses the 834, validates member data against the platform's schema, and delivers clean enrollment records. When a smaller carrier sends the same data as an Excel spreadsheet instead of EDI, the pipeline handles that too.

Government and compliance: XML regulatory filings

A compliance platform ingests regulatory filings from government agencies. These arrive as XML documents with specific namespace requirements and deeply nested element structures. FileFeed parses the XML, extracts records from the configured row elements, and maps them to the platform's internal schema. When the same data arrives from a different agency as a fixed-width text file, the same target schema and validation rules apply.

Getting started with multi-format file ingestion

If you are currently handling only CSV and Excel imports and know you need to support additional formats, the transition to FileFeed is straightforward. Your existing CSV and Excel pipelines continue to work exactly as before. You gain PDF extraction, EDI parsing, XML, JSON, and fixed-width support without any changes to your downstream data processing.

The key advantage is that all formats produce the same output. Your application code does not need to know or care what format the source file was in. It receives validated, mapped, transformed JSON through the same API endpoint or webhook, regardless of whether the original file was a 40-page PDF invoice or a simple CSV with three columns.

Enterprise data exchange is messy. File formats are diverse, encodings are inconsistent, and every client has their own way of exporting data. A file import tool that only handles CSV is solving the easiest 30% of the problem. FileFeed handles all of it, through a single pipeline, with no custom code required.

PDF Extraction, EDI Parsing, and Every File Format FileFeed Supports