Building a PDF Data Extraction App with Next.js 14 & OpenAI

In this comprehensive technical guide, we'll explore the architecture and implementation details of a modern PDF data extraction application built with Next.js 14, focusing on scalability, performance, and user experience.

Introduction

Building a reliable PDF data extraction system presents unique challenges, from handling various PDF formats to ensuring accurate data extraction at scale. Our application addresses these challenges using a modern tech stack:

Next.js 14 for the frontend and API routes
LlamaParser for robust PDF text extraction
OpenAI for intelligent data structuring
Clerk v6 for authentication
Supabase for data persistence
Tailwind CSS and Shadcn for the UI

The application serves businesses needing to extract structured data from PDFs, such as invoices, receipts, and forms, converting them into usable formats like Excel spreadsheets.

Architecture Overview

Frontend Architecture

The frontend is built using Next.js 14's App Router, employing a hybrid approach of server and client components for optimal performance. Here's our component organization:

app/
  ├── (auth)/
  │   ├── sign-in/
  │   └── sign-up/
  ├── dashboard/
  │   ├── page.tsx
  │   ├── loading.tsx
  │   └── error.tsx
  ├── processing/
  │   └── [...steps]/
  └── schemas/
      └── [id]/

We use React Server Components (RSC) for data-heavy components and Client Components for interactive elements. This separation provides several benefits:

Reduced JavaScript bundle size
Improved initial page load
Better SEO through server-side rendering
Maintained interactive features where needed

Backend Architecture

The backend leverages Next.js API routes organized into a clear hierarchy:

app/api/
  ├── uploads/
  │   └── route.ts
  ├── processing/
  │   └── route.ts
  ├── schemas/
  │   └── route.ts
  └── exports/
      └── route.ts

Each route is protected using Clerk middleware:

import { authMiddleware } from '@clerk/nextjs';

export default authMiddleware({
  publicRoutes: ['/api/health'],
  ignoredRoutes: ['/api/webhooks/clerk']
});

Core Features Deep Dive

1. PDF Upload and Processing

The upload system uses the @uploadthing/react library with custom configuration for PDF handling:

export const ourFileRouter = {
  pdfUploader: f({ pdf: { maxFileSize: '32MB' } })
    .middleware(async ({ req }) => {
      const user = await currentUser();
      return { userId: user.id };
    })
    .onUploadComplete(async ({ metadata, file }) => {
      await db.uploads.create({
        data: {
          userId: metadata.userId,
          fileUrl: file.url,
          status: 'pending'
        }
      });
    })
};

The processing pipeline follows these steps:

File upload and validation
Text extraction using LlamaParser
Structure identification
Data extraction with OpenAI
Results validation

2. Schema Definition System

The schema builder interface allows users to define custom data structures:

interface SchemaField {
  name: string;
  type: 'string' | 'number' | 'date' | 'nested';
  required: boolean;
  validation?: {
    pattern?: string;
    min?: number;
    max?: number;
  };
  children?: SchemaField[]; // For nested structures
}

We provide default templates for common documents:

Invoices
Receipts
Purchase Orders
Shipping Documents

3. Data Extraction Pipeline

The extraction process uses OpenAI's GPT-4 with custom prompting:

const extractData = async (text: string, schema: Schema) => {
  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo-preview',
    messages: [
      {
        role: 'system',
        content: 'Extract structured data according to the provided schema.'
      },
      {
        role: 'user',
        content: `
          Schema: ${JSON.stringify(schema)}
          Text: ${text}
        `
      }
    ],
    temperature: 0.1,
    max_tokens: 2000
  });

  return JSON.parse(completion.choices[0].message.content);
};

4. Credit Management System

Credits are tracked using a real-time system:

interface CreditTransaction {
  userId: string;
  amount: number;
  type: 'deduction' | 'addition';
  reason: 'processing' | 'refund' | 'purchase';
  metadata: {
    pageCount?: number;
    fileId?: string;
  };
}

Credit deduction follows these rules:

1 credit per page processed
Minimum 1 credit per file
Bulk processing discounts
Refunds for failed processing

5. Export Functionality

The export system handles nested data structures:

interface ExportOptions {
  format: 'xlsx' | 'csv' | 'json';
  flatten: boolean;
  includeMetadata: boolean;
}

const generateExport = async (data: ExtractedData[], options: ExportOptions) => {
  if (options.flatten) {
    return flattenNestedStructure(data);
  }

  const workbook = XLSX.utils.book_new();
  const worksheet = XLSX.utils.json_to_sheet(data);
  XLSX.utils.book_append_sheet(workbook, worksheet, 'Extracted Data');

  return XLSX.write(workbook, { type: 'buffer', bookType: 'xlsx' });
};

Security and Performance

Security Measures

We implement multiple security layers:

Authentication using Clerk
File validation and sanitization
Rate limiting on API routes
Secure file storage with signed URLs
Data encryption at rest

Performance Optimizations

Key optimizations include:

Streaming uploads for large files
Parallel processing where possible
Caching of extraction results
Lazy loading of heavy components
Background job processing for long-running tasks

User Experience Considerations

The UI is built with Tailwind CSS and Shadcn components for a modern look:

interface ProgressIndicator {
  current: number;
  total: number;
  status: 'uploading' | 'processing' | 'extracting' | 'complete';
}

const ProcessingStatus: React.FC<ProgressIndicator> = ({ current, total, status }) => {
  return (
    <div className="w-full max-w-md mx-auto">
      <Progress value={(current / total) * 100} />
      <p className="text-sm text-gray-500 mt-2">
        {status === 'uploading' ? 'Uploading files...' :
         status === 'processing' ? 'Processing PDFs...' :
         status === 'extracting' ? 'Extracting data...' :
         'Complete'}
      </p>
    </div>
  );
};

Challenges and Solutions

Large File Handling

For large PDFs, we implemented:

Chunked uploads
Progressive processing
Background workers
Status webhooks

Concurrent Processing

To handle multiple simultaneous uploads:

Queue system with Redis
Worker pools
Progress tracking per file
Automatic retries

Future Improvements

Planned enhancements include:

AI-powered schema suggestions
Improved accuracy through feedback loops
Additional export formats
Batch processing optimizations

Conclusion

Building a robust PDF data extraction system requires careful consideration of various technical aspects, from file handling to user experience. This implementation provides a scalable foundation that can be extended based on specific needs.

Technical Deep Dive: Building a Modern PDF Data Extraction Application