Photo by Andre Taissin on Unsplash
Technical Deep Dive: Building a Modern PDF Data Extraction Application
In this comprehensive technical guide, we'll explore the architecture and implementation details of a modern PDF data extraction application built with Next.js 14, focusing on scalability, performance, and user experience.
Introduction
Building a reliable PDF data extraction system presents unique challenges, from handling various PDF formats to ensuring accurate data extraction at scale. Our application addresses these challenges using a modern tech stack:
Next.js 14 for the frontend and API routes
LlamaParser for robust PDF text extraction
OpenAI for intelligent data structuring
Clerk v6 for authentication
Supabase for data persistence
Tailwind CSS and Shadcn for the UI
The application serves businesses needing to extract structured data from PDFs, such as invoices, receipts, and forms, converting them into usable formats like Excel spreadsheets.
Architecture Overview
Frontend Architecture
The frontend is built using Next.js 14's App Router, employing a hybrid approach of server and client components for optimal performance. Here's our component organization:
app/
├── (auth)/
│ ├── sign-in/
│ └── sign-up/
├── dashboard/
│ ├── page.tsx
│ ├── loading.tsx
│ └── error.tsx
├── processing/
│ └── [...steps]/
└── schemas/
└── [id]/
We use React Server Components (RSC) for data-heavy components and Client Components for interactive elements. This separation provides several benefits:
Reduced JavaScript bundle size
Improved initial page load
Better SEO through server-side rendering
Maintained interactive features where needed
Backend Architecture
The backend leverages Next.js API routes organized into a clear hierarchy:
app/api/
├── uploads/
│ └── route.ts
├── processing/
│ └── route.ts
├── schemas/
│ └── route.ts
└── exports/
└── route.ts
Each route is protected using Clerk middleware:
import { authMiddleware } from '@clerk/nextjs';
export default authMiddleware({
publicRoutes: ['/api/health'],
ignoredRoutes: ['/api/webhooks/clerk']
});
Core Features Deep Dive
1. PDF Upload and Processing
The upload system uses the @uploadthing/react
library with custom configuration for PDF handling:
export const ourFileRouter = {
pdfUploader: f({ pdf: { maxFileSize: '32MB' } })
.middleware(async ({ req }) => {
const user = await currentUser();
return { userId: user.id };
})
.onUploadComplete(async ({ metadata, file }) => {
await db.uploads.create({
data: {
userId: metadata.userId,
fileUrl: file.url,
status: 'pending'
}
});
})
};
The processing pipeline follows these steps:
File upload and validation
Text extraction using LlamaParser
Structure identification
Data extraction with OpenAI
Results validation
2. Schema Definition System
The schema builder interface allows users to define custom data structures:
interface SchemaField {
name: string;
type: 'string' | 'number' | 'date' | 'nested';
required: boolean;
validation?: {
pattern?: string;
min?: number;
max?: number;
};
children?: SchemaField[]; // For nested structures
}
We provide default templates for common documents:
Invoices
Receipts
Purchase Orders
Shipping Documents
3. Data Extraction Pipeline
The extraction process uses OpenAI's GPT-4 with custom prompting:
const extractData = async (text: string, schema: Schema) => {
const completion = await openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: [
{
role: 'system',
content: 'Extract structured data according to the provided schema.'
},
{
role: 'user',
content: `
Schema: ${JSON.stringify(schema)}
Text: ${text}
`
}
],
temperature: 0.1,
max_tokens: 2000
});
return JSON.parse(completion.choices[0].message.content);
};
4. Credit Management System
Credits are tracked using a real-time system:
interface CreditTransaction {
userId: string;
amount: number;
type: 'deduction' | 'addition';
reason: 'processing' | 'refund' | 'purchase';
metadata: {
pageCount?: number;
fileId?: string;
};
}
Credit deduction follows these rules:
1 credit per page processed
Minimum 1 credit per file
Bulk processing discounts
Refunds for failed processing
5. Export Functionality
The export system handles nested data structures:
interface ExportOptions {
format: 'xlsx' | 'csv' | 'json';
flatten: boolean;
includeMetadata: boolean;
}
const generateExport = async (data: ExtractedData[], options: ExportOptions) => {
if (options.flatten) {
return flattenNestedStructure(data);
}
const workbook = XLSX.utils.book_new();
const worksheet = XLSX.utils.json_to_sheet(data);
XLSX.utils.book_append_sheet(workbook, worksheet, 'Extracted Data');
return XLSX.write(workbook, { type: 'buffer', bookType: 'xlsx' });
};
Security and Performance
Security Measures
We implement multiple security layers:
Authentication using Clerk
File validation and sanitization
Rate limiting on API routes
Secure file storage with signed URLs
Data encryption at rest
Performance Optimizations
Key optimizations include:
Streaming uploads for large files
Parallel processing where possible
Caching of extraction results
Lazy loading of heavy components
Background job processing for long-running tasks
User Experience Considerations
The UI is built with Tailwind CSS and Shadcn components for a modern look:
interface ProgressIndicator {
current: number;
total: number;
status: 'uploading' | 'processing' | 'extracting' | 'complete';
}
const ProcessingStatus: React.FC<ProgressIndicator> = ({ current, total, status }) => {
return (
<div className="w-full max-w-md mx-auto">
<Progress value={(current / total) * 100} />
<p className="text-sm text-gray-500 mt-2">
{status === 'uploading' ? 'Uploading files...' :
status === 'processing' ? 'Processing PDFs...' :
status === 'extracting' ? 'Extracting data...' :
'Complete'}
</p>
</div>
);
};
Challenges and Solutions
Large File Handling
For large PDFs, we implemented:
Chunked uploads
Progressive processing
Background workers
Status webhooks
Concurrent Processing
To handle multiple simultaneous uploads:
Queue system with Redis
Worker pools
Progress tracking per file
Automatic retries
Future Improvements
Planned enhancements include:
AI-powered schema suggestions
Improved accuracy through feedback loops
Additional export formats
Batch processing optimizations
Conclusion
Building a robust PDF data extraction system requires careful consideration of various technical aspects, from file handling to user experience. This implementation provides a scalable foundation that can be extended based on specific needs.