To extract text from a scanned PDF, you need OCR — Optical Character Recognition. When you scan a paper document to PDF, the result is a photograph stored inside a PDF container. The text is not real text; it is just pixels shaped like letters. You cannot select it, search it, or copy it. OCR analyses that image and converts it into actual editable text, completely free, in your browser.
How to Extract Text from a Scanned PDF — Step by Step
- 1Open the OCR PDF tool and upload your scanned PDF.
- 2Select the language of the document from the dropdown — this significantly improves recognition accuracy.
- 3Click "Extract Text." The tool renders each page and runs OCR on it.
- 4Review the extracted text displayed on screen. Look for common OCR errors: 0/O, 1/l, rn/m substitutions.
- 5Click "Copy" to paste the text elsewhere, or "Download" to save it as a .txt file.
How OCR Works
OCR software analyses a scanned image pixel by pixel. It identifies shapes that correspond to characters, compares them against a database of known letter forms, and outputs the most probable text sequence. Modern OCR tools — including the Tesseract.js engine used in this tool — use neural networks trained on millions of document images for high accuracy.
The OCR PDF tool renders each PDF page as a high-resolution image, then runs Tesseract.js on each page. Everything happens locally in your browser — your scanned PDF is never uploaded to any server.
What Affects OCR Accuracy
The quality of the extracted text depends heavily on the quality of the original scan:
- Resolution: 200 DPI is the minimum for good results; 300 DPI is recommended. Below 150 DPI, characters are too small to recognise reliably.
- Contrast: text should be dark on a light background. Faded, coloured, or low-contrast pages significantly reduce accuracy.
- Page alignment: a page tilted even 5–10 degrees reduces accuracy noticeably. Use your scanner's auto-straighten feature if available.
- Font style: clean serif and sans-serif fonts recognise very accurately. Decorative fonts, handwriting, and script fonts are more difficult.
- Page condition: crumpled, torn, or water-damaged pages produce more errors.
Tip: If your scan quality is poor, increase the contrast to near-maximum in an image editor before running OCR. This single step can dramatically improve results on faded or washed-out documents.
OCR vs. PDF to Word — Which Should You Use?
Both tools extract text from scanned PDFs, but they serve different purposes:
- OCR PDF → extracts raw text only. Fast, simple. Use when you need to copy specific text, numbers, or names from a scanned page.
- PDF to Word → extracts text and attempts to preserve document structure (headings, paragraphs, tables). Use when you need a fully formatted, editable document.
- For quick copying: OCR PDF is faster. For full document editing: PDF to Word gives better structure.
Supported Languages
The OCR PDF tool supports English, Spanish, French, German, and Simplified Chinese. Always select the correct language from the dropdown before extracting — using the wrong language model significantly degrades accuracy.
Tips for Better OCR Results
- Scan at 300 DPI minimum — this is the most impactful single improvement you can make.
- Scan in grayscale or black-and-white for text documents (colour scans are larger and sometimes lower contrast).
- Flatten the document before scanning — curved or bent pages at the edges of a book reduce edge accuracy.
- After extraction, use Find & Replace in Word to check for common substitutions: search for "1" (numeral) to find "l" (letter), and vice versa.
- For multi-column documents, OCR may merge columns incorrectly. Manually check the reading order of extracted text.