You now have a single PDF reconstructed from your Tesseract output named joined.pdf. pdftk output/*.pdf cat output joined.pdf. Because they are numbered sequentially, you can use shell syntax to pass an ordered list of files to the pdftk cat command, to concatenate them: If you used a PDF as input in the last step, you’ll now need to use PDFtk and Ghostscript again to put it back together from the individual pages produced by Tesseract. Step 3 (Optional) – Rebuilding PDFs from Image Output If you’re using a PDF, you’ll reconstruct and finalize it in the next step. If you only need an output image, you can skip ahead to the final steps of this tutorial to learn more about bulk text extraction options. Ghostscript will output every page in the PDF individually: After adding some PNG formatting syntax and a DPI of -r300, provide the path to OCR-sample-paper.pdf or your chosen input file. You may see this used in other, older, command line applications. %05d is obscure shell syntax that Ghostscript understands natively - in this case, it means to name the output PNG files from the input PDF using automatically incremented, 5-digit numbers. This gs command specifies the output path before the rest of the command, using the -o flag. gs -o output/%05d.png -sDEVICE =png16m -r300 -dPDFFitPage =true OCR-sample-paper.pdf.First, create a working output directory for the files created this process, then run gs: You’ll need to include additional parameters to maintain consistency around DPI, color space, and dimensions. This can be done using a Ghostscript command. If you’re working with one or more PDFs, you’ll need to convert them to individual images before they can be used as OCR sources. To download the PDF onto your server, you can use curl with the -O flag to save it to your current directory under the same file name: If you don’t already have a PDF that you want to perform OCR on, you can follow along with this tutorial by downloading this sample PDF, which was scanned without any embedded text. Step 2 – Converting PDFs to Images and Running Tesseract You’ll use these commands to perform OCR in the next step. You should now have three new commands present, one for each application, which you can verify by using which: sudo apt install pdftk ghostscript tesseract-ocr x11-utils.Update your package sources with apt update and then use apt install to install them: You will need three tools for the end-to-end pipeline: Ghostscript, which handles all kinds of PDF-to-image conversion and vice-versa (it was originally created as an interpreter for Postscript, the predecessor technology to PDF), Tesseract, an open source OCR engine which, like Ghostscript, has been developed continuously since the 1980s, and PDFtk, a smaller utility for slicing or reconstructing PDFs from individual pages.Īll three applications are available in Ubuntu’s default repositories, and can be installed with the apt package manager. Working with PDFs adds some extra steps, which you can skip if you are working with images by themselves. OCR can be performed on both PDFs (which contain, and are sometimes rendered as, images) and standalone images. Step 1 – Installing Ghostscript, Tesseract, and PDFtk This tutorial will provide installation instructions for a Ubuntu 22.04 server, following our guide to Initial Server Setup with Ubuntu 22.04. These tools are available on most platforms. You will also review other tools that can be used instead of or in addition to this baseline functionality. This tutorial will cover setting up an OCR pipeline using Ghostscript, Tesseract, and PDFtk. This is especially useful if you are ingesting documents or images to a web application that needs to extract text, or if you are working with a large corpus of documents that need to have their full text indexed. In this case, you can use a pipeline of open source tools to automatically perform OCR. However, you may still encounter documents or images that contain significant amounts of non-embedded text which cannot be automatically extracted. Many modern desktop and mobile applications and scanner software stacks have some OCR functionality built in, and most circulating PDFs have text embedded. Optical Character Recognition, or OCR, is primarily used to turn the text from scanned images into selectable, copyable, encoded, embedded text.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |