You will have a lot of trouble doing that with coordinates. Then would come the task of ordering it visually. Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. What you can do: extract the text of a certain range of pages only. This will output all text contained on pages 3-5 to stdout. Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes.

This one requires you to download the latest version of the file ps2ascii. If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used. If you replace that parameter by -dCOMPLEX, you’ll get additional infos about colors and images used. Read the comments inside the ps2ascii. It’s not comfortable to use, but for me it worked in most cases I needed it. This utility is based either on Poppler or on XPDF. PDF” pages, like the OP asked for.

Best, if used with the -layout parameter. TET has a commandline interface, and it’s the most powerful of all text extraction tools I’m aware of. TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e. PDFs and save its files as . Which version of ghostscript is needed for using txtwrite device ?

