When you own a big collection of PDF files the used storage space can increasing quite high. Sometimes I own PDF documents with more than 100 MB. Well nowadays this storage capacities are not a big issue. But if you want to backup those files to other mediums like USB pen drives or a DVD it would be great to reduce the file size of you PDF collection.
Long a go I worked with a little scrip that allowed me to reduce the file size of a PDF document significantly. This script called a interactive tool called PDF Sam with some command line parameters. Unfortunately many years ago the software PDF Sam become with this option commercial, so I was needed a new solution.
Before I go closer to my approach I will discuss some basic information about what happens in the background. As first, when your PDF blew up to a huge file is the reason because of the included graphics. If you scanned you handwritten notes to save them in one single archive you should be aware that every scan is a image file. By default the PDF processor already optimize those files. This is why the file size almost don’t get reduced when you try to compress them by a tool like zip.
Scanned images can optimized before to include them to a PDF document by a graphic tool like Gimp. Actions you can perform are reduce the image quality and increase the contrast. Specially for scanned handwritten notes are this steps important. If the contrast is very low and maybe you plan to print those documents, it could happens they are not readable. Another problem in this case is that you can’t apply a text search over the document. A solution to this problem is the usage of an OCR tool to transform text in images back to real text.
We resume shortly the previous minds. When we try to reduce the file size of a PDF we need to reduce the quality of the included images. This can be done by reducing the amount of dots per inch (dpi). Be aware that after the compression the image is still readable. As long you do not plan to do a high quality print like a magazine or a book, nothing will get affected.
When we wanna reduce plenty PDF files in a short time we can’t do all those actions by hand. For instance we need an automated solution. To reach the goal it is important that the tool we use support the command line. The we can create a simple batch job to perform the task without any hands on.
We have several options to optimize the images inside a PDF. If it is a great idea to perform all options, depend on the purpose of the usage.
- change the image file to the PNG format
- reduce the graphic dimensions to the real printable area
- reduce the DPI
- change the image color profile to gray-scale
As Ubuntu Linux user I have all of the things I need already together. And now comes the part that I explain you my well working solution.
GPL Ghostscript is used for PostScript/PDF preview and printing. Usually as a back-end to a program such as ghostview, it can display PostScript and PDF documents in an X11 environment.
If you don’t have Ghostscript installed on you system, you can do this very fast.
sudo apt-get update sudo apt-get -y install ghostscript
Before you execute any script or command be aware you do not overwrite with the output the existing files. In the case something get wrong you loose all originals to try other options. Before you start to try out anything backup your files or generate the compressed PDF in a separate folder.
gs -sDEVICE=pdfwrite \ -dCompatibilityLevel=1.4 \ -dPDFSETTINGS=/default \ -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages \ -dCompressFonts=true \ -r150 \ -sOutputFile=output.pdf input.pdf
The important parameter is r150, which reduce the output resolution to 150 dpi. In the manage you can check for more parameters to compress the result more stronger. The given command you are able to place in a script, were its surrounded by a FOR loop to fetch all PDF files in a directory, to write them reduced in another directory.
The command I used for a original file with 260 MB and 640 pages. After the operation was done the size got reduced to around 36 MB. The shrunken file is almost 7 times smaller than the original. A huge different. As you can see in the screenshot, the quality of the pictures is almost identical.
As alternative, in the case you won’t come closer to the command line there is a online PDF compression tool in German and English language for free use available.
Linux Systems have many powerful tools to deal with PDF documents. For example the Libreoffice Suite have a button where you can generate for every document a proper PDF file. But sometimes you wish to create a PDF in the printing dialog of any other application in your system. With the cups PDF print driver you enable this functionality on your system.
sudo apt-get install printer-driver-cups-pdf
As I already explained, OCR allows you to extract from graphics text to make a document searchable. When you need to work with this type of software be aware that the result is good, but you cant avoid mistakes. Even when you perform an OCR on a scanned book page, you will find several mistakes. OCRFeeder is a free and very powerful solution for Linux systems.
Another powerful helper is the tool PDF Arranger which allows you to add or remove pages to an existing PDF. You are also able to change the order of the pages.