Refactoring – Eine kurze Geschichte des Scheiterns
Für mein kleines Open Source-Projekt TP-CORE, das Sie auf GitHub finden können, hatte ich die großartige Idee, die iText-Bibliothek für OpenPDF zu ersetzen. Nachdem ich einen Plan gemacht hatte, wie ich mein Ziel erreichen könnte, startete ich alle notwendigen Aktivitäten. Aber im wirklichen Leben sind die Dinge nie so einfach, wie wir es uns ursprünglich vorgestellt haben. In diesem Vortrag erfahren Sie was genau passiert ist. Ich spreche über meine Motivation, warum ich die Änderung wollte und wie mein Plan war, alle Aktivitäten zum Erfolg zu führen. Sie werden erfahren wie es war, als ich den Punkt erreichte bei dem mir klar wurde, dass ich so nicht zum Ziel gelange. Ich erkläre kurz, was ich getan habe, dass dieses kurze Abenteuer den Rest des Projekts nicht beeinflusst hat.
When you own a big collection of PDF files the used storage space can increasing quite high. Sometimes I own PDF documents with more than 100 MB. Well nowadays this storage capacities are not a big issue. But if you want to backup those files to other mediums like USB pen drives or a DVD it would be great to reduce the file size of you PDF collection.
Long a go I worked with a little scrip that allowed me to reduce the file size of a PDF document significantly. This script called a interactive tool called PDF Sam with some command line parameters. Unfortunately many years ago the software PDF Sam become with this option commercial, so I was needed a new solution.
Before I go closer to my approach I will discuss some basic information about what happens in the background. As first, when your PDF blew up to a huge file is the reason because of the included graphics. If you scanned you handwritten notes to save them in one single archive you should be aware that every scan is a image file. By default the PDF processor already optimize those files. This is why the file size almost don’t get reduced when you try to compress them by a tool like zip.
Scanned images can optimized before to include them to a PDF document by a graphic tool like Gimp. Actions you can perform are reduce the image quality and increase the contrast. Specially for scanned handwritten notes are this steps important. If the contrast is very low and maybe you plan to print those documents, it could happens they are not readable. Another problem in this case is that you can’t apply a text search over the document. A solution to this problem is the usage of an OCR tool to transform text in images back to real text.
We resume shortly the previous minds. When we try to reduce the file size of a PDF we need to reduce the quality of the included images. This can be done by reducing the amount of dots per inch (dpi). Be aware that after the compression the image is still readable. As long you do not plan to do a high quality print like a magazine or a book, nothing will get affected.
When we wanna reduce plenty PDF files in a short time we can’t do all those actions by hand. For instance we need an automated solution. To reach the goal it is important that the tool we use support the command line. The we can create a simple batch job to perform the task without any hands on.
We have several options to optimize the images inside a PDF. If it is a great idea to perform all options, depend on the purpose of the usage.
change the image file to the PNG format
reduce the graphic dimensions to the real printable area
reduce the DPI
change the image color profile to gray-scale
As Ubuntu Linux user I have all of the things I need already together. And now comes the part that I explain you my well working solution.
GPL Ghostscript is used for PostScript/PDF preview and printing. Usually as a back-end to a program such as ghostview, it can display PostScript and PDF documents in an X11 environment.
If you don’t have Ghostscript installed on you system, you can do this very fast.
Before you execute any script or command be aware you do not overwrite with the output the existing files. In the case something get wrong you loose all originals to try other options. Before you start to try out anything backup your files or generate the compressed PDF in a separate folder.
The important parameter is r150, which reduce the output resolution to 150 dpi. In the manage you can check for more parameters to compress the result more stronger. The given command you are able to place in a script, were its surrounded by a FOR loop to fetch all PDF files in a directory, to write them reduced in another directory.
The command I used for a original file with 260 MB and 640 pages. After the operation was done the size got reduced to around 36 MB. The shrunken file is almost 7 times smaller than the original. A huge different. As you can see in the screenshot, the quality of the pictures is almost identical.
As alternative, in the case you won’t come closer to the command line there is a online PDF compression tool in German and English language for free use available.
Linux Systems have many powerful tools to deal with PDF documents. For example the Libreoffice Suite have a button where you can generate for every document a proper PDF file. But sometimes you wish to create a PDF in the printing dialog of any other application in your system. With the cups PDF print driver you enable this functionality on your system.
sudo apt-get install printer-driver-cups-pdf
As I already explained, OCR allows you to extract from graphics text to make a document searchable. When you need to work with this type of software be aware that the result is good, but you cant avoid mistakes. Even when you perform an OCR on a scanned book page, you will find several mistakes. OCRFeeder is a free and very powerful solution for Linux systems.
Another powerful helper is the tool PDF Arranger which allows you to add or remove pages to an existing PDF. You are also able to change the order of the pages.
For my small Open Source project TP-CORE, you can find it on GitHub, I had the gorgeous Idea to replace the iText library for OpenPDF. After I made a plan how I could reach my goal I started all necessary activities. But in real life the things never that easy like we have originally in mind. I failed with my idea and in this talk I will let you know what happened exactly. I talk about my motivation why I wanted the replacements and how was my plan to success all activities. You will get to know how it was when I reached the point, I realized I will not make it. I give a brief explanation what I did that this short adventure did not affect the rest of the project.