PDF generation with TP-CORE

PDF is arguably the most important interoperable document format between operating systems and devices such as computers, smartphones, and tablets. The most important characteristic of PDF is the immutability of the documents. Of course, there are limits, and nowadays, pages can easily be removed or added to a PDF document using appropriate programs. However, the content of individual pages cannot be easily modified. Where years ago expensive specialized software such as Adobe Acrobat was required to work with PDFs, many free solutions are now available, even for commercial use. For example, the common office suites support exporting documents as PDFs.

Especially in the business world, PDF is an excellent choice for generating invoices or reports. This article focuses on this topic. I describe how simple PDF documents can be used for invoicing, etc. Complex layouts, such as those used for magazines and journals, are not covered in this article.

Technically, the freely available library openPDF is used. The concept is to generate a PDF from valid HTML code contained in a template. Since we’re working in the Java environment, Velocity is the template engine of choice. TP-CORE utilizes all these dependencies in its PdfRenderer functionality, which is implemented as a facade. This is intended to encapsulate the complexity of PDF generation within the functionality and facilitate library replacement.

While I’m not a PDF generation specialist, I’ve gained considerable experience over the years, particularly with this functionality, in terms of software project maintainability. I even gave a conference presentation on this topic. Many years ago, when I decided to support PDF in TP-CORE, the only available library was iText, which at that time was freely available with version 5. Similar to Linus Torvalds with his original source control management system, I found myself in a similar situation. iText became commercial, and I needed a new solution. Well, I’m not Linus, who can just whip up Git overnight, so after some waiting, I discovered openPDF. OpenPDF is a fork of iText 5. Now I had to adapt my existing code accordingly. This took some time, but thanks to my encapsulation, it was a manageable task. However, during this adaptation process, I discovered problems that were already causing me difficulties in my small environment, so I released TP-CORE version 3.0 to achieve functional stability. Therefore, anyone using TP-CORE version 2.x will still find iText5 as the PDF solution. But that’s enough about the development history. Let’s look at how we can currently generate PDFs with TP-CORE 3.1. Here, too, my main goal is to achieve the highest possible compatibility.

Before we can begin, we need to include TP-CORE as a dependency in our Java project. This example demonstrates the use of Maven as the build tool, but it can easily be switched to Gradle.

<dependency>
    <groupId>io.github.together.modules</groupId>
    <artifactId>core</artifactId>
    <version>3.1.0</version>
</dependency>

To generate an invoice, for example, we need several steps. First, we need an HTML template, which we create using Velocity. The template can, of course, contain placeholders for names and addresses, enabling mail merge and batch processing. The following code demonstrates how to work with this.

<h1>HTML to PDF Velocity Template</h1>

<p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet.</p>

<p>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</p>

<h2>Lists</h2>
<ul>
    <li><b>bold</b></li>
    <li><i>italic</i></li>
    <li><u>underline</u></li>
</ul>

<ol>
    <li>Item</li>
    <li>Item</li>
</ol>

<h2>Table 100%</h2>
<table border="1" style="width: 100%; border: 1px solid black;">
    <tr>
        <td>Property 1</td>
        <td>$prop_01</td>
        <td>Lorem ipsum dolor sit amet,</td>
    </tr>
    <tr>
        <td>Property 2</td>
        <td>$prop_02</td>
        <td>Lorem ipsum dolor sit amet,</td>
    </tr>
    <tr>
        <td>Property 3</td>
        <td>$prop_03</td>
        <td>Lorem ipsum dolor sit amet,</td>
    </tr>
</table>

<img src="path/to/myImage.png" />

<h2>Links</h2>
<p>here we try a simple <a href="https://together-platform.org/tp-core/">link</a></p>

template.vm

The template contains three properties: $prop_01, $prop_02, and $prop_03, which we need to populate with values. We can do this with a simple HashMap, which we insert using the TemplateRenderer, as the following example shows.

Map<String, String> properties = new HashMap<>();
properties.put("prop_01", "value 01");
properties.put("prop_02", "value 02");
properties.put("prop_03", "value 03");

TemplateRenderer templateEngine = new VelocityRenderer();
String directory = Constraints.SYSTEM_USER_HOME_DIR + "/";
String htmlTemplate = templateEngine
                .loadContentByFileResource(directory, "template.vm", properties);

The TemplateRenderer requires three arguments:

  • directory – The full path where the template is located.
  • template – The name of the Velocity template.
  • properties – The HashMap containing the variables to be replaced.

The result is valid HTML code, from which we can now generate the PDF using the PdfRenderer. A special feature is the constant SYSTEM_USER_HOME_DIR, which points to the user directory of the currently logged-in user and handles differences between operating systems such as Linux and Windows.

PdfRenderer pdf = new OpenPdfRenderer();
pdf.setTitle("Title of my own PDF");
pdf.setAuthor("Elmar Dott");
pdf.setSubject("A little description about the PDF document.");
pdf.setKeywords("pdf, html, openPDF");
pdf.renderDocumentFromHtml(directory + "myOwn.pdf", htmlTemplate);

The code for creating the PDF is clear and quite self-explanatory. It’s also possible to hardcode the HTML. Using the template allows for a separation of code and design, which also makes later adjustments flexible.

The default format is A4 with the dimensions size: 21cm 29.7cm; margin: 20mm 20mm; and is defined as inline CSS. This value can be customized using the setFormat(String format); method. The first value represents the width and the second value the height.

The PDF renderer also allows you to delete individual pages from a document.

PdfRenderer pdf = new OpenPdfRenderer();
PdfDocument original = pdf.loadDocument( new File("document.pdf") );
PdfDocument reduced = pdf.removePage(original, 1, 5);
pdf.writeDocument(reduced, "reduced.pdf");

We can see, therefore, that great emphasis was placed on ease of use when implementing the PDF functionality. Nevertheless, there are many ways to create customized PDF documents. This makes the solution of defining the layout via HTML particularly appealing. Despite the PdfRenderer’s relatively limited functionality, Java’s inheritance mechanism makes it very easy to extend the interface and its implementation with custom solutions. This possibility also exists for all other implementations in TP-CORE.

TP-CORE is also free for commercial use and has no restrictions thanks to the Apache 2 license. The source code can be found at https://git.elmar-dott.com and the documentation, including security and test coverage, can be found at https://together-platform.org/tp-core/.

How to reduce the size of a PDF document

When you own a big collection of PDF files the used storage space can increasing quite high. Sometimes I own PDF documents with more than 100 MB. Well nowadays this storage capacities are not a big issue. But if you want to backup those files to other mediums like USB pen drives or a DVD it would be great to reduce the file size of you PDF collection.

Long a go I worked with a little scrip that allowed me to reduce the file size of a PDF document significantly. This script called a interactive tool called PDF Sam with some command line parameters. Unfortunately many years ago the software PDF Sam become with this option commercial, so I was needed a new solution.

Before I go closer to my approach I will discuss some basic information about what happens in the background. As first, when your PDF blew up to a huge file is the reason because of the included graphics. If you scanned you handwritten notes to save them in one single archive you should be aware that every scan is a image file. By default the PDF processor already optimize those files. This is why the file size almost don’t get reduced when you try to compress them by a tool like zip.

Scanned images can optimized before to include them to a PDF document by a graphic tool like Gimp. Actions you can perform are reduce the image quality and increase the contrast. Specially for scanned handwritten notes are this steps important. If the contrast is very low and maybe you plan to print those documents, it could happens they are not readable. Another problem in this case is that you can’t apply a text search over the document. A solution to this problem is the usage of an OCR tool to transform text in images back to real text.

We resume shortly the previous minds. When we try to reduce the file size of a PDF we need to reduce the quality of the included images. This can be done by reducing the amount of dots per inch (dpi). Be aware that after the compression the image is still readable. As long you do not plan to do a high quality print like a magazine or a book, nothing will get affected.

When we wanna reduce plenty PDF files in a short time we can’t do all those actions by hand. For instance we need an automated solution. To reach the goal it is important that the tool we use support the command line. The we can create a simple batch job to perform the task without any hands on.

We have several options to optimize the images inside a PDF. If it is a great idea to perform all options, depend on the purpose of the usage.

  1. change the image file to the PNG format
  2. reduce the graphic dimensions to the real printable area
  3. reduce the DPI
  4. change the image color profile to gray-scale

As Ubuntu Linux user I have all of the things I need already together. And now comes the part that I explain you my well working solution.

Ghostscript

GPL Ghostscript is used for PostScript/PDF preview and printing. Usually as a back-end to a program such as ghostview, it can display PostScript and PDF documents in an X11 environment.

If you don’t have Ghostscript installed on you system, you can do this very fast.

sudo apt-get update 
sudo apt-get -y install ghostscript

 Before you execute any script or command be aware you do not overwrite with the output the existing files. In the case something get wrong you loose all originals to try other options. Before you start to try out anything backup your files or generate the compressed PDF in a separate folder.

Abonnement / Subscription

[English] This content is only available to subscribers.

[Deutsch] Diese Inhalte sind nur für Abonnenten verfügbar.

The important parameter is r150, which reduce the output resolution to 150 dpi. In the manage you can check for more parameters to compress the result more stronger. The given command you are able to place in a script, were its surrounded by a FOR loop to fetch all PDF files in a directory, to write them reduced in another directory.

The command I used for a original file with 260 MB and 640 pages. After the operation was done the size got reduced to around 36 MB. The shrunken file is almost 7 times smaller than the original. A huge different. As you can see in the screenshot, the quality of the pictures is almost identical.

As alternative, in the case you won’t come closer to the command line there is a online PDF compression tool in German and English language for free use available.

PDF Workbench

Linux Systems have many powerful tools to deal with PDF documents. For example the Libreoffice Suite have a button where you can generate for every document a proper PDF file. But sometimes you wish to create a PDF in the printing dialog of any other application in your system. With the cups PDF print driver you enable this functionality on your system.

sudo apt-get install printer-driver-cups-pdf 

As I already explained, OCR allows you to extract from graphics text to make a document searchable. When you need to work with this type of software be aware that the result is good, but you cant avoid mistakes. Even when you perform an OCR on a scanned book page, you will find several mistakes. OCRFeeder is a free and very powerful solution for Linux systems.

Another powerful helper is the tool PDF Arranger which allows you to add or remove pages to an existing PDF. You are also able to change the order of the pages.

Resources

Abonnement / Subscription

[English] This content is only available to subscribers.

[Deutsch] Diese Inhalte sind nur für Abonnenten verfügbar.


{j}DD Poland 2021

Refactoring Disasters: A Story how I failed

For my small Open Source project TP-CORE, you can find it on GitHub, I had the gorgeous Idea to replace the iText library for OpenPDF. After I made a plan how I could reach my goal I started all necessary activities. But in real life the things never that easy like we have originally in mind. I failed with my idea and in this talk I will let you know what happened exactly. I talk about my motivation why I wanted the replacements and how was my plan to success all activities. You will get to know how it was when I reached the point, I realized I will not make it. I give a brief explanation what I did that this short adventure did not affect the rest of the project.