Databases: the choice of torture

Rate this post

The permanent storage of data is called persistence in technical terms. To be able to access this data again, software is needed that structures and searches it. Such software is called a Database Management System (DBMS). To access a database from a programming language like Java, Ruby, Python, or PHP, a corresponding driver is required. This driver is often referred to as a client, because the DBMS is the server that allows access for multiple clients. In this article, we won’t focus on how to connect to the respective databases with which programming language, but rather on the different database technologies and their applications.

[Relational DB (rows, columns) | GIS DB | embedded DB]
[NoSQL | Key Value Store | Document DB (JSON, XML) | Graph DB | Time Series Server]

There are now numerous solutions to choose from for classic database systems, the so-called relational databases. Both commercial and professional free open-source options are vying for users’ attention. Most web hosting providers offer their users the choice between the free DBMS MySQL (Oracle) and MariaDB (a fork of MySQL after its acquisition by Oracle) for data storage. However, those who can manage their own servers can, of course, opt for the more professional PostgreSQL.

PostgreSQL is rather unsuitable for most standard PHP applications, although WordPress and Joomla do support this database system. Problems usually arise with the developers of extensions. Instead of using the application’s interfaces, database access is often achieved by ignorantly using MySQL’s native commands.

In commercial application development, Oracle or Microsoft SQL Server are typically used, depending on familiarity with the Microsoft Windows environment. The reason for using commercial database servers lies in the costly support available when vulnerabilities and bugs are discovered. Business-critical applications must ensure the continued existence of both the vendor and their customers. The speed of delivery of security patches is a particularly significant reason for using commercial software.

The functionality of relational databases is defined by tables. The columns of a table define the properties, and a row of the table represents the data record. To access an explicit data record, a column (primary key) must contain unique entries that do not appear again in that column. This property of the primary key is called uniqueness. Primary keys allow for the establishment of relationships, or relations, between tables. To keep this article from becoming excessively long, I will limit my in-depth discussion of the functionality of relational databases to this point and move on to the next category.

Of course, there are also relational databases that operate in a column-oriented rather than row-oriented manner. This enables more efficient queries and analyses, especially with large datasets. Here are some of the main features and benefits of column-oriented databases:

  • Data organization: Stores data in columns, which speeds up the processing of specific columns in queries.
  • Compression: Often offers better compression rates for columnar data because similar data types are stored together.
  • Analytical queries: Optimized for analysis and aggregate queries that need to quickly process large amounts of data.
  • Reduced I/O: Reduces the amount of data that needs to be read from disk, as only the required columns are retrieved.

Column-oriented databases include Apache Cassandra, SAP Hanna, DB2, and Amazon BigQuery, with classic use cases for:

  • Business Intelligence: Ideal for databases that need to process large amounts of data for analytical purposes.
  • Data Warehousing: Efficient storage and analysis of historical data.
  • Real-time analytics: Suitable for applications that require rapid decisions based on current data.
IDsupplierarticlepricepackageamount
13MongoDBJSON7.88piece1
21XindiceXML15.67piece1
// Row-oriented DBMS
[{13, MongoDB, JSON, 7.88, piece, 1} {21, Xindice, XML, 15.67, piece, 1}]

// Column-oriented DBMS
[{13,21} {MongoDB, Xindice} {JSON, XML} {7.88, 15.67} {piece, piece} {1, 1}]

To provide data for geographic information systems (GIS) like Google Maps, so-called geospatial databases are used. Geospatial databases are extensions of relational databases that provide tables and relations optimized and standardized for geometric objects. The GIS extension for PostgreSQL is called PostGIS. The datasets for the freely available OpenStreetMap are in a specialized XML format but can also be transformed into geospatial data structures.

Key-value stores are often used in configuration files. However, if you want to build a fast caching system, you need a bit more complexity. This is because the key/value relationship can range from simple strings to complex objects. Basically, a store consists of a unique key to which values ​​can be assigned depending on the data type. Data types can be strings, numbers (integers, floats), Boolean values, and lists. Key-value databases belong to the NoSQL database family because, unlike relational databases, queries are not performed using SQL but are database- and vendor-specific.

Typical key-value databases include Redis, MemCached, Amazon DynamoDB, and the somewhat outdated BarkleyDB, which was acquired by Oracle. A characteristic of key-value databases is that data is stored in memory and backed up to disk at regular intervals. Keeping data in RAM naturally requires a machine with sufficient RAM. Especially with large applications, an enormous amount of data can accumulate for caching.

Another category of databases is embedded databases. “Embedded” refers to the database server itself. Specifically, this means that the database system is not a standalone installation but rather a library integrated into the application. The advantage of this solution is a simpler application installation process. However, this often comes at the expense of security, as many embedded databases lack a dedicated user management layer. This is particularly true for SQLite and the Java-implemented H2. Even the previously mentioned NoSQL BarkelyDB, available as a Java or C library, lacks user management. This means that anyone with access to the application can use a client to read data from the database. Therefore, these systems are not suitable for applications requiring a high level of security.

Regarding the Java version of BarkelyDB, the last available implementation dates back to 2017 and is available as source code in Java/Apache Ant, but this code must be compiled manually. An official binary from Oracle is no longer available, but unofficial versions can be found in the Maven Central Repository.

Anyone wanting to integrate a fully functional relational database into their application can use the embedded version of PostgreSQL – pgx – which provides all the functions of the PostgreSQL server locally.

The next class of databases belongs to the NoSQL category: document-based databases. The two DBMSs, MongoDB and CouchDB, are quite similar in their feature set, but there are significant differences.

  • MongoDB is often chosen for applications requiring complex queries and real-time analytics due to its comprehensive query language and high performance.
  • CouchDB is particularly well-suited for applications that require reliability, a distributed architecture, and easy replication, especially in scenarios where offline access is essential.

The fundamental way document databases work is that the schema is derived from the underlying data structure. These data structures are usually in JSON format and are accessed accordingly. Documents of the same data structure are assigned to a collection. Therefore, these databases don’t store classic office documents, but rather formats like JSON and XML. Document databases that specialize in XML include Oracle XML DB and Apache Xindice.

Many web developers specializing in front-end (UX/UI) development frequently use document databases. This allows them to store data in JSON format to simulate RESTful access and thus populate the dynamic content of the user interface.

A very exotic variant of NoSQL databases are graph databases, which represent data as graphs. This storage format allows for the efficient storage of information according to relationships. Such relationships can be links between websites or a person’s representation on social media. Even the complex relationships used in recommendation systems can be represented as graphs. The following figure shows a simple example of a graph database implemented in Java using Neo4j, to illustrate its use case.

Other graph databases include Amazon Neptune and ArangoDB.

Finally, I’d like to introduce time series. Since monitoring has become essential, especially in the context of application operation, data presented as time series has gained in importance. Typical databases that specialize in processing time series are Prometheus and InfluxDB. However, there are also corresponding extensions for classic relational databases. The PostgreSQL database, which has already been mentioned several times, also has a corresponding extension for this use case called TimescaleDB.

Of course, much more could be said about this topic. After all, countless books on databases fill several meters of library shelves. However, this should suffice for an introduction and an overview of the various database systems and NoSQL solutions. With the information from this article, you now have an idea of ​​which database is suitable for your specific use case. We have also seen that relational databases, especially the free and open-source database PostgreSQL with its available extensions, are very versatile. Further topics related to databases include data modeling and security against hacker attacks.


Leave a Reply