Wednesday, October 24, 2007

Embedded Databases

An embedded database is a software component that is generally a part of the application. The ones that I have looked at:

  • Have a small footprint, about 2 megabytes for the base engine and embedded JDBC driver.
  • Based on Java, JDBC, and SQL standards.
  • Require zero or limited human administration

Click here to find out more about embedded databases.


HSQLDB

HSQLDB support is available from HyperXtremeSQL, and is available under the BSD license. It is a pure Java database, has a small footprint, and comes bundled with third party products like OpenOffice, JBoss, RSA crypto libraries.

Observations

  • It has an 8Gb database limit, it does not support blobs greater than 4Kb, and does not support server side cursors.
  • Database Directory
    • Each HSQLDB database consists of between 2 to 5 files; all named the same but with different extensions, located in the same directory. For example, the database named "test" consists of test.properties, test.script, test.log, test.data, test.backup, test.lck.
      • The script file contains the definition of tables and other database objects, plus the data for non-cached tables.
      • The log file contains recent changes to the database.
      • The data file contains the data for cached tables.
      • The backup file is a zipped backup of the last known consistent state of the data file.
      • The lck file is also used to record the fact that the database is open.
      • The log and lck files are deleted at normal SHUTDOWN.
  • Deployment Options
    • HSQLDB can be run in a number of different ways. In general these are divided into Server Modes and In-Process Mode (also called Standalone Mode).
    • In-Process Mode
      • Faster as the data is not converted and sent over the network.
      • More secure as it is not possible by default to connect to the database from outside your application. As a result you cannot check the contents of the database with external tools such as Database Manager while your application is running.
      • In 1.8.0, you can run a server instance in a thread from the same virtual machine as your application and provide external access to your in-process database.
  • In-Memory Only Database
    • It is possible to run HSQLDB in a way that the database is not persistent and exists entirely in random access memory.
  • Shutdown
    • In 1.8.0, a connection property, shutdown=true, can be specified on the first connection to the database (the connection that opens the database) to force a shutdown when the last connection closes.
    • When SHUTDOWN is issued, all active transactions are rolled back.
    • A special form of closing the database is via the SHUTDOWN COMPACT command. This command rewrites the .data file that contains the information stored in CACHED tables and compacts it to size
    • The SHUTDOWN should also be executed when accessing the database using the HSQL Database Manager tool.
    • If the DB is not shutdown properly, SHUTDOWN via JDBC, the occasionally HSQLDB all data changes (updates, inserts, deletes) made during the session. This is despite the fact that the:
      • Database is not configured as “In-Memory” type
      • Database is configured to frequently flush data to disk
      • JDBC transactions were executed in an “Auto-Commit” mode
    • If the DB is not shutdown properly, SHUTDOWN via JDBC, the physical lock files have to be manually deleted. Otherwise, upon JVM restart, HSQLDB complains that the database files are locked by another process.
  • Boot Database
    • When a server instance is started, or when a connection is made to an in-process database, a new, empty database is created if no database exists at the given path. This feature has a side effect that can confuse new users. If a mistake is made in specifying the path for connecting to an existing database, a connection is nevertheless established to a new database.
  • SQL Standard
    • The SQL dialect used in HSQLDB is as close to the SQL92 and SQL200n standards as it has been possible to achieve so far in a small-footprint database engine. Not all the features of the Standard are supported and there are some proprietary extensions. In 1.8.0 the behavior of the engine is far more compliant with the Standards than with older versions.
  • Types of Tables
    • Memory tables are the default type when the CREATE TABLE command is used. Their data is held entirely in memory but any change to their structure or contents is written to the .script file. The script file is read the next time the database is opened, and the MEMORY tables are recreated with all their contents.
    • CACHED tables are created with the CREATE CACHED TABLE command. Only part of their data or indexes is held in memory, allowing large tables that would otherwise take up to several hundred megabytes of memory. Another advantage of cached tables is that the database engine takes less time to start up when a cached table is used for large amounts of data. The disadvantage of cached tables is a reduction in speed.
    • It lets you use CSV files directly as Database tables, which is very handy for testing.
  • Write Delay
    • The purpose of this command is to control the amount of data loss in case of a total system crash.
    • This property can only be set (permanently) by executing the SET WRITE_DELAY SQL command, using the Database Manager Tool.
    • The default is TRUE and indicates that the changes to the database that have been logged are synched to the file system once every 20 seconds. FALSE indicates there is no delay and at each commit a file synch operation is performed. This will slow the engine down to the speed at which the file synch operation can be performed by the disk subsystem.
    • Values down to 10 milliseconds can be specified by adding MILLIS to the command, but in practice a delay of 100 milliseconds provides 99.99999% reliability with an average one system crash per 6 days.

Additional Documentation



Derby

  • It was first released as Cloudscape in 1997.
  • Derby and JavaDB are offshoots. Support for Apache Derby is available as Cloudscape from IBM and JavaDB from Sun
  • Derby is under the Apache license
  • Derby has a small footprint -- about 2 megabytes for the base engine and embedded JDBC driver.
  • Derby is based on the Java, JDBC, and SQL standards.
  • Derby provides an embedded JDBC driver that lets you embed Derby in any Java-based solution. In the default configuration there is no separate database server to be installed or maintained by the end user.
  • Apache Derby is a big community of developers that includes people from Sun and IBM, which makes it backed by 2 major companies that are already major open source contributors
  • The user and developers mailing lists are very active ones and you can get good assistance from these too
  • Supports server side cursors

Observations

  • Derby has some reserved tables that complain if you name your tables the same. Like USER. Hypersonic doesn't have this problem.
  • Deployment Options: Derby software distribution provides two basic deployment options
    • Embedded: Refers to Derby being started by a simple single-user Java application. With this option Derby runs in the same Java virtual machine (JVM) as the application. Derby can be almost invisible to the end user because it is started and stopped by the application and often requires no administration.
    • Server (or Server-based): Refers to Derby being started by an application that provides multi-user connectivity to Derby databases across a network. With this option Derby runs in the Java virtual machine (JVM) that hosts the Server. Applications connect to the Server from different JVMs to access the database. The Derby Network Server is part of the Derby software distribution and provides this type of framework for Derby. Derby also works well with other, independently developed Server applications.
  • Database Shutdown
    • If an application starts the Derby engine, the application should shut down all databases before exiting. The attribute “;shutdown=true” in the Derby connection URL performs the shutdown. When the Derby engine is shutdown, all booted databases will automatically shut down. The shutdown process cleans up records in the transaction log to ensure a faster startup the next time the database is booted. You can shut down individual databases without shutting down the engine by including the database name in the connection URL. NOTE: A successful shutdown always results in an SQLException to indicate that Derby has shut down and that there is no other exception.
  • Boot Database
    • The DriverManager class loads the database using the Derby connection URL stored in the variable connectionURL. This URL includes the parameter “;create=true” so that the database will be created if it does not already exist.
  • Database Directory: A Derby database is stored in files that live in a directory of the same name as the database. A database directory contains
    • log directory: Contains files that make up the database transaction log, used internally for data recovery (not the same thing as the error log)
    • seg0 directory: Contains one file for each user table, system table, and index (known as conglomerates).
    • service.properties file: A text file with internal configuration information.
    • tmp directory: (might not exist.) A temporary directory used by Derby for large sorts and deferred updates and deletes. Sorts are used by a variety of SQL statements. For databases on read-only media, you might need to set a property to change the location of this directory. See "Creating Derby Databases for Read-Only Use".
    • jar directory: (might not exist.) A directory in which jar files are stored when you use database class loading.
  • Management: Included with the product are some standalone Java tools and utilities that make it easier to use and develop applications for Derby. There are no GUI management tools.
    • ij: ij is Derby's interactive JDBC scripting tool. It is a simple utility for running scripts against a Derby database. You can also use it interactively to run ad hoc queries. ij provides several commands for ease in accessing a variety of JDBC features. ij can be used in an embedded or a client/server environment.
    • The import and export utilities: These server-side utilities allow you to import data directly from files into tables and to export data from tables into files. Server-side utilities can be in a client/server environment but require that all files referenced be on the Server machine.
    • Database class loading utilities: These utilities allow you to store application logic in a database.
    • Sysinfo: sysinfo provides information about your version of Derby and your environment.
    • dblook: dblook is Derby's Data Definition Language (DDL) Generation Utility, also called a schema dump tool. It is a simple utility for the dumping the DDL of a user-specified database to either a console or to a file. The generated DDL can then be used for such things as recreating all or parts of a database, viewing a subset of a database's objects (for example, those which pertain to specific tables and schemas), or documenting a database's schema.

Additional Documentation



Conclusion

The two products are fairly comparable it terms of their feature sets. The tooling support is again quite similar, though HSQLDB provides a GUI based management tool. Both products have their own quirks, which requires some getting used to. In my opinion either of these products will work well for a typical single-user, Swing based desktop application.

Apache Derby has a big community of developers that includes people from Sun and IBM, which makes it backed by 2 major companies that are already major open source contributors. Sun has adopted Apache Derby as its 100% Java DB, and included it in the Java SE 6 (Mustang) JDK. Sun is also using Java DB for some of its own products and other open source projects that it is heavily involved in. Customers can now purchase Sun Software Service Plans for Java DB including two levels of support — Premium (around-the-clock) and Standard (extended business hours).

So, if support, paid or community-based, is a big concern then Apache Derby definitely stands out.



Sunday, October 7, 2007

Integration Technologies in Practice

The following is based on what I have seen and experienced.

Point-to-Point using Sockets
  • Generally seen with legacy systems that have been in existence for a while.
  • Typically these communicate over an ASCII text based protocol
  • Over time you end up with a plethora of such brittle point-to-point interfaces, which are tricky to manage and to modify. (Integration Spaghetti)
  • There also exist variations of interfaces that are quite similar in nature and in intent.
  • Each interface builds it own failure recovery mechanism
  • Some of these interfaces are synchronous in design but are functionally asynchronous.

RMI or RMI/IIOP
  • Suitable when both the Service Provider and the Consumers are implemented in Java
  • RMI being a binary protocol makes it easier to transport large amounts of data
  • I wouldn't use it for large-scale deployments, as it gets tricky to manage resources like RMI Service Threads
  • It requires a registry, to access a service reference. There is another way of accessing RMI service, without using a registry, by using reference objects. In my view, going through the registry is the easier route
  • Firewall issues around ports being enabled. RMI over HTTP attempts to work around that issue
  • RMI activation framework helps increase availability of RMI Services, and helps better manage resources on the server

Object Serialization over HTTP
  • Service providers and consumers interact by serializing binary objects over HTTP.
  • Leverages the scalability and availability of standard HTTP servers like Apache.
  • Leverages the ubiquity and firewall-friendly nature of the HTTP protocol
  • Suitable when both the Service Provider and the Consumers are implemented in Java
  • Being a binary protocol makes it easier to transport large amounts of data

Integration using MOM
  • Publish/Subscribe
    • Pub/Sub is a highly advertised feature of MOMs.
    • In my view, this model fits in a world where the subscribers are highly dynamic.
    • If you have a high number of subscribers, pay careful attention to how messages a published over the network. Multicast reduces the latency in publishing, and has a lower network impact.
  • Point-to-Point
    • Things to watch out for are: Queue Depth, Sequence of Messages, Message Correlation, Duration in Queue and Message Expiration.
    • Pro-active monitoring and management of the dead-letter queue.

Web Services
  • These are typically accessed over HTTP, which is a synchronous protocol. This introduces Availability Coupling between the Service Provider and the Consumers.
  • The better way to communicate is by exchanging documents.
  • Stick to industry standard XML schemas
  • Authentication and Authorization is a tricky problem to solve
  • There is some value in attempting to solve the Service Versioning issue, by following the same approach as that for Component Versioning
  • The plethora of WS standards reminds me what happened with CORBA. In the end, only a handful of standards remained standing, and the vendors ended up implementing only those standards.
  • Then you have the whole set of arguments around REST vs. Web Services.

Enterprise Service Bus (ESB)
  • It is an attempt to emulate (conceptually) within the middleware world, what the computer hardware vendors ended up doing, to solve their integration issues.
  • The typical value-add of such products are: Protocol Transformation, Process Orchestration/Choreography, Monitoring & Management, Service Versioning, SLA management, etc