Fujitsu Develops Column-Oriented Data-Processing Engine Enabling Fast, High-Volume Data Analysis in Database Systems
In recent years, column-oriented databases have emerged as a system that allows for better speed when reading and analyzing large volumes of data, as a counterpart to existing row-oriented databases, which are suited to handling data updates. But problems have been either that the changes to row-oriented data cannot be automatically reflected in column-oriented data, or that the size of the column-oriented data is constrained by installed memory.
Fujitsu has developed an engine that, running on a PostgreSQL open-source database, without being dependent on memory capacity, instantly updates column-oriented data in response to changes in row-oriented data, and processes column-oriented data quickly. The engine quickly analyzes indexes, which are provided by most database systems, and can be used by developers without special consideration to whether the storage method is row-oriented or column-oriented. With a parallel-processing engine especially suited for processing column-oriented data, analyses run on a single CPU core are conducted 4 times faster than before, and one server equipped with 15 CPU cores can run analyses at least 50 times faster.
Even on smaller computer systems with little memory, this technology enables real-time data analysis reflecting the latest data.
Details of this technology are being presented at the Seventh Forum on Data Engineering and Information Management (DEIM 2015), opening March 2 in Koriyama, Fukushima.
Background
Database systems are able to report processing results back to a terminal efficiently, for what is called online transaction processing (OLTP), and are used widely for processing changes to data, such as with the storage and utilization of data from business systems.
Issues
In recent years, there has been an increasing demand for high-volume data analysis that is fast and available on demand, creating a need for a single database system that can handle OLTP and high-volume data analysis simultaneously. In contrast to the row-oriented data that is best-suited to OLTP, column-oriented data is better for data analysis, but this method gets bogged down when processing changes to data. One relatively recent solution is to store both row-oriented and column-oriented data as a way to accelerate analyses. But with previous technologies, changes to the row-oriented data are not automatically reflected to the column-oriented data, and memory constraints are also problematic.
About the Technology
Fujitsu has developed an engine for PostgreSQL open-source databases that instantly reflects updated row-oriented data to column-oriented data, stores column-oriented data without being dependent on memory capacity, and quickly conducts analysis of column-oriented data. Massive volumes of column-oriented data can be stored by taking advantage of a new technique for managing column-oriented data. The engine also enables high-speed analyses of the indexes that typical database systems provide, and can be used without special consideration for whether the data is stored as row-oriented or column-oriented. On the DBT-3 benchmark Query1 for reading, filtering, and aggregating, the parallel-processing analysis engine, which has been optimized for column-oriented data, runs 4 times faster on a single CPU core than its predecessors. On a single server with 15 CPU cores, performance is at least 50 times faster.
Key features of the technology are as follows:
- Large-volume column-oriented data storage
To efficiently manage large volumes of column-oriented data that cannot fit into memory, data domains are managed in "extents," large increments (roughly 260,000 records), in which data domains are secured or deleted, and in which free domains are reclaimed. While managing large increments and simultaneously running analyses can result in long wait times, Fujitsu has adopted a solution in the form of MultiVersion Concurrency Control (MVCC), which allows analyses to run at the same time that data domains are managed.
- Column-oriented indexes (column-store indexes)
Like other indexes, creating a column-oriented index (column-store index) is a way to select a data-storage method (row-oriented or column-oriented) that suits the contents of the database being queried and to process it. When there is an update to row-oriented data from which the column-store index is created, the column-oriented data is automatically updated. This completely frees users from worries about the data-storage method.
Figure 1: Architecture of the new technology
Analysis engine optimized for column-oriented data and parallel processing using an original shared-memory structure
Simply using column-oriented data to improve read performance does not make the most of the benefits that column-oriented data can offer. Fujitsu developed an analysis engine that can apply the same process at once to multiple types of data (vector processing), which improves performance under single parallelization. Also as a parallel-analysis mechanism, the company also developed a new shared-memory structure so that multiple processes operating in parallel in PostgreSQL can hand off data with little slowdown. On a server with 15 CPU cores, this can achieve minimum fifty-fold performance improvements over the previous PostgreSQL.
Results
This technology enables existing smaller systems with limited memory to achieve real-time analysis and utilization of big-data in ways that were not possible before.
Future Plans
Fujitsu is aiming for a commercial implementation of this technology during fiscal 2015, as a part of Symfoware Server, Fujitsu's database product.
All company or product names mentioned herein are trademarks or registered trademarks of their respective owners. Information provided in this press release is accurate at time of publication and is subject to change without advance notice.
Комментарии