Database Design: Individual Assignment - Database Optimisation

HI5033

Database Design

Individual Assignment

















Student ID –

Student Name -

Table of Figures



Introduction

Database tuning is particularly important for improving the performance and scalability of DBMS and is especially important now as new systems exist to improve the processing of larger volumes of data and more complex operations. Optimisation is a range of methods and approaches aimed at increasing the efficiency of the work of databases and the time required to complete queries and access the necessary resources, methods of storage, indexing, and more.

The objective of this report is to analyse several strategies for tuning the database and determine their efficacy and effectiveness. More precisely, it will revisit the findings from the literature review published in the report on database indexing, query optimization, and storage management where the most implemented and efficient mechanisms to enhance performance in DBMSs will be discussed and defined. The aim is to equip viewers herein with a broad perspective on how such optimization can be incorporated into everyday database systems to warrant much larger data capacities, faster query processing, and better system performance.

Database optimisation comprises query optimisation, indexing techniques, system caching, in-memory database and partitioning (Kouahla et al., 2021, p.19(2)). Some of them are more suitable than others for specific system architecture and it is not easy to determine which technique is the most useful at the moment of designing and implementing systems. For instance, query optimisation aims at choosing an efficient manner to process the query, using an operation of query reordering or selecting the best joining techniques whereas indexing speeds up the browser to select the data faster by presenting them in a certain fashion.

This report involves a literature review of ten peer-reviewed academic research papers that address various aspects of database optimisation techniques, and then a critical evaluation of the role which the selected sources play in enhancing the understanding of the performance enhanced in DBMS. By using this approach, the report intends to compare and contrast the current areas of study, establish existing knowledge gaps and suggest potential avenues for future research.



Figure 1: Database Optimisation

Literature Review

There is a lot of interest in the database optimisation problem both in research and in practice, similarly, many techniques have been proposed to enhance both the efficiency and the modularity of DBMSs (Przytarski et al., 2021, p.1(1)). These procedures have varying levels of impact concerning the complexity level of the implemented database the patterns of the workloads flowing through the system and the complexity of the queries that they manage to process. This section collates ten papers on database optimisation for trends, methodologies, and results from the selected ten research papers.

Indexing Strategies

Indexing is still considered one of the cornerstones of database optimization, with the growth of databases and query complexity even at their core. Adaptive indexing has proved to enhance the contingency query performance through the construction of partial indexes at the time of query execution. This not only decreases the initial cost of indexing but also makes it scalable by enabling dynamic workloads to index. Adaptive indexing may well be most useful in situations where the patterns of querying change as time goes on because it avoids the constraints of full indexing.

Other types of indexing strategies have also witnessed a similar development with a blend of features from trees and hash index such as hybrid index. These hybrid techniques combine the characteristics of tree-based search efficiency and hash-based direct access and significantly improve query execution time and memory usage in large datasets (Ukey et al., 2023, p.629(3)).

Query Optimisation

But in query optimisation, the broad aim would be to find which execution plan would be least costly in consumed resources to execute a set of varying queries on the database. In recent years cost-based query optimisation models have emerged as the most important ones as they enable query engines to approximate costs of various possible paths of execution. This kind of optimisation method outperforms rule-recommended technologies, mainly when applied to settings with many joins, and large databases.

Moving to methods based on machine learning, more recent approaches incorporate techniques that can be considered adaptive in(Query) Optimisation where optimal execution of the query includes learning mechanisms based on the previous usage patterns of the query. Because current and past query execution data can be used to teach machine learning algorithms about potential strategies for future queries, there will be a greatly improved execution time.

In-Memory Databases

In-memory database technology has launched a new generation of databases mainly used for transaction processing due to the issues of query response time. These systems avoid the disk I/O process by collating data in the main memory and thus experience reduced query execution times. In-memory computing is not just about speed improvements, but about near real-time with which many high throughput industries function.

Nonetheless, it is also significant to point out that the very value proposition of in-memory databases is inherently linked to memory management. It is essential to work at optimizing the performance of these handling techniques like Data compression and Garbage collection to deal with continuously increasing data sizes. Even though, compression brings down memory usage and, thereby, improves query efficiency by identifying the least amount of data to scan, it does so by increasing the computational cost during compression or decompression.

Partitioning and Sharding

Sharding techniques have turned out to be almost mandatory as distributed databases increase in popularity due to partitioning. Splitting or partitioning rows across different database servers, also known as horizontal partitioning, allows the queries to be executed in parallel, and does not elongate the time required for executing those queries on large tables.

However, it becomes advantageous where in the physical database some of the heavily used columns could be picked out so that the system can look for only the required information and/or minimize the time taken to complete queries. Each of the partitioning methods has a condition that must be understood to implement them without slowing down system operations, namely a good understanding of query patterns. Sharding which partitions the database to make it easier to manage further extends scalability in cloud-based architecture by spreading the data across nodes. To be sure, sharding has measurable advantages relating to scalability and extensibility, yet the problem of how to handle cross-shard queries as well as how to maintain proper consistency across shards in the network raises substantial concerns.

Caching Mechanisms

Caching is one of the optimisation strategies that involve placing data that is most often used in memory to cut short on time needed to make recurrent queries. Dynamic caching configurations extend simple caching configurations by generalising new cache sizes and evicting cached results according to query frequencies and access frequencies. These adaptive can help achieve increases in performance of up to thirty per cent with the help of adaptive techniques especially in a volatile working environment.

Storage Optimisation

Another factor which can be used to improve the effectiveness of a database is the management of storage media. Multi-tier storage systems which are a fusion of very expensive SSDs and inexpensive yet slow Hard Disks provide a good means of managing the two (Abbas and Alsheddy, 2020, p.56(3)). This means that from one tier the right data should be copied to another tier, where the data that is used frequently is placed on SSDs and other unwanted data is placed in sluggish storage.

Another effective method utilized accordingly is to implement the means of the column-based structures supported with data compression (Wang, 2023, pp.820-842(1)). Columnar storage results in nearly 105 power of reduced data amount that needs to be scanned in query execution; it also aids data compression, reducing storage requirements and enhancing data retrieval rates.

Figure 2: Optimize Storage

Discussion

As the following describes the advanced techniques it also would mention that several techniques are used for optimising the databases. Both methods have unique strengths and weaknesses and also require an analysis of their relevance to contemporary database technologies. In this discussion, sociocultural theories’ relationship and possible application about the practical use in real-life contexts are evaluated.

Efficient Indexing Techniques

The two major categories of indexing strategies which are adaptive and hybrid indexing techniques have been singled out as being most effective, especially in the use of query performance optimization, especially in complex and changing conditions. A technique called adaptive indexing where partial indexes are created during the execution of a query as opposed to before the execution of the query allows for more flexibility in being able to alter the indexes based on the new query patterns while at the same time sacrificing the initial cost of building full index trees in advanced.

At the same time, hybrid indexing, which unites several index structures, for example, hash-based and tree-based ones, records optimal performance improvement in terms of both query time and memory usage. Although the hybrid approach minimizes lookup time by selecting the right index for the type of query used, it also complicates the overall management of indexes (Achkouty et al., 2023, p.15(2)). For big systems with tens of millions of queries, it yields high maintenance costs for these functions.

Advanced Query Optimisation

This leads to cost-based query optimization as one of the firm foundations in choosing the right execution plan for compounded database queries (Ye et al., 2023, p.2314(2)). Since a cost model can be relied on to estimate the resources needed for the various execution plans that a query can be optimized to, cost-based optimization offers good performance gains essentially in conditions where there are large join or nesting of queries. However, reliance on cost estimates is a severe liability in situations where such resources: CPU or memory- are unpredictable (Ahamed et al., 2023, p.650(1)).

However, most of the query optimisation techniques that employ machine learning contain an adaptive layer, which learns from the query histories and optimises over time. It also possesses a self-learning characteristic which makes it possible for the system to adapt to workload changes and variations in resources, which is beneficial to the system and makes it a more robust solution required in; modern environments. However, model training and model update processes needed to support continuous model learning introduce additional computational complexity especially when the system is at large scale with high diversity in the queries being handled.

In-Memory Database Systems

The in-memory database has therefore refaced the concept of database optimization due to its capability of facilitating the execution of queries at about the same time through eradicating disk I/O constraints. Thus, for the transactional workloads - the low-latency and high through-put environments, in-memory systems offer a significant benefit.

Further, the conventional approaches of memory management like the use of compression and garbage collection become fundamental for managing performance in these systems (Bahn and Cho, 2020, p.999(1)). Compression shrinks the size and provides additional chunks to retrieval speed, but it has the disadvantage of decompression cost that negates itself if the data is large enough. Therefore, poor methods of garbage collection result in memory fragmentation and also increased latency hence needs to be managed sophisticatedly, especially during real-time processing.

Figure 3: In-Memory Database Systems

Data Partitioning and Sharding Approaches

Partitioning and sharding techniques are crucial elements with complex DBMS requiring application to large scales and high volumes of queries (Calatrava, Fontal and Cucchietti, 2022, p.86(2)). Horizontal partitioning, when it breaks rows across different servers, facilitates parallel query execution and provides substantial performance improvement. However, to achieve efficient horizontal partitioning, data balancing has to be done rigorously to avoid some of the nodes in a system being overloaded with requests, thus becoming hotspots. Distribution of the data results in different tastes of query performance where some partitions end up getting overworked and others with minimal or no work at all.

The second approach of vertical partitioning whereby data is segmented along the column is more appropriate for systems with complex data structures. They enable some columns to be separated from frequently accessed ones in a manner that minimizes the data brought in during the time of query processing (Al Jawarneh et al., 2021, p.4160). However, this method is not efficient, in many cases, especially where the data is unstructured or only partly structured – column-based segmentation does not make a monumental difference here. At the same time, sharding has become one of the main measures for scaling cloud-based databases when data is split across different nodes.

Dynamic Caching Solutions

Some or all of the proposed adaptive caching algorithms offer significant enhancements in query response time, especially under changing workloads. These algorithms can determine optimal cache size and cache eviction strategies based on query admission feedback leading to significant performance improvement. However, the use of these systems is dependent on an accurate forecast of the query patterns and this could be a big limitation in fast-changing environments. He explained the importance of proper estimations of how often data is accessed; it pertains to query frequencies, that is, how many times certain data is used for queries to avoid inefficient cache utilisation which leads to disappointing performance.

Optimising Storage Solutions

Different SSD-based storage structures have therefore received massive support because of the chances they provide to influence performance, which is measured in IOPS – Input/Output operations per second. Because of the enhanced data access speeds and minimized latency that come with SSDs, these devices are perfectly suitable where there is a constant requirement to read large data sets. This is because SSDs are relatively costly as compared to HDDs making their large-scale deployment in certain areas such as consumer markets where costs are cut at a bare minimum difficult.

To this end, the majority of the systems have adopted complex storage system designs that use both SSDs and conventional hard drives. In such configurations, more often accessed or ‘‘hot’’ data is put on the SSDs while the ‘‘cold data’’ is stored on the less expensive, but slower hard disk drives. This tiered strategy is rather effective when it comes to the relationship between performance and costs. Managing data is central to efficiency and makes sure the system always checks and optimizes the distribution of data between SSDs and HDDs.





Conclusion

The literature examined in this report offers a panoramic view of optimisation strategies that can be used to improve the efficiency and functionality of DBMS. Adaptive and hybrid strategies of indexing provide large query performance improvement compared with simple indexing, but their utilization may lead to an increase in system complexity. Cost models, machine learning methods, and other methods of optimisation also reflect high potential in performance enhancement while their efficiency depends on precise cost costs and training model overheads.

In-memory databases are pronounced as an optimal solution to minimize disk I/O and enhance query performance though there are certain drawbacks to real-world scalability due to memory dependency. Both partitioning and sharding are powerful techniques to maintain distributed databases but when implementing it one ought to be careful so that he or she does not make mistakes that are going to result in poor performance of the system. The adaptive caching technique can be an effective one as it reduces the execution time for the search queries, although the prediction of the query pattern is a tough task.

Last but not least, the improvement in storage systems of Computers through the implementation of SSDs and Data compression has humongous advantages when it comes to the execution of data replicate and minimized spaces of storage respectively. However, since SSDs are expensive compared to normal hard disk drives, and since compression also requires computations, the level of compression must therefore be dynamically balanced with the performance that is required.

Subsequent research studies should strive to seek to implement more flexible optimisation methodologies that can be set to respond to changes in the workload and the data in the system. There is also a lack of work for utilising machine learning in query optimisation and Portable memory solutions essential for building in-memory databases.



References

  1. Abbas, Q. and Alsheddy, A., 2020. Driver fatigue detection systems using multi-sensors, smartphone, and cloud-based computing platforms: a comparative analysis. Sensors, 21(1), p.56. https://www.mdpi.com/1424-8220/21/1/56

  2. Achkouty, F., Chbeir, R., Gallon, L., Mansour, E. and Corral, A., 2023. Resource Indexing and Querying in Large Connected Environments. Future Internet, 16(1), p.15. https://www.mdpi.com/1999-5903/16/1/15

  3. Ahamed, Z., Khemakhem, M., Eassa, F., Alsolami, F. and Al-Ghamdi, A.S.A.M., 2023. Technical study of deep learning in cloud computing for accurate workload prediction. Electronics, 12(3), p.650. https://www.mdpi.com/2079-9292/12/3/650

  4. Al Jawarneh, I.M., Bellavista, P., Corradi, A., Foschini, L. and Montanari, R., 2021. QoS-aware approximate query processing for smart cities spatial data streams. Sensors, 21(12), p.4160. https://www.mdpi.com/1424-8220/21/12/4160

  5. Bahn, H. and Cho, K., 2020. Implications of NVM-based storage on memory subsystem management. Applied Sciences, 10(3), p.999. https://www.mdpi.com/2076-3417/10/3/999

  6. Calatrava, C.G., Fontal, Y.B. and Cucchietti, F.M., 2022. A holistic scalability strategy for time series databases following cascading polyglot persistence. Big Data and Cognitive Computing, 6(3), p.86. https://www.mdpi.com/2504-2289/6/3/86

  7. Kouahla, Z., Benrazek, A.E., Ferrag, M.A., Farou, B., Seridi, H., Kurulay, M., Anjum, A. and Asheralieva, A., 2021. A survey on big IoT data indexing: Potential solutions, recent advancements, and open issues. Future Internet, 14(1), p.19. https://www.mdpi.com/1999-5903/14/1/19

  8. Przytarski, D., Stach, C., Gritti, C. and Mitschang, B., 2021. Query processing in blockchain systems: Current state and future challenges. Future Internet, 14(1), p.1. https://www.mdpi.com/1999-5903/14/1/1

  9. Ukey, N., Yang, Z., Li, B., Zhang, G., Hu, Y. and Zhang, W., 2023. Survey on exact known queries over high-dimensional data space. Sensors, 23(2), p.629. https://www.mdpi.com/1424-8220/23/2/629

  10. Wang, W., 2023. Efficient modal identification and optimal sensor placement via dynamic DIC measurement and feature-based data compression. Vibration, 6(4), pp.820-842. https://www.mdpi.com/2571-631X/6/4/50

  11. Ye, C., Duan, H., Zhang, H., Zhang, H., Wang, H. and Dai, G., 2023. Multi-Source Data Repairing: A Comprehensive Survey. Mathematics, 11(10), p.2314. https://www.mdpi.com/2227-7390/11/10/2314



7


FAQ's