Data Management

Introduction

Data is exploding at an alarming rate due to increase in many fields of business, experimental knowledge, social media and academics. Due to this large size of dataو it is hard to make information discovery and decision making in efficient time. According to the international data corporation report, it has been predicted that, digital data could grow by 40 times from 2012 to 2020, and it is of utmost significance that we develop tools to handle such a huge volume of evolving data [2,15]. Today, the amounts of big data that can be managed by NoSQL systems, like Riak, outstrip what can be managed by the largest relational database management system (RDBMS). NoSQL databases are essentially created from the ground up to require less management like data distribution, automatic repair, also simpler data models lead to lower administration [3]. NoSQL databases essentially use clusters of cheap commodity servers to handle the exploding data also transaction volumes, while relational databases tend to rely on expensive proprietary servers and storage systems. When using NoSQL, the cost per gigabyte or transaction per second for NoSQL can be many times less than the cost for RDBMS, permitting you to save and the process more data at a much lower price [1]. NoSQL Key Value (KV) stores, as well as document databases, let the application to save practically any structure it needs in a data element. Rigidly determined Big Table-based NoSQL databases (Cassandra, HBase) typically allow new columns to be created without too much confusion. A large number of NoSQL offerings consequently lead to the problem of differentiating between these offerings and their suitability in different circumstances [7]. In this study, we test and evaluate the Riak key-value database for big data clusters using the Basho-bench benchmark, a benchmarking tool created to conduct accurate and repeatable performance tests and stress tests and produce performance graphs.

This paper aims to accomplish the following:

  • To generate a fictitious workload and a data access pattern on the cluster that matches the workloads of real-world applications and monitor its performance.
  • To observe the performance of Riak KV with large data volumes and various workloads (read, write, update, mix of read update).
  • To monitor the performance of the Riak KV (Throughput, Latency) when data is being read, write and update operation.

The rest of this paper is organised as follows: Section 2 we present background and basic concepts. Section 3 takes a deeper look at related works. Section 4 provides an overview of Riak KV NoSQL databases system and its infrastructure. Section 5 is about the Basho bench benchmarking of Riak KV. Section 6 presents the experiment environment for testing the Riak KV NoSQL database with the Basho bench. Section 6 provides our experimental results and discussion. Section 7 concludes the paper.

Background and Basic Concepts

In this section, the basic concepts related to the big data, NoSQL database properties will be introduced. The challenges associated with big data and NoSQL are also introduced.

Big data

In this part, we will describe the term big data that is very related with NoSQL database systems. Big data can be defined as the capability of managing a huge volume of data within the right time and proper speed. Big data is an evolving term that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information, which cannot be managed using relational database management systems (RDBMs). [6,8] Every day, new data is created from a variety of sources, including social networks, photos, videos, and more. Due to the rapid growth of data, it has become very difficult to process this data through the available database management system. One of the solutions that have been proposed to overcome the fast growth of data has been applying better hardware; however, this approach has not been sufficient as the hardware enhancement reached a point where the growth of data volume outpaces computer resources [5]. Now, big data could be found in three forms:

Structured- Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over time, talent in computer science have achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also, deriving value out of it. There are two sources that provide structured data: data generated by human intervention such as gaming data and input data. The second source is the data generated by machines such as sensor data, web log data and financial data. [8,9]

Unstructured data- Before the current ubiquitous of online and mobile applications, databases processed direct, structured data. The data forms were almost simple and described a set of relationships between various data types in the database. In contrast, unstructured data refers to data

Column-oriented- A table in a column-oriented database can be used for the data model; however, this stores tables of extensible records. It includes columns and rows, which may be shared through being divided over nodes. In general, the benefit of this data model is a more appropriate application on aggregation and data warehouses, HBase [20] and Cassandra [21] are an example of this kind of data store.

Documents data stores- Also known as a document-oriented database, this program is used to retrieve, store and manage information. The data is semi-structured data. The documents database can usually use the secondary index to facilitate the value of the upper application. However, the Key Value and document database structures are very similar, they differ in how they process data. It was named by that name far from the manner of storing. So that the data is stored documents in XML or JSON format [22,23]. Couch and MongoDB dB [24] are examples of documents data.

Graph Databases- A graph database comprises nodes that are connected by edges. Data can be stored in edges and nodes. One advantage of a graph database is that it can traverse relationships very quickly. Similar to the other three types of NoSQL databases mentioned above, graph databases have some problems with horizontal scaling. Therefore, every node can connect to any other node. Traversing nodes on various physical machines can have a negative effect on performance. Another difference from the above three is that most graphics databases

Graph Databases- A graph database comprises nodes that are connected by edges. Data can be stored in edges and nodes. One advantage of a graph database is that it can traverse relationships very quickly. Similar to the other three types of NoSQL databases mentioned above, graph databases have some problems with horizontal scaling. Therefore, every node can connect to any other node. Traversing nodes on various physical machines can have a negative effect on performance. Another difference from the above three is that most graphics databases

Abramova et al. [13], tested the performance of Cassandra based on a number of factors, including the number of nodes, workload characteristics, number of threads, and data size, and analysed whether it provides the desired acceleration and scalability attributes. Scaling nodes and the number of data-sets do not guarantee performance. However, Cassandra handles concurrent request threads well and extends well with concurrent threads. A summary of the results of that paper concluded that when the number of nodes in a cluster has increased from 1, or 3 to 6, even for relatively large data sets, this trend cannot guarantee an improvement in performance. The authors of [29] showed a method and the results of a research that selected between three NoSQL databases systems for a large, distributed healthcare society.

The performance assessment methods and results are displayed to the following databases: MongoDB, Cassandra and Riak. The test was based on the YCSB benchmark for evaluating NoSQL databases. The paper's summary of the results concluding that the Cassandra database provides the best throughput performance with the highest latency.

Riak Key-Value (KV)

Riak is an open-source enterprise version of Riak Enterprise DS. It is a KV database developed by Basho in 2007 and written in Erlang and C. The enterprise version adds multi-data center replication, monitoring, and additional support [22].

Riak KV is a distributed NoSQL database that is extremely scalable, available, and straightforward to work with. It automatically assigns the data in a cluster to ensure quick performance and fault tolerance. Riak Enterprise includes multi-cluster replication that guarantees low latency and strong business continuity. Riak KV is an appropriated distributed NoSQL KV database that ensures read and write functions even in cases of hardware failure or network partitions by supporting both local and multi-cluster replication. Riak KV is designed to work and deal with an assortment of difficulties confronting big data applications that incorporate following client or session data, Basho-bench focuses on two metrics of performance throughput and latency [28].

How Does the Benchmark Work?

Each node can be either a traffic generator or a Riak node. A traffic generator runs one copy of Basho-bench that generates and sends commands to Riak nodes. A Riak node contains a complete and independent copy of the Riak package which is identified by an Internet Protocol (IP) address and a port number. Figure 2 shows how traffic generators and Riak nodes are organised inside a cluster. There is one traffic generator for every three Riak nodes [4].

EXPERIMENT ENVIRONMENT

In this part, we will introduce the results of experiments realized by the testing of the Riak KV NoSQL database with the Basho-bench. The benchmark is specifically designed for Riak performance test and analysis. Riak benchmark is done using the Basho ́s measurement software that defines the number of transactions per seconds executed per second. The benchmark needs a configuration file, which contains the required parameter to begin the benchmark. It executes the given number of workers that

The Basho-bench is a test tool to perform reads, updates and writes based on workload and measure performance. The possible operations that the driver will run, such as [{get,4}, {put,4}, {delete, 1}], which means that out of every 9 operations, get will be called four times, put will be called four times, and delete will be called once, on average. The benchmark package gives a set of predetermined experiment s that can be executed as follows:

To evaluate the loading time, we generated a different number of keys (10 K,100 K,1000 K, 10,000 K, and 200,000 K), and a varying number of threads (4, 8, and 12).

EXPERIMENTAL RESULTS AND DISCUSSION

In the following, we assign a section to each experiment, which describes the different scenario experiments between read and an update, also the results are illustrated in that.

equipment as well as communications. From the figure 5, observe that the three cases have a high latency in the process of data update. This is expected because the reading process usually does not have a great latency like the rest of the operations. Where the highest value

Conclusion

In this paper, we conduct analysis and evaluation of the read/update throughput as well as the latency of Riak KV NoSQL database management systems cluster environment. In order to achieve this goal, Basho-bench is used. Benchmarking the NoSQL data stores in the perspective of the cluster environment and monitor factors such as throughput, latency are important requirements as there exists a difference of NoSQL databases and its utility differs from one application to another. In addition, system performance is still an important factor when processing large amounts of data. We did measurements on three experiments of a different number of operations; experiment A, B and C. We measured the read throughput and latency of each of the experiments, and the update throughput and latency. We found that, the performance is affected significantly by increased data size. We also found that with the increase in the number of threads, throughput performance is better, and the latency factor reduced.

REFERENCES

  • [1] Rakesh Kumar, Shilpi Charu, Somya Bansal.” Effective Way to Handling Big Data Problems using NoSQL Database (MongoDB)”. Journal of Advanced Database Management & Systems ISSN: 2393-8730 (online) Volume 2, Issue 2 .2015.
  • Rakesh K. Lenka and et al.,”Comparative Analysis of Spatial Hadoop and GeoSpark for Geospatial Big Data Analytics”, Published in: 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I). Date of Conference: 14-17 Dec. 2016.
  • Anasuya N Jadagerimath1 and Dr. Prakash. S. “Efficient IoT Data Management for Cloud Environment using Mongo DB”. Proc. of Int. Conf. on Current Trends in Eng., Science and Technology, ICCTEST .2017
  • Amir Ghaffari ,Natalia Chechina, Phil Trinder,Jon Meredith (Sep 2013) Scalable Persistent Storage for Erlang: Theory and Practice, Twelfth ACM SIGPLAN Workshop on Erlang, Boston, MA, USA.
  • “Challenges and Opportunities with Big Data”. CRA.org. Retrieved Jan 2016.
  • Big data for dummies", Dr. Fern Halper, Marcia Kaufman, Judith Hurwitz, Alan Nugent 2013.
  • Raj R. Parmar and Sudipta Roy. ”MongoDB as an Efficient Graph Database: An Application of Document Oriented NOSQL Database”. Data Intensive Computing Applications for Big Data.2018
  • https://www.webopedia.com/TERM/B/big_data.html
  • A Comparison of NoSQL Database Systems: A Study on MongoDB, Apache Hbase, and Apache Cassandra
  • NoSQL Databases: Critical Analysis and Comparison
  • TESTING THE PERFORMANCE OF NoSQL DATABASES VIA THE DATABASE BENCHMARK TOOL
  • Survey of NoSQL Database Engines for Big Data
  • V. Abramova, J. Bernardino, P. Furtado. (2014). Which NOSQL database? A performance overview. In Paper presented at Open Journal Databases, Volume 1, Issue 2, pp. 17-24.
  • https://www.techopedia.com/definition/28802/semi-structured-data
  • Jing Han, Haihong E, Guan Le,Jian Du. Survey on NoSQL Database. (2011). In IEEE 6th International Conference on Pervasive Computing and Applications (ICPCA).
  • Asadulla Khan Zaki. (2014). NoSQL databases: new millennium database for big data, big users, cloud computing and its security challenges. IJRET: International Journal of Research in Engineering and Technology. Volume: 03 Special Issue.
  • Techopedia [Online]. 2018, Retrieved from: https://www.techopedia.com/definition/26284/key- value-store.
  • Riak-kv database [Online]. 2018, Retrieved
  • http://basho.com/products/riak-kv/ Redis database [[Online]. 2018, Retrieved from: https://redis.io/ .
  • Hbase database [Online]. 2018, Retrieved from: http://hbase.apache.org/.
  • Cassandra database [Online]. 2018, Retrieved from: http://cassandra.apache.org/.
  • Man Qi. Digital Forensics and NoSQL Databases. (2014). In IEEE 11th International Conference on Fuzzy Systems and Knowledge Discovery.
  • Jing Han, Haihong E, Guan Le,Jian Du. Survey on NoSQL Database. (2011). In IEEE 6th International Conference on Pervasive Computing and Applications (ICPCA).
  • MongodB database [Online]. 2018, Retrieved from: https://www.mongodb.com/.
  • Man Qi. Digital Forensics and NoSQL Databases. (2014). In IEEE 11th International Conference on Fuzzy Systems and Knowledge Discovery.
  • Neo4j database [Online]. 2018, Retrieved from: https://neo4j.com/
  • https://github.com/basho/basho_bench.
  • John Klein, Ian Gorton, Neil Ernst, Patrick Donohoe, Kim Pham, and Chrisjan Matser. (2015). Performance Evaluation of NoSQL Databases: A Case Study. In Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems (PABS ’15). ACM, New York, NY, USA, pp. 5-10.
sample
Live Chat with Humans