Data Persistence

Hello guys and welcome back to my Programming Application & Frameworks tutorial series. From this article, I would like to discuss some important facts about "Data Persistence".





What is this Data persistence?

What mainly the information systems do is process data and convert them into information. therefore the data should persist for later use. Such as for maintaining the status, for logging purposes and to further process and derive knowledge kind of actions. These data can be stored, read, updated or modified and deleted. At the runtime of software systems, data is stored in main memory and that is a volatile memory. Persistence is ''the continued or prolonged existence of something.'' In the storing data of computer data system, persistence means the data survives after the process with which it was created has ended. It is important that data should be stored in non-volatile storage for persistence. We can store data as Files or Databases and there so many formats to store data such as plain text, XML, JSON, Tables or etc.

In the process of understanding data persistence, it is needed to have an idea about data, databases, database servers, and database management systems.

Data

Data are facts and statistics that are collected for analysis or reference. Data itself has no meaning unless it is converted to meaningful information. Data can be quantitative or qualitative.

There are 3 ways to arrange data
  1. Un-structured
  2. Semi-structured
  3. Structured

Database

A database is a structured set of data held in a computer. They are created and managed in database servers. Data is organized into rows, columns, and tables, and it is indexed to make it easier to find relevant information. Data gets updated, expanded and deleted as new information is added. Databases process workloads to create and update themselves, querying the data they contain and running applications against it.


Database Server

A database server is a term used to refer to the back-end system of a database application using client/server architecture. The back-end, sometimes called a database server, performs tasks such as data analysis, storage, data manipulation, archiving, and other non-user specific tasks. It may also refer to the physical computer used to host the database. When mentioned in this context, the database server is typically a dedicated higher-end computer that hosts the database.

Database Management System

A database management system (DBMS) is system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data. A DBMS makes it possible for end users to create, read, update and delete data in a database. The DBMS essentially serves as an interface between the database and end users or application programs, ensuring that data is consistently organized and remains easily accessible.



Arrangements of data

Un-structured data -

Unstructured data is information, in many different forms, that doesn't hew to conventional data models and thus typically isn't a good fit for a mainstream relational database. Thanks to the emergence of alternative platforms for storing and managing such data, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. One of the most common types of unstructured data is text. Unstructured text is generated and collected in a wide range of forms, including Word documents, email messages, PowerPoint presentations, survey responses, transcripts of call center interactions, and posts from blogs and social media sites.
Ex - Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. 

Semi-Structured data -

Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Semi-structured data is data that has not been organized into a specialized repository, such as a database, but that nevertheless has associated information, such as metadata, that makes it more amenable to processing than raw data. In addition to structured and unstructured data, there's also a third category: semi-structured data. Semi-structured data is information that doesn't reside in a relational database but that does have some organizational properties that make it easier to analyze. Examples of semi-structured data might include XML documents and NoSQL databases.
Ex - CSV but  XML and JSON documents are semi structured documents,  NoSQL databases are considered as semi structured.

Structured data -

Structured Data is the term used for adding additional markup to HTML pages to help bots better understand information on a Web page. For example, if a Web page has information about an event, it may be difficult for a search bot to fully comprehend all of the details. However, when you add structured data, like the Event microdata provided by Schema.org, it can correctly communicate every detail of the event. One of the key benefits of using structured data is rich snippets. Rich snippets are enhanced search results created from structured data that include additional details. Here are a few examples of rich snippets.
Ex - numbers, dates, and groups of words and numbers called strings. Most experts agree that this kind of data accounts for about 20 percent of the data that is out there. Structured data is the data you’re probably used to dealing with. It’s usually stored in a database.


Let's get to know about different types of database management systems

There are several types of database management systems
  1. Hierarchical databases
  2. Network databases
  3. Relational databases
  4. Object-oriented databases
  5. Graph databases
  6. ER model databases
  7. Document databases

Hierarchical Databases


In a hierarchical database management systems (hierarchical DBMSs) model, data is stored in a parent-children relationship node. In a hierarchical database, besides actual data, records also contain information about their groups of parent/child relationships. 

Network Databases

Network database management systems (Network DBMSs) use a network structure to create a relationship between entities. Network databases are mainly used on a large digital computer. Network databases are hierarchical databases but unlike hierarchical databases where one node can have one parent only, a network node can have a relationship with multiple entities.

Relational Databases

In relational database management systems (RDBMS), the relationship between data is relational and data is stored in a tabular form of columns and rows. Each column if a table represents an attribute and each row in a table represents a record. Each field in a table represents a data value.

Object-oriented Databases

In this Model, we have to discuss the functionality of the object-oriented Programming. It takes more than storage of programming language objects. Object DBMS's increase in the semantics of C++ and Java. It provides full-featured database programming capability while containing native language compatibility. It adds the database functionality to object programming languages.

Graph Databases

Graph Databases are NoSQL databases and use a graph structure for semantic queries. The data is stored in the form of nodes, edges, and properties. In a graph database, a Node represents an entity or instance such as customer, person, or a car. A node is equivalent to a record in a relational database system.

ER Model Databases

An ER model is typically implemented as a database. In a simple relational database implementation, each row of a table represents one instance of an entity type, and each field in a table represents an attribute type. In a relational database, a relationship between entities is implemented by storing the primary key of one entity as a pointer or "foreign key" in the table of another entity.

Document Databases

Document databases (Document DB) are also a NoSQL database that stores data in the form of documents. Each document represents the data, its relationship between other data elements, and attributes of data. Document database store data in a key-value form. Document DB has become popular recently due to their document storage and NoSQL properties. NoSQL data storage provide a faster mechanism to store and search documents.

Data Warehouse & Big Data

BASIS FOR COMPARISON
DATA WAREHOUSE
BIG DATA
Meaning
Data Warehouse is mainly an architecture, not a technology. It extracts data from varieties SQL based data source (mainly relational database) and helps for generating analytic reports. In terms of definition, data repository, which using for any analytic reports, has been generated from one process, which is nothing but the data warehouse.
Big Data is mainly a technology, which stands on volume, velocity, and variety of data. Volumes define the amount of data coming from different sources, velocity refers to the speed of data processing, and varieties refer to the number of types of data (mainly support all type of data format).
Preferences
If an organization wants to know some informed decision (like what is going on in their corporation, next year planning based on current year performance data, etc), they prefer to choose data warehousing, as for this kind of report they need reliable or believable data from the sources.
If organization need to compare with a lot of big data, which contain valuable information and help them to take a better decision (like how to lead more revenue, more profitability, more customers, etc), they obviously preferred Big Data approach.
Accepted Data Source
Accepted one or more homogeneous (all sites use the same DBMS product) or heterogeneous (sites may run different DBMS product) data sources.
Accepted any kind of sources, including business transactions, social media, and information from sensor or machine specific data. It can come from a DBMS product or not.
Accepted type of formats
Handles mainly structural data (specifically relational data).
Accepted all types of formats. Structure data, relational data, and unstructured data including text documents, email, video, audio, stock ticker data, and financial transaction.
Subject-Oriented
A data warehouse is subject oriented because it actually provides information on the specific subject (like a product, customers, suppliers, sales, revenue, etc) not on organization ongoing operation. It does not focus on ongoing operation, it mainly focuses on analysis or displaying data which help on decision making.
Big Data is also subject-oriented, the main difference is a source of data, as big data can accept and process data from all the sources including social media, sensor or machine specific data. It also main on provide exact analysis on data specifically on subject oriented.
Time-Variant
The data collected in a data warehouse is actually identified by a particular time period. As it mainly holds historical data for an analytical report.
Big Data has a lot of approach to identified already loaded data, a time period is one of the approaches on it. Big data mainly processing flat files, so archive with date and time will be the best approach to identify loaded data. But it has the option to work with streaming data, so it not always holding historical data.
Non-volatile
Previous data never erase when new data added to it. This is one of the major features of a data warehouse. As it totally different from an operational database, so any changes on an operational database will not directly impact to a data warehouse.
For Big data, again previous data never erase when new data added to it. It stored as a file which represents a table. But here sometime in case of streaming directly use Hive or Spark as an operating environment.
Distributed File System
Processing of huge data in Data Warehousing is really time-consuming and sometimes it took an entire day to complete the process.
This is one of the big utility of Big Data. HDFS (Hadoop Distributed File System) mainly defined to load huge data in distributed systems by using map reduce program.

Do you know about these statements?

SQL statements

Structured Query Language (SQL) is a standard computer language for relational database management and data manipulation. SQL is used to query, insert, update and modify data. Most relational databases support SQL, which is an added benefit for database administrators (DBAs), as they are often required to support databases across several different platforms.

Prepared statements

A prepared statement is a feature used to execute the same (or similar) SQL statements repeatedly with high efficiency. 
  1. Prepare: An SQL statement template is created and sent to the database. Certain values are left unspecified, called parameters (labeled "?"). Example: INSERT INTO MyGuests VALUES(??, ?). 
  2. The database parses, compiles, and performs query optimization on the SQL statement template, and stores the result without executing it 
  3. Execute: At a later time, the application binds the values to the parameters, and the database executes the statement. The application may execute the statement as many times as it wants with different values

Callable statements

CallableStatement interface is used to call the stored procedures and functions. CallableStatement interface is used to call the stored procedures and functions. Suppose you need the get the age of the employee based on the date of birth, you may create a function that receives date as the input and returns age of the employee as the output. A callable statement object provides a way to call stored procedures in a standard way for all RDBMSs. A stored procedure is stored in a database; the call to the stored procedure is what a callable statement object contains. This call is written in an escape syntax that may take one of two forms: one form with a result parameter, and the other without one. A result parameter, a kind of OUT parameter, is the return value for the stored procedure.

Object Relational Mapping (ORM)

Object-Relational Mapping (ORM) is a technique that lets you query and manipulates data from a database using an object-oriented paradigm. When talking about ORM, most people are referring to a library that implements the Object-Relational Mapping technique, hence the phrase "an ORM". An ORM library is a completely ordinary library written in your language of choice that encapsulates the code needed to manipulate the data, so you don't use SQL anymore; you interact directly with an object in the same language you're using.

For example, here is a completely imaginary case with a pseudo language:
You have a book class, you want to retrieve all the books of which the author is "Linus". Manually, you would do something like that:
book_list = new List();
sql = "SELECT book FROM library WHERE author = 'Linus'";
data = query(sql); // I over simplify ...
while (row = data.next())
{
     book = new Book();
     book.setAuthor(row.get('author');
     book_list.add(book);
}
With an ORM library, it would look like this:
book_list = BookTable.query(author="Linus");
The mechanical part is taken care of automatically via the ORM library.

Do you know about POJO, Java Beans and JPA?

POJO

POJO stands for “Plain Old Java Object”. it’s a pure data structure that has fields with getters and possibly setters, and may override some methods from Object (e.g. equals) or some other interface like Serializable, but does not have a behavior of its own. It’s the Java equivalent of a C struct.
Ex-
  1. class Point {
  2. private double x;
  3. private double y;
  4. public double getX() { return x; }
  5. public double getY() { return y; }
  6. public void setX(double v) { x = v; }
  7. public void setY(double v) { y = v; }
  8. public boolean equals(Object other) {...}
  9. }

As soon as you start adding methods that operate on points, like vector addition or complex multiplication, you no longer have a POJO. POJOs can have all of their methods defined automatically based on their field names and types. IDEs can do this for you.


Java Beans

A Java Bean is a Java class that should follow conventions. Such as it should have a no.arg constructor, should be serializable and should provide methods to set and get values of the properties, known as getter and setter methods. We use Java Bean because according to Java white paper, it is a reusable software component. A bean encapsulates many objects into one object so we can access this object from multiple places. Moreover, it provides easy maintenance.
Ex -

//Employee.java  
  
package mypack;  
public class Employee implements java.io.Serializable{  
private int id;  
private String name;  
public Employee(){}  
public void setId(int id){this.id=id;}  
public int getId(){return id;}  
public void setName(String name){this.name=name;}  
public String getName(){return name;}  
}  

JPA

Java Persistence API is a collection of classes and methods to persistently store the vast amounts of data into a database which is provided by the Oracle Corporation. o reduce the burden of writing codes for relational object management, a programmer follows the ‘JPA Provider’ framework, which allows easy interaction with database instance. Here the required framework is taken over by JPA.
JPA is an open source API, therefore various enterprise vendors such as Oracle, Redhat, Eclipse, etc. provide new products by adding the JPA persistence flavor in them. Some of these products include Hibernate, EclipseLink, TopLink, Spring Data JPA and etc.


ORM Tools

As we discussed before Object-relational mapping' (ORM, O/RM, and O/R mapping) in computer software is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to create their own ORM tools. Such as,

Hibernate -

Hibernate is an object-relational mapping (ORM) library for the Java language, providing a framework for mapping an object-oriented domain model to a traditional relational database. Hibernate solves object-relational impedance mismatch problems by replacing direct persistence-related database accesses with high-level object handling functions.

Features -
  • Transparent persistence without byte code processing
  • Object-oriented query language
  • Object / Relational mappings
  • Automatic primary key generation
  • Object/Relational mapping definition
  • HDLCA (Hibernate Dual-Layer Cache Architecture)
  • High performance
  • J2EE integration
  • JMX support, Integration with J2EE architecture

IBatis / MyBatis

iBATIS is a persistence framework which automates the mapping between SQL databases and objects in Java, .NET, and Ruby on Rails. In Java, the objects are POJOs (Plain Old Java Objects). The mappings are decoupled from the application logic by packaging the SQL statements in XML configuration files. The result is a significant reduction in the amount of code that a developer needs to access a relational database using lower level APIs like JDBC and ODBC.

Features -
  • Support for Unit of work/object level transactions
  • In memory object filtering
  • Providing an ODMG compliant API and/or OCL and/or OPath
  • Supports multi servers (clustering) and simultaneous access by other applications without loss of transaction integrity
  • Query Caching - Built-in support
  • Supports disconnected operations
  • Support for Remoting. Distributed Objects.

Toplink

In computing, TopLink is an object-relational mapping (ORM) package for Java developers. It provides a framework for storing Java objects in a relational database or for converting Java objects to XML documents. TopLink Essentials is the reference implementation of the EJB 3.0 Java Persistence API (JPA) and the open-source community edition of Oracle's TopLink product. TopLink Essentials is a limited version of the proprietary product. For example, TopLink Essentials doesn't provide cache synchronization between clustered applications, some cache invalidation policy, and query Cache.

Features -
  • Query framework that supports an object-oriented expression framework, Query by Example (QBE), EJB QL, SQL, and stored procedures
  • Object-level transaction framework
  • Caching to ensure object identity
  • Set of direct and relational mappings
  • EIS/JCA support for non-relational data sources
  • Visual mapping editor (Mapping Workbench)
  • Database and JEE Architecture independent

What is this NoSQL

NoSQL is an approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schema is carefully designed before the database is built. NoSQL databases are especially useful for working with large sets of distributed data.

Advantages of NoSQL -

  • Large volumes of structured, semi-structured, and unstructured data
  • Agile sprints, quick iteration, and frequent code pushes
  • Object-oriented programming that is easy to use and flexible
  • Efficient, scale-out architecture instead of expensive, monolithic architecture
There are 4 basic different types of NoSQL databases -
  1. Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3 (Dynamo)}
  2. Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB}
  3. Column-based Store- Each storage block contains data from only one column, {Example- HBase, Cassandra}
  4. Graph-based-A network database that uses edges and nodes to represent and store data. {Example- Neo4J}

What is this Hadoop

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining, and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.


That's all about Data Persistence for this article and I hope you got some understanding about Data Persistence and some valuable knowledge about this area. Wish to see you guys in another interesting topic of Programming Applications & Frameworks. Thank you for reading and good luck.



References-

Comments

Popular Posts