This post started off as a research review, but half way through proofing it I decided to turn it into a more opinionated post about the current state of big data and the range of technologies available to solve big data problems with.
Most of my writing assume little to no knowledge about the topic I'm covering; this is partly with the hope of being useful to more people reading this, but more so to help myself if I decide to come back and read this in the future.
Within this post I'll take you through understanding the callenges of solving big data problems and highlight how it's different from classic data problems. A good portion of this post will also cover common techonologies used to handle big data.
What is big data?
Technically, big data refers to the high volume, velocity and variety of information that demand innovative forms of processing to extract meaningful insights, patterns and trends out of them. Or more simply; extremely large sets of data that require special programs and technology to make sense out of.
Recently, companies have started to place more value on data. With technology becoming more engrained in the everyday lives of people it's become the case that companies have access to massive streams of data coming from things like fitness wearables, mobile phones and social networks.
Big data can also be characterized more completely by the 5Vs: volume, variety, velocity, veracity and value. Understanding these 5Vs is key to solving big data problems.
Database technologies for big data
Big data requires large and scalable infrastructure to store and process the massive loads of data associated with it. The general trend has been that the costs of this infrastructure; which include network bandwidth, storage devices, data warehouses and processors, have steadily declined over the years making it cheaper to collect and store large amounts of data.
Cheaper infrastructure answers two-fifths of the problems associated with big data; volume and velocity. Next we need to focus on handling the variety, veracity and extraction of real value from big data.
Traditional relational database management systems (RDMS) are not capable of handling big data. Newer technologies that can efficiently handle big data sets across multiple servers are a necessity, below are a few popular database systems and processing frameworks used for this purpose:
Hadoop is a distributed data infrastructure that consists of the Hadoop Distributed File System (HDFS) for data storage and a Map/Reduce processing component. It's an extremely scalable and fault-tolerant database to store and process large amounts of structured, unstructured and semi-structured data. Hadoop is primarily suited for batch processing large sets of data.
Map/Reduce: This is a programming framework for automatic parallelization. The map and reduce components of this algorithm breaks apart an input and processes them independantaly (in parallel) before finally 'reducing' them into one input. (Simply put it lets you break apart massive sets of data, analyze them idividually and combine everything to one output at the end.)
Apache spark is an in-memory centric big data framework that can be used to process large sets of data in real-time with impressive speeds. It's 100x faster than Hadoop in memory, and 10x faster on disk. Spark is easier to program than Hadoop, but it's important to understand that it's a data processing tools and doesn't do distributed storage like Hadoop, so if you want to use spark it has to be combined with either HDFS or another database system.
NoSQL and NewSQL databases
NoSQL and NewSQL databases are in the form of key-value stores, column stores, document stores and graph databases. NoSQL refers to open-sourced, distributed and non-relational databases that use eventual consistency based on the CAP (consistency, availability, partition) theorem. This allows them to be highly scalable compared to SQL databases. There's also a newer class of databases called NewSQL that combines the ACID (Atomicity, Consistency, Isolation, Durability) properties of SQL and the fault-tolerance and scalability of NoSQL.
Below is a quick summary of the different types of NoSQL databases that have different levels of scalability, flexibility, complexity and functionality.
Key-value stores are a schema-less was of storing data as a pair (key, value). They were designed to provide very fast access to a massive load of data, although it is only viable for cases where all data can be accessible by a key.
Column store databases like Cassandra store data tables as sections of columns. It extends key-value stores by allowing columns to have complex structures. Columns can be grouped together to combine related data, which allows for data to be retrieved by addressing a column family name and a row key.
Document stores are simply a key-value store where the value is a document. The documents can have complex structures of any kind and therefore provide large flexibility for storing data. MongoDB and CouchDB are popular database systems that use document stores.
Graph databases are a type of non-relational database that provide a unique way of representing relationships between data based on graph theory. Graph databases makes it possible to write queries as traversals and therefore makes it possible to query and answer more complex questions about the data. One of the key disadvantages with graph databases is that they are tough to partition across servers and currently don't have a standard query language.
Big data is an exciting field to study and operate in as it holds the key to a future where data can be utilized in new ways to improve the livelihood of billions of people by identifying and solving problems that we couldn't possible understand.
Learning more about the tools available for problem solving with big data is one of the first steps in getting into the space. Hopefully I'll follow this post up with a more indepth analysis of each of these tools and how to apply them in the future.
V. Storey and I. Song, "Big data technologies and Management: What conceptual modeling can do", Data & Knowledge Engineering, vol. 108, pp. 50-67, 2017.