Monday, 24 December 2012

Buzzword Bazinga: What is Big Data? featuring MapReduce and NoSQL


The term 'Big Data' describes a vast amount of data, or chat, generated from two main sources: people and computers. People spend a lot of time talking on social networks like Twitter and Facebook and producing on-line content about their companies, affiliated organisations and themselves. Computers and devices, now IP enabled and connected to the internet, now have the ability to send their instrumentation information into internet. The latter phenomenon is known as 'the internet of things': a description of how connected devices and sensors, from shoes to thermostats to cloud server farms, now have a voice and can make information about how they are used and what they experience accessible to data crunching algorithms. These algorithms would then produce insights that can be feed back into the real world and affect everything ranging from peoples behaviour to how corporations are run.

Handling all this data is a massive computational challenge. Compounding the challenge is that data is  growing exponentially, may be stored over the place and even kept in different formats. To address these challenges two key technologies have evolved to help us along: MapReduce and NoSQL.

But before we delve into these let's break down the attributes of Big Data into a bit more detail.



  • A Lot of Data - By 'a lot of data', we are talking Terabytes and Petabytes, not Gigabytes. Traditional databases have a scale up ceiling where the hardware, infrastructure and skills required to maintain petabytes of data in a RDBMS becomes untenable. 
  • Growing at speed - Exacerbating the the problem of large volumes of data, is the speed at which it grows. Again, with the scaling up model technology will hit a ceiling fast which is why being able to 'infinitely' scale out supports big data velocities.
  • All over the place -  Big Data is distributed. Whether it is distributed in database shards, file shares in countries all over the world or exposed from data streams via APIs, the Big Data challenge is to surface all this data and derive a common set of insights against the distributed data set.
  •  In different forms - Another challenge around Big Data is that the data doesn't fit into a nice schema with defined relationships, like you get in an RDBMS. JSON, XML, CSV and various other formats of data as well as differing entity models and relationships requires the ability to interrogate data in different ways but draw the same conclusions for analysis purposes. 
Technologies have evolved to cope with the challenges above because existing data warehousing technologies like relational databases simply weren't cut out for the job. Enter MapReduce.

MapReduce
Courtesy of Wikipedia 
Crawling the web and creating a search index as fast as possible is probably the most apt use case for dealing with Big Data. MapReduce was Google's answer to this challenge. MapReduce refers to the technology stack Google created to process such volumes of distributed, unstructured data in parallel.

The MapReduce name comes from the two main steps in harvesting the data: a Map Step and a Reduce Step.

In the Map Step, the data points for the distributed dataset are defined. When the Map Step is executed, the distributed dataset is crawled and the output is a flattened, Table Like view of the data. This view can then be reduced in the Reduce step. Having a T-SQL background I always think of the Map step as the SELECT and JOIN parts of the statement. E.g. SELECT URL, META-TAGS, HEADINGS, HYPERLINKS FROM THE INTERNET.

In the the Reduce Step, the data surfaced from the Map Step is processed and aggregated. Again, using the T-SQL analogy, I think of the Reduce Step as the Group By Clause. GROUP BY META-TAGS, HEADING, HYPERLINKS.

The real brains behind MapReduce is in the scheduling and distribution of the work load in parallel. Google's MapReduce platform takes big data in (e.g. the web), processes and analyses it using Map Reduce functions, and spits out processed data (e.g. a search index). This is done is a distributed way with a Master node which sends map tasks to clusters of Worker nodes to process. The Worker nodes are ideally located near the data to reduce network latency and costs. Each worker nodes processes the Map task internally and then returns the result to the Master node to perform the reduce functions. One reduced, the final data set is ready to be outputted into a search index with PageRank intelligence which can be queried against.

There you have it, possibly the most over-simplified view of Google's MapReduce you will ever read.

What is easy to explain however, is that MapReduce gave birth to a number of technologies which allow 'not only Google' to process vast amounts of data efficiently. Coupled with Cloud Computing which provide pay as you go scale, Big Data services like Amazon's Elastic Map Reduce and up and coming Microsoft Azure HDInsight (both built on Hadoop), big data services are in an early boom stage and and their practical uses are being realised by businesses and organisations around the world.

Given that Apache Hadoop is probably the biggest player in the Market, it is worthwile a quick look at it's ecosystem.
Image from www.windowsazure.com

  • HDFS (Hadoop Distributed File System) - Fast, Durable, Distributed File Storage System used to store map and reduced data
  • MapReduce - Job Scheduling / Execution system
  • Pig - A high level platform that supports querying big data using data flow queries in 'Pig Latin'
  • Hive - A high level platform that supports querying big data using a SQL flavour called HiveQL. Hive translates HiveQL queries into MapReduce jobs
  • Squoop - A framework used to import Hadoop data into a Relational Database. 
There are other interesting frameworks out there that are worth mentioning:
  • Apache Mahout - A Machine Learning library that can be used to build things like recommendation engine. 
  • Pegasus - A data mining system. 

NoSQL

A familiar story with traditional RDBMS is as follows: 

A client has a killer idea for a product and comes to you to build it. You spend ages coming up with an elegant data schema with about 150 tables which supports all possible business requirements. The client spent all the money on database design and licences so they can't afford big servers. The product launches and is a big hit. Over time the site's performance starts to degrade. You add more hardware: drives, memory, processors, to buy some time. After a while things start to slow down again. You get your SQL guru mate to fine tune indexes and work on SP optimisation. They also create some data de-normalisation routines to process data into quick access tables. These work well but the client wants the schema changed to adapt to business critical stuff. They shout at you that it will take that long. You try your best but fail and end up with more performance problems. The client shouts at you more and you feel fragile as you haven't slept much in the last three weeks. This isn't what you wanted. You continue to bandage the database and begin to hate your chosen career path. You look at manual labourers with deep envy at their jobs which begin and end. You hate RDBMS. You wish you had another option.

And this is why NoSQL came about. Not this exact story, but ones with a similar narrative arc.

I've written about NoSQL in the past, specifically about RAVENDB. NoSQL, which stands for Not Only SQL, grew from the challenges traditional RDBMSs faced when dealing with Big Data - namely performance, cost and maintenance. The two key requirements for NoSQL systems are to be able to scale out by adding nodes horizontally and to be able to add and retrieve data quickly. To meet these requirements, NoSQL has had to make some sacrifices and presents some new challenges. Some main challenges include: Data Consistency, Architecture and Data Relationships.

Data Consistency

Getting Big Data from your NoSQL database quickly is important. Because data is distributed in NoSQL systems, there are a few sacrifices that have to be made and to ensure that data can be added and read quickly. The main sacrifice is around data consistency i.e. when I add data to a NoSQL database when will the entire database have this update reflected?

Because of the distributed nature of NoSQL, if you add data to a NoSQL database it may have to be replicated from one node to another across a network boundary. Since networks have latency and can fail there is no way you can guarantee that what is written to a NoSQL database can be read immediately by other clients requesting the data. Getting data immediately is called 'Immediate Consistency' and is supported by RDBMS systems as data is stored in one place and table locks and transactions support of Immediate Consistency for all database clients. NoSQL solutions support other types of consistency models such as:
  • Eventual Consistency - Defined/Coined by Amazon as 'the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.'
  • Monotonic Consistency - A stricter form of Eventual Consistency, namely, the writes are guaranteed to be in the correct order.
  • Read your own Writes Consistency - Eventual Consistency but with the ability for clients to retrieve what they've written immediately. This is essential for web programming where you, for example, add comment to a post and then the page refreshes with your comment included. Other's may not see it immediately, but that is a non essential requirement. This is also known as 'Personal Consistency' if the client refers to a user.
  • Strong Consistency - a system which supports read/write atomic operations on single data entities.  MongoDB supports Strong Consistency through its Master / Slave architecture. All reads and writes are done against the Master ensuring Strong Consistence. Slaves can be read from if SlaveOk is specified, thereby changing the consistence model to Eventual Consistency. 
Architecture 

NoSQL systems are cost effective because servers (or VMs) and storage is cheap. It is much cheaper to have one hundred small servers sitting side by side working together than one monster server trying to one thing. This is called scaling out as opposed to scaling up.

Scaling out requires changes to architectural thinking, namely sharding. Defining the right sharding strategy for you solution is a key challenge to having a good scalability model.

Relational Data

Because NoSQL is schema-less this means that you have a lot more flexibility in your data model and entity designs. This does come at cost when working with data that has relationships. With RDBMS you have the luxury of joins to tables which have built in relationships. With NoSQL this isn't as easy. You can build relationships between data but you have to go about it in a different way. 

Some techniques include:
  • Using materialised views of data e.g. RavenDb allows you to create Materialised views (a lucence query over multiple data entities that can hydrate denoralised objects
  • Adding denormalised data into your data entities. Probably the simplest approach
  • Pre-Loading - Pre loading your entities based on reference Ids stored in the Entity model
In summary: 

This is high level view on the Big Data Challenge and associated technologies which have emerged from it. These technologies are emerging fast and business sectors are starting to cotton on to the benefits of being able to store, harvest, analyse and capitalise on the insights internally, for their customers and client base.  

From a technology solution providers perspective, Cloud Computing offerings from Amazon and Microsoft as well as the Open Source Products like Hadoop have enable us to build big data systems without having to be Google. 

2013 will be an exiting year for big data and the cloud and I hope I get an opportunity to do something in this space. 

Twitter Stream anyone? 

Useful Links:


1 comment:

  1. Big Data Brings Big Opportunities for a data mining job and data mining careers.
    Big Data trainings

    ReplyDelete