Monday, 24 December 2012

Buzzword Bazinga: What is Big Data? featuring MapReduce and NoSQL

The term 'Big Data' describes a vast amount of data, or chat, generated from two main sources: people and computers. People spend a lot of time talking on social networks like Twitter and Facebook and producing on-line content about their companies, affiliated organisations and themselves. Computers and devices, now IP enabled and connected to the internet, now have the ability to send their instrumentation information into internet. The latter phenomenon is known as 'the internet of things': a description of how connected devices and sensors, from shoes to thermostats to cloud server farms, now have a voice and can make information about how they are used and what they experience accessible to data crunching algorithms. These algorithms would then produce insights that can be feed back into the real world and affect everything ranging from peoples behaviour to how corporations are run.

Handling all this data is a massive computational challenge. Compounding the challenge is that data is  growing exponentially, may be stored over the place and even kept in different formats. To address these challenges two key technologies have evolved to help us along: MapReduce and NoSQL.

But before we delve into these let's break down the attributes of Big Data into a bit more detail.

  • A Lot of Data - By 'a lot of data', we are talking Terabytes and Petabytes, not Gigabytes. Traditional databases have a scale up ceiling where the hardware, infrastructure and skills required to maintain petabytes of data in a RDBMS becomes untenable. 
  • Growing at speed - Exacerbating the the problem of large volumes of data, is the speed at which it grows. Again, with the scaling up model technology will hit a ceiling fast which is why being able to 'infinitely' scale out supports big data velocities.
  • All over the place -  Big Data is distributed. Whether it is distributed in database shards, file shares in countries all over the world or exposed from data streams via APIs, the Big Data challenge is to surface all this data and derive a common set of insights against the distributed data set.
  •  In different forms - Another challenge around Big Data is that the data doesn't fit into a nice schema with defined relationships, like you get in an RDBMS. JSON, XML, CSV and various other formats of data as well as differing entity models and relationships requires the ability to interrogate data in different ways but draw the same conclusions for analysis purposes. 
Technologies have evolved to cope with the challenges above because existing data warehousing technologies like relational databases simply weren't cut out for the job. Enter MapReduce.

Courtesy of Wikipedia 
Crawling the web and creating a search index as fast as possible is probably the most apt use case for dealing with Big Data. MapReduce was Google's answer to this challenge. MapReduce refers to the technology stack Google created to process such volumes of distributed, unstructured data in parallel.

The MapReduce name comes from the two main steps in harvesting the data: a Map Step and a Reduce Step.

In the Map Step, the data points for the distributed dataset are defined. When the Map Step is executed, the distributed dataset is crawled and the output is a flattened, Table Like view of the data. This view can then be reduced in the Reduce step. Having a T-SQL background I always think of the Map step as the SELECT and JOIN parts of the statement. E.g. SELECT URL, META-TAGS, HEADINGS, HYPERLINKS FROM THE INTERNET.

In the the Reduce Step, the data surfaced from the Map Step is processed and aggregated. Again, using the T-SQL analogy, I think of the Reduce Step as the Group By Clause. GROUP BY META-TAGS, HEADING, HYPERLINKS.

The real brains behind MapReduce is in the scheduling and distribution of the work load in parallel. Google's MapReduce platform takes big data in (e.g. the web), processes and analyses it using Map Reduce functions, and spits out processed data (e.g. a search index). This is done is a distributed way with a Master node which sends map tasks to clusters of Worker nodes to process. The Worker nodes are ideally located near the data to reduce network latency and costs. Each worker nodes processes the Map task internally and then returns the result to the Master node to perform the reduce functions. One reduced, the final data set is ready to be outputted into a search index with PageRank intelligence which can be queried against.

There you have it, possibly the most over-simplified view of Google's MapReduce you will ever read.

What is easy to explain however, is that MapReduce gave birth to a number of technologies which allow 'not only Google' to process vast amounts of data efficiently. Coupled with Cloud Computing which provide pay as you go scale, Big Data services like Amazon's Elastic Map Reduce and up and coming Microsoft Azure HDInsight (both built on Hadoop), big data services are in an early boom stage and and their practical uses are being realised by businesses and organisations around the world.

Given that Apache Hadoop is probably the biggest player in the Market, it is worthwile a quick look at it's ecosystem.
Image from

  • HDFS (Hadoop Distributed File System) - Fast, Durable, Distributed File Storage System used to store map and reduced data
  • MapReduce - Job Scheduling / Execution system
  • Pig - A high level platform that supports querying big data using data flow queries in 'Pig Latin'
  • Hive - A high level platform that supports querying big data using a SQL flavour called HiveQL. Hive translates HiveQL queries into MapReduce jobs
  • Squoop - A framework used to import Hadoop data into a Relational Database. 
There are other interesting frameworks out there that are worth mentioning:
  • Apache Mahout - A Machine Learning library that can be used to build things like recommendation engine. 
  • Pegasus - A data mining system. 


A familiar story with traditional RDBMS is as follows: 

A client has a killer idea for a product and comes to you to build it. You spend ages coming up with an elegant data schema with about 150 tables which supports all possible business requirements. The client spent all the money on database design and licences so they can't afford big servers. The product launches and is a big hit. Over time the site's performance starts to degrade. You add more hardware: drives, memory, processors, to buy some time. After a while things start to slow down again. You get your SQL guru mate to fine tune indexes and work on SP optimisation. They also create some data de-normalisation routines to process data into quick access tables. These work well but the client wants the schema changed to adapt to business critical stuff. They shout at you that it will take that long. You try your best but fail and end up with more performance problems. The client shouts at you more and you feel fragile as you haven't slept much in the last three weeks. This isn't what you wanted. You continue to bandage the database and begin to hate your chosen career path. You look at manual labourers with deep envy at their jobs which begin and end. You hate RDBMS. You wish you had another option.

And this is why NoSQL came about. Not this exact story, but ones with a similar narrative arc.

I've written about NoSQL in the past, specifically about RAVENDB. NoSQL, which stands for Not Only SQL, grew from the challenges traditional RDBMSs faced when dealing with Big Data - namely performance, cost and maintenance. The two key requirements for NoSQL systems are to be able to scale out by adding nodes horizontally and to be able to add and retrieve data quickly. To meet these requirements, NoSQL has had to make some sacrifices and presents some new challenges. Some main challenges include: Data Consistency, Architecture and Data Relationships.

Data Consistency

Getting Big Data from your NoSQL database quickly is important. Because data is distributed in NoSQL systems, there are a few sacrifices that have to be made and to ensure that data can be added and read quickly. The main sacrifice is around data consistency i.e. when I add data to a NoSQL database when will the entire database have this update reflected?

Because of the distributed nature of NoSQL, if you add data to a NoSQL database it may have to be replicated from one node to another across a network boundary. Since networks have latency and can fail there is no way you can guarantee that what is written to a NoSQL database can be read immediately by other clients requesting the data. Getting data immediately is called 'Immediate Consistency' and is supported by RDBMS systems as data is stored in one place and table locks and transactions support of Immediate Consistency for all database clients. NoSQL solutions support other types of consistency models such as:
  • Eventual Consistency - Defined/Coined by Amazon as 'the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.'
  • Monotonic Consistency - A stricter form of Eventual Consistency, namely, the writes are guaranteed to be in the correct order.
  • Read your own Writes Consistency - Eventual Consistency but with the ability for clients to retrieve what they've written immediately. This is essential for web programming where you, for example, add comment to a post and then the page refreshes with your comment included. Other's may not see it immediately, but that is a non essential requirement. This is also known as 'Personal Consistency' if the client refers to a user.
  • Strong Consistency - a system which supports read/write atomic operations on single data entities.  MongoDB supports Strong Consistency through its Master / Slave architecture. All reads and writes are done against the Master ensuring Strong Consistence. Slaves can be read from if SlaveOk is specified, thereby changing the consistence model to Eventual Consistency. 

NoSQL systems are cost effective because servers (or VMs) and storage is cheap. It is much cheaper to have one hundred small servers sitting side by side working together than one monster server trying to one thing. This is called scaling out as opposed to scaling up.

Scaling out requires changes to architectural thinking, namely sharding. Defining the right sharding strategy for you solution is a key challenge to having a good scalability model.

Relational Data

Because NoSQL is schema-less this means that you have a lot more flexibility in your data model and entity designs. This does come at cost when working with data that has relationships. With RDBMS you have the luxury of joins to tables which have built in relationships. With NoSQL this isn't as easy. You can build relationships between data but you have to go about it in a different way. 

Some techniques include:
  • Using materialised views of data e.g. RavenDb allows you to create Materialised views (a lucence query over multiple data entities that can hydrate denoralised objects
  • Adding denormalised data into your data entities. Probably the simplest approach
  • Pre-Loading - Pre loading your entities based on reference Ids stored in the Entity model
In summary: 

This is high level view on the Big Data Challenge and associated technologies which have emerged from it. These technologies are emerging fast and business sectors are starting to cotton on to the benefits of being able to store, harvest, analyse and capitalise on the insights internally, for their customers and client base.  

From a technology solution providers perspective, Cloud Computing offerings from Amazon and Microsoft as well as the Open Source Products like Hadoop have enable us to build big data systems without having to be Google. 

2013 will be an exiting year for big data and the cloud and I hope I get an opportunity to do something in this space. 

Twitter Stream anyone? 

Useful Links:


  1. Big Data Brings Big Opportunities for a data mining job and data mining careers.
    Big Data trainings

  2. Good Knowledge sharing about Big Data .
    Big Data has a huge demand in IT Industry.

  3. Nice information about hadoop bigdata My sincere thanks for sharing this post and please continue to share this kind of post
    Hadoop Training in Chennai

  4. really you have been posted an informative blog. before i read this blog i didn't have any knowledge about this but now i got some knowledge.
    hadoop training in chennai

  5. This is my first time i visit here. I found so many interesting stuff in your blog especially its discussion. Thanks. Cloud Computing Training in Chennai | Selenium Training in Chennai

  6. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

  7. This comment has been removed by the author.

  8. Very Excellent blog on
    Buzzword Bazinga: What is Big Data? featuring MapReduce and NoSQL
    thank you for blogging keep writing more blogs
    Informatica interview questions
    Devops Training in Bangalore
    Artificial Intelligence Training in Bangalore

  9. usefull article. Thanks for sharing such post

  10. very helpfull blog it was a pleasure reading your blog
    would love to read it more
    knowldege is not found but earned through hardwork and good teaching
    that being said click here to join us the next best thing in bangalore
    devops online training
    Devops Training in Bangalore

  11. Big data in hadoop is the interesting topic and to get some important information.Big data hadoop online Training Hyderabad

  12. Nice information thank you,if you want more information please visit our link selenium Online Training

  13. • Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating IOT Online Training

  14. Existing without the answers to the difficulties you’ve sorted out through this guide is a critical case, as well as the kind which could have badly affected my entire career if I had not discovered your website.
    Digital Marketing Training in Chennai

    Aws Training in Chennai

    Selenium Training in Chennai

  15. Awesome! Education is the extreme motivation that open the new doors of data and material. So we always need to study around the things and the new part of educations with that we are not mindful.
    python training in tambaram
    python training in annanagar
    python training in velachery

  16. Resources like the one you mentioned here will be very useful to me ! I will post a link to this page on my blog. I am sure my visitors will find that very useful
    Blueprism training in Chennai

    Blueprism training in Bangalore

    Blueprism training in Pune

  17. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

    Data Science training in Chennai
    Data science training in bangalore
    Data science online training
    Data science training in pune

  18. Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
    java training in jayanagar | java training in electronic city

    java training in chennai | java training in USA

  19. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
    angularjs-Training in sholinganallur

    angularjs-Training in velachery

    angularjs Training in bangalore

    angularjs Training in bangalore

    angularjs Training in btm


  20. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.

    AWS Training in BTM Layout |Best AWS Training in BTM Layout

    AWS Training in Marathahalli | Best AWS Training in Marathahalli

    Selenium Interview Questions and Answers

    AWS Tutorial |Learn Amazon Web Services Tutorials |AWS Tutorial For Beginners

  21. Good job! Fruitful article. I like this very much. It is very useful for my research. It shows your interest in this topic very well. I hope you will post some more information about the software. Please keep sharing!!
    SEO Training Institute in Chennai
    SEO training course
    Best SEO training in chennai
    Digital Marketing Course in Chennai
    Digital Marketing Training in Chennai
    Digital Marketing Course near me

  22. my Google account. I look forward to fresh updates and will talk about this blog with my Facebook group. Chat soon!
    nebosh course in chennai

  23. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    Microsoft Azure online training
    Selenium online training
    Java online training
    Java Script online training
    Share Point online training