As DBAs, let us ask ourselves, what is « Big Data ». Why so many NoSQL databases? Why Java/MapReduce jobs? What is so « Big » it does not fit into a 672 TB HC disk Exadata database machine? What so « Big » it does not fit into 18 Exadata racks tighten together?
It could be that I’m getting old but this is not the first time I’m told relational databases are failing! We all know « Big Data » is at the top of Gartner « hype » curve 2014😉 ready to crash and burn! But being passionate about our work, we should not simply turn our back on « Big Data ». We should consider the way it looks at our DBA concerns. In fact some of us have already embraced « Big Data »and others will.
This is exactly what these series of 13 articles are about: « Understanding some of the bits and bytes of Hadoop, one of the leading big data technology ». There is a lot to learn here and before we drill down into the details, let us think about it for a while. What do we, as DBAs, have in common with Big Data?
We are Big DBAs…
Big Data is spreading everywhere from Internet to telcos, utilities, banking, public sector and many more. It is a collection of infrastructures and algorithms that have been invented and created to address some of the challenges… We DBAs, are facing for years now. Because most of us are already dealing with dozen, sometimes hundreds of terabytes of data stored in relational databases. Because we are Big DBAs:
- Volume: database sizes are blowing up; relational database sizes
- Variety: data are stored in a multitude of formats; relational data co-exist with documents, pictures, videos, XML, geographic data, now JSON, web ontologies and Linked data. Those data are processes together
- Velocity: collection and loading systems are faster and bigger. Data used in analytical and transactional databases are massive and generated by other systems.
- Value: most the gigantic amount of data has low value. Still, among that noise, you’ll find extremely valuable signals from failures one can avoid, from customers ready to buy, from cheaters ready to fraud or from people ready to fall in love…
If the challenges are the same, the angle differs… One should be aware of what Big Data brings to the table! What it is, that is new, and we can leverage for our use. When it is better to use it.
Big Data Breakthroughs
Big Data is born from Google explaining some of its key technologies including GFS, MapReduce and BigTable. Since then, many projects have been started and have rebuilt platforms from the ground sometimes breaking assumptions that were commonly accepted for decades. Technological breakthroughs are really, and more than anything, what Big Data is all about. Nothing, except that, can link together distant relatives like Amazon DynamoDB and Apache Hadoop. Besides, nobody is foolish enough, to suggest loosing data consistency, transactions or SQL can somehow be a goal… In contrast, sacrificing those properties can lead to solutions to some common problems we are facing:
- As DBAs, we’ve all faced situations where application performances hit a wall due to activity and data growth/volume. That’s where partitioning might come up like a magic trick and powerful tool. Thinking seriously about it, you’ll figure out that it is because it « archives » somehow inactive data reducing by a factor what remains in access paths. Obviously data remains inline but it doesn’t change that fact. By breaking the transaction and data consistency rules, NoSQL databases spread data across different servers « automatically » so that adding more power and more servers, you add more ability to manage the load, regardless the volume of data ; that’s something you can really check for yourself
- As DBAs, we’ve all faced difficulties with designing and building decision support systems and data warehouses. Changes can take days or even weeks when they impact data models. By allowing the collection of loose data into what is described as a « data lake », Hadoop allows add collections and analytics in a few hours. Data are considered unstructured when loading and described right before they are used. Hadoop/Hive « schema on read » approaches speed up some changes when compared to relational « schema on write » data models…
- As DBAs, we experience tailored systems based on expensive license cost and engineered systems. Big Data cloud, standard and open-source approaches is a game changer. Il allows to add resources easily, for a specific period of time, no gap effect.
In addition to be real opportunities for a few DBAs already, Big Data is a land of changes where technological breakthroughs are coming up and growing. Those tools, algorithms and integrated platforms will sooner or later our DBA armories. Some technologies might replace our regular tools, others might be nice additions. In many cases, those technologies will integrate our ecosystem like the in-memory option just did… Remember not so long ago, in-memory database was only a market of niche players and new comers!
Why start a journey?
I’ve had the chance to work with Big Data for a while. I suggest you come with me for a 13 week journey on the trail of the elephant in the room: Apache Hadoop. Through the lens of 13 blog posts, we’ll explore some of the key aspects of that Big Data, no big deal, technology:
- (this post) Big Data Breakthroughs (1/13)
- Building Hadoop for Oracle Linux 7 (2/13)
- Simple 3-node Hadoop Cluster Setup (3/13)
- An Introduction to HDFS (4/13)
- MapReduce Sample Job (5/13)
- Yarn Sample Application (6/13)
- Evolved Yarn Application (7/13)
- Pig and Pig DataFu scripts (8/13)
- SQL on Hadoop with Hive (9/13)
- Linking Hadoop and relational databases with Sqoop (10/13)
- Mahout Predefined Algorithms (11/13)
- Spark: In-Memory Hadoop (12/13)
- Big Data Projects (13/13)
Big Data is evolving fast. Here like nowhere, you should believe what your told! Think for yourself: no one wants to abandon dbms ACID properties or replace SQL with Java. on the contrary, Big Data platforms are trying to stick back to SQL and eventually ACID. Besides, « Big » is not « Beautiful » neither « BulletProof ». With Yarn and Spark, Hadoop somehow deeply evolved MapReduce for something different, breaking the previously breaking model. Google itself has dumped MapReduce for something more evolved. Big Data is far from frozen and we can question the established order.
Join this thread and do not hesitate to share your mood and comment if you are a DBA, interested or not with Big Data!