Once upon a time, programmers stored data in flat files. That was tough: Each file was tied to the C/Fortran/COBOL code that wrote and read it, sequentially byte by byte, in a binary format unusable without that exact code. Then, relational databases came along. Schemas defined the structure of the data. A pure-functional language, SQL., alllowed convenient, flexible access. Then came transactions, replication, clustering, security, administrative tools, and the whole towering stack which DBAs use through this day.
Big Data, until a decade ago was much like pre-RDB data-files: Everything had to be coded ad-hoc. Then, Hadoop came along, Moores Law gave it a kick, and big data took off. But it took a out-of-the-box framework, Hadoop to let non-specialist programmers focus on their core business without having to be Big Data experts.
Still, Hadoop was a bare-bones framework for coding algorithms structured as MapReduce tasks, and not much else. It was missing a lot of those useful layers that evolved in the RDB. Over the last decade, some of these layers sprouted up on top of Hadoop: Pig and Hive for easy querying, instead of writing procedural Java code line by line ; HBase for a structured dataset on top of the unstructured HDFS; Oozie for workflow instead of chaining together MapReduce processes by hand; and much much more. The crowning jewel, the rider on top of the Hadoop elephant, was Mahout for machine learning. Machine learning and data mining were always the most talked about reason to use Hadoop. But in practice, most uses of Hadoop involved trivial transformation of data files, extracting the simplest of statistics. It took Mahout to make it useful.
Yet as the stack grew, it got more and more top-heavy. Either you had to install the pieces yourself, or else work from a virtual machine.
Apache Spark, Cassandra, and Java 8, make it a lot easier. I’m going to explain how I put those technologies together to suggest matches between software developers and tech employers. What would have taken weeks for a team of distributed-systems developers and machine learning experts a decade ago can now be cranked out in a few days by a single Java developer. Power!
Sign up for the RSS at this blog for more posts. If you want to follow the code, I’ve extracted a minimal viable example: Watch it here at GitHub.