The world's not yet not perfect. Working on it.

Month: December 2014

GitHub project: Spark/Cassandra with Java 8 Lambdas

I built a small project to accompany my Datanami article. It illustrates collaborative filtering using MLlib on Apache Spark. It accesses the data in Cassandra using the DataStax connector..

I wrote the key class twice, in both Java 7 and Java 8 to illustrate how much easier lambdas make things. Java 8 is not quite as good as Scala for Big Data–and Spark itself is written in Scala–but if you’re team is already deep into Java and you’re not ready to chance, you can get half the benefits of functional programming (and strongly typed functional programming, for that matter) but upgrading to Java 8.



Article up at Datanami: Apache Spark and Cassandra with Java 8

I’ve been digging deep into Apache Spark recently, and in particular seeing how it plays with Cassandra and with the new functional programming features of Java 8.

My latest article on the topic just appeared on Datanami, the leading technical/business site on Big Data.





Apache Spark on Cassandra: Matching jobs and employers

Once upon a time, programmers stored data in flat files. That was tough: Each file was tied to the C/Fortran/COBOL code that wrote and read it, sequentially byte by byte, in a  binary format unusable without that exact code.  Then, relational databases came along.  Schemas  defined the structure of the data. A pure-functional language, SQL., alllowed convenient, flexible access.  Then came transactions,  replication, clustering, security, administrative tools, and the whole towering stack which DBAs use through this day.

Big Data, until a decade  ago was much like pre-RDB data-files: Everything had to be coded ad-hoc. Then, Hadoop came along, Moores Law gave it a kick, and big data took off. But it took a out-of-the-box  framework, Hadoop to let non-specialist programmers  focus on their core business without having to be Big Data experts.


Still, Hadoop was a bare-bones  framework for coding algorithms structured as  MapReduce tasks, and not much else. It was missing a lot of those useful layers that evolved in the RDB. Over the last decade, some of these  layers sprouted up on top of Hadoop:  Pig and Hive for easy querying, instead of writing procedural Java code line by line ; HBase for a structured dataset on top of the unstructured  HDFS; Oozie for workflow instead of  chaining together MapReduce processes by hand; and much much more.  The crowning jewel, the rider on top of the Hadoop elephant, was Mahout for machine learning.  Machine learning and data mining were always the most talked about reason to use Hadoop. But in practice, most uses of Hadoop involved trivial transformation of data files, extracting the simplest of statistics.  It took Mahout to make it useful.

Yet as the stack grew, it got more and more top-heavy. Either you had to install the pieces yourself, or else work from a virtual machine.

Apache Spark, Cassandra, and Java 8, make it a lot easier.  I’m going to explain how I put those technologies together to suggest matches between software developers and tech employers. What would have taken weeks for a team of distributed-systems developers and machine learning experts a decade ago can now be cranked out in a few days by a single Java developer. Power!

Sign up for the RSS at this blog for more posts. If you want to follow the code, I’ve extracted a minimal viable example: Watch it here at GitHub.

© 2017 JTF

Theme by Anders NorenUp ↑