2004 to 2017 Convergence of Big Data, Machine Learning, Semantic Web, Graph Analytics, High Performance Computing – All These and Yet Big Data Analytics Sucks

2004 – Tim Lee Berner


Semantic Web

OWL and RDF introduced to address Semantic Web and also Knowledge Representation. This really calls for BigData technology that was still not ready.



2006 – Hadoop Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.




Scientific Method Obsolete for BigData

 The Data Deluge Makes the Scientific Method Obsolete


2008 – MapReduce

Large Data Processing – classification

Google created the framework for MapReduce – MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

•        https://research.google.com/archive/mapreduce.html


2009 – Machine Learning Emergence of BigData Machine Learning Framework and Libraries


2009 – Apache Mahout Apache Mahout – Machine Learning on BigData Introduced.  Apache Mahout is a linear algebra library that runs on top of any distributed engine that have bindings written.


Mahout ML is mostly restricted to set theory. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.



2012 – Apache SPARK Apache SPARK Introduced to deal with Very Large Data and IN-Memorry Processing. It is an architecture for cluster computing – that increases the computing compared with slow MapReduce by 100 times and also better solves parallelization of the algorithm. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab



Mahout vs Spark Difference between Mahout vs SPARK



2012 – GraphX GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph databasE. GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.
2013 – DARPA PPAML https://www.darpa.mil/program/probabilistic-programming-for-advancing-machine-learning


Machine learning – the ability of computers to understand data, manage results and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Teams of hard-to-find experts must build expensive, custom tools that are often painfully slow and can perform unpredictably against large, complex data sets.

The Probabilistic Programming for Advancing Machine Learning (PPAML) program aims to address these challenges. Probabilistic programming is a new programming paradigm for managing uncertain information.

Ingine Responded to DARPA’s RFQ with a detailed architecture based on Barry’s innovation in the algorithm that basically solves the above ask to some extent. Importantly it solve Probabilistic Ontology for  Knowledge Extraction from Uncertainty and Semantic Reasoning.

2017 – DARPA Graph Analytics https://graphchallenge.mit.edu/scenarios


In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed.  The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.


2013 Other Large Graph Analytics Reference An NSA Big Graph experiment


2017 Data Science Dealing with Large Data Still Sucks


Despite emergence of Big Data, Machine Learning, Graphing Techniques and Semantic Web. The convergence is still far fleeting. Especially Semantic / Cognitive / Knowledge Extraction techniques are very poorly defined and there does not exists a framework approach to knowledge engineering leading into Machine Learning and automation in Knowledge Extraction, Representation, Learning and Reasoning. This is what  Q-UEL and HDN solves at the algorithmic level.