2004 to 2017 Convergence of Big Data, Machine Learning, Semantic Web, Graph Analytics, High Performance Computing – All These and Yet Big Data Analytics Sucks

2004 – Tim Lee Berner

 

Semantic Web

OWL and RDF introduced to address Semantic Web and also Knowledge Representation. This really calls for BigData technology that was still not ready.

https://www.w3.org/2004/01/sws-pressrelease

 

2006 – Hadoop Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

https://opensource.com/life/14/8/intro-apache-hadoop-big-data

 

2008

Scientific Method Obsolete for BigData

 The Data Deluge Makes the Scientific Method Obsolete

 

2008 – MapReduce

Large Data Processing – classification

Google created the framework for MapReduce – MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

•        https://research.google.com/archive/mapreduce.html

 

2009 – Machine Learning Emergence of BigData Machine Learning Framework and Libraries

 

2009 – Apache Mahout Apache Mahout – Machine Learning on BigData Introduced.  Apache Mahout is a linear algebra library that runs on top of any distributed engine that have bindings written.

https://www.ibm.com/developerworks/library/j-mahout/

Mahout ML is mostly restricted to set theory. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

 

 

2012 – Apache SPARK Apache SPARK Introduced to deal with Very Large Data and IN-Memorry Processing. It is an architecture for cluster computing – that increases the computing compared with slow MapReduce by 100 times and also better solves parallelization of the algorithm. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab

https://en.wikipedia.org/wiki/Apache_Spark

 

Mahout vs Spark Difference between Mahout vs SPARK

https://www.linkedin.com/pulse/choosing-machine-learning-frameworks-apache-mahout-vs-debajani

 

2012 – GraphX GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph databasE. GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.
2013 – DARPA PPAML https://www.darpa.mil/program/probabilistic-programming-for-advancing-machine-learning

 

Machine learning – the ability of computers to understand data, manage results and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Teams of hard-to-find experts must build expensive, custom tools that are often painfully slow and can perform unpredictably against large, complex data sets.

The Probabilistic Programming for Advancing Machine Learning (PPAML) program aims to address these challenges. Probabilistic programming is a new programming paradigm for managing uncertain information.

Ingine Responded to DARPA’s RFQ with a detailed architecture based on Barry’s innovation in the algorithm that basically solves the above ask to some extent. Importantly it solve Probabilistic Ontology for  Knowledge Extraction from Uncertainty and Semantic Reasoning.

2017 – DARPA Graph Analytics https://graphchallenge.mit.edu/scenarios

 

In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed.  The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.

 

2013 Other Large Graph Analytics Reference An NSA Big Graph experiment

http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

2017 Data Science Dealing with Large Data Still Sucks

 

Despite emergence of Big Data, Machine Learning, Graphing Techniques and Semantic Web. The convergence is still far fleeting. Especially Semantic / Cognitive / Knowledge Extraction techniques are very poorly defined and there does not exists a framework approach to knowledge engineering leading into Machine Learning and automation in Knowledge Extraction, Representation, Learning and Reasoning. This is what  Q-UEL and HDN solves at the algorithmic level.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.