2004 to 2017 Convergence of Big Data, Machine Learning, Semantic Web, Graph Analytics, High Performance Computing – All These and Yet Big Data Analytics Sucks

2004 – Tim Lee Berner

Semantic Web

OWL and RDF introduced to address Semantic Web and also Knowledge Representation. This really calls for BigData technology that was still not ready.

https://www.w3.org/2004/01/sws-pressrelease

2006 – Hadoop

Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

https://opensource.com/life/14/8/intro-apache-hadoop-big-data

2008

Scientific Method Obsolete for BigData

The Data Deluge Makes the Scientific Method Obsolete

2008 – MapReduce

Large Data Processing – classification

Google created the framework for MapReduce – MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

• https://research.google.com/archive/mapreduce.html

2009 – Machine Learning

Emergence of BigData Machine Learning Framework and Libraries

2009 – Apache Mahout

Apache Mahout – Machine Learning on BigData Introduced. Apache Mahout is a linear algebra library that runs on top of any distributed engine that have bindings written.

https://www.ibm.com/developerworks/library/j-mahout/

Mahout ML is mostly restricted to set theory. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

2012 – Apache SPARK

Apache SPARK Introduced to deal with Very Large Data and IN-Memorry Processing. It is an architecture for cluster computing – that increases the computing compared with slow MapReduce by 100 times and also better solves parallelization of the algorithm. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab

https://en.wikipedia.org/wiki/Apache_Spark

Mahout vs Spark

Difference between Mahout vs SPARK

https://www.linkedin.com/pulse/choosing-machine-learning-frameworks-apache-mahout-vs-debajani

2012 – GraphX

GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph databasE. GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.

2013 – DARPA PPAML

https://www.darpa.mil/program/probabilistic-programming-for-advancing-machine-learning

Machine learning – the ability of computers to understand data, manage results and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Teams of hard-to-find experts must build expensive, custom tools that are often painfully slow and can perform unpredictably against large, complex data sets.

The Probabilistic Programming for Advancing Machine Learning (PPAML) program aims to address these challenges. Probabilistic programming is a new programming paradigm for managing uncertain information.

Ingine Responded to DARPA’s RFQ with a detailed architecture based on Barry’s innovation in the algorithm that basically solves the above ask to some extent. Importantly it solve Probabilistic Ontology for Knowledge Extraction from Uncertainty and Semantic Reasoning.

2017 – DARPA Graph Analytics

https://graphchallenge.mit.edu/scenarios

In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed. The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.

2013 Other Large Graph Analytics Reference

An NSA Big Graph experiment

http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

2017 Data Science Dealing with Large Data Still Sucks

Despite emergence of Big Data, Machine Learning, Graphing Techniques and Semantic Web. The convergence is still far fleeting. Especially Semantic / Cognitive / Knowledge Extraction techniques are very poorly defined and there does not exists a framework approach to knowledge engineering leading into Machine Learning and automation in Knowledge Extraction, Representation, Learning and Reasoning. This is what Q-UEL and HDN solves at the algorithmic level.

Generative Transformation

"The study of enterprises which are in a systemic state of entropy"

2004 to 2017 Convergence of Big Data, Machine Learning, Semantic Web, Graph Analytics, High Performance Computing – All These and Yet Big Data Analytics Sucks

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply