2004 – Tim Lee Berner
Semantic Web |
OWL and RDF introduced to address Semantic Web and also Knowledge Representation. This really calls for BigData technology that was still not ready.
https://www.w3.org/2004/01/sws-pressrelease
|
2006 – Hadoop | Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
https://opensource.com/life/14/8/intro-apache-hadoop-big-data
|
2008
Scientific Method Obsolete for BigData |
The Data Deluge Makes the Scientific Method Obsolete
|
2008 – MapReduce
Large Data Processing – classification |
Google created the framework for MapReduce – MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
• https://research.google.com/archive/mapreduce.html
|
2009 – Machine Learning | Emergence of BigData Machine Learning Framework and Libraries
|
2009 – Apache Mahout | Apache Mahout – Machine Learning on BigData Introduced. Apache Mahout is a linear algebra library that runs on top of any distributed engine that have bindings written.
https://www.ibm.com/developerworks/library/j-mahout/ Mahout ML is mostly restricted to set theory. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.
|
2012 – Apache SPARK | Apache SPARK Introduced to deal with Very Large Data and IN-Memorry Processing. It is an architecture for cluster computing – that increases the computing compared with slow MapReduce by 100 times and also better solves parallelization of the algorithm. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab
https://en.wikipedia.org/wiki/Apache_Spark
|
Mahout vs Spark | Difference between Mahout vs SPARK
https://www.linkedin.com/pulse/choosing-machine-learning-frameworks-apache-mahout-vs-debajani
|
2012 – GraphX | GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph databasE. GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce. |
2013 – DARPA PPAML | https://www.darpa.mil/program/probabilistic-programming-for-advancing-machine-learning
Machine learning – the ability of computers to understand data, manage results and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Teams of hard-to-find experts must build expensive, custom tools that are often painfully slow and can perform unpredictably against large, complex data sets. The Probabilistic Programming for Advancing Machine Learning (PPAML) program aims to address these challenges. Probabilistic programming is a new programming paradigm for managing uncertain information. Ingine Responded to DARPA’s RFQ with a detailed architecture based on Barry’s innovation in the algorithm that basically solves the above ask to some extent. Importantly it solve Probabilistic Ontology for Knowledge Extraction from Uncertainty and Semantic Reasoning. |
2017 – DARPA Graph Analytics | https://graphchallenge.mit.edu/scenarios
In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed. The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.
|
2013 Other Large Graph Analytics Reference | An NSA Big Graph experiment
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf |
2017 Data Science Dealing with Large Data Still Sucks
|
Despite emergence of Big Data, Machine Learning, Graphing Techniques and Semantic Web. The convergence is still far fleeting. Especially Semantic / Cognitive / Knowledge Extraction techniques are very poorly defined and there does not exists a framework approach to knowledge engineering leading into Machine Learning and automation in Knowledge Extraction, Representation, Learning and Reasoning. This is what Q-UEL and HDN solves at the algorithmic level. |