2004 – Tim Lee Berner
Semantic Web 
OWL and RDF introduced to address Semantic Web and also Knowledge Representation. This really calls for BigData technology that was still not ready.
https://www.w3.org/2004/01/swspressrelease

2006 – Hadoop  Apache Hadoop is an open source software framework for storage and large scale processing of datasets on clusters of commodity hardware.
https://opensource.com/life/14/8/introapachehadoopbigdata

2008
Scientific Method Obsolete for BigData 
The Data Deluge Makes the Scientific Method Obsolete

2008 – MapReduce
Large Data Processing – classification 
Google created the framework for MapReduce – MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
• https://research.google.com/archive/mapreduce.html

2009 – Machine Learning  Emergence of BigData Machine Learning Framework and Libraries

2009 – Apache Mahout  Apache Mahout – Machine Learning on BigData Introduced. Apache Mahout is a linear algebra library that runs on top of any distributed engine that have bindings written.
https://www.ibm.com/developerworks/library/jmahout/ Mahout ML is mostly restricted to set theory. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

2012 – Apache SPARK  Apache SPARK Introduced to deal with Very Large Data and INMemorry Processing. It is an architecture for cluster computing – that increases the computing compared with slow MapReduce by 100 times and also better solves parallelization of the algorithm. Apache Spark is an opensource clustercomputing framework. Originally developed at the University of California, Berkeley’s AMPLab
https://en.wikipedia.org/wiki/Apache_Spark

Mahout vs Spark  Difference between Mahout vs SPARK
https://www.linkedin.com/pulse/choosingmachinelearningframeworksapachemahoutvsdebajani

2012 – GraphX  GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph databasE. GraphX can be viewed as being the Spark inmemory version of Apache Giraph, which utilized Hadoop diskbased MapReduce. 
2013 – DARPA PPAML  https://www.darpa.mil/program/probabilisticprogrammingforadvancingmachinelearning
Machine learning – the ability of computers to understand data, manage results and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and selfdriving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Teams of hardtofind experts must build expensive, custom tools that are often painfully slow and can perform unpredictably against large, complex data sets. The Probabilistic Programming for Advancing Machine Learning (PPAML) program aims to address these challenges. Probabilistic programming is a new programming paradigm for managing uncertain information. Ingine Responded to DARPA’s RFQ with a detailed architecture based on Barry’s innovation in the algorithm that basically solves the above ask to some extent. Importantly it solve Probabilistic Ontology for Knowledge Extraction from Uncertainty and Semantic Reasoning. 
2017 – DARPA Graph Analytics  https://graphchallenge.mit.edu/scenarios
In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed. The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.

2013 Other Large Graph Analytics Reference  An NSA Big Graph experiment
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf 
2017 Data Science Dealing with Large Data Still Sucks

Despite emergence of Big Data, Machine Learning, Graphing Techniques and Semantic Web. The convergence is still far fleeting. Especially Semantic / Cognitive / Knowledge Extraction techniques are very poorly defined and there does not exists a framework approach to knowledge engineering leading into Machine Learning and automation in Knowledge Extraction, Representation, Learning and Reasoning. This is what QUEL and HDN solves at the algorithmic level. 
Enterprise Architecture
Deep Learning in Hamiltonian Space on iPad
Large Data Analytics – on your iPad
[Big Data In Your Mini Space]
Combinatorial Explosion !!!
Hermitian Conjugates and Billion Tags
Hamiltonian Space Offering Deep Learning
The BioIngine.com™ Platform
The BioIngine.com™; offers a comprehensive biostatistical reasoning experience in the application of the data science that blends descriptive and inferential statistical studies. Progressing further it will also blend NLP and AI to create a holistic Cognitive Experience.
The BioIngine.com™; is a High Performance Cloud Computing Platformdelivering HealthCare LargeData Analytics capability derived from an ensemble of biostatistical computations. The automated biostatistical reasoning is a combination of “deterministic” and “probabilistic” methods employed against both structured and unstructured large data sets leading into Cognitive Reasoning.
The figure below depicts the healthcare analytics challenge as the order complexity is scaled.
Given the challenge of analyzing against the large data sets both structured (EHR data) and unstructured data; the emerging Healthcare analytics are around below discussed methods E (multivariate regression) and F (multivariate probabilistic inference); Ingine is unique in the Hyperbolic Dirac Net proposition for probabilistic inference.
The basic premise in engineering The BioIngine.com™ is in acknowledging the fact that in solving knowledge extraction from the large data sets (both structured and unstructured), one is confronted by very large data sets riddled with highdimensionality and uncertainty.
Generally in solving insights from the large data sets the order in complexity is scaled as follows.
A) Descriptive Statistics : Insights around : “what”
For large data sets, descriptive statistics are adequate to extract a “what” perspective. Descriptive statistics generally delivers statistical summary of the ecosystem and the probabilistic distribution.
i) Univariate Problem : “what”
Considering some simplicity in the variables relationships or is cumulative effects between the independent variables (causing) and the dependent variables (outcomes):
Univariate regression (simple independent variables to dependent variables analysis)
ii) Bivariate Problem : “what”
Correlation Cluster – shows impact of set of variables or segment analysis.
https://en.wikipedia.org/wiki/Correlation_clustering
From above link : In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph G = (V,E), where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters k in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.
http://www.statisticssolutions.com/correlationpearsonkendallspearman/
From above link. : Correlation is a bivariate analysis that measures the strengths of association between two variables. In statistics, the value of the correlation coefficient varies between +1 and 1. When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation
iii) Multivariate Analysis (Complexity increases) : “what”
§ Multiple regression (considering multiple univariate to analyze the effect of the independent variables on the outcomes)
Multivariate regression – where multiple causes and multiple outcomes exists
iv) Neural Net : “what”
The above discussed challenges of analyzing multivariate pushes us into techniques such as Neural Net; which is the next level to Multivariate Regression Statistical Approach…. where multiple regression models are feeding into the next level of clusters, again an array of multiple regression models.The above Neural Net method still remains inadequate in depicting “how” probably the human mind is operates. In discerning the health ecosystem for diagnostic purposes, for which “how”, “why” and “when” interrogatives becomes imperative to arrive at accurate diagnosis and target outcomes effectively. Its learning is “smudged out”. A little more precisely put: it is hard to interrogate a Neural Net because it is far from easy to see what are the weights mixed up in different pooled contributions, or where they come from.
“We Enter Probabilistic Computations which is as such Combinatorial Explosion Problem”.
B) Inferential Statistics : – Deeper Insights “how”, “why”, “when” in addition to “what”.
Hyperbolic Dirac Net (Inverse or Dual Bayesian technique)
All the above are still discussing the “what” aspect. When the complexity increases the notion of independent and dependent variables become nondeterministic, since it is difficult to establish given the interactions, potentially including cyclic paths of influence in a network of interactions, amongst the variables. A very simple example in just a simple case is that obesity causes diabetes, but the also converse is true, and we may also suspect that obesity causes type 2 diabetes cause obesity. In such situation what is best as “subject” and what is best as “object” becomes difficult to establish. Existing inference network methods typically assume that the world can be represented by a Directional Acyclic Graph, more like a tree, but the real world is more complex than that that: metabolism, neural pathways, road maps, subway maps, concept maps, are not unidirectional, and they are more interactive, with cyclic routes. Furthermore, discovering the “how” aspect becomes important in the diagnosis of the episodes and to establish correct pathways, while also extracting the severe cases (chronic cases which is a multivariate problem). Indeterminism also creates an ontology that can be probabilistic, not crisp.
Note: From Healthcare Analytics perspective most Accountable Care Organization (ACO) analytics addresses the above based on the PQRS clinical factors, which are all quantitative. Barely useful for advancing the ACO into solving performance driven or value driven outcomes most of which are qualitative.
Notes On Statistics :
Generally one enters Inferential Statistics an inductive reasoning when there is no clear distinction between independent and dependent variables, furthermore this problem is accentuated by multivariate condition. As such the problem becomes irreducible. Please refer to below MIT course work to gain better understanding on statistics, different statistical methods, descriptive and inferential. Particularly pay attention to Bayesian Statistics. HDN Inferential Statistics being introduced in The BioIngine.com is an advancement to Bayesian Statistics
Introduction to Statistics Class 10, 18.05,
Spring 2014 Jeremy Orloff and Jonathan Bloom
From above link
a) What is a Statistics?
b) Descriptive statistics
Raw data often takes the form of a massive list, array, or database of labels and numbers. To make sense of the data, we can calculate summary statistics like the mean, median, and interquartile range. We can also visualize the data using graphical devices like histograms, scatterplots, and the empirical cdf. These methods are useful for both communicating and exploring the data to gain insight into its structure, such as whether it might follow a familiar probability distribution.
c) Inferential statistics
https://www.coursera.org/specializations/socialscience
Are concerned with making inferences based on relations found in the sample, to relations in the population. Inferential statistics help us decide, for example, whether the differences between groups that we see in our data are strong enough to provide support for our hypothesis that group differences exist in general, in the entire population.
d) Types of Inferential Statistics
i) Frequentist – 19th Century
Hypothesis Stable – Evaluating Data
https://en.wikipedia.org/wiki/Frequentist_inference
Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics. This is the inference framework in which the wellestablished methodologies of statistical hypothesis testing and confidence intervals are based.
ii) Bayesian Inference – 20th Century
Data Held Stable – Evaluating Hypothesis
https://ocw.mit.edu/courses/mathematics/1805introductiontoprobabilityandstatisticsspring 2014/readings/MIT18_05S14_Reading10a.pdf
In scientific experiments we start with a hypothesis and collect data to test the hypothesis. We will often let H represent the event ‘our hypothesis is true’ and let D be the collected data. In these words Bayes theorem says
The lefthand term is the probability our hypothesis is true given the data we collected. This is precisely what we’d like to know. When all the probabilities on the right are known exactly, we can compute the probability on the left exactly. This will be our focus next week. Unfortunately, in practice we rarely know the exact values of all the terms on the right. Statisticians have developed a number of ways to cope with this lack of knowledge and still make useful inferences. We will be exploring these methods for the rest of the course.
http://www.ling.upenn.edu/courses/cogs501/Bayes1.html
A. Conditional Probability
P (AB) is the probability of event A occurring, given that event B occurs.
https://en.wikipedia.org/wiki/Conditional_probability
In probability theory, conditional probability is a measure of the probability of an event given that (by assumption, presumption, assertion or evidence) another event has occurred.[1] If the event of interest is A and the event B is known or assumed to have occurred, “the conditional probability of A given B“, or “the probability of A under the condition B“, is usually written as P(AB), or sometimes PB(A). For example, the probability that any given person has a cough on any given day may be only 5%. But if we know or assume that the person has a cold, then they are much more likely to be coughing. The conditional probability of coughing given that you have a cold might be a much higher 75%.
The concept of conditional probability is one of the most fundamental and one of the most important concepts in probability theory.[2] But conditional probabilities can be quite slippery and require careful interpretation.[3] For example, there need not be a causal or temporal relationship between A and B.
B. Joint Probability
https://en.wikipedia.org/wiki/Joint_probability_distribution
P (A,B) The probability of two or more events occurring together.
In the study of probability, given at least two random variables X, Y, …, that are defined on a probability space, the joint probability distribution for X, Y, … is a probability distribution that gives the probability that each of X, Y, … falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.
Bayesian Rules
P(A  B) = P(A,B) / P(B)
P(B  A) = P(B,A) / P(A)
P(B  A) = P(A,B) / P(A)
P(A  B) P(B) = P(A,B)
P(B  A) P(A) = P(A,B)
P(A  B) P(B) = P(A,B) = P(B  A) P(A)
P(A  B) = P(B  A) P(A) / P(B)
iii) C. Hyperbolic Dirac Net (HDN) – 21st Century
Non – Hypothesis driven unsupervised machine learning. Independent of both data and hypothesis.
Refer: http://www.sciencedirect.com/science/article/pii/S0010482516300397
Datamining to build a knowledge representation store for clinical decision support. Studies on curation and validation based on machine performance in multiple choice medical licensing examinations
Barry Robson Srinidhi Boray
The differences between a BN and an HDN are as follows. A BN is essentially an estimate of a complicated joint or conditional probability, complicated because it considers many factors, states, events, measurements etc., that by analogy with XML tags and hence QUEL tags, we call attributes in the HDN context. In a BN, the complicated probability is seen as a probabilisticalgebraic expansion into many simpler conditional probabilities of general form P(x  y) = P(x, y)/P(y), simpler because each have fewer attributes. For example, one such may be of more specific form P(G  B, D, F, H), where B, D, F, G, H are attributes and the vertical bar ‘’ is effectively a logical operator that has the sense of “conditional upon” or “if”, “derived from the sample of”, “is a set with members” or sometimes “is caused by”. Along with simple, self or prior probabilities such as P(D) all these probabilities multiply together, which implies use of logical AND between the statements they represent, to give the estimate. It is an estimate because the use of probabilities with fewer attributes assumes that attributes separated by being in different probabilities are statistically independent of each other. As previously described [2], one key difference in an HDN is that the individual probabilities are bidirectional, using a dual probability (P(xy), P(yx)), say (P(B, G  D, F, H), P(D, F, HB, G)) which is a complex value, i.e., with an imaginary part [1, 2]. Another, the subject of the present report, is that for these probabilities to serve as semantic triples such as subjectverbobject as the Semantic Web requires, the vertical bar must be replaced by many other kinds of relationship. Yet another, which will be described in great deal elsewhere, is that there can be other kinds of operator between probabilities as statements than just logical AND. All these aspects, and the notation used including for the format of QUEL, have direct analogies in the Dirac notation and algebra [8] developed in the 1920s and 1930s for quantum mechanics (QM). It is a widely accepted standard, the capabilities of which are described in Refs. [912] that are also excellent introductions. The primary difference between QM and QUEL and HDN methodologies is that the complex value in the latter cases is purely hcomplex where h is the hyperbolic imaginary number such that hh = +1. The significance of this is that it avoids a description of the world in terms of waves and so behaves in an essentially classical way.
Inductive (Inferential Statistics) Reasoning: – Hyperbolic Dirac Net Reference : Notes on Synthesis of Forms by Christopher Alexander on Inductive Logic
Christopher Alexander: – Sometimes one of these structures is close enough to a real situation to be allowed to represent it. And then, because the logic is so tightly drawn, we gain insight into the reality, which was previously withheld from us.
Study Descriptive Statistics (Univariate – Bibariate – Multivariate)
Transformed Data Set
Univariate – Statistical Summary
Univariate – Probability Summary
Bivariate – Correlation Cluster
Correlation Cluster Varying the Pearson’s Coefficient
Scatter (Cluster) Plot – Linear Regression
Scatter (Cluster) Plot and Pearson Correlation Coefficient
What values can the Pearson correlation coefficient take?
The Pearson correlation coefficient, r, a statistic representing how closely two variables covary; it can vary from 1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation)
Multivariate Regression
HDN Multivariate Probabilistic Inference – Computing in Hamiltonian System
Hyperbolic Dirac Net (HDN) – This computation is against Billion Tags in the Semantic Lake
What is the relative risk of needing to take BP medication if you are diabetic as opposed to not diabetic?
Note: – To conduct HDN Inference, bear in mind that getting all the combinations of factors by data mining is “ combinatorial explosion ” problem, which lies behind the difficulty of Big Data as high dimensional data.
It applies in any kind of data mining, though it is most clearly apparent when mining structured data, a kind of spreadsheet with many columns, each of which are our different dimensions. In considering combinations of demographic and clinical factors, say A, B, C, D, E.., we ideally have to count the number of combinations (A), (A,B) (A, C) …(B, C, E)…and so on. Though sometimes assumptions can be made, you cannot always deduce a combination with many factors from those with fewer, nor vice versa. In the case of the number N of factors A,B,C,D,E,… etc. the answer is that there are 2N1 possible combinations. So data with 100 columns as factors would imply about
1,000,000,000,000,000,000,000,000,000,000
combinations, each of which we want to observe several times and so count them, to obtain probabilities. To find what we need without knowing what exactly it is in advance, distinguishes unsupervised data mining from statistics in which traditionally we test a hunch, a hypothesis. But worse still, in our spreadsheet the A, B, C, D, E are really to be seen as column headings with say about n possible different values in the columns below them, and so roughly we are speaking of potentially needing to count not just, say, males and females but each of nN different kinds of patient or thing. This results in truly astronomic number of different things, each to observe many time. If merely n=10, then nN is
10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00,000,000
There is a further implied difficulty, which in a strange way lifts much the above challenge from the shoulders of researchers and of their computers. In most cases of the above, must of the things we are counting contain many of the factors A,B,C,D, E..etc. Such concurrences of so many things is typically rare, so many of the things we would like to count will never be seen at all, and most of the rest will just be seen 1, 2, or 3 times. Indeed, any reasonably rich patient record with lots of data will probably be unique on this planet. However, most approaches are unable to make proper use of that sparse data, since it seems that it would need to be weighted and taken into account in the balance of evidence according to the information it contains, and it is not evident how. The zeta approach tells us how to do that. In short, the real curse of high dimensionality is in practice not that our computers lack sufficient memory to hold all the different probabilities, but that this is also true for the universe: even in principle we do not have all the data to work to determine probabilities by counting with even if we could count and use them. Note that probabilities of things that are never observed are, in the usual interpretation of zeta theory and of QUEL, assumed to have probability 1. In a purely multiplicative inference net, multiplying by probability 1 will have no effect. Information I = –log(P) for P = 1 means that information I = 0. Most statements of knowledge are, as philosopher Karl Popper argued, assertions awaiting refutation.
Nonetheless the general approach in the fields of semantics, knowledge representation, and reasoning from it is to gather all the knowledge that can be got into a kind of vast and ever growing encyclopedia.
In The BioIngine.com™ the native data sets have been transformed into Semantic Lake or Knowledge Representation Store (KRS) based on QUEL Notational Language such that they are now amenable to HDN based Inferences. Where possible, probabilities are assigned, if not, the default probabilities are again 1.
The BioIngine.com – Deep Learning Comprehensive Statistical Framework – Descriptive to Probabilistic Inference
Given the challenge of analyzing against the large data sets both structured (EHR data) and unstructured data; the emerging Healthcare analytics are around below discussed methods d (multivariate regression), e (neuralnet) and f (multivariate probabilistic inference); Ingine is unique in the Hyperbolic Dirac Net proposition for probabilistic inference.
The basic premise in engineering The BioIngine.com™ is in acknowledging the fact that in solving knowledge extraction from the large data sets (both structured and unstructured), one is confronted by very large data sets riddled with highdimensionality and uncertainty.
Generally in solving insights from the large data sets the order in complexity is scaled as follows.
a) Insights around : “what”
For large data sets, descriptive statistics are adequate to extract a “what” perspective. Descriptive statistics generally delivers statistical summary of the ecosystem and the probabilistic distribution.
b) Univariate Problem : “what”
Considering some simplicity in the variables relationships or is cumulative effects between the independent variables (causing) and the dependent variables (outcomes):
i) Univariate regression (simple independent variables to dependent variables analysis)
c) Bivariate Problem : “what”
Correlation Cluster – shows impact of set of variables or segment analysis.
https://en.wikipedia.org/wiki/Correlation_clustering
From above link : In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph G = (V,E), where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters k in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.
http://www.statisticssolutions.com/correlationpearsonkendallspearman/
From above link. : Correlation is a bivariate analysis that measures the strengths of association between two variables. In statistics, the value of the correlation coefficient varies between +1 and 1. When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation
d) Multivariate Analysis (Complexity increases) : “what”
§ Multiple regression (considering multiple univariate to analyze the effect of the independent variables on the outcomes)
§ Multivariate regression – where multiple causes and multiple outcomes exists
e) Neural Net : “what”
The above discussed challenges of analyzing multivariate pushes us into techniques such as Neural Net; which is the next level to Multivariate Regression Statistical Approach…. where multiple regression models are feeding into the next level of clusters, again an array of multiple regression models.
The above Neural Net method still remains inadequate in depicting “how” probably the human mind is operates. In discerning the health ecosystem for diagnostic purposes, for which “how”, “why” and “when” interrogatives becomes imperative to arrive at accurate diagnosis and target outcomes effectively. Its learning is “smudged out”. A little more precisely put: it is hard to interrogate a Neural Net because it is far from easy to see what are the weights mixed up in different pooled contributions, or where they come from.
“So we enter Probabilistic Computations which is as such Combinatorial Explosion Problem”.
f) Hyperbolic Dirac Net (Inverse or Dual Bayesian technique): – “how”, “why”, “when” in addition to “what”.
All the above are still discussing the “what” aspect. When the complexity increases the notion of independent and dependent variables become nondeterministic, since it is difficult to establish given the interactions, potentially including cyclic paths of influence in a network of interactions, amongst the variables. A very simple example in just a simple case is that obesity causes diabetes, but the also converse is true, and we may also suspect that obesity causes type 2 diabetes cause obesity. In such situation what is best as “subject” and what is best as “object” becomes difficult to establish. Existing inference network methods typically assume that the world can be represented by a Directional Acyclic Graph, more like a tree, but the real world is more complex than that that: metabolism, neural pathways, road maps, subway maps, concept maps, are not unidirectional, and they are more interactive, with cyclic routes. Furthermore, discovering the “how” aspect becomes important in the diagnosis of the episodes and to establish correct pathways, while also extracting the severe cases (chronic cases which is a multivariate problem). Indeterminism also creates an ontology that can be probabilistic, not crisp.
Note: From Healthcare Analytics perspective most Accountable Care Organization (ACO) analytics addresses the above based on the PQRS clinical factors, which are all quantitative. Barely useful for advancing the ACO into solving performance driven or value driven outcomes most of which are qualitative.
To conduct HDN Inference, bear in mind that getting all the combinations of factors by data mining is “ combinatorial explosion ” problem, which lies behind the difficulty of Big Data as high dimensional data.
It applies in any kind of data mining, though it is most clearly apparent when mining structured data, a kind of spreadsheet with many columns, each of which are our different dimensions. In considering combinations of demographic and clinical factors, say A, B, C, D, E.., we ideally have to count the number of combinations (A), (A,B) (A, C) …(B, C, E)…and so on. Though sometimes assumptions can be made, you cannot always deduce a combination with many factors from those with fewer, nor vice versa. In the case of the number N of factors A,B,C,D,E,… etc. the answer is that there are 2N1 possible combinations. So data with 100 columns as factors would imply about
1,000,000,000,000,000,000,000,000,000,000
combinations, each of which we want to observe several times and so count them, to obtain probabilities. To find what we need without knowing what exactly it is in advance, distinguishes unsupervised data mining from statistics in which traditionally we test a hunch, a hypothesis. But worse still, in our spreadsheet the A, B, C, D, E are really to be seen as column headings with say about n possible different values in the columns below them, and so roughly we are speaking of potentially needing to count not just, say, males and females but each of nN different kinds of patient or thing. This results in truly astronomic number of different things, each to observe many time. If merely n=10, then nN is
10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00,000,000
There is a further implied difficulty, which in a strange way lifts much the above challenge from the shoulders of researchers and of their computers. In most cases of the above, must of the things we are counting contain many of the factors A,B,C,D, E..etc. Such concurrences of so many things is typically rare, so many of the things we would like to count will never be seen at all, and most of the rest will just be seen 1, 2, or 3 times. Indeed, any reasonably rich patient record with lots of data will probably be unique on this planet. However, most approaches are unable to make proper use of that sparse data, since it seems that it would need to be weighted and taken into account in the balance of evidence according to the information it contains, and it is not evident how. The zeta approach tells us how to do that. In short, the real curse of high dimensionality is in practice not that our computers lack sufficient memory to hold all the different probabilities, but that this is also true for the universe: even in principle we do not have all the data to work to determine probabilities by counting with even if we could count and use them. Note that probabilities of things that are never observed are, in the usual interpretation of zeta theory and of QUEL, assumed to have probability 1. In a purely multiplicative inference net, multiplying by probability 1 will have no effect. Information I = –log(P) for P = 1 means that information I = 0. Most statements of knowledge are, as philosopher Karl Popper argued, assertions awaiting refutation.
Nonetheless the general approach in the fields of semantics, knowledge representation, and reasoning from it is to gather all the knowledge that can be got into a kind of vast and ever growing encyclopedia.
In The BioIngine.com™ the native data sets have been transformed into Semantic Lake or Knowledge Representation Store (KRS) based on QUEL Notational Language such that they are now amenable to HDN based Inferences. Where possible, probabilities are assigned, if not, the default probabilities are again 1.
The Bioingine.com : Onboarding PICO – Evidence Based Medicine [Large Data Driven Medicine]
The BioIngine.com Platform Beta launch on the anvil with below discussed EBM examples for all to Explore !!!
The Bioingine.com Platform is built on Wolfram Enterprise Private Cloud
 using the technology from one of the leading science and tech companies
 using Wolfram Technology, the same technology that is at every Fortune 500 company
 using Wolfram Technology, the same technology that is at every major educational facility in the world
 leveraging the same technology as WolframAlpha, the brains behind Apple’s Siri
Medical Automated Reasoning Programming Language environment [MARPLE]
References: On PICO Gold Standard
 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3140151/
 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1069073/
Formulating a researchable question: A critical step for facilitating good clinical research
Sadaf Aslam and Patricia Emmanuel
Abstract: Developing a researchable question is one of the challenging tasks a researcher encounters when initiating a project. Both, unanswered issues in current clinical practice or when experiences dictate alternative therapies may provoke an investigator to formulate a clinical research question. This article will assist researchers by providing stepbystep guidance on the formulation of a research question. This paper also describes PICO (population, intervention, control, and outcomes) criteria in framing a research question. Finally, we also assess the characteristics of a research question in the context of initiating a research project.
Keywords: Clinical research project, PICO format, research question
MARPLE – Question Format Medical Exam / PICO Setting
A good way to use Marple/HDNsudent is to set it up like an exam then the student answers. Marple then answers with its choices, i.e. candidate answers ranked by probability proposing its own choice of answer as the most probable and explaining why it did that (by the knowledge elements successfully used). This can then be compared with the intended answer of the examiner of which, of course Marple’s probability assessment of it can be seen.
It is already the case that MARPLE is used to test exam questions and it is scary that questions that have been issued by a Medical Licensing Board can turn out to be assigned an incorrect or unreachable answer by the examiner. The reason on inspection is that the question was ambiguous and potentially misleading, even though that may have not been obvious, or simply out of date – progress in science changed the answer and it shows up fast on some new web page (Translational Research for Medicine in action!). Often it is wrong or misleading because there turns out to be a very strong alternative answer.
Formulating the Questions in PICO Format
The modern approach to formulation is the recommendation for medical best practice known as PICO.
 P is the patient, population or problem (Primarily, what is the disease/diagnosis Dx?)
 I is intervention or something happening that intervenes (What is the proposed therapy Rx (drug, surgery, or life style recommendation)
 C is some alternative to that intervention or something happening that can be compared (with what options (including no treatment)? May also include this in the context of different compared types of patient female, diabetic, elderly, or Hispanic etc.
 O is the outcome, i.e. a disease state or set of such that occurs, or fails to occur, or is ideally terminated by the intervention such that health is restored. (Possibly that often means the prognosis, but often prognosis implies a more complex scenario on a longer timescale further in the future).
Put briefly “For P does I as opposed to C have outcome O” is the PICO form.
The above kinds of probabilities are not necessarily the same as an essentially statistical analysis by structured data mining would deliver. All of these except C relate to associations, symptoms, Dx, Rx, outcome. It is C that is difficult. Probably the best interpretation is replacing Rx in associations with no Rx and then various other Rx. If C means say in other kinds of patients, then it is a matter of associations including those.
A second step of quantification is usually required in which probabilities are obtained as measures of scope based on counting. Of particular interest here is the odds ratio
Two Primary Methods of Asking a Question in The BioIngine
1. Primarily Symbolic and Qualitative. (more unstructured data dependent) [Release 1]
HDN is behind the scenes but focuses mainly on contextual probabilities between statements. HDNstudent is used to address the issue as a multiple choice exam with indefinitely large numbers of candidate answers, in which the expert enduser can formulate PICO questions and candidate answers, or all these can be derived automatically or semiautomatically. Each initial question can be split into a P, I, C, and O question.
2. Primarily Calculative and Quantitative. (more structured – EHR data dependent) [Release 2]
Focus on intrinsic probabilities, the degree of truth associated with each statement by itself. DiracBuilder used after DiracMiner addresses EBM decision measures as special cases of HDN inference. Of particular interest is an entry
<O  P, I > / <O  P, C>
which is the HDN likelihood or HDN relative risk of the outcome O given patient/population/problem P given I as opposed to C, usually seen as a “NOT I”, and
<NOT O  P, I> / <NOT O  P, C>
which is the HDN likelihood or HDN relative risk of NOT getting the outcome O given patient/population/problem P given I as opposed to C usually seen as a “NOT I”. Note though that you get a two for one, because we also have <P, I  O>, the adjoint form, at the same time, because on the complex conjugate of the other. Note that the ODDS RATIO is the former likelihood ratio over the latter, and hence the HDN odds ratio as it would normally be entered in DiracBuilder is as follows:
<O  P, I>
/<NOT O  P, C>
<NOT O  P, C>
/<NOT O  P, I>
 QUALITATIVE / SYMBOLIC:
“An 84yearold man in a nursing home has increasing poorly localized lower abdominal pain recurring every 34 hours over the past 3 days. He has no nausea or vomiting; the last bowel movement was not recorded. Examination shows a soft abdomen with a palpable, slightly tender, lower left abdominal mass. Hematocrit is 28%. Leukocyte count is 10,000/mm3. Serum amylase activity is within normal limits. Test of the stool for occult blood is positive. What is the diagnosis?”
•This is usually addressed by a declared list of multiple choice candidate answers, though the list can be indefinitely large. 30 is not unusual.
•The answers are all assigned probabilities, and the most probable is considered the answer, at least for testing purposes in a medical licensing exam context. These probabilities can make use of probabilities, but predominantly they are contextual probabilities, depending in the relationships between chains and networks of knowledge elements that link the question to each answer.
 QUANTITATIVE / CALCULATIVE:
“Will my female patient age 5059 taking diabetes medication and having a body mass index of 3039 have very high cholesterol if the systolic BP is 130139 mmHg and HDL is 5059 mg/dL and nonHDL is 120129 mg/dL?”.
•This forms a preliminary Hyperbolic Dirac Net (inference net) from the query, which may be refined and to each statement intrinsic probabilities are assigned, e.g. automatically by data mining.
•This question could properly start “What is the probability that…” . The real answers of interest here are not qualitative statements, but the final probabilities.
•Note the “IF”. But POPPER extends this to relationships beyond IF associative or conditional ones, e.g. verbs of action.
Quantitative Computations : Odds Ratio and Risk Computations
 Medical Necessity
 Laboratory Testing Principles
 Quality of Diagnosis
 Diagnosis Test Accuracy
 Diagnosis Test
 Sensitivity
 Specificity
 Predictive Values – Employing Bayes Theorem (Positive and Negative Value)
 Coefficient of Variations
 Resolving Power
 Prevalence and Incidence
 Prevalence and Rate
 Relative Risk and Cohort Studies
 Predictive Odds
 Attributable Risk
 Odds Ratio
Examples Quantitative / Calculative HDN Queries
In The Bioingine.com Release 1 – we are only dealing with Quantitative / Calculative type questions
Examples discussed in section A below are simple to play with to appreciate the HDN power for conducting inference. However, Problems B2 onwards requires some deeper understanding of the Bayesian and HDN analysis.
<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘1’>
/<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘0’>
A. Against Data Set 1.csv (2114 records with 33 variables created for Cardiovascular Risk Studies (Framingham Risk Factor)
B. Against Data Set2.csv (nearing 700,000 records with 196 variables. Truly a large data set with high dimensionality (many columns of clinical and demographic factors), leading to a combinatorial explosion.
Note: in the examples below, you are forming questions or HDN queries such as
“For African Caribbean patients 5059 years old with a BMI of 5059 what is the Relative Risk of needing to be on BP medication if there is a family history as opposed to no family history?
IMPORTANT: THE TWOFORONE EFFECT OF THE DUAL. Calculations report a dual value for any probabilistic value implied for the expression ented. In some cases you may be only interest in the first number in the dual, but the second number is always meaningful and frequently very useful. Notably, we say Relative Risk by itself for brevity, but in fact this is only the first number in the dual that is reported. In general, the form
<’A’:=’1’’B’:=’1’>
/<’A’:=’1’’B’:=’0’>
yields the following dual probabilistic value…
(P(’A’:=’1’’B’:=’1’)/ P(’A’:=’1’’B’:=’0’), ( P(’B’:=’1’’A’:=’1’)/ P(’B’:=0’’B’:=’1’),
where the first ratio is relative risk RR(P(’A’:=’1’’B’:=’1’) and the second ratio is predictive odds RR(P(’A’:=’1’’B’:=’1’).
a. This inquiry seeking the risk of BP requires being translated into QUEL specification as shown below. [All the below QUEL queries in red can be copied and entered in the HDN query to get the HDN inference for the pertinent Data Sets.]
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1 ‘ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and BMI:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’0’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
b. The QUEL specified query enables Notational Algebra to work while making inference from the giant semantic lake or the knowledge repository store (KRS).
c. Recall, KRS is the representation of the universe as a Hyperbolic Dirac Net. This was created by transformation process of the uploaded data set to activate the automated statistical studies.
d. The query works against the KRS and extracts the inference in HDN format displaying an inverse Bayesian Result; which calculates both classical and zeta probabilities : Pfwd, Pzfwd & Pbwd, Pzbwd
A1. Relative Risk – High BP Case
Example: – Study of BP = blood pressure (high) in the population data set considered.
This case is very similar, because high BP and diabetes are each comorbidities with high BMI and hence to some extent with each other. Consequently we just substitute diabetes by BP throughout.
Note: for the values enter discreet or continuous
(0) We can in fact test the strength of the above with the following RR, which in effect reads as “What is the relative risk of needing to take BP medication if you are diabetic as opposed to not diabetic?
<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘1’>
/<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘0’>
The following predictive odds PO make sense and are useful here:
<‘Taking BP medication’:=’1’  ‘BMI’:= ’5059’ >
/<‘Taking BP medication’:=’0’  ‘BMI’:= ’5059’ >
and (separately entered)
<‘Taking diabetes medication’:=’1’  ‘BMI’:= ’5059’ >
/<‘Taking diabetes medication’:=’0’  ‘BMI’:= ’5059’ >
And the odds ratio OR would be a good measure here (as it works in both directions). Note Pfwd = Pbw theoretically for an odds ratio.
<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘1’>
<‘Taking BP medication’:=’0’  ‘Taking diabetes medication’:= ‘0’>
/<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘0’>
/<‘Taking BP medication’:=’0’  ‘Taking diabetes medication’:= ‘1’>
(1) For African Caribbean patients 5059 years old with a BMI of 5059 what is the Relative Risk of needing to be on BP medication if there is a family history as opposed to no family history?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1‘ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and BMI:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’0’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
(2) For African Caribbean patients 5059 years old with a family history of BP what is the Relative Risk of needing to be on BP medication if there is a BMI of 5059 as opposed to a reasonable BMI of ’2029’?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’2029’ >
(3) For African Caribbean patients with a family history of BP, what is the Relative Risk of needing to be on BP medication if there is an age of 5059 rather than 4049?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’4049’ and ‘BMI’:= ’5059’>
(4) For African Caribbean patients with a family history of BP, what is the Relative Risk of needing to be on BP medication if there is an age of 5059 rather than 4049?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059and ‘BMI’:= ’4049’>
(5) For African Caribbean patients with a family history of BP, what is the Relative Risk of needing to be on BP medication if there is an age of 5059 rather than 4049?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1‘ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059and ‘BMI’:= ’4049’>
(6) For African Caribbean patients with a family history of BP, what is the Relative Risk of needing to be on BP medication if there is an age of 5059 rather than 3039?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’3039and ‘BMI’:= ’4049’>
(7) For African Caribbean patients with a family history of BP, what is the Relative Risk of needing to be on BP medication if there is an age of 5059 rather than 2029?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’2029 and ‘BMI’:= ’4049’>
(8) For patients with a family history of BP age 5059 and BMI of 5059, what is the Relative Risk of needing to be on BP medication if they are African Caribbean rather than Caucasian?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘Caucasian’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
(9) For patients with a family history of BP age 5059 and BMI of 5059, what is the Relative Risk of needing to be on BP medication if they are African Caribbean rather than Asian?
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’’1 and ‘Ethnicity’:=‘Asian’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
(10) For patients with a family history of BP age 5059 and BMI of 5059, what is the Relative Risk of needing to be on BP medication if they are African Caribbean rather than Hispanic
< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking BP medication’:=’1’  ‘Family history of BP’:=’1’ and ‘Ethnicity’:=‘Hispanic’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
A2. Relative Risk – Diabetes Case
Against Data Set1.csv
Type 2 diabetes is implied here.
(11) For African Caribbean patients 5059 years old with a BMI of 5059 what is the Relative Risk of needing to be on diabetes medication if there is a family history as opposed to no family history?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and BMI:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’0’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
(12) For African Caribbean patients 5059 years old with a family history of diabetes what is the Relative Risk of needing to be on diabetes medication if there is a BMI of 5059 as opposed to a reasonable BMI of ’2029’?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’2029’ >
(13) For African Caribbean patients with a family history of diabetes, what is the Relative Risk of needing to be on diabetes medication if there is an age of 5059 rather than 4049?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’4049’ and ‘BMI’:= ’5059’>
(14) For African Caribbean patients with a family history of diabetes, what is the Relative Risk of needing to be on diabetes medication if there is an age of 5059 rather than 4049?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059and ‘BMI’:= ’4049’>
(15) For African Caribbean patients with a family history of diabetes, what is the Relative Risk of needing to be on diabetes medication if there is an age of 5059 rather than 4049?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059and ‘BMI’:= ’4049’>
(16) For African Caribbean patients with a family history of diabetes, what is the Relative Risk of needing to be on diabetes medication if there is an age of 5059 rather than 3039?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’3039and ‘BMI’:= ’4049’>
(17) For African Caribbean patients with a family history of diabetes, what is the Relative Risk of needing to be on diabetes medication if there is an age of 5059 rather than 2029?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’2029and ‘BMI’:= ’4049’>
A3. Relative Risk – Cholesterol Case
Against Data Set1.csv
(18) For African Caribbean patients 5059 years old with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if there is a family history as opposed to no family history?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and BMI:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
(19) For African Caribbean patients 5059 years old with a fat% of 4049, with a family history of cholesterol, what is the Relative Risk of needing to be on cholesterol medication if there is a BMI of 5059 as opposed to a reasonable BMI of ’2029’?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’2029’ >
(20) For African Caribbean patients with a family history of cholesterol, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if there is an age of 5059 rather than 4049?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’4049’ and ‘BMI’:= ’5059’>
(21) For African Caribbean patients with a family history of cholesterol, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if there is an age of 5059 rather than 4049?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059and ‘BMI’:= ’4049’>
(22) For African Caribbean patients with a family history of cholesterol, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if there is an age of 5059 rather than 4049?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059and ‘BMI’:= ’4049’>
(23) For African Caribbean patients with a family history of cholesterol , with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if there is an age of 5059 rather than 3039?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’3039and ‘BMI’:= ’4049’>
(24) For African Caribbean patients with a family history of cholesterol, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if there is an age of 5059 rather than 2029?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’2029and ‘BMI’:= ’4049’>
(25) For patients with a family history of cholesterol age 5059 and BMI of 5059, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if they are African Caribbean rather than Caucasian?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=1‘’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘Caucasian’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
(26) For patients with a family history of cholesterol age 5059 and BMI of 5059, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if they are African Caribbean rather than Asian?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘Asian’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
(27) For patients with a family history of cholesterol age 5059 and BMI of 5059, with a fat% of 4049, what is the Relative Risk of needing to be on cholesterol medication if they are African Caribbean rather than Hispanic
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:=‘Hispanic’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
(28) For ‘African Caribbean’ patients with a family history of cholesterol age 5059 and BMI of 5059, what is the Relative Risk of needing to be on cholesterol medication if they have fat% 4049 rather than 3039?
< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:= ‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking cholesterol medication’:=‘1’  ‘Fat(%)’:=‘4049’ and ‘Ethnicity’:= ‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059>
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘Caucasian’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’>
(29) For patients with a family history of diabetes age 5059 and BMI of 5059, what is the Relative Risk of needing to be on diabetes medication if they are African Caribbean rather than Asian?
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘Asian’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’>
(30) For patients with a family history of diabetes age 5059 and BMI of 5059, what is the Relative Risk of needing to be on diabetes medication if they are African Caribbean rather than Hispanic
< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘African Caribbean’ and ‘age(years):=’5059’ and ‘BMI’:= ’5059’ >
/< ‘Taking diabetes medication’:=’1’  ‘Family history of diabetes’:=’1’ and ‘Ethnicity’:=‘Hispanic’ and ‘age(years):=’5059’ and ‘BMI’:= ’505’9>
(31) For patients with a family history of diabetesage 5059 and BMI of 5059, what is the Relative Risk of needing to be on diabetes medication if they are African Caribbean rather than Caucasian?
The BioIngine.com Platform Beta Release 1.0 on the Anvil
The BioIngine.com™
Ingine; Inc™, The BioIngine.com™, DiracIngine™, MARPLE™ are all Ingine Inc © and Trademark Protected; also The BioIngine.com is Patent Pending IP belonging to Ingine; Inc™.
High Performance Cloud based Cognitive Computing Platform
The below figure depicts the healthcare analytics challenge as the order of complexity is scaled.
1. Introduction Beta Release 1.0
It is our pleasure to introduce startup venture Ingine; Inc that brings to market The BioIngine.com™; Cognitive Computing Platform for the Healthcare market, delivering Medical Automated Reasoning Programming Language Environment (MARPLE) capability based on the mathematics borrowed from several disciplines and notably from late Prof Paul A M Dirac’s Quantum Mechanics.
The BioIngine.com™; is a High Performance Cloud Computing Platformdelivering HealthCare LargeData Analytics capability derived from an ensemble of biostatistical computations. The automated biostatistical reasoning is a combination of “deterministic” and “probabilistic” methods employed against both structured and unstructured large data sets leading into Cognitive Reasoning.
The BioIngine.com™; delivers Medical Automated Reasoning based on a Medical Automated Programming Language Environment (MARPLE) capability, so better achieving 2nd order semantic interoperability1 in the Healthcare ecosystem. (Appendix Notes)
The BioIngine.com™ is a result of several years of efforts with Dr. Barry Robson; former Chief Scientific Officer, IBM Global Healthcare, Pharmaceutical and Life Science. His research has been in developing quantum math driven exchange and inference language achieving semantic interoperability, while also enabling Clinical Decision Support System, that is inherently Evidence Based Medicine (EBM). The solution, besides enabling EBM, also delivers knowledge graphs for Public Health surveys including those sought by epidemiologists. Based on Dr Robson’s experience in the biopharmaceutical industry and pioneering efforts in bioinformatics, this has the data mining driven potential to advance pathways planning from clinical to pharmacogenomics.
The BioIngine.com™; brings the machinery of Quantum Mechanics to Healthcare analytics; delivering a comprehensive data science experience that covers both Patient Health and Population Health (Epidemiology) analytics, driven by a range of biostatistical methods from descriptive to inferential statistics, leading into evidence driven medical reasoning.
The BioIngine.com™; transforms the large clinical data sets generated by interoperability architectures, such as in Health Information Exchange (HIE) into “semantic lake” representing the Health ecosystem that is more amenable to biostatistical reasoning and knowledge representation. This capability delivers evidencebased knowledge needed for Clinical Decision Support System, better achieving Clinical Efficacy by helping to reduce medical errors.
The BioIngine.com™; platform working against large clinical data sets or while residing within the large Patient Health Information Exchange (HIE) works in creating opportunity for Clinical Efficacy, while it also facilitates in the better achievement of “Efficiencies in the Healthcare Management” that Accountable Care Organization (ACO) seeks.
Our endeavors have resulted in the development of revolutionary Data Science to deliver Health Knowledge by Probabilistic Inference. The solution developed addresses critical areas in both scientific and technical, notably the healthcare interoperability challenges of delivering semantically relevant knowledge both at patient health (clinical) and public health level (Accountable Care Organization).
2. WhyThe BioIngine.com™?
The basic premise in engineering The BioIngine.com™ is in acknowledging the fact that in solving knowledge extraction from the large data sets (both structured and unstructured), one is confronted by very large data sets riddled by highdimensionality and uncertainty.
Generally in solving insights from the large data sets the order in complexity is scaled as follows:
A. Insights around : “what”
For large data sets, descriptive statistics are adequate to extract a “what” perspective. Descriptive statistics generally delivers statistical summary of the ecosystem and the probabilistic distribution.
B. Univariate Problem : “what”
Considering some simplicity in the variables relationships or is cumulative effects between the independent variables (causing) and the dependent variables (outcomes):
a) Univariate regression (simple independent variables to dependent variables analysis)
b) Correlation Cluster – shows impact of set of variables or segment analysis.
https://en.wikipedia.org/wiki/Correlation_clustering
[From above link: In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph G = (V,E), where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters k in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.]
C. Multivariate Analysis (Complexity increases) : “what”
a) Multiple regression (considering multiple univariate to analyze the effect of the independent variables on the outcomes)
b) Multivariate regression – where multiple causes and multiple outcomes exists
All the above are still discussing the “what” aspect. When the complexity increases the notion of independent and dependent variables become nondeterministic, since it is difficult to establish given the interactions, potentially including cyclic paths of influence in a network of interactions, amongst the variables. A very simple example in just a simple case is that obesity causes diabetes, but the also converse is true, and we may also suspect that obesity causes type 2 diabetes cause obesity… In such situation what is best as “subject” and what is best as “object” becomes difficult to establish. Existing inference network methods typically assume that the world can be represented by a Directional Acyclic Graph, more like a tree, but the real world is more complex than that that: metabolism, neural pathways, road maps, subway maps, concept maps, are not unidirectional, and they are more interactive, with cyclic routes. Furthermore, discovering the “how” aspect becomes important in the diagnosis of the episodes and to establish correct pathways, while also extracting the severe cases (chronic cases which is a multivariate problem). Indeterminism also creates an ontology that can be probabilistic, not crisp.
Most ACO analytics addresses the above based on the PQRS clinical factors, which are all quantitative. Barely useful for advancing the ACO into solving performance driven or value driven outcomes most of which are qualitative.
D. Neural Net : “what”
https://www.wolfram.com/language/11/neuralnetworks/?product=mathematica
The above discussed challenges of analyzing multivariate pushes us into techniques such as Neural Net; which is the next level to Multivariate Regression Statistical Approach…. where multiple regression models are feeding into the next level of clusters, again an array of multiple regression models.
The Neural Net method still remains inadequate in exposing “how” probably the human mind is organized in discerning the health ecosystem for diagnostic purposes, for which “how”, “why”, “when” etc becomes imperative to arrive at accurate diagnosis and target outcomes efficiently. Its learning is “smudged out”. A little more precisely put: it is hard to interrogate a Neural Net because it is far from easy to see what are the weights mixed up in different pooled contributions, or where they come from.
“So we enter Probabilistic Computations which is as such Combinatorial Explosion Problem”.
E. Hyperbolic Dirac Net (Inverse or Dual Bayesian technique): – “how”, “why”, “when” in addition to “what”.
Note: Beta Release 1.0 only addresses HDN transformation and inference query against the structured data sets and Features A, B and E. However, as a nonpackaged solution C and D features can still be explored.
Release 2.0 will deliver full A.I driven reasoning capability MARPLE working against both structured and unstructured data sets. Furthermore, it will be designed to be customized for EBM driven “Point Of Care” and “Care Planning” productized user experience.
The BioIngine.com™; offers a comprehensive biostatistical reasoning experience in the application of the data science as discussed above that blends descriptive and inferential statistical studies.
The BioIngine.com™; is a High Performance Cloud Computing Platformdelivering HealthCare LargeData Analytics capability derived from an ensemble of biostatistical computations. The automated biostatistical reasoning is a combination of “deterministic” and “probabilistic” methods employed against both structured and unstructured large data sets leading into Cognitive Reasoning.
Given the challenge of analyzing against the large data sets both structured (EHR data) and unstructured data; the emerging Healthcare analytics are around above discussed methods D and E; Ingine Inc is unique in the Hyperbolic Dirac Net proposition.
QUEL Toolkit for Medical Decision Making : Science of Uncertainty and Probabilities
Quantum Universal Exchange Language
 Implicate order and explicate order are concepts coined by David Bohm to describe two different frameworks for understanding the same phenomenon or aspect of reality.He uses these notions to describe how the same phenomenon might look different, or might be characterized by different principal factors, in different contexts such as at different scales. Macro vs Micro overcoming Cartesian Dilemma.
 The implicate order, also referred to as the “enfolded” order, is seen as a deeper and more fundamental order of reality.
 In contrast, the explicate or “unfolded” order include the abstractions that humans normally perceive.
Emergent  Interoperability  Knowledge Mining  Blockchain
QUEL
 It is a toolkit / framework
 Is an Algorithmic Language for constructing Complex System
 Results into a Inferential Statistical mechanism suitable for a highly complex system – “Hyperbolic Dirac Net”
 Involves an approach that is based on the premise that a Highly Complex System driven by the human social structures continuously strives to achieve a higher order in the entropic journey by continuos discerning the knowledge hidden in the system that is in continuum.
 A System in Continuum seeking Higher and Higher Order is a Generative System.
 A Generative System; Brings System itself as a Method to achieve Transformation. Similar is the case for National Learning Health System.
 A Generative System; as such is based on Distributed Autonomous Agents / Organization; achieving Syndication driven by Self Regulation or Swarming behavior.
 Essentially QUEL as a toolkit / framework algorithmically addresses interoperability, knowledge mining and blockchain; while driving the Healthcare Ecosystem into Generative Transformation achieving higher nd higher orders in the National Learning Health System.
 It has capabilities to facilitate medical workflow, continuity of care, medical knowledge extraction and representation from vast large sets of structured and unstructured data, automating biostatistical reasoning leading into large data driven evidence based medicine, that further leads into clinical decision support system including knowledge management and Artificial Intelligence; and public health and epidemiological analysis.
http://www.himss.org/achievingnationallearninghealthsystem
GENERATIVE SYSTEM :
https://ingine.wordpress.com/2013/01/09/generativetransformationsystemisthemethod/
A Large Chaotic System driven by Human Social Structures has two contending ways.
a. Natural Selection – Adaptive – Darwinian – Natural Selection – Survival Of Fittest – Dominance
b. Self Regulation – Generative – Innovation – Diversity – Cambrian Explosion – Unique Peculiarities – Co Existence – Emergent
Accountable Care Organization (ACO) driven by Affordability Care Act transforms the present Healthcare System that is adaptive (competitive) into generative (collaborative / coordinated) to achieve inclusive success and partake in the savings achieved. This is a generative systemic response contrasting the functional and competitive response of an adaptive system.
Natural selection seems to have resulted in functional transformation, where adaptive is the mode; does not account for diversity.
Self Regulation – seems like is a systemic outcome due to integrative influence (ecosystem), responding to the system constraints. Accounts for rich diversity.
The observer learns generatively from the system constraints for the type of reflexive response required (Refer – Generative Grammar – Immune System – http://www.ncbi.nlm.nih.gov/pmc/articles/PMC554270/pdf/emboj002690006.pdf)
From the above observation, should the theory in self regulation seem more correct and that adheres to laws of nature, in which generative learning occurs. Then, the assertion is “method” is offered by the system itself. System’s ontology has an implicate knowledge of the processes required for transformation (David Bohm – Implicate Order)
For very large complex system,
System itself is the method – impetus is the “constraint”.
In the video below, the ability for the cells to creatively create the script is discussed which makes the case for self regulated and generative complex system in addition to complex adaptive system.
Further Notes on QUEL / HDN :
 That brings Quantum Mechanics (QM) machinery to Medical Science.
 Is derived from Dirac Notation that helped in defining the framework for describing the QM. The resulting framework or language is QUEL and it delivers a mechanism for inferential statistics – “Hyperbolic Dirac Net”
 Created from System Dynamics and Systems Thinking Perspective.
 It is Systemic in approach; where System is itself the Method.
 Engages probabilistic ontology and semantics.
 Creates a mathematical framework to advance Inferential Statistics to study highly chaotic complex system.
 Is an algorithmic approach that creates Semantic Architecture of the problem or phenomena under study.
 The algorithmic approach is a blend of linguistics semantics, artificial intelligence and systems theory.
 The algorithm creates the Semantic Architecture defined by Probabilistic Ontology : representing the Ecosystem Knowledge distribution based on Graph Theory
To make a decision in any domain, first of all the knowledge compendium of the domain or the system knowledge is imperative.
System Riddled with Complexity is generally a Multivariate System, as such creating much uncertainty
A highly complex system being nondeterministic, requires probabilistic approaches to discern, study and model the system.
General Characteristics of Complex System Methods
 Descriptive statistics are employed to study “WHAT” aspects of the System
 Inferential Statistics are applied to study “HOW”, “WHEN”, “WHY” and “WHERE” probing both spatial and temporal aspects.
 In a highly complex system; the causality becomes indeterminable; meaning the correlation or relationships between the independent and dependent variables are not obviously established. Also, they seem to interchange the position. This creates dilemma between : subject vs object, causes vs outcomes.
 Approaching a highly complex system, since the priori and posterior are not definable; inferential techniques where hypothesis are fixed before the beginning the study of the system become enviable technique.
Review of Inferential Techniques as the Complexity is Scaled
Step 1: Simple System (turbulence level:1)
Frequentist : simplest classical or traditional statistics; employed treating data random with a steady state hypothesis – system is considered not uncertain (simple system). In Frequentist notions of statistics, probability is dealt as classical measures based only on the idea of counting and proportion. This technique is applied to probability to data, where the data sets are rather small.
Increase complexity: Larger data sets, multivariate, hypothesis model is not established, large variety of variables; each can combine (conditional and joint) in many different ways to produce the effect.
Step 2: Complex System (turbulence level:2)
Bayesian : hypothesis is considered probabilistic, while data is held at steady state. In Bayesian notions of statistics, probability is of the hypothesis for a given sets of data that is fixed. That is, hypothesis is random and data is fixed. The knowledge extracted contains the more subjectivist notions of uncertainty, belief, reliability, or confidence often used in automated inference and decision support systems.
Additionally the hypothesis can be explored only in an acyclic fashion creating Directed Acyclic Graphs (DAG)
Increase the throttle on the complexity: Very large data sets, both structured and unstructured, Hypothesis random, multiple Hypothesis possible, Anomalies can exist, There are hidden conditions, need arises to discover the “probabilistic ontology” as they represent the system and the behavior within.
Step 3: Highly Chaotic Complex System (turbulence level:3)
Certainly DAG is now inadequate, since we need to check probabilities as correlations and also causations of the variables, and if they conform to a hypothesis producing pattern, meaning some ontology is discovered which describes the peculiar intrinsic behavior among a specific combinations of the variables to represent a hypothesis condition. And, there are many such possibilities within the system, hence very chaotic and complex system.
Now the System itself seems probabilistic; regardless of the hypothesis and the data. This demands MultiLateral Cognitive approach
Telandic …. “Point – equilibrium – steady state – periodic (oscillatory) – quasiperiodic – Chaotic – and telandic (goal seeking behavior) are examples of behavior here placed in order of increasing complexity”
A Highly Complex System, demands a Dragon Slayer – Hyperbolic Dirac Net (HDN) driven Statistics (BIdirectional Bayesian) for extracting the Knowledge from a Chaotic Uncertain System.
BioIngine.com : High Performance Cloud Computing Platform
NonHypothesis driven Unsupervised Machine Learning Platform delivering Medical Automated Reasoning Programming Language Environment (MARPLE)
Uncertainties in the Healthcare Ecosystem
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3146626/
BioIngine.com Platform
Is High Performance Cloud Computing Platform delivering both probabilistic and deterministic computations; while combining HDN Inferential Statistics and Descriptive Statics.
The biostatistical reasoning algorithm have been implemented in the Wolfram Language; which is a knowledge based programming unified symbolic language. As such symbolic language has a good synergy in implementing Dirac Notational Algebra.
The Bioingine.com; brings the Quantum Mechanics machinery to Healthcare analytics; delivering a comprehensive data science experience that covers both Patient Health and Public Health analytics driven by a range of biostatistical methods from descriptive to inferential statistics, leading into evidence driven medical reasoning.
The Bioingine.com transforms the large clinical data sets generated by interoperability architectures, such as in Health Information Exchange (HIE) into semantic lake representing the Health ecosystem that is more amenable to biostatistical reasoning and knowledge representation. This capability delivers evidence based knowledge needed for Clinical Decision Support System better achieving Clinical Efficacy by helping to reduce medical errors.
Algorithm based on Hyperbolic Dirac Net (HDN)
An HDN is a dualization procedure performed on a given inference net that consists of a pair of splitcomplex number factorizations of the joint probability and its dual (adjoint, reverse direction of conditionality). Hyperbolic Dirac Net is derived from Dirac Notational Algebra that forms the mechanism to define Quantum Mechanics.
A Hyperbolic Dirac Net (HDN) is a truly Bayesian model and a probabilistic general graph model that includes cause and effect as players of equal importance. It is taken from the mathematics of Nobel Laureate Paul A. M. Dirac that has become standard notation and algebra in physics for some 70 years. It includes but goes beyond the Bayes Net that is seen as a special and (arguably) usually misleading case. In attune with nature, the HDN does not constrain interactions and may contain cyclic paths in the graphs representing the probabilistic relationships between all things (states, events, observations, measurements etc.). In the larger picture, HDNs define a probabilistic semantics and so are not confined to conditional relationships, and they can evolve under logical, grammatical, definitional and other relationships. It is also, in its larger context, a model of the nature of natural language and human reasoning based on it that takes account of uncertainty.
Explanation: An HDN is an inference net, but it is also best explained by showing that it stands in sharp contrast to the current notion of an inference net that, for historical reasons, is today often taken as meaning the same thing as a Bayes Net. “A Bayesian network, Bayes network, belief network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.” [https://en.wikipedia.org/ wiki/Bayesian_ network]. In practice, such nets have little to do with Bayes, nor Bayes’ rule, law, theorem or equation that allows verification that probabilities used are consistent with each other and all other probabilities that can be derived from data. Most importantly, in reality, all things interact in the manner of a general graph, and a DAG is in general a poor model of reality since it consequently may miss key interactions.
DiracMiner
Is a machine learning based biostatistical algorithm that transforms Large Data Sets such as Millions of Patient Records into Semantic Lake as defined by HDN driven computations that is a mix of Numbers theory (Riemann Zeta) and Information Theory (Dual Bayesian or HDN)
The HDN – Semantic Lake, represents the healthecosystem as captured in Knowledge Representation Store (KRS) consisting of Billions of Tags (QUEL Tags).
DiracBuilder
Send an HDN query to KRS to seek HDN probabilistic inference / estimate. The Query for the inference contains the HDN that the user would like to have, and DiracBuilder helps get the best similar dual net by looking at what Billions of QUEL tags and joint probabilities are available.
High Performance Cloud Computing
The Bioingine.com Platform computes (probabilistic computations) against the billions of QUEL tags employing extended inmemory processing technique. The creation of the billions of QUEL tags and querying against them is combinatorial explosionproblem.
The Bioingine platform working against large clinical data sets or while residing within the large Patient Health Information Exchange (HIE) works in creating opportunity for Clinical Efficacy and also facilitates in the better achievement of “Efficiencies in the Healthcare Management” that ACO seeks.
Our endeavors have resulted in the development of revolutionary Data Science to deliver Health Knowledge by Probabilistic Inference. The solution developed addresses critical areas both scientific and technical, notably the healthcare interoperability challenges of delivering semantically relevant knowledge both at patient health (clinical) and public health level (Accountable Care Organization).
Multivariate Cognitive Inference from Uncertainty
Solving Highdimentional Multivariate Inference involving variables factors excess of factor 4 representing the highdimentioanlity that characteristics of the healthcare domain.
EBM Diagnostic Risk Factors and Calculating Predictive Odds
QUEL tags of form
< A Pfwd:=x  assoc:=y  B Pbwd:=z >
Say A = disease, B = cause, drug, or diagnostic prediction of disease, are designed to imply the following, knowing numbers x, y, and z.
P(AB) = x
K(A; B) = P(A,B) / (P(A)P(B)) = y
P(BIA) = z
From which we can calculate the following….
P(A) = P(AB)/K(A;B)
P(B) = P(BA)/K(A;B)
P( NOT A) = 1 – P(A)
P(NOT B) = 1 – P(B)
P(A, B) = P(AB)P(B) = P(BA) P(A)
P(NOT A, B)= P(B) – P(A B)
P(A, NOT B) = P(A) – P(A B)
P(NOT A, NOT B) = 1 – P(A, B) – P(NOT A, B) – P(A NOT B)
P(NOT A  B) = 1 – P(AB)
P(NOT B  A) = 1 – P(BA)
P(A  NOT B) = P(A, NOT B)/P(NOT B)
P(B  NOT A) = P(NOT A, B)/P(NOT A)
Positive Predictive Value P+ = P(A  B)
Negative Predictive value P = P(NOTA  NOT B)
Sensitivity = P(B  A)
Specificity = P(NOT B  NOT A)
Accuracy A = P(A  B) + P(NOT A  NOT B)
Predictive odds PO = P(A  B) / P(NOT A  B)
Relative Risk RR = Positive likelihood ratio LR+ = P(A  B) / P(A  NOT B)
Negative likelihood ratio LR = P(NOT A  B) / NOT A  NOT B)
Odds ratio OR = P(A, B)P(NOT A, NOT B) / ( P(NOT A, B)P(A, NOT B) )
Absolute risk reduction ARR = P(NOT A  B) – P(A  B) (where A is disease and B is drug etc).
Number Needed to Treat NNT = +1 / ARR if ARR > 0 (giving positive result)
Number Needed to Harm NNH = 1 / ARR if ARR > 0 (giving positive result)
Example:
BP = blood pressure (high)
This case is very similar, because high BP and diabetes are each comorbidities with high BMI and hence to some extent with each other. Consequently we just substitute diabetes by BP throughout.
(0) We can in f act test the strength of the above with the following RR, which in effect reads as “What is the relative risk of needing to take BP medication if you are diabetic as opposed to not diabetic?
<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘1’>
/<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘0’>
The following predictive odds PO make sense and are useful here:
<‘Taking BP medication’:=’1’  ‘BMI’:= ’5059’ >
/<‘Taking BP medication’:=’0’  ‘BMI’:= ’5059’ >
and (separately entered)
<‘Taking diabets medication’:=’1’  ‘BMI’:= ’5059’ >
/<‘Taking diabetes medication’:=’0’  ‘BMI’:= ’5059’ >
And the odds ratio OR would be a good measure here (as it works in both directions). Note Pfwd = Pbw theoretically for an odds ratio.
<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘1’>
<‘Taking BP medication’:=’0’  ‘Taking diabetes medication’:= ‘0’>
/<‘Taking BP medication’:=’1’  ‘Taking diabetes medication’:= ‘0’>
/<‘Taking BP medication’:=’0’  ‘Taking diabetes medication’:= ‘1’>