Large Data Analytics – on your iPad
[Big Data In Your Mini Space]
Combinatorial Explosion !!!
Hermitian Conjugates and Billion Tags
Hamiltonian Space Offering Deep Learning
The BioIngine.com™ Platform
The BioIngine.com™; offers a comprehensive bio-statistical reasoning experience in the application of the data science that blends descriptive and inferential statistical studies. Progressing further it will also blend NLP and AI to create a holistic Cognitive Experience.
The BioIngine.com™; is a High Performance Cloud Computing Platformdelivering HealthCare Large-Data Analytics capability derived from an ensemble of bio-statistical computations. The automated bio-statistical reasoning is a combination of “deterministic” and “probabilistic” methods employed against both structured and unstructured large data sets leading into Cognitive Reasoning.
The figure below depicts the healthcare analytics challenge as the order complexity is scaled.
Given the challenge of analyzing against the large data sets both structured (EHR data) and unstructured data; the emerging Healthcare analytics are around below discussed methods E (multivariate regression) and F (multivariate probabilistic inference); Ingine is unique in the Hyperbolic Dirac Net proposition for probabilistic inference.
The basic premise in engineering The BioIngine.com™ is in acknowledging the fact that in solving knowledge extraction from the large data sets (both structured and unstructured), one is confronted by very large data sets riddled with high-dimensionality and uncertainty.
Generally in solving insights from the large data sets the order in complexity is scaled as follows.
A) Descriptive Statistics :- Insights around :- “what”
For large data sets, descriptive statistics are adequate to extract a “what” perspective. Descriptive statistics generally delivers statistical summary of the ecosystem and the probabilistic distribution.
Descriptive statistics : Raw data often takes the form of a massive list, array, or database of labels and numbers. To make sense of the data, we can calculate summary statistics like the mean, median, and interquartile range. We can also visualize the data using graphical devices like histograms, scatterplots, and the empirical cdf. These methods are useful for both communicating and exploring the data to gain insight into its structure, such as whether it might follow a familiar probability distribution.
i) Univariate Problem :- “what”
Considering some simplicity in the variables relationships or is cumulative effects between the independent variables (causing) and the dependent variables (outcomes):-
Univariate regression (simple independent variables to dependent variables analysis)
ii) Bivariate Problem :- “what”
Correlation Cluster – shows impact of set of variables or segment analysis.
From above link :- In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph G = (V,E), where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters k in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.
From above link. :- Correlation is a bivariate analysis that measures the strengths of association between two variables. In statistics, the value of the correlation coefficient varies between +1 and -1. When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation
iii) Multivariate Analysis (Complexity increases) :- “what”
§ Multiple regression (considering multiple univariate to analyze the effect of the independent variables on the outcomes)
Multivariate regression – where multiple causes and multiple outcomes exists
iv) Neural Net :- “what”
The above discussed challenges of analyzing multivariate pushes us into techniques such as Neural Net; which is the next level to Multivariate Regression Statistical Approach…. where multiple regression models are feeding into the next level of clusters, again an array of multiple regression models.The above Neural Net method still remains inadequate in depicting “how” probably the human mind is operates. In discerning the health ecosystem for diagnostic purposes, for which “how”, “why” and “when” interrogatives becomes imperative to arrive at accurate diagnosis and target outcomes effectively. Its learning is “smudged out”. A little more precisely put: it is hard to interrogate a Neural Net because it is far from easy to see what are the weights mixed up in different pooled contributions, or where they come from.
“We Enter Probabilistic Computations which is as such Combinatorial Explosion Problem”.
B) Inferential Statistics : – Deeper Insights “how”, “why”, “when” in addition to “what”.
Hyperbolic Dirac Net (Inverse or Dual Bayesian technique)
All the above are still discussing the “what” aspect. When the complexity increases the notion of independent and dependent variables become non-deterministic, since it is difficult to establish given the interactions, potentially including cyclic paths of influence in a network of interactions, amongst the variables. A very simple example in just a simple case is that obesity causes diabetes, but the also converse is true, and we may also suspect that obesity causes type 2 diabetes cause obesity. In such situation what is best as “subject” and what is best as “object” becomes difficult to establish. Existing inference network methods typically assume that the world can be represented by a Directional Acyclic Graph, more like a tree, but the real world is more complex than that that: metabolism, neural pathways, road maps, subway maps, concept maps, are not unidirectional, and they are more interactive, with cyclic routes. Furthermore, discovering the “how” aspect becomes important in the diagnosis of the episodes and to establish correct pathways, while also extracting the severe cases (chronic cases which is a multivariate problem). Indeterminism also creates an ontology that can be probabilistic, not crisp.
Note: From Healthcare Analytics perspective most Accountable Care Organization (ACO) analytics addresses the above based on the PQRS clinical factors, which are all quantitative. Barely useful for advancing the ACO into solving performance driven or value driven outcomes most of which are qualitative.
Notes On Statistics :-
Generally one enters Inferential Statistics an inductive reasoning when there is no clear distinction between independent and dependent variables, furthermore this problem is accentuated by multivariate condition. As such the problem becomes irreducible. Please refer to below MIT course work to gain better understanding on statistics, different statistical methods, descriptive and inferential. Particularly pay attention to Bayesian Statistics. HDN Inferential Statistics being introduced in The BioIngine.com is an advancement to Bayesian Statistics
Introduction to Statistics Class 10, 18.05,
Spring 2014 Jeremy Orloff and Jonathan Bloom
From above link
a) What is a Statistics?
The mathematical study of the likelihood and probability of events occurring based on known information and inferred by taking a limited number of samples. Statistics plays an extremely important role in many aspects of economics and science, allowing educated guesses to be made with a minimum of expensive or difficult-to-obtain data. A joke told about statistics (or, more precisely, about statisticians), runs as follows. Two statisticians are out hunting when one of them sees a duck. The first takes aim and shoots, but the bullet goes sailing past six inches too high. The second statistician also takes aim and shoots, but this time the bullet goes sailing past six inches too low. The two statisticians then give one another high fives and exclaim, “Got him!” (This joke plays on the fact that the mean of -6 and 6 is 0, so “on average, ” the two shots hit the duck.) Approximately 73.8474% of extant statistical jokes are maintained by Ramseyer.
b) Descriptive statistics
Raw data often takes the form of a massive list, array, or database of labels and numbers. To make sense of the data, we can calculate summary statistics like the mean, median, and interquartile range. We can also visualize the data using graphical devices like histograms, scatterplots, and the empirical cdf. These methods are useful for both communicating and exploring the data to gain insight into its structure, such as whether it might follow a familiar probability distribution.
c) Inferential statistics
Are concerned with making inferences based on relations found in the sample, to relations in the population. Inferential statistics help us decide, for example, whether the differences between groups that we see in our data are strong enough to provide support for our hypothesis that group differences exist in general, in the entire population.
d) Types of Inferential Statistics
i) Frequentist – 19th Century
Hypothesis Stable – Evaluating Data
Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics. This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based.
ii) Bayesian Inference – 20th Century
Data Held Stable – Evaluating Hypothesis
In scientific experiments we start with a hypothesis and collect data to test the hypothesis. We will often let H represent the event ‘our hypothesis is true’ and let D be the collected data. In these words Bayes theorem says
The left-hand term is the probability our hypothesis is true given the data we collected. This is precisely what we’d like to know. When all the probabilities on the right are known exactly, we can compute the probability on the left exactly. This will be our focus next week. Unfortunately, in practice we rarely know the exact values of all the terms on the right. Statisticians have developed a number of ways to cope with this lack of knowledge and still make useful inferences. We will be exploring these methods for the rest of the course.
A. Conditional Probability
P (A|B) is the probability of event A occurring, given that event B occurs.
In probability theory, conditional probability is a measure of the probability of an event given that (by assumption, presumption, assertion or evidence) another event has occurred. If the event of interest is A and the event B is known or assumed to have occurred, “the conditional probability of A given B“, or “the probability of A under the condition B“, is usually written as P(A|B), or sometimes PB(A). For example, the probability that any given person has a cough on any given day may be only 5%. But if we know or assume that the person has a cold, then they are much more likely to be coughing. The conditional probability of coughing given that you have a cold might be a much higher 75%.
The concept of conditional probability is one of the most fundamental and one of the most important concepts in probability theory. But conditional probabilities can be quite slippery and require careful interpretation. For example, there need not be a causal or temporal relationship between A and B.
B. Joint Probability
P (A,B) The probability of two or more events occurring together.
In the study of probability, given at least two random variables X, Y, …, that are defined on a probability space, the joint probability distribution for X, Y, … is a probability distribution that gives the probability that each of X, Y, … falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.
P(A | B) = P(A,B) / P(B)
P(B | A) = P(B,A) / P(A)
P(B | A) = P(A,B) / P(A)
P(A | B) P(B) = P(A,B)
P(B | A) P(A) = P(A,B)
P(A | B) P(B) = P(A,B) = P(B | A) P(A)
P(A | B) = P(B | A) P(A) / P(B)
iii) C. Hyperbolic Dirac Net (HDN) – 21st Century
Non – Hypothesis driven unsupervised machine learning. Independent of both data and hypothesis.
Data-mining to build a knowledge representation store for clinical decision support. Studies on curation and validation based on machine performance in multiple choice medical licensing examinations
Barry Robson Srinidhi Boray
The differences between a BN and an HDN are as follows. A BN is essentially an estimate of a complicated joint or conditional probability, complicated because it considers many factors, states, events, measurements etc., that by analogy with XML tags and hence Q-UEL tags, we call attributes in the HDN context. In a BN, the complicated probability is seen as a probabilistic-algebraic expansion into many simpler conditional probabilities of general form P(x | y) = P(x, y)/P(y), simpler because each have fewer attributes. For example, one such may be of more specific form P(G | B, D, F, H), where B, D, F, G, H are attributes and the vertical bar ‘|’ is effectively a logical operator that has the sense of “conditional upon” or “if”, “derived from the sample of”, “is a set with members” or sometimes “is caused by”. Along with simple, self or prior probabilities such as P(D) all these probabilities multiply together, which implies use of logical AND between the statements they represent, to give the estimate. It is an estimate because the use of probabilities with fewer attributes assumes that attributes separated by being in different probabilities are statistically independent of each other. As previously described , one key difference in an HDN is that the individual probabilities are bidirectional, using a dual probability (P(x|y), P(y|x)), say (P(B, G | D, F, H), P(D, F, H|B, G)) which is a complex value, i.e., with an imaginary part [1, 2]. Another, the subject of the present report, is that for these probabilities to serve as semantic triples such as subject-verb-object as the Semantic Web requires, the vertical bar must be replaced by many other kinds of relationship. Yet another, which will be described in great deal elsewhere, is that there can be other kinds of operator between probabilities as statements than just logical AND. All these aspects, and the notation used including for the format of Q-UEL, have direct analogies in the Dirac notation and algebra  developed in the 1920s and 1930s for quantum mechanics (QM). It is a widely accepted standard, the capabilities of which are described in Refs. [9-12] that are also excellent introductions. The primary difference between QM and Q-UEL and HDN methodologies is that the complex value in the latter cases is purely h-complex where h is the hyperbolic imaginary number such that hh = +1. The significance of this is that it avoids a description of the world in terms of waves and so behaves in an essentially classical way.
Inductive (Inferential Statistics) Reasoning: – Hyperbolic Dirac Net Reference :- Notes on Synthesis of Forms by Christopher Alexander on Inductive Logic
The search for causal relations of this sort cannot be mechanically experimental or statistical; it requires interpretation: to practice it we must adopt the same kind of common sense that we have to make use of all the time in the inductive part of science. The data of scientific method never go further than to display regularities. We put structure into them only by inference and interpretation. In just the same way, the structural facts about a system of variables in an ensemble will come only from the thoughtful interpretation of observations.
But, in speaking of logic, we do not need to be concerned with processes of inference at all. While it is true that a great deal of what is generally understood to be logic is concerned with deduction, logic, in the widest sense, refers to something far more general. It is concerned with the form of abstract structures, and is involved the moment we make pictures of reality and then seek to manipulate these pictures so that we may look further into the reality itself. It is the business of logic to invent purely artificial structures of elements and relations.
Christopher Alexander: – Sometimes one of these structures is close enough to a real situation to be allowed to represent it. And then, because the logic is so tightly drawn, we gain insight into the reality, which was previously withheld from us.
Study Descriptive Statistics (Univariate – Bibariate – Multivariate)
Transformed Data Set
Univariate – Statistical Summary
Univariate – Probability Summary
Bivariate – Correlation Cluster
Correlation Cluster Varying the Pearson’s Coefficient
Scatter (Cluster) Plot – Linear Regression
Scatter (Cluster) Plot and Pearson Correlation Coefficient
What values can the Pearson correlation coefficient take?
The Pearson correlation coefficient, r, a statistic representing how closely two variables co-vary; it can vary from -1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation)
HDN Multivariate Probabilistic Inference – Computing in Hamiltonian System
Hyperbolic Dirac Net (HDN) – This computation is against Billion Tags in the Semantic Lake
What is the relative risk of needing to take BP medication if you are diabetic as opposed to not diabetic?
Note: – To conduct HDN Inference, bear in mind that getting all the combinations of factors by data mining is “ combinatorial explosion ” problem, which lies behind the difficulty of Big Data as high dimensional data.
It applies in any kind of data mining, though it is most clearly apparent when mining structured data, a kind of spreadsheet with many columns, each of which are our different dimensions. In considering combinations of demographic and clinical factors, say A, B, C, D, E.., we ideally have to count the number of combinations (A), (A,B) (A, C) …(B, C, E)…and so on. Though sometimes assumptions can be made, you cannot always deduce a combination with many factors from those with fewer, nor vice versa. In the case of the number N of factors A,B,C,D,E,… etc. the answer is that there are 2N-1 possible combinations. So data with 100 columns as factors would imply about
combinations, each of which we want to observe several times and so count them, to obtain probabilities. To find what we need without knowing what exactly it is in advance, distinguishes unsupervised data mining from statistics in which traditionally we test a hunch, a hypothesis. But worse still, in our spreadsheet the A, B, C, D, E are really to be seen as column headings with say about n possible different values in the columns below them, and so roughly we are speaking of potentially needing to count not just, say, males and females but each of nN different kinds of patient or thing. This results in truly astronomic number of different things, each to observe many time. If merely n=10, then nN is
There is a further implied difficulty, which in a strange way lifts much the above challenge from the shoulders of researchers and of their computers. In most cases of the above, must of the things we are counting contain many of the factors A,B,C,D, E..etc. Such concurrences of so many things is typically rare, so many of the things we would like to count will never be seen at all, and most of the rest will just be seen 1, 2, or 3 times. Indeed, any reasonably rich patient record with lots of data will probably be unique on this planet. However, most approaches are unable to make proper use of that sparse data, since it seems that it would need to be weighted and taken into account in the balance of evidence according to the information it contains, and it is not evident how. The zeta approach tells us how to do that. In short, the real curse of high dimensionality is in practice not that our computers lack sufficient memory to hold all the different probabilities, but that this is also true for the universe: even in principle we do not have all the data to work to determine probabilities by counting with even if we could count and use them. Note that probabilities of things that are never observed are, in the usual interpretation of zeta theory and of Q-UEL, assumed to have probability 1. In a purely multiplicative inference net, multiplying by probability 1 will have no effect. Information I = –log(P) for P = 1 means that information I = 0. Most statements of knowledge are, as philosopher Karl Popper argued, assertions awaiting refutation.
Nonetheless the general approach in the fields of semantics, knowledge representation, and reasoning from it is to gather all the knowledge that can be got into a kind of vast and ever growing encyclopedia.
In The BioIngine.com™ the native data sets have been transformed into Semantic Lake or Knowledge Representation Store (KRS) based on Q-UEL Notational Language such that they are now amenable to HDN based Inferences. Where possible, probabilities are assigned, if not, the default probabilities are again 1.