*Note: Bioingine Cognitive Computing Platform employs Hyperbolic Dirac Net, an advanced Bayesian that overcomes acyclic constraint; thereby the resultant statistics method delivers coherent results when dealing with a messy system where both hypothesis and data have become random messing the entropy.*

By. Dr. Barry Robson

Bioingine.com (Ingine, Inc.)

**The hunger and thirst for more fine grained and powerful descriptions of our nature, health and treatment has meant that the latest International classification of Diseases takes us from ICD-9 with 3,824 procedure codes and 14,025 diagnosis codes to ICD-10 with 71,924 procedure codes and 69,823 diagnosis codes**. It is important to contemplate what this really means for ** extracting knowledge**, such as statements with associated probabilities, from clinical data that will help in public health analysis, biomedical research, and clinical decision making.

At first glance, it appears to the problem of sitting down before an enormous feast at which new courses arrive faster and faster, more than we can ever consume. In liquid terms, it is of course the proverbial “drinking from the fire hose”. Medical information technology already abounds in entities that are in some sense elaborate statements, in the sense of being statements about many demographic and clinical factors at a time. For example, 30 to 40 factors, including age, gender, ethnicity, weight, systolic blood pressure, fasting glucose and so are routinely recorded in public health studies. Even if we merely recorded these in a binary way, with values yes/no, good/bad normal/atypical and so on, there are a great many statements that we would like to explore and use, and they are represented by the many probabilities that we would ideally like to know.

*For just 30 factors there are mathematically potentially = 1,152,921,504,606,846,975 different probabilities, e.g.P(gender=male), P(Hispanic=no and ‘type 2 diabetes’=no), P(‘age greater than 60’=yes and ‘systolic blood pressure’=’not high’) P(male=no and smoker = yes and ‘heart attack=yes), and so on up to the one probability that has all 30 factors.*

**For a basic health record of N=100, we have at least 1,000,000,000,000,000,000,000,000,000,000 such statements.** Being more realistic, for the order of X possible values for reach factor we are speaking of the order of X to the power N, and for a typical electronic health record we being to exceed to number of quarks in the known universe.

But then, we already began to appreciate that as we addressed the problem of seeking meaning in DNA after the human genome projects, where the factors of interest are three billion per patient, the approximate number of DNA base pairs in our DNA. For 4 kinds of base (G,C,A or T) to the power 3 billion, your calculator will just likely say “invalid input for function”.

This above multifactor problem is often called “* the curse of high dimensionality*”, but it also implies a saving grace, which is an ultimate irony.

The ultimate irony is that the problem is one of sparse data. There is a *terrible famine amongst the feast of possibilities,* where it is mostly only crumbs that we can reach. Our fire hose mostly issues just a tiny drop at a time of the most interesting and precious liquids. Our most powerful computers appearing over the next twenty years will still not run out of steam because they will run out of data first. There aren’t actually 1,152,921,504,606,846,975 (billion tags) or 1,000,000,000,000,000,000, 000,000,000,000 patients on planet Earth, let alone a number so large as to be “invalid input for function”. We won’t find the cases to count to obtain probabilities of interest. High dimensionality means lots and lots of sparse data, of which we see combinations occurring just one or two times, and far more never occurring at all.

**The problem is that if we take a “classical” statistical approach and simply say that we have too little data in all that and simply ignore it, lots of weak evidence can add up to overthrow a decision made without it. Neglect of sparse data is a systematic error. Millions of crumbs can prevent someone somewhere from dying from hunger; millions of drops can stop someone somewhere dying from thirst.**

It is in these and the many related areas in which we must look for improvements in our handling of data and knowledge.