Author: Ouskova-Leonteva Anna (leopard152015@gmail.com)
Date: May 22, 2016
The aim of the current work is to study the complex etiology of obesity and identify genetic variations and/or factors related to nutrition and physical exercise that contribute to its variability. By identifying specific compositions of factors, we hope to distinguish and predict cases of genetic and non genetic obesity.
Obesity (body mass index > 25 kg/m2) is considered a global public health problem and occurs when energy intake from food and drink consumption is greater than energy expenditure through the body’s metabolism and physical activity over a prolonged period, resulting in the accumulation of excess body fat. Obesity is a serious concern because it is associated with heart disease, stroke, type 2 diabetes and certain types of cancer, some of the leading causes of preventable death.
According to the World Health Organization (2013), worldwide over 600 million adults and 42 million children are obese, figures which have more than doubled since 1980. In France, 6.5 million people are considered to obese (14.5% of the adult population) [1].
The causes of the obesity epidemic over the past 30 years are known to be multi factorial (including genetic makeup, neuroendocrine disorders, emotions, and even secondary effects from medical treatments) [2]. Until recently, the rise in obesity was generally believed to result from a caloric imbalance, linked to the easy availability of energy-dense foods and the convenience of labor-saving, or more accurately ‘human-energy-saving’, devices for every day activities and transport. However, the current paradigm that overeating of easily digestible carbohydrates and the resulting energy imbalance is the cause of obesity has recently been challenged [3]. Indeed, recent studies show that obesity is a highly complex disease with both causal and contributory links to metabolic dysfunction. Development of obesity is influenced by multiple pathways, including (i) eating behaviors directly regulated by the brain, (ii) energy expenditure in tissues such as liver, fat and muscle; (iii) and changes in adiposity mediated by adipocyte differentiation and lipid accumulation in adipose tissues [4].
The neuropathology of obesity has been studied by examining genetic and acquired brain diseases that cause or are influenced by obesity [5]. For example, Bardet-Biedl syndrome (BBS) is an example of a monogenic cause of obesity, linked to the abnormal detection of peripheral signals in the brain that regulate energy homeostasis. BBS is a rare, generally autosomal-recessive disorder: it is clinically heterogeneous but is associated with six core features: obesity, retinal dystrophy, renal abnormalities, polydactyly, learning disability and urogenital tract deficits. Obesity affects 72–92% of BBS patients. The study of the origins of obesity in BBS helped make the link between inactivation of ciliary genes and disorders of the hypothalamic regulation of food intake. However, even in the context of a monogenic obesity such as BBS, it is clear that obesity is the result of imbalances in numerous signaling cascades combined with defects in regulation of cellular differentiation pathways [6].
The project is a collaboration between the clinicians and molecular biologists studying BBS in the Laboratoire de Génétique Humaine, Strasbourg and the Complex Systems and Translational Bioinformatics team, Icube, Strasbourg.
Environmental factors, such as diet, physical activity, age, gender, socio-economic status and ethnicity, among others, have been shown to modulate the risk for obesity [13]. As obesity genetics makes further progress, considerable interest has recently been turned to the potential interactions between obesity-predisposing gene variants and specific environmental situations. “Genetic testing is taking medical nutrition therapy to a new level where we can individualize lifestyle interventions to optimize outcomes based on a better understanding of what works and what does not, advancing precision medicine in this field,” said Dr. Osama Hamdy, PhD Medical Director of the Obesity Clinical Program at Joslin Diabetes Center and Assistant Professor of Medicine at Harvard Medical School. The approach is a personalized nutrigenomic profile that provides individualized information to physicians and their patients in order to help them understand how genetics and lifestyle may impact their diet, nutrition, and exercise. The definition of personalized medicine for obesity is the use of information about the genetic makeup of a person with obesity to tailor strategies for preventing, detecting, treating, or monitoring their obesity. In the article [18] was proposed strategie of personalized medicine for diabetes, where the identification of risk factors through genotype is accompanied by an effective therapy. According [18], the practice of personalized medicinegenerally involves four processes:
In this work we cover relevant parameters to obesity classification with personalized approach in case of BBS and determine the role of machine learning in this context.
The vast amount of parameters for personalization makes obesity management increasingly complex. Today the availability of smart health technology like continuous glucose monitoring, physical activity detection, location and movement data, image recognition for planned meals [19], speech-recognition diet tracking [20] offer large data sets which can be used for types classification, initialization and improvement of the therapy of an individual obese person . The large amount of generated data shows the importance of knowledge discovery in data handling/processing for therapy personalization. Better understanding the correlation between phenotypes and genotypes will help to improve the accuracy of classification as well as treatment process. In fact, bioinformatics research entails many problems that can be solved by data mining tasks. Concretely, physical activity recognition as well as heart rate using wearable sensors can provide valuable information regarding individual’s movements and help us to determine some aspects of her health. The main contributions in this work are 4-fold:
As result we hope to determine in case BBS:
All analyses were made using the differenet packages from the R software[14].
We use R because:
Our goal is to create a research cohort consisting of obese patients with and without genetic factors, who will share genetic data, biological samples, diet/lifestyle information, and in particular sensor/mobile data (see section 4.2).
As one of the tasks, there is a need for tools such as sensor/mobile data that allow people to measure their physical activity intensity and intake over the course of a day to make informed decisions, and perform energetic trade-offs. Daily information about conditions of health, physical activity level, energy expenditure, and energy intake is central to weight control since energy balance is defined by the interaction between these variables. Wireless Body Area Network (WBAN) is a special kind of autonomous sensor network evolved to provide a wide variety of services. WBAN has emerged as a technology for e-healthcare that allows data concerning a patient’s vital body parameters and movements to be collected by small wearable or implantable sensors and communicated using short-range wireless communication techniques and has shown great potential in improving healthcare quality. Today, WBAN has become an integral component of health care management system where a patient needs to be monitored both inside and outside of home or hospital. Unfortunately, existing technologies in the area of physical activity are mostly designed for individuals who have already achieved a high level of physical fitness and are in good health. Most of them are equipped with heart rate sensor and accelerometer and provides means to track the heart rate, hours of sleep, speed, distance, pace and calories burned. But this is not enough for a medical application. The accurate measurement of physical activity, energy expenditure, and energy intake is challenging and, at present, there is no technology that allows people to measure these variables comfortably, accurately, and continuously over the course of a day and obtain real-time feedback. One of the challenges for this project will be to define technologies (system of multi devices and set of algorithms) that can measure and analyze many physical parameters (such as galvanic skin response. blood oxygen saturation, calorie intake, blood pressure, blood glucose) and that are adapted for use by patients who have difficulty maintaining a healthy weight and minimum levels of physical activity every day. As we will be collecting human patient-related data, safety issues need to be taken into account to avoid any risk to life, as well as security and privacy protection issues for the data collected from a WBAN, either while stored inside the WBAN or during their transmission outside of the WBAN. Controlling congestion and security is a major unsolved concern, with challenges resulting from stringent resource constraints of WBAN devices, and the high demand for both security/privacy and practicality/usability. In this work, we will address two important data security issues: secure and dependable distributed data storage, and fine-grained distributed data access control for sensitive and private patient medical data, as well as considering the diverse practical issues that need to be taken into account while fulfilling the security and privacy requirements.
The objective classification of HT is a important condition for the understanding of the complex relationship between health and physical activity. Many different Health Trackers (HTs) were considered during this study. Generally we can divide them by commercially availability, manufacturer ( big brands / start-ups or future projects) and area of applying ( medical / fitness). Further details about these sensors are provided in Table 1.
Table 1 How can we know which one is the best? Various HTs have been tested and validated for field-based research in both healthy and chronically diseased populations [21]. There are many studies about the validity of energy consumption algorithms embedded in different devices [22] of medical area rather than fitness trackers [26]. According [26] the resultats in 2014 of energy expenditure estimation during a 69-minute protocol were following: the mean absolute percent error values performed 9.3%, 10.1%, 10.4%, 12.2%, 12.6%, 12.8%, 13.0%, and 23.5% for the BodyMedia FIT, Fitbit Zip, Fitbit One, Jawbone Up, ActiGraph, DirectLife, NikeFuel Band, and Basis B1 Band, respectively.
The aim of this section was to compare four of the most suitable for our case of study commercially available activity monitors in terms of variety of functions and accuracy of mesurement:
GoBe and AngelSensor are Startups. Jawbone UP24/UP3 and Microsoft Band 2 are two of the most popular fitness trackers in the world.
All of these gadgets have three things in common:
These HTs use accelerometry which is combined with physiological sensors such as heart rate, temperature and galvanic skin response for increasing their accuracy of predicting energy expenditure and discriminating activity types. Activity type-specific equations are generally implemented into HTs to model energy expenditure [23]. These equations based on mesurement of step detection, but only a few studies have focused specifically on the accuracy of this estimate. In article [24] authers compared the accuracy of a pedometer and a triaxial accelerometer in their step count estimates in patients with Parkinson’s disease, using video recordings as reference. The error of the pedometer was speed-dependent, ranging between 4.5%- 17.2%, whereas the error of the accelerometer was around 7%. During another work [25] medical HT (Sensewear Armband Mini) was tested in group of patients with chronic pulmonary disease and the results at slow speeds (1.4-1.8 km/h) was not adequately accurate.
There are other things to consider as well, such as data extraction (SDK, API), price, set of sensors, intuitive application interface design, capability to motivate.
GoBe is a bracelet which measure the wearer’s heart rate, calories burned, sleep, and stress levels. That’s all conceivable, given what the Jawbone UP3 and other body trackers already measure.
But it is interesting by its “patented flow technology”. Click here to learn more about this technology. According to it, GoBe promises to something a little more sensational: Automatically tracking the calories of everything the wearer eats, through his or her skin
.
The automatic calorie-tracking, which GoBe claims to do by reading glucose levels in cells, would revolutionize dieting—even the best calorie-counting apps today rely on manual food logging. Fig The premise seems to be lofty, in fact. Let’s say GoBe does measure glucose levels without piercing the skin, as it claims to do. That would be a godsend to diabetics, who, as it stands, must regularly prick their fingers to test blood sugar. The less-invasive technology is probably coming soon, Michelle MacDonald ( a clinical dietician at the National Jewish Health hospital in Denver) told ( read full text) PandoDaily: but when it does it will be the size of a shoebox … It will come from a big lab, will be huge news and make a lot of money. But on top of that, blood glucose is only a rough measure of total energy intake. Eat a tablespoon of olive oil, and you’ve consumed 119 calories, but your blood sugar would barely rise. A very thin slice of white bread, meanwhile, would send blood sugar soaring and only yields 40 calories.
Developer API will be available later. Now data extraction is not posible.
AngelSensor M1 is an open source fitness tracker that provides a lot of health data. The main advantage is consept of providing an open platform for developers to create own applications, relying on the raw data streams sent by the sensor via Bluetooth low energy. It opens up varied opportunities, because being able to accurately quantify physical activity is important for epidemiological research so that relations between physical activity, health status, environmental factors, and so on, can be determined. But it’s extremely hard to find open platforms with decent usability & low cost, specially when it comes to physiological data (ECG, PPG, etc.), there is basically nothing on the market. So in this context AngelSensor is outstanding from the authers sensors. However, this is more prototype than final device, some functions like blood oxygen or sleep quality are not ready yet, it will be available later with firmware update. No site to store data – data is sent to mobile application. Temperature measurements are inaccurate -2 – 3 C than usual, but reflect well temperature changes. The same with heart rate parameter. Over 20% of steps are missed, but there is a possibility to read raw data directly to implement other algorithms for step calculation later. So this is a prototype that mostly works, but a dedicated application or API is needed to provide all implemented parameters. It has not only 3-axis accelerometer but alse a gyroscope. Concerning usability of current applications - not all parameters monitoring and no history viewing capability in the current apps version. It is in alfa-version for Android and has more features for iOS. But the project is in active development. Also it provides SDK for developers - quite good for customisation and creation of our owns applications. This is not a device for consumers, but it is a device for developers.
The sensor comes with an open source project, hosted on github, which is quite useful as a starting point for new applications. The documentations is not nearly enought, but support staff is provide some help. We did small example of getting and using data from this sensor for determine if we could applay it in this work for estimating health parameters and measure energy expenditure. The sensor transmit data by using the standard bluetooth low energy protocol. By default it send only heart rate data, but also possiable reciving raw data from two channels PPG and accelerometer. Our application only store these signals to a csv file. A description of the protocols can be found here. After observing data we found that it has enough noise. Here we propose data example from accelerometer:
The image of PPG raw data is much more difficult, because it seems to be very close to noise. It is explained by fact that PPG signal quality is depended by sensor’s factors such as the layout of the optical sensor, which the wavelength of light is used and the quality of the amplifiers. We need avoid to warping puls of real signal by filterning, in view of the risk of mixing some of the broad-spectrum noise with the signal, while removing some of the higher harmonics of the true pulse waveform. Anyway before feature extraction we have to applay data filter, taking into account that there might be important information in the low and high frequency spectra of our time series. R-package “signal” allow to split them providing butterworth digital filter. In conclusion, using AngelSensor make us free to develop our own system of energy expenditure estimation, but at other side, PPG data seems too noisy for making profond heart rate variability analysis. Concerning accelerometer data, it seems useable. As we had not enought time to collect big dataset by ourself, for activity recognition analysis (important step in measure energy expenditure, see 4.1.2) we will utilize dataset from smartphones [27], which consiste analogical physical activity information from gyroscope and accelerometer.
In my informal testing (during several months) Jawbone UP 24 handles burned calories and steps count quite well (sometimes there are +/- 10 % of steps).It provides REST API to collect list of parameters: steps, distance, burned calories, sleep (see below parameters for new UP 3 version). Sleep duration is measured well (but by accelerometer i.e. number of hours in horizontal position), so sleep quality is not accurate, but anyway it attempts to give you additional information such as how heavy or light users sleep was, and how long waking periods were. And this is all that current version provides. In 3.0 version there is also heart rate parameter. Concerning usability of current applications - it is user-friendly and also allows users to log food consumption to help with their dietary goals. The user can set the device to vibrate under certain circumstances. A good option might be to have it vibrate after users haven’t moved for a certain period of time. Let’s look at the approache of data gathering that Jawbone API proposes. API discription can be found here. This is full list of possible scopes for Jawbone U3:
Developers have not access Bluetooth or the raw sensors on the devices directly. All information are delived after processing, handling and transformations from the cloud via REST API. The only advanced sensor data available through the UP API is Resting Heart Rate. This provides one value per day, captured when the user wakes up while wearing the band, and is only available from users that have an UP3 band. However they are currently exploring opportunities to unlock access to more sensor data in the future, but do not have a set timeline yet. The information about there based step-classification concept is described in this article [https://jawbone.com/blog/classifying-steps-machine-learning/].
We tested data access via API and confirmed the comfortability of system.
This HT was not tested, but it it deserves attention and may be present the best desition. The Microsoft Band is a commercial product available to the public without any delays and problems. It features big number of important sensors: optical heart rate sensor, 3-axis accelerometer/gyrometer, ambient light sensor, UV sensor, capacitive sensor, galvanic skin response, baromete, haptic vibration motor, microphone, GPS. But the big positive point is that Microsoft has provided a SDK that allows access of the band’s sensor readings. The band communicates via Bluetooth to a synchronized smart phone. So it can present a good solution as it takes access to sensors data like Angel Sensor and has user-friendly application like Jawbone, which is also important for motivating. But the price is higher approximately on 100$.
Using parameters from HT we can measure and control the physical parameters, intencity of activity and energy expenditure (EE). And as calories burnt (EE) estimation present the key aspect in obesity determination, this section we dedicate to consept of EE measurement. We quickly investigate existing problems of EE estimation. Then the relation between accelerometer, physiological data and EE will be establised. And at the end we propose models which will be able to increase accuracy level and to take into account variability in physiological signals between individuals applaying method of personalization.
Definition EE
EE = Calories burnt make up half of the energy balance equation:
Energy Balance = Energy Expenditure - Energy Intake.
We can determine three elements of EE:
According our method (see section “Method description”) we will take deal with AEE.
How we can measure EE
One of the most standard and practical method of estimate EE is indirect calorimetry - VO2max test. It analyzes inhale and exhale gases concentration and the rate of oxygen consumption (VO2, mL/min) is used as the representation of energy expenditure according Weir’s equation:
Resting EE=3.9×VO2+1.1×VCO2,
where: VO2 = oxygen uptake (ml/min); VCO2 = carbon dioxide output (ml/min).
Metabolic carts offer invaluable insight into exercise metabolism, allowing interrogation of each breath. There are a couple of portable indirect calorimeters on the market, like AeroSport TEEM 100 Metabolic Analysis System, but the participant must use a facial mask during all data acquisition.
Below, we demonstrate the exmaple of perfoming substrate utilisation using metabolic cart data from VO2 R-package collected during:
For the estimation of substrate utilisation by way of gas exchange we have to prepare oxygen uptake data. Firstly we filtered this data form unwanted noises. Then according to Robergs and col.[45] removed ventilatory and cardiovascular frequency components (leaving theoretically only the muscular oxygen uptake kinetics) via the application of a 0.04 Hz, using third-order Butterworth filter. And finally, we used the equations of Jeukendrup & Wallis[44] which are implemented in the VO2 - package for perfoming energy expenditure (ee), carbohydrate oxidation rates (choox), and fat oxidation rates (fatox).
It would be interesting to estimate the rate of maximal fat oxidation. Taking into account that the rate of fat oxidation during graded exercise presents as a quadratic function, it’s maximum can be extracted as the vertex of a fitted polynomial:
## x
## 167.4839
And at the end, we can try to evalue another important parameter - efficiency during exercise. The two dominating methods are calculations of gross efficiency and delta efficiency, were described in this work[46].
In fact, in the begining of research it will be very useful to create for each paitant metabolic profile provides detailed information on there caloric burn rate and individualized heart rate zones by indirect calorimetry. Because it shows the ability of patient to absorb oxygen. And exactly the level of oxygen consumption by the muscles influences to, for example, the ability to run fast to keep this speed long time. It helps to personalize workout plan and to determine the algorithms of EE estimation giving higher accuracy. Moreover, scientists have come to the conclusion that the VO2max indicator is inherited. And the main trouble is that the ability to improve this value is also inheruted. In 2007-2008, the Norwegian researchers conducted big number of testes on the participants of the experiment revealed the dynamics of VO2max. The resultats shown that despite of certain genetics limits everyone improved his physical forme and reached a good level of this indicator in condition of the regular thaining progress. This indicator could help not only to create the accurate system of EE estimation, but also present interesting phenotype for study in this work.
Problems of EE estimation
We have already talked that many HT in the consumer market provide EE estimation in these applications. They applay regression models using accelerometer data, HR, or a combination of the two, across all activities with including anthropometric characteristics in the equation, given the relation between body size and EE. Some of them are claiming higher accuracy due to a combination of accelerometer, gyroscope and physiological data (e.g. Basis or the Apple watch). But the resultats is also not nearly enought, partly because these devices are simply combining multiple signals without personalization. As it has been explained in section “Method description”, the personalization of data is very important and helping to avoid suboptimal results. This fact can be demonstrated very easily in the following example: we want to estimate EE from HR and there are two participants with similar anthropometric characteristics (e.g. similar body weight, etc.), but with different sport level in our experiment; Normaly individuals with similar body size expend similar amounts of energy during a certain activity, however their HR differs during moderate to vigorous activities explaining by the participant’s fitness level. It means we can measure EE from HR if only HR is properly normalized. However, there is a strong link between HR, oxygen consumption and EE, which motivates the use of physiological data for EE estimation. Let’s look at another example: If we use only accelerometer, for which the rationale is that body motion close to the body’s center of mass, we will see the wrong relation between accelerometer and EE for the participant over different activities. Single regression models are unable to fit all the activities,since the slope and intercept of the regression model changebased on the activity performed while data is collected [40]. So the relation gets very weak for non-weight-bearing some types of activities, such as biking, while it’s reasonably good for walking and running.
We can denote two points:
Improving accelerometer data
For improving accelerometer data almost all modern devices and many researches (like [40], [41]) try to use activity recognition algorithms, where activity recognition is performed in different way. For example, it depends on sensor location, the number and types of clusters in set: it could classify 5 or 50 activities with different levels of accuracy, as well as the set can be mutually exclusive. In this work we propose to create own system of activity recognition using data from HT throw SDK. For that we have to build the machine learning model which will give the resultats closed to VO2max-test. In section “Activity recognition” we make simple example of classificator based on classical machine learning algorithms and show that applaing activity recognition algorithms makes system much more flexiable, for example it gives the ability to selectively use HR data - only when the activity is of a certain type.
Improving physiological data
Activity recognition method brought many improvements in EE measurement, and are now widely used. However we have another problem with HR showing huge variability across individuals and it seems to be not resolved. We have to develop models that fully exploit the potential of physiological data across different subjects, without requiring individual calibration. For resolving this problems some researches propose the following possiable ways:
We could normalize physiological data using a person’s physiological signals under high intensity activities. But for avoiding to re-perform individual calibrations according changes the relation between physiological data and physical form of person, we can use low intensity activities of daily living, automatically recognized using different machine learning models, to contextualize HR or any other physiological signal, and then predict what the physiological data during a high intensity activity would be - without having to actually perform that activity [42]. Knowing the “intense” value, we can normalize the signal across individuals.
Alternatively, we can use a more sophisticated solution, described in the work [43], which focused on using hierarchical Bayesian models. Bayesian hierarchical approach fits very well in tasks of EE estimation, because parameters are naturally interposed: belong to different levels. It meens that features at the second level influence to features at the first level. We will distinguish two levels - individual (first) level and group (second) level of a hierarchical structure. The group level parameters impact the relation between predictors at the first level and the outcome variable. In this method we have to understand the cause behind the necessity of individual calibration. For example, in case of HR, it is differences in fitness level of participances. So we could measure this level and then create a hierarchical model, using it as a group level parameter. In this case fitness level directly impacts the relation between HR and EE, without the need for explicit HR normalization. How to estimate fitness level is other question and we talked a little bit on this topic in subsection “How we can measure EE”. The principle is used by many sub maximal fitness tests (like Rockport Walk Test), where HR at a certain intensity is used as predictor - together with other variables - to estimate VO2max. The difference is that by using context recognition algorithms, there is no need to perform lab tests or specific exercises. We can take the HR while walking at certain speed as our sub maximal HR and then predict VO2max. Finally fitness level is included as group level parameter for prediction of EE.
Conclusion
Physical activity recognition is one of the key problem for determining accuracy of energy expenditure - as one of the main parameters in obesity classification and treatment. Previous subsection describes the components needed to gather the relevant data from a person in context of everyday activity. This subsection shows simple example how machine learning techniques are able to determine the activity from the sensors data for help to increases the accuracy of the measurement results. This approach is validated over real data from physical activity monitoring dataset from smartphones 27
Dataset information
This dataset is very closed to data, which we can take from AngelSensor. The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.
For each record in the dataset it is provided:
Data Exploration
Firstly we investigated data on preprocessing necessities it was found that all signs are within [-1,1] and as a consequence it is not necessary neither normalization, nor scaling.
Then has checked availability of signs with strongly displaced distribution and show three worst variables:
v389_fBodyAccJerkbandsEnergy57_64 | v479_fBodyGyrobandsEnergy33_40 | v60_tGravityAcciqrX |
---|---|---|
14.70005 | 12.33718 | 12.18477 |
For selecting criterion of quality of mode (Accuracy or Kappa) it is necessary to to check the activity distribution:
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING |
---|---|---|---|---|---|
1226 | 1073 | 986 | 1286 | 1374 | 1407 |
Cases are distributed almost equally, except maybe walking_down. Thus it can be used Accuracy.
Modeling
In this subsection as example of resolution of classification tasks two classical alhorithms will be applied: Random Forest, Support Vectors Machine.
Random Forest
We use Random Forest as basic algorithm, because it has the built-in mechanism of assessment of variable importance. The quality of model is following:
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING | |
---|---|---|---|---|---|---|
WALKING | 481 | 11 | 4 | 0 | 0 | 0 |
WALKING_UP | 41 | 424 | 6 | 0 | 0 | 0 |
WALKING_DOWN | 16 | 42 | 362 | 0 | 0 | 0 |
SITTING | 0 | 0 | 0 | 427 | 64 | 0 |
STANDING | 0 | 0 | 0 | 51 | 481 | 0 |
LAYING | 0 | 0 | 0 | 0 | 0 | 537 |
NA. | Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
overall | 0.9202579 | 0.9041334 | 0.9098849 | 0.9297870 | 0.1822192 | 0.0000000 | NA |
NA. | Class: WALKING | Class: WALKING_UP | Class: WALKING_DOWN | Class: SITTING | Class: STANDING | Class: LAYING |
byClass.Sensitivity | 0.9697581 | 0.9002123 | 0.8619048 | 0.8696538 | 0.9041353 | 1.0000000 |
byClass.Specificity | 0.9767442 | 0.9785945 | 0.9960427 | 0.9792345 | 0.9734990 | 1.0000000 |
byClass.Pos.Pred.Value | 0.8940520 | 0.8888889 | 0.9731183 | 0.8933054 | 0.8825688 | 1.0000000 |
byClass.Neg.Pred.Value | 0.9937733 | 0.9809717 | 0.9774757 | 0.9740786 | 0.9787677 | 1.0000000 |
byClass.Prevalence | 0.1683068 | 0.1598235 | 0.1425178 | 0.1666101 | 0.1805226 | 0.1822192 |
byClass.Detection.Rate | 0.1632168 | 0.1438751 | 0.1228368 | 0.1448931 | 0.1632168 | 0.1822192 |
byClass.Detection.Prevalence | 0.1825585 | 0.1618595 | 0.1262301 | 0.1621988 | 0.1849338 | 0.1822192 |
byClass.Balanced.Accuracy | 0.9732511 | 0.9394034 | 0.9289738 | 0.9244441 | 0.9388172 | 1.0000000 |
Support Vector Machine
For comparation we build SVM model. The parameters of model is following:
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING | |
---|---|---|---|---|---|---|
WALKING | 482 | 5 | 9 | 0 | 0 | 0 |
WALKING_UP | 13 | 457 | 1 | 0 | 0 | 0 |
WALKING_DOWN | 7 | 32 | 381 | 0 | 0 | 0 |
SITTING | 0 | 1 | 0 | 442 | 46 | 2 |
STANDING | 0 | 0 | 0 | 26 | 506 | 0 |
LAYING | 0 | 0 | 0 | 0 | 0 | 537 |
NA. | Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
overall | 0.9518154 | 0.9420842 | 0.9434533 | 0.9592650 | 0.1822192 | 0.0000000 | NA |
NA. | Class: WALKING | Class: WALKING_UP | Class: WALKING_DOWN | Class: SITTING | Class: STANDING | Class: LAYING |
byClass.Sensitivity | 0.9717742 | 0.9702760 | 0.9071429 | 0.9002037 | 0.9511278 | 1.0000000 |
byClass.Specificity | 0.9918401 | 0.9846527 | 0.9960427 | 0.9894137 | 0.9809524 | 0.9991701 |
byClass.Pos.Pred.Value | 0.9601594 | 0.9232323 | 0.9744246 | 0.9444444 | 0.9166667 | 0.9962894 |
byClass.Neg.Pred.Value | 0.9942740 | 0.9942904 | 0.9847418 | 0.9802340 | 0.9891441 | 1.0000000 |
byClass.Prevalence | 0.1683068 | 0.1598235 | 0.1425178 | 0.1666101 | 0.1805226 | 0.1822192 |
byClass.Detection.Rate | 0.1635562 | 0.1550730 | 0.1292840 | 0.1499830 | 0.1717000 | 0.1822192 |
byClass.Detection.Prevalence | 0.1703427 | 0.1679674 | 0.1326773 | 0.1588056 | 0.1873091 | 0.1828979 |
byClass.Balanced.Accuracy | 0.9818071 | 0.9774643 | 0.9515928 | 0.9448087 | 0.9660401 | 0.9995851 |
Important variables
Variable extraction - important variables with RF:
It is seen that two components of signal (Body and Gravity) from accelerometer presents among the most important variables. But there are no data from gyroscope. In fact, that could be suspicious.
Check important variables on SVM
Basic idea is to substitute 10% of the most important variables for the begining, and then to continue increasing by 10% with control of accuracy. Achieving maximum, will reduce step at first to 5%, then to 2.5% and, at last, to one variable. Model with 440 variables:
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING | |
---|---|---|---|---|---|---|
WALKING | 482 | 5 | 9 | 0 | 0 | 0 |
WALKING_UP | 13 | 457 | 1 | 0 | 0 | 0 |
WALKING_DOWN | 4 | 33 | 383 | 0 | 0 | 0 |
SITTING | 0 | 1 | 0 | 436 | 52 | 2 |
STANDING | 0 | 0 | 0 | 25 | 507 | 0 |
LAYING | 0 | 0 | 0 | 0 | 0 | 537 |
NA. | Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
overall | 0.9507974 | 0.9408597 | 0.9423595 | 0.9583251 | 0.1822192 | 0.0000000 | NA |
Model with 490 variables:
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING | |
---|---|---|---|---|---|---|
WALKING | 483 | 5 | 8 | 0 | 0 | 0 |
WALKING_UP | 15 | 455 | 1 | 0 | 0 | 0 |
WALKING_DOWN | 5 | 31 | 384 | 0 | 0 | 0 |
SITTING | 0 | 1 | 0 | 440 | 48 | 2 |
STANDING | 0 | 0 | 0 | 25 | 507 | 0 |
LAYING | 0 | 0 | 0 | 0 | 0 | 537 |
NA. | Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
overall | 0.9521547 | 0.9424917 | 0.9438181 | 0.9595781 | 0.1822192 | 0.0000000 | NA |
As result the maximum of accuracy has appeared around 490 variables and has made 0.952 (on 440 variables - 0,950), that is better on 0,1% than previous value from full set of variables, but training time is half less (see table of Time perfomance)
Second approach: important variables with Information gain ratio
For being determined in choice of important variable, we apply Information gain ratio (IGR). Information gain tells us how important a given attribute of the feature vectors is. After calculation of IGR, the list of variables was arranged in decreasing order of value. The first 20 from this list is following:
Maximum IGR value = 0.897, minimum = 0. As it is seen, IGR gives different set of important variables than RF ( only five variables are the same ).The gyroscope again is not present among the most important value. But it is a lot of variables from the accelerometer X components.
The difference of resulatats could be explained by the fact that in dataset some variables give more information than others. It is well visible if to construct diagrams of variables with the different IGR value. For example we have selected the two variables (5 and 500) in two opposite cases; in case of high IGR the points are visually separated over against low IGR case, which is also carry some portion of useful information. Finally number of classes also plays some role in it, because reaching the maximum accuracy in the one class, we can lose performance in another. This can be shown on the discrepancy matrices for two different sets of attributes with the same precision: for example, in the first matrix, two first classes have less errors and three others classes more errors than in the seconf one.
Principal component analysis
As we have deal with enough high dimensional data it would be really useful to applay Principal component analysis (PCA). The goal of PCA is to explain the most amount of variance with the lowest number of variables. Collinear variables can be combined into underlying factors or principal components that are uncorrelated with one another. Another words, we can reduce dimensions by dropping the components which do not explain much of the variance. So in order to see if PCA could be useful, we applay it in RF and SVM models.
We include all of the variables in the PCA.
And after performing PCA there are 102 variables have been rest.
Below the confusion matrix and statistics for RF and SVM models are shown:
RF with PCA data
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING | |
---|---|---|---|---|---|---|
WALKING | 473 | 5 | 18 | 0 | 0 | 0 |
WALKING_UP | 40 | 415 | 16 | 0 | 0 | 0 |
WALKING_DOWN | 72 | 43 | 305 | 0 | 0 | 0 |
SITTING | 0 | 1 | 0 | 384 | 98 | 8 |
STANDING | 1 | 0 | 2 | 40 | 488 | 1 |
LAYING | 2 | 1 | 0 | 3 | 10 | 521 |
NA. | Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
overall | 0.8775025 | 0.8526416 | 0.8651198 | 0.8891302 | 0.1822192 | 0.0000000 | NA |
SVM with PCA data
WALKING | WALKING_UP | WALKING_DOWN | SITTING | STANDING | LAYING | |
---|---|---|---|---|---|---|
WALKING | 487 | 1 | 8 | 0 | 0 | 0 |
WALKING_UP | 41 | 412 | 18 | 0 | 0 | 0 |
WALKING_DOWN | 2 | 13 | 405 | 0 | 0 | 0 |
SITTING | 0 | 1 | 2 | 431 | 53 | 4 |
STANDING | 1 | 0 | 0 | 37 | 494 | 0 |
LAYING | 0 | 0 | 0 | 0 | 0 | 537 |
NA. | Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
overall | 0.9385816 | 0.9261940 | 0.9292992 | 0.9469808 | 0.1822192 | 0.0000000 | NA |
In both cases applying model on the reduced data gives us a worse model. Accuracy for RF model with PCA on 5% lower than for RF model with full features set and for 2% lower in case of SVM. The training time was significant reduced for both models (see table).
duration | 11.4498 mins | 12.5184 mins | 8.850571 mins | 5.61223 mins | 16.16524 mins | 12.18458 mins | 2.044874 mins | 4.304657 mins |
model | RF Full model | SVM Full model | SVM Imp Var model 490 | SVM Imp Var model 440 | SVM IGR model 526 | RF IGR model 561 | RF PCA model | SVM PCA model |
Above we demonstrated how successfully perform human activity classification based on a simple feature extraction process of a smartphone time series. Two classifiers (SVM and RF) with varying computational costs and similar profiles of performance were utilised. The lists of important variables by RF and IGR raise the question of whether to use the gyroscope to improve recognition accuracy. But it fact absence of gyroscope data among important variable can be explain. Among accelerometer features using X-axis acceleration showed the next lowest prediction error because X-axis was aligned with the direction of forward movement. And in context of simle activity recognition it is one of the most affecting features. But from physical point of view, due to gyroscopes only capturing dynamic rotational movement free from gravitational bias, using gyroscope features in Y and Z axes have to show lower errors than corresponding acceleration Y and Z axes. For achievement higher accuracy and reliably recognize using stairs and rotations, we need to applay more sophisticated system of activitty recognition, which requied combining accelerometer and gyroscope information for reduceing prediction errors.
Conclusion
Diet tracking is an important behavior change strategy for controlling obesity. However, manually recording foods eaten is tedious. Many diet tracking approaches have been proposed but have limitations. Recognizing food from photographs fails for complicated foods. Scanning food barcodes works but not all foods have barcodes. (to be continued)
The specific parameters to be measured need to be determined during the research process– like genetic, metabolic or social factors implicated in the development of the disease. Obesity factors map
These factors have been clustered as in the figure below. This figure is derived from Tackling Obesities: Future Choices – Obesity System Atlas published by the former Department of Innovation Universities and Skills and is used under the Open Government Licence.
Where from top left: Social psychology (yellow), Individual psychology (orange), Physical activity environment (dark brown), Individual physical activity (light brown), Physiology (blue), Food consumption (light green), Food production (dark green).
Potential parameters include:
A major requirement for the success of the project will be the efficient collection and management of the heterogeneous patient data, in terms of both real time and non-real time traffic. The computational research aspects will therefore involve development of new algorithms, methodologies and software for integrating and analyzing the biomedical ‘big data’. Integrated analyses of genomic, environmental, behavioral, and clinical data will allow us to detect and characterize different types of obesity, leading to a better understanding of the complex relationships between obesity and genetic diseases. In the longer term, this should contribute to advances in pharmacogenomics (the right drug for the right patient at the right dose), identification of new targets for treatment and prevention, testing whether mobile devices can encourage healthy behaviors, and laying the scientific foundation for precision medicine in complex diseases.
The specific goals of this study are to determine the most important parameters in the context of obesity, by comparing two cohorts of genetic and non-genetic obese patients. Also, we hope to investigate the effects of physical activity and energy expenditure in the two cohorts. To be defined: exactly what questions we want to answer.
Healthcare industry is one of the industries that suffer of IT technologies unemployment in case of data processing. There is no one place to store medical data for a patient, some digitized components are not portable and are not suitable for collaboration projects. Today, the healthcare industry is shifting toward an information-centric care delivery model, enabled in part by open standards that support cooperation, collaborative workflows and information sharing. Cloud computing provides an infrastructure that allows hospitals, medical practices, insurance companies, and research facilities to tap improved computing resources at lower initial capital outlays. Additionally, cloud environments will lower the barriers for innovation and modernization of systems and applications. Cloud computing in healthcare could provide the following benefits:
Healthcare data has strict demands on security, confidentiality, availability to authorized users, traceability of access, reversibility of data, and long-term preservation. All these demands are very depended on country, so cloud providers should take care of it to store such information. Three types of applications – clinical, not clinical and third-party – should be taken into account. Clinical applications contain patient’s electronic records and should be secured properly, so private cloud type is likely to be used. Non-clinical applications contain workflow or billing cycles, so public or hybrid cloud types should be considered. Third party applications contain non-privacy related data for data analysis or other needs, public or hybrid cloud types should be also considered.
Machine Learning is an algorithm-based and data-driven technique to automati‐ cally improve computer programs by learning from experience. Training of machine learning is performed by the estimation of unknown parameters of a model by using training sets. Literature separates between follow main ML groups: supervised, unsuper‐ vised, semi-supervised learning, reinforced learning and deep learning. There are also other approaches: ‘weak signals’, bioinspired algorithms.
In this project, machine learning techniques will be used to build models to represent the different obesity types. To be defined: what kinds of models for what types of obesity?
The rest of this section summarizes some of the main techniques used in machine learning.
Click here for to see Map of machine learning algorrithms:
In an initial ‘training’ stage, the algorithm is presented with example inputs and their desired outputs, given by an “expert”, and the goal is to learn a general rule that maps inputs to outputs. Ina second stage, the rules can then be used to predict the output for unknown inputs. Specific techniques include Decision trees, Random forests, Artificial neural networks, Support vector machines, Bayesian statistics, Inductive logic programming, Case based reasoning.
No ‘outputs’ are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means to an end (feature learning). Specific techniques include Clustering, Outlier Detection, Association rule learning, Self-organizing maps.
Semi-supervised learning is between supervised and unsupervised learning, where the expert gives incomplete training examples, with some (or many) of the target outputs missing.
The algorithm interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without an expert explicitly telling it whether it has come close to its goal. Another example is learning to play a game by playing against an opponent.
Deep learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers. Architectures include deep neural networks, convolutional deep neural networks, deep belief networks and recurrent neural networks. They use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be supervised or unsupervised. Some successful applications of deep learning are computer vision and speech recognition. Recently, a deep-learning approach based on an artificial neural network has been used in bioinformatics, to predict Gene Ontology annotations and gene-function relationships (Chicco et al., 2014).
Other approaches: ‘weak signals’, bioinspired algorithms
To perform this analysis, we obtained a small published dataset (www.omicsonline.org/speaker/s-shajith-anoop-bharathiar-university-india) from a cohort of unrelated, obese subjects (n=208) and non obese controls (n=166) recruited from a semi urban population of Tamilnadu, South India. The goal of the original study was to identify gene-nutrient interactions in obesity. The dataset included various physiological characteristic features (BMI, WHR values, LDL/HDL ratio, TC, TGL, FBL etc.), genotypic data for known obesity genes and also data on the gender and age of the persons in the study.
In a first step, we performed some simple statistical analysis (using the R language) to see how each of the physiological factors and genetic factors correlated with obesity, in particular the BMI and the WHR values.
## Classes 'tbl_df' and 'data.frame': 372 obs. of 16 variables:
## $ Subject.type : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : num 20 21 24 26 27 28 25 27 25 23 ...
## $ gender : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BMI : num 25 26.7 29.1 28 29 ...
## $ WHR : num 0.98 0.99 1.02 1.08 1.06 0.98 0.99 1.02 1.04 1.08 ...
## $ FBG : num 88 110 90 112 69 98 109 101 98 92 ...
## $ T C : num 198 232 168 198 164 174 164 200 174 155 ...
## $ TGL : num 198 205 134 110 148 174 196 91 174 81 ...
## $ HDL : num 32 26 29 26 33 34 33 57 34 28 ...
## $ TG.HDL.RATIO : num 6.19 7.88 4.62 4.23 4.48 ...
## $ TC.HDL.RATIO : num 6.19 8.92 5.79 7.62 4.97 ...
## $ LDL : num 143 143 156 158 168 164 143 156 193 166 ...
## $ VLDL : num 31 31 38 33 23 10 30 14 10 9 ...
## $ LDL.HDL.RATIO: num 4.46 5.5 5.37 6.07 5.09 4.82 4.33 2.73 5.67 5.92 ...
## $ MC4R : chr "G/A" "G/A" "G/A" "G/G" ...
## $ DGAT : chr "CC" "CC" "CC" "CC" ...
## NULL
There are three groups of BMI-value on two Subject type in dataset. It means that subjects from second BMI-value group with overweight are defined as one of the two subject type (Obese or Non-obese).
Non-obese | Obese | |
---|---|---|
[14.4,24.6) | 0.976 | 0.024 |
[24.6,29.1) | 0.187 | 0.813 |
[29.1,38.1] | 0.033 | 0.967 |
BMI vs subject type (obese or non-obese) BMI is used as a gauge of obesity. Degree of obesity can be measured by BMI.
WHR vs subject type (obese or non-obese)
High WHR is indicative of abdominal obesity and the general idea is that obese individuals (in terms of BMI) are more likely to have abdominal obesity (in terms of WHR). The graph below supports this hypothesis.
Fasting Blood Glucose vs subject type (obese or non-obese)
Note that FBG for obese people has greater variation compared to non-obese individuals, especially wider tails. However, the mean value of the FBG is similar for both the obese and the non-obese groups. LDL/HDL ratio vs subject type (obese or non-obese)
One of the main reasons underlying obesity is the accumulation of extra LDL (low density lipoprotein) which is also referred to as “bad fat” and the reduction in levels of body HDL (high density lipoprotein) content, also referred to as “good fat”. As a result, for obese individuals, we are more likely to observe a higher value of the LDL/HDL ratio. For the data under consideration, the LDL/HDL ratio for obese individuals compared to non-obese individuals looked like the following: LDL/HDL ratio is also extremely high for obese individuals compared to non-obese individuals.
Correlation Matrix
This multi-panel plot combines following elements: correlation plots, histograms with density estimators, and, on the lower diagonal, the associated statistics.
Highly correlated attributes:
## [1] "HDL" "TG.HDL.RATIO"
Generally, we can to remove attributes with an absolute correlation of 0.75 or higher.
In a second step, we explored the effect of the genetic factors on BMI (obesity) and WHR (abdominal obesity). In the published study, sequence analysis of exon 4 in leptin receptor (LEP-R) revealed a novel mutation -Leu48His (CTT to CAT) in heterozygous state. The mutation was absent in the sequences of non obese controls. Sequence data was also available for the MC4R gene, coding for a protein called melanocortin 4 receptor, which is mainly found in the hypothalamus of the brain, an area responsible for controlling appetite and satiety. Mutations in the MC4R gene account for 6-8% of obesity cases (www.gbhealthwatch.com/GND-Obesity-MC4R.php). We investigated the relationship between variations in the MC4R gene (G/G or G/A at an unknown position) identified in the cohort and the BMI / WHR. For the two variants of the gene, the joint distribution of (BMI, WHR) can be represented by 3-D density plots:
(BMI,WHR) density plot for individuals with G/A & G/G variant
Note that for the G/G genotype, the peak in the (BMI,WHR) density plot is found at higher values of BMI and WHR. This implies that the variant may be involved in increasing both the BMI and WHR, making the person more susceptible to becoming obese.
To test this hypothesis, we create three simple liner models for each phenotypes (BMI, WHR, LDL/HDL ratio) as the response and MC4R gene variants as the predictor and computed some summary statistics. As interactions and influences between phenotypes of obesity have great value on motives and consequences of obesity, we include into models environmental factors from dataset ( gendre and age)
Genetic variants and LDL/HDL model
The LDL/HDL graphic distribution for each gendre:
And for each variant of MC4R:
As it has been seen on the figures, factor MC4R expresses tendency of increasing the LDL/HDL.
We create liner regression model with MC4R gene variants, gender and age as the predictors. For estimating the model we derive residual distribution:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.11906274 | 0.38013534 | 5.574496 | 0.0000000481 |
MC4RG/G | 1.39066887 | 0.16798736 | 8.278414 | 0.0000000000 |
gender | 0.27112899 | 0.16277624 | 1.665655 | 0.0966332176 |
Age | 0.01064211 | 0.01042379 | 1.020945 | 0.3079512954 |
The density plot of residuals on the figure above shows an approximately normal distribution, which is seems to be positive aspect of the model fit.
Secondly, for to make the model more flexible, we will use for the same model with the number of risk alleles in a genotype as a predictor instead of factor. It means that it will be concerned the following number of risk alleles: G/G=2 and G/A=1.
G/A | G/G | |
---|---|---|
Non-obese | 130 | 20 |
Obese | 71 | 151 |
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.99952287 | 0.36371511 | 2.748093 | 0.006289484 |
MC4R_risk | 1.39066887 | 0.16798736 | 8.278414 | 0.000000000 |
gender2 | 0.27112899 | 0.16277624 | 1.665655 | 0.096633218 |
Age | 0.01064211 | 0.01042379 | 1.020945 | 0.307951295 |
We can see that the model fit R2 value = 0.19 is pritty much the same as previous one. The only significant predictor is found to be the risk alleles corresponding to the MC4R gene (est = 1.39, p-Value = 2.33e-15). Age (est = 0.01, p-Value = 0.3) and gender (est = 0.27, p-Value = 0.09) are again found to be insignificant.
Genetic variants and BMI model
Despite of we had direct indicator about person obesity status we invistigate BMI in next model, because it gives us opportunity to estimate variation of the obesity degree of an individual on continuous scale with respect to the values of the predictor variables. For begining we provide the BMI mean and standard deviation values for each variants of gene MC4R:
Levels | Mean BMI | SD BMI |
---|---|---|
G/G | 29.6926315789474 | 3.26087223606477 |
G/A | 24.094328358209 | 4.26459220461251 |
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
The odds ratio is 20.040625, which indicates an effect of the variants on the BMI.
Like in previous subsection, we create linear models with BMI as response and gene variants, age and gender as the predictors and observe which of these factors have a profound effect.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 19.8903293 | 0.8089136 | 24.5889411 | 0.0000000000 |
MC4RG/G | 4.9191671 | 0.4039918 | 12.1764032 | 0.0000000000 |
gender2 | 0.2398279 | 0.3914596 | 0.6126506 | 0.5404858212 |
Age | 0.1321873 | 0.0250681 | 5.2731287 | 0.0000002293 |
Then create the same model with the number of risk alleles in a genotype as a predictor instead of factor and compare with the previous one.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 18.2858446 | 0.6299465 | 29.027615 | 0.0000000 |
MC4R_risk | 5.5407173 | 0.4002034 | 13.844754 | 0.0000000 |
gender2 | 0.6047309 | 0.3990441 | 1.515449 | 0.1305149 |
## Analysis of Variance Table
##
## Model 1: BMI ~ MC4R + gender + Age
## Model 2: BMI ~ MC4R_risk + gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 368 5031.2
## 2 369 5411.3 -1 -380.15 27.806 2.293e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here ANOVA test is not significant. That indicates we can just work with the risk alleles data, which is improved by a test for the validation of the fit as well. The model fit was good (R2 value = 0.39). The G/G genotype for the MC4R gene was found to be extremely significant (est = 4.91, p-Value < 2e-16) that improved importance MC4R as predictor for obesity and the allele G is again a risk allele. The gender factor has inessential effect (est = 0.23, p-Value = 0.54), but in this case age is found to be a major influencing parameter (est = 0.13 , p-Value = 2.29e-07), that relate the tendency of obesity increases with age.
The plot of the fitted values against the residuals and the distribution residuals imply that there is no significant relation between the residuals and the fitted values and the residuals seem to have approximately Normal distribution around 0, both of which are crucial model assumptions.
Genetic variants and WHR model
WHR is an important factor for determining the abdominal obesity of a person. According to World Health Organization (WHO) standards for Europe [37], males with WHR cut-off > 0.94 and females with WHR cut-off > 0.8 are considered to have abdominal obesity. Analogically with the previous two models ( BMI and LDL/HDL Ratio), we create a WHR model, using the same predictors - gender, age and the MC4R genes and WHR as response. Like in two first cases the model will be validated using the fitted values versus residuals plot and the residuals distribution. We begin by observation of WHR mean and standard deviation values:
Levels | Mean WHR | SD WHR |
---|---|---|
G/G | 0.923391812865497 | 0.0836390265133804 |
G/A | 0.862537313432836 | 0.0770456348584803 |
And look at it’s distribution for each gendre:
Characterictics of liner regression model with predictors - gendre, age and MC4R gene:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.8643374291 | 0.0163016546 | 53.021454 | 0.0000000 |
MC4RG/G | 0.0626275680 | 0.0081414562 | 7.692428 | 0.0000000 |
gender2 | -0.0610770752 | 0.0078889009 | -7.742153 | 0.0000000 |
Age | 0.0008143214 | 0.0005051856 | 1.611925 | 0.1078355 |
The model fit (R2 = 0.24) showed that as usual age factor is not influential (est = 0.0008, p-Value = 0.108). But the MC4R gene (G/G: est = 0.062, p-Value = 1.34e-13) and gender (Female = -0.061, p-Value = 9.55e-14) have meaningful effect. This shows that G seems to be a risk allele for abdominal obesity as well and males seem to have higher WHR ratio compared to females. Let’s examine fitted values vs residuals and plot density of residuals:
Characterictics of liner regression model with predictors - gendre and MC4R gene risk:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.22246249 | 0.015198564 | 80.432761 | 0e+00 |
MC4R_risk | -0.08306089 | 0.009655608 | -8.602346 | 0e+00 |
gender2 | 0.06510986 | 0.009627638 | 6.762807 | 1e-10 |
This model improve previous resultats.
However, we need take into account that the value of males cut-off of WHR for abdominal obesity is higher than females. But since the effect size is small (est = 0.061), it may be not essential to higher rate of abdominal obesity compared to females.
In order to make it clear, we produce linear model with the same set of predictors as the first model (MC4R, gender, age) contre the indicator response variable representing whether a person is abdominally obese or not (response = 1 if the individual is male and has WHR > 0.94, or the individual is female and has WHR > 0.8, response=0).
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -1.52030360 | 0.57664817 | -2.636449 | 0.0083778744 |
MC4RG/G | 2.11164040 | 0.33727998 | 6.260794 | 0.0000000004 |
gender2 | 2.97806892 | 0.38018063 | 7.833300 | 0.0000000000 |
Age | 0.02290676 | 0.01787063 | 1.281810 | 0.1999092113 |
The two most significant effects, as in the WHR modeling case, were found to be the MC4R gene (MC4R G/G: est=2.11, p-Value = 6e-10) and gender (Female: est = 2.97, p-Value = 6.82e-15). This model shows that females tend to have higher rate of abdominal obesity than males, because since the cut-off of abdominal obesity for the females as set by WHO is lower compared to the males. The following figure helps explain this:
Genetic variants and other factors
At the end we build linear model with Obesity (obese/non obese) as the response and gene variants as the predictor:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.3980100 | 0.02824171 | 14.09298 | 0 |
MC4RG/G | 0.5318146 | 0.04165474 | 12.76720 | 0 |
We present a logistic regression of the Obesity / Abdominal Obesity data against the MC4R variants, age and gender. First we looked at the abdominally obese individuals.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -1.5203036 | 0.5766482 | -2.636449 | 0.0083779 |
MC4RG/G | 2.1116404 | 0.3372800 | 6.260794 | 0.0000000 |
gender2 | 2.9780689 | 0.3801806 | 7.833300 | 0.0000000 |
Age | 0.0229068 | 0.0178706 | 1.281810 | 0.1999092 |
Also check pos | sible influen | ce MC4R in ca | se of overwe | ight (BMI>23) |
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -1.1246467 | 0.5693295 | -1.975388 | 0.0482241 |
MC4RG/G | 2.8292455 | 0.4824269 | 5.864610 | 0.0000000 |
gender2 | 0.2553966 | 0.2906601 | 0.878678 | 0.3795759 |
Age | 0.0480813 | 0.0186422 | 2.579164 | 0.0099040 |
From these simple analyses, we conclude that the MC4R gene variant has a significant effect on both abdominal and regular obesity. Age is not a significant factor for obesity and abdominal obesity, but gender is found to be highly significant for abdominal obesity and slightly significant for general obesity.
One of the first problems faced in attempting to define a genetic basis for obesity is deciding what kind of effect (phenotype) we seek to examine[36]. That is why the discovering features/parameters for the prediction model is one of the most important steps.
Most of the studies were shown above for identification of factors and employed traditional data analysis procedures for selecting parameters, such as linear or logistic regression, correlation. Also it is possiable to use the receiver-operating characteristic (ROC) curve analysis to verify the predictive role for each variable and to discover the best cutoff values for them. Correlation is very sensitive to assumptions violation. A small and non-significant correlation does not imply poor prediction and a high and significant correlation does not necessarily imply a good prediction as well. Linear and logistic regression analysis are sensitive to assumptions violation. Are based on null-hypothesis significance testing (P-value), it requires caution to verify which variables better predict obesity. A smaller P-value does not indicate a stronger relationship between independent and dependent variables, and statistical significance does not indicate practical importance [15]. The ROC curve analysis is used to provide and to verify the quality of the cutoff points and is highly recommended in epidemiological studies [16], because it can describe the accuracy of a variable to classify people into relevant clinical groups. However, the ROC curve methodology is not an informative technique to evaluate the contribution of an additional variable to the model [17], being limited to investigate the improvement in the prediction or in the amount of variance explained when an additional variable enters the model (incremental validity). In this section we use machine learning techniques to discover relations in dataset, to verify incremental validity of additional predictors, and to make accurate predictions for new datasets which may help us to find new robust diagnostic parameters in the future work.
While working on a problem, we will settle on a multiple of well performing data mining models. After tuning the parameters of each, we have to compare the models and discover which are the best and worst performing.
In this section we outline and compare the some general machine learning models used for classification and prediction in contexte of usability, efficiency and challenges in healthcare domain. We train the 6 machine learning models that we will compare in the next section: Learning Vector Quantization (LVQ), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM) with Radial Basis Function, Classification and Regression Trees (CART), k-Nearest Neighbors (KNN) and Random forest (RF) . Each model is automatically tuned and is evaluated using 3 repeats of 10-fold cross validation. The evaluation metric is accuracy and kappa because they are easy to interpret.
Six differnet model were calculated from the training set, in order to identify which variables, or which combination of variables, were suitable for obesity identification. For comparing the estimated accuracy of the constructed models the distributions are summarized in terms of generate Table Summary and draw two compatative plots.
Table Summary
This is the easiest comparison compose a table with one algorithm for each row and evaluation metrics for each column.
##
## Call:
## summary.resamples(object = results)
##
## Models: LQV, LDA, SVM, KNN, CART, RF
## Number of resamples: 50
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LQV 0.7500 0.8966 0.9286 0.9328 0.9652 1 0
## LDA 0.8621 0.9286 0.9643 0.9608 1.0000 1 0
## SVM 0.8214 0.9286 0.9643 0.9545 1.0000 1 0
## KNN 0.6897 0.8276 0.8889 0.8759 0.9286 1 0
## CART 0.8214 0.9266 0.9630 0.9439 0.9643 1 0
## RF 0.8214 0.9286 0.9643 0.9586 1.0000 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LQV 0.5000 0.7841 0.8560 0.8632 0.9280 1 0
## LDA 0.7157 0.8555 0.9263 0.9194 1.0000 1 0
## SVM 0.6196 0.8544 0.9239 0.9045 1.0000 1 0
## KNN 0.3524 0.6492 0.7724 0.7468 0.8539 1 0
## CART 0.6067 0.8456 0.9231 0.8820 0.9275 1 0
## RF 0.6196 0.8555 0.9253 0.9150 1.0000 1 0
Box and Whisker Plots
Here we show spreading of the estimated accuracies for different methods and how they relate.
Statistical significance tests
This step is to build a table of pair-wise statistical significance scores, which helps to understand significance of the differences between the metric distributions of different machine learning algorithms.
##
## Call:
## summary.diff.resamples(object = diffs)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## LQV LDA SVM KNN CART RF
## LQV -0.028055 -0.021727 0.056843 -0.011109 -0.025861
## LDA 0.001291 0.006328 0.084898 0.016946 0.002194
## SVM 0.015498 1.000000 0.078570 0.010618 -0.004134
## KNN 7.812e-07 3.145e-11 1.066e-10 -0.067952 -0.082704
## CART 1.000000 0.244335 0.881729 2.217e-07 -0.014752
## RF 0.002369 1.000000 1.000000 1.686e-12 0.339856
##
## Kappa
## LQV LDA SVM KNN CART RF
## LQV -0.056282 -0.041357 0.116307 -0.018866 -0.051814
## LDA 0.001462 0.014925 0.172590 0.037416 0.004469
## SVM 0.032563 1.000000 0.157665 0.022491 -0.010456
## KNN 7.878e-07 3.478e-11 1.947e-10 -0.135173 -0.168121
## CART 1.000000 0.171148 0.859744 4.562e-07 -0.032948
## RF 0.002806 1.000000 1.000000 1.461e-12 0.238005
Positional genetic techniques require no special previous knowledge of the function of an individual genomic region but implicate it in the causation of obesity purely on the grounds that identifiable markers in the region are found in obese phenotypes more frequently than would be expected by chance (i.e. they segregate in obesity and are identified in families by linkage or in populations by genetic association (GWAS)).
BBS8 is caused by homozygous mutations in the TTC8 gene (608132) on chromosome 14q31. Let’s give one example of High Throughput Sequence Analysis useing Bioconductor and display gene models and underlying support across BAM (aligned read) files. We used BAM files from Bioconductor experiment data package RNAseqData.HNRNPC.bam.chr14 which contains 8 BAM files from an experiment involving knockdown of gene HNRNPC. Here the plot of Region track:
(to be continued)
Epistasis, the phenomenon in which several loci (genes) collectively affecting a single trait (gene) and pleiotropy, single locus (gene) affecting more than one trait (phenotypes), have long been recognized to be central of understanding of gene expression. Epistasis and Pleiotropy require defining a appropriate genetic element from variety of ways ( a gene, a chromosomal segment with high linkage disequilibrium or mutation). For example in arcticle[11] Darabos et al. use SNPs and genes, while in [12] Philip et al.use expression quantitative trait loci (eQTL) as genetic elements. A quantitative trait locus (QTL) is a section of DNA (the locus) that correlates with variation in a phenotype (the quantitative trait).[1] The QTL typically is linked to, or contains, the genes that control that phenotype. QTLs are mapped by identifying which molecular markers (such as SNPs or AFLPs) correlate with an observed trait. This is often an early step in identifying and sequencing the actual genes that cause the trait variation.38 In this section we used a recently developed method, Combined Analysis of Epistasis and Pleiotropy (CAPE), originally described in [8], that infers directed interaction networks between genetic variants for predicting the influence of genetic perturbations on phenotypes. We can applay this method in case of BBS by following reasons:
The method uses regression on pairs of loci to detect interaction effects from each locus pair on each phenotype. It then combines the results of the linear regressions across phenotypes to interpret the direction of the interaction. This directional information is calculated through a reparametrization of the coefficients from the pairwise linear regressions. The result is a pair of directed coefficients describing how the two loci influence each other in terms of suppression or enhancement.
We will analyze a dataset described in [9]. This dataset was established to find quantitative trait loci (QTL) for obesity and other risk factors of type II diabetes in a reciprocal back-cross of non-obese non-diabetic (NON/Lt) mice and diabetes-prone, New Zealand obese (NZO/HILt) mice. Included in this dataset are 204 male mice genotyped at 85 markers across the genome. The phenotypes included are the body weight (g), insulin levels (ng/mL), and plasma glucose levels (mg/dL), all measured at age 24 weeks. In addition, there is a variable called “mom” indicating whether the mother of each mouse was normal weight (0) or obese (1).
## List of 6
## $ pheno : num [1:204, 1:4] 58.6 49.9 56 53.7 48.7 41.1 45.4 44.1 40.4 40.1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:204] "1" "2" "3" "4" ...
## .. ..$ : chr [1:4] "body_weight" "glucose" "insulin" "mom"
## $ geno : num [1:204, 1:84] 1 1 1 0 1 0 1 NA 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:204] "1" "2" "3" "4" ...
## .. ..$ : chr [1:84] "1" "2" "3" "4" ...
## $ chromosome : chr [1:84] "1" "1" "1" "1" ...
## $ marker.names : chr [1:84] "D1Mit296" "D1Mit211" "D1Mit411" "D1Mit123" ...
## $ marker.num : int [1:84] 1 2 3 4 5 6 7 8 9 10 ...
## $ marker.location: num [1:84] 2.08 10.59 12.62 17.67 22.88 ...
## NULL
Firstly, we perform the distribution of each phenotype:
While body weight looks relatively normally distributed, glucose and insulin have obviously non-normal distributions. Before proceeding with the analysis we mean centre and normalize all phenotypes. After normalization, only insulin still has a ceiling effect, which cannot be removed by normalization because rank cannot be determined among equal values. Then we examen the Q-Q plots of pairs of phenotypes for reveal phenotyping errors and other pathologies.
Here the ceiling effect is visible in the insulin measurement and this cannot be removed through normalization. Moreover, the distribution is more similar to those of the other phenotypes than before normalization. Knowing that this ceiling effect we can simply remove insulin from the analysis, but here we will it rest in data object.
This method relies on the selection of two or more phenotypes that have common genetic factors but are not identical across all individuals. Such phenotypes may describe multiple aspects of a single complex trait, (in this case - obesity ), and may encompass a combination of molecular phenotypes, such as plasma glucose levels, and phenotypes, such as body weight, that are measured at the organismal level. The central assumption of this method is that different genetic interactions found for a single gene pair in the context of different phenotypes represent multiple manifestations of a single underlying gene network. By measuring the interactions between genetic variants in different contexts we can gain a clearer picture of the network underlying statistical epistasis [8]. By examining the correlations between phenotypes, we can see that the phenotypes measured in this experiment are correlated, but not identical across all individuals.
Prior to calculating the linear regression models, we decomposed the phenotypes via singular value decomposition (SVD) into their prin- cipal components, called eigentraits (ETs) [10]. This step decorrelates the phenotypes, reorganizing phenotypic signals into orthogonal, com- posite phenotypes. This procedure potentially concentrates genetic associations, in that variants with weak associations to multiple orig- inal phenotypes often exhibit strong association to one ET. Although the final CAPE-derived model will be recast in terms of the original phenotypes, this provides enhanced detection of candidate loci for interaction analysis.
Before the selection of eigentraits, the eigentraits should be examined
This figure demonstrate the contributions of each phenotype to each ET. Green squares indicate a positive contribution while yellow squares indicate a negative contribution. Gray bars show the percent variance accounted for by each ET. Here, the first eigentrait captures more than 70% of the variance in the three phenotypes. This eigentrait describes the processes by which body weight, glucose levels, and insulin levels all vary together. The second eigentrait captures nearly 20% of the variance in the phenotypes. It captures the processes through which glucose and body weight vary in opposite directions. This eigentrait may be important in distinguishing the genetic discordance between obesity and diabetes.The third eigentrait describes the divergence of blood glucose and insulin levels and may represent a genetic link between glucose and body weight that is non-insulin dependent.
Once the eigentraits for the analysis have been selected, the single-locus scan is run to investigate how individual markers are associated with each eigentrait. This scan performs a linear regression at each marker. Single-Variant Scan can be used as pre-processing step for filtering variants and to choose those that will be incuded in the pair scan. It is useful, taking into account that the number of possible variant pairs may be too large to test exhaustively in case the large data sets. And it can helps to avoid obscuring interactions by large main effects - variants with large main-effects can be used as covariates in the pair scan. The single-marker scan currently does not support markers on sex chromosomes. Because the X chromosome is hemizygous in males, sex differences in phenotype can lead to false associations, and markers on this chromosome require special consideration [39].
We used linear regression to determine the association of each locus with each phenotype.
The results of the single variant scan can be visualized:
For better comparison of effects between phenotypes we present heatmap which demonstrate the effects of each regression on each phenotype:
Each of the phenotypes had multiple associated QTL, and these QTL often overlapped across multiple phenotypes. For example, body weight and glucose shared a large QTL on Chr 15, and also they showed overlapping QTL on Chr 1. These overlapping QTL indicate the possibility of common genetic factors underlying multiple phenotypes, such that information can be combined across multiple phenotypes to gain information about individual loci. Unique QTL were also observed, providing non-redundant information to discern genetic factors with phenotypic specificity.
Above we decomposed the normalized, mean-centered phenotypes into eigentraits and described them. As it is seen on figure below single-locus associations with each ET detected multiple QTL:
The results of the single variant scan can be visualized
In this figure the t-statistic of each marker is plotted as a vertical line. Results for both eigentraits are shown here as ET1 and ET2, and chromosome numbers are written along the x axis. In this example we will not filter the markers and use all markers in the pairwise scan. Since ET are linear combinations of traits, each QTL indicates a potentially pleiotropic association with varying effect strengths on each trait. For example, data for body weight and glucose had overlapping QTL on Chr 1. These phenotypes also contributed substantially to ET1, and there was a corresponding significant QTL for ET1 on Chr 1 representing the common QTL.
The purpose of the pairwise scan is to find interactions, or epistasis, between variants. The epistatic models are then combined across phenotypes or eigentraits to infer a network that takes data from all eigentraits into account. But before running the pair scan, it is important that we reduce the genetic matrix to only markers that are linearly independent of one another. Some of the markers used were tightly linked to one another. So we need calculate the correlation between all pairs of markers. High Pearson correlation between markers indicates that the individual markers supply redundant information and can lead to false-positive interactions. To address this problem, we will filter the marker pairs by removing all highly correlated pairs from consideration using a Pearson correlation cutoff of r ≥ 0.75.
In figure above we can see the locations of discarded markers, which markers have been selected for the pair scan and which have been rejected. A total of 2 marker pairs (56, 57) were eliminated in this filtering step. After ensuring that all markers are linearly independent and thresholded satisfactorily, the pairscan can be run. In fact pairscans are performed for each ET by multivariate linear regression, with an intercept, covariates, main effects, and interaction term for each pair of markers. Interaction (epistasis) coefficients from each ET can be plotted as shown in the following pdf-file:
Click here to download the big picture of the results of the pairwise scan.
This plot shows the resulting t-statistic for each pair of markers for ET1. Symmetric matrices of all marker pair interaction terms are displayed in matrix form, with gray and white bars along the axes to mark chromosome boundaries.Gray and white bars show the boundaries of the chromosomes. 82 markers were selected for the pairscan. This makes 3321 possible pairs.
From the pairwise the coefficients regression are then reparametrized to give direct influence coefficients, describing how each marker either enhances or suppresses the activity of each other marker. First two new parameters (δ 1 and δ 2 ) are defined in terms of the interaction coefficients from the pairwise regression. δ 1 can be thought of as the additional genetic activity of the variant near marker 1 when the variant near marker 2 is present. The δ terms are independent of phenotype and together completely describe the interaction term. They can be interpreted as the extent to which each marker influences the effect of the other on the phenotypes. For example, a negative δ 2 indicates that the presence of variant 2 suppresses the effect of variant 1 on the ETs. The δ terms are related to the main effects and interaction effects. In the picture below we present:
In the figure above all non-significant interactions are colored white.
As an example, present first 7 rows of the table of significant influences:
Source | Target | Effect | SE | X.Effect..SE | P_empirical | qval |
---|---|---|---|---|---|---|
mom | insulin | 0.5492437 | 0.0911159 | 6.027969 | 0 | < 0.00296 |
D1Mit76 | insulin | 0.8006307 | 0.1378770 | 5.806849 | 0 | < 0.00296 |
D1Mit123 | insulin | 0.7471561 | 0.1392996 | 5.363661 | 0 | < 0.00296 |
D1Mit411 | insulin | 0.7429545 | 0.1389041 | 5.348688 | 0 | < 0.00296 |
mom | body_weight | 0.5686825 | 0.1085399 | 5.239388 | 0 | < 0.00296 |
D1Mit213 | insulin | 0.7201700 | 0.1403843 | 5.129990 | 0 | < 0.00296 |
mom | glucose | 0.5539787 | 0.1082084 | 5.119552 | 0 | < 0.00296 |
The plot of network is demonstrated below. It presents the chromosomes in a circle and shows interactions as arrows between regions on chromosomes:
In this figure of the network, each chromosome is represented by a black bar and labeled with the chromosome number. Main effects are depicted in the gray circles outside the chromosome bars. Green segments represent significant positive main effects and there are not yellow segments which represent significant negative main effects. As it has been mentioned Interactions are denoted by arrows between chromosomal regions. The inner, middle and outer main effect circles show main effects for body_weight, glucose and insulin, respectively. Many subnetworks of individual modules can be seen as arrows linking common regions together. The largest module, for example comprises interactions between Chrs 1, 2, 3, 4, 6, 8, 9, 10, 11, 12, 13, 14, 16, 17 and 18 with Chr 1 acting as the hub of the module. Another smaller one links Chrs 14 and 1,7, 12. The modules also displayed different mixtures of positive and negative interactions. Some multinode modules contained both enhancing and suppressing interactions, while one other module contained only positive interactions. The first type of module has a clear hub on Chr 1. This chromosome had a positive effects on body weight, glucose et insulin, which were enhanced through interactions with loci on six other chromosomes - Chrs 2, 9, 10, 11, 13, 16 and were suppresed by Chrs 10 and 12. Or Chr 18 had a positive effects on body weight, which was enchanced through interactions with loci on Chrs 6, 15 ans was suppresed by Chr 2.
The interaction between Chr 1 and 12 is is illustrative, because although both loci have a positive effects on body weight, but their joint effect is less than additive.This suggests that one or both of the loci are suppressing the effect of the other. The nature of this interactions can be seen by examining effect plots for these interactions. [to be countinued]
Only the Chr 1 locus has an effect on insulin and this effect is independent of the Chr 12 locus.
The main points in which security must be applied within WBAN are: On sensors attached in, on or around user’s body. On personal servers (where aggregated information is kept and to be sent out side for further processing). On Communication channels & various entry/exit points or gateways. On Internet (to connect medical community outside the WBAN). On Devices used by Clinicians or medical helpers.
A lot of research has been taken place and still in progress to solve this safety problem within WBANs, including:
A cloud based framework using WBANS as its backbone implemented security and privacy techniques [28], [29] is one of its own kinds. This skeleton has 2 steps to apply security : Any pair of sensors can talk to each other safely by using multi-biometric key generation scheme within WBANs. Patient’s data stored on cloud has been made confidential and safe by using dynamic reconstruction of metadata. This framework has attached a personal server to a patient, a client interface/ data-reader, RBS (Remote base station) and a hospital community cloud. Main technique used to secure communication is based on combination of 2 biometric values as values taken from ECG and EEG devices. It raises length of keys using key-gen algorithm. This raised length keys are used to encrypt and decrypt the private data and in this way randomness and unpredictability are introduced for attackers. But this framework is not putting sufficient focus to protect against physical tampering and jamming of sensing devices. One more research paper by Han, In & Jo [30] has presented a better scheme for data confidentiality in cloud-based WBANs. In this paper authors have proposed a multi-valued and ambiguous scheme to confine data confidentiality. Data communication between cloud and WBANs is attempted to be secured efficiently. It is based on the association of complexity theory to cryptography. Authors have compared their results with standard AES, DES encryption techniques and shown their supremacy on these contemporary methods. In another research presented by Muhannad and Yaser [31] efficient information collection in WBANs is implemented by introducing cloudlet concept. Reliability and trust worthiness is maintained here up to some limit. Hence, integrity of large data collected by different machines attached to patients and users is also aimed to be kept.
Since WBANs are directly related to human health, so different human body generated or contained values/information can be made used to grow security within systems. By following above idea, robustness property of human body is used as an inspiration [32] which is evolved from biology to develop secure systems. An approach to secure WBAN using BIO-Inspiration developed by Rathore et al. is revolutionary idea in recent scenario. In this research security is implemented by using human immune system as it base and inspiration along with machine learning techniques have been applied here. Here malicious nodes are detected by machine learning module. Antigen and antibody concept of human immune system is used as a different module for removal of malicious nodes from communication network. According to another research new improved encryption mechanisms are developed on the basis of two concepts- DNA computation and Chaos Theory [33]. It targets secure data communication by using a concept- only encrypted information will be transmitted. DNA based cryptography is not a new method but doing this along with chaos theory of non-linear mathematical model brings a broad and unpredictable encryption scheme in to picture. Unauthorized user will feel this chaotic encrypted data as a noise. So chaos is used as a key generator and it proved as a strong pseudorandom generator. By this concept safe, collision-free and efficient MAC protocol could be developed here.
In these mechanisms, many existing routing and transmission control protocols are redesigned and developed again in order to make secure and privacy preserving WBANs. Another area of development consists of many new security protocols to defeat evil intensions of cryptanalysts. Zhang et al. developed a secure and lightweight admission and transmission protocol for WBSNs and WBANs [34]. In this protocol PWH (Personal Wireless Hub) and PHI (Personal Health Information) are utilized as basic terms. PWH is local processing unit of WBANs and data collected by sensors is termed as PHI. Data is forwarded from PWH to remote healthcare centre for necessary actions. In this research both- security of transmission of PHI and preserving privacy of PHI are handled properly. A polynomial based authentication scheme is explored and used to fulfill above required security and privacy implementation. Eavesdropping is controlled and prevented here by applying pair wise key generation and usage by two non-malicious nodes. Security while transmitting the data is applied by devising a protocol along with symmetric encryption and sub-keyed hash function. By applying this methodology few major security aspects as confidentiality, authentication and integrity are accomplished. Another development of security protocol is enriched by enabling the proper usage of Different PAKE (pair-wise acknowledgement key exchange) based idea. A detailed analysis of PAKE protocols [35] provides a transparent view of secure WBANs. Various limitations in PAKE protocols such as forward secrecy, impersonation attacks, dictionary, and replay attacks are also analyzed here thoroughly. Hence, these researches are definitely giving a path to move ahead and create few more security protocols to strengthen security of data and communication channels within WBANs. This observation is the first step to find and solve existing problems in E- healthcare arrangements. Our future research must be in the direction of multidisciplinary attachments to obtain better, promising and unexplored area of WBAN based e-healthcare arrangements. Further research and methodical designing is required in this are to enhance security and privacy within applications based on WBAN.
Table 1
Details of the health trackers
List of health trackers. Click on name of health tracker for to go to corresponding site Internet.
(XLSX)
Click here for additional data file.
Download file here
All relevant data are within the paper and its Supporting information files.