Abstract:. Document classification is challenging due to handling of voluminous and highly non-linear data, generated exponentially in the era of digitization. Proper representation of documents increases efficiency and performance of classification, ultimate goal of retrieving information from large corpus. Deep neural network models learn features for document classification unlike the engineered feature based approaches where features are extracted or selected from the data. In the paper we investigate performance of different classifiers based on the features obtained using two approaches. We apply deep autoencoder for learning features while engineering features are extracted by exploiting semantic association within the terms of the documents. Experimentally it has been observed that learning feature based classification always perform better than the proposed engineering feature based classifiers.
Conclusion and Future Work: In the paper we emphasize the importance of feature representation for classification. The potential of deep learning in feature extraction process for efficient compression and representation of raw features is explored. By conducting multiple experiments we deduce that a DBN – Deep AE feature extractor and a DNNC outperforms most other techniques providing a trade-off between accuracy and execution time. In this paper we have dealt with the most significant feature extraction and classification techniques for text documents where each text document belongs to a single class label. With the explosion of digital information a large number of documents may belong to multiple class labels handling of which is a new challenge and scope of future work. Word2vec models  in association with Recurrent Neural Networks(RNN) [4,14] have recently started gaining popularity in feature representation domain. We would like to compare their performance with our deep learning method in future. Similar feature extraction techniques can also be applied to image data to generate compressed feature which can facilitate efficient classification. We would also like to explore such possibilities in our future work.
Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered
variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical
Conclusions: This work shows how data mining and computational methods can be effectively adopted in clinical medicine to derive models that use patient-specific information to predict an outcome of interest. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which—once evaluated and verified—may be embedded within clinical information systems. Developing predictive models for the onset of chronic microvascular complications in patients suffering from T2DM could contribute to evaluating the relation between exposure to individual factors and the risk of onset of a specific complication, to stratifying the patients’ population in a medical center with respect to this risk, and to developing tools for the support of clinical informed decisions in patients’ treatment.
Abstract: To estimate the quality of the induced predictive model we generally use measures of averaged prediction accuracy, such as the relative mean squared error on test data. Such evaluation fails to provide local information about reliability of individual predictions, which can be important in risk-sensitive fields (medicine, finance, industry etc.). Related work presented several ways for computing individual prediction reliability estimates for single-target regression models, but has not considered their use with multi-target regression models that predict a vector of independent target variables. In this paper we adapt the existing single-target reliability estimates to multi-target models. In this way we try to design reliability estimates, which can estimate the prediction errors without knowing true prediction errors, for multi-target regression algorithms, as well. We approach this in two ways: by aggregating reliability estimates for individual target components, and by generalizing the existing reliability estimates to higher number of dimensions. The results revealed favorable performance of the reliability estimates that are based on bagging variance and local cross-validation approaches. The results are consistent with the related work in single-target reliability estimates and provide a support for multi-target decision making.
Conclusion: In the paper we proposed several approaches for estimating the reliabilities of individual multi-target regression predictions. The aggregated variants (AM, l2 and +) produce a single-valued estimate which is preferable for interpretation and comparison. The last variant (+) is a direct generalization of the singletarget estimators from the related work. Our evaluation showed that best results were achieved using the BAGV and the LCV reliability estimates regardless the estimate variant. This complies with the related work on the single-target predictions, where these two estimates also performed well. Although all of the proposed variants achieve comparable results, our proposed generalization of existing methods (+) is still the preferred variant due to its lower computational complexity (as estimates are only calculated once for all of the target attributes) and the solid theoretical background. In our further work we intend to additionally evaluate other reliability estimates in combination with several other regression models. We also plan to test the adaptation of the proposed methods to multi-target classification. Reliability estimation of individual predictions offers many advantages especially when making decisions in highly sensitive environment. Our work provides an effective support for model-independent multi-target regression.