CINEMA8 become a member of BESA

We are most pleased to become a BESA member, of the British Educational Suppliers Association. By signing up to BESA’s code of conduct and going through its rigorous application process (as they…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Redefining Cancer Treatment with Machine Learning

Photo by - National Cancer Institute

Over the past decades, there have been continuous evolution related to cancer treatment.Scientists applied different techniques to find the types of cancer before they cause symptoms.Recent years have seen many breakthroughs in the field of medicine and also there have been large amount of data available to medical researchers, as more data is available medical researchers have used machine learning to identify hidden patterns from complex data to try to predict effective future outcomes of the cancer type.

Given the significance of personalized medicine and the growing trends on the application of ML techniques we will try to solve one such problem where the challenge is to distinguish the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).

Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.

Given Gene, Variations and Text as features we need to predict the class of the Class variable(target variable). It’s a multi-class classification problem and we will measure the performance of our model with a multi-class log-loss metric.

We will read the data, perform text-preprocessing , split the data into train, test and cross- validation , train random models, train different ML models , compute log-loss and also the percentage of misclassified points and then compare and find out the best model.

Chaliye Shuru Karte(Let’s start coding!!)

Reading the data

Our data is present in two different files with different separators so will read each file separately and then combine both the files using “ID” column.

Text-Preprocessing and Feature Engineering

After reading the data we will do text preprocessing which involves cleaning of text like stopword removal, removing special characters if any , normalizing text and converting all the words to lowercase. During this process we found that there are some rows which doesn’t have text and therefore we will replace the NaN values with Gene + Variation values.

We will now split our data into train, test and cross-validate data to check if the distribution of our target values are same in all the three data or not.

Why distribution needs to be same? Distribution of our target value should be same so that during training, our model should encounter all the class values as present in our dataset.

Distribution of target variable

We will first train a random model so that we can compare our other models and their performance and efficiency.

How to perform log-loss for a random model in a multi-class setting?We will randomly generate numbers equal to our number of classes(10 in our problem) for every point in our Test and Cross Validate data and then normalize them to sum it to one.

In the above we first created an empty array with size 9 for each class label and then randomly generated probabilities for each class label and plotted the confusion matrix and computed log-loss.

We can see that our random-model has a log-loss of 2.4 across cross-validate and test-data so we need our models to perform better than this, let’s check the precision and recall for this model.

How to interpret the above precision recall matrix?

Precision
1. Taking an example of cell(1x1) it has value of 0.127 ; it says of all the points that are predicted to be class 1 only 12.7% values are actually class 1

2. For original class 4 and predicted class 2 we can say that of the values that our model predicted to class 2, 23.6% values actually belong to class 4

Recall

1. Check cell (1X1) it has a value of 0.079 which means for all the points which actually belongs to class 1 our model predicted only 7% values to be class 1

2. For original class 8 and predicted class 5 values is 0.250 means of all the values which are actually class 8 are model predicted 25% values to be class 5

Logistic Regression

Performance of Logistic Regression on Cross-Validation
Confusion Matrix of Logistic Regression Model

Support Vector Machine

Performance of SVM on Cross-Validation

Comparison of all the models

We can see that Logistic Regression and Support Vector Machine performs better than others in terms of both log-loss and percentage of mis-classified points.

Feel free to connect with me on any of the platforms.

Check out my other articles also

Add a comment

Related posts:

How to Cope With Stress In These Trying Times

We all experience it in our lives especially these days with business closures and job losses due to the COVID-19 pandemic but we need to learn how to reduce it or even eliminate it. We have all had…

Can I change the folder for backup download?

Because Linked Helper operates in Chrome browser, by default most files are saved in the Downloads directory: Linked Helper backup files are no exception. But you can change the location where all…

The Power of Belief

The beautiful thing about nature is that it provides purpose for each and every thing while offering sustenance that flows from the grounds and the skies to sustain an ecosystem. In that ecosystem…