Fake and Real News Classification

In this project we will build a model to classify fake or real news using the data available on Kaggle. Link: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

The dataset can be used by anyone who wants to learn about basics of natural language porocessing and model building.

Table of Contents:

  1. Importing Libraries
  2. Importing Data
  3. Data Analysis and Data Cleaning
  4. Feature Engineering
  5. Model Building
  6. Model Evaluation
  7. Future Research

Note: In this notebook I have used seaborn and matplotlib as visualization libraries but there are much interactive libraries like plotly that can be used for visualization.

Importing Libraries


Importing Data

Our data is available in 2 different datasets, fake and real news data. We will combine these datasets into one dataframe for further analysis. We will also build a new column for target column called target.

Let's check if the columns of our dataframes are similar, make target column and combine these dataframes.



Data Analysis

In data analysis we will look at the following aspects of the data: column names/structure/types, missing data, column values distribution and since we have text data, we will look into: words distribution.

We don't have any missing data in our dataset. All the columns are of type object inlcuding date column. We have title of news, news text, news subject, date of news and target column for fake or real news.


target

Our data set is not a highly imbalanced dataset. A slight imbalance may represent the real world distribution of news.


subject

fake or real news does not have a uniform distribution among different news subjects. We can see that some news subjects are completely fake while others are completely real. We need to look into what kind of words make fake and real news. It also shows us that we should be careful while using news subjects as one of the features since it can make our model biased.


date

Just looking at the date column we found that date column might have irregularities.

Now, let's look at timelines for increase or decrease of real or fake news in our dataset

By looking at the aggregated data, we observe that for year 2015, we don't have counts for real news. So, we will plot counts for rest of the data.

Let's look at time series plots for daily aggregated counts for fake and real news. We are going to remove data for 2015 and 2018 since we have only fake news for those years. We will use rest of the data for analysis.

Clearly, we can see that in the beginning of 2016 we had more fake news than real news. While on the other hand in the year 2018, we had a surge in real news. This also raises the questions on how this data was collected and also what were the factors that led to decrease/increase in fake news over the years.

Let's see monthly aggregated counts of fake/real news and see the timeline

We see a rise in real news articles in August, 2017. I went online to research what was going on at this time in US. I found that there was a rise in fkae news discussion at various levels. Form the beginning of 2017, discussion s on how to combat fake news started. Twitter also started working on initiatives to stop fake and offensive news on there platform.

  1. https://www.nytimes.com/interactive/2017/business/media/trump-fake-news.html
  2. https://blogs.lse.ac.uk/medialse/2017/08/10/the-evolving-conversation-around-fake-news-and-potential-solutions/
  3. https://www.washingtonpost.com/news/the-switch/wp/2017/06/29/twitter-is-looking-for-ways-to-let-users-flag-fake-news/

Let's also check time series changes in fake /real news in terms of subject of news.

In the above time series analysis we have seen that some subjects don't exist in real news while other don't exist in fake news. This raises the question of how the data was collected and if we build a model on this data it might not generalize well on real world data. But let's look at the text data which forms the news, this might help us is building some features to build a model.


Now, let's look at the columns for news title and news text. In order to analyse the text columns, we might need a few functions to clean data and create word counts columns in our dataframe.

Just by looking at the text we can see that fake news has a lot of @(twitter_handle). So, this is something we should look into for further analysis.

Looking, at boxplot for real news we can say that most of real news does not have any twitter handles in it. Fake news is more spread as compared to real news in terms of tiwtter handles used. We have a lot of outliers in both fake and real news when it comes to number of twitter handles used. The number of twitter handles mentioned in fake news are much higher than real news. This might be also because of sources of data collection for fake and real news. Looks like fake news data was collected from twitter while real news data was collected from other news sources.

Let's look at some other stats like number of words, number of unique words and number of real dictionary words in fake/real news data.

Now, lets move on to cleaning text data and look into words statistics for fake & real news.

Let's lemmatize our data to reduce inflection in words and map a group of words to same stem, which generalizes our dataset.

Analyzing violin plots for unique words counts and word counts in our text column:

  1. The maximum number of total words used in fake news are much higher than the maximum number of total words in real news.
  2. Median for both kind of news is quite close.
  3. There are a lot more outliers (total words in text) in fake news as compared to real news.
  4. Outliers of fake news are on much more extreme end as compared to real news data.
  5. We also have 0 value for number of words in our news text which shows that we will need to clean some of this data.
  6. Fake news is much more spread out as compared to real news.

IMP: While analyzing datasets ask these questions around the data so that it can be communicated to the customer, before building the model.

Looking at the violin plots, raises the question that why fake news has more words as compared to real news. Is this the correct repsentation of real world data. This data was downloaded from Kaggle, it can be hard question to answer: since most of the times data is collected by individuals and such questions go unanswered.

We also need to make sure that we remove the data which contains 0 words in text before building a model.

Now, lets clean the data where we don't have any data in the news text column

Let's analyze the words that make up fake and real news.

Looking at the words in the word clouds for fake and real news: the words seems to be quite similar in the news.

Looking at the most common words that occur in fake and real news. These words are quite similar to each other. This looks like something that would happen in real world where some words will be used in fake news (like donald, trump, clinton, obama etc.) as well as in real news (like donald, trump, republican, offical etc.) to provide clarity for fake news spread by people.

Let's look at the chi2 value for the words in news data to find the impact of these words

Clearly, we can see that words like said, black, obama, clinton are words with high prominenece. But we should also look into topic modelling to understand the presence of these words in news.

Topic Modelling

Looking at the topics we observe that the words with large prominence are present in the topics as well.

Now, let's look at n-grams to check what

Fake News n-gram Analysis
Real News n-gram Analysis

Analyzing the n-grams for fake and real news apparently the both the set of news have same kind of n-grams.



Conclusion of data analysis: Before we start feature engineering and model building, we need to keep in my mind results of our data analysis. We observed that real and fake news have a lot of common keywords. N-gram analysis shows us that most common n-grams are also quite similar. After doing chi-squared analysis, we found most prominent words are common among both kind of news. This is a clear indication that our dataset might be biased. The sources of datasets are distinct but still the datasets have lots of commonalities. Let's first see how the most basic models in the field of machine learning works on our dataset, then we can move on to more complicated models to achieve better accuracy.

Feature Engineering


Model Building

For the first algorithm let's use one of the classification algorithms from passive-aggressive algorithms. You can find description on link: https://scikit-learn.org/stable/modules/linear_model.html#passive-aggressive

Passive Agressive Algorithm


Logistic Regression Algorithm

Let's build logisitic regression and also check how many K-best features can we use to build a reasonable accuracy model.


Deep Learning Model

Let's build a deep learning model to check if we can achieve higher accuracy. We will use GloVec (Global Vectors) embedding to represent data features for the model. We will use GloVec representation build by Stanford for our model.
I found this link to be helpful in learning and implementing GloVec representation: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

We will load GloVec vectors into a dictionary.

We will use only 5000 features and the max length of our feature will be 300. These are hyperparameters which needs to be adjusted in real world training and deployment of model. These hyperparameters can be handled using something like grid search.

Tokenize the training set.

We will encode the training dataset and use padding (because our news text might be of varying length) for the sequences.

We will create a word-embedding matrix for the words which we see in our training data. We will use the tokenizer which we ran on our training data to locate the word and give it weight using the pre-defined GloVec embedding which we loaded.

We will create our embedding layer which we have created using the weights from GloVec word embedding. Since we do not want to change the weights while training our model, we set trainable to False.

Create and Train Model

We can use different types of algorithms to handle learning rate.

Training Deep Learning Model



Model Evaluation

Let's check accuracy of passive aggressive classification algorithm...

This is a really good accuracy. But we need to check what is causing such high accuracy.


Even with 10% features we are able to achieve good accuracy. Let's check if we can use even less number of features and predict with similarly level of accuracy.

By only using only 500 words we can achieve a really high accuracy. This shows that our dataset have high inherent bias.


Evaluate deep learning model

We have got a really good accuracy using deep learning model.



Conclusion

If you have any feedback, contact me on my linkedin profile.