Exploratory Data Analysis using Pandas Profiling

 In this article and web app, we are going to talk about data science in its true meaning- the analysis of big data, its visualization and representation and schematic analysis of that collected data. One of my first projects in Data Science was related to data analysis and specifically, exploratory data analysis using different libraries. I'm going to tell you about one such exploratory data analysis using the Pandas Profiling Report.

If you've taken any courses on statistics (and by that I mean certain advanced courses that touch upon topics like probability distribution, Gaussian functions and normal distributions) you would have come across data analysis at some point or he other. Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It was first proposed by an American mathematician John Tukey (who was also known for his notable work in Fast Fourier Transform Algorithm, Tukey range test, Tukey lambda distribution etc). However EDA is different from IDA( Initial Data Analysis)

The objectives of EDA are to:

  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys or experiments

There are many really amazing books related to exploratory data analysis such as the very first 'Exploratory Data Analysis' book written by John Tukey in 1970; 'Exploratory Data Analysis with MATLAB, written by Angel R Martinez, Jeffrey Solka and Wendy L Martinez; 'Exploratory Data Analysis using R' by Ronald K Pearson, Graphical Exploratory Data Analysis' written by A. G. W. Steyn, Rolf Stumpf, and Stephen Henry Charles Du Toit; 'Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of
 the Gesellschaft Für Klassifikation E.V., University of Munich, March 14–16, 2001' and many more. Many high quality youtube videos have also been made on data analysis. 


Coming onto this app, this app basically takes in a dataset given by the user and performs exploratory data analysis on it using the pandas profiling report and displays the outcome in the form of a large number of statistics such as number of variables, number of observations, total memory size, variable types, interactions, correlations, missing values,  provides descriptive statistics including mean, standard deviation, skewness and much moreWe have also provided a sample dataset to run the app. It is the Pima Indian Diabetes Dataset (I have an app and article related to that which has already been uploaded). 


Pandas profiling is an open source python module and helps us to do quick and efficient EDA . However pandas profiling is not the best choice for large datasets as it takes a significant time to analyze larger datasets. Pandas profiling gives us an in-depth analysis of numerical variables covering quantile and descriptive statistics. It displays quartile values which measure the distribution of the ordered values in the dataset above and below the median, shows the interquartile range, standard deviation, coefficient of variation, mean absolute deviation and skewness. 

For those of you who are not familiar with the statistical terms, here are a few definitions. A quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The interquartile range, also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, he standard deviation is a measure of the amount of variation or dispersion of a set of values (A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.), The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean. The coefficient of variation represents the ratio of the standard deviation to the mean. 

We have deployed the app using Streamlit. It is an open source framework that allows data science teams to deploy web apps fairly easily. It's one of the best hosting services I've used and it's great for quick and easy deployment of web apps. The app is coded in python. 

Link of the webapp: https://share.streamlit.io/skillcode-ml/eda2/main/app.py


Comments

Popular posts from this blog

Tennis GOAT Debate

PWA (Powerful WebApp) deployment for Skillocity

Vectors: A Physicist's perspective