1
00:00:15,690 --> 00:00:20,890
Hello, and welcome to the third lecture of
the course Introduction to Data Analytics.
2
00:00:20,890 --> 00:00:25,150
My name is Professor Nandan Sudarsanam and
today, we are going to be talking about Descriptive
3
00:00:25,150 --> 00:00:31,039
Statistics and more specifically about Graphical
Approaches used in descriptive statistics.
4
00:00:31,039 --> 00:00:37,379
Now, before we jump into the content into
descriptive statistics in the different types,
5
00:00:37,379 --> 00:00:44,100
it make sense for us to take a step back and
talk more generally about data, what is data
6
00:00:44,100 --> 00:00:50,649
and most critically, what are the different
types of data that you will encounter.
7
00:00:50,649 --> 00:00:56,120
And it is important to do this, because the
descriptive statistics that you use in the
8
00:00:56,120 --> 00:01:01,910
approaches that you use vary according to
the different types of variables or different
9
00:01:01,910 --> 00:01:07,930
types of data that you will encounter and
this is a recurring theme in many other aspects
10
00:01:07,930 --> 00:01:14,680
of this course. So, make sense for us to talk
about that for a few minutes right now. So,
11
00:01:14,680 --> 00:01:21,530
data is essentially numbers, now it can also
be texts, symbols, but more often the not,
12
00:01:21,530 --> 00:01:26,580
you will encounter numbers, which represents
some kind of information.
13
00:01:26,580 --> 00:01:32,140
And therefore, it is a kind of helps to think
of data as values, because values is broad
14
00:01:32,140 --> 00:01:37,700
enough to cover numbers, text, symbols. So,
you can think of data as values of quantitative
15
00:01:37,700 --> 00:01:48,670
and qualitative variables. Now, the variables
themselves can be of different types and that
16
00:01:48,670 --> 00:01:56,060
is, what we are going to talk about right
now. In statistics, you have various classifications
17
00:01:56,060 --> 00:02:03,220
of variables and so different depending on,
who you ask, you find different classifications
18
00:02:03,220 --> 00:02:08,420
in different text books as well.
But, one broad classification that make sense
19
00:02:08,420 --> 00:02:14,390
and that is very useful for us is to differentiate
variables as quantitative and qualitative
20
00:02:14,390 --> 00:02:23,890
variables. So, quantitative variables are
also called as numerical variables and these
21
00:02:23,890 --> 00:02:30,660
variables are essentially, you know the best
way to think of them is that they have meaning
22
00:02:30,660 --> 00:02:41,350
as a measurement, such as a personÕs height
or weight or IQ or they can be some kind of
23
00:02:41,350 --> 00:02:50,910
count, such as the number of something number
of days it is rained and so on and so forth.
24
00:02:50,910 --> 00:02:58,130
But, a very intuitive way for me to, that
is always been useful for me is to think of
25
00:02:58,130 --> 00:03:03,630
quantitative variables as variables, where
some form of basic arithmetic like either
26
00:03:03,630 --> 00:03:11,340
adding or averaging or subtracting kind of
make sense. So, I mean a definite requirement
27
00:03:11,340 --> 00:03:17,349
is that the quantitative variable is numerical,
but some time you can also have numbers being
28
00:03:17,349 --> 00:03:25,130
used as symbols not as the actual numbers.
So, a really simple rule that kind of helps
29
00:03:25,130 --> 00:03:31,069
me identify quantitative variables is to say
that it has to be a numerical variable, that
30
00:03:31,069 --> 00:03:38,940
the values that the variables takes on are
numerical and that, some basic forms of arithmetic
31
00:03:38,940 --> 00:03:45,960
kind of make sense on such a variables.
Now, within the quantitative variables you
32
00:03:45,960 --> 00:03:54,730
have continuous and discrete variables. Continuous
variables are essentially one square within
33
00:03:54,730 --> 00:04:01,060
a certain interval and this is an interval
that the variable could, where the variables
34
00:04:01,060 --> 00:04:08,370
could take on values. Within this interval
any value is possible, if any value within
35
00:04:08,370 --> 00:04:13,319
this interval is possible, then this variable
is said to be a continuous variable. So, let
36
00:04:13,319 --> 00:04:24,490
me give one example. Let us say that the variable
we are interested in is the height of students,
37
00:04:24,490 --> 00:04:31,630
who are registered for this course.
And so, let us say I go take a sample of a
38
00:04:31,630 --> 00:04:40,150
1000 students and write down and their heights
and so this is a data set that I have. The
39
00:04:40,150 --> 00:04:47,470
data set have 1000 values and let say an interval
and the interval can, you can may be get the
40
00:04:47,470 --> 00:04:52,900
interval from taking the highest value to
the lowest value or you can just, as long
41
00:04:52,900 --> 00:05:00,830
as it is a real interval that covers this
data. The question is, is any value possible
42
00:05:00,830 --> 00:05:08,630
within this interval and the answer is yes.
So, let say my interval was 120 centimeter
43
00:05:08,630 --> 00:05:15,030
to 140 centimeters, is it possible that someone,
who taken this course could have a height
44
00:05:15,030 --> 00:05:22,930
of 135 centimeters. Absolutely, there is nothing
about heights that inherently stops people
45
00:05:22,930 --> 00:05:29,270
from having that particular height. Now, I
can go even further, I can say is it possible
46
00:05:29,270 --> 00:05:38,000
for someone to have the height of a 135.156
centimeters and the answer is still yes. I
47
00:05:38,000 --> 00:05:42,080
mean I might not have a measuring scale that
goes to a certain accuracy, but that is the
48
00:05:42,080 --> 00:05:48,240
measurement problem. There is nothing about
heights that prohibits this value from existing
49
00:05:48,240 --> 00:05:54,740
in this data set.
So, continuous data is essentially one, where
50
00:05:54,740 --> 00:05:59,390
any value between certain interval and the
interval is sometimes formally defined as
51
00:05:59,390 --> 00:06:06,430
the highest value in the data set to the lowest
value. So, any value within this interval
52
00:06:06,430 --> 00:06:13,690
is potentially possible, then you have a continuous
data set. The second kind is the discrete,
53
00:06:13,690 --> 00:06:21,419
so the discrete quantitative data is one,
where this condition is not true essentially
54
00:06:21,419 --> 00:06:28,280
and it again helps to think of, what kind
of a data set would be such that a value not
55
00:06:28,280 --> 00:06:32,690
all possible values are there and I will give
you another example for that.
56
00:06:32,690 --> 00:06:43,400
So, let say that I was interested in looking
at the number of people, who enter IIT Madras
57
00:06:43,400 --> 00:06:48,670
every day, so the number of people, whoÉ
Let us make it interesting, the number of
58
00:06:48,670 --> 00:06:52,540
unique people, who enter IIT Madras every
day. So, if you come in and go out, come in
59
00:06:52,540 --> 00:06:57,169
go out and times, I do not care, you are still
one person. So, number of unique people, who
60
00:06:57,169 --> 00:07:05,270
enter IIT Madras on a given day is my variable
and the actual values of this variable I get
61
00:07:05,270 --> 00:07:09,810
from doing this kind of a survey or a study
for 1 year.
62
00:07:09,810 --> 00:07:15,759
So, I have 365 data points, one data point
for each day, which says the number of people
63
00:07:15,759 --> 00:07:25,550
who enter IIT Madras. Now, clearly this variable
is discrete, because let say there is a lower
64
00:07:25,550 --> 00:07:32,520
bond, which is may be 0 people, nobody enter
the IIT Madras highly unlikely, but on a given
65
00:07:32,520 --> 00:07:39,590
days. So, zero is the lower bond and the upper
bond is some, you know 100,000, 50,000 something.
66
00:07:39,590 --> 00:07:48,850
Now, within this range, can I have is every
value possible, no. You could for instance
67
00:07:48,850 --> 00:07:56,180
never have a day, where two and half people
entered or let say, you know 133.2 people
68
00:07:56,180 --> 00:08:00,090
entered.
So, here is discrete, because it is discrete
69
00:08:00,090 --> 00:08:07,620
in the sense that only the integer values
are possible, all the more; no, where you
70
00:08:07,620 --> 00:08:11,860
can never have values that are non integers.
So, that is an example of a discrete value
71
00:08:11,860 --> 00:08:17,240
and in this particular case, it happens to
be one of being discrete at the integers,
72
00:08:17,240 --> 00:08:21,550
but that is just, because of the example I
came up. You can come up with the other examples,
73
00:08:21,550 --> 00:08:27,789
where the variable is numerical and it is
I mean its quantitative, but the values you
74
00:08:27,789 --> 00:08:32,260
can take up are discrete.
We move on to the other class of variables,
75
00:08:32,260 --> 00:08:38,550
which are qualitative variables, these are
also known as categorical variables and categorical
76
00:08:38,550 --> 00:08:44,110
variables essentially represents some characteristics,
some characteristics, which can be categorized,
77
00:08:44,110 --> 00:08:51,830
which can be grouped. So, examples of that
are things like a person gender, so that is
78
00:08:51,830 --> 00:08:56,330
the variable and the values of variable can
take a male and female.
79
00:08:56,330 --> 00:09:04,270
And then, you might have something like marital
status or more interesting, one might be home
80
00:09:04,270 --> 00:09:11,459
town of or state within India. Let say, let
us take all the people who registered for
81
00:09:11,459 --> 00:09:18,790
this course from within India, which should
be bulk of them. And the variable we are interested
82
00:09:18,790 --> 00:09:24,880
in is, which state are you from, so that is
the variable the state that you are from and
83
00:09:24,880 --> 00:09:31,400
the values that this variable can take up
are the different states of India.
84
00:09:31,400 --> 00:09:41,150
Now, the categorical variables again, because
of their definition and their nature are always
85
00:09:41,150 --> 00:09:47,470
discrete, so that should be obvious. But,
within these categorical discrete variables
86
00:09:47,470 --> 00:09:54,040
there are two classes again, there is nominal
and the ordinal. With nominal, there is essentiallyÉ
87
00:09:54,040 --> 00:09:59,020
The big difference is that, with nominal there
is no order. So, the great example of that
88
00:09:59,020 --> 00:10:06,459
would be this home state, which state are
you from, variable. There is no order, which
89
00:10:06,459 --> 00:10:16,330
says that Madhya Pradesh is greater than or
lesser than another state and so on and so
90
00:10:16,330 --> 00:10:19,060
forth.
So, these are all, the values that these variables
91
00:10:19,060 --> 00:10:26,290
can take up cannot be ordered in a sensible
way you know, whereas with ordinal data by
92
00:10:26,290 --> 00:10:36,320
definition of the variable, there is an order.
Let me give an example of that, let say that
93
00:10:36,320 --> 00:10:44,230
we created a variable for, which is the color
for terror alert, so some countries have this,
94
00:10:44,230 --> 00:10:50,500
the terror alert color signify something.
And let say the possible values this can take
95
00:10:50,500 --> 00:10:56,010
is green, yellow, orange and red.
So, the variable stated terror alert color
96
00:10:56,010 --> 00:11:02,860
green, yellow, orange and red are the four
values that this particular variable can take,
97
00:11:02,860 --> 00:11:09,170
where green represents low risk and red represents
very high risk. See, so there again it is
98
00:11:09,170 --> 00:11:14,300
a categorical variables, because it is not
like you can do arithmetic operation on green,
99
00:11:14,300 --> 00:11:20,339
yellow, red. The variable itself is a qualitative
variable, but yet there is some order.
100
00:11:20,339 --> 00:11:29,500
Because, you can say things like if orange
is worst than yellow and yellow is worst than
101
00:11:29,500 --> 00:11:36,200
green, that must mean orange is worst than
green. So, you can get an idea also, that
102
00:11:36,200 --> 00:11:42,649
is an ordered categorical variables, whereas
with a nominal variable you could not say
103
00:11:42,649 --> 00:11:46,300
if Madhya Pradesh is greaterÉ First of all,
you could never say greater or less than,
104
00:11:46,300 --> 00:11:53,769
so creating more complex relationships becomes
impossible. So, that is just to give you a
105
00:11:53,769 --> 00:11:56,100
very quick idea about the different variable
types.
106
00:11:56,100 --> 00:12:03,480
So, now, let us jump into descriptive statistics.
So, descriptive statistics is the idea of
107
00:12:03,480 --> 00:12:10,040
quantitatively describing data and you can
do that through various means, you can do
108
00:12:10,040 --> 00:12:17,269
that through visualization techniques like
graphical representation, tabular representation,
109
00:12:17,269 --> 00:12:25,220
but you can also do that through summary statistics.
The idea here is that, you crunch the data,
110
00:12:25,220 --> 00:12:33,600
you work with the data and come up with 1
or 2 or 3 or 4 different numbers that summarized
111
00:12:33,600 --> 00:12:37,010
the data for you.
In this class we are going to be focusing
112
00:12:37,010 --> 00:12:45,990
more on the graphical and the tabular representation
and the next module is going to be on the
113
00:12:45,990 --> 00:12:56,860
summary statistics, so that is the idea. Now,
this is a very good time for us to just quickly
114
00:12:56,860 --> 00:13:03,120
review, you know in our overview classes we
spoke about descriptive versus inferential
115
00:13:03,120 --> 00:13:08,399
statistics and this is the good point to just
bring that up again and to kind of have a
116
00:13:08,399 --> 00:13:13,200
very quick idea, what descriptive statistics
are really means.
117
00:13:13,200 --> 00:13:21,850
The core idea in this dichotomy is that descriptive
statistics focus or is the way to say something
118
00:13:21,850 --> 00:13:28,270
meaningful for the data that you have at hand.
So, you have some data at hand, whether you
119
00:13:28,270 --> 00:13:33,860
call it sample of population or whatever,
if you are making the statement based of that
120
00:13:33,860 --> 00:13:43,470
data about that data derived from that data,
you are dealing with descriptive statistics.
121
00:13:43,470 --> 00:13:48,839
Descriptive statistics do not; however, allow
us to make conclusions beyond the data we
122
00:13:48,839 --> 00:13:53,730
have.
So, you cannot look at the data, do something
123
00:13:53,730 --> 00:14:00,480
with the data and make and based of that make
the generalization about potentially the source
124
00:14:00,480 --> 00:14:09,300
the data that, the data came from, you would
need inferential statistics for that. Now,
125
00:14:09,300 --> 00:14:13,850
having said this; however, descriptive statistics
is still very important, because you cannot
126
00:14:13,850 --> 00:14:18,029
just simply present raw data, it would be
very hard to visualize, especially when the
127
00:14:18,029 --> 00:14:24,560
data is a lot. When you have a lot of data,
you cannot just show the data, you need to
128
00:14:24,560 --> 00:14:31,019
present the data in a more meaningful way,
which allows for some kind of simpler interpretation
129
00:14:31,019 --> 00:14:35,019
and that can be through the graph or through
numbers, great.
130
00:14:35,019 --> 00:14:42,160
So, and a final point I just want to make
is that, descriptive statistics is not just
131
00:14:42,160 --> 00:14:46,860
confined to a single variable, it can be about
multiple variables and when you are dealing
132
00:14:46,860 --> 00:14:53,060
with multiple variables, our topic of interest
is relationships. So, how does one variables
133
00:14:53,060 --> 00:14:57,390
change with respect to another variable. So,
in essence you will be doing two things which
134
00:14:57,390 --> 00:15:02,260
is summarizing each variable or describing
each variable, but you are also interested
135
00:15:02,260 --> 00:15:06,399
in showing interrelationship between variables.
136
00:15:06,399 --> 00:15:13,680
So, let us go ahead and now, that we have
an understanding of different variable types,
137
00:15:13,680 --> 00:15:19,760
let us talk about some graphical representation
techniques. If one is dealing with a single
138
00:15:19,760 --> 00:15:27,870
variable and let us say it is a categorical
variable, a great way of representing data
139
00:15:27,870 --> 00:15:36,029
could be through a bar graph. So, that is
the graph that you see on your left hand side
140
00:15:36,029 --> 00:15:41,990
here. So, here for instance let us say the
example is one, where we sent out a survey
141
00:15:41,990 --> 00:15:46,660
and ask people, what their highest level of
education is and highest level of education
142
00:15:46,660 --> 00:15:52,360
be the variable, the possible values that
takes up our high school, bachelorÕs, masterÕs
143
00:15:52,360 --> 00:15:56,339
and doctorate.
Hence, therefore, this is a categorical variable,
144
00:15:56,339 --> 00:16:03,420
there are only four possible qualitative states,
that this particular variable can take up.
145
00:16:03,420 --> 00:16:12,699
And, what we plotting is the number that we,
number of responses or the number of observations
146
00:16:12,699 --> 00:16:20,490
we counted in having this values. So, you
sent the survey out let say it about 50000
147
00:16:20,490 --> 00:16:27,430
people, may be 60000 from the local way. So,
and about 15000 of them said that their highest
148
00:16:27,430 --> 00:16:32,980
level of education was high school.
So, the height represents the frequency about
149
00:16:32,980 --> 00:16:43,160
of occurrences of this particular value of
this variable. So, this kind of a representation
150
00:16:43,160 --> 00:16:47,430
can quickly summarize, which is more which
is less and so on. But, an interesting thing
151
00:16:47,430 --> 00:16:55,149
to note out here is that this categorical
variable is actually ordinal, meaning there
152
00:16:55,149 --> 00:17:00,680
is an order of going high school, bachelors,
masters, doctorate. You could have flip the
153
00:17:00,680 --> 00:17:06,539
whole graph around, but you would still have
the order that is a sense that a doctorate
154
00:17:06,539 --> 00:17:12,350
is a more years, I guess than masters and
which is more than a bachelors, which is more
155
00:17:12,350 --> 00:17:18,130
than a high school or whatever.
So, in some sense the variable itself has
156
00:17:18,130 --> 00:17:26,129
an intrinsic ordering and the good thing is
something like a bar graph, I love for that.
157
00:17:26,129 --> 00:17:36,690
Just, because of the fact that there is this
concept of a x axis, makes it very convenient
158
00:17:36,690 --> 00:17:44,470
to represent ordinal variables, which are
categorical. Another way of representing categorical
159
00:17:44,470 --> 00:17:50,929
variables could be something like a pie chart,
this is an example, where let say we looked
160
00:17:50,929 --> 00:17:55,070
at the number of students, who were in different
engineering departments.
161
00:17:55,070 --> 00:17:59,109
And your different engineering departments
here are mechanical, civil, electrical and
162
00:17:59,109 --> 00:18:05,200
computer science. These are just some random
departments I chose and again the frequency
163
00:18:05,200 --> 00:18:13,070
of occurrence is more represented as a percentage
of the whole. So, this percentage of this
164
00:18:13,070 --> 00:18:20,789
full circle is computer science students and
thus the idea behind using the pie chart.
165
00:18:20,789 --> 00:18:27,600
And clearly a pie chart is not very suited
for ordinal variables, which is more suited
166
00:18:27,600 --> 00:18:31,929
for nominal variables.
Because, there is no order, one that the fact
167
00:18:31,929 --> 00:18:38,580
that computer science shares the wall with
mechanical and civil is just coincidental,
168
00:18:38,580 --> 00:18:44,179
that is not what a pie chart seeks to capture.
Sometimes people will keep similar things
169
00:18:44,179 --> 00:18:50,740
together, but that is not a requirement. One
important thing about pie charts is that,
170
00:18:50,740 --> 00:18:56,919
usually you want to represent all the values.
So, if there are some engineering departments
171
00:18:56,919 --> 00:19:02,559
that are not being represented, usually a
pie chart need not be the best way, because
172
00:19:02,559 --> 00:19:08,860
there is an impression that this is all the
departments together. So, if there are more
173
00:19:08,860 --> 00:19:14,090
engineering departments, but you wanted to
only show 3 or 4, may be you could use a bar
174
00:19:14,090 --> 00:19:16,749
graph rather than a pie chart.
175
00:19:16,749 --> 00:19:24,509
Now, we move on to quantitative variables
and with quantitative variables, you have
176
00:19:24,509 --> 00:19:32,960
a couple of different ways of representing
a variable. One example is a box plot. So,
177
00:19:32,960 --> 00:19:38,019
with quantitative variables, remember that
is numerical data and, so you might be interested
178
00:19:38,019 --> 00:19:46,529
in representing things like, what is the average,
what is the variance and in our class on summary
179
00:19:46,529 --> 00:19:49,730
statistics, we are going to go to a great
detail about it.
180
00:19:49,730 --> 00:19:58,340
But, for purposes of this, a box plot is essentially
something that captures central tendency,
181
00:19:58,340 --> 00:20:06,989
which is this red line that is there in, typically
that tends to be the median of the data set.
182
00:20:06,989 --> 00:20:13,580
And you have the two bounce of the data set,
so the top and the bottom of the box itself
183
00:20:13,580 --> 00:20:20,659
and that tense to capture in some way the
variability in the data and the way a box
184
00:20:20,659 --> 00:20:25,940
plot does that by representing the lower quartile
and the upper quartile.
185
00:20:25,940 --> 00:20:31,590
Now, the lower quartile and the upper quartile
in really simple words is just 25th percentile
186
00:20:31,590 --> 00:20:41,269
and the 75th percentile and, so that kind
of that range a gets captured there and the
187
00:20:41,269 --> 00:20:48,799
whiskers themselves take on different meanings
depending on the, which version of box plot
188
00:20:48,799 --> 00:20:52,549
is using, but more often they are not it tends
to be lowest value and the highest value in
189
00:20:52,549 --> 00:21:00,609
the data set. And typically red dots like
this represent outliers in the data.
190
00:21:00,609 --> 00:21:05,249
The box plot is itself something that will
make more sense to you and we will talk about
191
00:21:05,249 --> 00:21:10,080
summary statistics, because you will understand,
what exactly a median means you will understand,
192
00:21:10,080 --> 00:21:17,980
what a different way of representing spread
an outliers and so on. But, it helps you at
193
00:21:17,980 --> 00:21:23,019
this stage to kind of say that this is one
simple way to take a data set, which is full
194
00:21:23,019 --> 00:21:30,380
of numbers. So, let say this data set had
you know 500 or 1000 numbers and it looks
195
00:21:30,380 --> 00:21:36,490
like these numbers are pretty much within
the range of like 25 to 33 or, so and to represent
196
00:21:36,490 --> 00:21:45,259
all of these numbers in a single graphical
representation, so great. Another way of representing
197
00:21:45,259 --> 00:21:50,599
quantitative data is through a histogram the
histogram is what you see on the right hand
198
00:21:50,599 --> 00:22:02,279
side and histogram is arguably ah the richest
representation of numerical quantitative data,
199
00:22:02,279 --> 00:22:11,169
because histogram essentially says how many
data points do you have with in this range.
200
00:22:11,169 --> 00:22:21,399
So, the x axis out here represents the different
possible values that this data set has. So,
201
00:22:21,399 --> 00:22:31,629
if you take this as 8 to 10 this should be
something like 8 to 8.66 and this should be
202
00:22:31,629 --> 00:22:38,619
till like 9.33 and this should possibly be
10. So, the question is in your data set how
203
00:22:38,619 --> 00:22:48,739
many data points do you have that have values
greater than 8 and values less out here. So,
204
00:22:48,739 --> 00:22:54,859
another way of showing that I am going to
try highlight it is, so here interested in
205
00:22:54,859 --> 00:23:06,039
this range right here this range. So, this
is 8 and this is 8, let say 8.66 is from reading
206
00:23:06,039 --> 00:23:09,529
the graph right, how many data points do you
have.
207
00:23:09,529 --> 00:23:14,460
Because, this is a numerical quantitative
data set that have values greater than 8 and
208
00:23:14,460 --> 00:23:22,059
less than 8.66 and the answer to that question
is it looks like 6 data points right approximately
209
00:23:22,059 --> 00:23:28,080
may be 7 may be if I am reading graph correct.
So, you answer that question for each of this
210
00:23:28,080 --> 00:23:34,190
bins this are all called bins each of this
columns are called bins for each of this column
211
00:23:34,190 --> 00:23:41,559
you answer that questions and what you get
is histogram and the histogram is the first
212
00:23:41,559 --> 00:23:48,220
step towards empirically constructing, what
you will we will later learn is it distribution.
213
00:23:48,220 --> 00:23:59,710
So, once you capture this, this entire picture
out here you have a very clear representation
214
00:23:59,710 --> 00:24:07,879
of the entire data set. So, again just keep
in mind that we are going to be talking about
215
00:24:07,879 --> 00:24:13,529
distribution we are going to talking about
medians means and variances, but keep in mind
216
00:24:13,529 --> 00:24:17,850
also that this is the graphical way of representing
these things.
217
00:24:17,850 --> 00:24:27,239
So, now we move on to the multiples variables
and the last section in graphical representation
218
00:24:27,239 --> 00:24:35,180
and there are three major of forms of representing
ha this data and they are the following. The
219
00:24:35,180 --> 00:24:41,519
first is scatter plots this are very useful
when you have two quantitative variables that
220
00:24:41,519 --> 00:24:49,769
is you know. So, two variables, which are
both numerical you can you can very easily
221
00:24:49,769 --> 00:24:56,389
represent using scatter plots and, so in the
key thing you should notice in this scatter
222
00:24:56,389 --> 00:25:03,899
plots is that it really helps you understand
the relationship it does not do a very good
223
00:25:03,899 --> 00:25:11,139
job of understanding each individuals variable.
So, may be if you done distribution of x and
224
00:25:11,139 --> 00:25:16,279
distribution of y separately you would understood
those two variables, but what it does a good
225
00:25:16,279 --> 00:25:22,690
job is of capturing the relationship between
x and y. And this particular case the fact
226
00:25:22,690 --> 00:25:29,700
that in general when x increases it look like
y also increased or y also high or you know
227
00:25:29,700 --> 00:25:33,409
you can always flip it the other way around
in general when y is high x is a high when
228
00:25:33,409 --> 00:25:36,629
y is low x is also low.
So, that relationship gets captured for that
229
00:25:36,629 --> 00:25:45,820
reason this is the great graphical representation
of two variables usually you can extend box
230
00:25:45,820 --> 00:25:53,110
plots if you feel like one of your variables
is categorical and the other is quantitative.
231
00:25:53,110 --> 00:26:00,259
So, you are not just interested you are interested,
let say one variable, which is country and
232
00:26:00,259 --> 00:26:06,690
the other variable, which is some indicator
let us say of the economy or crime or whatever
233
00:26:06,690 --> 00:26:09,529
I have just called it values here because
it does not matter.
234
00:26:09,529 --> 00:26:16,669
But, this variable is continuous right its
mean I should not say quantitative I do not
235
00:26:16,669 --> 00:26:20,799
know if it is continuous it could be continuous
or discreet, but this variable is quantitative,
236
00:26:20,799 --> 00:26:26,590
where as this variable is qualitative. So,
one great way of look comparing different
237
00:26:26,590 --> 00:26:34,719
qualitative variables, which have data set
that are on the quantitative scale is to perhaps
238
00:26:34,719 --> 00:26:41,590
use multiples box plots on the same graph
and that gives you not just an idea of how
239
00:26:41,590 --> 00:26:46,840
on average his country is different, but there
are also different in terms of their variability
240
00:26:46,840 --> 00:26:52,700
and their out so on, so forth.
Finally, we come to the use of contingency
241
00:26:52,700 --> 00:27:00,249
table out here on the extremely right and
the idea here is that when you have two categorical
242
00:27:00,249 --> 00:27:08,710
variables and what you are interested in representing
is the frequency of occurrence. So, the frequency
243
00:27:08,710 --> 00:27:14,279
of occurrence is the theme of the data set.
Then, contingency tables are great ways to
244
00:27:14,279 --> 00:27:21,929
do that, so an example that I have come up
with out here is how many people let say you
245
00:27:21,929 --> 00:27:29,990
go to a company and you take a survey of all
the managers working in the company may be
246
00:27:29,990 --> 00:27:31,799
interested in asking how many of them have
MBA.
247
00:27:31,799 --> 00:27:39,609
So, y represents yes they have an MBA n represents
no clearly this is categorical variables right
248
00:27:39,609 --> 00:27:44,049
it has only two states. Similarly, you could
say before this people join the company did
249
00:27:44,049 --> 00:27:49,999
they have work experiences before they joined
as managers and answer could be again be yes
250
00:27:49,999 --> 00:27:55,789
or no, so that is also categorical variables.
So, here is an example why you have two categorical
251
00:27:55,789 --> 00:28:00,960
variables and what you are interested in his
how many people belong to each combination.
252
00:28:00,960 --> 00:28:06,980
So, how many people have MBAÕs and had work
experience before joining how many people
253
00:28:06,980 --> 00:28:12,250
had MBAÕs did not have work experience before
joining. So, this can be complex data set
254
00:28:12,250 --> 00:28:19,590
where it very neatly summarized in the contingency
tables and that is something that could be
255
00:28:19,590 --> 00:28:30,700
quiet useful. So, I think that is about it
for graphical approaches to representing data
256
00:28:30,700 --> 00:28:37,330
in the modules descriptive statistics and
in the next class we will be looking more
257
00:28:37,330 --> 00:28:46,259
at summary statistics as a means of as the
sub modules in descriptive statistics.
258
00:28:46,259 --> 00:28:53,059
And then, we will move on to inferential statistics
great, thank you for joining me and look forward
259
00:28:53,059 --> 00:28:54,959
to seeing you in the next lecture.