1
00:00:13,870 --> 00:00:21,090
hi ah welcome to todays lecture so i hope
you have done your you know assignments and
2
00:00:21,090 --> 00:00:27,099
gone through the multiple choice questions
which were uploaded so today we will ah start
3
00:00:27,099 --> 00:00:31,080
discussing about ah data and ways of representing
data so broadly speaking there are two components
4
00:00:31,080 --> 00:00:36,880
of statistics one is descriptive statistics
which essentially is summarized that is to
5
00:00:36,880 --> 00:00:41,080
basically convert raw data into some numbers
so that is what descriptive statistics is
6
00:00:41,080 --> 00:00:52,920
about and the other type of statistics is
called inferential statistics here we want
7
00:00:52,920 --> 00:01:04,530
to develop procedures for finding out or making
distinct conclusions from the measures that
8
00:01:04,530 --> 00:01:06,702
we have drawn . from the sample from the population
ok
9
00:01:06,702 --> 00:01:14,950
so there are so of course inferential statistics
is the most important thing so there are few.
10
00:01:14,950 --> 00:01:23,060
steps which we need to follow in order to
understand what what are the steps in inferential
11
00:01:23,060 --> 00:01:31,030
statistics right so the very beginning the
first thing is to identify what is your question
12
00:01:31,030 --> 00:01:47,250
right what is your question and who is your
population lets say you want to you know you
13
00:01:47,250 --> 00:01:53,901
want to make you want to market a soap then
and for teenagers so what should be the look
14
00:01:53,901 --> 00:01:59,290
and feel of the soap so has to attract teenagers
to using that so your population is the teenager
15
00:01:59,290 --> 00:02:05,330
the question is basically to make a soap of
and identify the essential features of the
16
00:02:05,330 --> 00:02:09,899
soap right
so now you want to have a process of selecting
17
00:02:09,899 --> 00:02:13,590
sample right so you know its teenagers but
teenagers from where you know ah what is the
18
00:02:13,590 --> 00:02:17,489
proportion of. boys versus girls . in the
sample right so once you have done that and
19
00:02:17,489 --> 00:02:21,310
you know we had discussed in previous lecture
that if you are sampling is improper then
20
00:02:21,310 --> 00:02:25,480
you might lead to a completely wrong result
ok
21
00:02:25,480 --> 00:02:31,879
once you select the sample you have to analyze
the information right you select the sample
22
00:02:31,879 --> 00:02:38,880
you ask relevant questions in the course of
a questionnaire and you analyze the response
23
00:02:38,880 --> 00:02:43,689
is given by you know boys and girls right
and based on that you want to make an inference
24
00:02:43,689 --> 00:02:55,470
that you can apply for the whole population
of teenagers and then finally you want to
25
00:02:55,470 --> 00:03:04,200
determine the reliability of inference right
you have come up with ok pink soap with you
26
00:03:04,200 --> 00:03:07,599
know which is more elliptical in nature or
oval in nature is is what people would want
27
00:03:07,599 --> 00:03:11,639
so but you want to test the reliability of
this inference
28
00:03:11,639 --> 00:03:18,689
so these are the steps in inferential statistics
ok but before the you know the prelude to
29
00:03:18,689 --> 00:03:26,319
inferential statistics is descriptive statistics
and we want to begin with them descriptive
30
00:03:26,319 --> 00:03:34,340
statistics ok so in descriptive statistics
. one of the may most important things is
31
00:03:34,340 --> 00:03:41,980
a variable right what is your variable ok
so variable is a characteristic which varies
32
00:03:41,980 --> 00:03:48,989
with time and or different individuals so
our body temperature can be a variable rate
33
00:03:48,989 --> 00:03:59,219
so you want to figure out whether someone
has fever or not fever right so body temperature
34
00:03:59,219 --> 00:04:15,019
is a variable in that case
someone you want to figure out what is the
35
00:04:15,019 --> 00:04:18,280
average height of this population right so
then height single you weight so on and so
36
00:04:18,280 --> 00:04:22,200
forth so this is just an example of you know
data of lets say five students in a class
37
00:04:22,200 --> 00:04:28,170
so you have the following categories in other
words you have the following variables what
38
00:04:28,170 --> 00:04:36,160
is the gender what is the year in which the
student you have selected the students five
39
00:04:36,160 --> 00:04:45,650
students from you know from the hostel which
year their first year second year so on and
40
00:04:45,650 --> 00:04:58,400
so forth what are they measuring in with maths
with physics with biology so on and so forth
41
00:04:58,400 --> 00:05:06,330
how many courses have they already done right
so a first year student would have taken probably
42
00:05:06,330 --> 00:05:11,050
taken . five courses already that means this
is second semester of the first year so on
43
00:05:11,050 --> 00:05:18,680
and so forth ok and what is the g p a of that
particular student so what you see that the
44
00:05:18,680 --> 00:05:22,229
nature of the variable differs a lot ok so
in case of gender its just a category right
45
00:05:22,229 --> 00:05:25,630
you either have male or female in year you
have a number one two three four ok major
46
00:05:25,630 --> 00:05:30,090
is also categories right you have distinct
categories maths physics biology so on and
47
00:05:30,090 --> 00:05:35,870
so forth number of courses is a variable but
it is a discrete variable right you can have
48
00:05:35,870 --> 00:05:46,060
only you know. natural numbers which is greater
than zero and in terms of one two three like
49
00:05:46,060 --> 00:05:55,430
that ok but c g p a is a fraction right it
is a number which is depending on what is
50
00:05:55,430 --> 00:06:04,810
your you know. total c g p a it can ah vary
anywhere between zero and ten lets say ok
51
00:06:04,810 --> 00:06:11,919
but you can have any variable which is between
these numbers
52
00:06:11,919 --> 00:06:18,720
so in other words my variable can be divided
into the four categories right . your type
53
00:06:18,720 --> 00:06:23,860
of data can be qualitative so qualitative
i mean that is a gender is for example male
54
00:06:23,860 --> 00:06:29,460
or female or you can we have a quantitative
variable which is essentially like c g p a
55
00:06:29,460 --> 00:06:36,550
or which you know which here you are in so
again which here you are in is a discrete
56
00:06:36,550 --> 00:06:43,659
variable and your c g p a is a continuous
variable ok so there are various types of
57
00:06:43,659 --> 00:06:47,300
variables you have to identify depending on
the problem
58
00:06:47,300 --> 00:06:55,052
now lets say co we you know go back to another
plot where you have a grade so you have you
59
00:06:55,052 --> 00:07:19,219
know the mid sem exam is over and you have
graded the students and you want to find out
60
00:07:19,219 --> 00:07:28,539
the statistics as to who has gotten what grades
ok so there are ten percent in the population
61
00:07:28,539 --> 00:07:32,520
which has gotten grade a thirty percent in
the population grade b forty percent grade
62
00:07:32,520 --> 00:07:37,650
c and twenty percent grade d so you can represent
it enough what is very ah you know popularly
63
00:07:37,650 --> 00:07:42,240
known and used its called a pie chart it is
attractive in nature so what you clearly see
64
00:07:42,240 --> 00:07:44,220
forty is c and it has the . biggest section
of the pie chart
65
00:07:44,220 --> 00:07:53,090
so this the the area of this pie chart is
proportional to kind of the relative frequency
66
00:07:53,090 --> 00:08:01,590
of this number ok but so pie chart is easy
to represent easy to understand but it has
67
00:08:01,590 --> 00:08:06,820
its share of problems so we need to know what
are this problems so imagine in this case
68
00:08:06,820 --> 00:08:12,680
there are only four grades so there are four
categories it is easy to come up with the
69
00:08:12,680 --> 00:08:18,490
pie chart imagine a situation where there
are twenty five different categorized you
70
00:08:18,490 --> 00:08:24,310
know category is possible
so in other words each of these percentage
71
00:08:24,310 --> 00:08:28,669
areas will keep on shrinking and shrinking
so imagine you have one case which is one
72
00:08:28,669 --> 00:08:39,789
percent and the other one which is forty one
percent so forty one will of course take a
73
00:08:39,789 --> 00:08:44,310
huge chunk of this pie chart but one percent
will barely be visible
74
00:08:44,310 --> 00:08:54,160
so in other words you it is difficult to represent
in pie charts when your volume of data increases
75
00:08:54,160 --> 00:08:58,360
such that there are multiple different category
is possible so you can . express this categorical
76
00:08:58,360 --> 00:09:00,480
data into although in something in another
thing which is widely used is a bar chart
77
00:09:00,480 --> 00:09:05,600
so same as before you have the percentage
in your y axis and you have the categories
78
00:09:05,600 --> 00:09:10,460
a b c d ok so as before one of the weaknesses
or you know deficiencies of these bar charts
79
00:09:10,460 --> 00:09:16,420
is that if you have too many bars it looks
cluttered ok if you have few bars there you
80
00:09:16,420 --> 00:09:22,190
know it is easy to represent ok
so this is you know coming to the the few
81
00:09:22,190 --> 00:09:29,300
bars and again the same problem that i mentioned
before for pie charts right so you have a
82
00:09:29,300 --> 00:09:35,079
value one which is two percent and another
value which is forty percent how can you represent
83
00:09:35,079 --> 00:09:39,959
it in the same bar and still the you know
the the other person can make sense out of
84
00:09:39,959 --> 00:09:46,600
it the two percent for all practical purposes
will look like zero so it is nearly impossible
85
00:09:46,600 --> 00:09:56,080
but there is a solution so what we do is when
the variation in data is huge as is in this
86
00:09:56,080 --> 00:10:01,920
particular plot . you have three categories
a b c where a value is around fifty and c
87
00:10:01,920 --> 00:10:06,300
is maybe even you know six hundred right
so what you can do is introduce something
88
00:10:06,300 --> 00:10:22,620
called a break ok so
you want to show that there is significant
89
00:10:22,620 --> 00:10:28,240
difference between a and b so you have so
whatever is a you know range maximum of c
90
00:10:28,240 --> 00:10:35,820
till there you can have a continuous you know
axis in y but after that you can introduce
91
00:10:35,820 --> 00:10:49,630
what is a break right so lets say this guy
is eight hundred so you can introduce a break
92
00:10:49,630 --> 00:11:02,330
at four hundred and then plot again so everything
still fits into the same thing but the essential
93
00:11:02,330 --> 00:11:07,430
part of the information is there for you together
that that this is way smaller than this is
94
00:11:07,430 --> 00:11:10,300
also part of the information and this is way
smaller than this is also part of the information
95
00:11:10,300 --> 00:11:14,960
and you want to capture both these things
in the same plot
96
00:11:14,960 --> 00:11:20,529
so let us have a simple example ok we are
talking about working with quantitative data
97
00:11:20,529 --> 00:11:38,620
right so this is the body mass indices . of
you know twenty five people right in a class
98
00:11:38,620 --> 00:11:50,580
lets say you have this entire you know of
course these values are continuous variables
99
00:11:50,580 --> 00:11:58,820
so so that you can have all these values
now we want to know how can we convert it
100
00:11:58,820 --> 00:12:06,370
into a way of representing it so identifying
categories a b is perhaps not the good you
101
00:12:06,370 --> 00:12:10,380
know good way because its not a discrete quantity
but a continuous quantity but what you can
102
00:12:10,380 --> 00:12:15,480
do is you can identify what is the range right
so in order to identify the range we want
103
00:12:15,480 --> 00:12:18,140
to know what is the smallest value in this
population so i can go through this list and
104
00:12:18,140 --> 00:12:22,420
i think the smallest value is eighteen point
three so eighteen point three is the smallest
105
00:12:22,420 --> 00:12:26,889
value and the largest value largest value
is twenty eight point eight thirty four point
106
00:12:26,889 --> 00:12:30,170
two ok
so thirty four point two is the largest value
107
00:12:30,170 --> 00:12:35,320
this is smallest . this is largest right so
we can divide it so eighteen to thirty four
108
00:12:35,320 --> 00:12:40,820
is roughly eighteen to thirty four is equal
to thirty you know sixteen so we can have
109
00:12:40,820 --> 00:12:47,990
a range of four baskets so we can identify
four baskets lets say . ok one is eighteen
110
00:12:47,990 --> 00:12:51,410
point three to twenty two point three another
is twenty two point three so twenty two point
111
00:12:51,410 --> 00:12:55,790
three to twenty six point three ok we can
have another one which is twenty six point
112
00:12:55,790 --> 00:12:57,870
three to thirty point three and thirty point
three to thirty four point three
113
00:12:57,870 --> 00:13:00,860
now each of these numbers would mean that
you in this basket something will come in
114
00:13:00,860 --> 00:13:06,730
if lets say that number x is greater than
twenty six point three greater equal to twenty
115
00:13:06,730 --> 00:13:15,160
six point three and x is less than thirty
point three so . this would make sure the
116
00:13:15,160 --> 00:13:20,680
same point x does not go into multiple bask
baskets ok so this way what we can generate
117
00:13:20,680 --> 00:13:27,800
is called a histogram ok so you convert the
data into frequency you can then plot them
118
00:13:27,800 --> 00:13:35,250
as numbers or percentage and then you can
have multiple distributions depending on the
119
00:13:35,250 --> 00:13:41,630
nature of the data ok so your histogram looks
something like this so you can have these
120
00:13:41,630 --> 00:13:50,370
bars so in our case we have four bars so we
will have these distribution so these are
121
00:13:50,370 --> 00:13:58,589
values and this axis is frequency or the number
of them so it is possible so it is possible
122
00:13:58,589 --> 00:14:01,150
to convert this data .
now lets say you are going through this exercise
123
00:14:01,150 --> 00:14:05,199
you have . this distribution in one case where
the total number of observations were twenty
124
00:14:05,199 --> 00:14:10,389
five and another distribution . ok where n
is equal to six hundred right is it possible
125
00:14:10,389 --> 00:14:18,950
to put both of these data on the same plot
ok and this is where you have to do what is
126
00:14:18,950 --> 00:14:23,910
called as a normalization exercise so you
know what is n equal to twenty fives total
127
00:14:23,910 --> 00:14:40,540
and you know each of these values frequencies
so you convert it
128
00:14:40,540 --> 00:14:46,760
you normalize the curve in other words you
divide every if lets say this is my f one
129
00:14:46,760 --> 00:14:52,150
this is my f two this is my f three so on
and so forth i convert them into fractions
130
00:14:52,150 --> 00:15:08,350
ok so the nature of the curve wont change
so this value this value is now f one by summation
131
00:15:08,350 --> 00:15:20,190
f i ok so it is equal to f one by f one plus
f two plus f three plus f four
132
00:15:20,190 --> 00:15:26,260
so this you will get a fraction is a fraction
ok so once we have done this . . then it is
133
00:15:26,260 --> 00:15:29,970
theoretically possible to generate the following
plot i have the same thing . and another one
134
00:15:29,970 --> 00:15:39,920
lets just say hypothetically so the way i
drew is . ok so if i just if i were to draw
135
00:15:39,920 --> 00:15:54,540
the outlines of this curve these curve would
look like this so you have one curve like
136
00:15:54,540 --> 00:15:57,600
this and the other curve . which is like this
so it is possible to plot both of them at
137
00:15:57,600 --> 00:16:00,740
the same time but you have to do is normalized
but another caveat of this is you must ensure
138
00:16:00,740 --> 00:16:04,459
that the data is from a similar distribution
so any of course there is greater certainty
139
00:16:04,459 --> 00:16:08,519
when you have sampled six hundred you know
individual measurements . but when you are
140
00:16:08,519 --> 00:16:11,620
you know plotting the same thing with n equal
to twenty five this is ah the great possibility
141
00:16:11,620 --> 00:16:13,540
that the nature of that distribution will
shift . ok
142
00:16:13,540 --> 00:16:21,670
so another way of you know another type of
plot which is widely used is called scatter
143
00:16:21,670 --> 00:16:31,910
plot so scatter plot is just x and y values
lets say i have x versus y i have age age
144
00:16:31,910 --> 00:16:36,050
as one variable and the other variable is
let weight right so i can have this generation
145
00:16:36,050 --> 00:16:40,730
age is varied in all its a five years weight
is ten kgs so on and so forth and fifty four
146
00:16:40,730 --> 00:16:59,950
fifty years age is you know sixty kgs so you
have a range ok
147
00:16:59,950 --> 00:17:12,750
now depending on the nature of this data you
might have a you know points which look like
148
00:17:12,750 --> 00:17:21,020
this so this is my x this is my y so you you
might have data which looks like this or as
149
00:17:21,020 --> 00:17:24,630
i have plotted in this particular curve you
have a kind of a reverse such association
150
00:17:24,630 --> 00:17:28,480
. where you have greater the increase in x
the y value decreases with a notable exception
151
00:17:28,480 --> 00:17:33,650
ok so this is where your you know data analysis
so you know in this case do you call it a
152
00:17:33,650 --> 00:17:43,440
negative association or do you want to have
a much more non linear nature of this curve
153
00:17:43,440 --> 00:18:00,880
so scatter plots are widely used so again
here it is better to plot these points as
154
00:18:00,880 --> 00:18:06,350
scattered as oppose to connect them then it
is much difficult to make sense out of this
155
00:18:06,350 --> 00:18:12,120
data ok but you can mix if you were to connect
it then you can generate what are typically
156
00:18:12,120 --> 00:18:17,500
called as line plots so this is an example
of a line plot where x and y. i have plotted
157
00:18:17,500 --> 00:18:22,780
it in a slightly which looks like a you know
s in some way so these are reminiscent of
158
00:18:22,780 --> 00:18:27,990
bacterial growth curves but you know so you
can have various functions which describe
159
00:18:27,990 --> 00:18:33,830
these line plots so it makes sense to connect
them with line when you know that the underlying
160
00:18:33,830 --> 00:18:38,159
. phenomena is actually a physical process
which has a given time constant associated
161
00:18:38,159 --> 00:18:43,630
with it or a given you know mechanism by the
way in which it happens so there is its its
162
00:18:43,630 --> 00:18:46,610
under control it is not a completely random
association
163
00:18:46,610 --> 00:19:05,909
so that is when you can have very nice linear
plots so just a small detour on the
164
00:19:05,909 --> 00:19:30,169
type of plots . you ah you know you are all
well conversant with linear plots x equal
165
00:19:30,169 --> 00:19:42,890
to y is a very simple plot and you know how
to plot it you have x you have y you take
166
00:19:42,890 --> 00:19:48,070
these points and you know you take these points
so lets say this is one comma one you have
167
00:19:48,070 --> 00:19:52,840
minus one comma minus one you know five comma
five so on and so forth you have a line which
168
00:19:52,840 --> 00:19:58,940
goes like this and this is forty five degrees
right so in general if you have a line which
169
00:19:58,940 --> 00:20:03,570
kind of shifts up so in in general why you
know you can have a line which is like this
170
00:20:03,570 --> 00:20:08,970
in this case ok . so there is an intercept
a nonzero intercept on the y axis which you
171
00:20:08,970 --> 00:20:14,580
can call at y zero and it has a given slope
so you can have m as the slope or m is nothing
172
00:20:14,580 --> 00:20:27,230
but tan theta so in this case y is equal to
y naught plus m x is your equation ok so depends
173
00:20:27,230 --> 00:20:34,350
so you can have multiple type of you know
ah functions these are all linear functions
174
00:20:34,350 --> 00:20:37,440
that i have drawn you can have something like
this lets say this is an example of a parabola
175
00:20:37,440 --> 00:20:42,390
so this is x this is y and y is equal to lets
say x square right
176
00:20:42,390 --> 00:20:47,880
so the far higher you are you have a non you
know non linear nature of the curve so these
177
00:20:47,880 --> 00:20:58,950
are so y is equal to x cubed will look similar
but it it will it have much sharper peak be.
178
00:20:58,950 --> 00:21:16,029
before x equal to one and lower peak before
this but y is equal to x equal to x square
179
00:21:16,029 --> 00:21:23,470
is symmetric but y is equal to x x cubed looks
like so y is equal to x cubed looks like this
180
00:21:23,470 --> 00:21:25,640
. when x is negative your y values are negative
ok
181
00:21:25,640 --> 00:21:37,360
so these are some of the simple curves in
polynomial you can have exponential curves
182
00:21:37,360 --> 00:21:49,810
which are lets say an exponential decay curve
will looks like this lets say x this is y
183
00:21:49,810 --> 00:21:58,400
at so if it is y is equal to e to the power
minus x in terms of decay you have at x equal
184
00:21:58,400 --> 00:22:13,010
to zero you have y equal to one and then you
have a characteristic time so in the most
185
00:22:13,010 --> 00:22:21,440
general case you have x by tau which you know
which represents the time constant of the
186
00:22:21,440 --> 00:22:25,890
.
so when you are trying to fit data lets say
187
00:22:25,890 --> 00:22:41,789
you have a data which looks like this then
it should immediately occur to you that this
188
00:22:41,789 --> 00:22:52,380
has something it might look like an exponential
it might be a pair you know it might look
189
00:22:52,380 --> 00:22:57,029
like a polynomial so polynomials are easy
to fit because they have multiple dimensions
190
00:22:57,029 --> 00:23:01,429
but it is not you know why is to always fit
every function with a polynomial
191
00:23:01,429 --> 00:23:10,990
now what are the things that you need to keep
in mind while doing these plots let us go
192
00:23:10,990 --> 00:23:16,580
over them one by one so of course . when you
make a plot first thing after label your variables
193
00:23:16,580 --> 00:23:21,900
x and y you have to put their units ideally
so lets say this is if is age then i can have
194
00:23:21,900 --> 00:23:36,760
years in my if this is weight i can have k
g ok so i need to know what is the what are
195
00:23:36,760 --> 00:23:43,150
my axis and what are my units ok and i need
to choose the appropriate range lets say for
196
00:23:43,150 --> 00:23:50,450
example we are we i want to make a plot of
population expansion right so if i plot like
197
00:23:50,450 --> 00:24:02,179
this right it gives me the impression so sorry
this is k g now lets say it is a weight itself
198
00:24:02,179 --> 00:24:09,240
right weight which is increasing as a function
of years which will also probably be a linear
199
00:24:09,240 --> 00:24:12,760
you know increase and then some saturation
after point ok
200
00:24:12,760 --> 00:24:18,980
so in this case if i want to show it is linearly
increasing so lets say this maximum value
201
00:24:18,980 --> 00:24:25,600
is around sixty ok so i need to make sure
and i am plotting till one fifty so this portion
202
00:24:25,600 --> 00:24:29,900
of my plot . is completely destroyed because
i am not using the space i am i am visually
203
00:24:29,900 --> 00:24:33,940
trying to convey that the weight is not changing
much with years but in reality the weight
204
00:24:33,940 --> 00:24:47,970
is changing with years so i should actually
redraw this plot that this is from zero to
205
00:24:47,970 --> 00:24:49,559
sixty and my curve should look something like
this ok
206
00:24:49,559 --> 00:24:54,840
so if it was like sixty then i can clearly
see there is a non linear increase initially
207
00:24:54,840 --> 00:25:02,150
which means that initially when kids are growing
their weight increases drastically but once
208
00:25:02,150 --> 00:25:08,360
they reach a certain age it starts to kind
of plateau off ok . again the other point
209
00:25:08,360 --> 00:25:18,890
of breaks as i said so in whatever you have
done in bar graph you can have the break here
210
00:25:18,890 --> 00:25:27,760
itself again lets say you have a variable
x which goes from zero to hundred and a variable
211
00:25:27,760 --> 00:25:30,289
y which goes from point one to ten thousand
right
212
00:25:30,289 --> 00:25:33,340
so here if you if you put a linear value . so
also all values which are very small will
213
00:25:33,340 --> 00:25:36,870
look like this just look like a mess here
but what you can do is you can either plot
214
00:25:36,870 --> 00:25:41,410
it in log scale so if you plot it in log scale
then accordingly every point so this will
215
00:25:41,410 --> 00:25:44,510
be one this will be ten hundred so on and
so forth ok so the points will be well separated
216
00:25:44,510 --> 00:25:50,809
out and you can see them so it is important
to choose appropriate range and all again
217
00:25:50,809 --> 00:26:01,020
as as before lets say this is from point one
to hundred what i can do is i can introduce
218
00:26:01,020 --> 00:26:13,409
a break so lets say i can have point one to
one and then eighty to hundred if all the
219
00:26:13,409 --> 00:26:19,230
data is just here and then remaining is here
ok so this is how i can really make use of
220
00:26:19,230 --> 00:26:24,559
the whole plot and still plot my axis so that
everything is clearly visible
221
00:26:24,559 --> 00:26:30,049
one more thing is lets just say that you have
all your data is here and there is one outlier
222
00:26:30,049 --> 00:26:33,620
which is here ok . all your data is reasoning
is essentially concentrated in this portion
223
00:26:33,620 --> 00:26:38,570
of the curve but there is one point which
is way out which is an outlier so do would
224
00:26:38,570 --> 00:26:44,260
you bother to plot the entire range or would
you just bother to point this plot the scenes
225
00:26:44,260 --> 00:26:50,370
i think it is it makes sense to plot the centre
then. and then blot it up ok so you make it
226
00:26:50,370 --> 00:26:56,169
big so then you have all the scattered and
as an inset you can have this higher value
227
00:26:56,169 --> 00:27:04,330
where all these points are looking the same
so make this whole curve as the inset so this
228
00:27:04,330 --> 00:27:08,020
is called a inset ok to handle outliers ok
so now these are some single y plots again
229
00:27:08,020 --> 00:27:14,780
lets just say that . i have three variables
right lets say i have three variables time
230
00:27:14,780 --> 00:27:20,700
age and weight ok three variables ok and . i
want to understand and i want to make a single
231
00:27:20,700 --> 00:27:27,010
plot of putting all thems together so this
is where you can make use what is called as
232
00:27:27,010 --> 00:27:33,669
a double y plot . ok so you can have two axis
so this is you can label this as y one axis
233
00:27:33,669 --> 00:27:42,953
this is as y two axis this is x and you can
plot them lets say with you know ah weight
234
00:27:42,953 --> 00:27:58,659
with time might saturate and age with time
has a you know linear relationship so this
235
00:27:58,659 --> 00:28:05,480
if x is my time this is my ah weight and this
is my age then this guy will have a linearly
236
00:28:05,480 --> 00:28:07,559
increasing curve ok
so this is just another example of a double
237
00:28:07,559 --> 00:28:10,340
y plot so in this case i have had a reverse
weight in which variable y exhibits a decrease
238
00:28:10,340 --> 00:28:12,840
a linear decrease with x as a function of
time and variable x actually . exhibits a
239
00:28:12,840 --> 00:28:16,970
saturation profile so beyond a certain value
of x it reaches the saturation ok
240
00:28:16,970 --> 00:28:22,150
so now let us solve few examples . so we have
the following example where you have number
241
00:28:22,150 --> 00:28:40,040
of visits to a dental clinic in a typical
week . so as you can clearly see so these
242
00:28:40,040 --> 00:28:43,059
numbers are all discrete numbers you dont
have a fraction because the number of visits
243
00:28:43,059 --> 00:28:45,769
is of course a discrete number but and you
want to know what is the best way of plotting
244
00:28:45,769 --> 00:28:50,539
it ok so first you see what is the range right
so we have all the way from one to eight and
245
00:28:50,539 --> 00:28:51,960
i think when you have this kind of data it
is good to sort it so if i were to write the
246
00:28:51,960 --> 00:28:54,490
same data together in a sorted form. i have
one so the frequency of one i can make the
247
00:28:54,490 --> 00:28:57,820
frequency of one so i have one two three four
five six seven eight . right so and this is
248
00:28:57,820 --> 00:28:59,620
my frequency axis . we have the number of
visits . and the frequency axis so for number
249
00:28:59,620 --> 00:29:01,580
one the frequency is two the number two so
the frequency of two is only one right frequency
250
00:29:01,580 --> 00:29:02,670
of three is one two three . frequency of four
is one two three four five frequency of five
251
00:29:02,670 --> 00:29:03,690
is one two three four five six seven frequency
of six is only one frequency of seven is three
252
00:29:03,690 --> 00:29:05,100
eight is one two . ok
so you have the number of visits because these
253
00:29:05,100 --> 00:29:06,620
numbers are small there is absolute so you
so of course this axis has to be . one two
254
00:29:06,620 --> 00:29:08,570
like that three and because these numbers
are small there is absolutely no necessity
255
00:29:08,570 --> 00:29:09,680
to make it into a relative score you can just
have these values so for example for one it
256
00:29:09,680 --> 00:29:16,290
is two for two it is one for three it is three
for four it is for four it is five ok six
257
00:29:16,290 --> 00:29:19,510
it is one seven it is three eight it is two
ok so you have if i if i connect that actually
258
00:29:19,510 --> 00:29:20,960
i should have them as bars ok so what you
see is almost that they it is not a unimodal
259
00:29:20,960 --> 00:29:23,169
distribution there is a you know reasonable
amount of variation in the data so if this
260
00:29:23,169 --> 00:29:24,169
was if this ah five was slightly higher then
you have a nice histogram like shape but this
261
00:29:24,169 --> 00:29:25,169
is different ok
so this is of course the discrete variable
262
00:29:25,169 --> 00:29:27,380
and then . you can i think histogram would
be the easiest way to plot it ok so let us
263
00:29:27,380 --> 00:29:30,500
and you know histogram is the way to plot
it let us take another example so i have test
264
00:29:30,500 --> 00:29:31,500
course of you know twenty students right i
have test scores of twenty students . and
265
00:29:31,500 --> 00:29:32,500
i want to know ok so i want to know that what
is the average test score right as before
266
00:29:32,500 --> 00:29:34,500
and what is the way of plotting it as before
i think the histogram is the best way of plotting
267
00:29:34,500 --> 00:29:35,909
it ok so we can again go through the same
process we know what is our lowest number
268
00:29:35,909 --> 00:29:37,581
which is around twenty nine which is our highest
number which is around ninety three and we
269
00:29:37,581 --> 00:29:38,581
can make it into a histogram ok
so again where histogram is a good way of
270
00:29:38,581 --> 00:29:39,581
representing this the last one . ah the last
one is an example so imagine . so the above
271
00:29:39,581 --> 00:29:40,581
data is not test ah scores of twenty students
. whether it is test scores of ten students
272
00:29:40,581 --> 00:29:41,581
in two exams so ten students exam one exam
two so now we have to plot this you know so
273
00:29:41,581 --> 00:29:42,581
you can make them as two separate histograms
but if you want to plot it in the same plot
274
00:29:42,581 --> 00:29:43,581
maybe it is best to put it for the for each
student the x and the y and that might give
275
00:29:43,581 --> 00:29:44,581
a some correlation between how they performed
in each of the exams ok
276
00:29:44,581 --> 00:29:47,909
so i guess that brings us to the end of this
just a brief recap we discussed ah you know
277
00:29:47,909 --> 00:29:51,220
the nature of variables either qualitative
or quantitative we discussed some of the common
278
00:29:51,220 --> 00:29:55,190
ways of representation which is pie chart
bar chart histograms line or scatter plot
279
00:29:55,190 --> 00:29:58,970
and then double y ok so depending on the nature
of the data the range the size of the data
280
00:29:58,970 --> 00:30:03,940
you might choose to you know use the histogram
or the scatter plot as the case may be when
281
00:30:03,940 --> 00:30:06,260
you are trying to look for some correlation
you want to preferably use thoughts like scatter
282
00:30:06,260 --> 00:30:11,220
plot ok with that i thank you . for todays
lecture i and i hope that you . attempt the
283
00:30:11,220 --> 00:30:14,130
questions which we upload for the multiple
ah choice questions
284
00:30:14,130 --> 00:30:14,710
thank you .