1
00:00:00,669 --> 00:00:04,410
Let us look at the problem of searching for
a value in an array.
2
00:00:04,410 --> 00:00:09,441
So, in general the search problem is to find
whether a value K is present, present in a
3
00:00:09,441 --> 00:00:14,099
collection of values A and in our case we
will think of A as generally as a sequence
4
00:00:14,099 --> 00:00:15,099
of values.
5
00:00:15,099 --> 00:00:19,310
And moreover, we also assume that the sequence
is something like integers, where we can talk
6
00:00:19,310 --> 00:00:21,060
of one value being less than another value.
7
00:00:21,060 --> 00:00:24,732
So, the values can be ordered with respect
to each other.
8
00:00:24,732 --> 00:00:31,360
So, we already saw that we can keep such sequences
in two different ways, as arrays and as lists.
9
00:00:31,360 --> 00:00:35,309
And depending on whether we keep it as arrays
or list, the way we can access the elements
10
00:00:35,309 --> 00:00:36,309
is different.
11
00:00:36,309 --> 00:00:42,199
So, the first question we might ask is, whether
searching makes a difference in a list verses
12
00:00:42,199 --> 00:00:47,309
an array and the second question we might
ask is, is there some importance to how the
13
00:00:47,309 --> 00:00:51,499
values are arranged in the sequence, does
it help if they are in ascending or descending
14
00:00:51,499 --> 00:00:53,589
order or does it not matter.
15
00:00:53,589 --> 00:01:00,079
Is it equally the same to search for something
in a randomly ordered collection of values
16
00:01:00,079 --> 00:01:03,129
or when it is structured in some particular
way?
17
00:01:03,129 --> 00:01:11,420
So, in the unsorted case we have no choice
basically, we have a sequence A which runs
18
00:01:11,420 --> 00:01:13,329
from 0 to n minus 1.
19
00:01:13,329 --> 00:01:17,200
So, we must look at all the values, because
we have no idea where K maybe.
20
00:01:17,200 --> 00:01:23,600
So, systematic way to do it is to start with
the position 0 and just scan all the way to
21
00:01:23,600 --> 00:01:24,600
n minus 1.
22
00:01:24,600 --> 00:01:31,060
So, we have this loop here, which scans and
this scan either terminates when you reach
23
00:01:31,060 --> 00:01:36,850
the end without finding it or when at some
position i we find that A i is equal to K.
24
00:01:36,850 --> 00:01:41,039
And then depending on that we either say that
it is found, in which case we return the position
25
00:01:41,039 --> 00:01:45,789
or we have reached i equal to n, which means
we are gone beyond A n minus 1 and so, we
26
00:01:45,789 --> 00:01:53,659
return not 1, minus 1 which is an invalid
position indicate that it is not found.
27
00:01:53,659 --> 00:02:00,609
So, we saw before that the worst case actually
happens when K is not an element of A, K does
28
00:02:00,609 --> 00:02:04,900
not come in A, we have to scan A 0 to A n
minus 1 in order to determine that K is not
29
00:02:04,900 --> 00:02:05,900
there.
30
00:02:05,900 --> 00:02:09,030
Because, we have no evidence in advance which
position it is likely to be.
31
00:02:09,030 --> 00:02:14,040
So, this means that in the worst case searching
for an element in an unsorted array takes
32
00:02:14,040 --> 00:02:15,040
linear time.
33
00:02:15,040 --> 00:02:18,799
And of course, it does not matter now whether
it is an array or a list, because in a list
34
00:02:18,799 --> 00:02:23,030
we could also in linear time start from the
first element and follow the links all the
35
00:02:23,030 --> 00:02:27,110
way to the end, in an array we start with
A 0 and go all the way to A n minus 1, both
36
00:02:27,110 --> 00:02:31,920
of them take linear time.
37
00:02:31,920 --> 00:02:36,330
On the other hand if the sequence is sorted
and in particular if it is an array, we can
38
00:02:36,330 --> 00:02:38,060
be a little more intelligent.
39
00:02:38,060 --> 00:02:42,599
So, what we know is that the values are say
in ascending order.
40
00:02:42,599 --> 00:02:50,689
So, if you probe the value in the middle and
check is this equal to K, if the value that
41
00:02:50,689 --> 00:02:52,860
we have is equal to K of course, we have found
it.
42
00:02:52,860 --> 00:03:00,239
If it is smaller than the value here, then
we only need to search in this half, and if
43
00:03:00,239 --> 00:03:01,980
it is larger then we need to search in this
half.
44
00:03:01,980 --> 00:03:05,489
Now, this is something that we intuitively
do all the time, this is how we look for say
45
00:03:05,489 --> 00:03:11,660
words in a dictionary or when we play 20 questions,
we try to ask questions about the age of a
46
00:03:11,660 --> 00:03:16,510
person, then you say is this person less than
40, is this person greater than 65 and so
47
00:03:16,510 --> 00:03:17,510
on.
48
00:03:17,510 --> 00:03:19,489
So, this is something we know intuitively,
but we can formalize.
49
00:03:19,489 --> 00:03:23,450
So, we take the midpoint of the range we are
searching for.
50
00:03:23,450 --> 00:03:27,280
If the midpoint is the value that we want
we found it; otherwise, we depending on the
51
00:03:27,280 --> 00:03:30,739
value we are looking for and what the value
is at the midpoint, we search either the bottom
52
00:03:30,739 --> 00:03:32,019
half or the top half.
53
00:03:32,019 --> 00:03:37,550
So, this has a name which many of you may
already know, this is called binary search.
54
00:03:37,550 --> 00:03:41,269
So, here is a simple recursive algorithm for
binary search.
55
00:03:41,269 --> 00:03:46,250
So, in general it searches an array, remember
that when we do the search we might be searching
56
00:03:46,250 --> 00:03:49,450
different segments depending on how far we
have progressed in the search.
57
00:03:49,450 --> 00:03:54,909
So, in general binary search takes a value
K to search an array and two end points, left
58
00:03:54,909 --> 00:03:59,599
and right and just to make sure that we get
a everything right, we will have the convention
59
00:03:59,599 --> 00:04:05,200
that it searches from the index l to the index
r minus 1, that is it searches from l to r,
60
00:04:05,200 --> 00:04:07,080
but not including r itself.
61
00:04:07,080 --> 00:04:17,000
So, now if l and r are actually the same,
then we have an empty array, because l to
62
00:04:17,000 --> 00:04:19,350
r minus 1 is actually something where there
are no elements in between.
63
00:04:19,350 --> 00:04:21,320
So, we say we have not found it.
64
00:04:21,320 --> 00:04:25,420
So, when the interval that we are searching
for becomes empty, the array definitely does
65
00:04:25,420 --> 00:04:27,910
not contain the value we are looking for.
66
00:04:27,910 --> 00:04:32,320
Otherwise, we compute the midpoint between
l and r by taking the sum and dividing by
67
00:04:32,320 --> 00:04:35,620
2 and because this might be an odd number,
we use integer division.
68
00:04:35,620 --> 00:04:40,360
Now, we examine at this point we have found
the midpoint.
69
00:04:40,360 --> 00:04:44,360
So, now, we examine if the value that we want
is there at the midpoint.
70
00:04:44,360 --> 00:04:45,950
If so, we return true.
71
00:04:45,950 --> 00:04:50,320
Otherwise, if the value that we want is smaller
than the midpoint, then we go to the left
72
00:04:50,320 --> 00:04:56,390
and if the value that we want is bigger than
the midpoint, then we go to the right.
73
00:04:56,390 --> 00:05:02,170
So, this either goes from left to mid minus
1 or mid plus 1 to right.
74
00:05:02,170 --> 00:05:04,690
In other words we exclude mid from our search.
75
00:05:04,690 --> 00:05:11,790
So, this, the first case runs the search from
left to mid minus 1 because that is our assumption,
76
00:05:11,790 --> 00:05:14,780
if we call it with mid, it goes to mid minus
1.
77
00:05:14,780 --> 00:05:18,580
This one starts at mid plus 1 and goes to
right minus 1.
78
00:05:18,580 --> 00:05:21,880
So, the original thing was from left to right
minus 1 and we have now excluded mid from
79
00:05:21,880 --> 00:05:25,380
this and we have also halved the interval
to search.
80
00:05:25,380 --> 00:05:32,060
So, the crucial advantage of binary search
is that each step we halve the interval to
81
00:05:32,060 --> 00:05:36,120
search and at some point we will reach 1 and
then when we halve 1, we will get an interval
82
00:05:36,120 --> 00:05:38,680
of size 0 and so, we will get an immediate
answer.
83
00:05:38,680 --> 00:05:45,750
So, we can write as we saw for such recursive
functions, we can write what is called a recurrence.
84
00:05:45,750 --> 00:05:50,470
Recurrence is just an expression for the time,
in terms of smaller values of same expression.
85
00:05:50,470 --> 00:05:57,530
So, the base case is that when we have T of
n we mean the time taken to search in a list
86
00:05:57,530 --> 00:06:01,120
or an array actually of size n.
87
00:06:01,120 --> 00:06:04,130
So, T of 0 is 1.
88
00:06:04,130 --> 00:06:11,360
So, if we have an empty array we have nothing
to do and T of n in general is 1 step to find
89
00:06:11,360 --> 00:06:12,360
and compare.
90
00:06:12,360 --> 00:06:15,230
So, this one is actually a constant number
of steps to compare with the midpoint and
91
00:06:15,230 --> 00:06:16,610
decide to go up, down and all that.
92
00:06:16,610 --> 00:06:21,050
So, those operations plus the time taken to
search whichever half we focus on, the left
93
00:06:21,050 --> 00:06:22,050
half and right half.
94
00:06:22,050 --> 00:06:27,580
Remember, that if we look at the left half,
we never going to look at the right half again.
95
00:06:27,580 --> 00:06:31,690
So, one way to solve such a recurrence is
to unwind it.
96
00:06:31,690 --> 00:06:35,110
So, we have T n is 1 plus T n by 2.
97
00:06:35,110 --> 00:06:40,260
So, we take n by 2 and we do n by 2 divided
by 2 and rather than write it is as n by 4
98
00:06:40,260 --> 00:06:43,120
we write is as n by 2 square.
99
00:06:43,120 --> 00:06:47,150
And this is because now, if I do one more
time it will be 1 plus 1 plus 1 divided by
100
00:06:47,150 --> 00:06:48,250
2 cubed and so on.
101
00:06:48,250 --> 00:06:52,690
So, in general you can see that if I do it
k times then I am going to have k 1's here,
102
00:06:52,690 --> 00:07:02,570
if I do 3 times I have 1 plus 1 plus 1 plus
T n by 2 to the 3, 4 or 4 plus T n by 2 to
103
00:07:02,570 --> 00:07:03,570
the 4 and so on.
104
00:07:03,570 --> 00:07:12,020
So, after k steps I have k plus T of n by
2 to the k, now when n by 2 to the k becomes
105
00:07:12,020 --> 00:07:15,950
1 at the next step I am going to get T of
0.
106
00:07:15,950 --> 00:07:23,510
So, when does this become 1, when n is 2 to
the k in other words when k is log 2 of n.
107
00:07:23,510 --> 00:07:29,490
So, when I get log 2 of n, then this will
become T of 1 and at the next step this is
108
00:07:29,490 --> 00:07:32,370
going to become 1 plus T of 0.
109
00:07:32,370 --> 00:07:40,740
So, this is going to be log n 1s and so, over
all the complexity of binary search is just
110
00:07:40,740 --> 00:07:41,900
order of log n.
111
00:07:41,900 --> 00:07:47,610
So, we have gone from a linear search in the
case of an unsorted array to a logarithmic
112
00:07:47,610 --> 00:07:51,260
search in the case of a sorted array.
113
00:07:51,260 --> 00:07:57,550
So, we mentioned in the previous unit about
arrays and list that, things that work on
114
00:07:57,550 --> 00:07:59,490
list may not work on arrays and vice versa.
115
00:07:59,490 --> 00:08:03,220
So, here is an example of something which
works only for arrays.
116
00:08:03,220 --> 00:08:07,840
The idea of looking up the midpoint and then
going left works only, if you can find the
117
00:08:07,840 --> 00:08:11,601
midpoint in constant time, if you have to
spend time looking for the constantÃ‰ for
118
00:08:11,601 --> 00:08:14,760
the midpoint, then you cannot get this recurrence
anymore.
119
00:08:14,760 --> 00:08:21,240
it is not going to be 1 plus T of n by 2,
but it is going to be n by 2 plus T of n by
120
00:08:21,240 --> 00:08:23,760
2 and then we will actually get a linear time.
121
00:08:23,760 --> 00:08:27,810
So, binary search for a list will actually
turn out be linear, because of the time it
122
00:08:27,810 --> 00:08:29,440
takes us to go to the midpoint.
123
00:08:29,440 --> 00:08:34,330
So, this works only for arrays, but really
the remarkable thing about binary search is
124
00:08:34,330 --> 00:08:39,409
that by only looking at a very small fraction
of the sequence, we can conclude that an element
125
00:08:39,409 --> 00:08:40,409
is not present.
126
00:08:40,409 --> 00:08:44,620
So, we know for instance that 2 to the n,
2 to the 10 is 1024.
127
00:08:44,620 --> 00:08:54,720
So, if I give you 1000 values, we can look
at 10 or maybe 11 and say that something is
128
00:08:54,720 --> 00:08:55,720
not there.
129
00:08:55,720 --> 00:08:59,560
So, the overwhelming number or values we do
not even have to look at in order to decide
130
00:08:59,560 --> 00:09:05,189
whether a value is there or not and this makes
binary search to be remarkable procedure,
131
00:09:05,189 --> 00:09:06,130
if you think about it carefully.