1
00:00:16,520 --> 00:00:24,450
We have been looking at methods for solving
nonlinear algebraic equations and in that
2
00:00:24,450 --> 00:00:33,980
till now we have covered gradient free methods
or successive substitutions then gradient
3
00:00:33,980 --> 00:00:40,470
based methods and now I am going to move onto
a third category which is optimization based
4
00:00:40,470 --> 00:00:57,000
methods So if I just collate what are the
methods that we have seen
5
00:00:57,000 --> 00:01:06,740
So for solving nonlinear algebraic equations
the first method that we saw was successive
6
00:01:06,740 --> 00:01:13,210
substitutions or here we had different variants
like Jacobi iterations Gauss Seidel iterations
7
00:01:13,210 --> 00:01:34,100
and so on Then the second class I would say
is gradient or slope based methods and here
8
00:01:34,100 --> 00:01:48,759
we have looked at univariate secant method
multivariate secant or which is popularly
9
00:01:48,759 --> 00:02:03,380
known as wegstein iterations
10
00:02:03,380 --> 00:02:07,080
So we looked at univariate secant method which
are slope based method I would instead of
11
00:02:07,080 --> 00:02:13,170
saying gradient based method I would say slope
based method It will use two consecutive iterates
12
00:02:13,170 --> 00:02:21,950
and then find a slope and use that slope to
find out the next step Then we looked at multivariate
13
00:02:21,950 --> 00:02:34,120
secant method Under the same category we moved
to Newton Raphson or Newton’s method sometimes
14
00:02:34,120 --> 00:02:35,740
known as Newton Raphson method
15
00:02:35,740 --> 00:02:49,590
So here we had some modifications so damped
Newton method and we looked at quasi Newton
16
00:02:49,590 --> 00:02:58,030
method Damped Newton method was adjusting
the step length Quasi Newton method was Jacobi
17
00:02:58,030 --> 00:03:08,570
and matrix update using rank one matrices
or Broyden s update So this was using Broyden’s
18
00:03:08,570 --> 00:03:15,980
update Well of course you can combine these
two and have a Newton’s method which is
19
00:03:15,980 --> 00:03:21,650
damped Newton method with or quasi Newton
the damped method and so on
20
00:03:21,650 --> 00:03:29,700
So you can have all the combinations So these
are the categories on which we have looked
21
00:03:29,700 --> 00:03:39,550
at solving nonlinear algebraic equations The
third category that I want to look at is optimization
22
00:03:39,550 --> 00:03:52,870
based okay So first two categories we have
looked at in detail the third one the optimization
23
00:03:52,870 --> 00:03:58,140
based it has its parallel in solving linear
algebraic equations linear algebraic equations
24
00:03:58,140 --> 00:04:01,100
we solved using gradient method conjugate
gradient method
25
00:04:01,100 --> 00:04:06,290
Likewise here too we can form solving nonlinear
algebraic equations as an optimization problem
26
00:04:06,290 --> 00:04:19,260
and it relatively solve the optimization problem
till you reach the solution okay So what are
27
00:04:19,260 --> 00:04:33,599
this optimization based methods
so here we form an optimization problem See
28
00:04:33,599 --> 00:04:42,020
I want to solve for f of x equal to 0 and
x belongs to Rn X is an n dimensional vector
29
00:04:42,020 --> 00:04:46,240
and f of x equal to 0 is what I want to solve
for
30
00:04:46,240 --> 00:05:03,490
I formulate an objective function phi of x
which is 1by2 fx transpose fx which is nothing
31
00:05:03,490 --> 00:05:17,700
but 1by2 f1x square plus so I have this function
vector f which has components f1 to fn So
32
00:05:17,700 --> 00:05:36,710
we want to minimize this with respective to
x So we solve the problem as minimize with
33
00:05:36,710 --> 00:05:46,919
respect to x phi of x
okay What is the optimum What is the necessary
34
00:05:46,919 --> 00:05:59,110
condition for optimality The gradient equal
to 0 so doh phibydoh x that is doh fbydoh
35
00:05:59,110 --> 00:06:07,259
x transpose f of x equal to 0
36
00:06:07,259 --> 00:06:15,370
This is the optimality criteria okay and you
can see that when the Jacobian is nonsingular
37
00:06:15,370 --> 00:06:23,860
at the optimum only way you can get the necessary
condition satisfied is when f of x equal to
38
00:06:23,860 --> 00:06:31,790
0 If this is Jacobian is nonsingular then
only way you can get the solution are to this
39
00:06:31,790 --> 00:06:41,309
problem of you know the necessary condition
being satisfied when f of x equal to 0 so
40
00:06:41,309 --> 00:06:50,039
this is when you reached the optimum and when
you reach a stationary point where the Jacobian
41
00:06:50,039 --> 00:06:52,689
is nonsingular then here is the optimum
42
00:06:52,689 --> 00:06:58,210
Then here is the solution of f of x equal
to 0 that is the idea Now how do you solve
43
00:06:58,210 --> 00:07:10,699
this How do you well we do it iteratively
using different numerical optimization methods
44
00:07:10,699 --> 00:07:16,449
The first one of course the simplest one we
used would be gradient method but as I told
45
00:07:16,449 --> 00:07:25,080
you the gradient method has a problem So the
simplest would be gradient method This is
46
00:07:25,080 --> 00:07:29,430
as I said gradient method is more useful for
deriving more complex method as forms the
47
00:07:29,430 --> 00:07:30,430
basis
48
00:07:30,430 --> 00:07:37,460
Probably we would not use gradient method
We use conjugate gradient method but just
49
00:07:37,460 --> 00:07:45,860
to state the algorithm gradient method would
start with the initial guess x not and then
50
00:07:45,860 --> 00:08:15,199
you will get x k plus 1 equal to xk plus lambda
k gk and gk is minus grad phi x that is evaluated
51
00:08:15,199 --> 00:08:32,409
at x equal to xk which is nothing but if we
develop a notation I just do it here So if
52
00:08:32,409 --> 00:08:43,610
I develop a notation Jk which is doh f by
doh x evaluated at x equal to xk
53
00:08:43,610 --> 00:08:55,200
If I do this notation and then of course f
at xk equal to f superscript k if I use this
54
00:08:55,200 --> 00:09:05,130
notation we have use this notation earlier
the same notation I am using okay Then this
55
00:09:05,130 --> 00:09:18,390
would be minus jk transpose fk Jk would be
how do you computer lambda k this is a scalar
56
00:09:18,390 --> 00:09:28,600
parameter step line parameter and this is
found by one dimensional minimization okay
57
00:09:28,600 --> 00:09:32,820
So this is the iterative numerical procedure
by which you start from x not
58
00:09:32,820 --> 00:09:39,070
You generated new guess in the direction of
negative of gradient direction and at each
59
00:09:39,070 --> 00:09:52,260
time how much to move is decided by minimizing
with respect to lambda okay So lambda k is
60
00:09:52,260 --> 00:10:07,650
min with respect to lambda phi x k plus lambda
gk okay So lambda k is just one dimensional
61
00:10:07,650 --> 00:10:12,810
minimization gk is the gradient direction
which you are computed okay xk is known to
62
00:10:12,810 --> 00:10:18,810
you Gk is known to you Lambda is not known
which is found by this
63
00:10:18,810 --> 00:10:24,680
So of course more popular method are more
better way more computationally efficient
64
00:10:24,680 --> 00:10:38,350
way is to use conjugate gradients okay So
in fact in matlab when you are solving simultaneous
65
00:10:38,350 --> 00:10:45,970
nonlinear algebraic equations there is a subroutine
called fsolve I think fsolve is not Newton
66
00:10:45,970 --> 00:10:52,350
Raphson it is optimization based solver It
actually forms an optimization problem and
67
00:10:52,350 --> 00:10:55,630
minimizes by iterative search okay
68
00:10:55,630 --> 00:11:03,750
Let me confirmed but I think fsolve is not
a Newton Raphson solver it is a optimization
69
00:11:03,750 --> 00:11:18,600
based solver So the reason is that even though
we have written here in terms of the gradient
70
00:11:18,600 --> 00:11:28,530
direction is in written in terms of Jacobian
and multiplied by fk actually if you are numerically
71
00:11:28,530 --> 00:11:40,060
compute gradient of
phi xk this numerical gradient computation
72
00:11:40,060 --> 00:11:44,730
does not involve explicitly computing Jacobian
73
00:11:44,730 --> 00:12:03,740
See this is a function phi k is 1by2 f1x square
to right So if I want to compute this is phi
74
00:12:03,740 --> 00:12:16,660
x If I want to compute the numerical Jacobian
of this okay I need to compute function vector
75
00:12:16,660 --> 00:12:27,380
and take its norm okay I do not have to explicitly
compute Jacobian okay See computing the numerical
76
00:12:27,380 --> 00:12:34,930
or computing gradient of this is equivalent
to you understand what I am saying
77
00:12:34,930 --> 00:12:48,440
See computing grad phi x is equivalent to
doh f by doh x transpose f x these two are
78
00:12:48,440 --> 00:12:55,201
equivalent but numerical computation of this
if you part of this function vector See what
79
00:12:55,201 --> 00:13:00,880
I do for how do you compute numerical Jacobian
or numerical gradient We have a subroutine
80
00:13:00,880 --> 00:13:10,690
which we have written right in the programming
tutorial We perturb each element and by plus
81
00:13:10,690 --> 00:13:13,800
epsilon minus epsilon and then find out
82
00:13:13,800 --> 00:13:23,150
So see basically what I do is to compute if
I want to compute doh phibydoh xi okay What
83
00:13:23,150 --> 00:13:36,730
we do is we approximate this as phi xi plus
epsilon minus phi xi minus epsilon I am just
84
00:13:36,730 --> 00:13:44,920
perturbing the ith element remaining elements
are same okay divided by 2 epsilon right This
85
00:13:44,920 --> 00:13:53,120
is how we find out and of course here I have
not written remaining elements of x this phi
86
00:13:53,120 --> 00:13:54,930
is a function of remaining elements of x
87
00:13:54,930 --> 00:14:01,610
Just to show you that so doing this you know
function calculation does not involve anywhere
88
00:14:01,610 --> 00:14:10,440
explicitly computing Jacobian You see the
advantage it only that this is equivalent
89
00:14:10,440 --> 00:14:18,432
to this okay So actually numerically when
you are computing you do not have to go through
90
00:14:18,432 --> 00:14:24,280
this root You do not want to compute Jacobian
okay So there are advantages in using optimization
91
00:14:24,280 --> 00:14:28,060
based search because you do not have to compute
Jacobian explicitly
92
00:14:28,060 --> 00:14:35,820
For a large scale problem okay you are not
required to compute 1000 cross 1000 matrix
93
00:14:35,820 --> 00:14:58,660
that is not required okay Is this clear Okay
So what about conjugate gradient okay So of
94
00:14:58,660 --> 00:15:08,850
course we compute the conjugate gradient of
course requires the current gradient information
95
00:15:08,850 --> 00:15:25,070
So we need this gk which is negative of grad
x
96
00:15:25,070 --> 00:15:31,560
gradient of phi with respect to x This is
the direction so this we have to compute at
97
00:15:31,560 --> 00:15:34,110
x equal to xk
98
00:15:34,110 --> 00:15:47,170
And what we do next of course is find the
conjugate search direction Now what was the
99
00:15:47,170 --> 00:15:54,500
idea in generating conjugate directions The
idea in conjugate direction was given some
100
00:15:54,500 --> 00:16:05,960
matrix A we said that two search directions
are A conjugate okay When Sk transpose Sk
101
00:16:05,960 --> 00:16:14,750
minus 1 equal to 0 even though A matrix which
is positive definite matrix okay We call search
102
00:16:14,750 --> 00:16:17,060
direction Sk and Sk minus 1 as A conjugate
103
00:16:17,060 --> 00:16:30,779
When you take any two
successive directions you should have this
104
00:16:30,779 --> 00:16:40,779
property okay So what is done in the conjugate
gradient method here is you find out the search
105
00:16:40,779 --> 00:16:49,410
directions such that A equal to I okay So
we find out the search directions such that
106
00:16:49,410 --> 00:17:07,159
the alternate directions are perpendicular
to each other okay So we find out Sk this
107
00:17:07,159 --> 00:17:19,179
equal to beta k times Sk minus 1 plus gk gk
is the gradient which we are calculated currently
108
00:17:19,179 --> 00:17:34,880
and beta k is given by minus gk transpose
okay
109
00:17:34,880 --> 00:17:45,150
And the remaining part is similar to the gradient
method that is once you know the search direction
110
00:17:45,150 --> 00:18:01,100
then you find out xk plus 1 equal to xk plus
lambda xk okay Find out lambda k which is
111
00:18:01,100 --> 00:18:15,730
minimization with respect to lambda phi Do
this one dimensional minimization okay So
112
00:18:15,730 --> 00:18:21,480
just to find out the step length see this
beta k which is computed here This is only
113
00:18:21,480 --> 00:18:27,190
for finding out the new search direction which
is conjugate with respect to identity matrix
114
00:18:27,190 --> 00:18:28,510
okay
115
00:18:28,510 --> 00:18:33,750
So this is direction computation Once you
compute the direction sk how much to move
116
00:18:33,750 --> 00:18:41,780
in that direction see one interpretation of
this conjugate directions is that it is linear
117
00:18:41,780 --> 00:18:52,010
combination of previous gradients Because
you pick of this calculation using so you
118
00:18:52,010 --> 00:19:12,240
at 0 you set that is g so your S0 corresponds
to g0 The first direction is same as the gradient
119
00:19:12,240 --> 00:19:13,240
direction
120
00:19:13,240 --> 00:19:25,980
Next time okay S1 is beta 1 g0 plus g1 okay
So linear combination of g0 and g1 okay So
121
00:19:25,980 --> 00:19:32,230
linear combination of g0 and g1 when you go
to g2 it will be linear combination of g0
122
00:19:32,230 --> 00:19:41,160
g1 g2 see because S1 is linear combination
of g0 g1 okay So S2 turns out to be linear
123
00:19:41,160 --> 00:19:49,830
combination of g0 g1 g2 So it is like saying
instead of using current so once I get this
124
00:19:49,830 --> 00:19:55,120
direction here okay then I know that I move
in this direction
125
00:19:55,120 --> 00:20:01,170
This direction is linear combination of past
all gradients see it is like just imagine
126
00:20:01,170 --> 00:20:09,460
you know you are going down the valley okay
Now what you mean by using gradient step you
127
00:20:09,460 --> 00:20:18,280
are looking at the current slope okay Now
you just looking at the current slope could
128
00:20:18,280 --> 00:20:24,490
be you know too local you do not have information
about what does happen in the past If you
129
00:20:24,490 --> 00:20:28,040
take it to consideration somehow okay
130
00:20:28,040 --> 00:20:36,281
If you make your next move based on the information
about the past slopes okay then your move
131
00:20:36,281 --> 00:20:41,400
will be much better than just basing your
decision on the current slope Because if you
132
00:20:41,400 --> 00:20:47,710
take it to consideration past slopes in someway
you are taking into consideration the curvature
133
00:20:47,710 --> 00:20:53,300
right You are taking consideration the curvature
along the path and then making your decision
134
00:20:53,300 --> 00:21:02,120
See here if you look at how this conjugate
gradient method proceeds so the step here
135
00:21:02,120 --> 00:21:16,210
would be you know xk plus 1 will be xk plus
lambda k Sk So if I look at the progression
136
00:21:16,210 --> 00:21:24,910
of how the direction changes see this will
be S0 equal to g0 the first time whatever
137
00:21:24,910 --> 00:21:29,250
the gradient that you get negative of the
gradient direction okay What is next time
138
00:21:29,250 --> 00:21:45,080
S1 equal to beta 1 S0 plus g1 negative of
the gradient direction at a new point okay
139
00:21:45,080 --> 00:22:20,220
What is S2 beta 2 S1 plus g2 which is same
as beta 2 you see this So as you progress
140
00:22:20,220 --> 00:22:26,440
you are actually basing you move direction
based on what just one gradient but a history
141
00:22:26,440 --> 00:22:33,809
of gradients okay That is why conjugate gradient
direction you know tends out to be more powerful
142
00:22:33,809 --> 00:22:40,410
it moves much faster even when you go close
to the optimum as compared to the what is
143
00:22:40,410 --> 00:22:43,680
the problem with the gradient method when
you go close to optimum it tends to become
144
00:22:43,680 --> 00:22:45,429
slow okay
145
00:22:45,429 --> 00:22:53,010
Whereas here the move is based on past history
of gradients okay So it is like when you go
146
00:22:53,010 --> 00:22:57,140
down a hill you are trying to make use on
the curvature other than trying to make use
147
00:22:57,140 --> 00:23:02,910
on the local slope The gradient method will
just look at the local slopes okay Gradient
148
00:23:02,910 --> 00:23:09,950
method will only look at negative of gradient
or g0 g1 g2 g3 g4 okay So that is why conjugate
149
00:23:09,950 --> 00:23:13,400
gradient method is more powerful
150
00:23:13,400 --> 00:23:21,090
Now there is one more variant Here in the
gradient method you are only using the local
151
00:23:21,090 --> 00:23:33,180
gradient information right Now if you can
generate something more okay If you can generate
152
00:23:33,180 --> 00:23:38,900
Hessian what is Hessian the second derivative
of the objective function If you generate
153
00:23:38,900 --> 00:23:46,590
Hessian okay then your moves are much better
because you are using more information about
154
00:23:46,590 --> 00:23:49,950
the local curvature than just the gradient
okay
155
00:23:49,950 --> 00:23:59,640
Hessian computation would require second derivatives
to be calculated and then you know the optimization
156
00:23:59,640 --> 00:24:04,549
methods that you get which use Hessian to
generate the search direction these are called
157
00:24:04,549 --> 00:24:10,240
as Newton’s optimization method okay So
we will just briefly look at this Newton’s
158
00:24:10,240 --> 00:24:15,270
method Now do not confuse the Newton Raphson
method and Newton method here This is optimization
159
00:24:15,270 --> 00:24:23,330
based method you know that is direct substitution
kind of method
160
00:24:23,330 --> 00:24:33,470
So this name appears similar but so under
the category of optimization we have again
161
00:24:33,470 --> 00:24:43,970
Newton’s method and quasi Newton method
So the Newton and quasi Newton methods so
162
00:24:43,970 --> 00:24:52,950
now okay so what did we start with we started
by saying that to solve f of x equal to 0
163
00:24:52,950 --> 00:25:12,610
we form this objective function phi which
is 1by2 f x transpose fx right and what is
164
00:25:12,610 --> 00:25:14,940
the necessary condition for optimality
165
00:25:14,940 --> 00:25:27,950
Necessary condition for optimality is grad
x phi equal to 0 right This should be equal
166
00:25:27,950 --> 00:25:33,470
to 0 vector the gradient at the stationary
point the gradient of this objective function
167
00:25:33,470 --> 00:25:52,030
should be equal to 0 vector okay This is the
nonlinear algebraic equation okay Now this
168
00:25:52,030 --> 00:25:58,820
nonlinear algebraic because what is phi is
the scalar function phi is nothing but 1by2
169
00:25:58,820 --> 00:26:07,091
norm f x 2 norm square This is nothing but
2 norm square norm is a scalar okay
170
00:26:07,091 --> 00:26:11,510
Objective function is a scalar objective function
This is the scalar objective function So what
171
00:26:11,510 --> 00:26:27,190
is this grad phi is it a vector or a matrix
It is a vector it is not a matrix This equal
172
00:26:27,190 --> 00:26:37,990
to doh phibydoh x1 doh phibydoh x2 doh phibydoh
xn transpose this equal to 0 vector So this
173
00:26:37,990 --> 00:26:46,270
is actually because phi is a scalar it is
gradient with respect to x okay is a vector
174
00:26:46,270 --> 00:26:50,309
This vector I want to said equal to 0 okay
175
00:26:50,309 --> 00:27:01,270
If I solve this equation exactly then I will
get the solution okay but my problem is nonlinear
176
00:27:01,270 --> 00:27:07,280
I cannot solve this exactly okay What I decide
to do is instead of solving this equation
177
00:27:07,280 --> 00:27:14,800
exactly I decide to solve this using Newton’s
step So to use Newton’s step what I am going
178
00:27:14,800 --> 00:27:22,700
to do is I am going to linearize this equation
okay “Professor student conversation starts”
179
00:27:22,700 --> 00:27:23,700
exactly compute
180
00:27:23,700 --> 00:27:35,160
How will I get f of x equal to 0 No no no
if analytically this grad x phi is doh fbydoh
181
00:27:35,160 --> 00:27:47,840
x transpose f of x okay So if Jacobian matrix
is nonsingular then this equal to 0 means
182
00:27:47,840 --> 00:27:53,710
this will happen only f of x equal to 0 okay
So when you said this equal to 0 if this is
183
00:27:53,710 --> 00:27:59,090
nonzero or this is nonsingular only way you
will get 0 here is when f of x equal to 0
184
00:27:59,090 --> 00:28:04,580
this vector equal to 0 that all you get the
solution Is the same Is getting the solution
185
00:28:04,580 --> 00:28:07,690
is same okay “Professor student conversation
ends”
186
00:28:07,690 --> 00:28:17,220
So now I want to come up with Newton step
here So what I am going to do here is something
187
00:28:17,220 --> 00:28:35,230
like this I will continue here on this side
okay So what I need to do here is
188
00:28:35,230 --> 00:29:02,590
grad phi x okay I am going to write this as
grad phi xk plus okay and use our good old
189
00:29:02,590 --> 00:29:23,419
tell us its expansion okay So this is approximately
equal to grad phi xk okay plus del square
190
00:29:23,419 --> 00:29:32,580
phi
if I do realize why this is called Newton’s
191
00:29:32,580 --> 00:29:33,580
method
192
00:29:33,580 --> 00:29:39,450
Because this is exactly what you do in Newton’s
method for solving algebraic equation You
193
00:29:39,450 --> 00:29:44,720
have nonlinear algebraic equation you linearize
and then instead of solving the original problem
194
00:29:44,720 --> 00:30:02,530
you solved the linearize problem right So
this we call as Hk Hk is the Hessian and this
195
00:30:02,530 --> 00:30:13,000
is the gradient in which we have shown that
this is nothing but jk transpose fk I have
196
00:30:13,000 --> 00:30:16,730
shown this earlier this is jk transpose fk
197
00:30:16,730 --> 00:30:22,900
And now instead of solving for grad phi x
equal to 0 I am going to solve for this approximation
198
00:30:22,900 --> 00:30:33,049
equal to 0 vector okay Hk is the square matrix
so this is the n cross n matrix phi is the
199
00:30:33,049 --> 00:30:39,450
scalar objective function Its second derivative
with respect to x is the Hessian matrix We
200
00:30:39,450 --> 00:30:44,480
are look at Hessian matrix earlier when we
talked about the conditions for optimality
201
00:30:44,480 --> 00:30:46,200
necessary and sufficient conditions for optimality
202
00:30:46,200 --> 00:30:56,940
This is the Hessian matrix okay This is the
Jacobian transpose fk Well again I have written
203
00:30:56,940 --> 00:31:00,320
here Jacobian transpose fk but you do not
have to explicitly compute Jacobian it is
204
00:31:00,320 --> 00:31:06,750
more for getting end sides So now I am going
to solve for this okay If I solve for this
205
00:31:06,750 --> 00:31:26,160
I get delta xk equal to minus Hk inverse I
am just solving for that approximate Taylor
206
00:31:26,160 --> 00:31:33,669
series expansion okay and what I get here
is the step Newton like step okay
207
00:31:33,669 --> 00:31:45,559
So this is my search direction okay So the
way I construct my next xk plus 1 is xk plus
208
00:31:45,559 --> 00:31:52,789
then what you do is the same thing you know
you take lambda times delta xk okay and then
209
00:31:52,789 --> 00:32:01,450
you do a search with respect to lambda okay
So that part remains same so lambda k equal
210
00:32:01,450 --> 00:32:17,309
to min with respect to lambda phi of xk plus
lambda
211
00:32:17,309 --> 00:32:20,840
okay So this part remains same one dimensional
minimization
212
00:32:20,840 --> 00:32:29,400
But the steps are the direction for moving
is obtained using the second order derivative
213
00:32:29,400 --> 00:32:36,950
okay So actually if you look at optimization
based method let us say just the raw gradient
214
00:32:36,950 --> 00:32:41,430
method then conjugate gradient method and
then Newton’s method Of course Newton method
215
00:32:41,430 --> 00:32:47,850
is faster in converging that is because you
are using second order derivative information
216
00:32:47,850 --> 00:32:49,461
okay You are using second order derivative
information okay
217
00:32:49,461 --> 00:32:55,630
So this is more powerful method but a price
to phase computing Hessian Hessian is a n
218
00:32:55,630 --> 00:33:01,450
cross n matrix if your number of variables
is large Hessian matrix is large and again
219
00:33:01,450 --> 00:33:09,010
the same problem so you know first computations
are less number of steps but large number
220
00:33:09,010 --> 00:33:13,960
of computation at each time step or large
number of iterations and less computations
221
00:33:13,960 --> 00:33:18,080
at each time step not each time step iteration
okay
222
00:33:18,080 --> 00:33:28,590
So it is a balance In some cases you know
its worth computing the Hessian and doing
223
00:33:28,590 --> 00:33:33,880
quick steps in some cases Hessian computation
can be complex and so you might want to just
224
00:33:33,880 --> 00:33:45,630
use the conjugate gradient method and search
for the optimum okay Now of course the gradient
225
00:33:45,630 --> 00:33:50,210
direction or this direction which you get
should be a decent direction and that requires
226
00:33:50,210 --> 00:33:53,490
that Hessian should be positive definite and
so on
227
00:33:53,490 --> 00:34:01,910
So the convergence condition will depend up
on the nature of H are the local Hessian okay
228
00:34:01,910 --> 00:34:07,320
So what is the trouble with what is the advantage
of Hessian you are second order information
229
00:34:07,320 --> 00:34:13,159
okay Convergence can be much better okay What
is the trouble you have to compute the n cross
230
00:34:13,159 --> 00:34:19,740
n matrix okay In Newton’s method how did
we get over this problem we got over this
231
00:34:19,740 --> 00:34:23,069
problem in Newton’s method using Broyden
update right
232
00:34:23,069 --> 00:34:36,590
We use rank one updates and then you know
we had a way of updating Jacobian okay just
233
00:34:36,590 --> 00:34:40,840
like a difference equation and then we use
the updated value of the Jacobian and other
234
00:34:40,840 --> 00:34:47,530
than actually computing the Jacobian so same
thing is done here The quasi Newton methods
235
00:34:47,530 --> 00:34:57,590
okay actually use a rank one update of the
Hessian inverse What you need to compute here
236
00:34:57,590 --> 00:35:01,070
is the inverse of hessian matrix okay
237
00:35:01,070 --> 00:35:13,550
So what is done is in quasi Newton methods
let us define this matrix L to be H inverse
238
00:35:13,550 --> 00:35:25,290
See what is
the trouble step in doing this see this is
239
00:35:25,290 --> 00:35:31,380
the gradient calculation gradient calculation
is just one vector calculation by numerical
240
00:35:31,380 --> 00:35:38,250
perturbation not a big issue but calculating
Hessian see gradient you can do relatively
241
00:35:38,250 --> 00:35:44,849
easily because there are only n components
in the gradient of scalar function phi okay
242
00:35:44,849 --> 00:35:50,960
Much less computations than computing the
Hessian okay Hessian would require lot more
243
00:35:50,960 --> 00:36:03,730
computations okay Now to avoid Hessian computations
we do gradient computations by numerical perturbation
244
00:36:03,730 --> 00:36:14,460
but Hessian we do a update So let us define
this L equal to H inverse then we have this
245
00:36:14,460 --> 00:36:26,630
update this is called as variable metric called
Davidon Fletcher Powell method in which you
246
00:36:26,630 --> 00:36:29,800
update the inverse of Hessian iteratively
247
00:36:29,800 --> 00:36:35,750
So only once in the beginning you compute
the inverse and after than you just updated
248
00:36:35,750 --> 00:36:42,210
without actually having to compute it explicitly
okay So this is the quasi Newton idea where
249
00:36:42,210 --> 00:36:52,640
you do not have to explicitly keep computing
the Hessian So here you have this update Lk
250
00:36:52,640 --> 00:37:05,520
plus so we compute L0 equal to H0 inverse
You compute this once in the beginning okay
251
00:37:05,520 --> 00:37:25,790
and after that what we do is we have this
Lk plus 1 equal to Lk plus Mk Nk okay
252
00:37:25,790 --> 00:37:32,030
In quasi Newton method this is the philosophy
that is you define a matrix L equal to H inverse
253
00:37:32,030 --> 00:37:39,480
okay you only once compute this and then every
time what you do is the new inverse is old
254
00:37:39,480 --> 00:37:48,970
inverse plus some correction okay This correction
does not involve explicit Hessian computation
255
00:37:48,970 --> 00:37:57,859
What is this correction okay Let us move to
now the derivation of this is you can find
256
00:37:57,859 --> 00:38:02,359
in any book on optimization
257
00:38:02,359 --> 00:38:15,270
So derivation of this particular approach
I am going to just leave it to your curiosity
258
00:38:15,270 --> 00:38:20,460
you can go back to a book by S S Rao or any
other book on optimization then you will find
259
00:38:20,460 --> 00:38:33,300
a derivation okay So you define a vector qk
I am just giving you the final formula which
260
00:38:33,300 --> 00:38:53,780
is phi evaluated at k plus 1 minus grad phi
evaluated at iteration k okay Mk the correction
261
00:38:53,780 --> 00:39:14,760
one is lambda k delta xk transpose qk
262
00:39:14,760 --> 00:39:23,700
So this is the rank one matrix delta xk is
the previous move okay and this is the rank
263
00:39:23,700 --> 00:39:35,360
one matrix delta xk delta xk transpose will
be a rank one matrix okay and Mk and Nk are
264
00:39:35,360 --> 00:39:39,420
two corrections which are rank one corrections
here “Professor student conversation starts”
265
00:39:39,420 --> 00:39:48,240
No no no this is after see the sequence is
like this that okay the sequence is that you
266
00:39:48,240 --> 00:39:57,480
already have computed delta xk okay
267
00:39:57,480 --> 00:40:05,599
You have then minimize and found out lambda
k okay Then you have found out xk plus 1 okay
268
00:40:05,599 --> 00:40:10,369
Once you have found out xk plus 1 now for
the next iteration you are preparing okay
269
00:40:10,369 --> 00:40:21,760
So once I have found out xk plus 1 I actually
go back here and find this new gradient at
270
00:40:21,760 --> 00:40:29,990
k plus 1 take the difference between the new
gradient and the old gradient okay and use
271
00:40:29,990 --> 00:40:34,900
that and delta xk and previous inverse okay
272
00:40:34,900 --> 00:40:41,900
Previous inverse okay difference between the
old gradient and new gradient and delta xk
273
00:40:41,900 --> 00:40:47,020
move and the lambda k which you have found
out all of them are used to come up with the
274
00:40:47,020 --> 00:40:53,740
approximate you know inverse of the Hessian
at the next time step So that is how this
275
00:40:53,740 --> 00:40:58,790
quasi Newton method proceed Well we do not
have this is not a full course optimization
276
00:40:58,790 --> 00:41:03,550
I am just trying to give you the idea that
nonlinear algebraic equations can be very
277
00:41:03,550 --> 00:41:08,480
efficiently solved using optimization based
methods okay “Professor student conversation
278
00:41:08,480 --> 00:41:09,480
ends”
279
00:41:09,480 --> 00:41:13,580
So we are just as I said or I keep saying
throughout this course we are just touching
280
00:41:13,580 --> 00:41:21,390
the tip of the iceberg we are not really getting
into the deep okay Where the last thing which
281
00:41:21,390 --> 00:41:28,530
I want to talk here is a method which is very
very popular in solving nonlinear algebraic
282
00:41:28,530 --> 00:41:38,609
equations This is called Levenberg Marquardt
method is a combination of gradient method
283
00:41:38,609 --> 00:41:42,060
and Hessian method okay
284
00:41:42,060 --> 00:41:46,800
See what is the nice thing about gradient
method gradient method when you are far away
285
00:41:46,800 --> 00:41:53,260
from the optimum it makes very long straights
you know it tries to move towards the optimum
286
00:41:53,260 --> 00:42:03,700
very quickly okay But once it come close optimum
you know there is a problem It becomes very
287
00:42:03,700 --> 00:42:08,350
very slow The optimization the Hessian based
method on the other hand is very fast when
288
00:42:08,350 --> 00:42:11,710
you come close to the optimum okay
289
00:42:11,710 --> 00:42:17,690
So there is a meriting having a method which
is mixture of the two which initially starts
290
00:42:17,690 --> 00:42:26,420
like a gradient method and later on becomes
like a Hessian method So this or in the model
291
00:42:26,420 --> 00:42:35,210
parlance it is like multiple agent you know
optimization method there are two agents one
292
00:42:35,210 --> 00:42:39,369
is the gradient method other is a Hessian
method and when you tried to mix them in such
293
00:42:39,369 --> 00:42:41,930
a way that one is dominant when it is useful
294
00:42:41,930 --> 00:42:47,920
The other is dominant when that becomes useful
okay So this is just a small modification
295
00:42:47,920 --> 00:42:56,540
I am just going to give the philosophy I am
not going to get into details So what you
296
00:42:56,540 --> 00:43:08,369
do here is that you have this gradient direction
which is minus
297
00:43:08,369 --> 00:43:28,380
I have this gradient direction okay What I
do is my search direction Sk is found by
298
00:43:28,380 --> 00:43:35,480
minus of this is if I put this beta k equal
to 0 it is nothing but the Newton step
299
00:43:35,480 --> 00:43:40,910
If you go back and check this should be nothing
but a Newton step okay The Newton step actually
300
00:43:40,910 --> 00:43:46,000
does involve negative of the gradient direction
It does involve except it is premultiplied
301
00:43:46,000 --> 00:43:52,000
by H inverse okay It is premultiplied by H
inverse Otherwise it involves negative of
302
00:43:52,000 --> 00:44:03,240
the gradient direction So what we do is okay
we start with this eta to be very large okay
303
00:44:03,240 --> 00:44:14,420
So you start with large value of eta say 10
to the power 5 or something you take a large
304
00:44:14,420 --> 00:44:28,960
value of eta So actually your H0 plus 10 to
the power 5 times I okay See Hessian elements
305
00:44:28,960 --> 00:44:35,940
will not be typically very large okay So you
want to take this number sufficiently large
306
00:44:35,940 --> 00:44:44,210
compared to elements of the Hessian So this
inverse when this is a large number is approximately
307
00:44:44,210 --> 00:44:51,280
like 10 to the power 5 I inverse
308
00:44:51,280 --> 00:45:00,420
This term dominates over this okay So initially
you would start with very large data okay
309
00:45:00,420 --> 00:45:07,440
So eta 0 is chosen to be 10 to the power 5
and then what you do is you go on reducing
310
00:45:07,440 --> 00:45:19,780
eta okay as k increases So initially since
this is like I inverse this direction Sk direction
311
00:45:19,780 --> 00:45:24,610
is along the negative of the gradient direction
So initially you are moving along the negative
312
00:45:24,610 --> 00:45:26,280
of the gradient direction okay
313
00:45:26,280 --> 00:45:34,880
And then you go on okay so we reduce eta k
as k increases by some logic okay As eta k
314
00:45:34,880 --> 00:45:40,430
reduces this terms becomes smaller and smaller
and this starts dominating the Hessian starts
315
00:45:40,430 --> 00:45:46,040
dominating okay So initially the method behaves
like gradient method Later on the method behaves
316
00:45:46,040 --> 00:45:52,430
like the Hessian based method So Levenberg
Marquardt I think you have programming matlab
317
00:45:52,430 --> 00:45:55,869
tool box or any other tool box scilab toolbox
you will find this
318
00:45:55,869 --> 00:46:02,260
So this is one of the popular methods which
is used for solving nonlinear algebraic equations
319
00:46:02,260 --> 00:46:09,349
So with this we come to an end of algorithmic
part of solving nonlinear algebraic equations
320
00:46:09,349 --> 00:46:15,980
We have looked at different methods we have
looked at gradient free or successive substitution
321
00:46:15,980 --> 00:46:21,790
methods We have looked at gradient based or
sloped based methods
322
00:46:21,790 --> 00:46:28,720
So in that was wegstein method Newton method
damped Newton method okay and so on
323
00:46:28,720 --> 00:46:34,560
Then Broyden update for Newton’s method
we moved onto optimization based methods we
324
00:46:34,560 --> 00:46:42,180
looked at to gradient method again okay Then
we looked at Newton’s method quasi Newton
325
00:46:42,180 --> 00:46:45,930
method I have just quickly looked at these
things and Levenberg Marquardt which tries
326
00:46:45,930 --> 00:46:50,140
to merge the two gradient and Newton’s method
327
00:46:50,140 --> 00:46:55,420
So to solve nonlinear algebraic equations
you know it’s a complex problem and more
328
00:46:55,420 --> 00:47:00,320
and more we became advanced in computing we
want to solve the larger and larger problems
329
00:47:00,320 --> 00:47:04,450
So it is a ever challenging problem how to
solve nonlinear algebraic equations and there
330
00:47:04,450 --> 00:47:11,010
are many many methods okay So a question that
would naturally arise is that which is doh
331
00:47:11,010 --> 00:47:13,080
method there is doh method
332
00:47:13,080 --> 00:47:18,820
If there are doh method we would not be required
right We would be out of business So you need
333
00:47:18,820 --> 00:47:23,619
an expert you need a person who understands
the physics of the problem you should know
334
00:47:23,619 --> 00:47:28,830
how to concoct the solution how to concoct
a recipe just like you cook and you will do
335
00:47:28,830 --> 00:47:34,130
know the recipe or you will do form a recipe
so here also you have to form a solution for
336
00:47:34,130 --> 00:47:37,940
a particular problem and sometimes Newton’s
method will work
337
00:47:37,940 --> 00:47:41,790
Sometimes optimization method will work sometimes
wegstein method will work So you have to develop
338
00:47:41,790 --> 00:47:47,210
an expertise beyond the point as to how to
go about solving these problems okay So it
339
00:47:47,210 --> 00:47:52,190
does require this human element otherwise
matlab would do everything It just give a
340
00:47:52,190 --> 00:47:58,320
problem to matlab but it does not happen that
way that is good that is why we get jobs okay
341
00:47:58,320 --> 00:48:04,710
So we will now continue with I will say a
few things about the convergence of this nonlinear
342
00:48:04,710 --> 00:48:06,140
algebraic equation
343
00:48:06,140 --> 00:48:12,560
We cannot give justice to that in this course
it is a very very advance topic But at least
344
00:48:12,560 --> 00:48:17,910
you should be sensitizes to what is involve
when you talk about convergence okay So early
345
00:48:17,910 --> 00:48:21,869
more complex than I talking about convergence
of iterative schemes for linear algebraic
346
00:48:21,869 --> 00:48:26,750
equations because here you have nonlinearity
and things do not work out as nicely as linear
347
00:48:26,750 --> 00:48:31,990
algebraic equations but we will have a peak
at that and then move on to ODE initial value
348
00:48:31,990 --> 00:48:32,959
problems okay