phessianfree’s documentation
phessianfree is a modern optimization method for smooth, unconstained minimization problems, intended to be used in place of LBFGS. It is designed to take advantage of the structure of regularized loss minimization problems, where the objective decomposes as a sum over a large number of terms. By taking advantage of this structure, it is able to achieve rapid initial convergence similar to stochastic gradient descent methods, but without sacrificing the later stage convergence properties of batch methods such as LBFGS.
phessianfree uses a newton-cg method, somestimes called a hessian-free method, where the search direction at each step is improved using a linear solver, to bring it closer to the ideal newton step. Hessian-vector products are evaluated without forming the actual hessian, using finite difference methods, with an adjustable overhead, defaulting to 10% more computation per iteration than standard lbfgs methods.
See the example code for comparisons between LBFGS, phessianfree and SAG on training a standard classifier, where pnewton is able to converge in test loss up to 4 times faster than LBFGS. Example code with MNIST data bundled is available on github.
Core method
-
phessianfree.optimize(f, x0, ndata, gtol=1e-05, maxiter=100, callback=None, props={})
This method can be invoked in a simlar way as lbfgs routines in Scipy,
with the following differences:
- f takes additional arguments ‘s’ and ‘e’ that signify a range of
points to evaluate the objective over.
- The callback gives additional information
- logging is performed using the standard python logging framework
Parameters: |
- f (function) – Objective function, taking arguments (x,s,e), where
(s,e) is the range of datapoints over which to evaluate
the objective.
- x0 (vector) – Initial point
- ndata (int) – Number of points in dataset. The passed function
will be invoked with s,e between 0 and ndata.
- gtol (float) – stopping criterion, measured in 2-norm.
- maxiter (int) – Maximum number of steps to complete. Note that this does
not count line search iterations.
- callback (function) – Invoked with (xk, fval, gfk, pointsProcessed),
useful for tracking progress for latter plotting.
PlottingCallback in the convergence module can do
this for you.
- props (object) –
- Map of additional parameters:
- parts (integer default 100)
- For computing gradients and hessian vector products,
the data is split into this many parts. Calls to your
objective function will be in roughly ndata/parts.
The default of 100 is suitable for most datasets,
smaller numbers are only useful if the dataset is small
or non-homogeneous, in which case the hessian free method
is ineffective. Larger numbers of parts may improve
convergence, but result proportionally more internal overhead.
- subsetVariant (string default ‘lbfgs’)
- Setting this to ‘cg’ gives the standard conjugate gradient method
for solving the linear system Hp = -g, to find the search direction
p from the gradient g and hessian H. This is computed over only one
of the parts, so only a small amount of the data is seen.
Setting this to ‘lbfgs’ uses a stochastic minibatch lbfgs method
for solving the linear subproblem. This sees many more parts of the
data, but is only able to make half as many steps for the same
computation time. For problems without extreme curvature, lbfgs
works much better than cg. If the condition number of the hessian
is very large however, cg is the better option. In those cases
the solveFraction property should normally be increases as well.
- solveFraction (float default 0.2)
- The cg or lbfgs linear solvers perform a number of iterations
such that solveFraction fraction of overhead is incurred.
For example, if set to 0.2 and 100 parts, 20 cg iterations on 1
part will be preformed, if the cg subset variant is used.
If subsetObjective is off, then essentially 20% extra computation
is done per outer step over a standard lbfgs method (excluding line
searches).
- subsetObjective (boolean default True)
- Turn on or off the use of subsets of data for
computing gradients. If off, gradients are computed using
the full dataset, but hessian-vector products still use subsets.
The size of the subset used for computing the gradient is adaptive
using bounds on the approximation error.
- gradRelErrorBound (float default 0.1)
- At a search point, the gradient is computed over enough parts
so that the relative variance of the gradients is brought below
this threshold. 0.1 is conservative; better results may be
achieved by using values up to about 0.4. Larger values may cause
erratic convergence behavior though.
- lbfgsMemory (integer 10)
- The lbfgs search direction is used as the initial guess at the
search direction for the cg and lbfgs inner solves. This controls
the memory used for that. The same memory is used for the inner
lbfgs solve. Changing this has less of an effect than it would
on a standard lbfgs implementation.
- fdEps (float default 1e-8)
- Unless a gaussNewtonProd method is implemented, hessian vector
products are computed by using finite differences. Unlike
applying finite differences to approximate the gradient, the FD
method allows for the computation of hessian-vector products
at the cost of only one subset gradient evaluation.
If convergence plots become erratic near the optimum, tuning this
parameter can help. This normally occurs long after the test loss
has plateaued however.
- innerSolveAverage (boolean default False)
- Applicable only if subsetVariant is lbfgs, this turns on the
use of 50% sequence suffix averaging for the inner solve.
If a large number of parts (say 1000) is being used, this
can give better results.
- innerSolveStepFactor (float default 0.5)
- The lbfgs subsetVariant is stochastic, however it uses the
fact that quadratic problems have a simple formula for exact line
searches, in order to make better step choices than simple SGD.
Doing an exact line search makes overconfident steps however, and
so the step is scaled by this factor. If the lbfgs linear solve
is diverging, decrease this.
|
Return type: | (xk, fval)
|
Note
If your objective is non-convex, you need to explictly provide a
function that computes matrix vector products against the
Gauss-Newton approximation to the hessian. You can do this
by making f an object with a __call__ method that implements the
objective function as above, and a gaussNewtonProd(x, v, s, e)
method that implements the matrix vector product against v for
the GN approximation at x over the datapoints (s,e). This is
illustrated in the autoencoder example code.