phessianfree’s documentation

phessianfree is a modern optimization method for smooth, unconstained minimization problems, intended to be used in place of LBFGS. It is designed to take advantage of the structure of regularized loss minimization problems, where the objective decomposes as a sum over a large number of terms. By taking advantage of this structure, it is able to achieve rapid initial convergence similar to stochastic gradient descent methods, but without sacrificing the later stage convergence properties of batch methods such as LBFGS.

phessianfree uses a newton-cg method, somestimes called a hessian-free method, where the search direction at each step is improved using a linear solver, to bring it closer to the ideal newton step. Hessian-vector products are evaluated without forming the actual hessian, using finite difference methods, with an adjustable overhead, defaulting to 10% more computation per iteration than standard lbfgs methods.

See the example code for comparisons between LBFGS, phessianfree and SAG on training a standard classifier, where pnewton is able to converge in test loss up to 4 times faster than LBFGS. Example code with MNIST data bundled is available on github.

Core method

phessianfree.optimize(f, x0, ndata, gtol=1e-05, maxiter=100, callback=None, props={})

This method can be invoked in a simlar way as lbfgs routines in Scipy, with the following differences:

  • f takes additional arguments ‘s’ and ‘e’ that signify a range of points to evaluate the objective over.
  • The callback gives additional information
  • logging is performed using the standard python logging framework
Parameters:
  • f (function) – Objective function, taking arguments (x,s,e), where (s,e) is the range of datapoints over which to evaluate the objective.
  • x0 (vector) – Initial point
  • ndata (int) – Number of points in dataset. The passed function will be invoked with s,e between 0 and ndata.
  • gtol (float) – stopping criterion, measured in 2-norm.
  • maxiter (int) – Maximum number of steps to complete. Note that this does not count line search iterations.
  • callback (function) – Invoked with (xk, fval, gfk, pointsProcessed), useful for tracking progress for latter plotting. PlottingCallback in the convergence module can do this for you.
  • props (object) –
    Map of additional parameters:
    • parts (integer default 100)
      For computing gradients and hessian vector products, the data is split into this many parts. Calls to your objective function will be in roughly ndata/parts. The default of 100 is suitable for most datasets, smaller numbers are only useful if the dataset is small or non-homogeneous, in which case the hessian free method is ineffective. Larger numbers of parts may improve convergence, but result proportionally more internal overhead.
    • subsetVariant (string default ‘lbfgs’)
      Setting this to ‘cg’ gives the standard conjugate gradient method for solving the linear system Hp = -g, to find the search direction p from the gradient g and hessian H. This is computed over only one of the parts, so only a small amount of the data is seen. Setting this to ‘lbfgs’ uses a stochastic minibatch lbfgs method for solving the linear subproblem. This sees many more parts of the data, but is only able to make half as many steps for the same computation time. For problems without extreme curvature, lbfgs works much better than cg. If the condition number of the hessian is very large however, cg is the better option. In those cases the solveFraction property should normally be increases as well.
    • solveFraction (float default 0.2)
      The cg or lbfgs linear solvers perform a number of iterations such that solveFraction fraction of overhead is incurred. For example, if set to 0.2 and 100 parts, 20 cg iterations on 1 part will be preformed, if the cg subset variant is used. If subsetObjective is off, then essentially 20% extra computation is done per outer step over a standard lbfgs method (excluding line searches).
    • subsetObjective (boolean default True)
      Turn on or off the use of subsets of data for computing gradients. If off, gradients are computed using the full dataset, but hessian-vector products still use subsets. The size of the subset used for computing the gradient is adaptive using bounds on the approximation error.
    • gradRelErrorBound (float default 0.1)
      At a search point, the gradient is computed over enough parts so that the relative variance of the gradients is brought below this threshold. 0.1 is conservative; better results may be achieved by using values up to about 0.4. Larger values may cause erratic convergence behavior though.
    • lbfgsMemory (integer 10)
      The lbfgs search direction is used as the initial guess at the search direction for the cg and lbfgs inner solves. This controls the memory used for that. The same memory is used for the inner lbfgs solve. Changing this has less of an effect than it would on a standard lbfgs implementation.
    • fdEps (float default 1e-8)
      Unless a gaussNewtonProd method is implemented, hessian vector products are computed by using finite differences. Unlike applying finite differences to approximate the gradient, the FD method allows for the computation of hessian-vector products at the cost of only one subset gradient evaluation. If convergence plots become erratic near the optimum, tuning this parameter can help. This normally occurs long after the test loss has plateaued however.
    • innerSolveAverage (boolean default False)
      Applicable only if subsetVariant is lbfgs, this turns on the use of 50% sequence suffix averaging for the inner solve. If a large number of parts (say 1000) is being used, this can give better results.
    • innerSolveStepFactor (float default 0.5)
      The lbfgs subsetVariant is stochastic, however it uses the fact that quadratic problems have a simple formula for exact line searches, in order to make better step choices than simple SGD. Doing an exact line search makes overconfident steps however, and so the step is scaled by this factor. If the lbfgs linear solve is diverging, decrease this.
Return type:

(xk, fval)

Note

If your objective is non-convex, you need to explictly provide a function that computes matrix vector products against the Gauss-Newton approximation to the hessian. You can do this by making f an object with a __call__ method that implements the objective function as above, and a gaussNewtonProd(x, v, s, e) method that implements the matrix vector product against v for the GN approximation at x over the datapoints (s,e). This is illustrated in the autoencoder example code.

Table Of Contents

Next topic

Optimization example overview

This Page