Configuration options

Runners are configured through keyword arguments to their init methods.

These can be set:

All runners

MRJobRunner.__init__(mr_job_script=None, conf_path=None, extra_args=None, file_upload_args=None, hadoop_input_format=None, hadoop_output_format=None, input_paths=None, output_dir=None, partitioner=None, stdin=None, **opts)

All runners take the following keyword arguments:

Parameters:
  • mr_job_script (str) – the path of the .py file containing the MRJob. If this is None, you won’t actually be able to run() the job, but other utilities (e.g. ls()) will work.
  • conf_path (str) – Alternate path to read configs from, or False to ignore all config files.
  • extra_args (list of str) – a list of extra cmd-line arguments to pass to the mr_job script. This is a hook to allow jobs to take additional arguments.
  • file_upload_args – a list of tuples of ('--ARGNAME', path). The file at the given path will be uploaded to the local directory of the mr_job script when it runs, and then passed into the script with --ARGNAME. Useful for passing in SQLite DBs and other configuration files to your job.
  • hadoop_input_format (str) – name of an optional Hadoop InputFormat class. Passed to Hadoop along with your first step with the -inputformat option. Note that if you write your own class, you’ll need to include it in your own custom streaming jar (see hadoop_streaming_jar).
  • hadoop_output_format (str) – name of an optional Hadoop OutputFormat class. Passed to Hadoop along with your first step with the -outputformat option. Note that if you write your own class, you’ll need to include it in your own custom streaming jar (see hadoop_streaming_jar).
  • input_paths (list of str) – Input files for your job. Supports globs and recursively walks directories (e.g. ['data/common/', 'data/training/*.gz']). If this is left blank, we’ll read from stdin
  • output_dir (str) – an empty/non-existent directory where Hadoop streaming should put the final output from the job. If you don’t specify an output directory, we’ll output into a subdirectory of this job’s temporary directory. You can control this from the command line with --output-dir.
  • partitioner (str) – Optional name of a Hadoop partitoner class, e.g. 'org.apache.hadoop.mapred.lib.HashPartitioner'. Hadoop streaming will use this to determine how mapper output should be sorted and distributed to reducers.
  • stdin – an iterable (can be a StringIO or even a list) to use as stdin. This is a hook for testing; if you set stdin via sandbox(), it’ll get passed through to the runner. If for some reason your lines are missing newlines, we’ll add them; this makes it easier to write automated tests.

All runners also take the following options as keyword arguments. These can be defaulted in your mrjob.conf file:

Parameters:
  • base_tmp_dir (str) – path to put local temp dirs inside. By default we just call tempfile.gettempdir()
  • bootstrap_mrjob (bool) – should we automatically tar up the mrjob library and install it when we run the mrjob? Set this to False if you’ve already installed mrjob on your Hadoop cluster.
  • cleanup (list) – List of which kinds of directories to delete when a job succeeds. See CLEANUP_CHOICES.
  • cleanup_on_failure (list) – Which kinds of directories to clean up when a job fails. See CLEANUP_CHOICES.
  • cmdenv (dict) – environment variables to pass to the job inside Hadoop streaming
  • hadoop_extra_args (list of str) – extra arguments to pass to hadoop streaming
  • hadoop_streaming_jar (str) – path to a custom hadoop streaming jar.
  • jobconf (dict) – -jobconf args to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing ['-jobconf', 'KEY1=VALUE1', '-jobconf', 'KEY2=VALUE2', ...] to hadoop_extra_args.
  • label (str) – description of this job to use as the part of its name. By default, we use the script’s module name, or no_script if there is none.
  • owner (str) – who is running this job. Used solely to set the job name. By default, we use getpass.getuser(), or no_user if it fails.
  • python_archives (list of str) – same as upload_archives, except they get added to the job’s PYTHONPATH
  • python_bin (str) – Name/path of alternate python binary for mappers/reducers (e.g. for use with virtualenv). Defaults to 'python'.
  • setup_cmds (list) – a list of commands to run before each mapper/reducer step (e.g. ['cd my-src-tree; make', 'mkdir -p /tmp/foo']). You can specify commands as strings, which will be run through the shell, or lists of args, which will be invoked directly. We’ll use file locking to ensure that multiple mappers/reducers running on the same node won’t run setup_cmds simultaneously (it’s safe to run make).
  • setup_scripts (list of str) – files that will be copied into the local working directory and then run. These are run after setup_cmds. Like with setup_cmds, we use file locking to keep multiple mappers/reducers on the same node from running setup_scripts simultaneously.
  • steps_python_bin (str) – Name/path of alternate python binary to use to query the job about its steps (e.g. for use with virtualenv). Rarely needed. Defaults to sys.executable (the current Python interpreter).
  • upload_archives (list of str) – a list of archives (e.g. tarballs) to unpack in the local directory of the mr_job script when it runs. You can set the local name of the dir we unpack into by appending #localname to the path; otherwise we just use the name of the archive file (e.g. foo.tar.gz)
  • upload_files (list of str) – a list of files to copy to the local directory of the mr_job script when it runs. You can set the local name of the dir we unpack into by appending #localname to the path; otherwise we just use the name of the file

In-process local testing

InlineMRJobRunner.__init__(mrjob_cls=None, **kwargs)

InlineMRJobRunner takes the same keyword args as MRJobRunner. However, please note:

  • hadoop_extra_args, hadoop_input_format, hadoop_output_format, and hadoop_streaming_jar, jobconf, and partitioner are ignored because they require Java. If you need to test these, consider starting up a standalone Hadoop instance and running your job with -r hadoop.
  • cmdenv, python_bin, setup_cmds, setup_scripts, steps_python_bin, upload_archives, and upload_files are ignored because we don’t invoke the job as a subprocess or run it in its own directory.

Local Hadoop-like simulation

LocalMRJobRunner.__init__(**kwargs)

Arguments to this constructor may also appear in mrjob.conf under runners/local.

LocalMRJobRunner‘s constructor takes the same keyword args as MRJobRunner. However, please note:

  • cmdenv is combined with combine_local_envs()
  • python_bin defaults to sys.executable (the current python interpreter)
  • hadoop_extra_args, hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and partitioner are ignored because they require Java. If you need to test these, consider starting up a standalone Hadoop instance and running your job with -r hadoop.

On EMR

EMRJobRunner.__init__(**kwargs)

EMRJobRunner takes the same arguments as MRJobRunner, plus some additional options which can be defaulted in mrjob.conf.

aws_access_key_id and aws_secret_access_key are required if you haven’t set them up already for boto (e.g. by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)

Additional options:

Parameters:
  • additional_emr_info (JSON str, None, or JSON-encodable object) – Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).
  • ami_version (str) – EMR AMI version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things; see http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuideindex.html?EnvironmentConfig_AMIVersion.html for details. Implicitly defaults to AMI version 1.0 (this will change to 2.0 in mrjob v0.4).
  • aws_access_key_id (str) – “username” for Amazon web services.
  • aws_availability_zone (str) – availability zone to run the job in
  • aws_secret_access_key (str) – your “password” on AWS
  • aws_region (str) – region to connect to S3 and EMR on (e.g. us-west-1). If you want to use separate regions for S3 and EMR, set emr_endpoint and s3_endpoint.
  • bootstrap_actions (list of str) – a list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use shlex.split()). If the action is on the local filesystem, we’ll automatically upload it to S3.
  • bootstrap_cmds (list) – a list of commands to run on the master node to set up libraries, etc. Like setup_cmds, these can be strings, which will be run in the shell, or lists of args, which will be run directly. Prepend sudo to commands to do things that require root privileges.
  • bootstrap_files (list of str) – files to download to the bootstrap working directory on the master node before running bootstrap_cmds (for example, Debian packages). May be local files for mrjob to upload to S3, or any URI that hadoop fs can handle.
  • bootstrap_mrjob (boolean) – This is actually an option in the base MRJobRunner class. If this is True (the default), we’ll tar up mrjob from the local filesystem, and install it on the master node.
  • bootstrap_python_packages (list of str) – paths of python modules to install on EMR. These should be standard Python module tarballs. If a module is named foo.tar.gz, we expect to be able to run tar xfz foo.tar.gz; cd foo; sudo python setup.py install.
  • bootstrap_scripts (list of str) – scripts to upload and then run on the master node (a combination of bootstrap_cmds and bootstrap_files). These are run after the command from bootstrap_cmds.
  • check_emr_status_every (float) – How often to check on the status of EMR jobs. Default is 30 seconds (too often and AWS will throttle you anyway).
  • ec2_instance_type (str) – What sort of EC2 instance(s) to use on the nodes that actually run tasks (see http://aws.amazon.com/ec2/instance-types/). When you run multiple instances (see num_ec2_instances), the master node is just coordinating the other nodes, so usually the default instance type (m1.small) is fine, and using larger instances is wasteful.
  • ec2_key_pair (str) – name of the SSH key you set up for EMR.
  • ec2_key_pair_file (str) – path to file containing the SSH key for EMR
  • ec2_core_instance_type (str) – like ec2_instance_type, but only for the core (also know as “slave”) Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use ec2_instance_type. Defaults to 'm1.small'.
  • ec2_core_instance_bid_price (str) – when specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances.
  • ec2_master_instance_type (str) – like ec2_instance_type, but only for the master Hadoop node. This node hosts the task tracker and HDFS, and runs tasks if there are no other nodes. Usually you just want to use ec2_instance_type. Defaults to 'm1.small'.
  • ec2_master_instance_bid_price (str) – when specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.
  • ec2_slave_instance_type (str) – An alias for ec2_core_instance_type, for consistency with the EMR API.
  • ec2_task_instance_type (str) – like ec2_instance_type, but only for the task Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use ec2_instance_type. Defaults to the same instance type as ec2_core_instance_type.
  • ec2_task_instance_bid_price – when specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)
  • emr_endpoint (str) – optional host to connect to when communicating with S3 (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from aws_region.
  • emr_job_flow_id (str) – the ID of a persistent EMR job flow to run jobs in (normally we launch our own job flow). It’s fine for other jobs to be using the job flow; we give our job’s steps a unique ID.
  • emr_job_flow_pool_name (str) – Specify a pool name to join. Is set to 'default' if not specified. Does not imply pool_emr_job_flows.
  • enable_emr_debugging (str) – store Hadoop logs in SimpleDB
  • hadoop_streaming_jar (str) – This is actually an option in the base MRJobRunner class. Points to a custom hadoop streaming jar on the local filesystem or S3. If you want to point to a streaming jar already installed on the EMR instances (perhaps through a bootstrap action?), use hadoop_streaming_jar_on_emr.
  • hadoop_streaming_jar_on_emr (str) – Like hadoop_streaming_jar, except that it points to a path on the EMR instance, rather than to a local file or one on S3. Rarely necessary to set this by hand.
  • hadoop_version (str) – Set the version of Hadoop to use on EMR. Consider setting ami_version instead; only AMI version 1.0 supports multiple versions of Hadoop anyway. If ami_version is not set, we’ll default to Hadoop 0.20 for backwards compatibility with mrjob v0.3.0.
  • num_ec2_core_instances (int) – Number of core (or “slave”) instances to start up. These run your job and host HDFS. Incompatible with num_ec2_instances. This is in addition to the single master instance.
  • num_ec2_instances (int) – Total number of instances to start up; basically the number of core instance you want, plus 1 (there is always one master instance). Default is 1. Incompatible with num_ec2_core_instances and num_ec2_task_instances.
  • num_ec2_task_instances (int) – number of task instances to start up. These run your job but do not host HDFS. Incompatible with num_ec2_instances. If you use this, you must set num_ec2_core_instances; EMR does not allow you to run task instances without core instances (because there’s nowhere to host HDFS).
  • pool_emr_job_flows (bool) – Try to run the job on a WAITING pooled job flow with the same bootstrap configuration. Prefer the one with the most compute units. Use S3 to “lock” the job flow and ensure that the job is not scheduled behind another job. If no suitable job flow is WAITING, create a new pooled job flow. WARNING: do not run this without having mrjob.tools.emr.terminate.idle_job_flows in your crontab; job flows left idle can quickly become expensive!
  • s3_endpoint (str) – Host to connect to when communicating with S3 (e.g. s3-us-west-1.amazonaws.com). Default is to infer this from aws_region.
  • s3_log_uri (str) – where on S3 to put logs, for example s3://yourbucket/logs/. Logs for your job flow will go into a subdirectory, e.g. s3://yourbucket/logs/j-JOBFLOWID/. in this example s3://yourbucket/logs/j-YOURJOBID/). Default is to append logs/ to s3_scratch_uri.
  • s3_scratch_uri (str) – S3 directory (URI ending in /) to use as scratch space, e.g. s3://yourbucket/tmp/. Default is tmp/mrjob/ in the first bucket belonging to you.
  • s3_sync_wait_time (float) – How long to wait for S3 to reach eventual consistency. This is typically less than a second (zero in U.S. West) but the default is 5.0 to be safe.
  • ssh_bin (str or list) – path to the ssh binary; may include switches (e.g. 'ssh -v' or ['ssh', '-v']). Defaults to ssh
  • ssh_bind_ports (list of int) – a list of ports that are safe to listen on. Defaults to ports 40001 thru 40840.
  • ssh_tunnel_to_job_tracker (bool) – If True, create an ssh tunnel to the job tracker and listen on a randomly chosen port. This requires you to set ec2_key_pair and ec2_key_pair_file. See SSH Tunneling and Log Fetching for detailed instructions.
  • ssh_tunnel_is_open (bool) – if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.

On your Hadoop cluster

HadoopJobRunner.__init__(**kwargs)

HadoopJobRunner takes the same arguments as MRJobRunner, plus some additional options which can be defaulted in mrjob.conf.

output_dir and hdfs_scratch_dir need not be fully qualified hdfs:// URIs because it’s understood that they have to be on HDFS (e.g. tmp/mrjob/ would be okay)

Additional options:

Parameters:
  • hadoop_bin (str or list) – name/path of your hadoop program (may include arguments). Defaults to hadoop_home plus bin/hadoop.
  • hadoop_home (str) – alternative to setting the HADOOP_HOME environment variable
  • hdfs_scratch_dir (str) – temp directory on HDFS. Default is tmp/mrjob.

hadoop_streaming_jar is optional; by default, we’ll search for it inside HADOOP_HOME

Getting configuration options out of runners

MRJobRunner.get_opts()

Get options set for this runner, as a dict.

classmethod MRJobRunner.get_default_opts()

Get default options for this runner class, as a dict.