mrjob.protocol - input and output

Protocols are what allow mrjob.job.MRJob to input and output arbitrary values, rather than just strings.

We use JSON as our default protocol rather than something more powerful because we want to encourage interoperability with other languages. If you need more power, you can represent values as reprs or pickles.

Also, if know that your input will always be in JSON format, consider JSONValueProtocol as an alternative to RawValueProtocol.

Custom Protocols

A protocol is an object with methods read(self, line) and write(self, key, value). The read(line) method takes a string and returns a 2-tuple of decoded objects, and write(cls, key, value) takes the key and value and returns the line to be passed back to Hadoop Streaming or as output.

The built-in protocols use class methods instead of instance methods for legacy reasons, but you should use instance methods.

For more information on using alternate protocols in your job, see Protocols.

class mrjob.protocol.JSONProtocol

Encode (key, value) as two JSONs separated by a tab.

Note that JSON has some limitations; dictionary keys must be strings, and there’s no distinction between lists and tuples.

class mrjob.protocol.JSONValueProtocol

Encode value as a JSON and discard key (key is read in as None).

class mrjob.protocol.PickleProtocol

Encode (key, value) as two string-escaped pickles separated by a tab.

We string-escape the pickles to avoid having to deal with stray \t and \n characters, which would confuse Hadoop Streaming.

Ugly, but should work for any type.

class mrjob.protocol.PickleValueProtocol

Encode value as a string-escaped pickle and discard key (key is read in as None).

class mrjob.protocol.RawProtocol

Encode (key, value) as key and value separated by a tab (key and value should be bytestrings).

If key or value is None, don’t include a tab. When decoding a line with no tab in it, value will be None.

When reading from a line with multiple tabs, we break on the first one.

Your key should probably not be None or have tab characters in it, but we don’t check.

class mrjob.protocol.RawValueProtocol

Read in a line as (None, line). Write out (key, value) as value. value must be a str.

The default way for a job to read its initial input.

class mrjob.protocol.ReprProtocol

Encode (key, value) as two reprs separated by a tab.

This only works for basic types (we use mrjob.util.safeeval()).

class mrjob.protocol.ReprValueProtocol

Encode value as a repr and discard key (key is read in as None).

This only works for basic types (we use mrjob.util.safeeval()).

mrjob.protocol.DEFAULT_PROTOCOL = 'json'

Deprecated since version 0.3.0.

Formerly the default protocol for all encoded input and output: 'json'

mrjob.protocol.PROTOCOL_DICT = {'raw_value': <class 'mrjob.protocol.RawValueProtocol'>, 'pickle': <class 'mrjob.protocol.PickleProtocol'>, 'repr': <class 'mrjob.protocol.ReprProtocol'>, 'json_value': <class 'mrjob.protocol.JSONValueProtocol'>, 'repr_value': <class 'mrjob.protocol.ReprValueProtocol'>, 'json': <class 'mrjob.protocol.JSONProtocol'>, 'pickle_value': <class 'mrjob.protocol.PickleValueProtocol'>}

Deprecated since version 0.3.0.

Default mapping from protocol name to class:

name class
json JSONProtocol
json_value JSONValueProtocol
pickle PickleProtocol
pickle_value PickleValueProtocol
raw_value RawValueProtocol
repr ReprProtocol
repr_value ReprValueProtocol

Table Of Contents

Previous topic

mrjob.conf - parse and write config files

Next topic

Runners - launching your job

This Page