LIAC-ARFF v2.1

Introduction

The liac-arff module implements functions to read and write ARFF files in Python. It was created in the Connectionist Artificial Intelligence Laboratory (LIAC), which takes place at the Federal University of Rio Grande do Sul (UFRGS), in Brazil.

ARFF (Attribute-Relation File Format) is an file format specially created for describe datasets which are commonly used for machine learning experiments and softwares. This file format was created to be used in Weka, the best representative software for machine learning automated experiments.

An ARFF file can be divided into two sections: header and data. The Header describes the metadata of the dataset, including a general description of the dataset, its name and its attributes. The source below is an example of a header section in a XOR dataset:

% 
% XOR Dataset
% 
% Created by Renato Pereira
%            rppereira@inf.ufrgs.br
%            http://inf.ufrgs.br/~rppereira
% 
% 
@RELATION XOR

@ATTRIBUTE input1 REAL
@ATTRIBUTE input2 REAL
@ATTRIBUTE y REAL

The Data section of an ARFF file describes the observations of the dataset, in the case of XOR dataset:

@DATA
0.0,0.0,0.0
0.0,1.0,1.0
1.0,0.0,1.0
1.0,1.0,0.0
% 
% 
% 

Notice that several lines are starting with an % symbol, denoting a comment, thus, lines with % at the beginning will be ignored, except by the description part at the beginning of the file. The declarations @RELATION, @ATTRIBUTE, and @DATA are all case insensitive and obligatory.

For more information and details about the ARFF file description, consult http://www.cs.waikato.ac.nz/~ml/weka/arff.html

ARFF Files in Python

This module uses built-ins python objects to represent a deserialized ARFF file. A dictionary is used as the container of the data and metadata of ARFF, and have the following keys:

  • description: (OPTIONAL) a string with the description of the dataset.

  • relation: (OBLIGATORY) a string with the name of the dataset.

  • attributes: (OBLIGATORY) a list of attributes with the following template:

    (attribute_name, attribute_type)
    

    the attribute_name is a string, and attribute_type must be an string or a list of strings.

  • data: (OBLIGATORY) a list of data instances. Each data instance must be a list with values, depending on the attributes.

The above keys must follow the case which were described, i.e., the keys are case sensitive. The attribute type attribute_type must be one of these strings (they are not case sensitive): NUMERIC, INTEGER, REAL or STRING. For nominal attributes, the atribute_type must be a list of strings.

In this format, the XOR dataset presented above can be represented as a python object as:

xor_dataset = {
    'description': 'XOR Dataset',
    'relation': 'XOR',
    'attributes': [
        ('input1', 'REAL'),
        ('input2', 'REAL'),
        ('y', 'REAL'),
    ],
    'data': [
        [0.0, 0.0, 0.0],
        [0.0, 1.0, 1.0],
        [1.0, 0.0, 1.0],
        [1.0, 1.0, 0.0]
    ]
}

Features

This module provides several features, including:

  • Read and write ARFF files using python built-in structures, such dictionaries and lists;
  • Supports scipy.sparse.coo and lists of dictionaries as used by SVMLight
  • Supports the following attribute types: NUMERIC, REAL, INTEGER, STRING, and NOMINAL;
  • Has an interface similar to other built-in modules such as json, or zipfile;
  • Supports read and write the descriptions of files;
  • Supports missing values and names with spaces;
  • Supports unicode values and names;
  • Fully compatible with Python 2.6+ and Python 3.4+;
  • Under MIT License

How To Install

Via pip:

$ pip install liac-arff

Via easy_install:

$ easy_install liac-arff

Manually:

$ python setup.py install

Basic Usage

arff.load(fp, encode_nominal=False, return_type=0)

Load a file-like object containing the ARFF document and convert it into a Python object.

Parameters:
  • fp – a file-like object.
  • encode_nominal – boolean, if True perform a label encoding while reading the .arff file.
  • return_type – determines the data structure used to store the dataset. Can be one of arff.DENSE, arff.COO and arff.LOD. Consult the section on working with sparse data
Returns:

a dictionary.

arff.loads(s, encode_nominal=False, return_type=0)

Convert a string instance containing the ARFF document into a Python object.

Parameters:
  • s – a string object.
  • encode_nominal – boolean, if True perform a label encoding while reading the .arff file.
  • return_type – determines the data structure used to store the dataset. Can be one of arff.DENSE, arff.COO and arff.LOD. Consult the section on working with sparse data
Returns:

a dictionary.

arff.dump(obj, fp)

Serialize an object representing the ARFF document to a given file-like object.

Parameters:
  • obj – a dictionary.
  • fp – a file-like object.
arff.dumps(obj)

Serialize an object representing the ARFF document, returning a string.

Parameters:obj – a dictionary.
Returns:a string with the ARFF document.

Encoders and Decoders

class arff.ArffDecoder

An ARFF decoder.

decode(s, encode_nominal=False, return_type=0)

Returns the Python representation of a given ARFF file.

When a file object is passed as an argument, this method reads lines iteratively, avoiding to load unnecessary information to the memory.

Parameters:
  • s – a string or file object with the ARFF file.
  • encode_nominal – boolean, if True perform a label encoding while reading the .arff file.
  • return_type – determines the data structure used to store the dataset. Can be one of arff.DENSE, arff.COO and arff.LOD. Consult the section on working with sparse data
class arff.ArffEncoder

An ARFF encoder.

encode(obj)

Encodes a given object to an ARFF file.

Parameters:obj – the object containing the ARFF information.
Returns:the ARFF file as an unicode string.
iter_encode(obj)

The iterative version of arff.ArffEncoder.encode.

This encodes iteratively a given object and return, one-by-one, the lines of the ARFF file.

Parameters:obj – the object containing the ARFF information.
Returns:(yields) the ARFF file as unicode strings.

Exceptions

exception arff.BadRelationFormat

Error raised when the relation declaration is in an invalid format.

exception arff.BadAttributeFormat

Error raised when some attribute declaration is in an invalid format.

exception arff.BadDataFormat

Error raised when some data instance is in an invalid format.

exception arff.BadAttributeType

Error raised when some invalid type is provided into the attribute declaration.

exception arff.BadNominalValue

Error raised when a value in used in some data instance but is not declared into it respective attribute declaration.

exception arff.BadNumericalValue

Error raised when and invalid numerical value is used in some data instance.

exception arff.BadLayout

Error raised when the layout of the ARFF file has something wrong.

exception arff.BadObject(msg='')

Error raised when the object representing the ARFF file has something wrong.

Unicode

LIAC-ARFF works with unicode (for python 2.6+, in python 3.x this is default), and to take advantage of it, you need to load the arff file using codecs, specifying its codification:

import codecs
import arff

file_ = codecs.load('/path/to/file.arff', 'rb', 'utf-8')
arff.load(file_)

Examples

Dumping An Object

Converting an object to ARFF:

import arff

obj = {
   'description': u'',
   'relation': 'weather',
   'attributes': [
       ('outlook', ['sunny', 'overcast', 'rainy']),
       ('temperature', 'REAL'),
       ('humidity', 'REAL'),
       ('windy', ['TRUE', 'FALSE']),
       ('play', ['yes', 'no'])
   ],
   'data': [
       ['sunny', 85.0, 85.0, 'FALSE', 'no'],
       ['sunny', 80.0, 90.0, 'TRUE', 'no'],
       ['overcast', 83.0, 86.0, 'FALSE', 'yes'],
       ['rainy', 70.0, 96.0, 'FALSE', 'yes'],
       ['rainy', 68.0, 80.0, 'FALSE', 'yes'],
       ['rainy', 65.0, 70.0, 'TRUE', 'no'],
       ['overcast', 64.0, 65.0, 'TRUE', 'yes'],
       ['sunny', 72.0, 95.0, 'FALSE', 'no'],
       ['sunny', 69.0, 70.0, 'FALSE', 'yes'],
       ['rainy', 75.0, 80.0, 'FALSE', 'yes'],
       ['sunny', 75.0, 70.0, 'TRUE', 'yes'],
       ['overcast', 72.0, 90.0, 'TRUE', 'yes'],
       ['overcast', 81.0, 75.0, 'FALSE', 'yes'],
       ['rainy', 71.0, 91.0, 'TRUE', 'no']
   ],
}

print arff.dumps(obj)

resulting in:

@RELATION weather

@ATTRIBUTE outlook {sunny, overcast, rainy}
@ATTRIBUTE temperature REAL
@ATTRIBUTE humidity REAL
@ATTRIBUTE windy {TRUE, FALSE}
@ATTRIBUTE play {yes, no}

@DATA
sunny,85.0,85.0,FALSE,no
sunny,80.0,90.0,TRUE,no
overcast,83.0,86.0,FALSE,yes
rainy,70.0,96.0,FALSE,yes
rainy,68.0,80.0,FALSE,yes
rainy,65.0,70.0,TRUE,no
overcast,64.0,65.0,TRUE,yes
sunny,72.0,95.0,FALSE,no
sunny,69.0,70.0,FALSE,yes
rainy,75.0,80.0,FALSE,yes
sunny,75.0,70.0,TRUE,yes
overcast,72.0,90.0,TRUE,yes
overcast,81.0,75.0,FALSE,yes
rainy,71.0,91.0,TRUE,no
%
%
%

Loading An Object

Loading and ARFF file:

import arff
import pprint

file_ = '''@RELATION weather

@ATTRIBUTE outlook {sunny, overcast, rainy}
@ATTRIBUTE temperature REAL
@ATTRIBUTE humidity REAL
@ATTRIBUTE windy {TRUE, FALSE}
@ATTRIBUTE play {yes, no}

@DATA
sunny,85.0,85.0,FALSE,no
sunny,80.0,90.0,TRUE,no
overcast,83.0,86.0,FALSE,yes
rainy,70.0,96.0,FALSE,yes
rainy,68.0,80.0,FALSE,yes
rainy,65.0,70.0,TRUE,no
overcast,64.0,65.0,TRUE,yes
sunny,72.0,95.0,FALSE,no
sunny,69.0,70.0,FALSE,yes
rainy,75.0,80.0,FALSE,yes
sunny,75.0,70.0,TRUE,yes
overcast,72.0,90.0,TRUE,yes
overcast,81.0,75.0,FALSE,yes
rainy,71.0,91.0,TRUE,no
%
%
% '''
d = arff.loads(file_)
pprint.pprint(d)

resulting in:

{u'attributes': [(u'outlook', [u'sunny', u'overcast', u'rainy']),
                 (u'temperature', u'REAL'),
                 (u'humidity', u'REAL'),
                 (u'windy', [u'TRUE', u'FALSE']),
                 (u'play', [u'yes', u'no'])],
 u'data': [[u'sunny', 85.0, 85.0, u'FALSE', u'no'],
           [u'sunny', 80.0, 90.0, u'TRUE', u'no'],
           [u'overcast', 83.0, 86.0, u'FALSE', u'yes'],
           [u'rainy', 70.0, 96.0, u'FALSE', u'yes'],
           [u'rainy', 68.0, 80.0, u'FALSE', u'yes'],
           [u'rainy', 65.0, 70.0, u'TRUE', u'no'],
           [u'overcast', 64.0, 65.0, u'TRUE', u'yes'],
           [u'sunny', 72.0, 95.0, u'FALSE', u'no'],
           [u'sunny', 69.0, 70.0, u'FALSE', u'yes'],
           [u'rainy', 75.0, 80.0, u'FALSE', u'yes'],
           [u'sunny', 75.0, 70.0, u'TRUE', u'yes'],
           [u'overcast', 72.0, 90.0, u'TRUE', u'yes'],
           [u'overcast', 81.0, 75.0, u'FALSE', u'yes'],
           [u'rainy', 71.0, 91.0, u'TRUE', u'no']],
 u'description': u'',
 u'relation': u'weather'}

Loading An Object with encoded labels

In some cases it is practical to have categorical data represented by integers, rather than strings. In scikit-learn for example, integer data can be directly converted in a continuous representation with the One-Hot Encoder, which is necessary for most machine learning algorithms, e.g. Support Vector Machines. The values [u'sunny', u'overcast', u'rainy'] of the attribute u'outlook' would be represented by [0, 1, 2]. This representation can be directly used the One-Hot Encoder.

Encoding categorical data while reading it from a file saves at least one memory copy and can be invoked like in this example:

import arff
import pprint

file_ = '''@RELATION weather

@ATTRIBUTE outlook {sunny, overcast, rainy}
@ATTRIBUTE temperature REAL
@ATTRIBUTE humidity REAL
@ATTRIBUTE windy {TRUE, FALSE}
@ATTRIBUTE play {yes, no}

@DATA
sunny,85.0,85.0,FALSE,no
sunny,80.0,90.0,TRUE,no
overcast,83.0,86.0,FALSE,yes
rainy,70.0,96.0,FALSE,yes
rainy,68.0,80.0,FALSE,yes
rainy,65.0,70.0,TRUE,no
overcast,64.0,65.0,TRUE,yes
sunny,72.0,95.0,FALSE,no
sunny,69.0,70.0,FALSE,yes
rainy,75.0,80.0,FALSE,yes
sunny,75.0,70.0,TRUE,yes
overcast,72.0,90.0,TRUE,yes
overcast,81.0,75.0,FALSE,yes
rainy,71.0,91.0,TRUE,no
%
%
% '''
decoder = arff.ArffDecoder()
d = decoder.decode(file_, encode_nominal=True)
pprint.pprint(d)

resulting in:

{u'attributes': [(u'outlook', [u'sunny', u'overcast', u'rainy']),
             (u'temperature', u'REAL'),
             (u'humidity', u'REAL'),
             (u'windy', [u'TRUE', u'FALSE']),
             (u'play', [u'yes', u'no'])],
 u'data': [[0, 85.0, 85.0, 1, 1],
           [0, 80.0, 90.0, 0, 1],
           [1, 83.0, 86.0, 1, 0],
           [2, 70.0, 96.0, 1, 0],
           [2, 68.0, 80.0, 1, 0],
           [2, 65.0, 70.0, 0, 1],
           [1, 64.0, 65.0, 0, 0],
           [0, 72.0, 95.0, 1, 1],
           [0, 69.0, 70.0, 1, 0],
           [2, 75.0, 80.0, 1, 0],
           [0, 75.0, 70.0, 0, 0],
           [1, 72.0, 90.0, 0, 0],
           [1, 81.0, 75.0, 1, 0],
           [2, 71.0, 91.0, 0, 1]],
 u'description': u'',
 u'relation': u'weather'}

Using this dataset in scikit-learn:

from sklearn import preprocessing, svm
enc = preprocessing.OneHotEncoder(categorical_features=[0, 3, 4])
enc.fit(d['data'])
encoded_data = enc.transform(d['data']).toarray()
clf = svm.SVC()
clf.fit(encoded_data[:,0:4], encoded_data[:,4])

Working with sparse data

Sparse data is data in which most of the elements are zero. By saving only non-zero elements, one can potentially save a lot of space on either the harddrive or in RAM. liac-arff supports two sparse data structures:

  • scipy.sparse.coo is intended for easy construction of sparse matrices inside a python program.

  • list of dictionaries in the form

    [{column: value, column: value},
     {column: value, column: value}]
    

Dumping sparse data

Both scipy.sparse.coo matrices and lists of dictionaries can be used as the value for data in the arff object. Let’s look again at the XOR example, this time with the data encoded as a list of dictionaries:

xor_dataset = {
    'description': 'XOR Dataset',
    'relation': 'XOR',
    'attributes': [
        ('input1', 'REAL'),
        ('input2', 'REAL'),
        ('y', 'REAL'),
    ],
    'data': [
        {},
        {1: 1.0, 2: 1.0},
        {0: 1.0, 2: 1.0},
        {0: 1.0, 1: 1.0}
    ]
}

print arff.dumps(xor_dataset)

resulting in:

% XOR Dataset
@RELATION XOR

@ATTRIBUTE input1 REAL
@ATTRIBUTE input2 REAL
@ATTRIBUTE y REAL

@DATA
{  }
{ 1 1.0,2 1.0 }
{ 0 1.0,2 1.0 }
{ 0 1.0,1 1.0 }
%
%
%

Loading sparse data

When reading a sparse dataset, the user can choose a target data structure. These are represented by the constants arff.DENSE, arff.COO and arff.LOD:

decoder = arff.ArffDecoder()
d = decoder.decode(file_, encode_nominal=True, return_type=arff.LOD)
pprint.pprint(d)

resulting in:

{
    'description': 'XOR Dataset',
    'relation': 'XOR',
    'attributes': [
        ('input1', 'REAL'),
        ('input2', 'REAL'),
        ('y', 'REAL'),
    ],
    'data': [
        {},
        {1: 1.0, 2: 1.0},
        {0: 1.0, 2: 1.0},
        {0: 1.0, 1: 1.0}
    ]
}

When choosing arff.COO, the data can be dircetly passed to the scipy constructor:

from scipy import sparse
decoder = arff.ArffDecoder()
d = decoder.decode(file_, encode_nominal=True, return_type=arff.COO)
data = d['data'][0]
row = d['data'][1]
col = d['data'][2]
matrix = sparse.coo_matrix((data, (row, col)), shape=(max(row)+1, max(col)+1))