clevercsv package

Submodules

clevercsv.break_ties module

Break ties in the data consistency measure.

Author: Gertjan van den Burg

clevercsv.break_ties.break_ties_four(data, dialects)

Break ties between four dialects.

This function works by breaking the ties between pairs of dialects that result in the same parsing result (if any). If this reduces the number of dialects, then break_ties_three() or break_ties_two() is used, otherwise, the tie can’t be broken.

Ties are only broken if all dialects have the same delimiter.

Parameters
  • data (str) – The data of the file as a string

  • dialects (list) – List of SimpleDialect objects

Returns

dialect – The chosen dialect if the tie can be broken, None otherwise.

Return type

SimpleDialect

Notes

We have only observed one case during development where this function was needed. It may need to be revisited in the future if other examples are found.

clevercsv.break_ties.break_ties_three(data, A, B, C)

Break ties between three dialects.

If the delimiters and the escape characters are all equal, then we look for the dialect that has no quotechar. The tie is broken by calling break_ties_two() for the dialect without quotechar and another dialect that gives the same parsing result.

If only the delimiter is the same for all dialects then use break_ties_two() on the dialects that do not have a quotechar, provided there are only two of these.

Parameters
Returns

dialect – The chosen dialect if the tie can be broken, None otherwise.

Return type

SimpleDialect

Notes

We have only observed one tie for each case during development, so this may need to be improved in the future.

clevercsv.break_ties.break_ties_two(data, A, B)

Break ties between two dialects.

This function breaks ties between two dialects that give the same score. We distinguish several cases:

1. If delimiter and escapechar are the same and one of the quote characters is the empty string. We parse the file with both dialects and check if the parsing result is the same. If it is, the correct dialect is the one with no quotechar, otherwise it’s the other one. 2. If quotechar and escapechar are the same and the delimiters are comma and space, then we go for comma. Alternatively, if either of the delimiters is the hyphen, we assume it’s the other dialect. 3. If the delimiter and quotechar is the same and one dialect uses the escapchar and the other doesn’t. We break this tie by checking if the escapechar has an effect and if it occurs an even or odd number of times.

If it’s none of these cases, we don’t break the tie and return None.

Parameters
  • data (str) – The data of the file as a string.

  • A (SimpleDialect) – A potential dialect

  • B (SimpleDialect) – A potential dialect

Returns

dialect – The chosen dialect if the tie can be broken, None otherwise.

Return type

SimpleDialect or None

clevercsv.break_ties.reduce_pairwise(data, dialects)

Reduce the set of dialects by breaking pairwise ties

Parameters
  • data (str) – The data of the file as a string

  • dialects (list) – List of SimpleDialect objects

Returns

dialects – List of SimpleDialect objects.

Return type

list

clevercsv.break_ties.tie_breaker(data, dialects)

Break ties between dialects.

This function is used to break ties where possible between two, three, or four dialects that receive the same value for the data consistency measure.

Parameters
  • data (str) – The data as a single string

  • dialects (list) – Dialects that are tied

Returns

dialect – One of the dialects from the list provided or None.

Return type

SimpleDialect

clevercsv.consistency module

Detect the dialect using the data consistency measure.

Author: Gertjan van den Burg

clevercsv.consistency.break_ties(data, dialects)
clevercsv.consistency.consistency_scores(data, dialects, skip=True, logger=<built-in function print>)
clevercsv.consistency.detect_consistency_dialects(data, dialects, verbose=False)

Wrapper for dialect detection with the consistency measure

This function takes a list of dialects to consider.

clevercsv.consistency.detect_dialect_consistency(data, delimiters=None, verbose=False)

Detect the dialect with the data consistency measure

This uses the data consistency measure to detect the dialect. See the paper for details.

Parameters
  • data (str) – The data of the file as a string

  • delimiters (iterable) – List of delimiters to consider. If None, the get_delimiters() function is used to automatically detect this (as described in the paper).

  • verbose (bool) – Print out the dialects considered and their scores.

Returns

dialect – The detected dialect. If no dialect could be detected, returns None.

Return type

SimpleDialect

clevercsv.consistency.get_best_set(scores)

clevercsv.cparser_util module

Python utility functions that wrap the C parser.

clevercsv.cparser_util.field_size_limit(*args, **kwargs)

Get/Set the limit to the field size.

This function is adapted from the one in the Python CSV module. See the documentation there.

clevercsv.cparser_util.parse_data(data, dialect=None, delimiter=None, quotechar=None, escapechar=None, strict=None, return_quoted=False)

Parse the data given a dialect using the C parser

Parameters
  • data (iterable) – The data of the CSV file as an iterable

  • dialect (SimpleDialect) – The dialect to use for the parsing. If None, the dialect with each component set to the empty string is used.

  • delimiter (str) – The delimiter to use. If not None, overwrites the delimiter in the dialect.

  • quotechar (str) – The quote character to use. If not None, overwrites the quote character in the dialect.

  • escapechar (str) – The escape character to use. If not None, overwrites the escape character in the dialect.

  • strict (bool) – Enable strict mode or not. If not None, overwrites the strict mode set in the dialect.

  • return_quoted (bool) – For each cell, return a tuple “(field, is_quoted)” where the second element indicates whether the cell was a quoted cell or not.

Yields

rows (list) – The rows of the file as a list of cells.

Raises

Error – When an error occurs during parsing.

clevercsv.cparser_util.parse_string(data, *args, **kwargs)

Utility for when the CSV file is encoded as a single string

clevercsv.detect module

Drop-in replacement for Python Sniffer object.

Author: Gertjan van den Burg

class clevercsv.detect.Detector

Bases: object

Detect the Dialect of CSV files with normal forms or the data consistency measure. This class provides a drop-in replacement for the Python dialect Sniffer from the standard library.

Note

We call the object Detector just to mark the difference in the implementation and avoid naming issues. You can import it as from ccsv import Sniffer nonetheless.

detect(sample, delimiters=None, verbose=False, method='auto')
has_header(sample)

Detect if a file has a header from a sample.

This function is copied from CPython! The only change we’ve made is to use our dialect detection method.

sniff(sample, delimiters=None, verbose=False)

clevercsv.detect_pattern module

Code for computing the pattern score.

Author: Gertjan van den Burg

clevercsv.detect_pattern.fill_empties(abstract)

Fill empty cells in the abstraction

The way the row patterns are constructed assumes that empty cells are marked by the letter C as well. This function fill those in. The function also removes duplicate occurrances of CC and replaces these with C.

Parameters

abstract (str) – The abstract representation of the file.

Returns

abstraction – The abstract representation with empties filled.

Return type

str

clevercsv.detect_pattern.make_abstraction(data, dialect)

Create an abstract representation of the CSV file based on the dialect.

This function constructs the basic abstraction used to compute the row patterns.

Parameters
  • data (str) – The data of the file as a string.

  • dialect (SimpleDialect) – A dialect to parse the file with.

Returns

abstraction – An abstract representation of the CSV file.

Return type

str

clevercsv.detect_pattern.merge_with_quotechar(S, dialect)

Merge quoted blocks in the abstraction

This function takes the abstract representation and merges quoted blocks (QC...CQ) to a single cell (C). The function takes nested quotes into account.

Parameters
  • S (str) – The data of a file as a string

  • dialect (SimpleDialect) – The dialect used to make the abstraction.

Returns

abstraction – A simplified version of the abstraction with quoted blocks merged.

Return type

str

clevercsv.detect_pattern.pattern_score(data, dialect, eps=0.001)

Compute the pattern score for given data and a dialect.

Parameters
  • data (string) – The data of the file as a raw character string

  • dialect (dialect.Dialect) – The dialect object

Returns

score – the pattern score

Return type

float

clevercsv.detect_pattern.strip_trailing(abstract)

Strip trailing row separator from abstraction.

clevercsv.detect_type module

Code for computing the type score.

Author: Gertjan van den Burg

class clevercsv.detect_type.TypeDetector(strip_whitespace=True)

Bases: object

detect_type(cell, is_quoted=False)
is_currency(cell, **kwargs)
is_date(cell, **kwargs)
is_datetime(cell, **kwargs)
is_email(cell, **kwargs)
is_empty(cell, **kwargs)
is_ipv4(cell, **kwargs)
is_known_type(cell, is_quoted=False)
is_nan(cell, **kwargs)
is_number(cell, **kwargs)
is_percentage(cell, **kwargs)
is_time(cell, **kwargs)
is_unicode_alphanum(cell, is_quoted=False, **kwargs)
is_unix_path(cell, **kwargs)
is_url(cell, **kwargs)
clevercsv.detect_type.gen_known_type(cells)

Utility that yields a generator over whether or not the provided cells are of a known type or not.

clevercsv.detect_type.type_score(data, dialect, eps=1e-10)

Compute the type score as the ratio of cells with a known type.

Parameters
  • data (str) – the data as a single string

  • dialect (SimpleDialect) – the dialect to use

  • eps (float) – the minimum value of the type score

clevercsv.dialect module

Definitions for the dialect object.

Author: Gertjan van den Burg

class clevercsv.dialect.SimpleDialect(delimiter, quotechar, escapechar, strict=False)

Bases: object

The simplified dialect object.

For the delimiter, quotechar, and escapechar the empty string means no delimiter/quotechar/escapechar in the file. None is used to mark it undefined.

Parameters
  • delimiter (str) – The delimiter of the CSV file.

  • quotechar (str) – The quotechar of the file.

  • escapechar (str) – The escapechar of the file.

  • strict (bool) – Whether strict parsing should be enforced. Same as in the csv module.

classmethod deserialize(obj)

Deserialize dialect from a JSON object

classmethod from_csv_dialect(d)
classmethod from_dict(d)
serialize()

Serialize dialect to a JSON object

to_csv_dialect()
to_dict()
validate()

clevercsv.dict_read_write module

DictReader and DictWriter.

This code is entirely copied from the Python csv module. The only exception is that it uses the reader and writer classes from our package.

Author: Gertjan van den Burg

class clevercsv.dict_read_write.DictReader(f, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)

Bases: object

property fieldnames
class clevercsv.dict_read_write.DictWriter(f, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds)

Bases: object

writeheader()
writerow(rowdict)
writerows(rowdicts)

clevercsv.escape module

Common functions for dealing with escape characters.

Author: Gertjan van den Burg Date: 2018-11-06

clevercsv.escape.is_potential_escapechar(char, encoding, block_char=None)

Check if a character is a potential escape character.

A character is considered a potential escape character if it is in the “Punctuation, Other” Unicode category and in the list of blocked characters.

Parameters
  • char (str) – The character to check

  • encoding (str) – The encoding of the character

  • block_char (iterable) –

    Characters that are in the Punctuation Other category but that should not be considered as escape character. If None, the default set is used, equal to:

    ["!", "?", '"', "'", ".", ",", ";", ":", "%", "*", "&", "#"
    

Returns

is_escape – Whether the character is considered a potential escape or not.

Return type

bool

clevercsv.exceptions module

Exceptions for CleverCSV

Author: Gertjan van den Burg

exception clevercsv.exceptions.Error

Bases: cparser.Error

exception clevercsv.exceptions.NoDetectionResult

Bases: Exception

clevercsv.normal_form module

Detect the dialect with very strict functional tests.

This module uses so-called “normal forms” to detect the dialect of CSV files. Normal forms are detected with strict functional tests. The normal forms are used as a pre-test to check if files are simple enough that computing the data consistency measure is not necessary.

Author: Gertjan van den Burg

clevercsv.normal_form.detect_dialect_normal(data, encoding='UTF-8', delimiters=None, verbose=False)

Detect the normal form of a file from a given sample

Parameters
  • data (str) – The data as a single string

  • encoding (str) – The encoding of the data

Returns

dialect – The dialect detected using normal forms, or None if no such dialect can be found.

Return type

SimpleDialect

clevercsv.normal_form.even_rows(rows, dialect)
clevercsv.normal_form.every_row_has_delim(rows, dialect)
clevercsv.normal_form.has_delimiter(string, delim)
clevercsv.normal_form.has_nested_quotes(string, quotechar)
clevercsv.normal_form.is_any_empty(cell)
clevercsv.normal_form.is_any_partial_quoted_cell(cell)
clevercsv.normal_form.is_any_quoted_cell(cell)
clevercsv.normal_form.is_elementary(cell)
clevercsv.normal_form.is_empty_quoted(cell, quotechar)
clevercsv.normal_form.is_empty_unquoted(cell)
clevercsv.normal_form.is_form_1(data, dialect=None)
clevercsv.normal_form.is_form_2(data, dialect)
clevercsv.normal_form.is_form_3(data, dialect)
clevercsv.normal_form.is_form_4(data, dialect)
clevercsv.normal_form.is_form_5(data, dialect)
clevercsv.normal_form.is_quoted_cell(cell, quotechar)
clevercsv.normal_form.maybe_has_escapechar(data, encoding, delim, quotechar)
clevercsv.normal_form.split_file(data)
clevercsv.normal_form.split_row(row, dialect)
clevercsv.normal_form.strip_trailing_crnl(data)

clevercsv.potential_dialects module

Code for selecting the potential dialects of a file.

Author: Gertjan van den Burg

clevercsv.potential_dialects.filter_urls(data)

Filter URLs from the data

clevercsv.potential_dialects.get_delimiters(data, encoding, delimiters=None, block_cat=None, block_char=None)

Get potential delimiters

The set of potential delimiters is constructed as follows. For each unique character of the file, we check if its Unicode character category is in the set block_cat of prohibited categories. If it is, we don’t allow it to be a delimiter, with the exception of Tab (which is in the Control category). We furthermore block characters in block_char from being delimiters.

Parameters
  • data (str) – The data of the file

  • encoding (str) – The encoding of the file

  • delimiters (iterable) – Allowed delimiters. If provided, it overrides the block_cat/block_char mechanism and only the provided characters will be considered delimiters (if they occur in the file). If None, all characters can be considered delimiters subject to the block_cat and block_char parameters.

  • block_cat (list) –

    List of Unicode categories (2-letter abbreviations) for characters that should not be considered as delimiters. If None, the following default set is used:

    ["Lu", "Ll", "Lt", "Lm", "Lo", "Nd", "Nl", "No", "Ps", "Pe", "Co"]
    

  • block_char (list) –

    Explicit list of characters that should not be considered delimiters. If None, the following default set is used:

    [".", "/", '"', "'", "\n", "\r"]
    

Returns

delims – Set of potential delimiters. The empty string is added by default.

Return type

set

clevercsv.potential_dialects.get_dialects(data, encoding='UTF-8', delimiters=None, test_masked_by_quotes=False)

Return the possible dialects for the given data.

We consider as escape characters those characters for which is_potential_escapechar() is True and that occur at least once before a quote character or delimiter in the dialect.

One may wonder if self-escaping is an issue here (i.e. “\”, two times backslash). It is not. In a file where a single backslash is desired and escaping with a backslash is used, then it only makes sense to do this in a file where the backslash is already used as an escape character (in which case we include it). If it is never used as escape for the delimiter or quotechar, then it is not necessary to self-escape. This is an assumption, but it holds in general and it reduces noise.

Parameters
  • data (str) – The data for the file

  • encoding (str) – The encoding of the file

  • delimiters (iterable) – Set of delimiters to consider. See get_delimiters() for more info.

  • test_masked_by_quotes (bool) – Remove dialects where the delimiter is always masked by the quote character. Enabling this typically removes a number of potential dialects from the list, which can remove false positives. It however not a very fast operation, so it is disabled by default.

Returns

dialects – List of SimpleDialect objects that are considered potential dialects.

Return type

list

clevercsv.potential_dialects.get_quotechars(data, quote_chars=None)

Get potential quote characters

Quote characters are those that occur in the quote_chars set and are found at least once in the file.

Parameters
  • data (str) – The data of the file as a string

  • quote_chars (iterable) –

    Characters that should be considered quote characters. If it is None, the following default set is used:

    ["'", '"', "~", "`"]
    

Returns

quotes – Set of potential quote characters. The empty string is added by default.

Return type

set

clevercsv.potential_dialects.masked_by_quotechar(data, quotechar, escapechar, test_char)

Test if a character is always masked by quote characters

This function tests if a given character is always within quoted segments (defined by the quote character). Double quoting and escaping is supported.

Parameters
  • data (str) – The data of the file as a string

  • quotechar (str) – The quote character

  • escapechar (str) – The escape character

  • test_char (str) – The character to test

Returns

masked – Returns True if the test character is never outside quoted segements, False otherwise.

Return type

bool

clevercsv.potential_dialects.unicode_category(x, encoding=None)

Return the Unicode category of a character

Parameters
  • x (str) – character

  • encoding (str) – Encoding of the character

Returns

category – The Unicode category of the character.

Return type

str

clevercsv.read module

Drop-in replacement for the Python csv reader class. This is a wrapper for the Parser class, defined in cparser.

Author: Gertjan van den Burg

class clevercsv.read.reader(csvfile, dialect='excel', **fmtparams)

Bases: object

next()

clevercsv.utils module

Various utilities

Author: Gertjan van den Burg

clevercsv.utils.get_encoding(filename)

Get the encoding of the file

This function uses the chardet package for detecting the encoding of a file.

Parameters

filename (str) – Path to a file

Returns

encoding – Encoding of the file.

Return type

str

clevercsv.utils.pairwise(iterable)

s - > (s0, s1), (s1, s2), (s2, s3), …

clevercsv.wrappers module

Wrappers for some loading/saving functionality.

Author: Gertjan van den Burg

clevercsv.wrappers.csv2df(filename, *args, num_chars=None, **kwargs)

This function is deprecated, use read_dataframe instead.

clevercsv.wrappers.detect_dialect(filename, num_chars=None, encoding=None, verbose=False, method='auto')

Detect the dialect of a CSV file

This is a utility function that simply returns the detected dialect of a given CSV file.

Parameters
  • filename (str) – The filename of the CSV file.

  • num_chars (int) – Number of characters to read for the detection. If None, the entire file will be read. Note that limiting the number of characters can reduce the accuracy of the detected dialect.

  • encoding (str) – The file encoding of the CSV file. If None, it is detected.

  • verbose (bool) – Enable verbose mode during detection.

  • method (str) – Dialect detection method to use. Either ‘normal’ for normal form detection, ‘consistency’ for the consistency measure, or ‘auto’ for first normal and then consistency.

Returns

dialect – The detected dialect as a SimpleDialect, or None if detection failed.

Return type

SimpleDialect

clevercsv.wrappers.read_as_dicts(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

This function is deprecated, use read_dicts instead.

clevercsv.wrappers.read_csv(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

This function is deprecated, use read_table instead.

clevercsv.wrappers.read_dataframe(filename, *args, num_chars=None, **kwargs)

Read a CSV file to a Pandas dataframe

This function uses CleverCSV to detect the dialect, and then passes this to the read_csv function in pandas. Additional arguments and keyword arguments are passed to read_csv as well.

Parameters
  • filename (str) – The filename of the CSV file. At the moment, only local files are supported.

  • *args – Additional arguments for the pandas.read_csv function.

  • num_chars (int) –

    Number of characters to use for dialect detection. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • **kwargs – Additional keyword arguments for the pandas.read_csv function. You can specify the file encoding here if needed, and it will be used during dialect detection.

clevercsv.wrappers.read_dicts(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

Read a CSV file as a list of dictionaries

This function returns the rows of the CSV file as a list of dictionaries. The keys of the dictionaries are assumed to be in the first row of the CSV file. The dialect will be detected automatically, unless it is provided.

Parameters
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the Clevercsv clevercsv.DictReader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns

rows – Returns rows of the file as a list of dictionaries.

Return type

list

Raises

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.read_table(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

Read a CSV file as a table (a list of lists)

This is a convenience function that reads a CSV file and returns the data as a list of lists (= rows). The dialect will be detected automatically, unless it is provided.

Parameters
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the CleverCSV clevercsv.reader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns

rows – Returns rows as a list of lists.

Return type

list

Raises

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.stream_csv(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

This function is deprecated, use stream_table instead.

clevercsv.wrappers.stream_dicts(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

Read a CSV file as a generator over dictionaries

This function streams the rows of the CSV file as dictionaries. The keys of the dictionaries are assumed to be in the first row of the CSV file. The dialect will be detected automatically, unless it is provided.

Parameters
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the Clevercsv clevercsv.DictReader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns

rows – Returns file as a generator over rows as dictionaries.

Return type

generator

Raises

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.stream_table(filename, dialect=None, encoding=None, num_chars=None, verbose=False)

Read a CSV file as a generator over rows of a table

This is a convenience function that reads a CSV file and returns the data as a generator of rows. The dialect will be detected automatically, unless it is provided.

Parameters
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the CleverCSV clevercsv.reader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns

rows – Returns file as a generator over rows.

Return type

generator

Raises

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.write_table(table, filename, dialect='excel', transpose=False)

Write a table (a list of lists) to a file

This is a convenience function for writing a table to a CSV file.

Parameters
  • table (list) – A table as a list of lists. The table must have the same number of cells in each row (taking the transpose flag into account).

  • filename (str) – The filename of the CSV file to write the table to.

  • dialect (SimpleDialect or csv.Dialect) – The dialect to use.

  • transpose (bool) – Transpose the table before writing.

Raises

ValueError: – When the length of the rows is not constant.

clevercsv.write module

Drop-in replacement for the Python csv writer class.

Author: Gertjan van den Burg

class clevercsv.write.writer(csvfile, dialect='excel', **fmtparams)

Bases: object

writerow(row)
writerows(rows)

Module contents