clevercsv package#

Subpackages#

Submodules#

clevercsv.break_ties module#

Break ties in the data consistency measure.

Author: Gertjan van den Burg

clevercsv.break_ties.break_ties_four(data: str, dialects: List[SimpleDialect]) SimpleDialect | None#

Break ties between four dialects.

This function works by breaking the ties between pairs of dialects that result in the same parsing result (if any). If this reduces the number of dialects, then break_ties_three() or break_ties_two() is used, otherwise, the tie can’t be broken.

Ties are only broken if all dialects have the same delimiter.

Parameters:
  • data (str) – The data of the file as a string

  • dialects (list) – List of SimpleDialect objects

Returns:

dialect – The chosen dialect if the tie can be broken, None otherwise.

Return type:

Optional[SimpleDialect]

Notes

We have only observed one case during development where this function was needed. It may need to be revisited in the future if other examples are found.

clevercsv.break_ties.break_ties_three(data: str, A: SimpleDialect, B: SimpleDialect, C: SimpleDialect) SimpleDialect | None#

Break ties between three dialects.

If the delimiters and the escape characters are all equal, then we look for the dialect that has no quotechar. The tie is broken by calling break_ties_two() for the dialect without quotechar and another dialect that gives the same parsing result.

If only the delimiter is the same for all dialects then use break_ties_two() on the dialects that do not have a quotechar, provided there are only two of these.

Parameters:
Returns:

dialect – The chosen dialect if the tie can be broken, None otherwise.

Return type:

Optional[SimpleDialect]

Notes

We have only observed one tie for each case during development, so this may need to be improved in the future.

clevercsv.break_ties.break_ties_two(data: str, A: SimpleDialect, B: SimpleDialect) SimpleDialect | None#

Break ties between two dialects.

This function breaks ties between two dialects that give the same score. We distinguish several cases:

1. If delimiter and escapechar are the same and one of the quote characters is the empty string. We parse the file with both dialects and check if the parsing result is the same. If it is, the correct dialect is the one with no quotechar, otherwise it’s the other one. 2. If quotechar and escapechar are the same and the delimiters are comma and space, then we go for comma. Alternatively, if either of the delimiters is the hyphen, we assume it’s the other dialect. 3. If the delimiter and quotechar is the same and one dialect uses the escapchar and the other doesn’t. We break this tie by checking if the escapechar has an effect and if it occurs an even or odd number of times.

If it’s none of these cases, we don’t break the tie and return None.

Parameters:
  • data (str) – The data of the file as a string.

  • A (SimpleDialect) – A potential dialect

  • B (SimpleDialect) – A potential dialect

Returns:

dialect – The chosen dialect if the tie can be broken, None otherwise.

Return type:

SimpleDialect or None

clevercsv.break_ties.reduce_pairwise(data: str, dialects: List[SimpleDialect]) List[SimpleDialect] | None#

Reduce the set of dialects by breaking pairwise ties

Parameters:
  • data (str) – The data of the file as a string

  • dialects (list) – List of SimpleDialect objects

Returns:

dialects – List of SimpleDialect objects.

Return type:

list

clevercsv.break_ties.tie_breaker(data: str, dialects: List[SimpleDialect]) SimpleDialect | None#

Break ties between dialects.

This function is used to break ties where possible between two, three, or four dialects that receive the same value for the data consistency measure.

Parameters:
  • data (str) – The data as a single string

  • dialects (list) – Dialects that are tied

Returns:

dialect – One of the dialects from the list provided or None.

Return type:

SimpleDialect

clevercsv.consistency module#

Detect the dialect using the data consistency measure.

Author: Gertjan van den Burg

class clevercsv.consistency.ConsistencyDetector(skip: bool = True, verbose: bool = False, cache_capacity: int = 100000)#

Bases: object

Detect the dialect with the data consistency measure

This class uses the data consistency measure to detect the dialect. See the paper for details.

Parameters:
  • skip (bool) – Skip computation of the type score for dialects with a low pattern score.

  • verbose (bool) – Print out the dialects considered and their scores.

  • cache_capacity (int) – The size of the cache for type detection. Caching the type detection result greatly speeds up the computation of the consistency measure. The size of the cache can be changed to trade off memory use and speed.

compute_consistency_scores(data: str, dialects: List[SimpleDialect]) Dict[SimpleDialect, ConsistencyScore]#

Compute the consistency score for each dialect

This function computes the consistency score for each dialect. This is done by first computing the pattern score for a dialect. If the class is instantiated with skip set to False, it also computes the type score for each dialect. If skip is True (the default), the type score is only computed if the pattern score is larger or equal to the current best combined score.

Parameters:
  • data (str) – The data of the file as a string

  • dialects (Iterable[SimpleDialect]) – An iterable of delimiters to consider.

Returns:

scores – A map with a ConsistencyScore object for each dialect provided as input.

Return type:

Dict[SimpleDialect, ConsistencyScore]

compute_type_score(data: str, dialect: SimpleDialect, eps: float = 1e-10) float#

Compute the type score

detect(data: str, delimiters: List[str] | None = None) SimpleDialect | None#

Detect the dialect using the consistency measure

Parameters:
  • data (str) – The data of the file as a string

  • delimiters (iterable) – List of delimiters to consider. If None, the get_delimiters() function is used to automatically detect this (as described in the paper).

Returns:

dialect – The detected dialect. If no dialect could be detected, returns None.

Return type:

SimpleDialect

static get_best_dialects(scores: Dict[SimpleDialect, ConsistencyScore]) List[SimpleDialect]#

Identify the dialects with the highest consistency score

class clevercsv.consistency.ConsistencyScore(P: float, T: float | None, Q: float | None)#

Bases: object

Container to track the consistency score calculation

Parameters:
  • P (float) – The pattern score

  • T (Optional[float]) – The type score. Can be None if not computed for speed.

  • Q (Optional[float]) – The consistency score. Can be None if not computed for speed.

P: float#
Q: float | None#
T: float | None#
clevercsv.consistency.detect_dialect_consistency(data: str, delimiters: Iterable[str] | None = None, skip: bool = True, verbose: bool = False) SimpleDialect | None#

Helper function that wraps ConsistencyDetector

clevercsv.cparser_util module#

Python utility functions that wrap the C parser.

clevercsv.cparser_util.field_size_limit(*args: Any, **kwargs: Any) int#

Get/Set the limit to the field size.

This function is adapted from the one in the Python CSV module. See the documentation there.

clevercsv.cparser_util.parse_data(data: Iterable[str], dialect: SimpleDialect | None = None, delimiter: str | None = None, quotechar: str | None = None, escapechar: str | None = None, strict: bool | None = None, return_quoted: bool = False) Iterator[List[str] | List[Tuple[str, bool]]]#

Parse the data given a dialect using the C parser

Parameters:
  • data (iterable) – The data of the CSV file as an iterable

  • dialect (SimpleDialect) – The dialect to use for the parsing. If None, the dialect with each component set to the empty string is used.

  • delimiter (str) – The delimiter to use. If not None, overwrites the delimiter in the dialect.

  • quotechar (str) – The quote character to use. If not None, overwrites the quote character in the dialect.

  • escapechar (str) – The escape character to use. If not None, overwrites the escape character in the dialect.

  • strict (bool) – Enable strict mode or not. If not None, overwrites the strict mode set in the dialect.

  • return_quoted (bool) – For each cell, return a tuple “(field, is_quoted)” where the second element indicates whether the cell was a quoted cell or not.

Yields:

rows (list) – The rows of the file as a list of cells.

:raises Error : clevercsv.exceptions.Error: When an error occurs during parsing.

clevercsv.cparser_util.parse_string(data: str, dialect: SimpleDialect, return_quoted: bool = False) Iterator[List[str] | List[Tuple[str, bool]]]#

Utility for when the CSV file is encoded as a single string

clevercsv.detect module#

Drop-in replacement for Python Sniffer object.

Author: Gertjan van den Burg

class clevercsv.detect.DetectionMethod(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: str, Enum

Possible detection methods

Valid options are “auto” (the default for Detector.detect), “normal”, or “consistency”. The “auto” option first attempts to detect the dialect using normal-form detection, and uses the consistency measure if normal-form detection is inconclusive. The “normal” method uses normal-form detection excllusively, and the “consistency” method uses the consistency measure exclusively.

AUTO = 'auto'#
CONSISTENCY = 'consistency'#
NORMAL = 'normal'#
class clevercsv.detect.Detector#

Bases: object

Detect the Dialect of CSV files with normal forms or the data consistency measure. This class provides a drop-in replacement for the Python dialect Sniffer from the standard library.

Note

We call the object Detector just to mark the difference in the implementation and avoid naming issues. You can import it as from ccsv import Sniffer nonetheless.

detect(sample: str, delimiters: Iterable[str] | None = None, verbose: bool = False, method: DetectionMethod | str = DetectionMethod.AUTO, skip: bool = True) SimpleDialect | None#

Detect the dialect of a CSV file

This method detects the dialect of the CSV file using the specified detection method.

Parameters:
  • sample (str) – A sample of text from the CSV file. For best results and if time allows, use the entire contents of the CSV file as the sample.

  • delimiters (Optional[Iterable[str]]) – Set of delimiters to consider for dialect detection. The potential dialects will be constructed by analyzing the sample and these delimiters. If omitted, the set of potential delimiters will be constructed from the sample.

  • verbose (bool) – Enable verbose mode.

  • method (Union[DetectionMethod, str]) – The method to use for dialect detection. Possible values are DetectionMethod instances or strings that can be cast to as such an enum.

  • skip (bool) – Whether to skip potential dialects that have too low a pattern score in the consistency detection. See ConsistencyDetector.compute_consistency_scores() for more details.

Returns:

dialect – The detected dialect. Can be None if dialect detection was inconclusive.

Return type:

Optional[SimpleDialect]

has_header(sample: str, max_rows_to_check: int = 20) bool#

Detect if a file has a header from a sample.

This function is copied from CPython! The only change we’ve made is to use our dialect detection method.

sniff(sample: str, delimiters: Iterable[str] | None = None, verbose: bool = False) SimpleDialect | None#

clevercsv.detect_pattern module#

Code for computing the pattern score.

Author: Gertjan van den Burg

clevercsv.detect_pattern.fill_empties(abstract: str) str#

Fill empty cells in the abstraction

The way the row patterns are constructed assumes that empty cells are marked by the letter C as well. This function fill those in. The function also removes duplicate occurrances of CC and replaces these with C.

Parameters:

abstract (str) – The abstract representation of the file.

Returns:

abstraction – The abstract representation with empties filled.

Return type:

str

clevercsv.detect_pattern.make_abstraction(data: str, dialect: SimpleDialect) str#

Create an abstract representation of the CSV file based on the dialect.

This function constructs the basic abstraction used to compute the row patterns.

Parameters:
  • data (str) – The data of the file as a string.

  • dialect (SimpleDialect) – A dialect to parse the file with.

Returns:

abstraction – An abstract representation of the CSV file.

Return type:

str

clevercsv.detect_pattern.merge_with_quotechar(S: str, dialect: SimpleDialect | None = None) str#

Merge quoted blocks in the abstraction

This function takes the abstract representation and merges quoted blocks (QC...CQ) to a single cell (C). The function takes nested quotes into account.

Parameters:
  • S (str) – The data of a file as a string

  • dialect (SimpleDialect) – The dialect used to make the abstraction. This is not used but kept for backwards compatibility. Will be removed in a future version.

Returns:

abstraction – A simplified version of the abstraction with quoted blocks merged.

Return type:

str

clevercsv.detect_pattern.pattern_score(data: str, dialect: SimpleDialect, eps: float = 0.001) float#

Compute the pattern score for given data and a dialect.

Parameters:
  • data (str) – The data of the file as a raw character string

  • dialect (dialect.Dialect) – The dialect object

Returns:

score – the pattern score

Return type:

float

clevercsv.detect_pattern.strip_trailing(abstract: str) str#

Strip trailing row separator from abstraction.

clevercsv.detect_type module#

Code for computing the type score.

Author: Gertjan van den Burg

class clevercsv.detect_type.TypeDetector(patterns: Dict[str, Pattern[str]] | None = None, strip_whitespace: bool = True)#

Bases: object

detect_type(cell: str, is_quoted: bool = False) str | None#
is_bytearray(cell: str, is_quoted: bool = False) bool#
is_currency(cell: str, is_quoted: bool = False) bool#
is_date(cell: str, is_quoted: bool = False) bool#
is_datetime(cell: str, is_quoted: bool = False) bool#
is_email(cell: str, is_quoted: bool = False) bool#
is_empty(cell: str, is_quoted: bool = False) bool#
is_ipv4(cell: str, is_quoted: bool = False) bool#
is_json_obj(cell: str, is_quoted: bool = False) bool#
is_known_type(cell: str, is_quoted: bool = False) bool#
is_nan(cell: str, is_quoted: bool = False) bool#
is_number(cell: str, is_quoted: bool = False) bool#
is_percentage(cell: str, is_quoted: bool = False) bool#
is_time(cell: str, is_quoted: bool = False) bool#
is_unicode_alphanum(cell: str, is_quoted: bool = False) bool#
is_unix_path(cell: str, is_quoted: bool = False) bool#
is_url(cell: str, is_quoted: bool = False) bool#
list_known_types() List[str]#
clevercsv.detect_type.gen_known_type(cells)#

Utility that yields a generator over whether or not the provided cells are of a known type or not.

clevercsv.detect_type.type_score(data: str, dialect: SimpleDialect, eps: float = 1e-10) float#

Compute the type score as the ratio of cells with a known type.

Parameters:
  • data (str) – the data as a single string

  • dialect (SimpleDialect) – the dialect to use

  • eps (float) – the minimum value of the type score

Returns:

type_score – The computed type score

Return type:

float

clevercsv.dialect module#

Definitions for the dialect object.

Author: Gertjan van den Burg

class clevercsv.dialect.SimpleDialect(delimiter: str | None, quotechar: str | None, escapechar: str | None, strict: bool = False)#

Bases: object

The simplified dialect object.

For the delimiter, quotechar, and escapechar the empty string means no delimiter/quotechar/escapechar in the file. None is used to mark it undefined.

Parameters:
  • delimiter (str) – The delimiter of the CSV file.

  • quotechar (str) – The quotechar of the file.

  • escapechar (str) – The escapechar of the file.

  • strict (bool) – Whether strict parsing should be enforced. Same as in the csv module.

classmethod deserialize(obj: str) SimpleDialect#

Deserialize dialect from a JSON object

classmethod from_csv_dialect(d: Dialect) SimpleDialect#
classmethod from_dict(d: Dict[str, Any]) SimpleDialect#
serialize() str#

Serialize dialect to a JSON object

to_csv_dialect() Dialect#
to_dict() Dict[str, str | bool | None]#
validate() None#

clevercsv.dict_read_write module#

DictReader and DictWriter.

This code is entirely copied from the Python csv module. The only exception is that it uses the reader and writer classes from our package.

Author: Gertjan van den Burg

class clevercsv.dict_read_write.DictReader(f: Iterable[str], fieldnames: Sequence[_T] | None = None, restkey: str | None = None, restval: str | None = None, dialect: _DialectLike = 'excel', *args: Any, **kwds: Any)#

Bases: Generic[_T], Iterator[_DictReadMapping[Union[_T, Any], Union[str, Any]]]

property fieldnames: Sequence[_T]#
class clevercsv.dict_read_write.DictWriter(f: SupportsWrite[str], fieldnames: Collection[_T], restval: Any | None = '', extrasaction: Literal['raise', 'ignore'] = 'raise', dialect: _DialectLike = 'excel', *args: Any, **kwds: Any)#

Bases: Generic[_T]

writeheader() Any#
writerow(rowdict: Mapping[_T, Any]) Any#
writerows(rowdicts: Iterable[Mapping[_T, Any]]) None#

clevercsv.encoding module#

Functionality to detect file encodings

Author: G.J.J. van den Burg License: See the LICENSE file

This file is part of CleverCSV.

clevercsv.encoding.get_encoding(filename: str | bytes | os.PathLike[str] | os.PathLike[bytes] | int, try_cchardet: bool = True) str | None#

Get the encoding of the file

This function uses the chardet package for detecting the encoding of a file.

Parameters:
  • filename (str) – Path to a file

  • try_cchardet (bool) – Whether to run detection using cChardet if it is available. This can be faster, but may give different results than using chardet.

Returns:

encoding – Encoding of the file.

Return type:

str

clevercsv.escape module#

Common functions for dealing with escape characters.

Author: Gertjan van den Burg Date: 2018-11-06

clevercsv.escape.DEFAULT_BLOCK_CHARS: Set[str] = {'!', '"', '#', '%', '&', "'", '*', ',', '.', ':', ';', '?'}#

Set of default characters to never consider as escape character

clevercsv.escape.UNICODE_PO_CHARS: Set[str] = {'!', '"', '#', '%', '&', "'", '*', ',', '.', '/', ':', ';', '?', '@', '\\', '¡', '§', '¶', '·', '¿', ';', '·', '՚', '՛', '՜', '՝', '՞', '՟', '։', '׀', '׃', '׆', '׳', '״', '؉', '؊', '،', '؍', '؛', '؝', '؞', '؟', '٪', '٫', '٬', '٭', '۔', '܀', '܁', '܂', '܃', '܄', '܅', '܆', '܇', '܈', '܉', '܊', '܋', '܌', '܍', '߷', '߸', '߹', '࠰', '࠱', '࠲', '࠳', '࠴', '࠵', '࠶', '࠷', '࠸', '࠹', '࠺', '࠻', '࠼', '࠽', '࠾', '࡞', '।', '॥', '॰', '৽', '੶', '૰', '౷', '಄', '෴', '๏', '๚', '๛', '༄', '༅', '༆', '༇', '༈', '༉', '༊', '་', '༌', '།', '༎', '༏', '༐', '༑', '༒', '༔', '྅', '࿐', '࿑', '࿒', '࿓', '࿔', '࿙', '࿚', '၊', '။', '၌', '၍', '၎', '၏', '჻', '፠', '፡', '።', '፣', '፤', '፥', '፦', '፧', '፨', '᙮', '᛫', '᛬', '᛭', '᜵', '᜶', '។', '៕', '៖', '៘', '៙', '៚', '᠀', '᠁', '᠂', '᠃', '᠄', '᠅', '᠇', '᠈', '᠉', '᠊', '᥄', '᥅', '᨞', '᨟', '᪠', '᪡', '᪢', '᪣', '᪤', '᪥', '᪦', '᪨', '᪩', '᪪', '᪫', '᪬', '᪭', '᭚', '᭛', '᭜', '᭝', '᭞', '᭟', '᭠', '᭽', '᭾', '᯼', '᯽', '᯾', '᯿', '᰻', '᰼', '᰽', '᰾', '᰿', '᱾', '᱿', '᳀', '᳁', '᳂', '᳃', '᳄', '᳅', '᳆', '᳇', '᳓', '‖', '‗', '†', '‡', '•', '‣', '․', '‥', '…', '‧', '‰', '‱', '′', '″', '‴', '‵', '‶', '‷', '‸', '※', '‼', '‽', '‾', '⁁', '⁂', '⁃', '⁇', '⁈', '⁉', '⁊', '⁋', '⁌', '⁍', '⁎', '⁏', '⁐', '⁑', '⁓', '⁕', '⁖', '⁗', '⁘', '⁙', '⁚', '⁛', '⁜', '⁝', '⁞', '⳹', '⳺', '⳻', '⳼', '⳾', '⳿', '⵰', '⸀', '⸁', '⸆', '⸇', '⸈', '⸋', '⸎', '⸏', '⸐', '⸑', '⸒', '⸓', '⸔', '⸕', '⸖', '⸘', '⸙', '⸛', '⸞', '⸟', '⸪', '⸫', '⸬', '⸭', '⸮', '⸰', '⸱', '⸲', '⸳', '⸴', '⸵', '⸶', '⸷', '⸸', '⸹', '⸼', '⸽', '⸾', '⸿', '⹁', '⹃', '⹄', '⹅', '⹆', '⹇', '⹈', '⹉', '⹊', '⹋', '⹌', '⹍', '⹎', '⹏', '⹒', '⹓', '⹔', '、', '。', '〃', '〽', '・', '꓾', '꓿', '꘍', '꘎', '꘏', '꙳', '꙾', '꛲', '꛳', '꛴', '꛵', '꛶', '꛷', '꡴', '꡵', '꡶', '꡷', '꣎', '꣏', '꣸', '꣹', '꣺', '꣼', '꤮', '꤯', '꥟', '꧁', '꧂', '꧃', '꧄', '꧅', '꧆', '꧇', '꧈', '꧉', '꧊', '꧋', '꧌', '꧍', '꧞', '꧟', '꩜', '꩝', '꩞', '꩟', '꫞', '꫟', '꫰', '꫱', '꯫', '︐', '︑', '︒', '︓', '︔', '︕', '︖', '︙', '︰', '﹅', '﹆', '﹉', '﹊', '﹋', '﹌', '﹐', '﹑', '﹒', '﹔', '﹕', '﹖', '﹗', '﹟', '﹠', '﹡', '﹨', '﹪', '﹫', '!', '"', '#', '%', '&', ''', '*', ',', '.', '/', ':', ';', '?', '@', '\', '。', '、', '・', '𐄀', '𐄁', '𐄂', '𐎟', '𐏐', '𐕯', '𐡗', '𐤟', '𐤿', '𐩐', '𐩑', '𐩒', '𐩓', '𐩔', '𐩕', '𐩖', '𐩗', '𐩘', '𐩿', '𐫰', '𐫱', '𐫲', '𐫳', '𐫴', '𐫵', '𐫶', '𐬹', '𐬺', '𐬻', '𐬼', '𐬽', '𐬾', '𐬿', '𐮙', '𐮚', '𐮛', '𐮜', '𐽕', '𐽖', '𐽗', '𐽘', '𐽙', '𐾆', '𐾇', '𐾈', '𐾉', '𑁇', '𑁈', '𑁉', '𑁊', '𑁋', '𑁌', '𑁍', '𑂻', '𑂼', '𑂾', '𑂿', '𑃀', '𑃁', '𑅀', '𑅁', '𑅂', '𑅃', '𑅴', '𑅵', '𑇅', '𑇆', '𑇇', '𑇈', '𑇍', '𑇛', '𑇝', '𑇞', '𑇟', '𑈸', '𑈹', '𑈺', '𑈻', '𑈼', '𑈽', '𑊩', '𑑋', '𑑌', '𑑍', '𑑎', '𑑏', '𑑚', '𑑛', '𑑝', '𑓆', '𑗁', '𑗂', '𑗃', '𑗄', '𑗅', '𑗆', '𑗇', '𑗈', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑙃', '𑙠', '𑙡', '𑙢', '𑙣', '𑙤', '𑙥', '𑙦', '𑙧', '𑙨', '𑙩', '𑙪', '𑙫', '𑙬', '𑚹', '𑜼', '𑜽', '𑜾', '𑠻', '𑥄', '𑥅', '𑥆', '𑧢', '𑨿', '𑩀', '𑩁', '𑩂', '𑩃', '𑩄', '𑩅', '𑩆', '𑪚', '𑪛', '𑪜', '𑪞', '𑪟', '𑪠', '𑪡', '𑪢', '𑬀', '𑬁', '𑬂', '𑬃', '𑬄', '𑬅', '𑬆', '𑬇', '𑬈', '𑬉', '𑱁', '𑱂', '𑱃', '𑱄', '𑱅', '𑱰', '𑱱', '𑻷', '𑻸', '𑽃', '𑽄', '𑽅', '𑽆', '𑽇', '𑽈', '𑽉', '𑽊', '𑽋', '𑽌', '𑽍', '𑽎', '𑽏', '𑿿', '𒑰', '𒑱', '𒑲', '𒑳', '𒑴', '𒿱', '𒿲', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖬹', '𖬺', '𖬻', '𖭄', '𖺗', '𖺘', '𖺙', '𖺚', '𖿢', '𛲟', '𝪇', '𝪈', '𝪉', '𝪊', '𝪋', '𞥞', '𞥟'}#

Set of characters in the Unicode “Po” category

clevercsv.escape.is_potential_escapechar(char: str, encoding: str, block_char: Iterable[str] | None = None) bool#

Check if a character is a potential escape character.

A character is considered a potential escape character if it is in the “Punctuation, Other” Unicode category and not in the list of blocked characters.

Parameters:
  • char (str) – The character to check

  • encoding (str) – The encoding of the character

  • block_char (Optional[Iterable[str]]) – Characters that are in the Punctuation Other category but that should not be considered as escape character. If None, the default set is used, which is defined in DEFAULT_BLOCK_CHARS.

Returns:

is_escape – Whether the character is considered a potential escape or not.

Return type:

bool

clevercsv.exceptions module#

Exceptions for CleverCSV

Author: Gertjan van den Burg

exception clevercsv.exceptions.Error#

Bases: Error

exception clevercsv.exceptions.NoDetectionResult#

Bases: Exception

clevercsv.normal_form module#

Detect the dialect with very strict functional tests.

This module uses so-called “normal forms” to detect the dialect of CSV files. Normal forms are detected with strict functional tests. The normal forms are used as a pre-test to check if files are simple enough that computing the data consistency measure is not necessary.

Author: Gertjan van den Burg

clevercsv.normal_form.detect_dialect_normal(data: str, encoding: str = 'UTF-8', delimiters: Iterable[str] | None = None, verbose: bool = False) SimpleDialect | None#

Detect the normal form of a file from a given sample

Parameters:
  • data (str) – The data as a single string

  • encoding (str) – The encoding of the data

Returns:

dialect – The dialect detected using normal forms, or None if no such dialect can be found.

Return type:

SimpleDialect

clevercsv.normal_form.every_row_has_delim(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.every_row_has_delim_and_is_the_same_length(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.has_delimiter(string: str, delim: str) bool#
clevercsv.normal_form.has_nested_quotes(string: str, quotechar: str) bool#
clevercsv.normal_form.is_any_empty(cell: str) bool#
clevercsv.normal_form.is_any_partial_quoted_cell(cell: str) bool#
clevercsv.normal_form.is_any_quoted_cell(cell: str) bool#
clevercsv.normal_form.is_elementary(cell: str) bool#
clevercsv.normal_form.is_empty_quoted(cell: str, quotechar: str) bool#
clevercsv.normal_form.is_empty_unquoted(cell: str) bool#
clevercsv.normal_form.is_form_1(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.is_form_2(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.is_form_3(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.is_form_4(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.is_form_5(rows: List[str], dialect: SimpleDialect) bool#
clevercsv.normal_form.is_quoted_cell(cell: str, quotechar: str) bool#
clevercsv.normal_form.maybe_has_escapechar(data: str, encoding: str, delim: str, quotechar: str) bool#
clevercsv.normal_form.split_file(data: str) List[str]#
clevercsv.normal_form.split_row(row: str, dialect: SimpleDialect) List[str]#
clevercsv.normal_form.strip_trailing_crnl(data: str) str#

clevercsv.potential_dialects module#

Code for selecting the potential dialects of a file.

Author: Gertjan van den Burg

clevercsv.potential_dialects.filter_urls(data: str) str#

Filter URLs from the data

clevercsv.potential_dialects.get_delimiters(data: str, encoding: str, delimiters: List[str] | None = None, block_cat: List[str] | None = None, block_char: List[str] | None = None) Set[str]#

Get potential delimiters

The set of potential delimiters is constructed as follows. For each unique character of the file, we check if its Unicode character category is in the set block_cat of prohibited categories. If it is, we don’t allow it to be a delimiter, with the exception of Tab (which is in the Control category). We furthermore block characters in block_char from being delimiters.

Parameters:
  • data (str) – The data of the file

  • encoding (str) – The encoding of the file

  • delimiters (iterable) – Allowed delimiters. If provided, it overrides the block_cat/block_char mechanism and only the provided characters will be considered delimiters (if they occur in the file). If None, all characters can be considered delimiters subject to the block_cat and block_char parameters.

  • block_cat (list) –

    List of Unicode categories (2-letter abbreviations) for characters that should not be considered as delimiters. If None, the following default set is used:

    ["Lu", "Ll", "Lt", "Lm", "Lo", "Nd", "Nl", "No", "Ps", "Pe", "Co"]
    

  • block_char (list) –

    Explicit list of characters that should not be considered delimiters. If None, the following default set is used:

    [".", "/", '"', "'", "\n", "\r"]
    

Returns:

delims – Set of potential delimiters. The empty string is added by default.

Return type:

set

clevercsv.potential_dialects.get_dialects(data: str, encoding: str = 'UTF-8', delimiters: List[str] | None = None, test_masked_by_quotes: bool = False) List[SimpleDialect]#

Return the possible dialects for the given data.

We consider as escape characters those characters for which is_potential_escapechar() is True and that occur at least once before a quote character or delimiter in the dialect.

One may wonder if self-escaping is an issue here (i.e. “\”, two times backslash). It is not. In a file where a single backslash is desired and escaping with a backslash is used, then it only makes sense to do this in a file where the backslash is already used as an escape character (in which case we include it). If it is never used as escape for the delimiter or quotechar, then it is not necessary to self-escape. This is an assumption, but it holds in general and it reduces noise.

Parameters:
  • data (str) – The data for the file

  • encoding (str) – The encoding of the file

  • delimiters (iterable) – Set of delimiters to consider. See get_delimiters() for more info.

  • test_masked_by_quotes (bool) – Remove dialects where the delimiter is always masked by the quote character. Enabling this typically removes a number of potential dialects from the list, which can remove false positives. It however not a very fast operation, so it is disabled by default.

Returns:

dialects – List of SimpleDialect objects that are considered potential dialects.

Return type:

List[SimpleDialect]

clevercsv.potential_dialects.get_quotechars(data: str, quote_chars: Iterable[str] | None = None) Set[str]#

Get potential quote characters

Quote characters are those that occur in the quote_chars set and are found at least once in the file.

Parameters:
  • data (str) – The data of the file as a string

  • quote_chars (iterable) –

    Characters that should be considered quote characters. If it is None, the following default set is used:

    ["'", '"', "~", "`"]
    

Returns:

quotes – Set of potential quote characters. The empty string is added by default.

Return type:

set

clevercsv.potential_dialects.masked_by_quotechar(data: str, quotechar: str, escapechar: str, test_char: str) bool#

Test if a character is always masked by quote characters

This function tests if a given character is always within quoted segments (defined by the quote character). Double quoting and escaping is supported.

Parameters:
  • data (str) – The data of the file as a string

  • quotechar (str) – The quote character

  • escapechar (str) – The escape character

  • test_char (str) – The character to test

Returns:

masked – Returns True if the test character is never outside quoted segements, False otherwise.

Return type:

bool

clevercsv.potential_dialects.unicode_category(x: str, encoding: str) str#

Return the Unicode category of a character

Parameters:
  • x (str) – character

  • encoding (str) – Encoding of the character

Returns:

category – The Unicode category of the character.

Return type:

str

clevercsv.read module#

Drop-in replacement for the Python csv reader class. This is a wrapper for the Parser class, defined in cparser.

Author: Gertjan van den Burg

class clevercsv.read.reader(csvfile: Iterable[str], dialect: str | Dialect | Type[Dialect] | SimpleDialect = 'excel', **fmtparams: Any)#

Bases: object

property dialect: Dialect#

clevercsv.utils module#

Various utilities

Author: Gertjan van den Burg

clevercsv.utils.pairwise(iterable: Iterable[T]) Iterator[Tuple[T, T]]#

s - > (s0, s1), (s1, s2), (s2, s3), …

clevercsv.utils.sha1sum(filename: str | bytes | os.PathLike[str] | os.PathLike[bytes]) str#

Compute the SHA1 checksum of a given file

Parameters:

filename (str) – Path to a file

Returns:

checksum – The SHA1 checksum of the file contents.

Return type:

str

clevercsv.wrappers module#

Wrappers for some loading/saving functionality.

Author: Gertjan van den Burg

clevercsv.wrappers.detect_dialect(filename: FileDescriptorOrPath, num_chars: int | None = None, encoding: str | None = None, verbose: bool = False, method: str = 'auto', skip: bool = True) SimpleDialect | None#

Detect the dialect of a CSV file

This is a utility function that simply returns the detected dialect of a given CSV file.

Parameters:
  • filename (str) – The filename of the CSV file.

  • num_chars (int) – Number of characters to read for the detection. If None, the entire file will be read. Note that limiting the number of characters can reduce the accuracy of the detected dialect.

  • encoding (str) – The file encoding of the CSV file. If None, it is detected.

  • verbose (bool) – Enable verbose mode during detection.

  • method (str) – Dialect detection method to use. Either ‘normal’ for normal form detection, ‘consistency’ for the consistency measure, or ‘auto’ for first normal and then consistency.

  • skip (bool) – Skip computation of the type score for dialects with a low pattern score.

Returns:

dialect – The detected dialect as a SimpleDialect, or None if detection failed.

Return type:

Optional[SimpleDialect]

clevercsv.wrappers.read_dataframe(filename: FileDescriptorOrPath, *args: Any, num_chars: int | None = None, **kwargs: Any) pd.DataFrame#

Read a CSV file to a Pandas dataframe

This function uses CleverCSV to detect the dialect, and then passes this to the read_csv function in pandas. Additional arguments and keyword arguments are passed to read_csv as well.

Parameters:
  • filename (str) – The filename of the CSV file. At the moment, only local files are supported.

  • *args – Additional arguments for the pandas.read_csv function.

  • num_chars (int) –

    Number of characters to use for dialect detection. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • **kwargs – Additional keyword arguments for the pandas.read_csv function. You can specify the file encoding here if needed, and it will be used during dialect detection.

clevercsv.wrappers.read_dicts(filename: FileDescriptorOrPath, dialect: '_DialectLike' | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) List['_DictReadMapping']#

Read a CSV file as a list of dictionaries

This function returns the rows of the CSV file as a list of dictionaries. The keys of the dictionaries are assumed to be in the first row of the CSV file. The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the Clevercsv clevercsv.DictReader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns rows of the file as a list of dictionaries.

Return type:

list

Raises:

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.read_table(filename: FileDescriptorOrPath, dialect: '_DialectLike' | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) List[List[str]]#

Read a CSV file as a table (a list of lists)

This is a convenience function that reads a CSV file and returns the data as a list of lists (= rows). The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the CleverCSV clevercsv.reader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns rows as a list of lists.

Return type:

list

Raises:

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.stream_dicts(filename: FileDescriptorOrPath, dialect: _DialectLike | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) Iterator['_DictReadMapping']#

Read a CSV file as a generator over dictionaries

This function streams the rows of the CSV file as dictionaries. The keys of the dictionaries are assumed to be in the first row of the CSV file. The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the Clevercsv clevercsv.DictReader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns file as a generator over rows as dictionaries.

Return type:

generator

Raises:

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.stream_table(filename: FileDescriptorOrPath, dialect: '_DialectLike' | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) Iterator[List[str]]#

Read a CSV file as a generator over rows of a table

This is a convenience function that reads a CSV file and returns the data as a generator of rows. The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the CleverCSV clevercsv.reader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns file as a generator over rows.

Return type:

generator

Raises:

NoDetectionResult – When the dialect detection fails.

clevercsv.wrappers.write_dicts(items: Iterable[Mapping[_T, Any]], filename: FileDescriptorOrPath, dialect: _DialectLike = 'excel', encoding: str | None = None) None#

Write a list of dicts to a file

This is a convenience function to write dicts to a file. The header is extracted from the keys of the first item, so an OrderedDict is recommended to control the order of the headers in the output. If the list of items is empty, no output file is created.

Parameters:
  • items (list) – List of dicts to export

  • filename (str) – The filename of the CSV file to write the table to

  • dialect (str, SimpleDialect, or csv.Dialect) – The dialect to use. The default is the ‘excel’ dialect, which corresponds to RFC4180.

  • encoding (str) – Encoding to use to write the data to the file. Note that the default encoding is platform dependent, which ensures compatibility with the Python open() function. It thus defaults to locale.getpreferredencoding().

clevercsv.wrappers.write_table(table: Iterable[Iterable[Any]], filename: FileDescriptorOrPath, dialect: _DialectLike = 'excel', transpose: bool = False, encoding: str | None = None) None#

Write a table (a list of lists) to a file

This is a convenience function for writing a table to a CSV file. If the table has no rows, no output file is created.

Parameters:
  • table (list) – A table as a list of lists. The table must have the same number of cells in each row (taking the transpose flag into account).

  • filename (str) – The filename of the CSV file to write the table to.

  • dialect (SimpleDialect or csv.Dialect) – The dialect to use. The default is the ‘excel’ dialect, which corresponds to RFC4180. This is done to encourage more standardized CSV files.

  • transpose (bool) – Transpose the table before writing.

  • encoding (str) – Encoding to use to write the data to the file. Note that the default encoding is platform dependent, which ensures compatibility with the Python open() function. It thus defaults to locale.getpreferredencoding().

Raises:

ValueError: – When the length of the rows is not constant.

clevercsv.write module#

Drop-in replacement for the Python csv writer class.

Author: Gertjan van den Burg

class clevercsv.write.writer(csvfile: SupportsWrite[str], dialect: _DialectLike = 'excel', **fmtparams: Any)#

Bases: object

writerow(row: Iterable[Any]) Any#
writerows(rows: Iterable[Iterable[Any]]) Any#

Module contents#

class clevercsv.Detector#

Bases: object

Detect the Dialect of CSV files with normal forms or the data consistency measure. This class provides a drop-in replacement for the Python dialect Sniffer from the standard library.

Note

We call the object Detector just to mark the difference in the implementation and avoid naming issues. You can import it as from ccsv import Sniffer nonetheless.

detect(sample: str, delimiters: Iterable[str] | None = None, verbose: bool = False, method: DetectionMethod | str = DetectionMethod.AUTO, skip: bool = True) SimpleDialect | None#

Detect the dialect of a CSV file

This method detects the dialect of the CSV file using the specified detection method.

Parameters:
  • sample (str) – A sample of text from the CSV file. For best results and if time allows, use the entire contents of the CSV file as the sample.

  • delimiters (Optional[Iterable[str]]) – Set of delimiters to consider for dialect detection. The potential dialects will be constructed by analyzing the sample and these delimiters. If omitted, the set of potential delimiters will be constructed from the sample.

  • verbose (bool) – Enable verbose mode.

  • method (Union[DetectionMethod, str]) – The method to use for dialect detection. Possible values are DetectionMethod instances or strings that can be cast to as such an enum.

  • skip (bool) – Whether to skip potential dialects that have too low a pattern score in the consistency detection. See ConsistencyDetector.compute_consistency_scores() for more details.

Returns:

dialect – The detected dialect. Can be None if dialect detection was inconclusive.

Return type:

Optional[SimpleDialect]

has_header(sample: str, max_rows_to_check: int = 20) bool#

Detect if a file has a header from a sample.

This function is copied from CPython! The only change we’ve made is to use our dialect detection method.

sniff(sample: str, delimiters: Iterable[str] | None = None, verbose: bool = False) SimpleDialect | None#
class clevercsv.DictReader(f: Iterable[str], fieldnames: Sequence[_T] | None = None, restkey: str | None = None, restval: str | None = None, dialect: _DialectLike = 'excel', *args: Any, **kwds: Any)#

Bases: Generic[_T], Iterator[_DictReadMapping[Union[_T, Any], Union[str, Any]]]

property fieldnames: Sequence[_T]#
class clevercsv.DictWriter(f: SupportsWrite[str], fieldnames: Collection[_T], restval: Any | None = '', extrasaction: Literal['raise', 'ignore'] = 'raise', dialect: _DialectLike = 'excel', *args: Any, **kwds: Any)#

Bases: Generic[_T]

writeheader() Any#
writerow(rowdict: Mapping[_T, Any]) Any#
writerows(rowdicts: Iterable[Mapping[_T, Any]]) None#
exception clevercsv.Error#

Bases: Error

clevercsv.Sniffer#

alias of Detector

clevercsv.detect_dialect(filename: FileDescriptorOrPath, num_chars: int | None = None, encoding: str | None = None, verbose: bool = False, method: str = 'auto', skip: bool = True) SimpleDialect | None#

Detect the dialect of a CSV file

This is a utility function that simply returns the detected dialect of a given CSV file.

Parameters:
  • filename (str) – The filename of the CSV file.

  • num_chars (int) – Number of characters to read for the detection. If None, the entire file will be read. Note that limiting the number of characters can reduce the accuracy of the detected dialect.

  • encoding (str) – The file encoding of the CSV file. If None, it is detected.

  • verbose (bool) – Enable verbose mode during detection.

  • method (str) – Dialect detection method to use. Either ‘normal’ for normal form detection, ‘consistency’ for the consistency measure, or ‘auto’ for first normal and then consistency.

  • skip (bool) – Skip computation of the type score for dialects with a low pattern score.

Returns:

dialect – The detected dialect as a SimpleDialect, or None if detection failed.

Return type:

Optional[SimpleDialect]

class clevercsv.excel#

Bases: Dialect

Describe the usual properties of Excel-generated CSV files.

delimiter = ','#
doublequote = True#
lineterminator = '\r\n'#
quotechar = '"'#
quoting = 0#
skipinitialspace = False#
class clevercsv.excel_tab#

Bases: excel

Describe the usual properties of Excel-generated TAB-delimited files.

delimiter = '\t'#
clevercsv.field_size_limit(*args: Any, **kwargs: Any) int#

Get/Set the limit to the field size.

This function is adapted from the one in the Python CSV module. See the documentation there.

clevercsv.read_dataframe(filename: FileDescriptorOrPath, *args: Any, num_chars: int | None = None, **kwargs: Any) pd.DataFrame#

Read a CSV file to a Pandas dataframe

This function uses CleverCSV to detect the dialect, and then passes this to the read_csv function in pandas. Additional arguments and keyword arguments are passed to read_csv as well.

Parameters:
  • filename (str) – The filename of the CSV file. At the moment, only local files are supported.

  • *args – Additional arguments for the pandas.read_csv function.

  • num_chars (int) –

    Number of characters to use for dialect detection. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • **kwargs – Additional keyword arguments for the pandas.read_csv function. You can specify the file encoding here if needed, and it will be used during dialect detection.

clevercsv.read_dicts(filename: FileDescriptorOrPath, dialect: '_DialectLike' | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) List['_DictReadMapping']#

Read a CSV file as a list of dictionaries

This function returns the rows of the CSV file as a list of dictionaries. The keys of the dictionaries are assumed to be in the first row of the CSV file. The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the Clevercsv clevercsv.DictReader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns rows of the file as a list of dictionaries.

Return type:

list

Raises:

NoDetectionResult – When the dialect detection fails.

clevercsv.read_table(filename: FileDescriptorOrPath, dialect: '_DialectLike' | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) List[List[str]]#

Read a CSV file as a table (a list of lists)

This is a convenience function that reads a CSV file and returns the data as a list of lists (= rows). The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the CleverCSV clevercsv.reader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns rows as a list of lists.

Return type:

list

Raises:

NoDetectionResult – When the dialect detection fails.

class clevercsv.reader(csvfile: Iterable[str], dialect: str | Dialect | Type[Dialect] | SimpleDialect = 'excel', **fmtparams: Any)#

Bases: object

property dialect: Dialect#
clevercsv.stream_dicts(filename: FileDescriptorOrPath, dialect: _DialectLike | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) Iterator['_DictReadMapping']#

Read a CSV file as a generator over dictionaries

This function streams the rows of the CSV file as dictionaries. The keys of the dictionaries are assumed to be in the first row of the CSV file. The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the Clevercsv clevercsv.DictReader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns file as a generator over rows as dictionaries.

Return type:

generator

Raises:

NoDetectionResult – When the dialect detection fails.

clevercsv.stream_table(filename: FileDescriptorOrPath, dialect: '_DialectLike' | None = None, encoding: str | None = None, num_chars: int | None = None, verbose: bool = False) Iterator[List[str]]#

Read a CSV file as a generator over rows of a table

This is a convenience function that reads a CSV file and returns the data as a generator of rows. The dialect will be detected automatically, unless it is provided.

Parameters:
  • filename (str) – Path of the CSV file

  • dialect (str, SimpleDialect, or csv.Dialect object) – If the dialect is known, it can be provided here. This function uses the CleverCSV clevercsv.reader object, which supports various dialect types (string, SimpleDialect, or csv.Dialect). If None, the dialect will be detected.

  • encoding (str) – The encoding of the file. If None, it is detected.

  • num_chars (int) –

    Number of characters to use to detect the dialect. If None, use the entire file.

    Note that using less than the entire file will speed up detection, but can reduce the accuracy of the detected dialect.

  • verbose (bool) – Whether or not to show detection progress.

Returns:

rows – Returns file as a generator over rows.

Return type:

generator

Raises:

NoDetectionResult – When the dialect detection fails.

class clevercsv.unix_dialect#

Bases: Dialect

Describe the usual properties of Unix-generated CSV files.

delimiter = ','#
doublequote = True#
lineterminator = '\n'#
quotechar = '"'#
quoting = 1#
skipinitialspace = False#
clevercsv.write_table(table: Iterable[Iterable[Any]], filename: FileDescriptorOrPath, dialect: _DialectLike = 'excel', transpose: bool = False, encoding: str | None = None) None#

Write a table (a list of lists) to a file

This is a convenience function for writing a table to a CSV file. If the table has no rows, no output file is created.

Parameters:
  • table (list) – A table as a list of lists. The table must have the same number of cells in each row (taking the transpose flag into account).

  • filename (str) – The filename of the CSV file to write the table to.

  • dialect (SimpleDialect or csv.Dialect) – The dialect to use. The default is the ‘excel’ dialect, which corresponds to RFC4180. This is done to encourage more standardized CSV files.

  • transpose (bool) – Transpose the table before writing.

  • encoding (str) – Encoding to use to write the data to the file. Note that the default encoding is platform dependent, which ensures compatibility with the Python open() function. It thus defaults to locale.getpreferredencoding().

Raises:

ValueError: – When the length of the rows is not constant.

class clevercsv.writer(csvfile: SupportsWrite[str], dialect: _DialectLike = 'excel', **fmtparams: Any)#

Bases: object

writerow(row: Iterable[Any]) Any#
writerows(rows: Iterable[Iterable[Any]]) Any#