Code Maven

Search for '{{search_term}}'

CSV - Comma Separated Files

CMOS is the Code-Maven Open Source podcast that also includes video interviews. Subscribe to this feed RSS feed with your Podcast listener app or via iTunes iTunes.

CSV stands for Comma Separated Values, though in reality, the character that separates the values can be anything. , is probably the most popular one, but there are plenty of files where the separator is semi-colon ;. In other places it is |, and yet in other places it is a \t TAB. The latter is usually also called TSV - Tab Separated Values.

We'll just use the name CSV and mean any of these cases.

A CSV file is great for representing table-like data. Something you could also store in Excel, Calc of Libre Office, or any other Spreadsheet application. Without the formatting and without the calculating of expressions. Just the raw data.

A CSV file can be seen as a table from a relational database, one that you'd normally interrogate using SQL. If you need to represent multiple SQL tables, that will require one CSV file for each table.

A rather simple example looks like this:



Here each field is separated by a semi-colon ; and each physical row represents one row in the table.

(Data: Snowwhite and the 7 Dwarfs in Hungarian.)

A more complex example is this one:


Tel Aviv,Beirut,500,km
New York,"Moscow, East",6000,km

(Data: totally made up.)

Here the fields are separated using , and we can see two special cases:

Handle embedded speparator characters

The line starting with "New York" contains a field that has a comma , in the middle. In order to make it clear that the comma between Moscow and East is part of the element the creator of the CSV file can do two things. Either wrap the element in quotes as happend in this example:

New York,"Moscow, East",6000,km

The character can also be "escaped" using a back-slash \. (Escaping here means that the character looses its special meaning separating fields.

New York,Moscow\, East,6000,km

The software creating the CSV file needs to be able to add these extra marks and the software reading the CSV file must be able to understand them and interprete them correctly.

Embedded newline (Multi-line fields)

The second special case can be seen in the lines which comprise a single logical row. We already know that the quotation character " marks the beginning of a field which ends with the next quotation character ", but in this case that second charcter is in a subsequent physical row. Effectively here we have a field that has a newline embedded in it.


If you are familiar with any of the Spreadsheet software you probably have already seen files with a few multi-line fields. This is how such fields are represented.

Here too, it is very important for the software to be able to parse the file correctly. A simple call to split won't work properly.

Header or not?

In the earlier examples each row contained data, though rows were not necessary aligned along newlines.

In the next example the first row of the file is the header. In that row the values are the names of the columns similar to what you would have in a relational database, except that in this case there is no restriction as to what the column name can contain. So we have one column with a space in the name and another column where the name (the content of the first row) has both space and parenthese.


Planet name,Distance (AU),Mass

When reading such a file one must treat the first row as special. At a minimum it should not be used it as data, but maybe it can even be used as way to refernce the values in the various fields.

(Data source Solar System. No really.)

Implementations of CSV parsers

See also an exercise to add numbers taken from a CSV file.


In the comments, please wrap your code snippets within <pre> </pre> tags and use spaces for indentation.
comments powered by Disqus