Reading large CSV files in Python - A perpetual problem

Abhinav Bangia
Mar 9, 2022
2 min read

The growing pace of data is exponential in today's society, where every business and institution is transforming itself into a data-savvy entity. As a result, dealing with large amounts of data has become necessary. The CSV (Comma-Separated Values) format is one of the most frequent ways to store data efficiently. Importing a large CSV file directly into a Python script can cause an 'Out of memory' error or a system crash owing to a lack of RAM.

The internet has plenty of tips and strategies for reading large CSV files at once, such as defining the chunksize of the data in the pd.read csv() command or utilising Dask dataframes or Datatables. After extensive testing and hours spent developing the best code to read massive quantities of data, I personally believe that all of these solutions have some form of barrier at some point of time.

For example, defining chunksize and breaking the data into chunks necessitates an extra step of concatenating the data into one dataset, which takes almost as long as simply reading the data. And the first obstacle with Dask dataframes is specifying the dtype for all of the columns (even when there are 200+ columns); second, dealing with Dask dataframes is not as straightforward as working with Pandas dataframes.

Following extensive research, one feasible and efficient method for reading large dataframes is to not read them all at once. This leads us to the concept of 'Structurization.' Most datasets can be divided into subsets based on the year, quarter, month, day, or any other criterion. Creating subsets while saving the data according to the datetime column makes it very simple to read and concatenate the required data.

Amongst the various ways to create subsets of a dataset, one very efficient way is described below:

1. From the timestamp column in dataframe, create a new column of just the Year (or Month, or Date):