In this tutorial, we'll explore how to use Pandas to merge all CSV files in a specific folder. This is particularly useful when dealing with large datasets split across several files.
Firstly, we'll import the required libraries.
import pandas as pd
import os
Now getting the list of all CSV files in the particular folder. To do this, we'll use os and glob.
import glob
folder_name = '/your_folder_path'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
The `glob.glob` function returns a list of file paths, which we then pass to the `pd.read_csv` function. This will return a list of data frames.
The last step is to concatenate them together into a single data frame, which is done by `pd.concat`. The `ignore_index` parameter is set to True so that the
index gets reset in the final merged dataframe.
Next, let's write the merged data frame to a new CSV file.
dataframe.to_csv("/output_folder_path/merged.csv")
This will save the merged data frame to a new CSV file named `merged.csv`.
Handling Large Datasets
In the case of handling large datasets that don't fit into memory, we can do the concatenation in chunks. Supposing we still want to use `ignore_index=True`:
chunksize = 10 ** 6
chunks = []
for f in glob.glob(folder_name + "/*."+file_type):
chunk = pd.read_csv(f, chunksize=chunksize)
chunks.append(pd.concat(chunk, ignore_index=True))
df = pd.concat(chunks, ignore_index=True)
This way, each CSV file will be read and concatenated in chunks, and only then will the results be combined.
Remember to always check the merged data for consistency and correctness, as issues can be complexified when dealing with multiple sources of data.