This is related, but maybe a bit easier of a case to optimize for. With smaller partitions, it’s easier to merge them logically before we do a physical merge, so to the developer they will appear as normal and generally balanced partitions. This is easier when the partitions are small, which can happen with a large number of small files.
If you’re passing in a glob with read_csv_glob, there is no need to rebalance since the glob reader will be able to calculate good partitioning with knowledge of all of the files in advance.
The major limitations of read_csv in larger files is that the schema needs to be derived from the data after all of the data is loaded. This is expensive. This is also true for read_json(lines=True) and read_table
read_parquet limitation is that it is hard to tell how many rows are in a given file, and row-based partitioning is important for the scale you’re talking about. Nested parquet files and lots of partitions can make this complex, and I don’t think our reader is currently able to generate optimal partitioning in really complex nested parquet file structures.
read_sql in pandas doesn’t have a parameter for partition_column or something equivalent, so for some databases we end up with a syntax error in our generated SQL statements for reading partitioned data. This is also an area we are looking at, creating database-dependent SQL clauses for splitting up the query.
Does this make sense? Happy to answer any other questions!