Experimental API initial discussion

#1

The purpose of this discussion is to discuss thoughts related to the Experimental API.

import modin.experimental.pandas as pd

Currently (as of Modin 0.3.1), Modin supports extra parameters for read_sql that allow users to specify extra parameters. This was added in modin-project/modin/#436.

Self-contained code example:

import os
import time
import logging
import sqlite3
from contextlib import contextmanager
import pandas
import modin.pandas as pd

TBL_NAME = "tbl"
FILE_NAME = "benchmark.db"
CONN = "sqlite:///" + FILE_NAME
NUM_ROWS = 3000000
NUM_COLS = 10


def create_db():
    if os.path.exists(FILE_NAME):
        os.remove(FILE_NAME)
    table = [[row] * NUM_COLS for row in range(NUM_ROWS)]
    headers = ["col"+str(col) for col in range(NUM_COLS)]
    df = pandas.DataFrame(table, columns=headers)
    df.to_sql(TBL_NAME, CONN)
    sqlite3.connect(FILE_NAME).cursor().execute("CREATE INDEX tbl_col0 ON tbl (col0);")


@contextmanager
def time_logger(name):
    """This logs the time usage of a code block"""
    start_time = time.time()
    yield
    end_time = time.time()
    total_time = end_time - start_time
    logging.info("%s; time: %ss", name, total_time)


def pandas_test():
    pandas.read_sql(sql=TBL_NAME, con=CONN)


def modin_test():
    pd.read_sql(
        sql=TBL_NAME,
        con=CONN,
        partition_column="col0",
        lower_bound=0,
        upper_bound=NUM_ROWS-1
    )


create_db()
with time_logger("Pandas - Read sql table with {} rows and {} columns".format(NUM_ROWS, NUM_COLS)):
    pandas_test()
with time_logger("Modin - Read sql table with {} rows and {} columns".format(NUM_ROWS, NUM_COLS)):
    modin_test()

Note that the extra parameters (partition_column, upper_bound, and lower_bound) are able to improve performance in this case.

I am hoping to open up the discussion here for other potential “hints” or extra APIs on top of pandas parameters that we could use to optimize the system. All thoughts are welcome!

#2

In an issue related the parallel read_sql parameters (modin-project/modin/#455), an additional parameter for max parallelism (which is a good idea). This would be the maximum number of connections to the database when operating in parallel (which most databases limit).

pinned globally #4