Lazy engine initialization - rationale

Hi all,

This is a topic to gather rationale for lazy engine initialization as started by

@devin-petersohn would be awesome if you could provide some context, especially on the edge cases.


Take the following code:

import modin.pandas as pd

if isinstance(df, pd.DataFrame):
    # do something
    # do something else

Checking whether or not an object is a Modin dataframe shouldn’t be super costly, and definitely shouldn’t create an entire multiprocessing execution context.

You can also consider the case where someone wants to use Modin in their library. Those two cases alone were enough for me to make the change.

Syntactically it is also nicer. We are going from:

import ray
ray.init()  # don't mess up the order!
import modin.pandas as pd

df = pd.read_parquet("some.parquet")

to supporting

import modin.pandas as pd
import ray


df = pd.read_parquet("some.parquet")

Also, the engine doesn’t need to be updated before import:

import modin.pandas as pd
from modin.config import Engine

df = pd.read_parquet("some.parquet")

Does that make sense?

I think so.
Considered this, I’m even more sure in my idea of adding an explicit init_engine() function, probably to modin namespace, instead of hijacking the fact that subscribing to engine changes would initialize it.

That way we can even make this warning disappear if a user has explicitly called this function.

Yes, an explicit init_engine makes sense. This change also makes it easier to do that. We can change the warning to suggest doing that as well.