Gandiva Backend for Arrow format discussion


I have also just added a PR modin-project/modin#489 for running Arrow Table format natively instead of using pandas as the engine to perform the queries. There are a couple of ways that Arrow allows for operating on its Table format, but for now, I am using Gandiva to accelerate query (query code contributed by @suquark on GitHub).

This PR isn’t currently in modin.experimental.pandas though perhaps it should be. Any thoughts?

Currently, this is what appears:

$ MODIN_BACKEND=Gandiva ipython
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import modin.pandas as pd                                                                                                                                          
UserWarning: Gandiva on Ray is experimental, some behaviors may not match expectations.
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-03-03_16-11-38_6458/logs.
Waiting for redis server at to respond...
Waiting for redis server at to respond...
Starting Redis shard with 10.0 GB max memory.
Starting the Plasma object store with 20.0 GB memory using /tmp.

There is a warning in the first line of the output after it is imported, but it might be getting lost in the regular Ray output.

Experimental API initial discussion

It might be better in modin.experimental.pandas so that it will be obvious that the behavior may not work as expected. However, it doesn’t have to be since if people are going to run modin with modin environment variables, they probably are power users who know that the results that they get might not be what they expected.


I tend to agree that this maybe belongs in modin.experimental.pandas. It will require a minor refactor, but that should be fine.