Modin is slower than Pandas

tldr: Modin join slower than Pandas.

Generate datasets:

import numpy as np
import pandas as pd
cols = 100
arows = 20000
brows = 20000
keyrange = 100
a = pd.DataFrame(np.vstack([np.random.choice(keyrange,size=(arows)),np.random.normal(size=(cols,arows))]).transpose(),columns = ['key'] + ['avalue' + str(i) for i in range(1,1+cols)]).set_index('key')
b = pd.DataFrame(np.vstack([np.random.choice(keyrange,size=(brows)),np.random.normal(size=(cols,brows))]).transpose(),columns = ['key'] + ['bvalue' + str(i) for i in range(1,1+cols)]).set_index('key')

Run this code:

df = pd.read_csv("s3://"+ sys.argv[1] + "/" + sys.argv[2] )
start = time.time()
df = pd.read_csv("s3://"+ sys.argv[1] + "/" + sys.argv[2] )
df1 = pd.read_csv("s3://"+ sys.argv[1] + "/" + sys.argv[3] )

result = df.merge(df1,on='key',how='inner',suffixes=('_a','_b'))

On a c5.2xlarge on AWS, I find Modin 5x slower than Pandas. Any ideas why?

Hi @marsupialtail thanks for the post and welcome to the board!

There are a couple of possible explanations here. First, 20,000 rows may not be enough to see a meaningful benefit from Modin broadly. Second, the join strategy may be causing some problems. In some cases, we use a broadcast join, which broadcasts smaller datasets. Since the left and right datasets are both small here, the broadcast (shuffling data) will dominate the runtime.

There’s room for improvement in join (what pandas calls merge), and we have on the immediate roadmap some improvements to the shuffle-join (alternative to broadcast-join). This will improve performance of overall joins significantly. Lot’s of exciting things to come in the next few months related to better data shuffling!

Is there a recommended dataframe size where you will start to see improvement from Modin? Trying to understand how I should be benchmarking Modin using a similar merge operation

Great question @devin-petersohn

@hweller @marsupialtail In your cases, the merge operation has performance issues specific to the join strategy we are using (broadcast). Broadcast often works best when the e.g. left dataframe is large and the right dataframe is small. There’s a bit more information about your question in the docs, so I will point you there: Frequently Asked Questions (FAQs) — Modin 0+untagged.50.g406af7c.dirty documentation