Modin errors out on pytz.timezone()

Hello, I just started using Modin, it’s truly delightful, so :+1:t2:. I just encountered an error that I verified it is specific to Modin, as the code runs fine san Modin. I hope you could help resolve. Here goes:

My code has the following line that calls pytz.timezone() to convert UTC to local, like so

 acc2['Time'] = acc2['Time'].apply(timestamp_to_local, tzone=tzone)

Here is the timestamp_to_local function definition:

def timestamp_to_local(ts,tzone='Australia/Melbourne'):
    Convert unixtimestamp to local time. Output format:year-month-day hour:minute:seconds
    local = timezone('UTC').localize(datetime.datetime.utcfromtimestamp(ts)).astimezone(timezone(tzone))
    local = local.strftime(format='%Y-%m-%d %H:%M:%S:%f')
    return local

Modin yields the following error:

Traceback (most recent call last):
File “/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py”, line 180, in width
self._width_cache = ray.get(self._width_cache)
File “/opt/conda/envs/modin/lib/python3.8/site-packages/ray/worker.py”, line 1474, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): e[36mray::modin.engines.ray.pandas_on_ray.frame.axis_partition.deploy_ray_func()e[39m (pid=10567, ip=xxx.xx.xx.xxx)
File “python/ray/_raylet.pyx”, line 410, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 432, in ray._raylet.execute_task
ray.exceptions.RayTaskError: e[36mray::modin.engines.ray.pandas_on_ray.frame.partition.deploy_ray_func()e[39m (pid=10566, ip=xxx.xx.xx.xxx)
File “python/ray/_raylet.pyx”, line 410, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 432, in ray._raylet.execute_task
ray.exceptions.RayTaskError: e[36mray::modin.engines.ray.pandas_on_ray.frame.partition.deploy_ray_func()e[39m (pid=10566, ip=xxx.xx.xx.xxx)
File “/opt/conda/envs/modin/lib/python3.8/site-packages/modin/data_management/functions/mapfunction.py”, line 23, in
lambda x: function(x, *args, **kwargs), *call_args, **call_kwds
File “/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py”, line 6944, in applymap
return self.apply(infer)
File “/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py”, line 6878, in apply
return op.get_result()
File “/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/apply.py”, line 186, in get_result
return self.apply_standard()
File “/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/apply.py”, line 295, in apply_standard
result = libreduction.compute_reduction(
File “pandas/_libs/reduction.pyx”, line 620, in pandas._libs.reduction.compute_reduction
File “pandas/_libs/reduction.pyx”, line 128, in pandas._libs.reduction.Reducer.get_result
File “/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py”, line 6942, in infer
return lib.map_infer(x.astype(object).values, func)
File “pandas/_libs/lib.pyx”, line 2329, in pandas._libs.lib.map_infer
File “/opt/conda/envs/modin/lib/python3.8/site-packages/modin/pandas/series.py”, line 971, in
lambda s: arg(s)
File “/opt/conda/envs/modin/lib/python3.8/site-packages/modin/pandas/series.py”, line 514, in f
return func(x, *args, **kwds)
File “/home/ec2-user/jobs/create_info.py”, line 51, in timestamp_to_local
local = timezone(‘UTC’).localize(datetime.datetime.utcfromtimestamp(ts)).astimezone(timezone(tzone))
ValueError: Invalid value NaN (not a number)

Thank you for your help.

Hi @atan4583, thanks for posting this!

I wasn’t able to reproduce the issue based on the code you provided. Is it possible to share a few more lines of code from before this? It could be that something that should not be NaN is. The more you can share the better, we’d love to fix the issue!

Hello @devin-petersohn. Thanks for getting back. Sure. The code snippet looks like this:

start = 1585542839.0
fs = 32
len(acc) = 1922220 # acc is a pandas df read from a csv file
t = np.linspace(start,start + (len(acc)-1)/fs,len(acc))
tzone='Australia/Melbourne'
acc['X'] = acc['X']/64
acc['Y'] = acc['Y']/64
acc['Z'] = acc['Z']/64
acc['Time'] = acc['Time'].apply(timestamp_to_local, tzone=tzone)

While waiting for you, I tried the following to get around the error :

vf = np.vectorize(timestamp_to_local)
ltz = vf(ts=t,tzone=tzone)
acc['Time'] = pd.Series(ltz)

This got around the error, but the resulting acc df has an extra row, like below

This extra row led to another exception when the df is passed on to a python user defined class for further processing. To get around that, I removed the extra row, restricted the python user define class to only use normal pandas, had the processed df saved as a csv file, before reloading the csv file back into modin pandas df. However, modin pandas again tripped on this operation:

info_sleeppy['Date']=pd.to_datetime(info_sleeppy['Date'], format='%d/%m/%Y')

The error this time:

  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 44, in get
    return ray.get(self.oid)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/ray/worker.py", line 1474, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ^[[36mray::modin.engines.ray.pandas_on_ray.frame.partition.deploy_ray_func()^[[39m (pid=12748, ip=xxx.xx.xx.xxx)
  File "python/ray/_raylet.pyx", line 410, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task
ray.exceptions.RayTaskError: ^[[36mray::modin.engines.ray.pandas_on_ray.frame.axis_partition.deploy_ray_func()^[[39m (pid=12748, ip=xxx.xx.xx.xxx)
  File "python/ray/_raylet.pyx", line 410, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task
ray.exceptions.RayTaskError: ^[[36mray::modin.engines.ray.pandas_on_ray.frame.partition.deploy_ray_func()^[[39m (pid=12743, ip=xxx.xx.xx.xxx)
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 215, in deploy_ray_func
    partition = func(partition, **kwargs)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 123, in <lambda>
    lambda df: pandas.DataFrame(df.iloc[row_indices, col_indices])
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 2067, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 703, in _has_valid_tuple
    self._validate_key(k, i)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 2009, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

I ran the code on an aws t2.large EC2 instance with 2 CPU. Without modin pandas, the normal pandas will just use 1 CPU, reached 100% usage and hit pandas memory error, which the other CPU is 100% idle. This was the reason why I switch to modin pandas. I hope you could help resolve the teething issues. Thank you.

@devin-petersohn, i couldn’t edit my previous reply to add more info, thus adding a new reply. I want to add:

  1. The code snippet provided in the previous reply runs fine under normal pandas

  2. The current code under normal pandas is too slow. I tried to run it as 3 concurrent jobs. However, under normal pandas, the first job would take up 100% of one CPU, hangs on pandas memory error (never returns), while the other 2 jobs canceled by the o/s automatically. All this happened while the second CPU was 100% idle. I am not sure how to make the jobs use the idle CPU. Thus, I am trying out modin pandas, hoping it will run fast and eliminate the need to run the code as concurrent jobs.

Hello @devin-petersohn. Thanks for getting back. Sure. The code snippet looks like this:

start = 1585542839.0
fs = 32
len(acc) = 1922220 # acc is a pandas df read from a csv file
t = np.linspace(start,start + (len(acc)-1)/fs,len(acc))
tzone='Australia/Melbourne'
acc['X'] = acc['X']/64
acc['Y'] = acc['Y']/64
acc['Z'] = acc['Z']/64
acc['Time'] = pd.Series(t)
acc['Time'] = acc['Time'].apply(timestamp_to_local, tzone=tzone)

While waiting for you, I tried the following to get around the error :

vf = np.vectorize(timestamp_to_local)
ltz = vf(ts=t,tzone=tzone)
acc['Time'] = pd.Series(ltz)

This got around the error, but the resulting acc df has an extra row, like below

This extra row led to another exception when the df is passed on to a python user defined class for further processing. To get around that, I removed the extra row, restricted the python user define class to only use normal pandas, had the processed df saved as a csv file, before reloading the csv file back into modin pandas df. However, modin pandas again tripped on this operation:

info_sleeppy['Date']=pd.to_datetime(info_sleeppy['Date'], format='%d/%m/%Y')

The error this time:

  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 44, in get
    return ray.get(self.oid)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/ray/worker.py", line 1474, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ^[[36mray::modin.engines.ray.pandas_on_ray.frame.partition.deploy_ray_func()^[[39m (pid=12748, ip=xxx.xx.xx.xxx)
  File "python/ray/_raylet.pyx", line 410, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task
ray.exceptions.RayTaskError: ^[[36mray::modin.engines.ray.pandas_on_ray.frame.axis_partition.deploy_ray_func()^[[39m (pid=12748, ip=xxx.xx.xx.xxx)
  File "python/ray/_raylet.pyx", line 410, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task
ray.exceptions.RayTaskError: ^[[36mray::modin.engines.ray.pandas_on_ray.frame.partition.deploy_ray_func()^[[39m (pid=12743, ip=xxx.xx.xx.xxx)
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 215, in deploy_ray_func
    partition = func(partition, **kwargs)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 123, in <lambda>
    lambda df: pandas.DataFrame(df.iloc[row_indices, col_indices])
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 2067, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 703, in _has_valid_tuple
    self._validate_key(k, i)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py", line 2009, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

I ran the code on an aws t2.large EC2 instance with 2 CPU.

The current code under normal pandas is too slow. I tried to run it as 3 concurrent jobs. However, under normal pandas, the first job would take up 100% of one CPU, hangs on pandas memory error (never returns), while the other 2 jobs canceled by the o/s automatically. All this happened while the second CPU was 100% idle. I am not sure how to make the concurrent jobs utilize the idle CPU. Thus, I am trying out modin pandas, hoping it will run fast and eliminate the need to run the code as concurrent jobs. This is why I am switching to modin pandas. I hope you could help resolve the teething issues. Thank you.

Hello @devin-petersohn. Apologies for posting so many replies, I am not sure how to edit or deleted replies that contain mistakes. Please ignore the first and second reply, read the 3rd post.

@atan4583, no problem at all! Sorry for some delay in my responses this week, it’s been a bit hectic!

Does the NaN row happen in pandas? This is something i would expect to work, unless pandas is doing a filter on that row. In that case, it would be a quick fix for your case and we should investigate further how the deviation occurred.

Your use case fits Modin well, but it looks like there is something different going on. I definitely want to fix that case.

Thank you @devin-petersohn. No, the code bit runs fine under normal pandas, no NaN row at all.

Even after manually removing the extra NaN row from the df, when it was passed into a python user defined class down stream, modin.pandas produced another exception on this operation:

data[‘LUX’] = pd.Series(np.zeros(len(data))) #data is the name of the same acc df

The exception details:

  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/pandas/dataframe.py", line 2515, in __setitem__
    self._query_compiler.concat(1, value._query_compiler),
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/backends/pandas/query_compiler.py", line 255, in concat
    new_modin_frame = self._modin_frame._concat(axis, other_modin_frame, join, sort)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/base/frame/data.py", line 1649, in _concat
    left_parts, right_parts, joined_index = self._copartition(
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/base/frame/data.py", line 1555, in _copartition
    joined_index = self._join_index_objects(axis, index_other_obj, how, sort)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/modin/engines/base/frame/data.py", line 919, in _join_index_objects
    joined_obj = joined_obj.join(obj, how=how, sort=sort)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexes/datetimelike.py", line 815, in join
    return Index.join(
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3296, in join
    return this.join(other, how=how, return_indexers=return_indexers)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3310, in join
    return self._join_non_unique(
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3428, in _join_non_unique
    left_idx, right_idx = _get_join_indexers(
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1311, in _get_join_indexers
    zipped = zip(*mapped)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1308, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1924, in _factorize_keys
    llab, rlab = _sort_labels(uniques, llab, rlab)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1950, in _sort_labels
    _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/algorithms.py", line 2012, in safe_sort
    ordered = sort_mixed(values)
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/pandas/core/algorithms.py", line 2002, in sort_mixed
    nums = np.sort(values[~str_pos])
  File "<__array_function__ internals>", line 5, in sort
  File "/opt/conda/envs/modin/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 991, in sort
    a.sort(axis=axis, kind=kind, order=order)
TypeError: '<' not supported between instances of 'int' and 'Timestamp'

The TypeError exception does not happen in normal pandas.

Indeed, modin.padas couldn’t run without hitting exceptions in that python user defined class. I eventually resorted to normal pandas in that module and saved the ending df as a csv file, before resuming modin.pandas up stream. Unfortunately, after modin.pandas read in the saved csv, it hit IndexError: positional indexers are out-of-bounds exception (exception details in post #3) on this operation:

info_sleeppy[‘Date’]=pd.to_datetime(info_sleeppy[‘Date’], format=’%d/%m/%Y’)

Thanks so much for your commitment to help resolving the issue. Looking forwards to the fix.

Thanks @atan4583, this is helpful. I haven’t yet been able to reproduce the issue yet, but I am still working on it. Are you able to share more about your data/workflow? The smaller we can make the problem the easier it will be to solve.

Does the first issue with the function above still happen on a subset of the file (smaller data)?

Hi @devin-petersohn, for the first issue, the number of rows in the csv file is 1,922,222, which is by no means large in typical data science applications.

I have not tried subsetting the file, because I am not sure if there is any kind of dependency issue (from a functional or business perspective) in subsetting it.

Are you able to share more about your data/workflow?

Could you elaborate what you need? Data privacy and business confidentiality restrictions mean neither the data nor the code can be uploaded onto a public forum like this. If you could provide a private and secured channel, a test Jupyter notebook for you to simulate the reported error conditions is a possibility.

Let me know how we can collaborate to get the issue resolved without compromising any binding data privacy & business confidentiality restrictions. Thank you for your help.

Hi @devin-petersohn, I can create a test jupyter notebook for you to simulate the reported errors. Do you have an email box or upload site where I can either email you the download link or upload the notebook to your site privately, please?

Hi @devin-petersohn, I created a test jupter notebook to simulate the first issue reported and exported it as a html file with the run results. Please download here

If you still need the csv file, please provide an email I will send you a download link.

Thank you.

Thanks for the link, you can send an email to devin.petersohn@berkeley.edu.

Hi @devin-petersohn, I emailed you a link to download the csv file. The test nb I created is included in the zip file, so you can get right on the issue instead of mucking around with replicating the errors.

Thank you for your help.

Hi @devin-petersohn, checking in to see you if have an update on trouble shooting the issue, is there anything else you need from my side to aid with the resolution? Please let me hear back. Thank you.

Hi @atan4583, thank you for responding. I am still working through the notebook. I have many academic deadlines upcoming and I have not been able to finish debugging yet. I will try to get the issue identified today or tomorrow. Thank you again and let me know if you have any questions!

1 Like

hi @devin-petersohn, sorry to bother you again, did you have a chance to identify the issue?

Hi @atan4583, it is no bother at all! I have been looking into the issue. It is somewhat complicated, but I will give an update on where I am at.

1.) There is an undocumented (and inconsistent) behavior where pandas inserts new Series objects into an existing DataFrame. This is where Modin NaN but pandas does not. I am still trying to see how Pandas handles these edge cases where the index objects don’t match.
2.) There is a new issue related to inserting a column with identical index. I have that issue fixed and am in the process of getting it merged. After this fix, your notebook works with a reset_index!

I will keep you updated. Don’t hesitate to ask for updates!

I opened an issue to track the issue (2) in my previous comment

Thanks so much for the update @devin-petersohn! Very excited to hear you have fixed the issue with inserting a column with identical index. Once your PR is merged, how soon will the fix be made available to the general public, and will a re-installation of Modin required to get the patch?

Thanks again for your help amid your busy schedule. :+1: