repurpose package

Submodules

repurpose.img2ts module

class repurpose.img2ts.Img2Ts(input_dataset, outputpath, startdate, enddate, input_kwargs=None, input_grid=None, target_grid=None, imgbuffer=100, variable_rename=None, unlim_chunksize=100, cellsize_lat=None, cellsize_lon=None, r_methods='nn', r_weightf=None, r_min_n=1, r_radius=18000, r_neigh=8, r_fill_values=None, filename_templ='%04d.nc', gridname='grid.nc', global_attr=None, ts_attributes=None, ts_dtypes=None, time_units='days since 1858-11-17 00:00:00', zlib=True, n_proc=1, ignore_errors=False, backend='threading')[source]

Bases: object

class that uses the read_img iterator of the input_data dataset to read all images between startdate and enddate and saves them in netCDF time series files with the cell structure of the outputgrid.

Currently, 2 time series formats are implemented:
  • The OrthoMultiTs format will we used when the same time stamp applies to all data points in a loaded image.

  • IndexedRaggedTs format will be used when time stamps vary between locations in a netcdf image file.

The _read_image function will decide whether the orthogonal format is used or not.

calc()[source]

Iterate through all images of the image stack and extract temporal chunks. Transpose the data and append it to the output time series files.

img_bulk()[source]

Yields numpy array of images from imgbuffer between start and enddate until all images have been read.

Returns:

  • img_stack_dict (dict[str, np.ndarray]) – stack of daily images for each variable

  • startdate (datetime.datetime) – date of first image in stack

  • enddate (datetime.datetime) – date of last image in stack

  • datetimestack (np.ndarray) – array of the timestamps of each image

  • jd_stack (np.ndarray or None) – None if all observations in an image have the same observation timestamp. Otherwise it gives the julian date of each observation in img_stack_dict

Yields:

tuple[dict, datetime, np.ndarray or None]

exception repurpose.img2ts.Img2TsError[source]

Bases: Exception

repurpose.img2ts.is_subset_grid(grid, other, compare_index=False, compare_cell=False)[source]

Check if all the locations from other grid are also included in grid, i.e. other grid is a subset of grid. - Check if distance between (common) GPIs is 0

Parameters:
  • grid (CellGrid) – Main grid

  • other (CellGrid) – Potential subset grid

  • compare_index (bool, optional (default: False)) – If GPIs have the same coordinates, verify that the index is the same.

  • compare_cell (bool, optional (default: True)) – Both input grids must be CellGrids. Also the cell numbers assigned to the same locations must be equal.

Returns:

subset – True if subset or equal else False

Return type:

bool

repurpose.misc module

repurpose.misc.delete_empty_directories(path: str)[source]

Delete empty dirs in path

repurpose.misc.deprecated(message: str = None)[source]

Decorator for classes or functions to mark them as deprecated. If the decorator is applied without a specific message (@deprecated()), the default warning is shown when using the function/class. To specify a custom message use it like:

@deprecated(‘Don’t use this function anymore!’).

Parameters:

message (str, optional (default: None)) – Custom message to show with the DeprecationWarning.

repurpose.misc.find_first_at_depth(root_dir, depth, reverse=False)[source]

Finds and returns the first or last element at the specified depth in a directory tree.

This function performs a breadth-first search (BFS) of the directory structure, starting from the given root directory, and returns the name of the first or last (sorted) file or directory found at the specified depth. The elements at each level are processed in lexicographical order by default, but can be reversed.

Parameters:

root_dirstr

The path to the root directory from which to start searching.

depthint

The target depth to search for an element. - A depth of 0 refers to the root_dir itself. - A depth of 1 refers to the immediate subdirectories/files in root_dir. - A depth of 2 refers to the subdirectories/files within those subdirectories, and so on.

reversebool, optional (default: False)

If False, the function returns the first element at the specified depth (lexicographically). If True, it returns the last element at the specified depth (reverse lexicographically).

Returns:

str or None

The name of the first (or last, if reverse=True) file or directory found at the specified depth, or None if no such element exists.

Raises:

ValueError:

If the root_dir is not a valid directory.

Notes:

  • If depth is 0, it will return the root directory itself if valid.

  • If files are encountered before reaching the target depth, they are ignored.

repurpose.process module

class repurpose.process.ImageBaseConnection(reader, max_retries=99, retry_delay_s=1, attr_read='read', attr_path='path', attr_grid='grid')[source]

Bases: object

Wrapper for image reader that creates a list of all files in a root directory upon initialisation. When the reader tries to access a file but cannot find it, verify agains the previously created list. If the file should exist, repeat the reading assuming that due to some temporary issue the file is not accessible.

This protects against processing gaps due to e.g. temporary network issues.

property grid
tstamps_for_daterange(*args, **kwargs)[source]
class repurpose.process.ProgressParallel(use_tqdm=True, total=None, desc='', *args, **kwargs)[source]

Bases: Parallel

print_progress()[source]

Updated the progress bar after each successful call

repurpose.process.configure_worker_logger(log_queue, log_level, name)[source]
repurpose.process.idx_chunks(idx, n=-1)[source]

Yield successive n-sized chunks from list.

Parameters:
  • idx (pd.DateTimeIndex) – Time series index to split into parts

  • n (int, optional (default: -1)) – Parts to split idx up into, -1 returns the full index.

repurpose.process.parallel_process(FUNC, ITER_KWARGS, STATIC_KWARGS=None, n_proc=1, show_progress_bars=True, ignore_errors=False, activate_logging=True, log_path=None, log_filename=None, loglevel='WARNING', logger_name=None, verbose=False, progress_bar_label='Processed', backend='threading', sharedmem=False, joblib_kwargs=None) list[source]

Applies the passed function to all elements of the passed iterables. Parallel function calls are processed ASYNCHRONOUSLY (ie order of return values might be different from order of passed iterables)! Usually the iterable is a list of cells, but it can also be a list of e.g. images etc.

Parameters:
  • FUNC (Callable) – Function to call.

  • ITER_KWARGS (dict) – Container that holds iterables to split up and call in parallel with FUNC: Usually something like ‘cell’: [cells, … ] If multiple, iterables MUST HAVE THE SAME LENGTH. We iterate through all iterables and pass them to FUNC as individual kwargs. i.e. FUNC is called N times, where N is the length of iterables passed in this dict. Can not be empty!

  • STATIC_KWARGS (dict, optional (default: None)) – Kwargs that are passed to FUNC in addition to each element in ITER_KWARGS. Are the same for each call of FUNC!

  • n_proc (int, optional (default: 1)) – Number of parallel workers. If 1 is chosen, we do not use a pool. In this case the return values are kept in order.

  • show_progress_bars (bool, optional (default: True)) – Show how many iterables were processed already.

  • ignore_errors (bool, optional (default: False)) – If True, exceptions are caught and logged. If False, exceptions are raised.

  • activate_logging (bool, optional (default: True)) – If False, no logging is done at all (neither to file nor to stdout).

  • log_path (str, optional (default: None)) – If provided, a log file is created in the passed directory.

  • log_filename (str, optional (default: None)) – Name of the logfile in `log_path to create. If None is chosen, a name is created automatically. If `log_path is None, this has no effect.

  • loglevel (str, optional (default: "WARNING")) – Which level should be logged. Must be one of [“DEBUG”, “INFO”, “WARNING”, “ERROR”, “CRITICAL”].

  • logger_name (str, optional (default: None)) – The name to assign to the logger that can be accessed in FUNC to log to. If not given, then the root logger is used. e.g ` logger = logging.getLogger(<logger_name>) logger.error("Some error message") `

  • verbose (bool, optional (default: False)) – Print all logging messages to stdout, useful for debugging. Only effective when logging is activated.

  • progress_bar_label (str, optional (default: "Processed")) – Label to use for the progress bar.

  • backend (Literal["threading", "multiprocessing", "loky"] = "threading") – The backend to use for parallel execution (if n_proc > 1). Defaults to “threading”. See joblib docs for more info.

  • sharedmem (bool, optional (default:True)) – Activate shared memory option (slow) WARNING: Option not fully implemented / tested.

  • joblib_kwargs (dict, optional (default: None)) – Additional keyword arguments to pass to joblib.Parallel

Returns:

results – List of return values from each function call or None if no return values are found.

Return type:

list or None

repurpose.process.parallel_process_async(*args, **kwargs)[source]
repurpose.process.rootdir() Path[source]
repurpose.process.run_with_error_handling(FUNC, ignore_errors=False, log_queue=None, log_level='WARNING', logger_name=None, **kwargs) Any[source]

repurpose.resample module

repurpose.resample.hamming_window(radius, distances)[source]

Hamming window filter.

Parameters:
  • radius (float32) – Radius of the window.

  • distances (numpy.ndarray) – Array with distances.

Returns:

weights – Distance weights.

Return type:

numpy.ndarray

repurpose.resample.resample_to_grid(input_data, src_lon, src_lat, target_lon, target_lat, methods='nn', weight_funcs=None, min_neighbours=1, search_rad=18000, neighbours=8, fill_values=None)[source]

resamples data from dictionary of numpy arrays using pyresample to given grid. Searches for the neighbours and then resamples the data to the grid given in togrid if at least min_neighbours neighbours are found

Parameters:
  • input_data (dict of numpy.arrays) –

  • src_lon (numpy.array) – longitudes of the input data

  • src_lat (numpy.array) – src_latitudes of the input data

  • target_lon (numpy.array) – longitudes of the output data

  • target_src_lat (numpy.array) – src_latitudes of the output data

  • methods (string or dict, optional) – method of spatial averaging. this is given to pyresample and can be ‘nn’ : nearest neighbour ‘custom’ : custom weight function has to be supplied in weight_funcs see pyresample documentation for more details can also be a dictionary with a method for each array in input data dict

  • weight_funcs (function or dict of functions, optional) – if method is ‘custom’ a function like func(distance) has to be given can also be a dictionary with a function for each array in input data dict

  • min_neighbours (int, optional) – if given then only points with at least this number of neighbours will be resampled Default : 1

  • search_rad (float, optional) – search radius in meters of neighbour search Default : 18000

  • neighbours (int, optional) – maximum number of neighbours to look for for each input grid point Default : 8

  • fill_values (number or dict, optional) – if given the output array will be filled with this value if no valid resampled value could be computed, if not a masked array will be returned can also be a dict with a fill value for each variable

Returns:

data – resampled data on given grid

Return type:

dict of numpy.arrays

Raises:

ValueError : – if empty dataset is resampled

repurpose.resample.resample_to_grid_only_valid_return(input_data, src_lon, src_lat, target_lon, target_lat, methods='nn', weight_funcs=None, min_neighbours=1, search_rad=18000, neighbours=8, fill_values=None)[source]

resamples data from dictionary of numpy arrays using pyresample to given grid. Searches for the neighbours and then resamples the data to the grid given in to grid if at least min_neighbours neighbours are found

Parameters:
  • input_data (dict of numpy.arrays) – Data to resample

  • src_lon (numpy.array) – longitudes of the input data

  • src_lat (numpy.array) – src_latitudes of the input data

  • target_lon (numpy.array) – longitudes of the output data

  • target_src_lat (numpy.array) – src_latitudes of the output data

  • methods (string or dict, optional) – method of spatial averaging. this is given to pyresample and can be ‘nn’ : nearest neighbour ‘custom’ : custom weight function has to be supplied in weight_funcs see pyresample documentation for more details can also be a dictionary with a method for each array in input data dict

  • weight_funcs (function or dict of functions, optional) – if method is ‘custom’ a function like func(distance) has to be given can also be a dictionary with a function for each array in input data dict

  • min_neighbours (int, optional) – if given then only points with at least this number of neighbours will be resampled Default : 1

  • search_rad (float, optional) – search radius in meters of neighbour search Default : 18000

  • neighbours (int, optional) – maximum number of neighbours to look for for each input grid point Default : 8

  • fill_values (number or dict, optional) – if given the output array will be filled with this value if no valid resampled value could be computed, if not a masked array will be returned can also be a dict with a fill value for each variable

Returns:

  • data (dict of numpy.arrays) – resampled data on part of the target grid over which data was found

  • mask (numpy.ndarray) – boolean mask into target grid that specifies where data was resampled

Raises:

ValueError : – if empty dataset is resampled

repurpose.stack module

class repurpose.stack.Regular3dimImageStack(grid, timestamps, time_collocation=True, reference_time=None, zlib=True)[source]

Bases: object

add_variable(name, values=nan, static=False, attrs=None, dtype='float32')[source]

Add (empty) variable data to the current image stack.

Parameters:
  • name (str) – Name of the variable.

  • values (float or np.ndarray, optional (default: np.nan)) – Value that the variable should be initialised with. If an array is passed, it must have the correct shape.

  • static (bool, optional (default: False)) – If True, the variable is static, i.e. it has no time dimension.

  • attrs (dict, optional (default: None)) – Attributes that should be assigned to the variable.

  • dtype (str, optional (default: 'float32')) –

    Data type of the variable.

    ’float32’, ‘float64’ ‘int32’, ‘int64’, ‘int16’, ‘int8’ ‘str’ (not yet supported)

close()[source]
collocate(df)[source]

For each image time stamp find the closest time series time stamp afterwards. Then convert time series time stamps to deltas (>0) from the image time stamps. If the image stack sampling is too sparse, i.e. multiple time series time stamps are assigned to the same image, then some data might be lost.

Parameters:

df (pd.DataFrame) – Loaded time series data

Returns:

collocated – Collocated version of df

Return type:

pd.DataFrame

classmethod from_genreg(resolution=0.25, extent=None, **kwargs)[source]

Initialize image stack from regular raster of the passed resolution.

Parameters:
  • resolution (float, optional (default: 0.25)) – Resolution in degrees. A global raster of the chosen resolution is created.

  • extent (list or None) – Extent of the output image as [minlat, maxlat, minlon, maxlon].

static t_max_delta(dt)[source]

Find max of deltas between passed time stamps.

Parameters:

dt (pd.DatetimeIndex) – Datetime index. Deltas are computed between subsequent values.

Returns:

delta_h – The max delta in hours between the passed time stamps

Return type:

float

to_netcdf(path, *args, **kwargs)[source]

Shortcut to xarray.Dataset.to_netcdf. Write current stack to file. Zlib compression is applied when selected for the data set for all numeric variables. Other compression options can be set via the encoding keyword, e.g.

encoding={‘sm’: {‘scale_factor’: 0.001, ‘dtype’: ‘int32’,

‘_FillValue’: -9999}}

to store sm values (0-1) with 3 decimal places precision as int32 (nans are stored as -9999)

Parameters:
  • path (str) – Path to output file.

  • args – Passed to xarray.Dataset.to_netcdf().

  • kwargs – Passed to xarray.Dataset.to_netcdf().

write_loc(gpis, data, timestamp=None, new_var_kwargs=None)[source]

Write data for multiple gpis to one image. For static images (no time dimension), no timestamp is required. For dynamic images, a timestamp must be passed.

Parameters:
  • gpis (list of int) – List of gpis for which data is passed.

  • data (dict[str, np.ndarray]) – Data to be written. Keys are variable names (must exist in the dataset). Shape of each array must be (len(gpis),).

  • timestamp (str or datetime, optional (default: None)) – Timestamp of the image. If None, the image is static.

  • new_var_kwargs (dict[str, dict], optional (default: None)) –

    {variable_name: dict, …} In case a variable is not yet in the data set, we use these kwargs (passed to add_variable) to add the variable to the data set.

    The key is the column in df, values are the kwargs passed to

    add_variable.

write_ts(df, gpi, new_var_kwargs=None)[source]

Write time series for gpi to stack.

Parameters:
  • df (pd.DataFrame) – Data to be written to the stack. Columns contain variable names. If a variable is not yet present in the stack, a warning is issued.

  • gpi (int) – Gpi of the grid cell where the data is written to.

  • new_var_kwargs (dict[str, dict], optional (default: None)) – {variable_name: dict, …} In case a variable is not yet in the data set, we use these kwargs (passed to add_variable) to add the variable to the data set. The key is the column in df, values are the kwargs passed to add_variable.

repurpose.ts2img module

class repurpose.ts2img.Ts2Img(ts_reader, img_grid, timestamps, variables=None, read_function='read', max_dist=18000, time_collocation=True, loglevel='WARNING', ignore_errors=False, backend='threading')[source]

Bases: object

Takes a time series dataset and converts it into a set of images. Images are stored on a regular grid. This includes a spatial and temporal lookup, ie resampling of the time series data to a regular 2d grid as well as assigning time series time stamps to images.

Protected variable names (used internally) are:

timedelta_seconds, index_other, distance_other gpi, lon, lat

Parameters:
  • ts_reader (GriddedNcOrthoMultiTs or GriddedNcContiguousRaggedTs or GriddedNcIndexedRaggedTs) – A reader that returns a time series for a given lon/lat combination. The class method defined in read_name is called to read a pandas DataFrame that has a DateTimeIndex and the variables as columns for a location.

  • img_grid (BasicGrid or CellGrid) – A regular grid that defines the output images. Must be rectangular and have a 2d shape attribute. Can be a spatial subset of the time series grid and contain points that are missing in the time series (filled with nan). For each grid point, we search the closest time series (within max_dist of ts_reader).

  • timestamps (pd.DateTimeIndex) – Each data point in the loaded time series must be assigned to an image. This defines the temporal sampling of the image stack. Each time stamp is a separate image. The closest time stamp from the time series will be stored in the according image, other data that would be assinged to the same image are DISCARDED! In this case a higher frequency (eg 12-hourly) should be chosen. A too low frequency here means that information is lost. A too high frequency here means that data is split up into many images.

  • variables (dict or list[str] or None, optional (default: None)) – Data variables to be read from the time series and transfer to the images. Must exist in the time series. If a dict is given, then the variables are renamed after reading. Ideally a fill value for each variable (new name) is given in ‘fill_values’. If None, all variables are read.

  • read_function (str, optional (default: 'read')) – Name of the method in ts_reader that takes a lon/lat pair and returns a pandas DataFrame with a DateTimeIndex and the variables as columns.

  • max_dist (float, optional (default: 0.25)) – Maximum distance around an image grid cell to tool for a time series. If mutliple are found, only the nearest one is used!

  • time_colloction (bool, optional (default: True)) – Relevant when converting data with varying time stamps per location. For each image time stamp find the closest time series time stamp afterwards. Then convert time series time stamps to deltas (>0) from the image time stamps and store them in a new image variable ‘timedelta_seconds’.

  • loglevel (str, optional (default: 'WARNING')) – Logging level. Must be one of ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’.

  • ignore_errors (bool, optional (default: False)) – Instead of raising an exception, log errors and continue the process. E.g. to skip individual corrupt files.

  • backend (str, optional (default: 'threading')) – Which backend joblib should use. Default is ‘threading’, other options are ‘multiprocessing’ and ‘loky’

calc(path_out, format_out='slice', preprocess=None, preprocess_kwargs=None, postprocess=None, postprocess_kwargs=None, fn_template='{datetime}.nc', drop_empty=False, encoding=None, zlib=True, glob_attrs=None, var_attrs=None, var_fillvalues=None, var_dtypes=None, img_buffer=100, n_proc=1)[source]

Perform conversion of all time series to images. This will first split timestamps into processing chunks (img_buffer) and then - for each chunk - iterate through all cells (parallel) in the img_grid, and transfer the time series for each pixel into the image stack.

Parameters:
  • path_out (str) – Path to the output directory where the files are written to.

  • format_out (str, optional (default: 'slice')) –

    • slice: write each time step as a separate file. In this case

      the fn_template must contain a placeholder {datetime} where the date is inserted for each image

    • stack: write all time steps into one file. In this case if there

      is a {datetime} placeholder in the fn_template, then the time range is inserted.

  • preprocess (callable or list[Callable], optional (default: None)) –

    Function that is applied to each time series before converting it. The first argument is the data frame that the reader returns. Additional keyword arguments can be passed via preprocess_kwargs. The function must return a data frame of the same form as the input data, i.e. with a datetime index and at least one column of data. Note: As an alternative to a preprocessing function, consider applying an adapter to the reader class. Adapters also perform preprocessing, see pytesmo.validation_framework.adapters A simple example for a preprocessing function to compute the sum: ``` def preprocess_add(df: pd.DataFrame, **preprocess_kwargs) -> pd.DataFrame:

    df[‘var3’] = df[‘var1’] + df[‘var2’] return df

    ```

  • preprocess_kwargs (dict or list[dict], optional (default: None)) – Keyword arguments for the preprocess function. If None are given, then the preprocessing function is is called with only the input data frame and no additional arguments (see example above).

  • postprocess (Callable or list[Callable], optional (default: None)) –

    Function(s) applied to the image stack after loading the data and before writing it to disk. The function must take an xarray Dataset as the first argument and return an xarray Dataset of the same form. A simple example for a preprocessing function to add a new variable from the sum of two existing variables: ``` def postprocess_add(stack: xr.Dataset, **postprocess_kwargs) -> xr.Dataset

    stack = stack.assign(var3=lambda x: x[‘var0’] + x[‘var2’]) return stack

    ```

  • postprocess_kwargs (dict or list[dict], optional (default: None)) – Keyword arguments for the postprocess function(s). If None are given, then the postprocess function is called with only the input image stack and no additional arguments (see example above).

  • fn_template (str, optional (default: "{datetime}.nc")) – Template for the output image file names. If format_out is ‘slice’, then a placeholder {datetime} must be in the fn_template, which will be replaced by the timestamp of each image. If format_out is ‘stack’, then no {datetime} placeholder is required. If it’s till provided, the time range of the stack is inserted.

  • drop_empty (bool, optional (default: False)) – Images where all data variables are empty are removed from the stack after loading / before writing. Otherwise, emtpy images will be written to disk.

  • encoding (dict of dicts, optional (default: None)) – Encoding kwargs for each variable. Are passed to netcdf for storing the files to apply dtype, scale_factor, add_offset, etc. Make sure that the encoding is consistent with the data and fill values (var_fillvalues). For example, conversion to int16 for data values between 0 and 100 can result in data loss. e.g. {‘sm’: {‘dtype’: ‘int32’, ‘scale_factor’: 0.01}}

  • zlib (bool, optional (default: True)) – If True, then the netcdf files are compressed using zlib compression for all data variables.

  • glob_attrs (dict, optional (default: None)) – Additional global attributes that are added to the netcdf file. e.g. {‘product’: ‘ASCAT 12.5 TS’}

  • var_attrs (dict of dicts, optional (default: None)) – Additional variable attributes that are added to the netcdf file. The dict must have the following structure: {varname: {‘attrname’: value}}, e.g {‘sm’: {‘long_name’: ‘soil moisture’, ‘units’: ‘m3 m-3’}, …} In case variable was renamed, use the new name here!

  • var_fillvalues (dict, optional (default: None)) – Fill values for each variable. By default, nan is used for all variables (you can also use the encoding parameter to set a fill value when writing to disk). In case variable was renamed, use the new name here!

  • var_dtypes (dict, optional (default: None)) – Data types for each variable. By default, float32 is used for all variables (you can also use the encoding parameter to set a dtype when writing to disk). In case variable was renamed, use the new name here!

  • img_buffer (int, optional (default: 100)) – Size of the stack before writing to disk. Larger stacks need more memory but will lead to faster conversion. Passing -1 means that the whole stack loaded into memory at once.

  • n_proc (int, optional (default: 1)) – Number of processes to use for parallel processing. We parallelize by 5 deg. grid cell.

store_netcdf_images(path_out, fn_template="<class 'datetime.datetime'>.nc", encoding=None, annual_folder=True, keep=False, n_proc=1)[source]

Write the (global) merged image stack to netcdf files.

Parameters:
  • path_out (str) – Path to the output directory where the files are written to.

  • fn_template (str, optional (default: None)) – Template for the output image file names. Must contain a placeholder {datetime} where the image date is inserted.

  • encoding (dict (default: None)) – Encoding for the netcdf variables. The keys are the variable names, If True, then the images are grouped by year, and images for each year a written to a separate folder.

  • annual_folder (bool, optional (default: True)) – If True, then the images are grouped by year, and images for each year a written to a separate folder.

  • keep (bool, optional (default: False)) – If True, then the image stack is kept in memory during writing. This is only needed if anything else should be done with the stack after writing it to disk. If False (recommended), then the stack is gradually deleted to empty memory during writing.

  • n_proc (int, optional (default: 1)) – Number of processes to use for reading cells and writing images in parallel. Merging cells after reading is not parallelised and might be a bottleneck.

Module contents