DRecPy.Dataset package

DRecPy.Dataset.dataset_abc module

class DRecPy.Dataset.dataset_abc.InteractionDatasetABC(**kwds)

Bases: abc.ABC

apply(column, function)

Modifies the current dataset instance by applying a transformation to a specific column in every row.

Parameters:
  • column – A string that represents the name of the column that will be transformed.
  • function – The function that will be used to map the current column value in each row to the new one.
Returns:

None.

assign_internal_ids()

Assigns user and item internal ids. Internal ids are integer consecutive identifiers that represent each user or item uniquely. Two new columns are created on this dataset instance: “uid” and “iid”, for user internal id and item internal id, respectively.

Returns:None.
copy()

Copies the current dataset instance into a new one.

Returns:InteractionDataset instance with the same data values as the current one.
count_unique(columns=None)

Count the number of unique values on the provided column combination.

Parameters:columns – A list containing the columns to take into account. Default: all.
Returns:The count of unique values present on the provided column combination.
drop(record_ids, copy=True, keep=False)

Remove (or keep) the provided list of record ids from the current InteractionDataset instance.

Parameters:
  • record_ids – A list of integers representing record ids.
  • copy – A boolean indicating whether to create a new InteractionDataset instance to remove (or keep) the provided list of record ids, or if the current InteractionDataset instance should be modified accordingly (copy=False). Default: True.
  • keep – A boolean indicating whether the provided record ids should be kept or removed (keep=False). Default: False.
Returns:

An instance of the InteractionDataset with (or without) the filtered records.

exists(query)

Compute if the provided query handles at least 1 value or not.

Parameters:query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
Returns:A boolean indicating if the query handles any results or not.
iid_to_item(iid)

Converts a given internal item id into its correspondent raw id. Raises exception if no internal ids are assigned.

Parameters:iid – The item internal id.
Returns:An integer value representing the item raw id, or None if the internal item id provided does not exist.
item_to_iid(item)

Converts a given raw item id into its correspondent internal id. Raises exception if no internal ids are assigned.

Parameters:item – The item raw id.
Returns:An integer value representing the item internal id, or None if the raw item id provided does not exist.
max(column=None)

Computes the maximum value for the provided column.

Parameters:column – The name of the column for which the maximum should be computed.
Returns:The maximum value present on the whole dataset, for the provided column name.
min(column=None)

Computes the minimum value for the provided column.

Parameters:column – The name of the column for which the minimum should be computed.
Returns:The minimum value present on the whole dataset, for the provided column name.
null_interaction_pair_generator(interaction_threshold=None, seed=None)

Provides a generator that yields negative / null interaction pairs.

Parameters:
  • interaction_threshold – An optional integer that is used as the boundary interaction value between positive and negative interaction pairs. All values above or equal interaction_threshold are considered positive, and all values bellow are considered negative. If none is provided, positive interactions are the ones present on the dataset, and all the others are considered negative. Default: None.
  • seed – An optional integer to be used as the seed value for the pseudo-random number generated used to sample null interaction pairs. Default: None.
Returns:

A generator that yields negative / null interaction pairs, that is, (user internal id, item internal id) tuples.

remove_internal_ids()

Removes user and item internal ids.

Returns:None.
save(path, columns=None, write_header=False)

Persists the current dataset instance in the provided path, as a csv file. Note that internal identifiers, such as the row id (rid), user internal id (uid) and item internal id (iid) are never persisted, since they’re only useful during runtime.

Parameters:
  • path – A string that represents the path where the current dataset values will be persisted.
  • columns – An optional list with the names of the columns that should be persisted. Default: all columns.
  • write_header – A boolean indicating whether to write the csv header on the persisted file. Default: False.
Returns:

None.

select(query, copy=True)

Select rows from the InteractionDataset.

Parameters:
  • query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
  • copy – A boolean indicating whether to create a new InteractionDataset where the rows that satisfy the provided query are put, or if the filtered rows should be removed from the current InteractionDataset (copy=False). Default: True.
Returns:

An instance of a InteractionDataset containing the rows selected by the provided query.

select_item_interaction_vec(iid)

Compute the item interaction vector for the provided item internal id.

Parameters:iid – The item internal id that references the item that should have its interaction vector computed.
Returns:The interaction vector (a vector containing the interaction values of each user to the provided item, in order) of the provided iid.
select_one(query, columns=None, to_list=False)

Select the first resulting row for the provided query.

Parameters:
  • query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
  • columns – A list with the column names to be kept on the resulting record. Default: all.
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.

Returns:

select_random_generator(query=None)

Provides a generator that yields dataset rows.

Parameters:query – A string representing the query to be run before selecting random rows. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
Returns:A generator that yields dataset rows, where each row is represented as a dict.
select_user_interaction_vec(uid)

Compute the user interaction vector for the provided user internal id.

Parameters:uid – The user internal id that references the user that should have its interaction vector computed.
Returns:The interaction vector (a vector containing the interaction values of each item to the provided user, in order) of the provided uid.
uid_to_user(uid)

Converts a given internal user id into its correspondent raw id. Raises exception if no internal ids are assigned.

Parameters:uid – The user internal id.
Returns:An integer value representing the user raw id, or None if the internal user id provided does not exist.
unique(columns=None, copy=True)

Return a new InteractionDataset instance containing only the unique values on the provided column combination.

Parameters:
  • columns – The column combination to take into account when computing unique values. Default: all.
  • copy – A boolean indicating whether a copy of the InteractionDataset should be made, or if it should be modified in-place. Default: True.
Returns:

A InteractionDataset instance containing the unique values on the provided column combination.

user_to_uid(user)

Converts a given raw user id into its correspondent internal id. Raises exception if no internal ids are assigned.

Parameters:user – The user raw id.
Returns:An integer value representing the user internal id, or None if the raw user id provided does not exist.
values(columns=None, to_list=False)

Provides a generator that yields all the records present in the dataset.

Parameters:
  • columns – The list of columns that should be returned for each data point. Default: all.
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.
Returns:

A generator that yields records present in the dataset.

values_list(columns=None, to_list=False)

Provides list with all the records present in the dataset.

Parameters:
  • columns – The list of columns that should be returned for each data point. Default: None (show all).
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.
Returns:

A list containing all records present in the dataset.

DRecPy.Dataset.dataset_factory module

class DRecPy.Dataset.dataset_factory.InteractionsDatasetFactory

Bases: object

InteractionsDatasetFactory creates InteractionDataset instances.

Parameters:
  • path – A string representing the path to the file where the dataset is located at.
  • columns – A list with the names of the columns present on the dataset, ordered accordingly to the column order present in the dataset file. Required column names: ‘user’, ‘item’, ‘interaction’.
  • delimiter – A string representing the delimiter used on the dataset file. Default: ‘,’.
  • has_header – A boolean indicating whether the dataset file has a header row or not (skip first row or not?). Default: false.
  • in_memory – A boolean indicating whether to load the dataset: in memory or out of memory. Default: True.
  • verbose – A boolean indicating whether to log info messages or not. Default: True.
static read_df(df, user_label='user', item_label='item', interaction_label='interaction', **kwds)

Convert the provided dataframe into a InteractionDataset instance.

Parameters:
  • df – A dataframe containing the dataset to be imported.
  • user_label – The name of the column containing the user identifiers. Default: ‘user’.
  • item_label – The name of the column containing the item identifiers. Default: ‘item’.
  • interaction_label – The name of the column containing the interaction values. Default: ‘interaction’.
Returns:

A InteractionDataset instance containing the provided data.

DRecPy.Dataset.mem_dataset module

class DRecPy.Dataset.mem_dataset.MemoryInteractionDataset(path='', columns=None, **kwds)

Bases: DRecPy.Dataset.dataset_abc.InteractionDatasetABC

apply(column, function)

Modifies the current dataset instance by applying a transformation to a specific column in every row.

Parameters:
  • column – A string that represents the name of the column that will be transformed.
  • function – The function that will be used to map the current column value in each row to the new one.
Returns:

None.

assign_internal_ids()

Assigns user and item internal ids. Internal ids are integer consecutive identifiers that represent each user or item uniquely. Two new columns are created on this dataset instance: “uid” and “iid”, for user internal id and item internal id, respectively.

Returns:None.
copy()

Copies the current dataset instance into a new one.

Returns:InteractionDataset instance with the same data values as the current one.
drop(record_ids, copy=True, keep=False)

Remove (or keep) the provided list of record ids from the current InteractionDataset instance.

Parameters:
  • record_ids – A list of integers representing record ids.
  • copy – A boolean indicating whether to create a new InteractionDataset instance to remove (or keep) the provided list of record ids, or if the current InteractionDataset instance should be modified accordingly (copy=False). Default: True.
  • keep – A boolean indicating whether the provided record ids should be kept or removed (keep=False). Default: False.
Returns:

An instance of the InteractionDataset with (or without) the filtered records.

iid_to_item(iid)

Converts a given internal item id into its correspondent raw id. Raises exception if no internal ids are assigned.

Parameters:iid – The item internal id.
Returns:An integer value representing the item raw id, or None if the internal item id provided does not exist.
item_to_iid(item)

Converts a given raw item id into its correspondent internal id. Raises exception if no internal ids are assigned.

Parameters:item – The item raw id.
Returns:An integer value representing the item internal id, or None if the raw item id provided does not exist.
max(column=None)

Computes the maximum value for the provided column.

Parameters:column – The name of the column for which the maximum should be computed.
Returns:The maximum value present on the whole dataset, for the provided column name.
min(column=None)

Computes the minimum value for the provided column.

Parameters:column – The name of the column for which the minimum should be computed.
Returns:The minimum value present on the whole dataset, for the provided column name.
null_interaction_pair_generator(interaction_threshold=None, seed=None)

Provides a generator that yields negative / null interaction pairs.

Parameters:
  • interaction_threshold – An optional integer that is used as the boundary interaction value between positive and negative interaction pairs. All values above or equal interaction_threshold are considered positive, and all values bellow are considered negative. If none is provided, positive interactions are the ones present on the dataset, and all the others are considered negative. Default: None.
  • seed – An optional integer to be used as the seed value for the pseudo-random number generated used to sample null interaction pairs. Default: None.
Returns:

A generator that yields negative / null interaction pairs, that is, (user internal id, item internal id) tuples.

remove_internal_ids()

Removes user and item internal ids.

Returns:None.
save(path='', columns=None, write_header=False)

Persists the current dataset instance in the provided path, as a csv file. Note that internal identifiers, such as the row id (rid), user internal id (uid) and item internal id (iid) are never persisted, since they’re only useful during runtime.

Parameters:
  • path – A string that represents the path where the current dataset values will be persisted.
  • columns – An optional list with the names of the columns that should be persisted. Default: all columns.
  • write_header – A boolean indicating whether to write the csv header on the persisted file. Default: False.
Returns:

None.

select(query, copy=True)

Select rows from the InteractionDataset.

Parameters:
  • query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
  • copy – A boolean indicating whether to create a new InteractionDataset where the rows that satisfy the provided query are put, or if the filtered rows should be removed from the current InteractionDataset (copy=False). Default: True.
Returns:

An instance of a InteractionDataset containing the rows selected by the provided query.

select_item_interaction_vec(iid)

Compute the item interaction vector for the provided item internal id.

Parameters:iid – The item internal id that references the item that should have its interaction vector computed.
Returns:The interaction vector (a vector containing the interaction values of each user to the provided item, in order) of the provided iid.
select_one(query, columns=None, to_list=False)

Select the first resulting row for the provided query.

Parameters:
  • query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
  • columns – A list with the column names to be kept on the resulting record. Default: all.
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.

Returns:

select_random_generator(query=None, seed=None)

Provides a generator that yields dataset rows.

Parameters:query – A string representing the query to be run before selecting random rows. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
Returns:A generator that yields dataset rows, where each row is represented as a dict.
select_user_interaction_vec(uid)

Compute the user interaction vector for the provided user internal id.

Parameters:uid – The user internal id that references the user that should have its interaction vector computed.
Returns:The interaction vector (a vector containing the interaction values of each item to the provided user, in order) of the provided uid.
uid_to_user(uid)

Converts a given internal user id into its correspondent raw id. Raises exception if no internal ids are assigned.

Parameters:uid – The user internal id.
Returns:An integer value representing the user raw id, or None if the internal user id provided does not exist.
unique(columns=None, copy=True)

Return a new InteractionDataset instance containing only the unique values on the provided column combination.

Parameters:
  • columns – The column combination to take into account when computing unique values. Default: all.
  • copy – A boolean indicating whether a copy of the InteractionDataset should be made, or if it should be modified in-place. Default: True.
Returns:

A InteractionDataset instance containing the unique values on the provided column combination.

user_to_uid(user)

Converts a given raw user id into its correspondent internal id. Raises exception if no internal ids are assigned.

Parameters:user – The user raw id.
Returns:An integer value representing the user internal id, or None if the raw user id provided does not exist.
values(columns=None, to_list=False)

Provides a generator that yields all the records present in the dataset.

Parameters:
  • columns – The list of columns that should be returned for each data point. Default: all.
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.
Returns:

A generator that yields records present in the dataset.

DRecPy.Dataset.db_dataset module

class DRecPy.Dataset.db_dataset.DatabaseInteractionDataset(path='', columns=None, **kwds)

Bases: DRecPy.Dataset.dataset_abc.InteractionDatasetABC

apply(column, function)

Modifies the current dataset instance by applying a transformation to a specific column in every row.

Parameters:
  • column – A string that represents the name of the column that will be transformed.
  • function – The function that will be used to map the current column value in each row to the new one.
Returns:

None.

assign_internal_ids()

Assigns user and item internal ids. Internal ids are integer consecutive identifiers that represent each user or item uniquely. Two new columns are created on this dataset instance: “uid” and “iid”, for user internal id and item internal id, respectively.

Returns:None.
close()

Cleanup method to delete temporary database files when they’re not in use anymore.

copy()

Copies the current dataset instance into a new one.

Returns:InteractionDataset instance with the same data values as the current one.
count_unique(columns=None)

Count the number of unique values on the provided column combination.

Parameters:columns – A list containing the columns to take into account. Default: all.
Returns:The count of unique values present on the provided column combination.
drop(record_ids, copy=True, keep=False)

Remove (or keep) the provided list of record ids from the current InteractionDataset instance.

Parameters:
  • record_ids – A list of integers representing record ids.
  • copy – A boolean indicating whether to create a new InteractionDataset instance to remove (or keep) the provided list of record ids, or if the current InteractionDataset instance should be modified accordingly (copy=False). Default: True.
  • keep – A boolean indicating whether the provided record ids should be kept or removed (keep=False). Default: False.
Returns:

An instance of the InteractionDataset with (or without) the filtered records.

iid_to_item(iid)

Converts a given internal item id into its correspondent raw id. Raises exception if no internal ids are assigned.

Parameters:iid – The item internal id.
Returns:An integer value representing the item raw id, or None if the internal item id provided does not exist.
item_to_iid(item)

Converts a given raw item id into its correspondent internal id. Raises exception if no internal ids are assigned.

Parameters:item – The item raw id.
Returns:An integer value representing the item internal id, or None if the raw item id provided does not exist.
max(column=None)

Computes the maximum value for the provided column.

Parameters:column – The name of the column for which the maximum should be computed.
Returns:The maximum value present on the whole dataset, for the provided column name.
min(column=None)

Computes the minimum value for the provided column.

Parameters:column – The name of the column for which the minimum should be computed.
Returns:The minimum value present on the whole dataset, for the provided column name.
null_interaction_pair_generator(interaction_threshold=None, seed=None)

Provides a generator that yields negative / null interaction pairs.

Parameters:
  • interaction_threshold – An optional integer that is used as the boundary interaction value between positive and negative interaction pairs. All values above or equal interaction_threshold are considered positive, and all values bellow are considered negative. If none is provided, positive interactions are the ones present on the dataset, and all the others are considered negative. Default: None.
  • seed – An optional integer to be used as the seed value for the pseudo-random number generated used to sample null interaction pairs. Default: None.
Returns:

A generator that yields negative / null interaction pairs, that is, (user internal id, item internal id) tuples.

remove_internal_ids()

Removes user and item internal ids.

Returns:None.
save(path='', columns=None, write_header=False)

Persists the current dataset instance in the provided path, as a csv or sqlite file. Note that internal identifiers, such as the row id (rid), user internal id (uid) and item internal id (iid) are never persisted, since they’re only useful during runtime.

Parameters:
  • path – A string that represents the path where the current dataset values will be persisted. If it ends in “.sqlite” then a sqlite db file will be persisted in the provided path. Otherwise a csv file will be persisted.
  • columns – An optional list with the names of the columns that should be persisted. Default: all columns.
  • write_header – A boolean indicating whether to write the csv header on the persisted file. Default: False.
Returns:

None.

select(query, copy=True)

Select rows from the InteractionDataset.

Parameters:
  • query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
  • copy – A boolean indicating whether to create a new InteractionDataset where the rows that satisfy the provided query are put, or if the filtered rows should be removed from the current InteractionDataset (copy=False). Default: True.
Returns:

An instance of a InteractionDataset containing the rows selected by the provided query.

select_item_interaction_vec(iid)

Compute the item interaction vector for the provided item internal id.

Parameters:iid – The item internal id that references the item that should have its interaction vector computed.
Returns:The interaction vector (a vector containing the interaction values of each user to the provided item, in order) of the provided iid.
select_one(query, columns=None, to_list=False)

Select the first resulting row for the provided query.

Parameters:
  • query – A string representing the query to be run. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
  • columns – A list with the column names to be kept on the resulting record. Default: all.
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.

Returns:

select_random_generator(query=None, seed=None)

Provides a generator that yields dataset rows.

Parameters:query – A string representing the query to be run before selecting random rows. The query format should be: “column_name operator value”, where extra conditions should be separated by a ‘,’. E.g. “user == ‘123’, interaction > 3.5”.
Returns:A generator that yields dataset rows, where each row is represented as a dict.
select_user_interaction_vec(uid)

Compute the user interaction vector for the provided user internal id.

Parameters:uid – The user internal id that references the user that should have its interaction vector computed.
Returns:The interaction vector (a vector containing the interaction values of each item to the provided user, in order) of the provided uid.
uid_to_user(uid)

Converts a given internal user id into its correspondent raw id. Raises exception if no internal ids are assigned.

Parameters:uid – The user internal id.
Returns:An integer value representing the user raw id, or None if the internal user id provided does not exist.
unique(columns=None, copy=True)

Return a new InteractionDataset instance containing only the unique values on the provided column combination.

Parameters:
  • columns – The column combination to take into account when computing unique values. Default: all.
  • copy – A boolean indicating whether a copy of the InteractionDataset should be made, or if it should be modified in-place. Default: True.
Returns:

A InteractionDataset instance containing the unique values on the provided column combination.

user_to_uid(user)

Converts a given raw user id into its correspondent internal id. Raises exception if no internal ids are assigned.

Parameters:user – The user raw id.
Returns:An integer value representing the user internal id, or None if the raw user id provided does not exist.
values(columns=None, to_list=False)

Provides a generator that yields all the records present in the dataset.

Parameters:
  • columns – The list of columns that should be returned for each data point. Default: all.
  • to_list – A boolean indicating whether each data point should be returned as a dict or as a list. Default: False.
Returns:

A generator that yields records present in the dataset.

DRecPy.Dataset.integrated_datasets module

class DRecPy.Dataset.integrated_datasets.DatasetReadConfig(url, full_file, columns, delimiter, encoding='utf8', train_file=None, test_file=None, unzip_folder=None, has_header=False, unzip=True)

Bases: object

DRecPy.Dataset.integrated_datasets.available_datasets()

Returns a list of the datasets available to download.

DRecPy.Dataset.integrated_datasets.download_dataset(ds_name)

Download the dataset with name passed as argument.

DRecPy.Dataset.integrated_datasets.get_dataset(ds_name, path, is_generated=False, force_out_of_memory=False, verbose=True, **kwds)

Returns an InteractionDataset containing the data present in the path argument, and uses the settings defined for the dataset specified in the ds_name argument. Downloads the dataset if is not already stored.

DRecPy.Dataset.integrated_datasets.get_full_dataset(ds_name, force_out_of_memory=False, verbose=True, **kwds)

Gets a full dataset. Might download the dataset if it hasn’t been downloaded before.

Parameters:
  • ds_name – A string with the name of the requested dataset. This name should be present in the list returned by available_datasets(), otherwise an error will be thrown.
  • force_out_of_memory – A boolean indicating whether to force dataset loading to out of memory. Default: False.
  • verbose – A boolean indicating whether to log info messages or not. Default: True.
Returns:

A InteractionDataset containing the dataset.

DRecPy.Dataset.integrated_datasets.get_test_dataset(ds_name, force_out_of_memory=False, verbose=True, **kwds)

Gets a test dataset. If the named dataset does not have a specific test file (example: BX dataset), a test InteractionDataset will be created using leave_k_out() from the Evaluation module on the full dataset. The split is deterministic (i.e. has a defined seed value). Might download the dataset if it hasn’t been downloaded before.

Parameters:
  • ds_name – A string with the name of the requested dataset. This name should be present in the list returned by available_datasets(), otherwise an error will be thrown.
  • force_out_of_memory – A boolean indicating whether to force dataset loading to out of memory. Default: False.
  • verbose – A boolean indicating whether to log info messages or not. Default: True.
Returns:

A InteractionDataset containing the test dataset.

DRecPy.Dataset.integrated_datasets.get_train_dataset(ds_name, force_out_of_memory=False, verbose=True, **kwds)

Gets a train dataset. If the named dataset does not have a specific train file (example: BX dataset), a train InteractionDataset will be created using leave_k_out() from the Evaluation module on the full dataset. The split is deterministic (i.e. has a defined seed value). Might download the dataset if it hasn’t been downloaded before.

Parameters:
  • ds_name – A string with the name of the requested dataset. This name should be present in the list returned by available_datasets(), otherwise an error will be thrown.
  • force_out_of_memory – A boolean indicating whether to force dataset loading to out of memory. Default: False.
  • verbose – A boolean indicating whether to log info messages or not. Default: True.
Returns:

A InteractionDataset containing the train dataset.