DRecPy.Evaluation.Splits package

DRecPy.Evaluation.Splits.leave_k_out module

DRecPy.Evaluation.Splits.leave_k_out.leave_k_out(interaction_dataset, k=1, min_user_interactions=0, last_timestamps=False, timestamp_label='timestamp', seed=0, max_concurrent_threads=4, **kwds)

Dataset split method that uses a leave k out strategy. More specifically, for each user with more than k interactions, k interactions are randomly selected and taken out from the train set and put into the test set. This means that there are never users present in the test set that are not present in the train set. Also, users without at least min_user_interactions interactions, will not be sampled in either sets (i.e. they’re removed). This function is not thread-safe (i.e. concurrent calls might produce unexpected results). Instead of trying this, increase the max_concurrent_threads argument to speed up the process (if you’ve the available cores).

Parameters:
  • interaction_dataset – A InteractionDataset instance containing the user-item interactions.
  • k – Optional integer or float value: if k is an integer, then it represents the number of interactions, per user, to use in the test set (and to remove from the train set); if k is a float value (and between 0 and 1), it represents the percentage of interactions, per user, to use in the test set (and to remove from the train set). Default: 1.
  • min_user_interactions – Optional integer that represents the minimum number of interactions each user needs to have to be included in the train or test set. Default: 0.
  • last_timestamps – Optional boolean that indicates whether the test records should be sampled by last timestamps (using the column with the name passed in the timestamp_label argument). Default: False.
  • timestamp_label – Optional string that corresponds to the name of the timestamp column on the interaction_dataset. This is only used when the last_timestamps argument is set to True. Default: ‘timestamp’.
  • max_concurrent_threads – An optional integer representing the max concurrent threads to use. Default: 4.
  • seed – An integer that is used as a seed value for the pseudorandom number generator. Default: 0.
  • verbose – Optional boolean that indicates if a progress bar showing the splitting progress should be displayed or not. Default: True.
Returns:

the train and test interaction datasets in this order.

Return type:

Two InteractionDataset instances

DRecPy.Evaluation.Splits.matrix_split module

DRecPy.Evaluation.Splits.matrix_split.matrix_split(interaction_dataset, user_test_ratio=0.2, item_test_ratio=0.2, min_user_interactions=0, seed=0, max_concurrent_threads=4, **kwds)

Dataset split method that uses a matrix split strategy. More specifically, item_test_ratio items from user_test_ratio users are sampled out of the full dataset and moved to the test set, while the missing items and users make the training set. If all records for a given user are selected to be moved into the test set, the split for that user is skipped, and its records are kept in the train set. This function is not thread-safe (i.e. concurrent calls might produce unexpected results). Instead of trying this, increase the max_concurrent_threads argument to speed up the process (if you’ve the available cores).

Parameters:
  • interaction_dataset – A InteractionDataset instance containing the user-item interactions.
  • user_test_ratio – Optional float value that represents the percentage of users to be sampled to the test set.
  • item_test_ratio – Optional float value that represents the percentage of items to be sampled to the test set.
  • min_user_interactions – Optional integer that represents the minimum number of interactions each user needs to have to be included in the train or test set. Default: 0.
  • max_concurrent_threads – An optional integer representing the max concurrent threads to use. Default: 4.
  • seed – An integer that is used as a seed value for the pseudorandom number generator. Default: 0.
  • verbose – Optional boolean that indicates if a progress bar showing the splitting progress should be displayed or not. Default: True.
Returns:

the train and test interaction datasets in this order.

Return type:

Two InteractionDataset instances

DRecPy.Evaluation.Splits.random_split module

DRecPy.Evaluation.Splits.random_split.random_split(interaction_dataset, test_ratio=0.25, seed=0, **kwds)

Random split that creates a train set with (100-100*test_ratio)% of the total rows, and a test set with the other (100*test_ratio)% of the rows. No guarantees of users or items existing on both datasets are made, therefore cases like: user X exists on the test set but not on the train set might happen. The use of this split should be directed to models that support these types of behaviour.

Parameters:
  • interaction_dataset – A InteractionDataset instance containing the user-item interactions.
  • test_ratio – A floating-point value representing the ratio of rows used for the test set. Default: 0.25.
  • seed – An integer that is used as a seed value for the pseudorandom number generator. If none is given, no seed will be used. Default: 0.
  • verbose – Optional boolean that indicates if a progress bar showing the splitting progress should be displayed or not. Default: True.
Returns:

the train and test interaction datasets in this order.

Return type:

Two InteractionDataset instances