Getting Started¶
Basic usage¶
Training a model¶
Recommenders built using the DRecPy framework follow the usual method definitions: fit() to fit the model to the provided data, and predict(), rank() or recommend() to provide predictions. Once trained, in order to evaluate a model, one can build custom evaluation processes, or can use the builtin ones, which are defined on the DRecPy.Evaluation package.
Here’s a quick example of training the CDAE recommender with the MovieLens-100k data set on 100 epochs, and evaluating the ranking performance on 100 test users. Node that a seed parameter is passed through when instantiating the CDAE object, as well as when calling the evaluation process, so that we can have a deterministic pipeline.
examples/cdae.py¶from DRecPy.Recommender import CDAE
from DRecPy.Dataset import get_train_dataset
from DRecPy.Dataset import get_test_dataset
from DRecPy.Evaluation.Processes import ranking_evaluation
import time
ds_train = get_train_dataset('ml-100k')
ds_test = get_test_dataset('ml-100k')
start_train = time.time()
cdae = CDAE(hidden_factors=50, corruption_level=0.2, loss='bce', seed=10)
cdae.fit(ds_train, learning_rate=0.001, reg_rate=0.001, epochs=50, batch_size=64, neg_ratio=5)
print("Training took", time.time() - start_train)
print(ranking_evaluation(cdae, ds_test, k=[1, 5, 10], novelty=True, n_test_users=100, n_pos_interactions=1,
n_neg_interactions=100, generate_negative_pairs=True, seed=10, max_concurrent_threads=4,
verbose=True))
Data Set usage¶
To learn more about the public methods offered by the InteractionDataset module, please read the respective api documentation. This section is simply a brief introduction on how to import and make use of data sets.
Importing a built-in data set¶
At the moment, DRecPy provides various builtin data sets, such as: the MovieLens (100k, 1M, 10M and 20M) and the Book Crossing data set. Whenever you’re using a builtin data set for the first time, a new folder will be created at your home path called “.DRecPy_data”. If you want to provide a custom path for saving these data sets, you can do so by providing the DATA_FOLDER environment variable mapping to the intended path.
The example bellow shows how to use a builtin data set and how to manipulate it using the provided methods:
examples/integrated_datasets.py¶from DRecPy.Dataset import get_train_dataset
from DRecPy.Dataset import get_test_dataset
from DRecPy.Dataset import get_full_dataset
from DRecPy.Dataset import available_datasets
print('Available datasets', available_datasets())
# Reading the ml-100k full dataset and prebuilt train and test datasets.
print('ml-100k full dataset', get_full_dataset('ml-100k'))
print('ml-100k train dataset', get_train_dataset('ml-100k'))
print('ml-100k test dataset', get_test_dataset('ml-100k'))
# Reading the ml-1m full dataset and generated train and test datasets using out of memory storage.
print('ml-1m full dataset', get_full_dataset('ml-1m', force_out_of_memory=True))
print('ml-1m train dataset', get_train_dataset('ml-1m', force_out_of_memory=True))
print('ml-1m test dataset', get_test_dataset('ml-1m', force_out_of_memory=True))
# Showcase some dataset operations
ds_ml = get_full_dataset('ml-100k')
print('Minimum rating value:', ds_ml.min('interaction'))
print('Unique rating values:', ds_ml.unique('interaction').values_list())
ds_ml.apply('interaction', lambda x: x / ds_ml.max('interaction')) # standardize the rating value
print('New values', ds_ml.values_list()[:5])
Importing a custom data set¶
Custom data sets are also supported, and you should provide the path to the csv file as well as the column names and the delimiter.
examples/custom_datasets.py¶from DRecPy.Dataset import InteractionDataset
from os import remove
# create file with sample dataset
with open('tmp.csv', 'w') as f:
f.write('"john","ps4",4.5\n')
f.write('"patrick","xbox",4.1\n')
f.write('"anna","brush",3.6\n')
f.write('"david","tv",2.0\n')
# load dataset into memory
ds_memory = InteractionDataset('tmp.csv', columns=['user', 'item', 'interaction'])
print('all values:', ds_memory.values_list())
print('filtered values:', ds_memory.select('interaction > 3.5').values_list())
ds_memory_scaled = ds_memory.copy()
ds_memory_scaled.apply('interaction', lambda x: x / ds_memory.max('interaction'))
print('all values scaled:', ds_memory_scaled.values_list())
# load dataset out of memory
ds_out_of_memory = InteractionDataset('tmp.csv', columns=['user', 'item', 'interaction'], in_memory=False)
print('all values:', ds_out_of_memory.values_list())
print('filtered values:', ds_out_of_memory.select('interaction > 3.5').values_list())
remove('tmp.csv') # delete previously created sample dataset file
Note that there are 3 required columns: user, item and interaction.