Getting Started¶

Basic usage¶

Training a model¶

Recommenders built using the DRecPy framework follow the usual method definitions: fit() to fit the model to the provided data, and predict(), rank() or recommend() to provide predictions. Once trained, in order to evaluate a model, one can build custom evaluation processes, or can use the builtin ones, which are defined on the DRecPy.Evaluation package.

Here’s a quick example of training the CDAE recommender with the MovieLens-100k data set on 100 epochs, and evaluating the ranking performance on 100 test users. Node that a seed parameter is passed through when instantiating the CDAE object, as well as when calling the evaluation process, so that we can have a deterministic pipeline.

From file examples/cdae.py¶

from DRecPy.Recommender import CDAE
from DRecPy.Dataset import get_train_dataset
from DRecPy.Dataset import get_test_dataset
from DRecPy.Evaluation.Processes import ranking_evaluation
import time

ds_train = get_train_dataset('ml-100k')
ds_test = get_test_dataset('ml-100k')

start_train = time.time()
cdae = CDAE(hidden_factors=50, corruption_level=0.2, loss='bce', seed=10)
cdae.fit(ds_train, learning_rate=0.001, reg_rate=0.001, epochs=50, batch_size=64, neg_ratio=5)
print("Training took", time.time() - start_train)

print(ranking_evaluation(cdae, ds_test, k=[1, 5, 10], novelty=True, n_test_users=100, n_pos_interactions=1,
                         n_neg_interactions=100, generate_negative_pairs=True, seed=10, max_concurrent_threads=4,
                         verbose=True))

Data Set usage¶

To learn more about the public methods offered by the InteractionDataset module, please read the respective api documentation. This section is simply a brief introduction on how to import and make use of data sets.

Importing a built-in data set¶

At the moment, DRecPy provides various builtin data sets, such as: the MovieLens (100k, 1M, 10M and 20M) and the Book Crossing data set. Whenever you’re using a builtin data set for the first time, a new folder will be created at your home path called “.DRecPy_data”. If you want to provide a custom path for saving these data sets, you can do so by providing the DATA_FOLDER environment variable mapping to the intended path.

The example bellow shows how to use a builtin data set and how to manipulate it using the provided methods:

From file examples/integrated_datasets.py¶

from DRecPy.Dataset import get_train_dataset
from DRecPy.Dataset import get_test_dataset
from DRecPy.Dataset import get_full_dataset
from DRecPy.Dataset import available_datasets

print('Available datasets', available_datasets())

# Reading the ml-100k full dataset and prebuilt train and test datasets.
print('ml-100k full dataset', get_full_dataset('ml-100k'))
print('ml-100k train dataset', get_train_dataset('ml-100k'))
print('ml-100k test dataset', get_test_dataset('ml-100k'))

# Reading the ml-1m full dataset and generated train and test datasets using out of memory storage.
print('ml-1m full dataset', get_full_dataset('ml-1m', force_out_of_memory=True))
print('ml-1m train dataset', get_train_dataset('ml-1m', force_out_of_memory=True))
print('ml-1m test dataset', get_test_dataset('ml-1m', force_out_of_memory=True))

# Showcase some dataset operations
ds_ml = get_full_dataset('ml-100k')
print('Minimum rating value:', ds_ml.min('interaction'))
print('Unique rating values:', ds_ml.unique('interaction').values_list())

ds_ml.apply('interaction', lambda x: x / ds_ml.max('interaction'))  # standardize the rating value
print('New values', ds_ml.values_list()[:5])

Importing a custom data set¶

Custom data sets are also supported, and you should provide the path to the csv file as well as the column names and the delimiter.

From file examples/custom_datasets.py¶

from DRecPy.Dataset import InteractionDataset
from os import remove

# create file with sample dataset
with open('tmp.csv', 'w') as f:
    f.write('"john","ps4",4.5\n')
    f.write('"patrick","xbox",4.1\n')
    f.write('"anna","brush",3.6\n')
    f.write('"david","tv",2.0\n')

# load dataset into memory
ds_memory = InteractionDataset('tmp.csv', columns=['user', 'item', 'interaction'])
print('all values:', ds_memory.values_list())
print('filtered values:', ds_memory.select('interaction > 3.5').values_list())
ds_memory_scaled = ds_memory.copy()
ds_memory_scaled.apply('interaction', lambda x: x / ds_memory.max('interaction'))
print('all values scaled:', ds_memory_scaled.values_list())

# load dataset out of memory
ds_out_of_memory = InteractionDataset('tmp.csv', columns=['user', 'item', 'interaction'], in_memory=False)
print('all values:', ds_out_of_memory.values_list())
print('filtered values:', ds_out_of_memory.select('interaction > 3.5').values_list())

remove('tmp.csv')  # delete previously created sample dataset file

Note that there are 3 required columns: user, item and interaction.