# actionrwd.reward

## actionrwd.reward(..) 🡒 (zedfunc, zedfunc), process

The action reward function actionrwd.reward() returns a pair of zedfuncs aligned with marginal outcomes of the nth unit being ordered. The first zedfunc, the sell-through, is the expectation of selling the unit within the relevant time frame. The second zedfunc, the holding time, is the expected number of periods where the unit will be kept in stock. The action reward emphasizes a discrete demand and lead time model centered around a single SKU.

### Overview

The action reward function is intended to compute the profitability of ordering one more unit of stock for a given SKU taking into account a probabilistic demand forecast, a probabilistic lead time forecast and a few other variables. This function is intended to be composed with a short series of economic variables as the expected marginal reward, the per-period carrying cost and the stock-out penalty. However, those variables are composed externally and are not part of the action reward itself.

The action reward superseeds the stock reward. The action reward properly handles (a) non-stationary future demand, (b) non-deterministic leadtime, (c) fine-grained information about the stock on order and its estimated arrival time, (d) decoupled ordering lead times and supply lead times and (e) a ownership perspective the economic consequence of the decision.

• The action reward models the probabilistic future demand through a trajectory generation mechanism which supports non-stationary patterns to be applied such as seasonality, trend or lifecycle.
• The lead time also benefits from a probabilistic modeling. This ensures that the reward estimate properly factor the risk of delays associated to the incoming orders.
• The stock on order is modeled explicitly with a time component which reflects the estimated arrival date of the incoming batch(es). The model backing the action reward implicitly assumes that the unserviced demand is lost not merely delayed.
• The ordering frequency is distinguished from the supply lead time, instead of lumping the two lead time values together.
• The “ownership” perspective means that each +1 ordering increment is assessed in its capacity to serve the demand in a way that a later decision could not. Yet, it assigns all the resulting carrying costs to this original decision.

The action reward function operates in the ordering space, that is, it estimates the marginal economic returns of ordering 0, 1, 2, … units of stocks. This perspective is slightly different from the stock reward function which was operating in the stocking space, that is, estimating the economic returns of having 0, 1, 2, … units committed to the stock.

The notion of horizon in the action reward diverges in subtle but important ways compared to how the same notion is leveraged in the stock reward.

From the action reward perspective, at the level of a decision, the horizon is defined from an ownership (i.e. responsibility) perspective, and this horizon varies depending whether margin and stock-out penalty are considered vs. the carrying cost. For the margin and the stock-out penalty, the action is only “rewarded” (positively or negatively) for a duration equal to the ordering lead time. For the carrying cost, the action penalized for the entire lifetime of the unit held in stock.

### Concepts

The period can be a day, a week or a month. The action reward is agnostic of the chosen time granularity.

The coverage timespan starts at the point of time (in the future) when the first order passed today would be received and ends at the point of time (later in the future) where a second order passsed at the next reorder opportunity would be received. From a decision-making perspective, the “upside” availability-wise of the present order is limited to the coverage timespan. Indeed, before this timespan, the present order is already too late to prevent the stockout; after this timespan, the prevention of the stockout becomes the responsability of the next order.

The sell through, computed by the action reward function, estimates whether each unit of interest, within the order, is going to be serviced within the coverage timespan assumnig a FIFO consumption of the stock. This estimates takes the form of a probability.

From an economic perspective, the sell through drives both the gross margin and the stockout penalty. Indeed, if the unit isn’t serviced within the coverage timespan, then, its gross margin does not belong to the present order but to a later order. Conversely, if the unit is serviced within the coverage timespan, then, it means that lacking this unit would have generated a stockout. Thus, while gross-margin and stockout may appear as two distinct economic drivers, we see that, from the action reward perspective, those two factors are largely correlated.

The holding timespan (or simply holding time) starts at the point of time (in the future) when a unit ordered today is received and ends at the point of time (later in the future) when this unit is finally serviced. From a desision-market perspective, the “downside” carrying-cost-wise of the present order runs for the whole holding timespan. Indeed, the action reward assumes that the only way to lower the stock is to service the demand, and thus, later orders can’t “undo” a prior ordered stock quantity lingering around.

The holding time, as computed by the action reward function, is the mean of the distribution of holding timespan.

### Model details

The action reward makes multiple assumptions about the underlying SKU being modeled:

• A lead time deviate of 1 indicates that the goods will be available at the start of the next period, hence prior to any consumption of stock for this period. More generally, received goods for a period are always assumed to be available at the beginning of the period. The same principle holds for the stock on order whic represents past not-yet-fulfilled orders.
• A lead time deviate of 0 (zero) indicates that the goods will be available at the start of the present period.
• A fraction of the demand is considered lost whenever the demand for the period exceeds the stock available for this period. The lost demand is assumed to be non-recoverable, hence it cannot be serviced by a later arrival of good.

### Code example

table Items = with
[| as Id, as SellPrice, as BuyPrice, as LeadTime, as ReorderFrequency, \
as MeanDemand, as Dispersion, as StockAvailable, as StockOnOrder |]
[| "cap",      20, 10,    5,  3,  12.1,  3.2,   3,  0 |]
[| "hat",      10,  3,   10,  7,   2.4,  1.5,   0, 10 |]
[| "t-shirt",  25, 10,   30,  7,   7.9,  2.3,   7,  2 |]

table Periods = extend.range(80 into Items) // days ahead

Periods.StockOnOrder = Periods.N == 10 ? Items.StockOnOrder : 0

Periods.Seasonality = match Periods.N with
..    13 -> 0.5
14 .. 39 -> 1.5
..       -> 0.5

Periods.Baseline = Items.MeanDemand * Periods.Seasonality

Items.Alpha = 0.3

Items.SellThrough, Items.HoldingTime = actionrwd.reward(
TimeIndex: Periods.N
Baseline: Periods.Baseline
Dispersion: Items.Dispersion
Alpha: Items.Alpha
StockOnHand: Items.StockAvailable
ArrivalTime: dirac(Periods.N)
StockOnOrder : Periods.StockOnOrder
StepOfReorder: Items.ReorderFrequency)

Periods.D = actionrwd.demand(
Baseline: Periods.Baseline
Dispersion: Items.Dispersion
TimeIndex: Periods.N
Alpha: Items.Alpha)

Periods.Median = quantile(Periods.D, 0.5)
Periods.Q95 = quantile(Periods.D, 0.95)
Periods.Q05 = quantile(Periods.D, 0.05)

oosPenalty = 0.4     // % relative to selling price
carryingCost = 0.005 // % per-period carrying cost relative to purchase price
Items.S = oosPenalty * Items.M
Items.R = Items.SellThrough * (Items.M + Items.S) - Items.HoldingTime * Items.C

Items.slice = sliceDashboard(Items.Id) by [Items.Id]

show scalar "Reward" a1c3 slices: Items.slice with same(Items.R) * uniform(0, 100)
show plot "Demand" a4c6 slices: Items.slice with
Periods.N
Periods.Median
Periods.Q05
Periods.Q95


### State space model of the demand

The demand trajectories are generated by a state space model. A single hidden state - the level - is used and generates observations in sequence. Each newly generated observation is drawned from a probability distribution - here a negative binomial - and this observation is used to update the state.

The pseudo-code that governs the state is given by:

level[t = 0] = 1

foreach t:
mean[t]        = baseline[t] * level[t]
variance[t]    = mean[t] * dispersion
observation[t] = DrawNegativeBinomial(mean[t], variance[t])
level[t + 1]   = (1 - alpha) * level[t] + alpha * observation[t] / baseline[t]


If alpha = 0, the state space model is equivalent to a sequence of negative binomial distributions with mean = baseline[t] and variance[t] = baseline[t] * dispersion.

The number of trajectories used is also an input of the computation : the more trajectories, the more precise the computation, but also the longer it takes to compute. Similarly to the horizon, it is always beneficial to the precision to increase the number of trajectories, but past a certain point the precision gain becomes negligible.

### Function signature

/// Returns two zedfuncs associated to the marginal outcome of ordering units.
/// The space considered is the ordering space.
call actionrwd.reward<Items, Periods, Orders>(
/// Defines a non-ambiguous ordering per item (i.e. distinct values required).
Periods.TimeIndex: number as "TimeIndex",
/// The baseline of the average demand over each period.
Periods.Baseline: number as "Baseline",
/// The dispersion parameter (variance divided by mean) of the demand for each item.
Items.Dispersion: number as "Dispersion",
/// The update speed parameter of the ISSM model for each item.
Items.Alpha: number as "Alpha",
/// Stock on hand at the time the order is placed.
Items.StockOnHand: number as "StockOnHand",
/// Lead time, unit is abitrary time step, must be consistent with StepOfReorder.
/// Estimated next reorder time, in arbitrary time steps.
Items.StepOfReorder: number as "StepOfReorder",
/// Estimated order’s arrival period, zero-indexed.
/// ArrivalTime and StockOnOrder must be both present, or both absent.
Orders.ArrivalTime?: ranvar as "ArrivalTime",
/// Quantity associated to the pending order. Stock is assumed available at the beginning of the period.
Orders.StockOnOrder?: number as "StockOnOrder",
/// Number of trajectories used to evaluate to action reward.
scalar.Samples?: number as "Samples",
/// Seed used for the trajectory generator.
scalar.Seed?: number as "Seed",
Items -> Periods,
?Items -> Orders) : {
/// Probability of servicing the nth ordered unit while having the
/// responsibility to do so.
Items.SellThrough: zedfunc,
/// Mean estimate of the number of periods of carrying cost for the nth
/// ordered unit.
Items.HoldingTime: zedfunc
} as "actionrwd.reward"