actionrwd.reward

actionrwd.reward(..) 🡒 (zedfunc, zedfunc), process

The action reward function actionrwd.reward() returns a pair of zedfuncs aligned with marginal outcomes of the nth unit being ordered. The first zedfunc, the sell-through, is the expectation of selling the unit within the relevant time frame. The second zedfunc, the holding time, is the expected number of periods where the unit will be kept in stock. The action reward emphasizes a discrete demand and lead time model centered around a single SKU.

See actionrwd.demand.

Overview

The action reward function is intended to compute the profitability of ordering one more unit of stock for a given SKU taking into account a probabilistic demand forecast, a probabilistic lead time forecast and a few other variables. This function is intended to be composed with a short series of economic variables as the expected marginal reward, the per-period carrying cost and the stock-out penalty. However, those variables are composed externally and are not part of the action reward itself.

The action reward superseeds the stock reward. The action reward properly handles (a) non-stationary future demand, (b) non-deterministic leadtime, (c) fine-grained information about the stock on order and its estimated arrival time, (d) decoupled ordering lead times and supply lead times and (e) a ownership perspective the economic consequence of the decision.

The action reward function operates in the ordering space, that is, it estimates the marginal economic returns of ordering 0, 1, 2, … units of stocks. This perspective is slightly different from the stock reward function which was operating in the stocking space, that is, estimating the economic returns of having 0, 1, 2, … units committed to the stock.

The notion of horizon in the action reward diverges in subtle but important ways compared to how the same notion is leveraged in the stock reward.

From the action reward perspective, at the level of a decision, the horizon is defined from an ownership (i.e. responsibility) perspective, and this horizon varies depending whether margin and stock-out penalty are considered vs. the carrying cost. For the margin and the stock-out penalty, the action is only “rewarded” (positively or negatively) for a duration equal to the ordering lead time. For the carrying cost, the action penalized for the entire lifetime of the unit held in stock.

Concepts

The period can be a day, a week or a month. The action reward is agnostic of the chosen time granularity.

The coverage timespan starts at the point of time (in the future) when the first order passed today would be received and ends at the point of time (later in the future) where a second order passsed at the next reorder opportunity would be received. From a decision-making perspective, the “upside” availability-wise of the present order is limited to the coverage timespan. Indeed, before this timespan, the present order is already too late to prevent the stockout; after this timespan, the prevention of the stockout becomes the responsability of the next order.

The sell through, computed by the action reward function, estimates whether each unit of interest, within the order, is going to be serviced within the coverage timespan assumnig a FIFO consumption of the stock. This estimates takes the form of a probability.

From an economic perspective, the sell through drives both the gross margin and the stockout penalty. Indeed, if the unit isn’t serviced within the coverage timespan, then, its gross margin does not belong to the present order but to a later order. Conversely, if the unit is serviced within the coverage timespan, then, it means that lacking this unit would have generated a stockout. Thus, while gross-margin and stockout may appear as two distinct economic drivers, we see that, from the action reward perspective, those two factors are largely correlated.

The holding timespan (or simply holding time) starts at the point of time (in the future) when a unit ordered today is received and ends at the point of time (later in the future) when this unit is finally serviced. From a desision-market perspective, the “downside” carrying-cost-wise of the present order runs for the whole holding timespan. Indeed, the action reward assumes that the only way to lower the stock is to service the demand, and thus, later orders can’t “undo” a prior ordered stock quantity lingering around.

The holding time, as computed by the action reward function, is the mean of the distribution of holding timespan.

Model details

The action reward makes multiple assumptions about the underlying SKU being modeled:

Code example

table Items = with
  [| as Id, as SellPrice, as BuyPrice, as LeadTime, as ReorderFrequency, \
     as MeanDemand, as Dispersion, as StockAvailable, as StockOnOrder |]
  [| "cap",      20, 10,    5,  3,  12.1,  3.2,   3,  0 |]
  [| "hat",      10,  3,   10,  7,   2.4,  1.5,   0, 10 |]
  [| "t-shirt",  25, 10,   30,  7,   7.9,  2.3,   7,  2 |]

table Periods = extend.range(80 into Items) // days ahead

Periods.StockOnOrder = Periods.N == 10 ? Items.StockOnOrder : 0

Periods.Seasonality = match Periods.N with
  ..    13 -> 0.5
  14 .. 39 -> 1.5
  ..       -> 0.5

Periods.Baseline = Items.MeanDemand * Periods.Seasonality

Items.Alpha = 0.3

Items.SellThrough, Items.HoldingTime = actionrwd.reward(
  Baseline: Periods.Baseline
  Dispersion: Items.Dispersion
  TimeIndex: Periods.N
  Alpha: Items.Alpha
  StockOnHand: Items.StockAvailable
  StockOnOrder : Periods.StockOnOrder
  LeadTime: dirac(Items.LeadTime)
  StepOfReorder: Items.ReorderFrequency)

Periods.D = actionrwd.demand(
  Baseline: Periods.Baseline
  Dispersion: Items.Dispersion
  TimeIndex: Periods.N
  Alpha: Items.Alpha)

Periods.Median = quantile(Periods.D, 0.5)
Periods.Q95 = quantile(Periods.D, 0.95)
Periods.Q05 = quantile(Periods.D, 0.05)

oosPenalty = 0.4     // % relative to selling price
carryingCost = 0.005 // % per-period carrying cost relative to purchase price
Items.M = Items.SellPrice - Items.BuyPrice
Items.S = oosPenalty * Items.M
Items.C = carryingCost * Items.BuyPrice
Items.R = Items.SellThrough * (Items.M + Items.S) - Items.HoldingTime * Items.C

Items.slice = sliceDashboard(Items.Id) by [Items.Id]

show scalar "Reward" a1c3 slices: Items.slice with same(Items.R) * uniform(0, 100)
show plot "Demand" a4c6 slices: Items.slice with
  Periods.N
  Periods.Median
  Periods.Q05
  Periods.Q95

State space model of the demand

The demand trajectories are generated by a state space model. A single hidden state - the level - is used and generates observations in sequence. Each newly generated observation is drawned from a probability distribution - here a negative binomial - and this observation is used to update the state.

The pseudo-code that governs the state is given by:

level[t = 0] = 1

foreach t:
  mean[t]        = baseline[t] * level[t]
  variance[t]    = mean[t] * dispersion
  observation[t] = DrawNegativeBinomial(mean[t], variance[t])
  level[t + 1]   = (1 - alpha) * level[t] + alpha * observation[t] / baseline[t]

If alpha = 0, the state space model is equivalent to a sequence of negative binomial distributions with mean = baseline[t] and variance[t] = baseline[t] * dispersion.

The number of trajectories used is also an input of the computation : the more trajectories, the more precise the computation, but also the longer it takes to compute. Similarly to the horizon, it is always beneficial to the precision to increase the number of trajectories, but past a certain point the precision gain becomes negligible.

Function signature

/// Returns two zedfuncs associated to the marginal outcome of ordering units.
/// The space considered is the ordering space.
call actionrwd.reward<Items, Periods>(
    /// Time index of the period. Links the table to the time. Starts at one.
    Periods.TimeIndex: number as "TimeIndex",
    /// The baseline of the average demand over each period.
    Periods.Baseline: number as "Baseline",
    /// The dispersion parameter (variance divided by mean) of the demand for each item.
    Items.Dispersion: number as "Dispersion",
    /// The update speed parameter of the ISSM model for each item.
    Items.Alpha: number as "Alpha",  
    /// Stock on hand at the time the order is placed.
    Items.StockOnHand: number as "StockOnHand",
    /// Stock on order scheduled to be available at the start of each period.
    Periods.StockOnOrder: number as "StockOnOrder",
    /// Lead time, unit is abitrary time step,
    /// must be consistent with StepOfReorder.
    Items.LeadTime: ranvar as "LeadTime",
    /// Estimated next reorder time, in arbitrary time steps.
    Items.StepOfReorder: number as "StepOfReorder",
    /// Number of trajectories used to evaluate to action reward.
    scalar.Samples?: number as "Samples",  
    /// Seed used for the trajectory generator.
    scalar.Seed?: number as "Seed",
    Items -> Periods) : {
        /// Probability of servicing the nth ordered unit while having the
        /// responsibility to do so.
        Items.SellThrough: zedfunc,
        /// Mean estimate of the number of periods of carrying cost for the nth
        /// ordered unit.
        Items.HoldingTime: zedfunc
    } as "actionrwd.reward"