For my MSc Thesis, i have constructed a LARS algorithm in cpp to implement uq to test functions. My goal is to use a sample (number of evaluations) as small as possible. The objective function is: f(x_(1…15) )= x_1^2∙ 5x_2^3∙ x_3^2 + 2x_4^4∙x_5 + x_6^0 - x_7^4 + x_8^4∙x_9^4 + x_10^2 + 0.001(x_11 + x_12 + x_13 + x_14 + x_15).
Input variables(Normal Distribution):
μ σ
x_1 1 2
x_2 2 1
x_3 1 2
x_4 2 1
x_5 1 0.5
x_6 1 2
x_7 1 3
x_8 1 0.5
x_9 1 2
x_10 1 3
x_11 1 2
x_12 1 2
x_13 1 2
x_14 1 0.3
x_15 1 2
As you see i expect the variables x11-x15 to be less important on the polynomial base that lars method will build.
My algorithm does the follwing
I use apriori 6000 samples X->Y so I evaluate them and keep 30% of them aside to measure error (see final diagram). I also define apriori the max order of Chaos for this problem to be 6
Compute the full Polynomial Design Matrix once, in order to have it later on
Add to my active base the first 0 grade Polynomial Psi_0 (hardcoded) and compute the first corelations between the Residual (at this point the coefficients are all 0 so the residual is the evaluated values of x namely Y)
select the Polynomial with the highest corelation (let’s say Psi_1 and do OLS with Psi_0, Psi_1 in my Design Matrix to compute the first 2 coefficients).
Compute the Corelations as cⱼ = ⟨Ψⱼ, r⟩ / ||Ψⱼ|| , choose Ψⱼ with the highest cⱼ, add it to active set, update Design Matrix
Do OlS and compute the error between test and train sets
Repeat 4 and 5 until a. Convergence Criterion – Stop if the relative error improvement is below 1e-4.
b. Insignificant Correlation – Stop if the maximum correlation is less than 1e-1.
c. Condition Number Growth – Stop if the condition number increases more than 10.0 times compared to the previous iteration.
d. Error Threshold – Stop if the loo error falls below 1e-2.
Exit Loop integrate the test set to the train set, do one last ols and compute mean and standard dev.
The final error in predicting mean and standard deviation (in comparison with Monte Carlo with COST=5 ∙10^7) is 0.0328827000% in mean and 1.2964408180% in standard deviation.
Questions and Comments:
a) I have used (and evaluated) apriori 6000 samples. Is there a way to define apriori the number of samples I will need. Maybe if i start with a small number of samples and try adding samples everytime during the loop (if needed) before the ols, will it be a good idea?
b) The corelations should be inner product between the basis function and the residual (Y- Design_Matrix_active *coefficients_active) or between the basis function and the error (Y− Ŷ)
b) Would it be more efficient to calculate the LOO error? I believe that if it takes x amount of time to process a single sample like this, then using LOO will require number_of_samples * x time, which could be too costly in terms of computational resources for problems like this. Is that correct?
c) Do you have any suggestions? I may have misunderstood something, but are there ways to achieve better accuracy with even lower computational cost? Any feedback would be greatly appreciated!
I’m not an expert on LARS, but I can still give you some of my thoughts on your question.
LARS is not the simplest algorithm, especially if performance is one of your concerns. I wonder why you reimplemented it instead of using a well-established implementation.
I think what you call the test data is actually your validation data. Test data should not be used during training at all.
(a) I don’t think there is a clear solution to that. The number of samples depends heavily on factors like dimensionality and the topology of your problem. In your second sentence, are you referring to an active learning approach? If you’re interested, you can check out the UQLab manual on active learning. It is a good approach but adds additional complexity. I don’t know the scope of your MSc thesis, but you seem relatively new to these topics, so I would advise working with existing implementations if you also want to explore more advanced methods.
(b) What is \hat{Y} in your notation?
(b) (the second one ) UQLab evaluates the LOO error to decide when to stop LARS. I believe to remember that it stops when the LOO error starts to increase. There might be a section on this in the manual. Otherwise, you may want to check the UQLab implementation. And yes, this adds cost to the algorithm, but there is an analytical solution for the LOO error in linear regression problems, so the cost is manageable in most cases.
Instead of validating i have computed LOO error and modified LOO error as the manual says and for a 10 dimensional problem used 10*d number of samples. I have a quetion about the termination criterium. The manual says: ‘An effective and robust early stop criterion for LAR is then to stop adding regressors after the LOO is above its minimum value for at least 10% of the maximum number of possible iterations’.
My quetion is this: what is the possible number of Iterations? Number of Samples -1? or just Number of Polynomials?
If i apply this criterion + the criterion: Stop if the system becomes underdetermined (Number of Pol. < Number of Samples), then for the specific problem the criterion with the LOO error will never fuilfill and the szstem will end with the following Results. Maybe if i change it to: ‘An effective and robust early stop criterion for LAR is then to stop adding regressors after the LOO is above its minimum value’ would it be better?. Are these 2 Criterias enough for LAR algorithm?
The results i have using LAR LSM are
Max Correlation: 5.54492e-05
→ Mean (a0): 268.263 (true value according to Monte Carlo: (268.1208)
→ Standard Deviation: 48.5624 (true value according to Monte Carlo: (49.40246)
Condition Number: 7539.11
Total Polynomials: 99
LOO_error (normalised): 1.32759e-09
Modified LOO Error (normalised): 6.45492e-06
Number of samples used: 100
Thank you for your patience
PS in parallel i try to use the well-established UQLab code to solve the same problem
The maximum number of iterations in UQLab is (see file uq_lar.m):
[N, P] = size(Psi); % line 107 (Psi is the regression matrix)
...
nvars = min(N-1, P) % line 191
...
maxk = 8*nvars; % line 196
...
maxiter = min(maxk, nvars); % line 222
Regarding early stopping, I want to emphasize that this is exactly what it states—an early stop. UQLab then selects the iteration with the lowest LOO error. You can also run LARS until the pool of regressors is exhausted and then choose the iteration with the lowest error. However, this can be very time-consuming.
Does this also address your concerns from your second bullet point? You can either run LARS to completion and select the set of coefficients with the lowest LOO error or stop early if you notice that the LOO error is unlikely to decrease further.
I have one more question. After reading this Acceptable LOO Error about acceptable LOO error I am wondering this.
I am performing LARS-based PCE. In each iteration, a highly correlated polynomial is added to the design matrix, and I compute the minimum modified LOO error (Eq. 1.22 in the UQ manual) for a nonlinear QoI with 10 stochastic inputs. The LOO error is not normalized, and I obtain the following results for a maximum chaos order of 2.
for:
100 Training Samples: modified LOO Error is 0.0022469228
50 Training Samples: modified LOO Error is 0.0145296080
25 Training Samples: modified LOO Error is 0.0200713039 (here I noticed >=10% errors when computing standard deviation in comparison with the Monte Carlo results)
So, knowing that this is a problem-dependent decision, is there a general empirical rule that indicates which results are trustful? Is the modified loo error a capable indicator to say that?
I want also ask about the max order of chaos. I have chosen 2 for this case. In LARS with a small training sample like here is there also a rule of thumb for that?