Clean up statistics writers

added feature label

assigned to @berber

mentioned in commit 7e90b8f2

Please note that the available statistics writer of core.main changed. Now, there are best-candidate-per-generation and candidates-per-generation. Both are configurable to change their behaviour. The attributes are:

prefix-data, a boolean value that makes the writer prefix the attributes by search-space-value and optimisation-space-value-. This is handy in some use cases.
store-search-space, a boolean value that controls if the search space values of a candidate are stored
store-optimisation-space, a boolean value that controls if the optimisation space values of a candidate are stored

@cplump Do you think it would be a good idea to add something like a raw-optimisation-value-of-best-candidate-per-generation and raw-optimisation-value-of-candidates-per-generation? This would allow us to store the optimisation values before any adjustment.

Yes, this definitely makes sense. We also have to talk about

the covariance statistics
adding the gof-measure for the stat-writer with a flag in surrogate cases

added componentcore workflowdiscussion labels

We have the standard use case, where we evaluate an optimisation run via its population individuals and their respective (adapted) fitness values. For this, the above variant sounds good, although an attribute as array, containing a list of (adapted) fitness variations would be handy as well. E.g. optimisation values = ['objective-function', 'adjusted-fitness', 'predicted-value']. Something like additional information might be useful as well, e.g. pareto-classes (@lau_pau) or gof-measures for that given individual, number of violated constraints or generated malus value. @berber, should we make a list?

Then, we have correlation writers. These, however, are somewhat distinct from the actual optimisation and always relate to the entire generation, not just one individual. Maybe, it's worth giving them their own statistics writer instead of only attributes?

I think, we should distinguish between different writers for different situations. A list, such as values = ['objective-function', 'adjusted-fitness', 'predicted-value'] would be possible but would break any seperation of concerns. Currently, I have the feeling, we have to rework the documentation part to make it more flexible and less intrusive as the current structure requires us to calculate everything at the end of a generation. It would easier to allow any component in the calculation process to add information. This is a larger change in the structure and should be moved to the next release, I think.

We have additional writers:

prediction-per-individual writes the predicted-value.
correlated ???
range-corelated ???
constraint-statistics ???

I think, we should test these and add a proper documentation.

The middle two were the ones, I was referring to in the above comment with

<Then, we have correlation writers. These, however, are somewhat distinct from the actual optimisation and always relate to the entire generation, not just one individual. Maybe, it's worth giving them their own statistics writer instead of only attributes?>

Testing is surely a good idea, and I think some of them need work because research has evolved since then, so we wouldn't use them like that anymore.

In my opinion, but I may be wrong about that, there are three groups of writers:

focusing on some value on the co-domain side (fitness, objective, adjusted, predicted, pareto-class )
focusing on some value on the domain side (violated constraints of which type, considered amount of malus)
focusing on the entire population on either side (atm: correlated, range-correlated)

Should I change some of the examples to use them so we can see if they are doing the right things?

For which writers? I think the only really important one is the one for the predicted individual. Maybe the constraints.

Nevertheless, I think it might help to get some structure in our statistic component.

mentioned in issue #126 (closed)

marked this issue as related to #126 (closed)

prediction-per-individual produces the following output for a prediction model with eight output dimensions:

target,run,generation,index,y:0,y:1,y:2,y:3,y:4,y:5,y:6,y:7

I think that this is the intended behaviour.

Yes, I second that. Does it have options like above? (search space included or not?)

No, since the idea was to not repeat logging several times. If you want the search-space values as well, you have to use candidates-per-generation and prediction-per-individual and join the data on the columns target, run, generation, and index. Perhaps, we should revisit this topic for the next version of the "Logging API".

Ok, then I would say that that Writer is fine and working correctly.

The correlated writer generates the following data:

target,run,generation,individual,age,ind. difference(0/0),sum(abs(ind. difference)),sum(sqr(ind. difference)),pop. difference(0/0),sum(abs(pop. difference)),sum(sqr(pop. difference)),time

@cplump It would be great if you added some documentation to the corresponding definition file.

@berber I have done so, see Commit 8d6a4d7f. However, I am wondering whether we should talk about including the non-surrogate specific writers in the general optimisation.dl file as they can also be used when working with benchmarks and not surrogates.

Thanks. I think it would be a good idea to refactor the code and move it to a different place. I will deal with it.

The correlated writer generates the information for the winning individual since the calculation is expensive. We could remove the age from the output as it is not necessary for the use case.

@cplump Can you provide a simple example for range-correlated and constraints?

The constraint writer writes the following for two constraints:

target
run
generation
violatingIndividuals_0
sumOfDifferences_0
minOfDifferences_0
avgOfDifferences_0
maxOfDifferences_0
violatingIndividuals_1
sumOfDifferences_1
minOfDifferences_1
avgOfDifferences_1
maxOfDifferences_1

A more general aspect: Is it possible to make "target" optional? It is only relevant when we're optimising towards different targets, and most cases will almost always optimise towards 0. As you made some search space information optional as well, could that be extended to target?

This is a great idea. Target and run are some kind of external context that we are using to repeat evaluations. I am not sure if there is a short-term way of removing them. Definitely, something for the next release.

mentioned in issue #128

Created #128 to tackle this topic.

marked this issue as related to #128

closed

Clean up statistics writers

Designs

Child items ...

Activity

Clean up statistics writers

Relates to

Activity