Distribution-Selection Algorithms

    The algorithm that ExpertFit uses to select which probability distribution best fits a data set was developed as follows.  We had 15 heuristics that were thought to have some ability to discriminate between a good-fitting and bad-fitting distribution.  To determine which of these heuristics was actually the best, a random sample of size n was generated from a known "parent" distribution, and each of the 15 heuristics was applied to see if it could, in fact, choose the correct distribution.  This was repeated for 200 independent samples, giving an estimated probability that each heuristic would pick the parent distribution for the specified sample size.  This whole process was repeated for 175 parent-distribution/sample-size pairs, resulting in several heuristics that appeared to be superior.  These heuristics were combined to give the overall algorithm for ranking the fitted distributions, with the analysis of the 35,000 data sets taking six months.

    To show the superiority of the ExpertFit distribution-selection algorithm, we used ExpertFit and another distribution-fitting package to automatically pick the "best" distribution for each of 69 real-world data sets that were selected from many different application areas.  We found that ExpertFit provided a better-fitting distribution than the other package for 87% of the data sets tested1 and tied on the rest!  

-------------------------------------------------------------------------------------------------------------------

1The Anderson-Darling test statistic -- a powerful measure of goodness-of-fit -- was used to determine which of the two selected distributions provided the best fit for the data set.