WEKA cross validation discretization -
i'm trying improve accuracy of weka model applying unsupervised discretize filter. need decided on number of bins , whether equal frequency binning should used. normally, optimize using training set.
however, how determine bin size , whether equal frequency binning should used when using cross-validation? initial idea use accuracy result of classifier in multiple cross-validation tests find optimal bin size. however, isn't wrong, despite using cross-validation, use same set test accuracy of model, because have overfitted model? correct way of determining bin sizes?
i tried supervized discretize filter determine bin sizes, results in in single bins. mean data random , therefore cannot clustered multiple bins?
yes, correct in both idea , concerns first issue.
what trying parameter optimization. term used when try optimize parameters of classifier, e.g., number of trees random forest or c parameter svms. can apply pre-processing steps , filters.
what have in case nested cross-validation. (you should check https://stats.stackexchange.com/ more information, example here or here). important final classifier, including pre-processing steps binning , such, has never seen test set, training set. outer cross-validation.
for each fold of outer cross-validation, need inner cross-validation on training set determine optimal parameters model.
i'll try "visualize" on simple 2-fold cross-validation
data set ######################################## split outer cross-validation (2-fold) #################### #################### training set test set split inner cross-validation ########## ########## training test evaluate parameters ########## ########## build evaluated bin size 5 acc 70% bin size 10 acc 80% bin size 20 acc 75% ... => optimal bin size: 10 outer cross-validation (2-fold) #################### #################### training set test set apply bin size 10 train model evaluate model
parameter optimization can exhausting. if have 3 parameters 10 possible parameter values each, makes 10x10x10=1000 parameter combinations need evaluate each outer fold.
this topic of machine learning itself, because can naive grid search evolutionary search here. can use heuristics. need kind of parameter optimization every time.
as second question: hard tell without seeing data. should post separate question anyway.
Comments
Post a Comment