WEKA cross validation discretization -


i'm trying improve accuracy of weka model applying unsupervised discretize filter. need decided on number of bins , whether equal frequency binning should used. normally, optimize using training set.

however, how determine bin size , whether equal frequency binning should used when using cross-validation? initial idea use accuracy result of classifier in multiple cross-validation tests find optimal bin size. however, isn't wrong, despite using cross-validation, use same set test accuracy of model, because have overfitted model? correct way of determining bin sizes?

i tried supervized discretize filter determine bin sizes, results in in single bins. mean data random , therefore cannot clustered multiple bins?

yes, correct in both idea , concerns first issue.

what trying parameter optimization. term used when try optimize parameters of classifier, e.g., number of trees random forest or c parameter svms. can apply pre-processing steps , filters.

what have in case nested cross-validation. (you should check https://stats.stackexchange.com/ more information, example here or here). important final classifier, including pre-processing steps binning , such, has never seen test set, training set. outer cross-validation.

for each fold of outer cross-validation, need inner cross-validation on training set determine optimal parameters model.

i'll try "visualize" on simple 2-fold cross-validation

data set ########################################  split outer cross-validation (2-fold) #################### #################### training set                     test set  split inner cross-validation ########## ########## training         test  evaluate parameters ########## ########## build  evaluated  bin size  5   acc 70% bin size 10   acc 80% bin size 20   acc 75% ... => optimal bin size: 10  outer cross-validation (2-fold) #################### #################### training set                     test set apply bin size 10 train model                evaluate model 

parameter optimization can exhausting. if have 3 parameters 10 possible parameter values each, makes 10x10x10=1000 parameter combinations need evaluate each outer fold.

this topic of machine learning itself, because can naive grid search evolutionary search here. can use heuristics. need kind of parameter optimization every time.

as second question: hard tell without seeing data. should post separate question anyway.


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

python - Healpy: From Data to Healpix map -