machine learning - ML enough features? -


i'm trying train random forest on accelerometer dataset. calculate features mean, sd, correlation between axes, area under curve , others. i'm ml noob.

i'm trying understand 2 things:

1.if split dataset 1 person test , train , run rf prediction accuracy high (> 90%). however, if train rf data different people , predict, accuracy low (< 50%). why? how debug this? not sure i'm doing wrong.

  1. in above example, 90% accuracy, how many features "enough"? how data "enough"?

i can furnish more details. dataset 10 people, large files of labelled data. have limited myself above features avoid lots of compute.

  1. most classifier overfits, when training on 1 person not generalizes well, may "memorize" dataset labels instead of capturing general rules of distribution:how each feature correlated other/how affect result/etc. maybe need more data, or more features.

  2. it's not easy question, generalization problem, there many theoretical researches this, example: vapnik–chervonenkis theory akaike_information_criterion. , knowledge of such theories cannot answer question accurately. main principle of of such theories - more data have, less variative model trying fit , less difference between accuracy on training , test requiring - theories rank model higher. e.g if wan't minimize difference between accuracy on test , training set (to make sure accuracy on test data not collapse) - need increase amount of data, provide more meaningful features (with respect model), or use less variative model fitting. if interesting in more detailed explanation theoretical aspect, can watch lectures caltech, starting caltechx - cs1156x learning data.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -