python 3.x - Poor predictive performance for RandomForest in Spark -
this might long shot, has run poor predictive performance using randomforest mllib? here i'm doing:
- spark 1.4.1 pyspark
- python 3.4.2
- ~30,000 tweets of text
- 12289 1s , 15956 0s
- whitespace tokenization , hashing trick feature selection using 10,000 features
- run rf 100 trees , maxdepth of 4 , predict using features 1s observations.
so in theory, should predictions of close 12289 1s (especially if model overfits). i'm getting 0 1s, sounds ludicrous me , makes me suspect wrong code or i'm missing something. notice similar behavior (although not extreme) if play around settings. i'm getting normal behavior other classifiers, don't think it's setup that's problem.
for example:
>>> lrm = logisticregressionwithsgd.train(lp, iterations=10) >>> logit_predict = lrm.predict(predict_feat) >>> logit_predict.sum() 9077 >>> nb = naivebayes.train(lp) >>> nb_predict = nb.predict(predict_feat) >>> nb_predict.sum() 10287.0 >>> rf = randomforest.trainclassifier(lp, numclasses=2, categoricalfeaturesinfo={}, numtrees=100, seed=422) >>> rf_predict = rf.predict(predict_feat) >>> rf_predict.sum() 0.0
this code run back didn't change in between. have possible explanation this?
Comments
Post a Comment