python 3.x - Poor predictive performance for RandomForest in Spark -

- June 15, 2013

this might long shot, has run poor predictive performance using randomforest mllib? here i'm doing:

spark 1.4.1 pyspark
python 3.4.2
~30,000 tweets of text
12289 1s , 15956 0s
whitespace tokenization , hashing trick feature selection using 10,000 features
run rf 100 trees , maxdepth of 4 , predict using features 1s observations.

so in theory, should predictions of close 12289 1s (especially if model overfits). i'm getting 0 1s, sounds ludicrous me , makes me suspect wrong code or i'm missing something. notice similar behavior (although not extreme) if play around settings. i'm getting normal behavior other classifiers, don't think it's setup that's problem.

for example:

>>> lrm = logisticregressionwithsgd.train(lp, iterations=10)  >>> logit_predict = lrm.predict(predict_feat)  >>> logit_predict.sum()  9077   >>> nb = naivebayes.train(lp)  >>> nb_predict = nb.predict(predict_feat)  >>> nb_predict.sum()  10287.0  >>> rf = randomforest.trainclassifier(lp, numclasses=2, categoricalfeaturesinfo={}, numtrees=100, seed=422)  >>> rf_predict = rf.predict(predict_feat)  >>> rf_predict.sum()  0.0

this code run back didn't change in between. have possible explanation this?

Search This Blog

Ruby Co

python 3.x - Poor predictive performance for RandomForest in Spark -

Comments

Post a Comment

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -