apache spark - Trouble in understanding the LDA topic model in MLlib -
i have trouble understanding lda topic model result in spark mlib.
to understanding result following:
topic 1: term1, term2, term.... topic 2: term1, term2, term3... ... topic n: term1, ........ doc1 : topic1, topic2,... doc2 : topic1, topic2,... doc3 : topic1, topic2,... ... docn :topic1, topic2,...
i apply lda sample data of spark mllib looks this:
1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0
afterwards following results:
topics: org.apache.spark.mllib.linalg.matrix = 10.33743440804936 9.104197117225599 6.5583684747250395 6.342536927434482 12.486281081997593 10.171181990567925 2.1728012328444692 2.1939589470020042 7.633239820153526 17.858082227094904 9.405347532724434 12.736570240180663 13.226180094790433 3.9570395921153536 7.816780313094214 6.155778858763581 10.224730593556806 5.619490547679611 7.834725138351118 15.52628918346391 7.63898567818497 4.419396221560405 3.072221927676895 2.5083818507627 1.4984991123084432 3.5227422247618927 2.978758662929664 5.696963722524612 7.254625667071781 11.048410610403607 11.080658179168758 10.11489350657456 11.804448314256682
each column term distribution of topics. there total of 3 topics , each topic distribution of 11 vocabularies.
i think there 12 documents, each of has 11 vocabularies. trouble that
- how can find topic distribution of each document?
- why each topic have distribution on 11 vocabularies while there totally 10 different vocabularies (0-9) in data?
- why sum of each column not equal 100 (meaning 100% according understanding)?
you can topic distribution on each document calling distributedldamodel.topicdistributions()
or distributedldamodel.javatopicdistributions()
in spark 1.4. work if model optimizer set emldaoptimizer
(the default).
there an example here , the documentation here.
it looks in java:
ldamodel ldamodel = lda.setk(k.intvalue()).run(corpus); javapairrdd<long,vector> topic_dist_over_docs = ((distributedldamodel) ldamodel).javatopicdistributions();
as second question:
the lda model returns probability distribution on each word in vocabulary each topic. so, have 3 topics (three columns) each 11 rows (one each word in vocab) because vocab size 11.
Comments
Post a Comment