apache spark - Trouble in understanding the LDA topic model in MLlib -


i have trouble understanding lda topic model result in spark mlib.

to understanding result following:

 topic 1: term1, term2, term....  topic 2: term1, term2, term3...  ...  topic n: term1, ........   doc1 : topic1, topic2,...  doc2 : topic1, topic2,...  doc3 : topic1, topic2,...  ...  docn :topic1, topic2,... 

i apply lda sample data of spark mllib looks this:

1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 

afterwards following results:

topics: org.apache.spark.mllib.linalg.matrix =   10.33743440804936   9.104197117225599   6.5583684747250395   6.342536927434482   12.486281081997593  10.171181990567925   2.1728012328444692  2.1939589470020042  7.633239820153526    17.858082227094904  9.405347532724434   12.736570240180663   13.226180094790433  3.9570395921153536  7.816780313094214    6.155778858763581   10.224730593556806  5.619490547679611    7.834725138351118   15.52628918346391   7.63898567818497     4.419396221560405   3.072221927676895   2.5083818507627      1.4984991123084432  3.5227422247618927  2.978758662929664    5.696963722524612   7.254625667071781   11.048410610403607   11.080658179168758  10.11489350657456   11.804448314256682   

each column term distribution of topics. there total of 3 topics , each topic distribution of 11 vocabularies.

i think there 12 documents, each of has 11 vocabularies. trouble that

  • how can find topic distribution of each document?
  • why each topic have distribution on 11 vocabularies while there totally 10 different vocabularies (0-9) in data?
  • why sum of each column not equal 100 (meaning 100% according understanding)?

you can topic distribution on each document calling distributedldamodel.topicdistributions() or distributedldamodel.javatopicdistributions() in spark 1.4. work if model optimizer set emldaoptimizer (the default).

there an example here , the documentation here.

it looks in java:

ldamodel ldamodel = lda.setk(k.intvalue()).run(corpus); javapairrdd<long,vector> topic_dist_over_docs = ((distributedldamodel) ldamodel).javatopicdistributions(); 

as second question:

the lda model returns probability distribution on each word in vocabulary each topic. so, have 3 topics (three columns) each 11 rows (one each word in vocab) because vocab size 11.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -