python - PySpark: How do I install a linux command-line tool on workers? -


i trying use linux command-line tool 'poppler' extract information pdf files. want huge amount of pdfs on several spark workers. need use popplers, not pypdf or alike.

does know how install poppler on workers? know can command-line calls within python, , fetch output (or fetch generated file poppler lib), how install on each worker? im using spark 1.3.1 (databricks).

thank you!

the proper way install on workers when set them install other linux application. pointed out, can shell out within python.

if not option whatever reason, can ship files workers using addfile method: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.sparkcontext.addfile

note latter approach not take care of dependencies (libraries etc.).


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -