Hi All, I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct:
# val sqlCont = new org.apache.spark.sql.SQLContext(sc) # val reader = sqlCont.read # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") # dataFrame.registerTempTable("BICC") # val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000") # recSet.show() But when I run the Java program below, from my client, I get: Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC. The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log: 16/01/07 18:27:47 INFO Master: Registering app SparkTest 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801 If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console: WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist. Java program: SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("spark://172.27.13.57:7077"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); DataFrameReader reader = sqlContext.read(); DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000"); filtered.show(); Are there someone there can help me? Henrik