I solved the problem. I needed to tell the SparkContext about my Hadoop set up, so now my program is as follow:
SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("spark://172.27.13.57:7077") .set("spark.executor.memory", "2g") // We assign 2 GB ram to our job on each Worker .set("spark.driver.port", "51810"); // Fix the port the driver will listen on, good for firewalls! JavaSparkContext sc = new JavaSparkContext(conf); // Tell Spark about our Hadoop environment File coreSite = new File("/etc/hadoop/conf/core-site.xml"); File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml"); Configuration hConf = sc.hadoopConfiguration(); hConf.addResource(new Path(coreSite.getAbsolutePath())); hConf.addResource(new Path(hdfsSite.getAbsolutePath())); SQLContext sqlContext = new SQLContext(sc); DataFrameReader reader = sqlContext.read(); DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000"); filtered.show(); Henrik On 07/01/2016 19:41, Ewan Leith wrote: > > Try the path > > > "hdfs:///user/hdfs/parquet-multi/BICC" > Thanks, > Ewan > > > ------ Original message------ > > *From: *Henrik Baastrup > > *Date: *Thu, 7 Jan 2016 17:54 > > *To: *user@spark.apache.org; > > *Cc: *Baastrup, Henrik; > > *Subject:*Problems with reading data from parquet files in a HDFS remotely > > > Hi All, > > I have a small Hadoop cluster where I have stored a lot of data in parquet > files. I have installed a Spark master service on one of the nodes and now > would like to query my parquet files from a Spark client. When I run the > following program from the spark-shell on the Spark Master node all function > correct: > > # val sqlCont = new org.apache.spark.sql.SQLContext(sc) > # val reader = sqlCont.read > # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") > # dataFrame.registerTempTable("BICC") > # val recSet = sqlCont.sql("SELECT > protocolCode,beginTime,endTime,called,calling FROM BICC WHERE > endTime>=1449421800000000 AND endTime<=1449422400000000 AND > calling='6287870642893' AND p_endtime=1449422400000000") > # recSet.show() > > But when I run the Java program below, from my client, I get: > > Exception in thread "main" java.lang.AssertionError: assertion failed: No > predefined schema found, and no Parquet data files or summary files found > under file:/user/hdfs/parquet-multi/BICC. > > The exception occurs at the line: DataFrame df = > reader.parquet("/user/hdfs/parquet-multi/BICC"); > > On the Master node I can see the client connect when the SparkContext is > instanced, as I get the following lines in the Spark log: > > 16/01/07 18:27:47 INFO Master: Registering app SparkTest > 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID > app-20160107182747-00801 > > If I create a local directory with the given path, my program goes in an > endless loop, with the following warning on the console: > > WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient resources > > To me it seams that my SQLContext does not connect to the Spark Master, but > try to work locally on the client, where the requested files do not exist. > > Java program: > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("spark://172.27.13.57:7077"); > JavaSparkContext sc = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sc); > > DataFrameReader reader = sqlContext.read(); > DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC"); > DataFrame filtered = df.filter("endTime>=1449421800000000 AND > endTime<=1449422400000000 AND calling='6287870642893' AND > p_endtime=1449422400000000"); > filtered.show(); > > Are there someone there can help me? > > Henrik >