Problems with reading data from parquet files in a HDFS remotely

Henrik Baastrup Thu, 07 Jan 2016 09:55:21 -0800

Hi All,

I have a small Hadoop cluster where I have stored a lot of data in parquet 
files. I have installed a Spark master service on one of the nodes and now 
would like to query my parquet files from a Spark client. When I run the 
following program from the spark-shell on the Spark Master node all function 
correct:


# val sqlCont = new org.apache.spark.sql.SQLContext(sc)
# val reader = sqlCont.read
# val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
# dataFrame.registerTempTable("BICC")
# val recSet = sqlCont.sql("SELECT 
protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
endTime>=1449421800000000 AND endTime<=1449422400000000 AND 
calling='6287870642893' AND p_endtime=1449422400000000")
# recSet.show()  

But when I run the Java program below, from my client, I get: 

Exception in thread "main" java.lang.AssertionError: assertion failed: No 
predefined schema found, and no Parquet data files or summary files found under 
file:/user/hdfs/parquet-multi/BICC.

The exception occurs at the line: DataFrame df = 
reader.parquet("/user/hdfs/parquet-multi/BICC");

On the Master node I can see the client connect when the SparkContext is 
instanced, as I get the following lines in the Spark log:

16/01/07 18:27:47 INFO Master: Registering app SparkTest
16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
app-20160107182747-00801

If I create a local directory with the given path, my program goes in an 
endless loop, with the following warning on the console:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient 
resources

To me it seams that my SQLContext does not connect to the Spark Master, but try 
to work locally on the client, where the requested files do not exist.

Java program:
        SparkConf conf = new SparkConf()
                .setAppName("SparkTest")
                .setMaster("spark://172.27.13.57:7077");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);
        
        DataFrameReader reader = sqlContext.read();
        DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
        DataFrame filtered = df.filter("endTime>=1449421800000000 AND 
endTime<=1449422400000000 AND calling='6287870642893' AND 
p_endtime=1449422400000000");
        filtered.show();

Are there someone there can help me?

Henrik

Problems with reading data from parquet files in a HDFS remotely

Reply via email to