Re: Problems with reading data from parquet files in a HDFS remotely

Henrik Baastrup Fri, 08 Jan 2016 00:56:40 -0800

Hi Ewan,

Thank you for your answer.
I have already tried what you suggest.


If I use:
    "hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC"
I get the AssertionError exception:
    Exception in thread "main" java.lang.AssertionError: assertion
failed: No predefined schema found, and no Parquet data files or summary
files found under hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC.
Note: The IP address of my Spark Master is: 172.27.13.57

If I do as as you suggest literally:
    "hdfs:///user/hdfs/parquet-multi/BICC"
I get an IOException:
    Exception in thread "main" java.io.IOException: Incomplete HDFS URI,
no host: hdfs:///user/hdfs/parquet-multi/BICC

To me it seams that the Spark library try to resolve the URI locally and
I suspect I miss something in my configuration of the SparkContext, but
do not know what.
Or could it be that I use the wrong port in the hdfs:// URI above?

Henrik




On 07/01/2016 19:41, Ewan Leith wrote:
>
> Try the path
>
>
> "hdfs:///user/hdfs/parquet-multi/BICC"
> Thanks,
> Ewan
>
>
> ------ Original message------
>
> *From: *Henrik Baastrup
>
> *Date: *Thu, 7 Jan 2016 17:54
>
> *To: *user@spark.apache.org;
>
> *Cc: *Baastrup, Henrik;
>
> *Subject:*Problems with reading data from parquet files in a HDFS remotely
>
>
> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=1449421800000000 AND endTime<=1449422400000000 AND 
> calling='6287870642893' AND p_endtime=1449422400000000")
> # recSet.show()  
>
> But when I run the Java program below, from my client, I get: 
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>       SparkConf conf = new SparkConf()
>               .setAppName("SparkTest")
>               .setMaster("spark://172.27.13.57:7077");
>       JavaSparkContext sc = new JavaSparkContext(conf);
>       SQLContext sqlContext = new SQLContext(sc);
>       
>       DataFrameReader reader = sqlContext.read();
>       DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>       DataFrame filtered = df.filter("endTime>=1449421800000000 AND 
> endTime<=1449422400000000 AND calling='6287870642893' AND 
> p_endtime=1449422400000000");
>       filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>

Re: Problems with reading data from parquet files in a HDFS remotely

Reply via email to