Re: Problems with reading data from parquet files in a HDFS remotely

Henrik Baastrup Fri, 08 Jan 2016 05:16:09 -0800

I solved the problem. I needed to tell the SparkContext about my Hadoop
set up, so now my program is as follow:


    SparkConf conf = new SparkConf()
        .setAppName("SparkTest")
        .setMaster("spark://172.27.13.57:7077")
        .set("spark.executor.memory", "2g") // We assign 2 GB ram to our
job on each Worker
        .set("spark.driver.port", "51810"); // Fix the port the driver
will listen on, good for firewalls!
    JavaSparkContext sc = new JavaSparkContext(conf);

    // Tell Spark about our Hadoop environment
    File coreSite = new File("/etc/hadoop/conf/core-site.xml");
    File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml");
    Configuration hConf =  sc.hadoopConfiguration();
    hConf.addResource(new Path(coreSite.getAbsolutePath()));
    hConf.addResource(new Path(hdfsSite.getAbsolutePath()));

    SQLContext sqlContext = new SQLContext(sc);

    DataFrameReader reader = sqlContext.read();
    DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
    DataFrame filtered = df.filter("endTime>=1449421800000000 AND
endTime<=1449422400000000 AND calling='6287870642893' AND
p_endtime=1449422400000000");
    filtered.show();

Henrik

On 07/01/2016 19:41, Ewan Leith wrote:
>
> Try the path
>
>
> "hdfs:///user/hdfs/parquet-multi/BICC"
> Thanks,
> Ewan
>
>
> ------ Original message------
>
> *From: *Henrik Baastrup
>
> *Date: *Thu, 7 Jan 2016 17:54
>
> *To: *user@spark.apache.org;
>
> *Cc: *Baastrup, Henrik;
>
> *Subject:*Problems with reading data from parquet files in a HDFS remotely
>
>
> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=1449421800000000 AND endTime<=1449422400000000 AND 
> calling='6287870642893' AND p_endtime=1449422400000000")
> # recSet.show()  
>
> But when I run the Java program below, from my client, I get: 
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>       SparkConf conf = new SparkConf()
>               .setAppName("SparkTest")
>               .setMaster("spark://172.27.13.57:7077");
>       JavaSparkContext sc = new JavaSparkContext(conf);
>       SQLContext sqlContext = new SQLContext(sc);
>       
>       DataFrameReader reader = sqlContext.read();
>       DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>       DataFrame filtered = df.filter("endTime>=1449421800000000 AND 
> endTime<=1449422400000000 AND calling='6287870642893' AND 
> p_endtime=1449422400000000");
>       filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>

Re: Problems with reading data from parquet files in a HDFS remotely

Reply via email to