Hello Sparkers,
I'm reading data from a CSV file, applying some transformations and ending
up with an RDD of pairs (String,Iterable<>).
I have already prepared Parquet files. I want now to take the previous
(key,value) RDD and populate the parquet files like follows:
- key holds the name of the
Hi,
I have a pair RDD of the form: (mykey, (value1, value2))
How can I create a DataFrame having the schema [V1 String, V2 String] to
store [value1, value2] and save it into a Parquet table named "mykey"?
/createDataFrame()/ method takes an RDD and a schema (StructType) in
parameters. The sc
Hi all,
I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing
a problem with table caching (sqlContext.cacheTable()), using
spark-shell of Spark 1.5.1.
After I run the sqlContext.cacheTable(table), the sqlContext.sql(query)
takes longer the first time (well, for the lazy exe
?
If you can show snippet of your code, that would help give us more clue.
Thanks
On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI wrote:
Hi all,
I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a
problem with table caching (sqlContext.cacheTable()), using spark-she
6.id=p.id ORDER BY p.`bbb` LIMIT 10"
On 24.03.2016 22:16, Ted Yu wrote:
Can you obtain output from explain(true) on the query after
cacheTable() call ?
Potentially related JIRA:
[SPARK-13657] [SQL] Support parsing very long AND/OR expressions
On Thu, Mar 24, 2016 at 12:55 PM, Mo
I noticed that in most SQL queries (sqlContext.sql(query)) I ran on
Parquet tables that some results are returned faster after the first and
second run of the query. Is this variation normal i.e. two executions of
the same job can take different times? or there is some intermediate
results bein
Hello all,
I'm getting the famous /java.io.FileNotFoundException: ... (Too many
open files) /exception. What seemed to have helped people out, it
haven't for me. I tried to set the ulimit via the command line /"ulimit
-n"/, then I tried to add the following lines to
/"/etc/security/limits.con
Hello all,
Could someone please help me figure out what wrong with my query that
I'm running over Parquet tables? the query has the following form:
weird_query = "SELECT a._example.com/aa/1.1/aa_,
b._example.com/bb/1.2/bb_ FROM www$aa@aa a LEFT JOIN www$bb@bb b ON
a.http://example.de/cc=b.co
Your jars are not delivered to the workers. Have a look at this:
http://stackoverflow.com/questions/24052899/how-to-make-it-easier-to-deploy-my-jar-to-spark-cluster-in-standalone-mode
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-Cl
in Parquet
tables. Any help on solving/working around this would be very appreciated.
*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*PhD Student - EIS Department - **Bonn University (Germany).*
*About me! <http://www.strikingly.com/mohame
Hello,
I've asked the following question [1] on Stackoverflow but didn't get an
answer, yet. I use now this channel to give it more visibility, and
hopefully find someone who can help.
"*Context.* I have tens of SQL queries stored in separate files. For
benchmarking purposes, I created an ap
I paste this right from Spark shell (Spark 2.1.0):
/scala> spark.sql("SELECT count(distinct col) FROM Table").show()//
//+-+ //
//|count(DISTINCT col)|//
//+-+//
//|4697|//
//+-+//
//scala> spark.sql
That was the case. Thanks for the quick and clean answer, Hemanth.
*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
*About me! <http://www.strikingly.com/mohamed-nadjib-m
Spark 2.1, so I think it doesn't include the new cost-based
optimizations (introduced in Spark 2.2).
*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
*About me! &
14 matches
Mail list logo