Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrot

Re: Running SparkSql against Hive tables

2015-06-07 Thread Cheng Lian
On 6/6/15 9:06 AM, James Pirz wrote: I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a better performance, it is a good idea to use Parquet files. I have 2 questions regarding that: 1) If I wanna us

Re: Avro or Parquet ?

2015-06-07 Thread Cheng Lian
Usually Parquet can be more efficient because of its columnar nature. Say your table has 10 columns but your join query only touches 3 of them, Parquet only reads those 3 columns from disk while Avro must load all data. Cheng On 6/5/15 3:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: We currently have data in a

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Cheng Lian
This issue has been fixed recently in Spark 1.4 https://github.com/apache/spark/pull/6581 Cheng On 6/5/15 12:38 AM, Marcelo Vanzin wrote: I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real i

Re: NullPointerException SQLConf.setConf

2015-06-07 Thread Cheng Lian
Are you calling hiveContext.sql within an RDD.map closure or something similar? In this way, the call actually happens on executor side. However, HiveContext only exists on the driver side. Cheng On 6/4/15 3:45 PM, patcharee wrote: Hi, I am using Hive 0.14 and spark 0.13. I got java.lang.Nu

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian
For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val cdf = df.cache() `cdf` is also columnar but that's different from Parquet. When a DataFrame is cached, Spa

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: > For the following code: > > val df = sqlContext.parquetFile(path) > > `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code:

Re: spark sql - reading data from sql tables having space in column names

2015-06-07 Thread Cheng Lian
You can use backticks to quote the column names. Cheng On 6/3/15 2:49 AM, David Mitchell wrote: I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, "Executor Info" from the Spark logs. I suggest that we open a JIRA ticket to ad

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Cheng Lian
Were you using HiveContext.setConf()? "dfs.replication" is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame

Re: Caching parquet table (with GZIP) on Spark 1.3.1

2015-06-07 Thread Cheng Lian
Is it possible that some Parquet files of this data set have different schema as others? Especially those ones reported in the exception messages. One way to confirm this is to use [parquet-tools] [1] to inspect these files: $ parquet-schema Cheng [1]: https://github.com/apache/parquet

Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
text on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Cheng Lian wrote: Hi, This is expected beh

Re: hiveContext.sql NullPointerException

2015-06-08 Thread Cheng Lian
on the SQL programming guide for now. This should be mentioned though. BR, Patcharee On 07. juni 2015 16:40, Cheng Lian wrote: Spark SQL supports Hive dynamic partitioning, so one possible workaround is to create a Hive table partitioned by zone, z, year, and month dynamically, and then insert

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-08 Thread Cheng Lian
the replication factor of some specific files. -----Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setC

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
For DataFrame, there are also transformations and actions. And transformations are also lazily evaluated. However, DataFrame transformations like filter(), select(), agg() return a DataFrame rather than an RDD. Other methods like show() and collect() are actions. Cheng On 6/8/15 1:33 PM, kira

Re: Error in using saveAsParquetFile

2015-06-08 Thread Cheng Lian
Are you appending the joined DataFrame whose PolicyType is string to an existing Parquet file whose PolicyType is int? The exception indicates that Parquet found a column with conflicting data types. Cheng On 6/8/15 5:29 PM, bipin wrote: Hi I get this error message when saving a table: parqu

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
grouped like in the case of RDD? Also if you can tell me if sqlContext.load and unionAll are transformations or actions... I answered a question on the forum assuming unionAll is a blocking call and said execution of multiple load and df.unionAll in different threads would benefit perf

Re: Error in using saveAsParquetFile

2015-06-08 Thread Cheng Lian
ile when are you loading these file? can you please share the code where you are passing parquet file to spark?. On 8 June 2015 at 16:39, Cheng Lian mailto:lian.cs@gmail.com>> wrote: Are you appending the joined DataFrame whose PolicyType is string to an e

Re: Running SparkSql against Hive tables

2015-06-08 Thread Cheng Lian
? (I tried to find some basic examples in the documentation, but I was not able to) - Any suggestion or hint on how I can do that would be highly appreciated. Thnx On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: On 6/6/15 9:06 AM, James Pirz wrote:

Re: Error in using saveAsParquetFile

2015-06-09 Thread Cheng Lian
and midway threw the error. It isn't quite there yet. Thanks for the help. On Mon, Jun 8, 2015 at 8:29 PM Cheng Lian <mailto:lian.cs@gmail.com>> wrote: I suspect that Bookings and Customerdetails both have a PolicyType field, one is string and the other is an in

Re: BigDecimal problem in parquet file

2015-06-09 Thread Cheng Lian
Would you please provide a snippet that reproduce this issue? What version of Spark were you using? Cheng On 6/9/15 8:18 PM, bipin wrote: Hi, When I try to save my data frame as a parquet file I get the following error: java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to o

Re: Spark 1.3.1 SparkSQL metastore exceptions

2015-06-09 Thread Cheng Lian
Seems that you're using a DB2 Hive metastore? I'm not sure whether Hive 0.12.0 officially supports DB2, but probably not? (Since I didn't find DB2 scripts under the metastore/scripts/upgrade folder in Hive source tree.) Cheng On 6/9/15 8:28 PM, Needham, Guy wrote: Hi, I’m using Spark 1.3.1 to

Re: Running SparkSql against Hive tables

2015-06-10 Thread Cheng Lian
o retrieve metadata of these tables. After starting HiveThriftServer2, you should be able to use Beeline to run SQL scripts. On Mon, Jun 8, 2015 at 6:56 PM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: On 6/9/15 8:42 AM, James Pirz wrote: Thanks for the help! I am ac

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Cheng Lian
Would you mind to provide executor output so that we can check the reason why executors died? And you may run EXPLAIN EXTENDED to find out the physical plan of your query, something like: |0: jdbc:hive2://localhost:1> explain extended select * from foo; +--

Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Hi Xiaohan, Would you please try to set "spark.sql.thriftServer.incrementalCollect" to "true" and increasing driver memory size? In this way, HiveThriftServer2 uses RDD.toLocalIterator rather than RDD.collect().iterator to return the result set. The key difference is that RDD.toLocalIterator

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread Cheng Lian
Hm, this is a common confusion... Although the variable name is `sqlContext` in Spark shell, it's actually a `HiveContext`, which extends `SQLContext` and has the ability to communicate with Hive metastore. So your program need to instantiate a `org.apache.spark.sql.hive.HiveContext` instead.

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
changed from "OOM::GC overhead limit exceeded" to " lost worker heartbeat after 120s". I will try to set "spark.sql.thriftServer.incrementalCollect" and continue increase driver memory to 7G, and will send the result to you. Thanks, SuperJ - 原始邮件信息 -

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
end the result to you. Thanks, SuperJ - 原始邮件信息 - *发件人:* "Cheng Lian" *收件人:* "Hester wang" , *主题:* Re: Met OOM when fetching more than 1,000,000 rows. *日期:* 2015/06/10 16:15:47 (Wed) Hi Xiaohan, Would you please try to set "spark.sql.thriftServer.incrementalC

Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
park.kryoserializer.buffer.mb=256 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.akka.frameSize=512 --driver-class-path /usr/local/hive/lib/classes12.jar --conf spark.sql.thriftServer.incrementalCollect=true Thanks, SuperJ - 原始邮件信息 - *发件人:* &

Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread Cheng Lian
Michael had answered this question in the SO thread http://stackoverflow.com/a/30226336 Cheng On 6/10/15 9:24 PM, shahab wrote: Hi George, I have same issue, did you manage to find a solution? best, /Shahab On Wed, May 13, 2015 at 9:21 PM, George Adams > wrote

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Cheng Lian
CT_LIVE_DT#31,IKB_PROJECT_TYPE#29], (MetastoreRelation sourav_ikb_hs, ikb_project_calendar_ext, Some(C)), None Code Generation: false == RDD == --- On Wed, Jun 10, 2015 at 12:59 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Would you mind to provide executor output so t

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread Cheng Lian
cept a JavaSparkContext, but a SparkContext. (the comment is basically misleading). The correct code snippet should be: |HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc <http://sc.sc>());| Thanks again for your help. On Wed, Jun 10, 2015 at 1:17 AM, Cheng Lian <

Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
ate my spark to v1.4. This issue resolved. Thanks, SuperJ - 原始邮件信息 - *发件人:* "姜超才" *收件人:* "Cheng Lian" , "Hester wang" , *主题:* 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows. *日期:* 2015/06/11 08:56:28 (Thu) No problem on L

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
i remember i have ever seen the OOM stderr log on slave node. But recently there seems no OOM log on slave node. Follow the cmd 、data 、env and the code I gave you, the OOM can 100% repro on cluster mode. Thanks, SuperJ - 原始邮件信息 - *发件人:* "Cheng Lian" *收件人:* "姜超

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
s not performed). Cheng On 6/12/15 4:17 PM, Cheng, Hao wrote: Not sure if Spark Core will provide API to fetch the record one by one from the block manager, instead of the pulling them all into the driver memory. *From:*Cheng Lian [mailto:l...@databricks.com] *Sent:* Friday, June 12, 2015 3:51 P

Re: BigDecimal problem in parquet file

2015-06-12 Thread Cheng Lian
ry/ms378878%28v=sql.110%29.aspx> and/or spark datatypes <https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#data-types>) That is how I created the parquet file. Any help to solve the issue is appreciated. Thanks Bipin On 9 June 2015 at 20:44, Cheng Lian <mailto:lian.c

Re: Upgrade to parquet 1.6.0

2015-06-12 Thread Cheng Lian
At the time 1.3.x was released, 1.6.0 hasn't been released yet. And we didn't have enough time to upgrade and test Parquet 1.6.0 for Spark 1.4.0. But we've already upgraded Parquet to 1.7.0 (which is exactly the same as 1.6.0 with package name renamed from com.twitter to org.apache.parquet) on

Re: Parquet Multiple Output

2015-06-12 Thread Cheng Lian
Spark 1.4 supports dynamic partitioning, you can first convert your RDD to a DataFrame and then save the contents partitioned by date column. Say you have a DataFrame df containing three columns a, b, and c, you may have something like this: df.write.partitionBy("a", "b").mode("overwrite"

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread Cheng Lian
As the error message says, were you using a |SQLContext| instead of a |HiveContext| to create the DataFrame? In Spark shell, although the variable name is |sqlContext|, the type of that variable is actually |org.apache.spark.sql.hive.HiveContext|, which has the ability to communicate with Hive

Re: How to silence Parquet logging?

2015-06-13 Thread Cheng Lian
Hi Chris, Which Spark version were you using? And could you provide some sample log lines you saw? Parquet uses java.util.logging internally and can't be controlled by log4j.properties. The most recent master branch should have muted most Parquet logs. However, it's known that if you explicitl

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Cheng Lian
Hi Nathan, Thanks a lot for the detailed report, especially the information about nonconsecutive part numbers. It's confirmed to be a race condition bug and just filed https://issues.apache.org/jira/browse/SPARK-8406 to track this. Will deliver a fix ASAP and this will be included in 1.4.1.

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Cheng Lian
heng. Nice find! Let me know if there is anything we can do to help on this end with contributing a fix or testing. Side note - any ideas on the 1.4.1 eta? There are a few bug fixes we need in there. Cheers, Nathan From: Cheng Lian Date: Wednesday, 17 June 2015 6:25 pm To: Nathan, "user@s

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Cheng Lian
What's the size of this table? Is the data skewed (so that speculation is probably triggered)? Cheng On 6/15/15 10:37 PM, Night Wolf wrote: Hey Yin, Thanks for the link to the JIRA. I'll add details to it. But I'm able to reproduce it, at least in the same shell session, every time I do a w

Re: HiveContext saveAsTable create wrong partition

2015-06-17 Thread Cheng Lian
Thanks for reporting this. Would you mind to help creating a JIRA for this? On 6/16/15 2:25 AM, patcharee wrote: I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using

Re: BigDecimal problem in parquet file

2015-06-17 Thread Cheng Lian
> > Thanks for helping out. > Bipin > > On 12 June 2015 at 14:57, Cheng Lian mailto:lian.cs@gmail.com>> wrote: >> >> On 6/10/15 8:53 PM, Bipin Nag wrote: >> >> Hi Cheng, >> >> I am using Spark

Re: How to disable parquet schema merging in 1.4?

2015-07-01 Thread Cheng Lian
With Spark 1.4, you may use data source option "mergeSchema" to control it: sqlContext.read.option("mergeSchema", "false").parquet("some/path") or CREATE TABLE t USING parquet OPTIONS (mergeSchema false, path "some/path") We're considering to disable schema merging by default in 1.5.0 si

Re: DataFrame.write().partitionBy("some_column").parquet(path) produces OutOfMemory with very few items

2015-07-16 Thread Cheng Lian
Hi Nikos, How many columns and distinct values of "some_column" are there in the DataFrame? Parquet writer is known to be very memory consuming for wide tables. And lots of distinct partition column values result in many concurrent Parquet writers. One possible workaround is to first repartit

Re: what is : ParquetFileReader: reading summary file ?

2015-07-17 Thread Cheng Lian
Yeah, Spark SQL Parquet support need to do some metadata discovery when firstly importing a folder containing Parquet files, and discovered metadata is cached. Cheng On 7/17/15 1:56 PM, shsh...@tsmc.com wrote: Hi all, our scenario is to generate lots of folders containinig parquet file and t

Re: feedback on dataset api explode

2016-05-25 Thread Cheng Lian
Agree, since they can be easily replaced by .flatMap (to do explosion) and .select (to rename output columns) Cheng On 5/25/16 12:30 PM, Reynold Xin wrote: Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers

Re: update mysql in spark

2016-06-15 Thread Cheng Lian
Spark SQL doesn't support update command yet. On Wed, Jun 15, 2016, 9:08 AM spR wrote: > hi, > > can we write a update query using sqlcontext? > > sqlContext.sql("update act1 set loc = round(loc,4)") > > what is wrong in this? I get the following error. > > Py4JJavaError: An error occurred while

Re: Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-15 Thread Cheng Lian
Spark 1.6.1 is also using 1.7.0. Could you please share the schema of your Parquet file as well as the exact exception stack trace reported by Hive? Cheng On 6/13/16 12:56 AM, mayankshete wrote: Hello Team , I am facing an issue where output files generated by Spark 1.6.1 are not read by

Re: Bug about reading parquet files

2016-07-08 Thread Cheng Lian
What's the Spark version? Could you please also attach result of explain(extended = true)? On Fri, Jul 8, 2016 at 4:33 PM, Sea <261810...@qq.com> wrote: > I have a problem reading parquet files. > sql: > select count(1) from omega.dwd_native where year='2016' and month='07' > and day='05' and h

Re: 回复: Bug about reading parquet files

2016-07-09 Thread Cheng Lian
According to our offline discussion, the target table consists of 1M+ small Parquet files (~12M by average). The OOM occurred at driver side while listing input files. My theory is that the total size of all listed FileStatus objects is too large for the driver and caused the OOM. Suggestion

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-10 Thread Cheng Lian
Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https://github.com/apache/spark/pull/14585/files Cheng On 8/9/16 5:38 PM, immerrr again wrote: Anot

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-12 Thread Cheng Lian
OK, I've merged this PR to master and branch-2.0. On 8/11/16 8:27 AM, Cheng Lian wrote: Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the iss

Re: memory leak when saving Parquet files in Spark

2015-12-10 Thread Cheng Lian
This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps: df.write .format("parquet") .option("mergeSchema", "false") .partitionBy(partitionCols: _*) .mode(saveMode) .save(targetPath) I

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Cheng Lian
quot;false") Thanks, -Matt On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian <mailto:l...@databricks.com>> wrote: This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps:

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-31 Thread Cheng Lian
Hey Lin, This is a good question. The root cause of this issue lies in the analyzer. Currently, Spark SQL can only resolve a name to a top level column. (Hive suffers the same issue.) Take the SQL query and struct you provided as an example, col_b.col_d.col_g is resolved as two nested GetStru

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian
Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use write.partitionBy

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian
oblem. Best, Gavin On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially,

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all data belonging to a single (data) partition into a single (RDD) partition: |df.coalesce(1)|||.repartition("entity", "year", "month", "day", "status")|.write.partitionBy("entity", "year", "month", "day", "status").mode(S

Re: cast column string -> timestamp in Parquet file

2016-01-25 Thread Cheng Lian
The following snippet may help: sqlContext.read.parquet(path).withColumn("col_ts", $"col".cast(TimestampType)).drop("col") Cheng On 1/21/16 6:58 AM, Muthu Jayakumar wrote: DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the colu

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Cheng Lian
Parquet is a read-only format. So the only way to remove data from a written Parquet file is to write a new Parquet file without unwanted rows. Cheng On 2/17/16 5:11 AM, SRK wrote: Hi, I am saving my records in the form of parquet files using dataframes in hdfs. How to delete the records usin

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true when reading Parquet files containing strings generated by Hive. This is actually a bug of parquet-hive. When generating Parquet schema for a string field, Parquet requires a "UTF8" annotation, something like: message hive

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 0.14.0. So the SQL option I mentioned is mostly used for reading legacy Parquet files generated by older versions of Hive. Cheng On 9/25/15 2:42 PM, Cheng Lian wrote: Please set the the SQL option

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
I guess you're probably using Spark 1.5? Spark SQL does support schema merging, but we disabled it by default since 1.5 because it introduces extra performance costs (it's turned on by default in 1.4 and 1.3). You may enable schema merging via either the Parquet data source specific option "me

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging - http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration Cheng On 9/28/15 3:54 PM, Cheng Lian wrote: I guess you're probably

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 5:56 PM *To:* Thomas, Jordan ; mich

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
nd re-transferred. Thanks, Jordan *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:15 PM *To:* Thomas, Jordan ; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet files Could you please elaborate

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
430-L431>, which reads the actual Parquet footers and probably take most of the time). Cheng On 9/28/15 6:51 PM, Cheng Lian wrote: Oh I see, then probably this one, basically the parallel Spark version of my last script, using ParquetFileReader: import org.apache.parquet.hadoop.ParquetFile

Re: Metadata in Parquet

2015-09-30 Thread Cheng Lian
Unfortunately this isn't supported at the moment https://issues.apache.org/jira/browse/SPARK-10803 Cheng On 9/30/15 10:54 AM, Philip Weaver wrote: Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for th

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day)..

Re: Parquet file size

2015-10-07 Thread Cheng Lian
, without month and day). Cheng So you want to dump all data into a single large Parquet file? On 10/7/15 1:55 PM, Younes Naguib wrote: The TSV original files is 600GB and generated 40k files of 15-25MB. y *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* October-07-15 3:18 PM *To

Re: Parquet file size

2015-10-08 Thread Cheng Lian
l.com<mailto:younes.nag...@streamtheworld.com>** *From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] *Sent:* Wednesday, October 07, 2015 9:14 PM *To:* Younes Naguib *Cc:* Cheng

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Cheng Lian
Hi Hyukjin, Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option. Cheng On 10/8/15 11:04 PM, Hyuk

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but missin

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a partition key while SPARK-11153 is about Parquet filter push-down and doesn't affect partition pruning. Cheng On 11/3/15 7:14 PM, Rex Xiong wrote: We found the query performance is very poor due to this issue https://issues.apac

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if thi

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
none of your responses are there either. I am definitely subscribed to the list though (I get daily digests). Any clue how to fix it? Sorry, no idea :-/ On Nov 6, 2015, at 9:26 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: I'd expect writing Parquet files slower than

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian
This is because of PARQUET-369 , which prevents users or other libraries to override Parquet's JUL logging settings via SLF4J. It has been fixed in the most recent parquet-format master (PR #32

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
t works fine. my requirement is now to handle writing in multiple folders at same time. Basically the JavaPairrdd I want to write to multiple folders based on final hive partitions where this rdd will lend.Have you used multiple output formats in spark? On Fri, Nov 13, 2015 at 3:56 PM, Che

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
t works fine. my requirement is now to handle writing in multiple folders at same time. Basically the JavaPairrdd I want to write to multiple folders based on final hive partitions where this rdd will lend.Have you used multiple output formats in spark? On Fri, Nov 13, 2015 at 3:56 PM, Che

Re: DateTime Support - Hive Parquet

2015-11-23 Thread Cheng Lian
Hey Bryan, What do you mean by "DateTime properties"? Hive and Spark SQL both support DATE and TIMESTAMP types, but there's no DATETIME type. So I assume you are referring to Java class DateTime (possibly the one in joda)? Could you please provide a sample snippet that illustrates your requir

Re: DateTime Support - Hive Parquet

2015-11-24 Thread Cheng Lian
nanos, Timestamp, etc) prior to writing records to hive. Regards, Bryan Jeffrey Sent from Outlook Mail *From: *Cheng Lian *Sent: *Tuesday, November 24, 2015 1:42 AM *To: *Bryan Jeffrey;user *Subject: *Re: DateTime Support - Hive Parquet Hey Bryan, What do you mean by "DateTime prope

Re: DateTime Support - Hive Parquet

2015-11-29 Thread Cheng Lian
icit conversion for this case? Do you convert on insert or on RDD to DF conversion? Regards, Bryan Jeffrey Sent from Outlook Mail *From: *Cheng Lian *Sent: *Tuesday, November 24, 2015 6:49 AM *To: *Bryan;user *Subject: *Re: DateTime Support - Hive Parquet I see, then this is actually irrelevan

Re: Parquet files not getting coalesced to smaller number of files

2015-11-29 Thread Cheng Lian
RDD.coalesce(n) returns a new RDD rather than modifying the original RDD. So what you need is: metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...) Cheng On 11/29/15 12:21 PM, SRK wrote: Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My idea

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Cheng Lian
You may try to set Hadoop conf "parquet.enable.summary-metadata" to false to disable writing Parquet summary files (_metadata and _common_metadata). By default Parquet writes the summary files by collecting footers of all part-files in the dataset while committing the job. Spark also follows

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Arora wrote: Hi I have few doubts on parquet file format. 1.Does parquet keeps min max statistics like in ORC. how can I see parquet version(whether its1.1,1.2or1.3) for pa

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
, Ted Yu wrote: Cheng: I only see user@spark in the CC. FYI On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian <mailto:l...@databricks.com>> wrote: cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Ar

Re: parquet file doubts

2015-12-08 Thread Cheng Lian
eet mailto:absi...@informatica.com>> wrote: Yes, Parquet has min/max. *From:*Cheng Lian [mailto:l...@databricks.com <mailto:l...@databricks.com>] *Sent:* Monday, December 07, 2015 11:21 AM *To:* Ted Yu *Cc:* Shushant Arora; user@spark.apache.org <mailto

Re: About the bottleneck of parquet file reading in Spark

2015-12-10 Thread Cheng Lian
Cc Spark user list since this information is generally useful. On Thu, Dec 10, 2015 at 3:31 PM, Lionheart <87249...@qq.com> wrote: > Dear, Cheng > I'm a user of Spark. Our current Spark version is 1.4.1 > In our project, I find there is a bottleneck when loading huge amount > of parquet

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the master branch. https://github.com/apache/spark/pull/7048 ENUM types are actually not in the Parquet format spec, that's why we didn't have it at the first place. Basically, ENUMs are always treated as UTF8 strings in Spa

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Cheng Lian
Hey Jerrick, What do you mean by "schema evolution with Hive metastore tables"? Hive doesn't take schema evolution into account. Could you please give a concrete use case? Are you trying to write Parquet data with extra columns into an existing metastore Parquet table? Cheng On 7/21/15 1:04

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
p at newParquet.scala:573 but that is the same even with non partitioned data. Do you mean how to verify whether partition pruning is effective? You should be able to see log lines like this: 15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned 66.6666

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
table' from SparkSQLCLI I won't see the new column being added. I understand that this is because Hive doesn't support schema evolution. So what is the best way to support CLI queries in this situation? Do I need to manually alter the table everytime the underlying schema changes? Th

Re: Parquet problems

2015-07-22 Thread Cheng Lian
How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause fai

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
g.com On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Since Hive doesn’t support schema evolution, you’ll have to update the schema stored in metastore somehow. For example, you can create a new external table with the merged schema. Say you ha

Re: Partition parquet data by ENUM column

2015-07-23 Thread Cheng Lian
r(PrimitiveType: BINARY, OriginalType: ENUM) Valid types for this column are: null Is it because Spark does not recognize ENUM type in parquet? Best Regards, Jerry On Wed, Jul 22, 2015 at 12:21 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: On 7/22/15 9:03 AM, Ankit wrote:

  1   2   3   4   5   >