Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished
Hi, we always get issues on inserting or creating table with Amazon EMR Spark version, by inserting about 1GB resultset, the spark sql query will never be finished. by inserting small resultset (like 500MB), works fine. *spark.sql.shuffle.partitions* by default 200 or *set spark.sql.shuffle.partitions=1* do not help. the log stopped at: */15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-1 s3://hive-db/db_xxx/some_huge_table/* then only metrics.MetricsSaver logs. we set / property namehive.metastore.warehouse.dir/name values3://hive-db/value /property/ but hive.exec.scratchdir ist not set, i have no idea why the tmp files were created in /s3://hive-db/tmp/hive-hadoop// we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6 (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md), still not work. anyone get same issue? any idea about how to fix it? i believe Amazon EMR's Spark version use com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but not the original hadoop s3n implementation, right? /home/hadoop/spark/classpath/emr/* and /home/hadoop/spark/classpath/emrfs/* is in classpath btw. is there any plan to use the new hadoop s3a implementation instead of s3n ? Thanks for any help. Teng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: SchemaRDD - Parquet - insertInto makes many files
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks it will be great, if something like hive.exec.reducers.bytes.per.reducer could be implemented. one idea is, get total size of all target blocks, then set number of partitions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13740.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Storage Handlers in Spark SQL
it seems he means to query RDBMS or cassandra using Spark SQL, multi data sources for spark SQL. i looked through the link he posted https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources using their storage handlers, users can create hive external table from c* table or RDBMS table (JDBC) so Niranda, maybe you can take a look at this API: https://issues.apache.org/jira/browse/SPARK-2179 and there is some doc in pull request pool: https://github.com/apache/spark/pull/1774 there is a similar implementation to your JDBC storage handlers in spark SQL, it could also be a sample of the Public API for DataTypes and Schema: https://github.com/apache/spark/pull/1612 (https://issues.apache.org/jira/browse/SPARK-2710) and, in some other userlist threads, i saw that, some kind of c* mapper is also in development by datastax? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Storage-Handlers-in-Spark-SQL-tp12780p12818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory
it is definitively a bug, sqlContext.parquetFile should take both dir and single file as parameter. this if-check for isDir make no sense after this commit https://github.com/apache/spark/pull/1370/files#r14967550 i opened a ticket for this issue https://issues.apache.org/jira/browse/SPARK-3138 this ticket shows how to reproduce this bug. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345p12426.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multiple column families vs Multiple tables
ö_ö you should send this message to hbase user list, not spark user list... but i can give you some personal advice about this, keep column families as few as possible! at least, use some prefix of column qualifier could also be an idea. but read performance may be worse for your use case like search for a row with value x in column family A and with value Y in column family B. so it depends on which workload is important for you, if your use case is very read-heavy and you really want to use multi column families to hold a good read performance, you should try to disable region split, adjust compaction interval carefully, and so on. there is a good slide for this: http://photo.weibo.com/1431095941/wbphotos/large/mid/3735178188435939/pid/554cca85gw1eiloddlqa5j20or0ik77z more slides about hbase + coprocessor, hbase + hive and hbase + spark: http://www.weibo.com/1431095941/BeL90zozx -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-column-families-vs-Multiple-tables-tp12425p12439.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is hive UDF are supported in HiveContext
there is no collect_list in hive 0.12 try this after this ticket is done https://issues.apache.org/jira/browse/SPARK-2706 i am also looking forward to this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-hive-UDF-are-supported-in-HiveContext-tp12310p12444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to direct insert vaules into SparkSQL tables?
oh, right, i meant within SqlContext alone, schemaRDD from text file with a case class -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext
spark.speculation was not set, any speculative execution on tachyon side? tachyon-env.sh only changed following export TACHYON_MASTER_ADDRESS=test01.zala #export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020 export TACHYON_WORKER_MEMORY_SIZE=16GB test01.zala is master node for HDFS, tachyon, Spark, etc. worker nodes are test02.zala test03.zala test04.zala spark-shell run on test02 after parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_1) i got FailedToCheckpointException with Failed to rename but tfs lsr, there are some temporary files and metadata file 0.00 B08-11-2014 16:19:28:054 /parquet_1 881.00 B 08-11-2014 16:19:28:054 In Memory /parquet_1/_metadata 0.00 B08-11-2014 16:19:28:314 /parquet_1/_temporary 0.00 B08-11-2014 16:19:28:314 /parquet_1/_temporary/0 0.00 B08-11-2014 16:19:28:931 /parquet_1/_temporary/0/_temporary 0.00 B08-11-2014 16:19:28:931 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_33 0.00 B08-11-2014 16:19:28:931 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_33/part-r-1.parquet 0.00 B08-11-2014 16:19:28:940 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_35 0.00 B08-11-2014 16:19:28:940 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_35/part-r-2.parquet 0.00 B08-11-2014 16:19:28:962 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_36 0.00 B08-11-2014 16:19:28:962 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_36/part-r-4.parquet 0.00 B08-11-2014 16:19:28:971 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_34 0.00 B08-11-2014 16:19:28:971 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_34/part-r-3.parquet 0.00 B08-11-2014 16:20:06:349 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_37 0.00 B08-11-2014 16:20:06:349 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_37/part-r-2.parquet 0.00 B08-11-2014 16:20:09:519 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_38 0.00 B08-11-2014 16:20:09:519 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_38/part-r-4.parquet 0.00 B08-11-2014 16:20:18:777 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_39 0.00 B08-11-2014 16:20:18:777 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_39/part-r-3.parquet 0.00 B08-11-2014 16:20:28:315 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_40 0.00 B08-11-2014 16:20:28:315 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_40/part-r-1.parquet 0.00 B08-11-2014 16:20:38:382 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_41 0.00 B08-11-2014 16:20:38:382 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_41/part-r-2.parquet 0.00 B08-11-2014 16:20:40:681 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_42 0.00 B08-11-2014 16:20:40:681 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_42/part-r-4.parquet 0.00 B08-11-2014 16:20:50:376 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_43 0.00 B08-11-2014 16:20:50:376 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_43/part-r-3.parquet 0.00 B08-11-2014 16:21:00:932 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_44 0.00 B08-11-2014 16:21:00:932 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_44/part-r-1.parquet 0.00 B08-11-2014 16:21:10:355 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_45 0.00 B08-11-2014 16:21:10:355 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_45/part-r-2.parquet 0.00 B08-11-2014 16:21:11:468 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_46 0.00 B08-11-2014 16:21:11:468 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_46/part-r-4.parquet 0.00 B08-11-2014 16:21:21:681 /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_47 0.00 B08-11-2014 16:21:21:681 In Memory /parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_47/part-r-3.parquet 0.00 B08-11-2014 16:21:32:583
Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext
more interesting is if spark-shell started on master node (test01) then parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_tablex) 14/08/12 11:42:06 INFO : initialize(tachyon://... ... ... 14/08/12 11:42:06 INFO : File does not exist: tachyon://test01.zala:19998/parquet_tablex/_metadata 14/08/12 11:42:06 INFO : getWorkingDirectory: / 14/08/12 11:42:06 INFO : create(tachyon://test01.zala:19998/parquet_tablex/_metadata, rw-r--r--, true, 65536, 1, 33554432, null) 14/08/12 11:42:06 WARN : tachyon.home is not set. Using /mnt/tachyon_default_home as the default value. 14/08/12 11:42:06 INFO : Trying to get local worker host : test01.zala 14/08/12 11:42:06 ERROR : No local worker on test01.zala NoWorkerException(message:No local worker on test01.zala) at tachyon.thrift.MasterService$user_getWorker_result$user_getWorker_resultStandardScheme.read(MasterService.java:25675) at tachyon.thrift.MasterService$user_getWorker_result$user_getWorker_resultStandardScheme.read(MasterService.java:25652) at tachyon.thrift.MasterService$user_getWorker_result.read(MasterService.java:25591) at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at tachyon.thrift.MasterService$Client.recv_user_getWorker(MasterService.java:832) at tachyon.thrift.MasterService$Client.user_getWorker(MasterService.java:818) at tachyon.master.MasterClient.user_getWorker(MasterClient.java:648) at tachyon.worker.WorkerClient.connect(WorkerClient.java:199) at tachyon.worker.WorkerClient.mustConnect(WorkerClient.java:360) at tachyon.worker.WorkerClient.getUserUfsTempFolder(WorkerClient.java:298) at tachyon.client.TachyonFS.createAndGetUserUfsTempFolder(TachyonFS.java:270) at tachyon.client.FileOutStream.init(FileOutStream.java:72) at tachyon.client.TachyonFile.getOutStream(TachyonFile.java:207) at tachyon.hadoop.AbstractTFS.create(AbstractTFS.java:102) at tachyon.hadoop.TFS.create(TFS.java:24) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:773) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:344) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:345) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:142) at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:120) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:197) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:399) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:397) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:403) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:403) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:406) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:77) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) at $line12.$read$$iwC$$iwC$$iwC$$iwC.init(console:17) at $line12.$read$$iwC$$iwC$$iwC.init(console:22) at $line12.$read$$iwC$$iwC.init(console:24) at $line12.$read$$iwC.init(console:26) at $line12.$read.init(console:28) at $line12.$read$.init(console:32) at $line12.$read$.clinit(console) at $line12.$eval$.init(console:7) at $line12.$eval$.clinit(console) at $line12.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at
Re: CDH5, HiveContext, Parquet
hive-thriftserver does not work with parquet tables in hive metastore also, this PR will fix it too? do not need to change any pom.xml ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to direct insert vaules into SparkSQL tables?
no, spark sql can not insert or update textfile yet, can only insert into parquet files but, people.union(new_people).registerAsTable(people) could be an idea. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p11882.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext
sharing /reusing RDDs is always useful for many use cases, is this possible via persisting RDD on tachyon? such as off heap persist a named RDD into a given path (instead of /tmp_spark_tachyon/spark-xxx-xxx-xxx) or saveAsParquetFile on tachyon i tried to save a SchemaRDD on tachyon, val parquetFile = sqlContext.parquetFile(hdfs://test01.zala:8020/user/hive/warehouse/parquet_tables.db/some_table/) parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_1) but always error, first error message is: 14/08/11 16:19:28 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on test03.zala:37377 (size: 18.7 KB, free: 16.6 GB) 14/08/11 16:20:06 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 3.0 (TID 35, test04.zala): java.io.IOException: FailedToCheckpointException(message:Failed to rename hdfs://test01.zala:8020/tmp/tachyon/workers/140776003/31806/730 to hdfs://test01.zala:8020/tmp/tachyon/data/730) tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:112) tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:168) tachyon.client.FileOutStream.close(FileOutStream.java:104) org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321) parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111) parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:259) org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272) org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:722) hdfs://test01.zala:8020/tmp/tachyon/ already chmod to 777, both owner and group is same as spark/tachyon startup user off-heap persist or saveAs normal text file on tachyon works fine. CDH 5.1.0, spark 1.1.0 snapshot, tachyon 0.6 snapshot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/share-reuse-off-heap-persisted-tachyon-RDD-in-SparkContext-or-saveAsParquetFile-on-tachyon-in-SQLCont-tp11897.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to use spark-cassandra-connector in spark-shell?
try to add following jars in classpath xxx/cassandra-all-2.0.6.jar:xxx/cassandra-thrift-2.0.6.jar:xxx/libthrift-0.9.1.jar:xxx/cassandra-driver-spark_2.10-1.0.0-SNAPSHOT.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-core-2.0.2.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-dse-2.0.2.jar then in spark-shell import org.apache.spark.{SparkConf, SparkContext} val conf = new SparkConf(true).set(cassandra.connection.host, your-cassandra-host) val sc = new SparkContext(local[1], cassandra-driver, conf) import com.datastax.driver.spark._ sc.cassandraTable(db1, table1).select(key).count -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-spark-cassandra-connector-in-spark-shell-tp11757p11781.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark with HBase
this two posts should be good for setting up spark+hbase environment and use the results of hbase table scan as RDD settings http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html some samples: http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-HBase-tp11629p11647.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reduceByKey to get all associated values
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network. is that still true? which one should we choice? because actually we can replace all of groupByKey with reduceByKey for example, if we want to use groupByKey on a RDD[ String, String ], to get a RDD[ String, Seq[String] ], we can also do it with reduceByKey: at first, map RDD[ String, String ] to RDD[ String, Seq[String] ] then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
right, Spark is more like to act as an OLAP, i believe no one will use spark as an OLTP, so there is always some question about how to share the data between these two platform efficiently and a more important is that most of enterprise BI tools rely on RDBMS or at least a JDBC/ODBC interface -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11672.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed
in spark 1.1 maybe not so easy like spark 1.0 after commit: https://issues.apache.org/jira/browse/SPARK-2446 only binary with UTF8 annotation will be recognized as string after this commit, but in impala strings are always without UTF8 anno -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10490.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
why there is only getString(index) but no getString(columnName) in catalyst.expressions.Row.scala ?
i do not want to use always schemaRDD.map { case Row(xxx) = ... } using case we must rewrite the table schema again is there any plan to implement this? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-there-is-only-getString-index-but-no-getString-columnName-in-catalyst-expressions-Row-scala-tp10521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed
Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in hive/impala, they are folders in HDFS, and each table has different schema, and in its folder there are one or more parquet files. that means xx001_suffix xx002_suffix are folders, there are some parquet files like xx001_suffix/parquet_file1_with_schema1 xx002_suffix/parquet_file1_with_schema2 xx002_suffix/parquet_file2_with_schema2 it seems only union can do this job~ Nonetheless, thank you very much, maybe the only reason is that spark eating up too much memory... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10335.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: gain access to persisted rdd
but at least if user want to access the persisted RDDs, they can use sc.getPersistentRDDs in the same context. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed
no, something like this 14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2 on 02.xxx: remote Akka client disassociated ... ... 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186) 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599) at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ulimit is increased -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10344.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed
like this: val sc = new SparkContext(new SparkConf().setAppName(SLA Filter)) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val suffix = args(0) sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx001_ + suffix).registerAsTable(xx001) sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx002_ + suffix).registerAsTable(xx002) ... ... var xx001 = sql(select some_id, some_type, some_time from xx001).map(line = (line(0), (line(1).toString, line(2).toString.substring(0, 19)) ) ) var xx002 = sql(select some_id, some_type, some_time from xx002).map(line = (line(0), (line(1).toString, line(2).toString.substring(0, 19)) ) ) ... ... var all = xx001 union xx002 ... union ... all..groupByKey.filter( kv = FilterSLA.filterSLA(kv._2.toSeq) ).saveAsTextFile(xxx) filterSLA will turn the input Seq[(String, String)] to Map, then check somethinkg like if map contains type1 and type2 and then if timestamp_type1 - timestamp_type2 2days thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10268.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed
160G parquet files (ca. 30 files, snappy compressed, made by cloudera impala) ca. 30 full table scan, took 3-5 columns out, then some normal scala operations like substring, groupby, filter, at the end, save as file in HDFS yarn-client mode, 23 core and 60G mem / node but, always failed ! startup script (3 NodeManager, each an executor): some screenshot: http://apache-spark-user-list.1001560.n3.nabble.com/file/n10254/spark1.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n10254/spark2.png i got some log like: same job using standalone mode (3 slaves) works... startup script (each 24 cores, 64g mem) : any idea? thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Shark CDH5 Final Release
hi, you can take a look here: http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html Sent from the Apache Spark User List mailing list archive at Nabble.com.