Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-01 Thread chutium
Hi,

we always get issues on inserting or creating table with Amazon EMR Spark
version, by inserting about 1GB resultset, the spark sql query will never be
finished.

by inserting small resultset (like 500MB), works fine. 

*spark.sql.shuffle.partitions* by default 200
or *set spark.sql.shuffle.partitions=1*
do not help.

the log stopped at:
*/15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename
s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-1
s3://hive-db/db_xxx/some_huge_table/*

then only metrics.MetricsSaver logs.

we set
/  property
namehive.metastore.warehouse.dir/name
values3://hive-db/value
  /property/
but hive.exec.scratchdir ist not set, i have no idea why the tmp files were
created in /s3://hive-db/tmp/hive-hadoop//

we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md),
still not work.

anyone get same issue? any idea about how to fix it?

i believe Amazon EMR's Spark version use
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but not the
original hadoop s3n implementation, right?

/home/hadoop/spark/classpath/emr/*
and
/home/hadoop/spark/classpath/emrfs/*
is in classpath

btw. is there any plan to use the new hadoop s3a implementation instead of
s3n ?

Thanks for any help.

Teng




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-08 Thread chutium
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks

it will be great, if something like  hive.exec.reducers.bytes.per.reducer 
could be implemented.

one idea is, get total size of all target blocks, then set number of
partitions



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Storage Handlers in Spark SQL

2014-08-26 Thread chutium
it seems he means to query RDBMS or cassandra using Spark SQL, multi data
sources for spark SQL.

i looked through the link he posted
https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources

using their storage handlers, users can create hive external table from c*
table or RDBMS table (JDBC)

so Niranda, maybe you can take a look at this API:
https://issues.apache.org/jira/browse/SPARK-2179
and there is some doc in pull request pool:
https://github.com/apache/spark/pull/1774

there is a similar implementation to your JDBC storage handlers in spark
SQL, it could also be a sample of the Public API for DataTypes and Schema:
https://github.com/apache/spark/pull/1612
(https://issues.apache.org/jira/browse/SPARK-2710)

and, in some other userlist threads, i saw that, some kind of c* mapper is
also in development by datastax?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Storage-Handlers-in-Spark-SQL-tp12780p12818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-19 Thread chutium
it is definitively a bug, sqlContext.parquetFile should take both dir and
single file as parameter.

this if-check for isDir make no sense after this commit
https://github.com/apache/spark/pull/1370/files#r14967550

i opened a ticket for this issue
https://issues.apache.org/jira/browse/SPARK-3138

this ticket shows how to reproduce this bug.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345p12426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multiple column families vs Multiple tables

2014-08-19 Thread chutium
ö_ö  you should send this message to hbase user list, not spark user list...

but i can give you some personal advice about this, keep column families as
few as possible!

at least, use some prefix of column qualifier could also be an idea. but
read performance may be worse for your use case like search for a row with
value x in column family A and with value Y in column family B.

so it depends on which workload is important for you, if your use case is
very read-heavy and you really want to use multi column families to hold a
good read performance, you should try to disable region split, adjust
compaction interval carefully, and so on.

there is a good slide for this:
http://photo.weibo.com/1431095941/wbphotos/large/mid/3735178188435939/pid/554cca85gw1eiloddlqa5j20or0ik77z

more slides about hbase + coprocessor, hbase + hive and hbase + spark:
http://www.weibo.com/1431095941/BeL90zozx




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-column-families-vs-Multiple-tables-tp12425p12439.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is hive UDF are supported in HiveContext

2014-08-19 Thread chutium
there is no collect_list in hive 0.12

try this after this ticket is done
https://issues.apache.org/jira/browse/SPARK-2706

i am also looking forward to this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-hive-UDF-are-supported-in-HiveContext-tp12310p12444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to direct insert vaules into SparkSQL tables?

2014-08-14 Thread chutium
oh, right, i meant within SqlContext alone, schemaRDD from text file with a
case class



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
spark.speculation was not set, any speculative execution on tachyon side?

tachyon-env.sh only changed following

export TACHYON_MASTER_ADDRESS=test01.zala
#export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs
export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020
export TACHYON_WORKER_MEMORY_SIZE=16GB

test01.zala is master node for HDFS, tachyon, Spark, etc.
worker nodes are
test02.zala
test03.zala
test04.zala

spark-shell run on test02

after parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_1)

i got FailedToCheckpointException with Failed to rename

but tfs lsr, there are some temporary files and metadata file

0.00 B08-11-2014 16:19:28:054 /parquet_1
881.00 B  08-11-2014 16:19:28:054  In Memory  /parquet_1/_metadata
0.00 B08-11-2014 16:19:28:314 /parquet_1/_temporary
0.00 B08-11-2014 16:19:28:314 /parquet_1/_temporary/0
0.00 B08-11-2014 16:19:28:931
/parquet_1/_temporary/0/_temporary
0.00 B08-11-2014 16:19:28:931
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_33
0.00 B08-11-2014 16:19:28:931  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_33/part-r-1.parquet
0.00 B08-11-2014 16:19:28:940
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_35
0.00 B08-11-2014 16:19:28:940  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_35/part-r-2.parquet
0.00 B08-11-2014 16:19:28:962
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_36
0.00 B08-11-2014 16:19:28:962  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_36/part-r-4.parquet
0.00 B08-11-2014 16:19:28:971
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_34
0.00 B08-11-2014 16:19:28:971  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_34/part-r-3.parquet
0.00 B08-11-2014 16:20:06:349
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_37
0.00 B08-11-2014 16:20:06:349  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_37/part-r-2.parquet
0.00 B08-11-2014 16:20:09:519
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_38
0.00 B08-11-2014 16:20:09:519  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_38/part-r-4.parquet
0.00 B08-11-2014 16:20:18:777
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_39
0.00 B08-11-2014 16:20:18:777  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_39/part-r-3.parquet
0.00 B08-11-2014 16:20:28:315
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_40
0.00 B08-11-2014 16:20:28:315  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_40/part-r-1.parquet
0.00 B08-11-2014 16:20:38:382
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_41
0.00 B08-11-2014 16:20:38:382  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_41/part-r-2.parquet
0.00 B08-11-2014 16:20:40:681
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_42
0.00 B08-11-2014 16:20:40:681  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_42/part-r-4.parquet
0.00 B08-11-2014 16:20:50:376
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_43
0.00 B08-11-2014 16:20:50:376  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_43/part-r-3.parquet
0.00 B08-11-2014 16:21:00:932
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_44
0.00 B08-11-2014 16:21:00:932  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_00_44/part-r-1.parquet
0.00 B08-11-2014 16:21:10:355
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_45
0.00 B08-11-2014 16:21:10:355  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_01_45/part-r-2.parquet
0.00 B08-11-2014 16:21:11:468
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_46
0.00 B08-11-2014 16:21:11:468  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_03_46/part-r-4.parquet
0.00 B08-11-2014 16:21:21:681
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_47
0.00 B08-11-2014 16:21:21:681  In Memory 
/parquet_1/_temporary/0/_temporary/attempt_201408111619_0017_r_02_47/part-r-3.parquet
0.00 B08-11-2014 16:21:32:583

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
more interesting is if spark-shell started on master node (test01)

then

parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_tablex)

14/08/12 11:42:06 INFO : initialize(tachyon://...
...
...
14/08/12 11:42:06 INFO : File does not exist:
tachyon://test01.zala:19998/parquet_tablex/_metadata
14/08/12 11:42:06 INFO : getWorkingDirectory: /
14/08/12 11:42:06 INFO :
create(tachyon://test01.zala:19998/parquet_tablex/_metadata, rw-r--r--,
true, 65536, 1, 33554432, null)
14/08/12 11:42:06 WARN : tachyon.home is not set. Using
/mnt/tachyon_default_home as the default value.
14/08/12 11:42:06 INFO : Trying to get local worker host : test01.zala
14/08/12 11:42:06 ERROR : No local worker on test01.zala
NoWorkerException(message:No local worker on test01.zala)
at
tachyon.thrift.MasterService$user_getWorker_result$user_getWorker_resultStandardScheme.read(MasterService.java:25675)
at
tachyon.thrift.MasterService$user_getWorker_result$user_getWorker_resultStandardScheme.read(MasterService.java:25652)
at
tachyon.thrift.MasterService$user_getWorker_result.read(MasterService.java:25591)
at
tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
tachyon.thrift.MasterService$Client.recv_user_getWorker(MasterService.java:832)
at
tachyon.thrift.MasterService$Client.user_getWorker(MasterService.java:818)
at tachyon.master.MasterClient.user_getWorker(MasterClient.java:648)
at tachyon.worker.WorkerClient.connect(WorkerClient.java:199)
at tachyon.worker.WorkerClient.mustConnect(WorkerClient.java:360)
at
tachyon.worker.WorkerClient.getUserUfsTempFolder(WorkerClient.java:298)
at
tachyon.client.TachyonFS.createAndGetUserUfsTempFolder(TachyonFS.java:270)
at tachyon.client.FileOutStream.init(FileOutStream.java:72)
at tachyon.client.TachyonFile.getOutStream(TachyonFile.java:207)
at tachyon.hadoop.AbstractTFS.create(AbstractTFS.java:102)
at tachyon.hadoop.TFS.create(TFS.java:24)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:773)
at
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:344)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:345)
at
org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:142)
at
org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:120)
at
org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:197)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:399)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:397)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:403)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:403)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:406)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:406)
at
org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:77)
at
org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103)
at $line12.$read$$iwC$$iwC$$iwC$$iwC.init(console:17)
at $line12.$read$$iwC$$iwC$$iwC.init(console:22)
at $line12.$read$$iwC$$iwC.init(console:24)
at $line12.$read$$iwC.init(console:26)
at $line12.$read.init(console:28)
at $line12.$read$.init(console:32)
at $line12.$read$.clinit(console)
at $line12.$eval$.init(console:7)
at $line12.$eval$.clinit(console)
at $line12.$eval.$print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at 

Re: CDH5, HiveContext, Parquet

2014-08-11 Thread chutium
hive-thriftserver does not work with parquet tables in hive metastore also,
this PR will fix it too?

do not need to change any pom.xml ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to direct insert vaules into SparkSQL tables?

2014-08-11 Thread chutium
no, spark sql can not insert or update textfile yet, can only insert into
parquet files

but,

people.union(new_people).registerAsTable(people)

could be an idea.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p11882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread chutium
sharing /reusing RDDs is always useful for many use cases, is this possible
via persisting RDD on tachyon?

such as off heap persist a named RDD into a given path (instead of
/tmp_spark_tachyon/spark-xxx-xxx-xxx)
or
saveAsParquetFile on tachyon

i tried to save a SchemaRDD on tachyon, 

val parquetFile =
sqlContext.parquetFile(hdfs://test01.zala:8020/user/hive/warehouse/parquet_tables.db/some_table/)
parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_1)

but always error, first error message is:

14/08/11 16:19:28 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in
memory on test03.zala:37377 (size: 18.7 KB, free: 16.6 GB)
14/08/11 16:20:06 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 3.0
(TID 35, test04.zala): java.io.IOException:
FailedToCheckpointException(message:Failed to rename
hdfs://test01.zala:8020/tmp/tachyon/workers/140776003/31806/730 to
hdfs://test01.zala:8020/tmp/tachyon/data/730)
tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:112)
tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:168)
tachyon.client.FileOutStream.close(FileOutStream.java:104)
   
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
   
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
   
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
   
parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:259)
   
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
   
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)



hdfs://test01.zala:8020/tmp/tachyon/
already chmod to 777, both owner and group is same as spark/tachyon startup
user

off-heap persist or saveAs normal text file on tachyon works fine.

CDH 5.1.0, spark 1.1.0 snapshot, tachyon 0.6 snapshot



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/share-reuse-off-heap-persisted-tachyon-RDD-in-SparkContext-or-saveAsParquetFile-on-tachyon-in-SQLCont-tp11897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to use spark-cassandra-connector in spark-shell?

2014-08-08 Thread chutium
try to add following jars in classpath

xxx/cassandra-all-2.0.6.jar:xxx/cassandra-thrift-2.0.6.jar:xxx/libthrift-0.9.1.jar:xxx/cassandra-driver-spark_2.10-1.0.0-SNAPSHOT.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-core-2.0.2.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-dse-2.0.2.jar

then in spark-shell

import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf(true).set(cassandra.connection.host,
your-cassandra-host)
val sc = new SparkContext(local[1], cassandra-driver, conf)
import com.datastax.driver.spark._
sc.cassandraTable(db1, table1).select(key).count




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-spark-cassandra-connector-in-spark-shell-tp11757p11781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark with HBase

2014-08-07 Thread chutium
this two posts should be good for setting up spark+hbase environment and use
the results of hbase table scan as RDD

settings
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

some samples:
http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-HBase-tp11629p11647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reduceByKey to get all associated values

2014-08-07 Thread chutium
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about
performance
(http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/)

that, reduceByKey will be more efficient than groupByKey... he mentioned
groupByKey copies all data over network.

is that still true? which one should we choice? because actually we can
replace all of groupByKey with reduceByKey

for example, if we want to use groupByKey on a RDD[ String, String ], to get
a RDD[ String, Seq[String] ], 

we can also do it with reduceByKey:
at first, map RDD[ String, String ] to RDD[ String, Seq[String] ]
then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ]



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-07 Thread chutium
right, Spark is more like to act as an OLAP, i believe no one will use spark
as an OLTP, so there is always some question about how to share the data
between these two platform efficiently

and a more important is that most of enterprise BI tools rely on RDBMS or at
least a JDBC/ODBC interface




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-23 Thread chutium
in spark 1.1 maybe not so easy like spark 1.0 after commit:
https://issues.apache.org/jira/browse/SPARK-2446

only binary with UTF8 annotation will be recognized as string after this
commit, but in impala strings are always without UTF8 anno



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


why there is only getString(index) but no getString(columnName) in catalyst.expressions.Row.scala ?

2014-07-23 Thread chutium
i do not want to use always
schemaRDD.map {
  case Row(xxx) =
...
}

using case we must rewrite the table schema again

is there any plan to implement this?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-there-is-only-getString-index-but-no-getString-columnName-in-catalyst-expressions-Row-scala-tp10521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium
Hi,

unfortunately it is not so straightforward

xxx_parquet.db

is a folder of managed database created by hive/impala, so, every sub
element in it is a table in hive/impala, they are folders in HDFS, and each
table has different schema, and in its folder there are one or more parquet
files.

that means

xx001_suffix
xx002_suffix

are folders, there are some parquet files like

xx001_suffix/parquet_file1_with_schema1

xx002_suffix/parquet_file1_with_schema2
xx002_suffix/parquet_file2_with_schema2

it seems only union can do this job~

Nonetheless, thank you very much, maybe the only reason is that spark eating
up too much memory...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10335.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: gain access to persisted rdd

2014-07-21 Thread chutium
but at least if user want to access the persisted RDDs, they can use
sc.getPersistentRDDs  in the same context.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium
no, something like this

14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2
on 02.xxx: remote Akka client disassociated

...
...

14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186)
14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was due to
java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at
parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599)
at
parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


ulimit is increased




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-20 Thread chutium
like this:

val sc = new SparkContext(new SparkConf().setAppName(SLA Filter))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val suffix = args(0)
sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx001_
+ suffix).registerAsTable(xx001)
sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx002_
+ suffix).registerAsTable(xx002)
...
...
var xx001 = sql(select some_id, some_type, some_time from
xx001).map(line = (line(0), (line(1).toString,
line(2).toString.substring(0, 19)) ) )
var xx002 = sql(select some_id, some_type, some_time from
xx002).map(line = (line(0), (line(1).toString,
line(2).toString.substring(0, 19)) ) )
...
...

var all = xx001 union xx002 ... union ...

all..groupByKey.filter( kv = FilterSLA.filterSLA(kv._2.toSeq)
).saveAsTextFile(xxx)

filterSLA will turn the input Seq[(String, String)] to Map, then check
somethinkg like if map contains type1 and type2 and then if timestamp_type1
- timestamp_type2  2days


thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254p10268.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-19 Thread chutium
160G parquet files (ca. 30 files, snappy compressed, made by cloudera impala)

ca. 30 full table scan, took 3-5 columns out, then some normal scala
operations like substring, groupby, filter, at the end, save as file in HDFS

yarn-client mode, 23 core and 60G mem / node

but, always failed !

startup script (3 NodeManager, each an executor):




some screenshot:


http://apache-spark-user-list.1001560.n3.nabble.com/file/n10254/spark1.png 


http://apache-spark-user-list.1001560.n3.nabble.com/file/n10254/spark2.png 



i got some log like:




same job using standalone mode (3 slaves) works...

startup script (each 24 cores, 64g mem) :




any idea?

thanks a lot!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Shark CDH5 Final Release

2014-04-10 Thread chutium
hi, you can take a look here:

http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.