[jira] [Commented] (SPARK-5947) First class partitioning support in data sources API

2015-02-25 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336556#comment-14336556
 ] 

Philippe Girolami commented on SPARK-5947:
--

For some workloads, it can make more sense to use SKEWED ON rather than 
PARTITION in order to prevent creating thousands of tiny partitions just to 
handle a few large partitions.
As far as I can tell, these two cases can't be inferred from a directory layout 
so maybe it would make sense to make PARTITION  SKEW part of Spark too, and 
rely on meta-data defined by the application rather than directory discovery ?

 First class partitioning support in data sources API
 

 Key: SPARK-5947
 URL: https://issues.apache.org/jira/browse/SPARK-5947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian

 For file system based data sources, implementing Hive style partitioning 
 support can be complex and error prone. To be specific, partitioning support 
 include:
 # Partition discovery:  Given a directory organized similar to Hive 
 partitions, discover the directory structure and partitioning information 
 automatically, including partition column names, data types, and values.
 # Reading from partitioned tables
 # Writing to partitioned tables
 It would be good to have first class partitioning support in the data sources 
 API. For example, add a {{FileBasedScan}} trait with callbacks and default 
 implementations for these features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-24 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334727#comment-14334727
 ] 

Philippe Girolami commented on SPARK-1867:
--

[~srowen] I'm only getting this issue when building myself from the source 
using Maven. Every snapshot, RC  release I've downloaded has worked just fine 
so for me it's a packaging issue when building locally.


 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-14 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321747#comment-14321747
 ] 

Philippe Girolami commented on SPARK-1867:
--

I tried this afternoon again using the 1.3 snapshot and I still run into it. I 
have checked that the Java version I compiled with is the same as the one I'm 
running
{code}
Philippes-MacBook-Air-3:spark Philippe$ java -version
java version 1.7.0_75
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
Philippes-MacBook-Air-3:spark Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 
2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_75.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: mac os x, version: 10.9.5, arch: x86_64, family: mac
{code}

Maybe another JVM gets chosen when I run spark-shell ? Is there a log somewhere 
that I could check ?

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-05 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307580#comment-14307580
 ] 

Philippe Girolami commented on SPARK-1867:
--

Has anyone figured this out ? I'm seeing this happen when running spark-shell 
off the master branch (at cd5da42), using the same example as [~ansonism]. 
Works fine in 1.2.0, downloaded from the website.
{code}
val source = sc.textFile(/tmp/testfile.txt)
source.saveAsTextFile(/tmp/test_spark_output)
{code}

I built master using
{code}
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-Pbigtop-dist -DskipTests  clean package install
{code} on MacOS using Sun Java 7
{quote}
java version 1.7.0_60
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
{quote}

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala 

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-05 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307719#comment-14307719
 ] 

Philippe Girolami commented on SPARK-1867:
--

[~srowen] I'm unfortunately reporting this bug. To mitigate SPARK-5557, I've 
reverted my working branch to commit cd5da42 until it gets sorted out. I should 
have included the stack trace. I think someone could easily verify by doing a 
clean clone, checkout cd5da42 and build the way I describe. Then it's simply a 
matter of launch spark-shell. If that works for you, then I agree it's on my 
side but I can't imagine how it could be given the steps I describe to 
reproduce it.

{code}
Philippes-MacBook-Air-3:spark Philippe$ bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/05 19:13:15 INFO SecurityManager: Changing view acls to: Philippe
15/02/05 19:13:15 INFO SecurityManager: Changing modify acls to: Philippe
15/02/05 19:13:15 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(Philippe); users 
with modify permissions: Set(Philippe)
15/02/05 19:13:15 INFO HttpServer: Starting HTTP Server
15/02/05 19:13:16 INFO Utils: Successfully started service 'HTTP class server' 
on port 61040.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0-SNAPSHOT
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
15/02/05 19:13:21 INFO SparkContext: Running Spark version 1.3.0-SNAPSHOT
15/02/05 19:13:21 INFO SecurityManager: Changing view acls to: Philippe
15/02/05 19:13:21 INFO SecurityManager: Changing modify acls to: Philippe
15/02/05 19:13:21 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(Philippe); users 
with modify permissions: Set(Philippe)
15/02/05 19:13:22 INFO Slf4jLogger: Slf4jLogger started
15/02/05 19:13:22 INFO Remoting: Starting remoting
15/02/05 19:13:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.31:61043]
15/02/05 19:13:22 INFO Utils: Successfully started service 'sparkDriver' on 
port 61043.
15/02/05 19:13:22 INFO SparkEnv: Registering MapOutputTracker
15/02/05 19:13:22 INFO SparkEnv: Registering BlockManagerMaster
15/02/05 19:13:22 INFO DiskBlockManager: Created local directory at 
/var/folders/8r/0ty24ys52kvdvx8r6nz2cdc0gn/T/spark-local-20150205191322-7e22
15/02/05 19:13:22 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/02/05 19:13:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/02/05 19:13:23 INFO HttpFileServer: HTTP File server directory is 
/var/folders/8r/0ty24ys52kvdvx8r6nz2cdc0gn/T/spark-8400830a-a7fc-4909-ae37-ee4b48e3ff88
15/02/05 19:13:23 INFO HttpServer: Starting HTTP Server
15/02/05 19:13:23 INFO Utils: Successfully started service 'HTTP file server' 
on port 61044.
15/02/05 19:13:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
15/02/05 19:13:23 INFO Utils: Successfully started service 'SparkUI' on port 
4041.
15/02/05 19:13:23 INFO SparkUI: Started SparkUI at http://192.168.1.31:4041
15/02/05 19:13:23 INFO Executor: Using REPL class URI: http://192.168.1.31:61040
15/02/05 19:13:23 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@192.168.1.31:61043/user/HeartbeatReceiver
15/02/05 19:13:23 INFO NettyBlockTransferService: Server created on 61046
15/02/05 19:13:23 INFO BlockManagerMaster: Trying to register BlockManager
15/02/05 19:13:23 INFO BlockManagerMasterActor: Registering block manager 
localhost:61046 with 265.4 MB RAM, BlockManagerId(driver, localhost, 61046)
15/02/05 19:13:23 INFO BlockManagerMaster: Registered BlockManager
15/02/05 19:13:23 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala val source = sc.textFile(/tmp/test)
15/02/05 19:13:27 INFO MemoryStore: ensureFreeSpace(163705) called with 
curMem=0, maxMem=278302556
15/02/05 19:13:27 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 159.9 KB, free 265.3 MB)
15/02/05 19:13:27 INFO MemoryStore: ensureFreeSpace(22736) called with 
curMem=163705, maxMem=278302556
15/02/05 19:13:27 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 22.2 KB, free 265.2 MB)
15/02/05 19:13:27 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:61046 (size: 22.2 KB, free: 265.4 MB)
15/02/05 19:13:27 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0
15/02/05 19:13:27 INFO SparkContext: Created