Encoding issue reading text file
Hi everyone, I´m trying to read a text file with UTF-16LE but I´m getting weird characters like this: �� W h e n My code is this one: sparkSession .read .format("text") .option("charset", "UTF-16LE") .load("textfile.txt") I´m using Spark 2.3.1. Any idea to fix it? Thanks -- Regards. Miguel Ángel
Dataset error with Encoder
Hi, I have the following issue, case class Item (c1: String, c2: String, c3: Option[BigDecimal]) import sparkSession.implicits._ val result = df.as[Item].groupByKey(_.c1).mapGroups((key, value) => { value }) But I get the following error in compilation time: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. What am I missing? Thanks
Hbase and Spark
I´m trying to build an application where is necessary to do bulkGets and bulkLoad on Hbase. I think that I could use this component https://github.com/hortonworks-spark/shc *Is it a good option??* But* I can't import it in my project*. Sbt cannot resolve hbase connector This is my build.sbt: version := "1.0" scalaVersion := "2.10.6" mainClass in assembly := Some("com.location.userTransaction") assemblyOption in assembly ~= { _.copy(includeScala = false) } resolvers += "Typesafe Repo" at "http://repo.typesafe.com/typesafe/releases/ " val sparkVersion = "1.6.0" val jettyVersion = "8.1.14.v20131031" val hbaseConnectorVersion = "1.0.0-1.6-s_2.10" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %% "spark-hive" % sparkVersion % "provided" ) libraryDependencies += "com.hortonworks" % "shc" % hbaseConnectorVersion libraryDependencies += "org.eclipse.jetty" % "jetty-client" % jettyVersion -- Saludos. Miguel Ángel
Testing with spark testing base
Hi. I'm testing "spark testing base". For example: class MyFirstTest extends FunSuite with SharedSparkContext{ def tokenize(f: RDD[String]) = { f.map(_.split("").toList) } test("really simple transformation"){ val input = List("hi", "hi miguel", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } } But...How can I launch this test?? Spark-submit or IntelliJ? Thanks. -- Regards Miguel
Re: Debug Spark
This is very intersting. Thanks!!! On Thu, Dec 3, 2015 at 8:28 AM, Sudhanshu Janghel < sudhanshu.jang...@cloudwick.com> wrote: > Hi, > > Here is a doc that I had created for my team. This has steps along with > snapshots of how to setup debugging in spark using IntelliJ locally. > > > https://docs.google.com/a/cloudwick.com/document/d/13kYPbmK61di0f_XxxJ-wLP5TSZRGMHE6bcTBjzXD0nA/edit?usp=sharing > > Kind Regards, > Sudhanshu > > On Thu, Dec 3, 2015 at 6:46 AM, Akhil Das > wrote: > >> This doc will get you started >> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ >> >> Thanks >> Best Regards >> >> On Sun, Nov 29, 2015 at 9:48 PM, Masf wrote: >> >>> Hi >>> >>> Is it possible to debug spark locally with IntelliJ or another IDE? >>> >>> Thanks >>> >>> -- >>> Regards. >>> Miguel Ángel >>> >> >> > -- Saludos. Miguel Ángel
Re: Debug Spark
Hi Ardo Some tutorial to debug with Intellij? Thanks Regards. Miguel. On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR wrote: > hi, > > IntelliJ is just great for that! > > cheers, > Ardo. > > On Sun, Nov 29, 2015 at 5:18 PM, Masf wrote: > >> Hi >> >> Is it possible to debug spark locally with IntelliJ or another IDE? >> >> Thanks >> >> -- >> Regards. >> Miguel Ángel >> > > -- Saludos. Miguel Ángel
Debug Spark
Hi Is it possible to debug spark locally with IntelliJ or another IDE? Thanks -- Regards. Miguel Ángel
Re: SQLContext load. Filtering files
Thanks Akhil, I will have a look. I have a dude regarding to spark streaming and filestream. If spark streaming crashs and while spark was down new files are created in input folder, when spark streaming is launched again, how can I process these files? Thanks. Regards. Miguel. On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das wrote: > Have a look at the spark streaming. You can make use of the ssc.fileStream. > > Eg: > > val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable, > AvroKeyInputFormat[GenericRecord]](input) > > You can also specify a filter function > <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext> > as the second argument. > > Thanks > Best Regards > > On Wed, Aug 19, 2015 at 10:46 PM, Masf wrote: > >> Hi. >> >> I'd like to read Avro files using this library >> https://github.com/databricks/spark-avro >> >> I need to load several files from a folder, not all files. Is there some >> functionality to filter the files to load? >> >> And... Is is possible to know the name of the files loaded from a folder? >> >> My problem is that I have a folder where an external process is inserting >> files every X minutes and I need process these files once, and I can't >> move, rename or copy the source files. >> >> >> Thanks >> -- >> >> Regards >> Miguel Ángel >> > > -- Saludos. Miguel Ángel
Spark 1.3. Insert into hive parquet partitioned table from DataFrame
Hi. I have a dataframe and I want to insert these data into parquet partitioned table in Hive. In Spark 1.4 I can use df.write.partitionBy("x","y").format("parquet").mode("append").saveAsTable("tbl_parquet") but in Spark 1.3 I can't. How can I do it? Thanks -- Regards Miguel
SQLContext load. Filtering files
Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is that I have a folder where an external process is inserting files every X minutes and I need process these files once, and I can't move, rename or copy the source files. Thanks -- Regards Miguel Ángel
Dataframe Partitioning
Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner join between these dataframes, the result contains 200 partitions. *Why?* df1.join(df2, df1("id") === df2("id"), "Inner") => returns 200 partitions Thanks!!! -- Regards. Miguel Ángel
Re: Adding columns to DataFrame
Hi. I think that it's possible to do: *df.select($"*", lit(null).as("col17", lit(null).as("col18", lit(null).as("col19",, lit(null).as("col26")* Any other advice? Miguel. On Wed, May 27, 2015 at 5:02 PM, Masf wrote: > Hi. > > I have a DataFrame with 16 columns (df1) and another with 26 columns(df2). > I want to do a UnionAll. So, I want to add 10 columns to df1 in order to > have the same number of columns in both dataframes. > > Is there some alternative to "withColumn"? > > Thanks > > -- > Regards. > Miguel Ángel > -- Saludos. Miguel Ángel
Adding columns to DataFrame
Hi. I have a DataFrame with 16 columns (df1) and another with 26 columns(df2). I want to do a UnionAll. So, I want to add 10 columns to df1 in order to have the same number of columns in both dataframes. Is there some alternative to "withColumn"? Thanks -- Regards. Miguel Ángel
Re: DataFrame. Conditional aggregation
Yes. I think that your sql solution will work but I was looking for a solution with DataFrame API. I had thought to use UDF such as: val myFunc = udf {(input: Int) => {if (input > 100) 1 else 0}} Although I'd like to know if it's possible to do it directly in the aggregation inserting a lambda function or something else. Thanks Regards. Miguel. On Wed, May 27, 2015 at 1:06 AM, ayan guha wrote: > For this, I can give you a SQL solution: > > joined data.registerTempTable('j') > > Res=ssc.sql('select col1,col2, count(1) counter, min(col3) minimum, > sum(case when endrscp>100 then 1 else 0 end test from j' > > Let me know if this works. > On 26 May 2015 23:47, "Masf" wrote: > >> Hi >> I don't know how it works. For example: >> >> val result = joinedData.groupBy("col1","col2").agg( >> count(lit(1)).as("counter"), >> min(col3).as("minimum"), >> sum("case when endrscp> 100 then 1 else 0 end").as("test") >> ) >> >> How can I do it? >> >> Thanks >> Regards. >> Miguel. >> >> On Tue, May 26, 2015 at 12:35 AM, ayan guha wrote: >> >>> Case when col2>100 then 1 else col2 end >>> On 26 May 2015 00:25, "Masf" wrote: >>> >>>> Hi. >>>> >>>> In a dataframe, How can I execution a conditional sentence in a >>>> aggregation. For example, Can I translate this SQL statement to DataFrame?: >>>> >>>> SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1) >>>> FROM table >>>> GROUP BY name >>>> >>>> Thanks >>>> -- >>>> Regards. >>>> Miguel >>>> >>> >> >> >> -- >> >> >> Saludos. >> Miguel Ángel >> > -- Saludos. Miguel Ángel
Re: DataFrame. Conditional aggregation
Hi I don't know how it works. For example: val result = joinedData.groupBy("col1","col2").agg( count(lit(1)).as("counter"), min(col3).as("minimum"), sum("case when endrscp> 100 then 1 else 0 end").as("test") ) How can I do it? Thanks Regards. Miguel. On Tue, May 26, 2015 at 12:35 AM, ayan guha wrote: > Case when col2>100 then 1 else col2 end > On 26 May 2015 00:25, "Masf" wrote: > >> Hi. >> >> In a dataframe, How can I execution a conditional sentence in a >> aggregation. For example, Can I translate this SQL statement to DataFrame?: >> >> SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1) >> FROM table >> GROUP BY name >> >> Thanks >> -- >> Regards. >> Miguel >> > -- Saludos. Miguel Ángel
DataFrame. Conditional aggregation
Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1) FROM table GROUP BY name Thanks -- Regards. Miguel
Inserting Nulls
Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1: Option[Ingeger] = None* *...* *val myRow = org.apache.spark.sql.Row(col1, ...)* *val mySchema = StructType(Array(StructField("Column1", IntegerType, true)))* *val TableOutputSchemaRDD = hc.applySchema(myRow, mySchema)* *hc.registerRDDAsTable(TableOutputSchemaRDD, "result_intermediate")* *hc.sql("CREATE TABLE table_output STORED AS PARQUET AS SELECT * FROM result_intermediate")* Produce: java.lang.ClassCastException: scala.Some cannot be cast to java.lang.Integer at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector.get(JavaIntObjectInspector.java:40) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:247) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks! -- Regards. Miguel Ángel
Re: Parquet number of partitions
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x=> number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel. On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom < eric.eijkelenb...@gmail.com> wrote: > Hello guys > > Q1: How does Spark determine the number of partitions when reading a > Parquet file? > > val df = sqlContext.parquetFile(path) > > Is it some way related to the number of Parquet row groups in my input? > > Q2: How can I reduce this number of partitions? Doing this: > > df.rdd.coalesce(200).count > > from the spark-shell causes job execution to hang… > > Any ideas? Thank you in advance. > > Eric > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Saludos. Miguel Ángel
Re: Opening many Parquet files = slow
Hi guys Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files (250MB/file), it lasts 4 minutes. I have a cluster with 4 nodes and it seems me too slow. The "load" function is not available in Spark 1.2, so I can't test it Regards. Miguel. On Mon, Apr 13, 2015 at 8:12 PM, Eric Eijkelenboom < eric.eijkelenb...@gmail.com> wrote: > Hi guys > > Does anyone know how to stop Spark from opening all Parquet files before > starting a job? This is quite a show stopper for me, since I have 5000 > Parquet files on S3. > > Recap of what I tried: > > 1. Disable schema merging with: sqlContext.load(“parquet", > Map("mergeSchema" -> "false”, "path" -> “s3://path/to/folder")) > This opens most files in the folder (17 out of 21 in my small > example). For 5000 files on S3, sqlContext.load() takes 30 minutes to > complete. > > 2. Use the old api with: > sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false”) > Now sqlContext.parquetFile() only opens a few files and prints the > schema: so far so good! However, as soon as I run e.g. a count() on the > dataframe, Spark still opens all files _before_ starting a job/stage. > Effectively this moves the delay from load() to count() (or any other > action I presume). > > 3. Run Spark 1.3.1-rc2. > sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, > the same as 1.3.0. > > Any help would be greatly appreciated! > > Thanks a lot. > > Eric > > > > > On 10 Apr 2015, at 16:46, Eric Eijkelenboom > wrote: > > Hi Ted > > Ah, I guess the term ‘source’ confused me :) > > Doing: > > sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path > to a single day of logs")) > > for 1 directory with 21 files, Spark opens 17 files: > > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-72' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-72' for > reading at position '261573524' > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-74' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-77' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-62' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-74' for > reading at position '259256807' > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-77' for > reading at position '260002042' > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-62' for > reading at position ‘260875275' > etc. > > I can’t seem to pass a comma-separated list of directories to load(), so > in order to load multiple days of logs, I have to point to the root folder > and depend on auto-partition discovery (unless there’s a smarter way). > > Doing: > > sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path > to root log dir")) > > starts opening what seems like all files (I killed the process after a > couple of minutes). > > Thanks for helping out. > Eric > > > -- Saludos. Miguel Ángel
Re: Increase partitions reading Parquet File
Hi. It doesn't work. val file = SqlContext.parquetfile("hdfs://node1/user/hive/warehouse/ file.parquet") file.repartition(127) println(h.partitions.size.toString()) <-- Return 27! Regards On Fri, Apr 10, 2015 at 4:50 PM, Felix C wrote: > RDD.repartition(1000)? > > --- Original Message --- > > From: "Masf" > Sent: April 9, 2015 11:45 PM > To: user@spark.apache.org > Subject: Increase partitions reading Parquet File > > Hi > > I have this statement: > > val file = > SqlContext.parquetfile("hdfs://node1/user/hive/warehouse/file.parquet") > > This code generates as many partitions as files are. So, I want to > increase the number of partitions. > I've tested coalesce (file.coalesce(100)) but the number of partitions > doesn't change. > > How can I increase the number of partitions? > > Thanks > > -- > > > Regards. > Miguel Ángel > -- Saludos. Miguel Ángel
Spark SQL. Memory consumption
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 < -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 = 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0)) FROM table1 CL JOIN table2 netw ON CL.field15 = netw.id WHERE AND field3 IS NOT NULL AND field4 IS NOT NULL AND field5 IS NOT NULL GROUP BY field1,field2,field3,field4, netw.field5 spark-submit --master spark://master:7077 *--driver-memory 20g --executor-memory 60g* --class "GMain" project_2.10-1.0.jar --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2> ./error Input data is 8GB in parquet format. Many times crash by *GC overhead*. I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB RAM/node) is collapsed. *Is it a query too difficult to Spark SQL? * *Would It be better to do it in Spark?* *Am I doing something wrong?* Thanks -- Regards. Miguel Ángel
Re: Error reading smallin in hive table with parquet format
No, in my company are using cloudera distributions and 1.2.0 is the last version of spark. Thanks On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust wrote: > Can you try with Spark 1.3? Much of this code path has been rewritten / > improved in this version. > > On Wed, Apr 1, 2015 at 7:53 AM, Masf wrote: > >> >> Hi. >> >> In Spark SQL 1.2.0, with HiveContext, I'm executing the following >> statement: >> >> CREATE TABLE testTable STORED AS PARQUET AS >> SELECT >> field1 >> FROM table1 >> >> *field1 is SMALLINT. If table1 is in text format all it's ok, but if >> table1 is in parquet format, spark returns the following error*: >> >> >> >> 15/04/01 16:48:24 ERROR TaskSetManager: Task 26 in stage 1.0 failed 1 >> times; aborting job >> Exception in thread "main" org.apache.spark.SparkException: Job aborted >> due to stage failure: Task 26 in stage 1.0 failed 1 times, most recent >> failure: Lost task 26.0 in stage 1.0 (TID 28, localhost): >> java.lang.ClassCastException: java.lang.Integer cannot be cast to >> java.lang.Short >> at >> org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41) >> at >> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:251) >> at >> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301) >> at >> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178) >> at >> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164) >> at >> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123) >> at >> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org >> $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114) >> at >> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) >> at >> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) >> at org.apache.spark.scheduler.Task.run(Task.scala:56) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> Driver stacktrace: >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> at >> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) >> at scala.Option.foreach(Option.scala:236) >> at >> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >> at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> >> >> Thanks! >> -- >> >> >> Regards. >> Miguel Ángel >> > > -- Saludos. Miguel Ángel
Error reading smallin in hive table with parquet format
Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1 is in parquet format, spark returns the following error*: 15/04/01 16:48:24 ERROR TaskSetManager: Task 26 in stage 1.0 failed 1 times; aborting job Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 1.0 failed 1 times, most recent failure: Lost task 26.0 in stage 1.0 (TID 28, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:251) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Thanks! -- Regards. Miguel Ángel
Re: Error in Delete Table
Hi Ted. Spark 1.2.0 an Hive 0.13.1 Regards. Miguel Angel. On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu wrote: > Which Spark and Hive release are you using ? > > Thanks > > > > > On Mar 27, 2015, at 2:45 AM, Masf wrote: > > > > Hi. > > > > In HiveContext, when I put this statement "DROP TABLE IF EXISTS > TestTable" > > If TestTable doesn't exist, spark returns an error: > > > > > > > > ERROR Hive: NoSuchObjectException(message:default.TestTable table not > found) > > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) > > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) > > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) > > at > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) > > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) > > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) > > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1008) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90) > > at com.sun.proxy.$Proxy22.getTable(Unknown Source) > > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1000) > > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:942) > > at > org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3887) > > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:310) > > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554) > > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321) > > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139) > > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962) > > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952) > > at > org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305) > > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) > > at > org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) > > at > org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) > > at > org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) > > at > org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) > > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) > > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) > > at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > > at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > > at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) > > at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98) > > at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98) > > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > > at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > at GeoMain$.HiveExecution(GeoMain.scala:96) > > at GeoMain$.main(GeoMain.scala:17) > > at GeoMain.main(GeoMain.scala) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > > > > > Thanks!! > > -- > > > > > > Regards. > > Miguel Ángel > -- Saludos. Miguel Ángel
Re: Too many open files
I'm executing my application in local mode (with --master local[*]). I'm using ubuntu and I've put "session required pam_limits.so" into /etc/pam.d/common-session but it doesn't work On Mon, Mar 30, 2015 at 4:08 PM, Ted Yu wrote: > bq. In /etc/secucity/limits.conf set the next values: > > Have you done the above modification on all the machines in your Spark > cluster ? > > If you use Ubuntu, be sure that the /etc/pam.d/common-session file > contains the following line: > > session required pam_limits.so > > > On Mon, Mar 30, 2015 at 5:08 AM, Masf wrote: > >> Hi. >> >> I've done relogin, in fact, I put 'uname -n' and returns 100, but it >> crashs. >> I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file) >> >> >> Regards. >> Miguel Angel. >> >> On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das >> wrote: >> >>> Mostly, you will have to restart the machines to get the ulimit effect >>> (or relogin). What operation are you doing? Are you doing too many >>> repartitions? >>> >>> Thanks >>> Best Regards >>> >>> On Mon, Mar 30, 2015 at 4:52 PM, Masf wrote: >>> >>>> Hi >>>> >>>> I have a problem with temp data in Spark. I have fixed >>>> spark.shuffle.manager to "SORT". In /etc/secucity/limits.conf set the next >>>> values: >>>> * softnofile 100 >>>> * hardnofile 100 >>>> In spark-env.sh set ulimit -n 100 >>>> I've restarted the spark service and it continues crashing (Too many >>>> open files) >>>> >>>> How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2 >>>> >>>> java.io.FileNotFoundException: >>>> /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c >>>> (Too many open files) >>>> at java.io.FileOutputStream.open(Native Method) >>>> at java.io.FileOutputStream.(FileOutputStream.java:221) >>>> at >>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) >>>> at >>>> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) >>>> at >>>> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) >>>> at scala.Array$.fill(Array.scala:267) >>>> at >>>> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) >>>> at >>>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) >>>> at >>>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) >>>> at >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >>>> at >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>>> at >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:745) >>>> 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID >>>> 27, localhost): java.io.FileNotFoundException: >>>> /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c >>>> (Too many open files) >>>> at java.io.FileOutputStream.open(Native Method) >>>> at java.io.FileOutputStream.(FileOutputStream.java:221) >>>> at >>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) >>>> at >>>> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) >>>> at >>>> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) >>>> at scala.Array$.fill(Array.scala:267) >>>>
Re: Too many open files
Hi. I've done relogin, in fact, I put 'uname -n' and returns 100, but it crashs. I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file) Regards. Miguel Angel. On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das wrote: > Mostly, you will have to restart the machines to get the ulimit effect (or > relogin). What operation are you doing? Are you doing too many > repartitions? > > Thanks > Best Regards > > On Mon, Mar 30, 2015 at 4:52 PM, Masf wrote: > >> Hi >> >> I have a problem with temp data in Spark. I have fixed >> spark.shuffle.manager to "SORT". In /etc/secucity/limits.conf set the next >> values: >> * softnofile 100 >> * hardnofile 100 >> In spark-env.sh set ulimit -n 100 >> I've restarted the spark service and it continues crashing (Too many open >> files) >> >> How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2 >> >> java.io.FileNotFoundException: >> /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c >> (Too many open files) >> at java.io.FileOutputStream.open(Native Method) >> at java.io.FileOutputStream.(FileOutputStream.java:221) >> at >> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) >> at scala.Array$.fill(Array.scala:267) >> at >> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) >> at >> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) >> at >> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:56) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID >> 27, localhost): java.io.FileNotFoundException: >> /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c >> (Too many open files) >> at java.io.FileOutputStream.open(Native Method) >> at java.io.FileOutputStream.(FileOutputStream.java:221) >> at >> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) >> at scala.Array$.fill(Array.scala:267) >> at >> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) >> at >> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) >> at >> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:56) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> >> >> Thanks! >> -- >> >> >> Regards. >> Miguel Ángel >> > > -- Saludos. Miguel Ángel
Too many open files
Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager to "SORT". In /etc/secucity/limits.conf set the next values: * softnofile 100 * hardnofile 100 In spark-env.sh set ulimit -n 100 I've restarted the spark service and it continues crashing (Too many open files) How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2 java.io.FileNotFoundException: /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) at scala.Array$.fill(Array.scala:267) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID 27, localhost): java.io.FileNotFoundException: /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) at scala.Array$.fill(Array.scala:267) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks! -- Regards. Miguel Ángel
Error in Delete Table
Hi. In HiveContext, when I put this statement "DROP TABLE IF EXISTS TestTable" If TestTable doesn't exist, spark returns an error: ERROR Hive: NoSuchObjectException(message:default.TestTable table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1008) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90) at com.sun.proxy.$Proxy22.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1000) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:942) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3887) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:310) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98) at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at GeoMain$.HiveExecution(GeoMain.scala:96) at GeoMain$.main(GeoMain.scala:17) at GeoMain.main(GeoMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks!! -- Regards. Miguel Ángel
Re: Windowing and Analytics Functions in Spark SQL
Ok, Thanks. Some web resource where I could check the functionality supported by Spark SQL? Thanks!!! Regards. Miguel Ángel. On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian wrote: > We're working together with AsiaInfo on this. Possibly will deliver an > initial version of window function support in 1.4.0. But it's not a promise > yet. > > Cheng > > On 3/26/15 7:27 PM, Arush Kharbanda wrote: > > Its not yet implemented. > > https://issues.apache.org/jira/browse/SPARK-1442 > > On Thu, Mar 26, 2015 at 4:39 PM, Masf wrote: > >> Hi. >> >> Are the Windowing and Analytics functions supported in Spark SQL (with >> HiveContext or not)? For example in Hive is supported >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics >> >> >> Some tutorial or documentation where I can see all features supported >> by Spark SQL? >> >> >> Thanks!!! >> -- >> >> >> Regards. >> Miguel Ángel >> > > > > -- > > [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> > > *Arush Kharbanda* || Technical Teamlead > > ar...@sigmoidanalytics.com || www.sigmoidanalytics.com > > > -- Saludos. Miguel Ángel
Windowing and Analytics Functions in Spark SQL
Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark SQL? Thanks!!! -- Regards. Miguel Ángel
Re: Issues with SBT and Spark
Hi Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala 2.11 Regards On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan wrote: > My current simple.sbt is > > name := "SparkEpiFast" > > version := "1.0" > > scalaVersion := "2.11.4" > > libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.2.1" % > "provided" > > libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "1.2.1" > % "provided" > > While I do "sbt package", it compiles successfully. But while running the > application, I get > "Exception in thread "main" java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;" > > However, changing the scala version to 2.10.4 and updating the dependency > lines appropriately resolves the issue (no exception). > > Could anyone please point out what I am doing wrong? > -- Saludos. Miguel Ángel
Re: Spark SQL. Cast to Bigint
Hi Yin With HiveContext works well. Thanks!!! Regars. Miguel Angel. On Fri, Mar 13, 2015 at 3:18 PM, Yin Huai wrote: > Are you using SQLContext? Right now, the parser in the SQLContext is quite > limited on the data type keywords that it handles (see here > <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala#L391>) > and unfortunately "bigint" is not handled in it right now. We will add > other data types in there ( > https://issues.apache.org/jira/browse/SPARK-6146 is used to track it). > Can you try HiveContext for now? > > On Fri, Mar 13, 2015 at 4:48 AM, Masf wrote: > >> Hi. >> >> I have a query in Spark SQL and I can not covert a value to BIGINT: >> CAST(column AS BIGINT) or >> CAST(0 AS BIGINT) >> >> The output is: >> Exception in thread "main" java.lang.RuntimeException: [34.62] failure: >> ``DECIMAL'' expected but identifier BIGINT found >> >> Thanks!! >> Regards. >> Miguel Ángel >> > > -- Saludos. Miguel Ángel
Hive error on partitioned tables
Hi. I'm running Spark 1.2.0. I have HiveContext and I execute the following query: select sum(field1 / 100) from table1 group by field2; field1 in hive metastore is a smallint. The schema detected by hivecontext is a int32: fileSchema: message schema { optional int32 field1; } If table1 is an unpartitioned table it works well, however, if table1 is a partitioned table it crashs in spark-submit. The error is the following: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102) at scala.math.Numeric$ShortIsIntegral$.toInt(Numeric.scala:72) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:366) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365) at org.apache.spark.sql.catalyst.expressions.Expression.f1(Expression.scala:162) at org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:115) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365) at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:109) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:90) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50) at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:72) at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:526) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/03/17 10:42:51 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 5) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102) at scala.math.Numeric$ShortIsIntegral$.toInt(Numeric.scala:72) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:366) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365) at org.apache.spark.sql.catalyst.expressions.Expression.f1(Expression.scala:162) at org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:115) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365) at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:109) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:90) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50) at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:72) at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:526) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.Shuff
Re: Parquet and repartition
Thanks Sean, I forgot it The ouput error is the following: java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:115) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/03/16 11:30:11 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 207) java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:115) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/03/16 11:30:11 INFO TaskSetManager: Starting task 2.0 in stage 6.0 (TID 208, localhost, ANY, 2878 bytes) 15/03/16 11:30:11 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 206, localhost): java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:115) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) On Mon, Mar 16, 2015 at 12:19 PM, Sean Owen wrote: > You forgot to give any information about what "fail" means here. > > On Mon, Mar 16, 2015 at 11:11 AM, Masf wrote: > > Hi all. > > > > When I specify the number of partitions and save this RDD in parquet &
Parquet and repartition
Hi all. When I specify the number of partitions and save this RDD in parquet format, my app fail. For example selectTest.coalesce(28).saveAsParquetFile("hdfs://vm-clusterOutput") However, it works well if I store data in text selectTest.coalesce(28).saveAsTextFile("hdfs://vm-clusterOutput") My spark version is 1.2.1 Is this bug registered? -- Saludos. Miguel Ángel
Spark SQL. Cast to Bigint
Hi. I have a query in Spark SQL and I can not covert a value to BIGINT: CAST(column AS BIGINT) or CAST(0 AS BIGINT) The output is: Exception in thread "main" java.lang.RuntimeException: [34.62] failure: ``DECIMAL'' expected but identifier BIGINT found Thanks!! Regards. Miguel Ángel
Re: Read parquet folders recursively
Hi. Thanks for your answers, but, to read parquet files is necessary to use parquetFile method in org.apache.spark.sql.SQLContext, is it true? How can I combine your solution with the called to this method? Thanks!! Regards On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen wrote: > org.apache.spark.deploy.SparkHadoopUtil has a method: > > /** >* Get [[FileStatus]] objects for all leaf children (files) under the > given base path. If the >* given path points to a file, return a single-element collection > containing [[FileStatus]] of >* that file. >*/ > def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = { > def recurse(path: Path) = { > val (directories, leaves) = fs.listStatus(path).partition(_.isDir) > leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath)) > } > > val baseStatus = fs.getFileStatus(basePath) > if (baseStatus.isDir) recurse(basePath) else Array(baseStatus) > } > > — > Best Regards! > Yijie Shen > > On March 12, 2015 at 2:35:49 PM, Akhil Das (ak...@sigmoidanalytics.com) > wrote: > > Hi > > We have a custom build to read directories recursively, Currently we use > it with fileStream like: > > val lines = ssc.fileStream[LongWritable, Text, > TextInputFormat]("/datadumps/", > (t: Path) => true, true, *true*) > > > Making the 4th argument true to read recursively. > > > You could give it a try > https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz > > Thanks > Best Regards > > On Wed, Mar 11, 2015 at 9:45 PM, Masf wrote: > >> Hi all >> >> Is it possible to read recursively folders to read parquet files? >> >> >> Thanks. >> >> -- >> >> >> Saludos. >> Miguel Ángel >> > > -- Saludos. Miguel Ángel
Read parquet folders recursively
Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel