How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
I had posted this question yesterday but formatting of my question was very
bad. So I am posting the same question again. Below is my question:

I am reading a directory of files using wholeTextFiles. After that I am
calling a function on each element of the rdd using map . The whole program
uses just 50 lines of each file. Please find the code at below link

https://gist.github.com/ashwini-anand/0e468da9b4ab7863dff14833d34de79e

The size of each file of the directory can be very large in my case and
because of this reason use of wholeTextFiles api will be inefficient in this
case. Right now wholeTextFiles loads full file content into the memory. can
we make wholeTextFiles to load only first 50 lines of each file ? Apart from
using wholeTextFiles, other solution I can think of is iterating over each
file of the directory one by one but that also seems to be inefficient. I am
new to spark. Please let me know if there is any efficient way to do this. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[jira] Lantao Jin shared "SPARK-20680: Spark-sql do not support for void column datatype of view" with you

2017-05-09 Thread Lantao Jin (JIRA)
Lantao Jin shared an issue with you




> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Lantao Jin
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}

 Also shared with
  d...@spark.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

shouldn't this be just -
df = spark.read.csv('out/df_in.csv')
sparkSession itself is in entry point to dataframes and SQL functionality .



our bootstrap is a bit messy, in our case no.  In the general case yes.

On 9 May 2017 at 16:56, Pushkar.Gujar  wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
> shouldn't this be just -
>
> df = spark.read.csv('out/df_in.csv')
>
> sparkSession itself is in entry point to dataframes and SQL functionality .
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra 
> wrote:
>
>> Looks to me like it is a conflict between a Databricks library and Spark
>> 2.1. That's an issue for Databricks to resolve or provide guidance.
>>
>> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> I'm a bit confused by that answer, I'm assuming it's spark deciding
>>> which lib to use.
>>>
>>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>>
 This looks more like a matter for Databricks support than spark-user.

 On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
 lucas.g...@gmail.com> wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
>> global_temp, returning NoSuchObjectException
>>
>
>
>> Py4JJavaError: An error occurred while calling o72.csv.
>> : java.lang.RuntimeException: Multiple sources found for csv 
>> (*com.databricks.spark.csv.DefaultSource15,
>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
>> please specify the fully qualified class name.
>> at scala.sys.package$.error(package.scala:27)
>> at org.apache.spark.sql.execution.datasources.DataSource$.looku
>> pDataSource(DataSource.scala:591)
>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>> ingClass$lzycompute(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>> ingClass(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.resolv
>> eRelation(DataSource.scala:325)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
>> ala:152)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.sca
>> la:415)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>> ava:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>> java.lang.Thread.run(Thread.java:745)
>
>
> When I change our call to:
>
> df = spark.hiveContext.read \
> 
> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
> \
> .load('df_in.csv)
>
> No such issue, I was under the impression (obviously wrongly) that
> spark would automatically pick the local lib.  We have the databricks
> library because other jobs still explicitly call it.
>
> Is the 'correct answer' to go through and modify so as to remove the
> databricks lib / remove it from our deploy?  Or should this just work?
>
> One of the things I find less helpful in the spark docs are when
> there's multiple ways to do it but no clear guidance on what those methods
> are intended to accomplish.
>
> Thanks!
>


>>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Pushkar.Gujar
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

shouldn't this be just -

df = spark.read.csv('out/df_in.csv')

sparkSession itself is in entry point to dataframes and SQL functionality .


Thank you,
*Pushkar Gujar*


On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra 
wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com  > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
 df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
> global_temp, returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
> please specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.execution.datasources.DataSource$.looku
> pDataSource(DataSource.scala:591)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass$lzycompute(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.resolv
> eRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
> ala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


 When I change our call to:

 df = spark.hiveContext.read \
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 \
 .load('df_in.csv)

 No such issue, I was under the impression (obviously wrongly) that
 spark would automatically pick the local lib.  We have the databricks
 library because other jobs still explicitly call it.

 Is the 'correct answer' to go through and modify so as to remove the
 databricks lib / remove it from our deploy?  Or should this just work?

 One of the things I find less helpful in the spark docs are when
 there's multiple ways to do it but no clear guidance on what those methods
 are intended to accomplish.

 Thanks!

>>>
>>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Hyukjin Kwon
Sounds like it is related with https://github.com/apache/spark/pull/17916

We will allow pick up the internal one if this one gets merged.

On 10 May 2017 7:09 am, "Mark Hamstra"  wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com  > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
 df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
> global_temp, returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
> please specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.execution.datasources.DataSource$.looku
> pDataSource(DataSource.scala:591)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass$lzycompute(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.resolv
> eRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
> ala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


 When I change our call to:

 df = spark.hiveContext.read \
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 \
 .load('df_in.csv)

 No such issue, I was under the impression (obviously wrongly) that
 spark would automatically pick the local lib.  We have the databricks
 library because other jobs still explicitly call it.

 Is the 'correct answer' to go through and modify so as to remove the
 databricks lib / remove it from our deploy?  Or should this just work?

 One of the things I find less helpful in the spark docs are when
 there's multiple ways to do it but no clear guidance on what those methods
 are intended to accomplish.

 Thanks!

>>>
>>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
ah.. thanks , your code also works for me, I figured it's because I tried
to encode a tuple of (MyClass, Int):


package org.apache.spark

/**
  */

import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Encoders, SQLContext}


object Hello {
  // this class has to be OUTSIDE the method that calls it!! otherwise
gives error about typetag not found
  // the UDT stuff from
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
  // and 
http://stackoverflow.com/questions/32440461/how-to-define-schema-for-custom-type-in-spark-sql
  class Person4 {
@scala.beans.BeanProperty def setX(x:Int): Unit = {}
@scala.beans.BeanProperty def getX():Int = {1}
  }

  def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file
on your system
val conf = new
SparkConf().setMaster("local[*]").setAppName("Simple Application")
val sc = new SparkContext(conf)

val raw = Array((new Person4(), 1), (new Person4(), 1))
val myrdd = sc.parallelize(raw)

val sqlContext = new SQLContext(sc)

implicit val personEncoder = Encoders.bean[Person4](classOf[Person4])
implicit val personEncoder2 = Encoders.tuple(personEncoder, Encoders.INT)


import sqlContext.implicits._
  this works --
Seq(new Person4(), new Person4()).toDS()

 -- this doesn't -
Seq((new Person4(),1), (new Person4(),1)).toDS()


sc.stop()
  }
}


On Tue, May 9, 2017 at 1:37 PM, Michael Armbrust 
wrote:

> Must be a bug.  This works for me
> 
>  in
> Spark 2.1.
>
> On Tue, May 9, 2017 at 12:10 PM, Yang  wrote:
>
>> somehow the schema check is here
>>
>> https://github.com/apache/spark/blob/master/sql/catalyst/
>> src/main/scala/org/apache/spark/sql/catalyst/ScalaReflec
>> tion.scala#L697-L750
>>
>> supposedly beans are to be handled, but it's not clear to me which line
>> handles the type of beans. if that's clear, I could probably annotate my
>> bean class properly
>>
>> On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust > > wrote:
>>
>>> I think you are supposed to set BeanProperty on a var as they do here
>>> .
>>> If you are using scala though I'd consider using the case class encoders.
>>>
>>> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>>>
 I'm trying to use Encoders.bean() to create an encoder for my custom
 class, but it fails complaining about can't find the schema:


 class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
 @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
 Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
 parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd
 : org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1]
 at parallelize at :31 scala> sqlcontext.createDataFrame(per
 son_rdd) java.lang.UnsupportedOperationException: Schema for type
 Person4 is not supported at org.apache.spark.sql.catalyst.
 ScalaReflection$.schemaFor(ScalaReflection.scala:716) at org.apache.
 spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(
 ScalaReflection.scala:71 2) at org.apache.spark.sql.catalyst.
 ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 1)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(Traver
 sableLike.scala:234) at


 but if u look at the encoder's schema, it does know it:
 but the system does seem to understand the schema for "Person4":


 scala> personEncoder.schema
 res38: org.apache.spark.sql.types.StructType = 
 StructType(StructField(x,IntegerType,false))


>>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
Looks to me like it is a conflict between a Databricks library and Spark
2.1. That's an issue for Databricks to resolve or provide guidance.

On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com 
wrote:

> I'm a bit confused by that answer, I'm assuming it's spark deciding which
> lib to use.
>
> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>
>> This looks more like a matter for Databricks support than spark-user.
>>
>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> df = spark.sqlContext.read.csv('out/df_in.csv')

>>>
>>>
 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
 metastore. hive.metastore.schema.verification is not enabled so
 recording the schema version 1.2.0
 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
 returning NoSuchObjectException
 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
 returning NoSuchObjectException

>>>
>>>
 Py4JJavaError: An error occurred while calling o72.csv.
 : java.lang.RuntimeException: Multiple sources found for csv 
 (*com.databricks.spark.csv.DefaultSource15,
 org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
 specify the fully qualified class name.
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.execution.datasources.DataSource$.looku
 pDataSource(DataSource.scala:591)
 at org.apache.spark.sql.execution.datasources.DataSource.provid
 ingClass$lzycompute(DataSource.scala:86)
 at org.apache.spark.sql.execution.datasources.DataSource.provid
 ingClass(DataSource.scala:86)
 at org.apache.spark.sql.execution.datasources.DataSource.resolv
 eRelation(DataSource.scala:325)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
 ssorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
 thodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:280)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:214) at
 java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> When I change our call to:
>>>
>>> df = spark.hiveContext.read \
>>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>>> \
>>> .load('df_in.csv)
>>>
>>> No such issue, I was under the impression (obviously wrongly) that spark
>>> would automatically pick the local lib.  We have the databricks library
>>> because other jobs still explicitly call it.
>>>
>>> Is the 'correct answer' to go through and modify so as to remove the
>>> databricks lib / remove it from our deploy?  Or should this just work?
>>>
>>> One of the things I find less helpful in the spark docs are when there's
>>> multiple ways to do it but no clear guidance on what those methods are
>>> intended to accomplish.
>>>
>>> Thanks!
>>>
>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
I'm a bit confused by that answer, I'm assuming it's spark deciding which
lib to use.

On 9 May 2017 at 14:30, Mark Hamstra  wrote:

> This looks more like a matter for Databricks support than spark-user.
>
> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com  > wrote:
>
>> df = spark.sqlContext.read.csv('out/df_in.csv')
>>>
>>
>>
>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>>> metastore. hive.metastore.schema.verification is not enabled so
>>> recording the schema version 1.2.0
>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>>> returning NoSuchObjectException
>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>>> returning NoSuchObjectException
>>>
>>
>>
>>> Py4JJavaError: An error occurred while calling o72.csv.
>>> : java.lang.RuntimeException: Multiple sources found for csv 
>>> (*com.databricks.spark.csv.DefaultSource15,
>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>>> specify the fully qualified class name.
>>> at scala.sys.package$.error(package.scala:27)
>>> at org.apache.spark.sql.execution.datasources.DataSource$.
>>> lookupDataSource(DataSource.scala:591)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> providingClass$lzycompute(DataSource.scala:86)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> providingClass(DataSource.scala:86)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> resolveRelation(DataSource.scala:325)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:57)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>>> java.lang.Thread.run(Thread.java:745)
>>
>>
>> When I change our call to:
>>
>> df = spark.hiveContext.read \
>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>> \
>> .load('df_in.csv)
>>
>> No such issue, I was under the impression (obviously wrongly) that spark
>> would automatically pick the local lib.  We have the databricks library
>> because other jobs still explicitly call it.
>>
>> Is the 'correct answer' to go through and modify so as to remove the
>> databricks lib / remove it from our deploy?  Or should this just work?
>>
>> One of the things I find less helpful in the spark docs are when there's
>> multiple ways to do it but no clear guidance on what those methods are
>> intended to accomplish.
>>
>> Thanks!
>>
>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
This looks more like a matter for Databricks support than spark-user.

On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com 
wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>> returning NoSuchObjectException
>>
>
>
>> Py4JJavaError: An error occurred while calling o72.csv.
>> : java.lang.RuntimeException: Multiple sources found for csv 
>> (*com.databricks.spark.csv.DefaultSource15,
>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>> specify the fully qualified class name.
>> at scala.sys.package$.error(package.scala:27)
>> at org.apache.spark.sql.execution.datasources.
>> DataSource$.lookupDataSource(DataSource.scala:591)
>> at org.apache.spark.sql.execution.datasources.DataSource.providingClass$
>> lzycompute(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.providingClass(
>> DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(
>> DataSource.scala:325)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>> java.lang.Thread.run(Thread.java:745)
>
>
> When I change our call to:
>
> df = spark.hiveContext.read \
> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
> \
> .load('df_in.csv)
>
> No such issue, I was under the impression (obviously wrongly) that spark
> would automatically pick the local lib.  We have the databricks library
> because other jobs still explicitly call it.
>
> Is the 'correct answer' to go through and modify so as to remove the
> databricks lib / remove it from our deploy?  Or should this just work?
>
> One of the things I find less helpful in the spark docs are when there's
> multiple ways to do it but no clear guidance on what those methods are
> intended to accomplish.
>
> Thanks!
>


Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
> returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
> specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:591)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


When I change our call to:

df = spark.hiveContext.read \
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
\
.load('df_in.csv)

No such issue, I was under the impression (obviously wrongly) that spark
would automatically pick the local lib.  We have the databricks library
because other jobs still explicitly call it.

Is the 'correct answer' to go through and modify so as to remove the
databricks lib / remove it from our deploy?  Or should this just work?

One of the things I find less helpful in the spark docs are when there's
multiple ways to do it but no clear guidance on what those methods are
intended to accomplish.

Thanks!


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
Must be a bug.  This works for me

in
Spark 2.1.

On Tue, May 9, 2017 at 12:10 PM, Yang  wrote:

> somehow the schema check is here
>
> https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> ScalaReflection.scala#L697-L750
>
> supposedly beans are to be handled, but it's not clear to me which line
> handles the type of beans. if that's clear, I could probably annotate my
> bean class properly
>
> On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust 
> wrote:
>
>> I think you are supposed to set BeanProperty on a var as they do here
>> .
>> If you are using scala though I'd consider using the case class encoders.
>>
>> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>>
>>> I'm trying to use Encoders.bean() to create an encoder for my custom
>>> class, but it fails complaining about can't find the schema:
>>>
>>>
>>> class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
>>> @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
>>> Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
>>> parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd:
>>> org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1] at
>>> parallelize at :31 scala> sqlcontext.createDataFrame(per
>>> son_rdd) java.lang.UnsupportedOperationException: Schema for type
>>> Person4 is not supported at org.apache.spark.sql.catalyst.
>>> ScalaReflection$.schemaFor(ScalaReflection.scala:716) at org.apache.
>>> spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(
>>> ScalaReflection.scala:71 2) at org.apache.spark.sql.catalyst.
>>> ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 1)
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike
>>> .scala:234) at
>>>
>>>
>>> but if u look at the encoder's schema, it does know it:
>>> but the system does seem to understand the schema for "Person4":
>>>
>>>
>>> scala> personEncoder.schema
>>> res38: org.apache.spark.sql.types.StructType = 
>>> StructType(StructField(x,IntegerType,false))
>>>
>>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
somehow the schema check is here

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L697-L750

supposedly beans are to be handled, but it's not clear to me which line
handles the type of beans. if that's clear, I could probably annotate my
bean class properly

On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust 
wrote:

> I think you are supposed to set BeanProperty on a var as they do here
> .
> If you are using scala though I'd consider using the case class encoders.
>
> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>
>> I'm trying to use Encoders.bean() to create an encoder for my custom
>> class, but it fails complaining about can't find the schema:
>>
>>
>> class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
>> @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
>> Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
>> parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd:
>> org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1] at
>> parallelize at :31 scala> sqlcontext.createDataFrame(person_rdd
>> ) java.lang.UnsupportedOperationException: Schema for type Person4 is not
>> supported at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
>> laReflection.scala:716) at org.apache.spark.sql.catalyst.
>> ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 2)
>> at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.
>> apply(ScalaReflection.scala:71 1) at scala.collection.TraversableLi
>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) at
>>
>>
>> but if u look at the encoder's schema, it does know it:
>> but the system does seem to understand the schema for "Person4":
>>
>>
>> scala> personEncoder.schema
>> res38: org.apache.spark.sql.types.StructType = 
>> StructType(StructField(x,IntegerType,false))
>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
2.0.2 with scala 2.11

On Tue, May 9, 2017 at 11:30 AM, Michael Armbrust 
wrote:

> Which version of Spark?
>
> On Tue, May 9, 2017 at 11:28 AM, Yang  wrote:
>
>> actually with var it's the same:
>>
>>
>> scala> class Person4 {
>>  |
>>  | @scala.beans.BeanProperty var X:Int = 1
>>  | }
>> defined class Person4
>>
>> scala> val personEncoder = Encoders.bean[Person4](classOf[Person4])
>> personEncoder: org.apache.spark.sql.Encoder[Person4] = class[x[0]: int]
>>
>> scala> val person_rdd =sc.parallelize(Array( (new Person4(), 1), (new
>> Person4(), 2) ))
>> person_rdd: org.apache.spark.rdd.RDD[(Person4, Int)] =
>> ParallelCollectionRDD[3] at parallelize at :39
>>
>> scala> sqlContext.createDataFrame(person_rdd)
>> java.lang.UnsupportedOperationException: Schema for type Person4 is not
>> supported
>>   at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
>> laReflection.scala:716)
>>   at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schem
>> aFor$2.apply(ScalaReflection.scala:712)
>>   at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schem
>> aFor$2.apply(ScalaReflection.scala:711)
>>   at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:234)
>>   at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:234)
>>   at scala.collection.immutable.List.foreach(List.scala:381)
>>   at scala.collection.TraversableLike$class.map(TraversableLike.
>> scala:234)
>>   at scala.collection.immutable.List.map(List.scala:285)
>>   at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
>> laReflection.scala:711)
>>   at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
>> laReflection.scala:654)
>>   at org.apache.spark.sql.SparkSession.createDataFrame(SparkSessi
>> on.scala:251)
>>   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.
>> scala:278)
>>   ... 54 elided
>>
>> On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust > > wrote:
>>
>>> I think you are supposed to set BeanProperty on a var as they do here
>>> .
>>> If you are using scala though I'd consider using the case class encoders.
>>>
>>> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>>>
 I'm trying to use Encoders.bean() to create an encoder for my custom
 class, but it fails complaining about can't find the schema:


 class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
 @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
 Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
 parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd
 : org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1]
 at parallelize at :31 scala> sqlcontext.createDataFrame(per
 son_rdd) java.lang.UnsupportedOperationException: Schema for type
 Person4 is not supported at org.apache.spark.sql.catalyst.
 ScalaReflection$.schemaFor(ScalaReflection.scala:716) at org.apache.
 spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(
 ScalaReflection.scala:71 2) at org.apache.spark.sql.catalyst.
 ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 1)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(Traver
 sableLike.scala:234) at


 but if u look at the encoder's schema, it does know it:
 but the system does seem to understand the schema for "Person4":


 scala> personEncoder.schema
 res38: org.apache.spark.sql.types.StructType = 
 StructType(StructField(x,IntegerType,false))


>>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
Which version of Spark?

On Tue, May 9, 2017 at 11:28 AM, Yang  wrote:

> actually with var it's the same:
>
>
> scala> class Person4 {
>  |
>  | @scala.beans.BeanProperty var X:Int = 1
>  | }
> defined class Person4
>
> scala> val personEncoder = Encoders.bean[Person4](classOf[Person4])
> personEncoder: org.apache.spark.sql.Encoder[Person4] = class[x[0]: int]
>
> scala> val person_rdd =sc.parallelize(Array( (new Person4(), 1), (new
> Person4(), 2) ))
> person_rdd: org.apache.spark.rdd.RDD[(Person4, Int)] =
> ParallelCollectionRDD[3] at parallelize at :39
>
> scala> sqlContext.createDataFrame(person_rdd)
> java.lang.UnsupportedOperationException: Schema for type Person4 is not
> supported
>   at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
> ScalaReflection.scala:716)
>   at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$
> schemaFor$2.apply(ScalaReflection.scala:712)
>   at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$
> schemaFor$2.apply(ScalaReflection.scala:711)
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
> ScalaReflection.scala:711)
>   at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
> ScalaReflection.scala:654)
>   at org.apache.spark.sql.SparkSession.createDataFrame(
> SparkSession.scala:251)
>   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:278)
>   ... 54 elided
>
> On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust 
> wrote:
>
>> I think you are supposed to set BeanProperty on a var as they do here
>> .
>> If you are using scala though I'd consider using the case class encoders.
>>
>> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>>
>>> I'm trying to use Encoders.bean() to create an encoder for my custom
>>> class, but it fails complaining about can't find the schema:
>>>
>>>
>>> class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
>>> @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
>>> Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
>>> parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd:
>>> org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1] at
>>> parallelize at :31 scala> sqlcontext.createDataFrame(per
>>> son_rdd) java.lang.UnsupportedOperationException: Schema for type
>>> Person4 is not supported at org.apache.spark.sql.catalyst.
>>> ScalaReflection$.schemaFor(ScalaReflection.scala:716) at org.apache.
>>> spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(
>>> ScalaReflection.scala:71 2) at org.apache.spark.sql.catalyst.
>>> ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 1)
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike
>>> .scala:234) at
>>>
>>>
>>> but if u look at the encoder's schema, it does know it:
>>> but the system does seem to understand the schema for "Person4":
>>>
>>>
>>> scala> personEncoder.schema
>>> res38: org.apache.spark.sql.types.StructType = 
>>> StructType(StructField(x,IntegerType,false))
>>>
>>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
actually with var it's the same:


scala> class Person4 {
 |
 | @scala.beans.BeanProperty var X:Int = 1
 | }
defined class Person4

scala> val personEncoder = Encoders.bean[Person4](classOf[Person4])
personEncoder: org.apache.spark.sql.Encoder[Person4] = class[x[0]: int]

scala> val person_rdd =sc.parallelize(Array( (new Person4(), 1), (new
Person4(), 2) ))
person_rdd: org.apache.spark.rdd.RDD[(Person4, Int)] =
ParallelCollectionRDD[3] at parallelize at :39

scala> sqlContext.createDataFrame(person_rdd)
java.lang.UnsupportedOperationException: Schema for type Person4 is not
supported
  at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
  at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:712)
  at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:711)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:711)
  at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
  at
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:251)
  at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:278)
  ... 54 elided

On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust 
wrote:

> I think you are supposed to set BeanProperty on a var as they do here
> .
> If you are using scala though I'd consider using the case class encoders.
>
> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>
>> I'm trying to use Encoders.bean() to create an encoder for my custom
>> class, but it fails complaining about can't find the schema:
>>
>>
>> class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
>> @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
>> Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
>> parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd:
>> org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1] at
>> parallelize at :31 scala> sqlcontext.createDataFrame(person_rdd
>> ) java.lang.UnsupportedOperationException: Schema for type Person4 is not
>> supported at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
>> laReflection.scala:716) at org.apache.spark.sql.catalyst.
>> ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 2)
>> at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.
>> apply(ScalaReflection.scala:71 1) at scala.collection.TraversableLi
>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) at
>>
>>
>> but if u look at the encoder's schema, it does know it:
>> but the system does seem to understand the schema for "Person4":
>>
>>
>> scala> personEncoder.schema
>> res38: org.apache.spark.sql.types.StructType = 
>> StructType(StructField(x,IntegerType,false))
>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
Thanks Michael.

I could not use case class here since I need to later modify the output of
getX() so that the output is dynamically generated.

the bigger context is this:
I want to implement topN(), using a BoundedPriorityQueue. basically I
include a queue in reduce(), or aggregateByKey(), but the only available
serializer is kyro, and it's extremely slow in this case because
BoundedPriorityQueue probably has a lot of internal fields.

so I want to wrap the queue in a wrapper class, and only export the queue
content through getContent() and setContent(), and the content is a list of
tuples. This way when I encode the wrapper, the bean encoder simply encodes
the getContent() output, I think. encoding a list of tuples is very fast.

Yang

On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust 
wrote:

> I think you are supposed to set BeanProperty on a var as they do here
> .
> If you are using scala though I'd consider using the case class encoders.
>
> On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:
>
>> I'm trying to use Encoders.bean() to create an encoder for my custom
>> class, but it fails complaining about can't find the schema:
>>
>>
>> class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
>> @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
>> Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
>> parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd:
>> org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1] at
>> parallelize at :31 scala> sqlcontext.createDataFrame(person_rdd
>> ) java.lang.UnsupportedOperationException: Schema for type Person4 is not
>> supported at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
>> laReflection.scala:716) at org.apache.spark.sql.catalyst.
>> ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 2)
>> at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.
>> apply(ScalaReflection.scala:71 1) at scala.collection.TraversableLi
>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) at
>>
>>
>> but if u look at the encoder's schema, it does know it:
>> but the system does seem to understand the schema for "Person4":
>>
>>
>> scala> personEncoder.schema
>> res38: org.apache.spark.sql.types.StructType = 
>> StructType(StructField(x,IntegerType,false))
>>
>>
>


Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
I think you are supposed to set BeanProperty on a var as they do here
.
If you are using scala though I'd consider using the case class encoders.

On Tue, May 9, 2017 at 12:21 AM, Yang  wrote:

> I'm trying to use Encoders.bean() to create an encoder for my custom
> class, but it fails complaining about can't find the schema:
>
>
> class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {}
> @scala.beans.BeanProperty def getX():Int = {1} } val personEncoder =
> Encoders.bean[Person4](classOf[Person4]) scala> val person_rdd =sc.
> parallelize(Array( (new Person4(), 1), (new Person4(), 2) )) person_rdd:
> org.apache.spark.rdd.RDD[(Person4, Int)] = ParallelCollectionRDD[1] at
> parallelize at :31 scala> sqlcontext.createDataFrame(person_rdd)
> java.lang.UnsupportedOperationException: Schema for type Person4 is not
> supported at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(Sca
> laReflection.scala:716) at org.apache.spark.sql.catalyst.
> ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 2) at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(
> ScalaReflection.scala:71 1) at scala.collection.TraversableLi
> ke$$anonfun$map$1.apply(TraversableLike.scala:234) at
>
>
> but if u look at the encoder's schema, it does know it:
> but the system does seem to understand the schema for "Person4":
>
>
> scala> personEncoder.schema
> res38: org.apache.spark.sql.types.StructType = 
> StructType(StructField(x,IntegerType,false))
>
>


SPARK randomforestclassifer and balancing classe

2017-05-09 Thread issues solution
HI i have aleardy  ask this question but i still without ansewr somone can
help me to figure out

who i can  balance my class when i use fit methode of randomforestclassifer


thx for adavance.


Re: How to read large size files from a directory ?

2017-05-09 Thread Alonso Isidoro Roman
please create a github repo and upload the code there...


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2017-05-09 8:47 GMT+02:00 ashwini anand :

> I am reading each file of a directory using wholeTextFiles. After that I
> am calling a function on each element of the rdd using map . The whole
> program uses just 50 lines of each file. The code is as below: def
> processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0]
> result = "\n\n"+fileName resultEr = "\n\n"+fileName input =
> StringIO.StringIO(fileNameContentsPair[1]) reader =
> csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break
> // do some processing and get result string i=i+1 except csv.Error as e:
> resultEr = resultEr +"error occured\n\n" return resultEr return result if
> __name__ == "__main__": inputFile = sys.argv[1] outputFile = sys.argv[2] sc
> = SparkContext(appName = "SomeApp") resultRDD =
> sc.wholeTextFiles(inputFile).map(processFiles) 
> resultRDD.saveAsTextFile(outputFile)
> The size of each file of the directory can be very large in my case and
> because of this reason use of wholeTextFiles api will be inefficient in
> this case. Right now wholeTextFiles loads full file content into the
> memory. can we make wholeTextFiles to load only first 50 lines of each file
> ? Apart from using wholeTextFiles, other solution I can think of is
> iterating over each file of the directory one by one but that also seems to
> be inefficient. I am new to spark. Please let me know if there is any
> efficient way to do this.
> --
> View this message in context: How to read large size files from a
> directory ?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Join streams Apache Spark

2017-05-09 Thread tencas
Hi scorpio,

 thanks for your reply. 
I don't understand your approach. Is it possible to receive data from
different clients throught the same port on Spark?

Surely I'm confused and I'd appreciate your opinion.

Regarding the word count example , from Spark Streaming documentation, Spark
acts as a client that connects to a remote server, in order te receive data:

/// Create a DStream that will connect to hostname:port, like localhost:
JavaReceiverInputDStream lines = jssc.socketTextStream("localhost",
);/

Then, you create a dummy server using nc receive connections request from
spark, and to send data:

/nc -lk /

So, regarding this implementation, as spark is playing the role of tcp
client. you'd need to manage the join of external sensors streams (by the
way, all with the same schema) in your own server.
How would you be able to make Spark acts as a "sink" that can receive
different sources stream throught the same port??








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28670.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-09 Thread Matthew cao
Hi,
I have tried simple test like this:
case class A(id: Long)
val sample = spark.range(0,10).as[A]
sample.createOrReplaceTempView("sample")
val df = spark.emptyDataset[A]
val df1 = spark.sql("select * from sample").as[A]
df.union(df1)

It runs ok. And for nullabillity I thought that issue has been fixed: 
https://issues.apache.org/jira/browse/SPARK-18058 

I think you can check your spark version and schema of dataset again? Hope this 
help.

Best,
> On 2017年5月9日, at 04:56, Dirceu Semighini Filho  
> wrote:
> 
> Ok, great,
> Well I havn't provided a good example of what I'm doing. Let's assume that my 
> case  class is 
> case class A(tons of fields, with sub classes)
> 
> val df = sqlContext.sql("select * from a").as[A]
> 
> val df2 = spark.emptyDataset[A]
> 
> df.union(df2)
> 
> This code will throw the exception.
> Is this expected? I assume that when I do as[A] it will convert the schema to 
> the case class schema, and it shouldn't throw the exception, or this will be 
> done lazy when the union is been processed?
> 
> 
> 
> 2017-05-08 17:50 GMT-03:00 Burak Yavuz  >:
> Yes, unfortunately. This should actually be fixed, and the union's schema 
> should have the less restrictive of the DataFrames.
> 
> On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho 
> > wrote:
> HI Burak, 
> By nullability you mean that if I have the exactly the same schema, but one 
> side support null and the other doesn't, this exception (in union dataset) 
> will be thrown? 
> 
> 
> 
> 2017-05-08 16:41 GMT-03:00 Burak Yavuz  >:
> I also want to add that generally these may be caused by the `nullability` 
> field in the schema. 
> 
> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu  > wrote:
> This is because RDD.union doesn't check the schema, so you won't see the 
> problem unless you run RDD and hit the incompatible column problem. For RDD, 
> You may not see any error if you don't use the incompatible column.
> 
> Dataset.union requires compatible schema. You can print ds.schema and 
> ds1.schema and check if they are same.
> 
> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho 
> > wrote:
> Hello,
> I've a very complex case class structure, with a lot of fields.
> When I try to union two datasets of this class, it doesn't work with the 
> following error :
> ds.union(ds1)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can 
> only be performed on tables with the compatible column types
> 
> But when use it's rdd, the union goes right:
> ds.rdd.union(ds1.rdd)
> res8: org.apache.spark.rdd.RDD[
> 
> Is there any reason for this to happen (besides a bug ;) )
> 
> 
> 
> 
> 
> 
> 



how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
I'm trying to use Encoders.bean() to create an encoder for my custom class,
but it fails complaining about can't find the schema:


class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {} @scala.
beans.BeanProperty def getX():Int = {1} } val personEncoder = Encoders.bean[
Person4](classOf[Person4]) scala> val person_rdd =sc.parallelize(Array( (new
Person4(), 1), (new Person4(), 2) )) person_rdd: org.apache.spark.rdd.RDD[(
Person4, Int)] = ParallelCollectionRDD[1] at parallelize at :31
scala> sqlcontext.createDataFrame(person_rdd) java.lang.
UnsupportedOperationException: Schema for type Person4 is not supported at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.
scala:716) at org.apache.spark.sql.catalyst.
ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:71 2) at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(
ScalaReflection.scala:71 1) at scala.collection.
TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at


but if u look at the encoder's schema, it does know it:
but the system does seem to understand the schema for "Person4":


scala> personEncoder.schema
res38: org.apache.spark.sql.types.StructType =
StructType(StructField(x,IntegerType,false))


RDD.cacheDataSet() not working intermittently

2017-05-09 Thread jasbir.sing
Hi,

I have a scenario in which I am caching my RDDs for future use. But I observed 
that when I use my RDD, complete DAG is re-executed and RDD gets created again.
How can I avoid this scenario and make sure that RDD.cacheDataSet() caches RDD 
every time.

Regards,
Jasbir Singh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com


How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
I am reading each file of a directory using wholeTextFiles. After that I am
calling a function on each element of the rdd using map . The whole program
uses just 50 lines of each file. The code is as below:def
processFiles(fileNameContentsPair):  fileName= fileNameContentsPair[0] 
result = "\n\n"+fileName  resultEr = "\n\n"+fileName  input =
StringIO.StringIO(fileNameContentsPair[1])  reader =
csv.reader(input,strict=True)  try:   i=0   for row in reader:
if i==50:   break // do some processing and get result
string i=i+1  except csv.Error as e:resultEr = resultEr +"error
occured\n\n"return resultEr  return resultif __name__ == "__main__": 
inputFile = sys.argv[1]  outputFile = sys.argv[2]  sc = SparkContext(appName
= "SomeApp")  resultRDD = sc.wholeTextFiles(inputFile).map(processFiles) 
resultRDD.saveAsTextFile(outputFile)The size of each file of the directory
can be very large in my case and because of this reason use of
wholeTextFiles api will be inefficient in this case. Right now
wholeTextFiles loads full file content into the memory. can we make
wholeTextFiles to load only first 50 lines of each file ? Apart from using
wholeTextFiles, other solution I can think of is iterating over each file of
the directory one by one but that also seems to be inefficient. I am new to
spark. Please let me know if there is any efficient way to do this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28669.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.