UDF error with Spark 2.4 on scala 2.12

2019-01-17 Thread Andrés Ivaldi
Hello I'm having problems with UDF, I was reading a bit about it , and it
look's like a closure issue, but I don't know hoy to fix it, it works fine
on 2.11.

my code for udf definition (I tried several posibilites this is the las one)

val o:org.apache.spark.sql.api.java.UDF2[java.sql.Timestamp,String,
Int] =  (date:java.sql.Timestamp, periodExp:String) => {
   11
}


  val o2:org.apache.spark.sql.api.java.UDF2[java.sql.Date,String, Int]
=  (date:java.sql.Date, periodExp:String) => {
   11
}

   sc.udf.register ("timePeriod",o2, DataTypes.IntegerType )
   sc.udf.register("timePeriod", o , DataTypes.IntegerType )


exception
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
[info] Serialization stack:
[info]  - object not serializable (class: scala.runtime.LazyRef, value:
LazyRef thunk)
[info]  - element of array (index: 3)
[info]  - array (class [Ljava.lang.Object;, size 5)
[info]  - field (class: java.lang.invoke.SerializedLambda, name:
capturedArgs, type: class [Ljava.lang.Object;)
[info]  - object (class java.lang.invoke.SerializedLambda,
SerializedLambda[capturingClass=class
org.apache.spark.sql.catalyst.expressions.ScalaUDF,
functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
implementation=invokeStatic
org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$3:(Lscala/Function2;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;,
instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;,
numCaptured=5])

Regards,
-- 
Ing. Ivaldi Andres


Re: Spark Core - Embed in other application

2018-12-11 Thread Andrés Ivaldi
Hi, yes you can, also I've developed an engine to perform ETL.

I've build a Rest service with Akka, with a method called "execute" that
recibe a JSON structure representing the ETL.
You just need to configure your embedded standalone Spark, I did something
like this, this is in scala:

val spark = SparkSession
  .builder().master("local[8]")
  .appName("xxx")
  .config("spark.sql.warehouse.dir",
config.getString("cache.directory") )
  .config("spark.driver.memory", "11g")
  .config("spark.executor.cores","8")
  .config("spark.shuffle.compress",false)
  .config("spark.executor.memory", "11g")
  .config("spark.cores.max", "12")
  .config("spark.deploy.defaultCores", "3")
  .config("spark.driver.maxResultSize","0")
  .config("spark.default.parallelism","9")
  .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
  .config("spark.kryoserializer.buffer","1024k")
  .config("spark.sql.shuffle.partitions","1")
  .config("spark.kryo.unsafe", true)
  .config("spark.ui.port","4041")
  .enableHiveSupport()
  .getOrCreate()

(This will crash if you have already a Spark running ...)

And use the spark variable as you wish.




On Thu, Dec 6, 2018 at 9:23 PM sparkuser99 
wrote:

> Hi,
>
> I have a use case to process simple ETL like jobs. The data volume is very
> less (less than few GB), and can fit easily on my running java
> application's
> memory. I would like to take advantage of Spark dataset api, but don't need
> any spark setup (Standalone / Cluster ). Can I embed spark in existing Java
> application and still use ?
>
> I heard local spark mode is only for testing. For small data sets like, can
> this still be used in production? Please advice if any disadvantages.
>
> Regards
> Reddy
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Ing. Ivaldi Andres


Spark version performance

2018-12-11 Thread Andrés Ivaldi
Hello list,

I'm having a little performance issue, with different Spark versions.

I've a spark embedded application written in scala, Initially I've use
Spark 2.0.2, and works fine, with good speed response, but when I updated
to 2.3.2 , with no any code changes It becomes slower.

Mainly what the application do is to gather information from a source,
apply transformations with filters and performs aggregation over it. It's
source is mainly parquet and no write is done just a serialization from the
result.

Maybe I'm using  deprecated api functions or the order of the operations
are not generating a good plan...

Can someone give me some idea of any change on the versions that could
generate this behavior?

Regards,

-- 
Ing. Ivaldi Andres


Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread Andrés Ivaldi
I see, as @ayan said, it's valid, but, why don't use API or SQL, the
build-in options are optimized
I understand  that SQL API is hard when trying to build an api over that,
but Spark API doesn't, and you can do a lot with that.

regards,


On Wed, Aug 30, 2017 at 10:31 AM, ayan guha <guha.a...@gmail.com> wrote:

> Well, using raw sql is a valid option, but if you do not want you can
> always implement the concept using apis. All these constructs have api
> counterparts, such as filter, window, over, row number etc.
>
> On Wed, 30 Aug 2017 at 10:49 pm, purna pradeep <purna2prad...@gmail.com>
> wrote:
>
>> @Andres I need latest but it should less than 10 months based income_age
>> column and don't want to use sql here
>>
>> On Wed, Aug 30, 2017 at 8:08 AM Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>
>>> Hi, if you need the last value from income in window function you can
>>> use last_value.
>>> No tested but meaby with @ayan sql
>>>
>>> spark.sql("select *, row_number(), last_value(income) over (partition by
>>> id order by income_age_ts desc) r from t")
>>>
>>>
>>> On Tue, Aug 29, 2017 at 11:30 PM, purna pradeep <purna2prad...@gmail.com
>>> > wrote:
>>>
>>>> @ayan,
>>>>
>>>> Thanks for your response
>>>>
>>>> I would like to have functions in this case  calculateIncome and the
>>>> reason why I need function is to reuse in other parts of the application
>>>> ..that's the reason I'm planning for mapgroups with function as argument
>>>> which takes rowiterator ..but not sure if this is the best to implement as
>>>> my initial dataframe is very large
>>>>
>>>> On Tue, Aug 29, 2017 at 10:24 PM ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> the tool you are looking for is window function.  Example:
>>>>>
>>>>> >>> df.show()
>>>>> +++---+--+-+
>>>>> |JoinDate|dept| id|income|income_age_ts|
>>>>> +++---+--+-+
>>>>> | 4/20/13|  ES|101| 19000|  4/20/17|
>>>>> | 4/20/13|  OS|101| 1|  10/3/15|
>>>>> | 4/20/12|  DS|102| 13000|   5/9/17|
>>>>> | 4/20/12|  CS|102| 12000|   5/8/17|
>>>>> | 4/20/10|  EQ|103| 1|   5/9/17|
>>>>> | 4/20/10|  MD|103|  9000|   5/8/17|
>>>>> +++---+--+-+
>>>>>
>>>>> >>> res = spark.sql("select *, row_number() over (partition by id
>>>>> order by income_age_ts desc) r from t")
>>>>> >>> res.show()
>>>>> +++---+--+-+---+
>>>>> |JoinDate|dept| id|income|income_age_ts|  r|
>>>>> +++---+--+-+---+
>>>>> | 4/20/10|  EQ|103| 1|   5/9/17|  1|
>>>>> | 4/20/10|  MD|103|  9000|   5/8/17|  2|
>>>>> | 4/20/13|  ES|101| 19000|  4/20/17|  1|
>>>>> | 4/20/13|  OS|101| 1|  10/3/15|  2|
>>>>> | 4/20/12|  DS|102| 13000|   5/9/17|  1|
>>>>> | 4/20/12|  CS|102| 12000|   5/8/17|  2|
>>>>> +++---+--+-+---+
>>>>>
>>>>> >>> res = spark.sql("select * from (select *, row_number() over
>>>>> (partition by id order by income_age_ts desc) r from t) x where r=1")
>>>>> >>> res.show()
>>>>> +++---+--+-+---+
>>>>> |JoinDate|dept| id|income|income_age_ts|  r|
>>>>> +++---+--+-+---+
>>>>> | 4/20/10|  EQ|103| 1|   5/9/17|  1|
>>>>> | 4/20/13|  ES|101| 19000|  4/20/17|  1|
>>>>> | 4/20/12|  DS|102| 13000|   5/9/17|  1|
>>>>> +++---+--+-+---+
>>>>>
>>>>> This should be better because it uses all in-built optimizations in
>>>>> Spark.
>>>>>
>>>>> Best
>>>>> Ayan
>>>>>
>>>>> On Wed, Aug 30, 2017 at 11:06 AM, purna pradeep <
>>>>> purna2prad...@gmail.com> wrote:
>>>>>
>>>>>> Please click on unnamed text/html  link for better view
>>>>>>
>>>>>> On Tue, Aug 29, 2017 at 8:11 PM purna pradeep <
>&

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread Andrés Ivaldi
Hi, if you need the last value from income in window function you can use
last_value.
No tested but meaby with @ayan sql

spark.sql("select *, row_number(), last_value(income) over (partition by id
order by income_age_ts desc) r from t")


On Tue, Aug 29, 2017 at 11:30 PM, purna pradeep 
wrote:

> @ayan,
>
> Thanks for your response
>
> I would like to have functions in this case  calculateIncome and the
> reason why I need function is to reuse in other parts of the application
> ..that's the reason I'm planning for mapgroups with function as argument
> which takes rowiterator ..but not sure if this is the best to implement as
> my initial dataframe is very large
>
> On Tue, Aug 29, 2017 at 10:24 PM ayan guha  wrote:
>
>> Hi
>>
>> the tool you are looking for is window function.  Example:
>>
>> >>> df.show()
>> +++---+--+-+
>> |JoinDate|dept| id|income|income_age_ts|
>> +++---+--+-+
>> | 4/20/13|  ES|101| 19000|  4/20/17|
>> | 4/20/13|  OS|101| 1|  10/3/15|
>> | 4/20/12|  DS|102| 13000|   5/9/17|
>> | 4/20/12|  CS|102| 12000|   5/8/17|
>> | 4/20/10|  EQ|103| 1|   5/9/17|
>> | 4/20/10|  MD|103|  9000|   5/8/17|
>> +++---+--+-+
>>
>> >>> res = spark.sql("select *, row_number() over (partition by id order
>> by income_age_ts desc) r from t")
>> >>> res.show()
>> +++---+--+-+---+
>> |JoinDate|dept| id|income|income_age_ts|  r|
>> +++---+--+-+---+
>> | 4/20/10|  EQ|103| 1|   5/9/17|  1|
>> | 4/20/10|  MD|103|  9000|   5/8/17|  2|
>> | 4/20/13|  ES|101| 19000|  4/20/17|  1|
>> | 4/20/13|  OS|101| 1|  10/3/15|  2|
>> | 4/20/12|  DS|102| 13000|   5/9/17|  1|
>> | 4/20/12|  CS|102| 12000|   5/8/17|  2|
>> +++---+--+-+---+
>>
>> >>> res = spark.sql("select * from (select *, row_number() over
>> (partition by id order by income_age_ts desc) r from t) x where r=1")
>> >>> res.show()
>> +++---+--+-+---+
>> |JoinDate|dept| id|income|income_age_ts|  r|
>> +++---+--+-+---+
>> | 4/20/10|  EQ|103| 1|   5/9/17|  1|
>> | 4/20/13|  ES|101| 19000|  4/20/17|  1|
>> | 4/20/12|  DS|102| 13000|   5/9/17|  1|
>> +++---+--+-+---+
>>
>> This should be better because it uses all in-built optimizations in Spark.
>>
>> Best
>> Ayan
>>
>> On Wed, Aug 30, 2017 at 11:06 AM, purna pradeep 
>> wrote:
>>
>>> Please click on unnamed text/html  link for better view
>>>
>>> On Tue, Aug 29, 2017 at 8:11 PM purna pradeep 
>>> wrote:
>>>

 -- Forwarded message -
 From: Mamillapalli, Purna Pradeep 
 Date: Tue, Aug 29, 2017 at 8:08 PM
 Subject: Spark question
 To: purna pradeep 

 Below is the input Dataframe(In real this is a very large Dataframe)



 EmployeeID

 INCOME

 INCOME AGE TS

 JoinDate

 Dept

 101

 19000

 4/20/17

 4/20/13

 ES

 101

 1

 10/3/15

 4/20/13

 OS

 102

 13000

 5/9/17

 4/20/12

 DS

 102

 12000

 5/8/17

 4/20/12

 CS

 103

 1

 5/9/17

 4/20/10

 EQ

 103

 9000

 5/8/15

 4/20/10

 MD

 Get the latest income of an employee which has  Income_age ts <10 months

 Expected output Dataframe

 EmployeeID

 INCOME

 INCOME AGE TS

 JoinDate

 Dept

 101

 19000

 4/20/17

 4/20/13

 ES

 102

 13000

 5/9/17

 4/20/12

 DS

 103

 1

 5/9/17

 4/20/10

 EQ



>>>
>>>
>>>
>>>
>>>
>>> Below is what im planning to implement



 case class employee (*EmployeeID*: Int, *INCOME*: Int, INCOMEAGE: Int,
 *JOINDATE*: Int,DEPT:String)



 *val *empSchema = *new *StructType().add(*"EmployeeID"*,*"Int"*).add(
 *"INCOME"*, *"Int"*).add(*"INCOMEAGE"*,*"Date"*) . add(*"JOINDATE"*,
 *"Date"*). add(*"DEPT"*,*"String"*)



 *//Reading from the File **import *sparkSession.implicits._

 *val *readEmpFile = sparkSession.read
   .option(*"sep"*, *","*)
   .schema(empSchema)
   .csv(INPUT_DIRECTORY)


 *//Create employee DataFrame **val *custDf = readEmpFile.as[employee]


 *//Adding Salary Column **val *groupByDf = custDf.groupByKey(a => a.*
 EmployeeID*)


 *val *k = groupByDf.mapGroups((key,value) => 

Re: Spark 2.0.0 and Hive metastore

2017-08-29 Thread Andrés Ivaldi
Every comment are welcome

If I´m not wrong it's because we are using percentile aggregation which
comes with Hive support, apart from that nothing else.


On Tue, Aug 29, 2017 at 11:23 AM, Jean Georges Perrin <j...@jgp.net> wrote:

> Sorry if my comment is not helping, but... why do you need Hive? Can't you
> save your aggregation using parquet for example?
>
> jg
>
>
> > On Aug 29, 2017, at 08:34, Andrés Ivaldi <iaiva...@gmail.com> wrote:
> >
> > Hello, I'm using Spark API and with Hive support, I dont have a Hive
> instance, just using Hive for some aggregation functions.
> >
> > The problem is that Hive crete the hive and metastore_db folder at the
> temp folder, I want to change that location
> >
> > Regards.
> >
> > --
> > Ing. Ivaldi Andres
>
>


-- 
Ing. Ivaldi Andres


Spark 2.0.0 and Hive metastore

2017-08-29 Thread Andrés Ivaldi
Hello, I'm using Spark API and with Hive support, I dont have a Hive
instance, just using Hive for some aggregation functions.

The problem is that Hive crete the hive and metastore_db folder at the temp
folder, I want to change that location

Regards.

-- 
Ing. Ivaldi Andres


Re: UDF percentile_approx

2017-06-14 Thread Andrés Ivaldi
Hello,
Riccardo I was able to make it run, the problem is that HiveContext doesn't
exists any more in Spark 2.0.2, as far I can see. But exists the method
enableHiveSupport to add the hive functionality to SparkSession. To enable
this the spark-hive_2.11 dependency is needed.

In the Spark API Docs this is not well explained, only says that SqlContext
and HiveContext are now part of SparkSession

"SparkSession is now the new entry point of Spark that replaces the old
SQLContext and HiveContext. Note that the old SQLContext and HiveContext
are kept for backward compatibility. A new catalog interface is accessible
from SparkSession - existing API on databases and tables access such as
listTables, createExternalTable, dropTempView, cacheTable are moved here."

I think would be a good idea document enableHiveSupport also.

Thanks,

On Wed, Jun 14, 2017 at 5:13 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> You can use the function w/o hive and you can try:
>
> scala> Seq(1.0, 8.0).toDF("a").selectExpr("percentile_approx(a,
> 0.5)").show
>
> ++
>
> |percentile_approx(a, CAST(0.5 AS DOUBLE), 1)|
>
> ++
>
> | 8.0|
>
> ++
>
>
> // maropu
>
>
>
> On Wed, Jun 14, 2017 at 5:04 PM, Riccardo Ferrari <ferra...@gmail.com>
> wrote:
>
>> Hi Andres,
>>
>> I can't find the refrence, last time I searched for that I found that
>> 'percentile_approx' is only available via hive context. You should register
>> a temp table and use it from there.
>>
>> Best,
>>
>> On Tue, Jun 13, 2017 at 8:52 PM, Andrés Ivaldi <iaiva...@gmail.com>
>> wrote:
>>
>>> Hello, I`m trying to user percentile_approx  on my SQL query, but It's
>>> like spark context can´t find the function
>>>
>>> I'm using it like this
>>> import org.apache.spark.sql.functions._
>>> import org.apache.spark.sql.DataFrameStatFunctions
>>>
>>> val e = expr("percentile_approx(Cantidadcon0234514)")
>>> df.agg(e).show()
>>>
>>> and exception is
>>>
>>> org.apache.spark.sql.AnalysisException: Undefined function:
>>> 'percentile_approx'. This function is neither a registered temporary
>>> function nor a permanent function registered
>>>
>>> I've also tryid with callUDF
>>>
>>> Regards.
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Ing. Ivaldi Andres


UDF percentile_approx

2017-06-13 Thread Andrés Ivaldi
Hello, I`m trying to user percentile_approx  on my SQL query, but It's like
spark context can´t find the function

I'm using it like this
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrameStatFunctions

val e = expr("percentile_approx(Cantidadcon0234514)")
df.agg(e).show()

and exception is

org.apache.spark.sql.AnalysisException: Undefined function:
'percentile_approx'. This function is neither a registered temporary
function nor a permanent function registered

I've also tryid with callUDF

Regards.

-- 
Ing. Ivaldi Andres


Re: why we can t apply udf on rdd ???

2017-04-13 Thread Andrés Ivaldi
Hi,
what Spark version are you using?
Did you register the UDF?
How are you using the UDF?
Does the UDF support that data type as parameter?

What I do with Spark 2.0 is
-Create the UDF for each dataType I need
-Register the UDF to sparkContext
-I use UDF over dataFrame not RDD, you can convert it easily
-then dataSet.withColumn( myColumn, expr("myUDF(myColumn)") )





On Thu, Apr 13, 2017 at 6:52 AM, issues solution 
wrote:

> hi
> what kind of orgine of this error  ???
>
> java.lang.UnsupportedOperationException: Cannot evaluate expression:
> PythonUDF#Grappra(input[410, StringType])
>
>
> regrads
>
>


-- 
Ing. Ivaldi Andres


Exception on Join with Spark2.1

2017-04-11 Thread Andrés Ivaldi
Hello, I'm using spark embedded, So far with Spark 2.0.2 was all ok, after
update Spark to 2.1.0, I'm having problems when join to Datset.

The query are generated dinamically, but I have two Dataset one with a
WindowFunction and the other is de same Dataset  before the application of
the windowFunction.

I can get the data before the join them, but after, I'm getting this
exception


[11/04/2017 15:55:12 - default-akka.actor.default-dispatcher-6] ERROR
ar.com.visionaris.engine.Engine  -
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
tree:
Exchange hashpartitioning(TiempoAAMM#67, 2)
+- *Sort [TiempoAAMM#67 ASC NULLS FIRST], true, 0
   +- *Sort [TiempoAAMM#67 ASC NULLS FIRST], true, 0
  +- Exchange rangepartitioning(TiempoAAMM#67 ASC NULLS FIRST, 2)
 +- *HashAggregate(keys=[TiempoAAMM#67, spark_grouping_id#65],
functions=[sum(Cantidad#23)], output=[TiempoAAMM#67, Cantidad_SUM#96])
+- Exchange hashpartitioning(TiempoAAMM#67,
spark_grouping_id#65, 2)
   +- *HashAggregate(keys=[TiempoAAMM#67,
spark_grouping_id#65], functions=[partial_sum(Cantidad#23)],
output=[TiempoAAMM#67, spark_grouping_id#65, sum#160])
  +- *Filter isnotnull(TiempoAAMM#67)
 +- *!Expand [ArrayBuffer(Cantidad#23, TiempoAAMM#36,
0), ArrayBuffer(Cantidad#23, null, 1)], [Cantidad#23, TiempoAAMM#67,
spark_grouping_id#65]
+- InMemoryTableScan [Cantidad#23]
  +- InMemoryRelation [Cantidad#23,
TiempoAAMM#36], true, 1, StorageLevel(disk, memory, deserialized, 1
replicas)
+- *Project [Cantidad#6 AS Cantidad#23,
TiempoAAMM#4 AS TiempoAAMM#36]
   +- *FileScan parquet
[TiempoAAMM#4,Cantidad#6] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/tmp/vs/cache/dataSet-16], PartitionFilters: [],
PushedFilters: [], ReadSchema:
struct

at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:112)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
at
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at
org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegenExec.scala:227)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at
org.apache.spark.sql.execution.joins.SortMergeJoinExec.inputRDDs(SortMergeJoinExec.scala:326)
at
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)

Re: Grouping Set

2016-11-17 Thread Andrés Ivaldi
I'm realize that my data have null values, so that null are for the values
not for the calculated grouping set, but that is another problem, how can I
detect witch is one? now I have this problem

I my data is just a row like this [ {1:"A",2:null, 3:123}] the grouping set
(1) will give me
A, null, 123
A, null, 123

and with  [ {1:"A", 2:null, 3:123},{1:"A", 2:"b", 3:1}] the grouping set
(1) will give me
A, null, 124
A, null, 123
A, b, 1

Quick fix could be isNull with a label that I can detect, but that's too
dirty I think, grouping set should return a value type witch could be
detected as the grouped set on that column, not null


On Mon, Nov 14, 2016 at 5:49 PM, ayan guha <guha.a...@gmail.com> wrote:

> And, run the same SQL in hive and post any difference.
> On 15 Nov 2016 07:48, "ayan guha" <guha.a...@gmail.com> wrote:
>
>> It should be A,yes. Can you please reproduce this with small data and
>> exact SQL?
>> On 15 Nov 2016 02:21, "Andrés Ivaldi" <iaiva...@gmail.com> wrote:
>>
>>> Hello, I'm tryin to use Grouping Set, but I dont know if it is a bug or
>>> the correct behavior.
>>>
>>> Givven the above example
>>> Select a,b,sum(c) from table group by a,b grouping set ( (a), (a,b) )
>>>
>>> What shound be the expected result
>>> A:
>>>
>>> A  | B| sum(c)
>>> xx | null | 
>>> xx | yy   | 
>>> xx | zz   | 
>>>
>>>
>>> B
>>> A   | B| sum(c)
>>> xx  | null | 
>>> xx  | yy   | 
>>> xx  | zz   | 
>>> null| yy   | 
>>> null| zz   | 
>>> null| null | 
>>>
>>>
>>> I believe is A, but i'm getting B
>>> thanks
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>


-- 
Ing. Ivaldi Andres


Grouping Set

2016-11-14 Thread Andrés Ivaldi
Hello, I'm tryin to use Grouping Set, but I dont know if it is a bug or the
correct behavior.

Givven the above example
Select a,b,sum(c) from table group by a,b grouping set ( (a), (a,b) )

What shound be the expected result
A:

A  | B| sum(c)
xx | null | 
xx | yy   | 
xx | zz   | 


B
A   | B| sum(c)
xx  | null | 
xx  | yy   | 
xx  | zz   | 
null| yy   | 
null| zz   | 
null| null | 


I believe is A, but i'm getting B
thanks

-- 
Ing. Ivaldi Andres


Re: DataSet toJson

2016-11-08 Thread Andrés Ivaldi
Ok, digging the code, I find out in the class JacksonGenerator the next
method

private def writeFields(
  row: InternalRow, schema: StructType, fieldWriters: Seq[ValueWriter]):
Unit = {
  var i = 0
  while (i < row.numFields) {
val field = schema(i)
if (!row.isNullAt(i)) {
  gen.writeFieldName(field.name)
  fieldWriters(i).apply(row, i)
}
i += 1
  }
}

So null values are directly ignored, I've to rewrite the method toJson to
use my own JacksonGenerator.

Regards.



On Tue, Nov 8, 2016 at 10:06 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> Hello, I'm using spark 2.0 and I'm using toJson method. I've seen that
> Null values are omitted in the Json Record, witch is valid, but I need the
> field with null as value, it's possible to configure that?
>
> thanks.
>
>


-- 
Ing. Ivaldi Andres


DataSet toJson

2016-11-08 Thread Andrés Ivaldi
Hello, I'm using spark 2.0 and I'm using toJson method. I've seen that Null
values are omitted in the Json Record, witch is valid, but I need the field
with null as value, it's possible to configure that?

thanks.


Re: Aggregation Calculation

2016-11-04 Thread Andrés Ivaldi
Ok, so I've read that rollup is just syntactic sugar of GROUPING SET(...),
in that case I just need to use GROUPNG SET, but the examples in the
documentation this GROUPING SET is used with SQL syntaxis and I am doing it
programmatically, so I need the DataSet api, like ds.rollup(..) but for
grouping set,

Does any one knows how to do it?

thanks.



On Thu, Nov 3, 2016 at 5:17 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> I'm not sure about inline views, it will still performing aggregation that
> I don't need. I think I didn't explain right, I've already filtered the
> values that I need, the problem is that default calculation of rollUp give
> me some calculations that I don't want like only aggregation by the second
> column.
> Suppose tree columns (DataSet Columns) Year, Moth, Import, and I want
> aggregation sum(Import), and the combination of all Year/Month Sum(import),
> also Year Sum(import), but Mont Sum(import) doesn't care
>
> in table it will looks like
>
> YEAR | MOTH | Sum(Import)
> 2006 | 1| 
> 2005 | 1| 
> 2005 | 2| 
> 2006 | null | 
> 2005 | null | 
> null | null | 
> null | 1| 
> null | 2| 
>
> the las tree rows are not needed, in this example I could perform
> filtering after rollUp i do the query by demand  so it will grow depending
> on number of rows and columns, and will be a lot of combinations that I
> don't need.
>
> thanks
>
>
>
>
>
> On Thu, Nov 3, 2016 at 4:04 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>> You would likely want to create inline views that perform the filtering 
>> *before
>> *performing t he cubes/rollup; in this way the cubes/rollups only
>> operate on the pruned rows/columns.
>>
>> 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi <iaiva...@gmail.com>:
>>
>>> Hello, I need to perform some aggregations and a kind of Cube/RollUp
>>> calculation
>>>
>>> Doing some test looks like Cube and RollUp performs aggregation over all
>>> posible columns combination, but I just need some specific columns
>>> combination.
>>>
>>> What I'm trying to do is like a dataTable where te first N columns are
>>> may rows and the second M values are my columns and the last columna are
>>> the aggregated values, like Dimension / Measures
>>>
>>> I need all the values of the N and M columns and the ones that
>>> correspond to the aggregation function. I'll never need the values that
>>> previous column has no value, ie
>>>
>>> having N=2 so two columns as rows I'll need
>>> R1 | R2  
>>> ##  |  ## 
>>> ##  |   null 
>>>
>>> but not
>>> null | ## 
>>>
>>> as roll up does, same approach to M columns
>>>
>>>
>>> So the question is what could be the better way to perform this
>>> calculation.
>>> Using rollUp/Cube give me a lot of values that I dont need
>>> Using groupBy give me less information ( I could do several groupBy but
>>> that is not performant, I think )
>>> Is any other way to something like that?
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres


Re: Aggregation Calculation

2016-11-03 Thread Andrés Ivaldi
I'm not sure about inline views, it will still performing aggregation that
I don't need. I think I didn't explain right, I've already filtered the
values that I need, the problem is that default calculation of rollUp give
me some calculations that I don't want like only aggregation by the second
column.
Suppose tree columns (DataSet Columns) Year, Moth, Import, and I want
aggregation sum(Import), and the combination of all Year/Month Sum(import),
also Year Sum(import), but Mont Sum(import) doesn't care

in table it will looks like

YEAR | MOTH | Sum(Import)
2006 | 1| 
2005 | 1| 
2005 | 2| 
2006 | null | 
2005 | null | 
null | null | 
null | 1| 
null | 2| 

the las tree rows are not needed, in this example I could perform filtering
after rollUp i do the query by demand  so it will grow depending on number
of rows and columns, and will be a lot of combinations that I don't need.

thanks





On Thu, Nov 3, 2016 at 4:04 PM, Stephen Boesch <java...@gmail.com> wrote:

> You would likely want to create inline views that perform the filtering 
> *before
> *performing t he cubes/rollup; in this way the cubes/rollups only operate
> on the pruned rows/columns.
>
> 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi <iaiva...@gmail.com>:
>
>> Hello, I need to perform some aggregations and a kind of Cube/RollUp
>> calculation
>>
>> Doing some test looks like Cube and RollUp performs aggregation over all
>> posible columns combination, but I just need some specific columns
>> combination.
>>
>> What I'm trying to do is like a dataTable where te first N columns are
>> may rows and the second M values are my columns and the last columna are
>> the aggregated values, like Dimension / Measures
>>
>> I need all the values of the N and M columns and the ones that correspond
>> to the aggregation function. I'll never need the values that previous
>> column has no value, ie
>>
>> having N=2 so two columns as rows I'll need
>> R1 | R2  
>> ##  |  ## 
>> ##  |   null 
>>
>> but not
>> null | ## 
>>
>> as roll up does, same approach to M columns
>>
>>
>> So the question is what could be the better way to perform this
>> calculation.
>> Using rollUp/Cube give me a lot of values that I dont need
>> Using groupBy give me less information ( I could do several groupBy but
>> that is not performant, I think )
>> Is any other way to something like that?
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Aggregation Calculation

2016-11-03 Thread Andrés Ivaldi
Hello, I need to perform some aggregations and a kind of Cube/RollUp
calculation

Doing some test looks like Cube and RollUp performs aggregation over all
posible columns combination, but I just need some specific columns
combination.

What I'm trying to do is like a dataTable where te first N columns are may
rows and the second M values are my columns and the last columna are the
aggregated values, like Dimension / Measures

I need all the values of the N and M columns and the ones that correspond
to the aggregation function. I'll never need the values that previous
column has no value, ie

having N=2 so two columns as rows I'll need
R1 | R2  
##  |  ## 
##  |   null 

but not
null | ## 

as roll up does, same approach to M columns


So the question is what could be the better way to perform this calculation.
Using rollUp/Cube give me a lot of values that I dont need
Using groupBy give me less information ( I could do several groupBy but
that is not performant, I think )
Is any other way to something like that?

Thanks.





-- 
Ing. Ivaldi Andres


Re: Aggregations with scala pairs

2016-08-18 Thread Andrés Ivaldi
Thanks!!!

On Thu, Aug 18, 2016 at 3:35 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Agreed.
>
> Regards
> JB
> On Aug 18, 2016, at 07:32, Olivier Girardot <o.girardot@lateral-thoughts.
> com> wrote:
>>
>> CC'ing dev list,
>> you should open a Jira and a PR related to it to discuss it c.f.
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#
>> ContributingtoSpark-ContributingCodeChanges
>>
>>
>>
>> On Wed, Aug 17, 2016 4:01 PM, Andrés Ivaldi iaiva...@gmail.com wrote:
>>
>>> Hello, I'd like to report a wrong behavior of DataSet's API, I don´t
>>> know how I can do that. My Jira account doesn't allow me to add a Issue
>>>
>>> I'm using Apache 2.0.0 but the problem came since at least version 1.4
>>> (given the doc since 1.3)
>>>
>>> The problem is simple to reporduce, also the work arround, if we apply
>>> agg over a DataSet with scala pairs over the same column, only one agg over
>>> that column is actualy used, this is because the toMap that reduce the pair
>>> values of the mane key to one and overwriting the value
>>>
>>> class
>>> https://github.com/apache/spark/blob/master/sql/core/
>>> src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
>>>
>>>
>>>  def agg(aggExpr: (String, String), aggExprs: (String, String)*):
>>> DataFrame = {
>>> agg((aggExpr +: aggExprs).toMap)
>>>   }
>>>
>>>
>>> rewrited as somthing like this should work
>>>  def agg(aggExpr: (String, String), aggExprs: (String, String)*):
>>> DataFrame = {
>>>toDF((aggExpr +: aggExprs).map { pairExpr =>
>>>   strToExpr(pairExpr._2)(df(pairExpr._1).expr)
>>> }.toSeq)
>>> }
>>>
>>>
>>> regards
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>> *Olivier Girardot*   | Associé
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>


-- 
Ing. Ivaldi Andres


Aggregations with scala pairs

2016-08-17 Thread Andrés Ivaldi
Hello, I'd like to report a wrong behavior of DataSet's API, I don´t know
how I can do that. My Jira account doesn't allow me to add a Issue

I'm using Apache 2.0.0 but the problem came since at least version 1.4
(given the doc since 1.3)

The problem is simple to reporduce, also the work arround, if we apply agg
over a DataSet with scala pairs over the same column, only one agg over
that column is actualy used, this is because the toMap that reduce the pair
values of the mane key to one and overwriting the value

class
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala


 def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
> = {
> agg((aggExpr +: aggExprs).toMap)
>   }


rewrited as somthing like this should work
 def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
= {
   toDF((aggExpr +: aggExprs).map { pairExpr =>
  strToExpr(pairExpr._2)(df(pairExpr._1).expr)
}.toSeq)
}


regards
-- 
Ing. Ivaldi Andres


Spark 1.6.1 and regexp_replace

2016-08-09 Thread Andrés Ivaldi
I'm having a strange behaviour with regular expression replace, I'm trying
to remove the spaces with trim and also remove the spaces when they are
more than one to only one.

Given a string like this "   A  B   " with trim only I got "A  B" so
perfect,
if I add regexp_replace I got "  A B".

Text1 is the column so I did

df.withColumn("Text1", expr ( "trim(regexp_replace(Text1,'\\s+',' ') )) )

Also tried another expressions with no luck either

Any idea?

thanks


Spark 2 and Solr

2016-08-01 Thread Andrés Ivaldi
Hello, does any one know if Spark 2.0 will have a Solr connector?
Lucidworks has one but is not available yet for Spark 2.0


thanks!!


Re: Data Frames Join by more than one column

2016-06-23 Thread Andrés Ivaldi
Ahh, My mistake I'm think, as the data came from SQL, one of the column is
char(7) but not all characters are occuped, and the join was with a column
varchar(30), so what what append was the comparison looks like "xx
"=="xx", apply trim to char column works.

On Thu, Jun 23, 2016 at 2:57 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> Hello, I'v been trying to join ( left_outer) dataframes by two columns,
> and the result is not as expected.
>
> I'm doing this
>
> dfo1.get.join(dfo2.get,
>
> dfo1.get.col("Col1Left").equalTo(dfo2.get.col("Col1Right")).and(dfo1.get.col("Col2Left").equalTo(dfo2.get.col("Col2Right")))
>   , "left_outer").head(10).foreach(println)
>
> the result is not the same than sql join, some tuples cant be matched on
> right
>
> So I don't know if the way to do the join is the correct.
>
> Regards.
>
> --
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres


Data Frames Join by more than one column

2016-06-23 Thread Andrés Ivaldi
Hello, I'v been trying to join ( left_outer) dataframes by two columns, and
the result is not as expected.

I'm doing this

dfo1.get.join(dfo2.get,

dfo1.get.col("Col1Left").equalTo(dfo2.get.col("Col1Right")).and(dfo1.get.col("Col2Left").equalTo(dfo2.get.col("Col2Right")))
  , "left_outer").head(10).foreach(println)

the result is not the same than sql join, some tuples cant be matched on
right

So I don't know if the way to do the join is the correct.

Regards.

-- 
Ing. Ivaldi Andres


JDBC Create Table

2016-05-27 Thread Andrés Ivaldi
Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to fail
because is not possible create new tables in SQLServer, I'm using
SaveMode.Overwrite as in 1.6.0 version

Any Idea

regards

-- 
Ing. Ivaldi Andres


Re: Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
Done, version 1.6.1 has the fix, updated and work fine

Thanks.

On Thu, May 26, 2016 at 4:15 PM, Anthony May <anthony...@gmail.com> wrote:

> It's on the 1.6 branch
>
> On Thu, May 26, 2016 at 4:43 PM Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe
>> it's in 1.6.1 looking at the history.
>> thanks I'll see if update spark  to 1.6.1
>>
>> On Thu, May 26, 2016 at 3:33 PM, Anthony May <anthony...@gmail.com>
>> wrote:
>>
>>> It doesn't appear to be configurable, but it is inserting by column name:
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102
>>>
>>> On Thu, 26 May 2016 at 16:02 Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>  I'realize that when dataframe executes insert it is inserting by
>>>> scheme order column instead by name, ie
>>>>
>>>> dataframe.write(SaveMode).jdbc(url, table, properties)
>>>>
>>>> Reading the profiler the execution is
>>>>
>>>> insert into TableName values(a,b,c..)
>>>>
>>>> what i need is
>>>> insert into TableNames (colA,colB,colC) values(a,b,c)
>>>>
>>>> could be some configuration?
>>>>
>>>> regards.
>>>>
>>>> --
>>>> Ing. Ivaldi Andres
>>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Re: Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe
it's in 1.6.1 looking at the history.
thanks I'll see if update spark  to 1.6.1

On Thu, May 26, 2016 at 3:33 PM, Anthony May <anthony...@gmail.com> wrote:

> It doesn't appear to be configurable, but it is inserting by column name:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102
>
> On Thu, 26 May 2016 at 16:02 Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Hello,
>>  I'realize that when dataframe executes insert it is inserting by scheme
>> order column instead by name, ie
>>
>> dataframe.write(SaveMode).jdbc(url, table, properties)
>>
>> Reading the profiler the execution is
>>
>> insert into TableName values(a,b,c..)
>>
>> what i need is
>> insert into TableNames (colA,colB,colC) values(a,b,c)
>>
>> could be some configuration?
>>
>> regards.
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
Hello,
 I'realize that when dataframe executes insert it is inserting by scheme
order column instead by name, ie

dataframe.write(SaveMode).jdbc(url, table, properties)

Reading the profiler the execution is

insert into TableName values(a,b,c..)

what i need is
insert into TableNames (colA,colB,colC) values(a,b,c)

could be some configuration?

regards.

-- 
Ing. Ivaldi Andres


Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Andrés Ivaldi
Ok, so I did that, create a database and then insert data, but spark drop
database and try to create it again, I'm using
Dataframe.write(SaveMode.Overwrite), documentation said :
"when performing a Overwrite, the data will be deleted before writing out
the new data."

why is dropping the table?


On Wed, May 4, 2016 at 6:44 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> Yes, I can do that, it's what we are doing now, but I think the best
> approach would be delegate the create table action to spark.
>
> On Tue, May 3, 2016 at 8:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> Can you create the MSSQL (target) table first with the correct column
>> setting and insert data from Spark to it with JDBC as opposed to JDBC
>> creating target table itself?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 May 2016 at 22:19, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>
>>> Ok, Spark MSSQL dataType mapping is not right for me, ie. string is Text
>>> instead of varchar(MAX) , so how can I override default SQL Mapping?
>>>
>>> regards.
>>>
>>> On Sun, May 1, 2016 at 5:23 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well if MSSQL cannot create that column then it is more like
>>>> compatibility between Spark and RDBMS.
>>>>
>>>> What value that column has in MSSQL. Can you create table the table in
>>>> MSSQL database or map it in Spark to a valid column before opening JDBC
>>>> connection?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 29 April 2016 at 16:16, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>
>>>>> Hello, Spark is executing a create table sentence (using JDBC) to
>>>>> MSSQLServer with a mapping column type like ColName Bit(1) for boolean
>>>>> types, This create table cannot be executed on MSSQLServer.
>>>>>
>>>>> In class JdbcDialect the mapping for Boolean type is Bit(1), so the
>>>>> question is, this is a problem of spark or JDBC driver who is not mapping
>>>>> right?
>>>>>
>>>>> Anyway it´s possible to override that mapping in Spark?
>>>>>
>>>>> Regards
>>>>>
>>>>> --
>>>>> Ing. Ivaldi Andres
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres


Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Andrés Ivaldi
Yes, I can do that, it's what we are doing now, but I think the best
approach would be delegate the create table action to spark.

On Tue, May 3, 2016 at 8:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Can you create the MSSQL (target) table first with the correct column
> setting and insert data from Spark to it with JDBC as opposed to JDBC
> creating target table itself?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 May 2016 at 22:19, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Ok, Spark MSSQL dataType mapping is not right for me, ie. string is Text
>> instead of varchar(MAX) , so how can I override default SQL Mapping?
>>
>> regards.
>>
>> On Sun, May 1, 2016 at 5:23 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well if MSSQL cannot create that column then it is more like
>>> compatibility between Spark and RDBMS.
>>>
>>> What value that column has in MSSQL. Can you create table the table in
>>> MSSQL database or map it in Spark to a valid column before opening JDBC
>>> connection?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 April 2016 at 16:16, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>
>>>> Hello, Spark is executing a create table sentence (using JDBC) to
>>>> MSSQLServer with a mapping column type like ColName Bit(1) for boolean
>>>> types, This create table cannot be executed on MSSQLServer.
>>>>
>>>> In class JdbcDialect the mapping for Boolean type is Bit(1), so the
>>>> question is, this is a problem of spark or JDBC driver who is not mapping
>>>> right?
>>>>
>>>> Anyway it´s possible to override that mapping in Spark?
>>>>
>>>> Regards
>>>>
>>>> --
>>>> Ing. Ivaldi Andres
>>>>
>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Bit(N) on create Table with MSSQLServer

2016-04-29 Thread Andrés Ivaldi
Hello, Spark is executing a create table sentence (using JDBC) to
MSSQLServer with a mapping column type like ColName Bit(1) for boolean
types, This create table cannot be executed on MSSQLServer.

In class JdbcDialect the mapping for Boolean type is Bit(1), so the
question is, this is a problem of spark or JDBC driver who is not mapping
right?

Anyway it´s possible to override that mapping in Spark?

Regards

-- 
Ing. Ivaldi Andres


Re: Fill Gaps between rows

2016-04-26 Thread Andrés Ivaldi
Thanks Sebastian, I've never understood that part of Hive Context, so It's
possible to use HiveContext then use the window functions and save
dataFrame into another source like MSSQL, Oracle, or any with JDBC ?

Regards.

On Tue, Apr 26, 2016 at 1:22 PM, Sebastian Piu <sebastian@gmail.com>
wrote:

> Yes you need hive Context for the window functions, but you don't need
> hive for it to work
>
> On Tue, 26 Apr 2016, 14:15 Andrés Ivaldi, <iaiva...@gmail.com> wrote:
>
>> Hello, do exists an Out Of the box for fill in gaps between rows with a
>> given  condition?
>> As example: I have a source table with data and a column with the day
>> number, but the record only register a event and no necessary all days have
>> events, so the table no necessary has all days. But I want a resultant
>> Table with all days, filled in the data with 0 o same as row before.
>>
>> I'm using SQLContext. I think window function will do that, but I cant
>> use it without hive context, Is that right?
>>
>> regards
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Fill Gaps between rows

2016-04-26 Thread Andrés Ivaldi
Hello, do exists an Out Of the box for fill in gaps between rows with a
given  condition?
As example: I have a source table with data and a column with the day
number, but the record only register a event and no necessary all days have
events, so the table no necessary has all days. But I want a resultant
Table with all days, filled in the data with 0 o same as row before.

I'm using SQLContext. I think window function will do that, but I cant use
it without hive context, Is that right?

regards

-- 
Ing. Ivaldi Andres


DataFrame group and agg

2016-04-25 Thread Andrés Ivaldi
Hello,
Anyone know if this is on purpose or its a bug?
in
https://github.com/apache/spark/blob/2f1d0320c97f064556fa1cf98d4e30d2ab2fe661/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
class

the def agg have many implemetations next two of them:
Line 136:
  def agg(aggExpr: (String, String), aggExprs: (String, String)*):
DataFrame = {
agg((aggExpr +: aggExprs).toMap)
  }

Line 155:
  def agg(exprs: Map[String, String]): DataFrame = {
toDF(exprs.map { case (colName, expr) =>
  strToExpr(expr)(df(colName).expr)
}.toSeq)
  }

So this allow me to do somthing like .agg( "col1"->"sum", "col2"->"max"  )

But If I want to apply two differents agg function to same column, as the
method 136 creates map then somtihg like "col"->"sum", "col"->"max" will
end as "col"->"max"


I think this signatur of def whould work

  def agg(exprs: Seq[String, String]): DataFrame = {
toDF(exprs.map { case (colName, expr) =>
  strToExpr(expr)(df(colName).expr)
}.toSeq)
  }

Regards.


Re: Spark SQL Transaction

2016-04-23 Thread Andrés Ivaldi
Thanks, I'll take a look to JdbcUtils

regards.

On Sat, Apr 23, 2016 at 2:57 PM, Todd Nist <tsind...@gmail.com> wrote:

> I believe the class you are looking for is
> org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala.
>
> By default in savePartition(...) , it will do the following:
>
> if (supportsTransactions) { conn.setAutoCommit(false) // Everything in
> the same db transaction. } Then at line 224, it will issue the commit:
> if (supportsTransactions) { conn.commit() } HTH -Todd
>
> On Sat, Apr 23, 2016 at 8:57 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Hello, so I executed Profiler and found that implicit isolation was turn
>> on by JDBC driver, this is the default behavior of MSSQL JDBC driver, but
>> it's possible change it with setAutoCommit method. There is no property for
>> that so I've to do it in the code, do you now where can I access to the
>> instance of JDBC class used by Spark on DataFrames?
>>
>> Regards.
>>
>> On Thu, Apr 21, 2016 at 10:59 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> This statement
>>>
>>> ."..each database statement is atomic and is itself a transaction.. your
>>> statements should be atomic and there will be no ‘redo’ or ‘commit’ or
>>> ‘rollback’."
>>>
>>> MSSQL compiles with ACIDITY which requires that each transaction be "all
>>> or nothing": if one part of the transaction fails, then the entire
>>> transaction fails, and the database state is left unchanged.
>>>
>>> Assuming that it is one transaction (through much doubt if JDBC does
>>> that as it will take for ever), then either that transaction commits (in
>>> MSSQL redo + undo are combined in syslogs table of the database) meaning
>>> there will be undo + redo log generated  for that row only in syslogs. So
>>> under normal operation every RDBMS including MSSQL, Oracle, Sybase and
>>> others will comply with generating (redo and undo) and one cannot avoid it.
>>> If there is a batch transaction as I suspect in this case, it is either all
>>> or nothing. The thread owner indicated that rollback is happening so it is
>>> consistent with all rows rolled back.
>>>
>>> I don't think Spark, Sqoop, Hive can influence the transaction behaviour
>>> of an RDBMS for DML. DQ (data queries) do not generate transactions.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 April 2016 at 13:58, Michael Segel <msegel_had...@hotmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Sometimes terms get muddled over time.
>>>>
>>>> If you’re not using transactions, then each database statement is
>>>> atomic and is itself a transaction.
>>>> So unless you have some explicit ‘Begin Work’ at the start…. your
>>>> statements should be atomic and there will be no ‘redo’ or ‘commit’ or
>>>> ‘rollback’.
>>>>
>>>> I don’t see anything in Spark’s documentation about transactions, so
>>>> the statements should be atomic.  (I’m not a guru here so I could be
>>>> missing something in Spark)
>>>>
>>>> If you’re seeing the connection drop unexpectedly and then a rollback,
>>>> could this be a setting or configuration of the database?
>>>>
>>>>
>>>> > On Apr 19, 2016, at 1:18 PM, Andrés Ivaldi <iaiva...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hello, is possible to execute a SQL write without Transaction? we
>>>> dont need transactions to save our data and this adds an overhead to the
>>>> SQLServer.
>>>> >
>>>> > Regards.
>>>> >
>>>> > --
>>>> > Ing. Ivaldi Andres
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Re: Spark SQL Transaction

2016-04-23 Thread Andrés Ivaldi
Hello, so I executed Profiler and found that implicit isolation was turn on
by JDBC driver, this is the default behavior of MSSQL JDBC driver, but it's
possible change it with setAutoCommit method. There is no property for that
so I've to do it in the code, do you now where can I access to the instance
of JDBC class used by Spark on DataFrames?

Regards.

On Thu, Apr 21, 2016 at 10:59 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> This statement
>
> ."..each database statement is atomic and is itself a transaction.. your
> statements should be atomic and there will be no ‘redo’ or ‘commit’ or
> ‘rollback’."
>
> MSSQL compiles with ACIDITY which requires that each transaction be "all
> or nothing": if one part of the transaction fails, then the entire
> transaction fails, and the database state is left unchanged.
>
> Assuming that it is one transaction (through much doubt if JDBC does that
> as it will take for ever), then either that transaction commits (in MSSQL
> redo + undo are combined in syslogs table of the database) meaning
> there will be undo + redo log generated  for that row only in syslogs. So
> under normal operation every RDBMS including MSSQL, Oracle, Sybase and
> others will comply with generating (redo and undo) and one cannot avoid it.
> If there is a batch transaction as I suspect in this case, it is either all
> or nothing. The thread owner indicated that rollback is happening so it is
> consistent with all rows rolled back.
>
> I don't think Spark, Sqoop, Hive can influence the transaction behaviour
> of an RDBMS for DML. DQ (data queries) do not generate transactions.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 13:58, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> Hi,
>>
>> Sometimes terms get muddled over time.
>>
>> If you’re not using transactions, then each database statement is atomic
>> and is itself a transaction.
>> So unless you have some explicit ‘Begin Work’ at the start…. your
>> statements should be atomic and there will be no ‘redo’ or ‘commit’ or
>> ‘rollback’.
>>
>> I don’t see anything in Spark’s documentation about transactions, so the
>> statements should be atomic.  (I’m not a guru here so I could be missing
>> something in Spark)
>>
>> If you’re seeing the connection drop unexpectedly and then a rollback,
>> could this be a setting or configuration of the database?
>>
>>
>> > On Apr 19, 2016, at 1:18 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>> >
>> > Hello, is possible to execute a SQL write without Transaction? we dont
>> need transactions to save our data and this adds an overhead to the
>> SQLServer.
>> >
>> > Regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Ing. Ivaldi Andres


Tracing Spark DataFrame Execition

2016-04-21 Thread Andrés Ivaldi
Hello, It's possible to trace DataFrame, I'd like to do a progress
DataFrame Execution?, I looked at SparkListeners, but nested dataframes
produces several Jobs, and I dont know how to relate these Jobs also I'm
reusing SparkContext.

Regards.

-- 
Ing. Ivaldi Andres


Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
I think the same, and I don't think reducing batches size improves speed
but will avoid loosing all data when rollback.


Thanks for the help..


On Wed, Apr 20, 2016 at 4:03 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> yep. I think it is not possible to make SQL Server do a non logged
> transaction. Other alternative is doing inserts in small batches if
> possible. Or write to a CSV type file and use Bulk copy to load the file
> into MSSQL with frequent commits like every 50K rows?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 19:42, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Yes, I know that behavior , but there is not explicit Begin Transaction
>> in my code, so, maybe Spark or the same driver is adding the begin
>> transaction, or implicit transaction is configured. If spark is'n adding a
>> Begin transaction on each insertion, then probably is database or Driver
>> configuration...
>>
>> On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> You will see what is happening in SQL Server. First create a test table
>>> called  testme
>>>
>>> 1> use tempdb
>>> 2> go
>>> 1> create table testme(col1 int)
>>> 2> go
>>> -- Now explicitly begin a transaction and insert 1 row and select from
>>> table
>>> 1>
>>> *begin tran*2> insert into testme values(1)
>>> 3> select * from testme
>>> 4> go
>>> (1 row affected)
>>>  col1
>>>  ---
>>>1
>>> -- That value col1=1 is there
>>> --
>>> (1 row affected)
>>> -- Now rollback that transaction meaning in your case by killing your
>>> Spark process!
>>> --
>>> 1> rollback tran
>>> 2> select * from testme
>>> 3> go
>>>  col1
>>>  ---
>>> (0 rows affected)
>>>
>>> -- You can see that record has gone as it rolled back!
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 20 April 2016 at 18:42, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>
>>>> Sorry I'cant answer before, I want to know if spark is the responsible
>>>> to add the Begin Tran, The point is to speed up insertion over losing data,
>>>>  Disabling Transaction will speed up the insertion and we dont care about
>>>> consistency... I'll disable te implicit_transaction and see what happens.
>>>>
>>>> thanks
>>>>
>>>> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Assuming that you are using JDBC for putting data into any ACID
>>>>> compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or
>>>>> explicitly  adding BEGIN TRAN to INSERT statement in a distributed
>>>>> transaction. MSSQL does not know or care where data is coming from. If 
>>>>> your
>>>>> connection completes OK a COMMIT TRAN will be sent and that will tell MSQL
>>>>> to commit transaction. If yoy kill Spark transaction before MSSQL receive
>>>>> COMMIT TRAN, the transaction will be rolled back.
>>>>>
>>>>> The only option is that if you don't care about full data getting to
>>>>> MSSQL,to break your insert into chunks at source and send data to MSSQL in
>>>>> small batches. In that way you will not lose all data in MSSQL because of
>>>>> rollback.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
&

Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
Yes, I know that behavior , but there is not explicit Begin Transaction in
my code, so, maybe Spark or the same driver is adding the begin
transaction, or implicit transaction is configured. If spark is'n adding a
Begin transaction on each insertion, then probably is database or Driver
configuration...

On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> You will see what is happening in SQL Server. First create a test table
> called  testme
>
> 1> use tempdb
> 2> go
> 1> create table testme(col1 int)
> 2> go
> -- Now explicitly begin a transaction and insert 1 row and select from
> table
> 1>
> *begin tran*2> insert into testme values(1)
> 3> select * from testme
> 4> go
> (1 row affected)
>  col1
>  ---
>1
> -- That value col1=1 is there
> --
> (1 row affected)
> -- Now rollback that transaction meaning in your case by killing your
> Spark process!
> --
> 1> rollback tran
> 2> select * from testme
> 3> go
>  col1
>  ---
> (0 rows affected)
>
> -- You can see that record has gone as it rolled back!
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 18:42, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Sorry I'cant answer before, I want to know if spark is the responsible to
>> add the Begin Tran, The point is to speed up insertion over losing data,
>>  Disabling Transaction will speed up the insertion and we dont care about
>> consistency... I'll disable te implicit_transaction and see what happens.
>>
>> thanks
>>
>> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Assuming that you are using JDBC for putting data into any ACID
>>> compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or
>>> explicitly  adding BEGIN TRAN to INSERT statement in a distributed
>>> transaction. MSSQL does not know or care where data is coming from. If your
>>> connection completes OK a COMMIT TRAN will be sent and that will tell MSQL
>>> to commit transaction. If yoy kill Spark transaction before MSSQL receive
>>> COMMIT TRAN, the transaction will be rolled back.
>>>
>>> The only option is that if you don't care about full data getting to
>>> MSSQL,to break your insert into chunks at source and send data to MSSQL in
>>> small batches. In that way you will not lose all data in MSSQL because of
>>> rollback.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 20 April 2016 at 07:33, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Are you using JDBC to push data to MSSQL?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 19 April 2016 at 23:41, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>
>>>>> I mean local transaction, We've ran a Job that writes into SQLServer
>>>>> then we killed spark JVM just for testing purpose and we realized that
>>>>> SQLServer did a rollback.
>>>>>
>>>>> Regards
>>>>>
>>>>> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> What do you mean by *without transaction*? do you mean forcing SQL
>>>>>> Server to accept a non logged operation?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19 April 2016 at 21:18, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello, is possible to execute a SQL write without Transaction? we
>>>>>>> dont need transactions to save our data and this adds an overhead to the
>>>>>>> SQLServer.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>> --
>>>>>>> Ing. Ivaldi Andres
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ing. Ivaldi Andres
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>
>


-- 
Ing. Ivaldi Andres


Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
Sorry I'cant answer before, I want to know if spark is the responsible to
add the Begin Tran, The point is to speed up insertion over losing data,
 Disabling Transaction will speed up the insertion and we dont care about
consistency... I'll disable te implicit_transaction and see what happens.

thanks

On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Assuming that you are using JDBC for putting data into any ACID compliant
> database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly
>  adding BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL
> does not know or care where data is coming from. If your connection
> completes OK a COMMIT TRAN will be sent and that will tell MSQL to commit
> transaction. If yoy kill Spark transaction before MSSQL receive COMMIT
> TRAN, the transaction will be rolled back.
>
> The only option is that if you don't care about full data getting to
> MSSQL,to break your insert into chunks at source and send data to MSSQL in
> small batches. In that way you will not lose all data in MSSQL because of
> rollback.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 07:33, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Are you using JDBC to push data to MSSQL?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 April 2016 at 23:41, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>
>>> I mean local transaction, We've ran a Job that writes into SQLServer
>>> then we killed spark JVM just for testing purpose and we realized that
>>> SQLServer did a rollback.
>>>
>>> Regards
>>>
>>> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> What do you mean by *without transaction*? do you mean forcing SQL
>>>> Server to accept a non logged operation?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 19 April 2016 at 21:18, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>
>>>>> Hello, is possible to execute a SQL write without Transaction? we dont
>>>>> need transactions to save our data and this adds an overhead to the
>>>>> SQLServer.
>>>>>
>>>>> Regards.
>>>>>
>>>>> --
>>>>> Ing. Ivaldi Andres
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>


-- 
Ing. Ivaldi Andres


Re: Spark SQL Transaction

2016-04-19 Thread Andrés Ivaldi
I mean local transaction, We've ran a Job that writes into SQLServer then
we killed spark JVM just for testing purpose and we realized that SQLServer
did a rollback.

Regards

On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> What do you mean by *without transaction*? do you mean forcing SQL Server
> to accept a non logged operation?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 April 2016 at 21:18, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Hello, is possible to execute a SQL write without Transaction? we dont
>> need transactions to save our data and this adds an overhead to the
>> SQLServer.
>>
>> Regards.
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Spark SQL Transaction

2016-04-19 Thread Andrés Ivaldi
Hello, is possible to execute a SQL write without Transaction? we dont need
transactions to save our data and this adds an overhead to the SQLServer.

Regards.

-- 
Ing. Ivaldi Andres


Microsoft SQL dialect issues

2016-03-15 Thread Andrés Ivaldi
Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having
dialect problems
I found this
https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E

That is what is happening to me, It's possible to define the dialect? so I
can override the default for SQLServer?

Regards.

-- 
Ing. Ivaldi Andres


Re: Can we use spark inside a web service?

2016-03-15 Thread Andrés Ivaldi
Thanks Evan for the points. I had supposed what you said, but as I don't
have enough experience maybe I was missing something, thanks for the
answer!!

On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan <velvia.git...@gmail.com> wrote:

> Andres,
>
> A couple points:
>
> 1) If you look at my post, you can see that you could use Spark for
> low-latency - many sub-second queries could be executed in under a
> second, with the right technology.  It really depends on "real time"
> definition, but I believe low latency is definitely possible.
> 2) Akka-http over SparkContext - this is essentially what Spark Job
> Server does.  (it uses Spray, whic is the predecessor to akka-http
> we will upgrade once Spark 2.0 is incorporated)
> 3) Someone else can probably talk about Ignite, but it is based on a
> distributed object cache. So you define your objects in Java, POJOs,
> annotate which ones you want indexed, upload your jars, then you can
> execute queries.   It's a different use case than typical OLAP.
> There is some Spark integration, but then you would have the same
> bottlenecks going through Spark.
>
>
> On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
> > nice discussion , I've a question about  Web Service with Spark.
> >
> > What Could be the problem using Akka-http as web service (Like play does
> ) ,
> > with one SparkContext created , and the queries over -http akka using
> only
> > the instance of  that SparkContext ,
> >
> > Also about Analytics , we are working on real- time Analytics and as
> Hemant
> > said Spark is not a solution for low latency queries. What about using
> > Ingite for that?
> >
> >
> > On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <hemant9...@gmail.com>
> > wrote:
> >>
> >> Spark-jobserver is an elegant product that builds concurrency on top of
> >> Spark. But, the current design of DAGScheduler prevents Spark to become
> a
> >> truly concurrent solution for low latency queries. DagScheduler will
> turn
> >> out to be a bottleneck for low latency queries. Sparrow project was an
> >> effort to make Spark more suitable for such scenarios but it never made
> it
> >> to the Spark codebase. If Spark has to become a highly concurrent
> solution,
> >> scheduling has to be distributed.
> >>
> >> Hemant Bhanawat
> >> www.snappydata.io
> >>
> >> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote:
> >>>
> >>> great discussion, indeed.
> >>>
> >>> Mark Hamstra and i spoke offline just now.
> >>>
> >>> Below is a quick recap of our discussion on how they've achieved
> >>> acceptable performance from Spark on the user request/response path
> (@mark-
> >>> feel free to correct/comment).
> >>>
> >>> 1) there is a big difference in request/response latency between
> >>> submitting a full Spark Application (heavy weight) versus having a
> >>> long-running Spark Application (like Spark Job Server) that submits
> >>> lighter-weight Jobs using a shared SparkContext.  mark is obviously
> using
> >>> the latter - a long-running Spark App.
> >>>
> >>> 2) there are some enhancements to Spark that are required to achieve
> >>> acceptable user request/response times.  some links that Mark provided
> are
> >>> as follows:
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-11838
> >>> https://github.com/apache/spark/pull/11036
> >>> https://github.com/apache/spark/pull/11403
> >>> https://issues.apache.org/jira/browse/SPARK-13523
> >>> https://issues.apache.org/jira/browse/SPARK-13756
> >>>
> >>> Essentially, a deeper level of caching at the shuffle file layer to
> >>> reduce compute and memory between queries.
> >>>
> >>> Note that Mark is running a slightly-modified version of stock Spark.
> >>> (He's mentioned this in prior posts, as well.)
> >>>
> >>> And I have to say that I'm, personally, seeing more and more
> >>> slightly-modified versions of Spark being deployed to production to
> >>> workaround outstanding PR's and Jiras.
> >>>
> >>> this may not be what people want to hear, but it's a trend that i'm
> >>> seeing lately as more and more customize Spark to their specific use
> cases.
> >>>
> >>> Anyway, thanks for the good discussion, everyone!  This is why we have
> &g

Re: Can we use spark inside a web service?

2016-03-11 Thread Andrés Ivaldi
nice discussion , I've a question about  Web Service with Spark.

What Could be the problem using Akka-http as web service (Like play does )
, with one SparkContext created , and the queries over -http akka using
only the instance of  that SparkContext ,

Also about Analytics , we are working on real- time Analytics and as Hemant
said Spark is not a solution for low latency queries. What about using
Ingite for that?


On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat 
wrote:

> Spark-jobserver is an elegant product that builds concurrency on top of
> Spark. But, the current design of DAGScheduler prevents Spark to become a
> truly concurrent solution for low latency queries. DagScheduler will turn
> out to be a bottleneck for low latency queries. Sparrow project was an
> effort to make Spark more suitable for such scenarios but it never made it
> to the Spark codebase. If Spark has to become a highly concurrent solution,
> scheduling has to be distributed.
>
> Hemant Bhanawat 
> www.snappydata.io
>
> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:
>
>> great discussion, indeed.
>>
>> Mark Hamstra and i spoke offline just now.
>>
>> Below is a quick recap of our discussion on how they've achieved
>> acceptable performance from Spark on the user request/response path (@mark-
>> feel free to correct/comment).
>>
>> 1) there is a big difference in request/response latency between
>> submitting a full Spark Application (heavy weight) versus having a
>> long-running Spark Application (like Spark Job Server) that submits
>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>> the latter - a long-running Spark App.
>>
>> 2) there are some enhancements to Spark that are required to achieve
>> acceptable user request/response times.  some links that Mark provided are
>> as follows:
>>
>>- https://issues.apache.org/jira/browse/SPARK-11838
>>- https://github.com/apache/spark/pull/11036
>>- https://github.com/apache/spark/pull/11403
>>- https://issues.apache.org/jira/browse/SPARK-13523
>>- https://issues.apache.org/jira/browse/SPARK-13756
>>
>> Essentially, a deeper level of caching at the shuffle file layer to
>> reduce compute and memory between queries.
>>
>> Note that Mark is running a slightly-modified version of stock Spark.
>>  (He's mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what people want to hear, but it's a trend that i'm
>> seeing lately as more and more customize Spark to their specific use cases.
>>
>> Anyway, thanks for the good discussion, everyone!  This is why we have
>> these lists, right!  :)
>>
>>
>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
>> wrote:
>>
>>> One of the premises here is that if you can restrict your workload to
>>> fewer cores - which is easier with FiloDB and careful data modeling -
>>> you can make this work for much higher concurrency and lower latency
>>> than most typical Spark use cases.
>>>
>>> The reason why it typically does not work in production is that most
>>> people are using HDFS and files.  These data sources are designed for
>>> running queries and workloads on all your cores across many workers,
>>> and not for filtering your workload down to only one or two cores.
>>>
>>> There is actually nothing inherent in Spark that prevents people from
>>> using it as an app server.   However, the insistence on using it with
>>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>>
>>> I agree there are more optimized stacks for running app servers, but
>>> the choices that you mentioned:  ES is targeted at text search;  Cass
>>> and HBase by themselves are not fast enough for analytical queries
>>> that the OP wants;  and MySQL is great but not scalable.   Probably
>>> something like VectorWise, HANA, Vertica would work well, but those
>>> are mostly not free solutions.   Druid could work too if the use case
>>> is right.
>>>
>>> Anyways, great discussion!
>>>
>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>>> >
>>> > so the problem is even worse given that a typical job requires multiple
>>> > tasks/cores.
>>> >
>>> > i have yet to see this particular architecture work in production.  i
>>> would
>>> > love for someone to prove otherwise.
>>> >
>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra >> >
>>> > wrote:
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> with 1000
>>> >>> cores.
>>> >>
>>> >>
>>> >> This doesn't make sense.  A Spark Job is a 

Multiple Spark taks with Akka FSM

2016-03-09 Thread Andrés Ivaldi
Hello,

I'd like to know if this architecture is correct or not. We are studying
Spark as our ETL engine, we have a UI designer for the graph, this give us
a model that we want to translate in the corresponding Spark executions.
What brings to us Akka FSM, Using same sparkContext for all actors, we
suppose that parallelism will be supported by Spark(depending on
configuration of course) and Akka, each node Actor will be executed only if
his related in edges are completed.
It's that correct or there is any simple way to do it? I took a look to
Graphx but It's looks like is more for joins than manipulation or maybe
I've overlook something

regards
-- 
Ing. Ivaldi Andres


Re: Multiple Spark taks with Akka FSM

2016-03-09 Thread Andrés Ivaldi
My Mistake, It's not Akka FSM is Akka Flow Graphs


On Wed, Mar 9, 2016 at 1:46 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> Hello,
>
> I'd like to know if this architecture is correct or not. We are studying
> Spark as our ETL engine, we have a UI designer for the graph, this give us
> a model that we want to translate in the corresponding Spark executions.
> What brings to us Akka FSM, Using same sparkContext for all actors, we
> suppose that parallelism will be supported by Spark(depending on
> configuration of course) and Akka, each node Actor will be executed only if
> his related in edges are completed.
> It's that correct or there is any simple way to do it? I took a look to
> Graphx but It's looks like is more for joins than manipulation or maybe
> I've overlook something
>
> regards
> --
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres


GoogleAnalytics GAData

2016-01-29 Thread Andrés Ivaldi
Hello , Im using Google api to retrive google analytics JSON

I'd like to use Spark to load the JSON, but toString truncates the value, I
could save it to disk and then retrive it, butI'm loosing performance, is
there any other way?

Regars



-- 
Ing. Ivaldi Andres


Re: JSON to SQL

2016-01-28 Thread Andrés Ivaldi
Thans for the tip, I've realize about that end I've ended using explode as
you said.

This is my attempt

 var res=(df.explode("rows","r") {
l: WrappedArray[ArrayBuffer[String]] => l.toList}).select("r")
.map { m => m.getList[Row](0) }

 var u = res.map { m => Row.fromSeq(m.toSeq) }

var df1 = df.sqlContext.createDataFrame(u, getScheme(df)  )

It woks ok, but throws an invalid cast to Integer if the scheme have some
IntegerType, looks like a spark-csv bug, but I can solved anyway

Thanks for the help.


On Thu, Jan 28, 2016 at 7:43 PM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> You don’t need Hive for that. The DataFrame class has a method  named
> explode, which provides the same functionality.
>
>
>
> Here is an example from the Spark API documentation:
>
> df.explode("words", "word"){words: String => words.split(" ")}
>
>
>
> The first argument to the explode method  is the name of the input column
> and the second argument is the name of the output column.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Andrés Ivaldi [mailto:iaiva...@gmail.com]
> *Sent:* Wednesday, January 27, 2016 7:17 PM
> *To:* Cheng, Hao
> *Cc:* Sahil Sareen; Al Pivonka; user
>
> *Subject:* Re: JSON to SQL
>
>
>
> I'm using DataFrames reading the JSON exactly as you say, and I can get
> the scheme from there. Reading the documentation, I realized that is
> possible to create Dynamically a Structure, so applying some
> transformations to the dataFrame plus the new structure I'll be able to
> save the JSON on my DBRM.
>
>
>
> For the flatten approach, you mentioned LateralView, do I need Hive DB for
> that? or just the Spark Hive Context? I saw some examples and that is
> exactly what I'm needing. Can you explain it a little bit more?
>
>
>
> Thanks
>
>
>
> On Wed, Jan 27, 2016 at 10:29 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
> Have you ever try the DataFrame API like:
> sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer
> the type/schema for you.
>
>
>
> And lateral view will help on the flatten issues,
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView,
> as well as the “a.b[0].c” format of expression.
>
>
>
>
>
> *From:* Andrés Ivaldi [mailto:iaiva...@gmail.com]
> *Sent:* Thursday, January 28, 2016 3:39 AM
> *To:* Sahil Sareen
> *Cc:* Al Pivonka; user
> *Subject:* Re: JSON to SQL
>
>
>
> I'm really brand new with Scala, but if I'm defining a case class then is
> becouse I know how is the json's structure is previously?
>
> If I'm able to define dinamicaly a case class from the JSON structure then
> even with spark I will be able to extract the data
>
>
>
> On Wed, Jan 27, 2016 at 4:01 PM, Sahil Sareen <sareen...@gmail.com> wrote:
>
> Isn't this just about defining a case class and using
> parse(json).extract[CaseClassName]  using Jackson?
>
> -Sahil
>
>
>
> On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi <iaiva...@gmail.com>
> wrote:
>
> We dont have Domain Objects, its a service like a pipeline, data is read
> from source and they are saved it in relational Database
>
> I can read the structure from DataFrames, and do some transformations, I
> would prefer to do it with Spark to be consistent with the process
>
>
>
> On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka <alpivo...@gmail.com> wrote:
>
> Are you using an Relational Database?
>
> If so why not use a nojs DB ? then pull from it to your relational?
>
>
>
> Or utilize a library that understands Json structure like Jackson to
> obtain the data from the Json structure the persist the Domain Objects ?
>
>
>
> On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
> Sure,
>
> The Job is like an etl, but without interface, so I decide the rules of
> how the JSON will be saved into a SQL Table.
>
>
>
> I need to Flatten the hierarchies where is possible in case of list
> flatten also, nested objects Won't be processed by now
>
> {"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
> {"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
> {"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }
>
> I would like something like this on my SQL table
>
> ab  c

Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-01-28 Thread Andrés Ivaldi
Hi, could you get it work, tomorrow I'll be using the xml parser also, On
windows 7, I'll let you know the results.

Regards,



On Thu, Jan 28, 2016 at 12:27 PM, Deenar Toraskar  wrote:

> Hi
>
> Anyone tried using spark-xml with spark 1.6. I cannot even get the sample
> book.xml file (wget
> https://github.com/databricks/spark-xml/raw/master/src/test/resources/books.xml
> ) working
> https://github.com/databricks/spark-xml
>
> scala> val df =
> sqlContext.read.format("com.databricks.spark.xml").load("books.xml")
>
>
> scala> df.count
>
> res4: Long = 0
>
>
> Anyone else facing the same issue?
>
>
> Deenar
>



-- 
Ing. Ivaldi Andres


Problems when applying scheme to RDD

2016-01-28 Thread Andrés Ivaldi
Hello, I'm having an exception when trying to apply a new Scheme to RDD

I'm reading an JSON with Databricks spark-csv v1.3.0


after applying some transformations I have RDD with Strings type columns

Then I'm trying to apply Scheme where one of the field is Integer then this
exception is riced

16/01/28 17:38:14 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4,
localhost): java.lang.ClassCastException: java.lang.String cannot be cast
to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
at
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:221)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:354)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

The code I'm running is like

 var res=(df.explode("rows","r") {
l: WrappedArray[ArrayBuffer[String]] => l.toList}).select("r")
.map { m => m.getList[Row](0) }

  var u = res.map { m => Row.fromSeq(m.toSeq) }


  var df1 = df.sqlContext.createDataFrame(u, getScheme(df)  )
  //if df1.show -> the exception is riced


getScheme return the scheme, the las column is IntegerType, if I change it
to StringType
and then apply the cast like this, its works

  df1.select(df1("ga:pageviews").cast(IntegerType)).show

The order of the fields at the Structure seems to be ok.
I read that in early versions of spark-csv was a similar issue.

Any Ideas?

Regards!!!

Ing. Ivaldi Andres


Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
I'm using DataFrames reading the JSON exactly as you say, and I can get the
scheme from there. Reading the documentation, I realized that is possible
to create Dynamically a Structure, so applying some transformations to the
dataFrame plus the new structure I'll be able to save the JSON on my DBRM.

For the flatten approach, you mentioned LateralView, do I need Hive DB for
that? or just the Spark Hive Context? I saw some examples and that is
exactly what I'm needing. Can you explain it a little bit more?

Thanks

On Wed, Jan 27, 2016 at 10:29 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> Have you ever try the DataFrame API like:
> sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer
> the type/schema for you.
>
>
>
> And lateral view will help on the flatten issues,
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView,
> as well as the “a.b[0].c” format of expression.
>
>
>
>
>
> *From:* Andrés Ivaldi [mailto:iaiva...@gmail.com]
> *Sent:* Thursday, January 28, 2016 3:39 AM
> *To:* Sahil Sareen
> *Cc:* Al Pivonka; user
> *Subject:* Re: JSON to SQL
>
>
>
> I'm really brand new with Scala, but if I'm defining a case class then is
> becouse I know how is the json's structure is previously?
>
> If I'm able to define dinamicaly a case class from the JSON structure then
> even with spark I will be able to extract the data
>
>
>
> On Wed, Jan 27, 2016 at 4:01 PM, Sahil Sareen <sareen...@gmail.com> wrote:
>
> Isn't this just about defining a case class and using
> parse(json).extract[CaseClassName]  using Jackson?
>
> -Sahil
>
>
>
> On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi <iaiva...@gmail.com>
> wrote:
>
> We dont have Domain Objects, its a service like a pipeline, data is read
> from source and they are saved it in relational Database
>
> I can read the structure from DataFrames, and do some transformations, I
> would prefer to do it with Spark to be consistent with the process
>
>
>
> On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka <alpivo...@gmail.com> wrote:
>
> Are you using an Relational Database?
>
> If so why not use a nojs DB ? then pull from it to your relational?
>
>
>
> Or utilize a library that understands Json structure like Jackson to
> obtain the data from the Json structure the persist the Domain Objects ?
>
>
>
> On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
> Sure,
>
> The Job is like an etl, but without interface, so I decide the rules of
> how the JSON will be saved into a SQL Table.
>
>
>
> I need to Flatten the hierarchies where is possible in case of list
> flatten also, nested objects Won't be processed by now
>
> {"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
> {"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
> {"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }
>
> I would like something like this on my SQL table
>
> ab  c d
>
> 12,3Field 4,5,6,7,8
>
> 11   22,33  Field1    44,55,66,77,88
>
> 111  222,333Field2444,555,,666,777,888
>
> Right now this is what i need
>
> I will later add more intelligence, like detection of list or nested
> objects and create relations in other tables.
>
>
>
>
>
>
>
> On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka <alpivo...@gmail.com> wrote:
>
> More detail is needed.
>
> Can you provide some context to the use-case ?
>
>
>
> On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
> Hello, I'm trying to Save a JSON filo into SQL table.
>
> If i try to do this directly the IlligalArgumentException is raised, I
> suppose this is beacouse JSON have a hierarchical structure, is that
> correct?
>
> If that is the problem, how can I flatten the JSON structure? The JSON
> structure to be processed would be unknow, so I need to do it
> programatically
>
> regards
>
> --
>
> Ing. Ivaldi Andres
>
>
>
>
>
> --
>
> Those who say it can't be done, are usually interrupted by those doing it.
>
>
>
> --
>
> Ing. Ivaldi Andres
>
>
>
>
>
> --
>
> Those who say it can't be done, are usually interrupted by those doing it.
>
>
>
> --
>
> Ing. Ivaldi Andres
>
>
>
>
>
>
> --
>
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres


Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
We dont have Domain Objects, its a service like a pipeline, data is read
from source and they are saved it in relational Database

I can read the structure from DataFrames, and do some transformations, I
would prefer to do it with Spark to be consistent with the process



On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka <alpivo...@gmail.com> wrote:

> Are you using an Relational Database?
> If so why not use a nojs DB ? then pull from it to your relational?
>
> Or utilize a library that understands Json structure like Jackson to
> obtain the data from the Json structure the persist the Domain Objects ?
>
> On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Sure,
>> The Job is like an etl, but without interface, so I decide the rules of
>> how the JSON will be saved into a SQL Table.
>>
>> I need to Flatten the hierarchies where is possible in case of list
>> flatten also, nested objects Won't be processed by now
>>
>> {"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
>> {"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
>> {"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }
>>
>> I would like something like this on my SQL table
>>
>> ab  c d
>> 12,3Field 4,5,6,7,8
>> 11   22,33  Field144,55,66,77,88
>> 111  222,333Field2444,555,,666,777,888
>>
>> Right now this is what i need
>>
>> I will later add more intelligence, like detection of list or nested
>> objects and create relations in other tables.
>>
>>
>>
>> On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka <alpivo...@gmail.com> wrote:
>>
>>> More detail is needed.
>>> Can you provide some context to the use-case ?
>>>
>>> On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi <iaiva...@gmail.com>
>>> wrote:
>>>
>>>> Hello, I'm trying to Save a JSON filo into SQL table.
>>>>
>>>> If i try to do this directly the IlligalArgumentException is raised, I
>>>> suppose this is beacouse JSON have a hierarchical structure, is that
>>>> correct?
>>>>
>>>> If that is the problem, how can I flatten the JSON structure? The JSON
>>>> structure to be processed would be unknow, so I need to do it
>>>> programatically
>>>>
>>>> regards
>>>>
>>>> --
>>>> Ing. Ivaldi Andres
>>>>
>>>
>>>
>>>
>>> --
>>> Those who say it can't be done, are usually interrupted by those doing
>>> it.
>>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>
>
> --
> Those who say it can't be done, are usually interrupted by those doing it.
>



-- 
Ing. Ivaldi Andres


Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
I'm really brand new with Scala, but if I'm defining a case class then is
becouse I know how is the json's structure is previously?

If I'm able to define dinamicaly a case class from the JSON structure then
even with spark I will be able to extract the data


On Wed, Jan 27, 2016 at 4:01 PM, Sahil Sareen <sareen...@gmail.com> wrote:

> Isn't this just about defining a case class and using
> parse(json).extract[CaseClassName]  using Jackson?
>
> -Sahil
>
> On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi <iaiva...@gmail.com>
> wrote:
>
>> We dont have Domain Objects, its a service like a pipeline, data is read
>> from source and they are saved it in relational Database
>>
>> I can read the structure from DataFrames, and do some transformations, I
>> would prefer to do it with Spark to be consistent with the process
>>
>>
>>
>> On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka <alpivo...@gmail.com> wrote:
>>
>>> Are you using an Relational Database?
>>> If so why not use a nojs DB ? then pull from it to your relational?
>>>
>>> Or utilize a library that understands Json structure like Jackson to
>>> obtain the data from the Json structure the persist the Domain Objects ?
>>>
>>> On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi <iaiva...@gmail.com>
>>> wrote:
>>>
>>>> Sure,
>>>> The Job is like an etl, but without interface, so I decide the rules of
>>>> how the JSON will be saved into a SQL Table.
>>>>
>>>> I need to Flatten the hierarchies where is possible in case of list
>>>> flatten also, nested objects Won't be processed by now
>>>>
>>>> {"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
>>>> {"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
>>>> {"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }
>>>>
>>>> I would like something like this on my SQL table
>>>>
>>>> ab  c d
>>>> 12,3Field 4,5,6,7,8
>>>> 11   22,33  Field144,55,66,77,88
>>>> 111  222,333Field2444,555,,666,777,888
>>>>
>>>> Right now this is what i need
>>>>
>>>> I will later add more intelligence, like detection of list or nested
>>>> objects and create relations in other tables.
>>>>
>>>>
>>>>
>>>> On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka <alpivo...@gmail.com>
>>>> wrote:
>>>>
>>>>> More detail is needed.
>>>>> Can you provide some context to the use-case ?
>>>>>
>>>>> On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi <iaiva...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello, I'm trying to Save a JSON filo into SQL table.
>>>>>>
>>>>>> If i try to do this directly the IlligalArgumentException is raised,
>>>>>> I suppose this is beacouse JSON have a hierarchical structure, is that
>>>>>> correct?
>>>>>>
>>>>>> If that is the problem, how can I flatten the JSON structure? The
>>>>>> JSON structure to be processed would be unknow, so I need to do it
>>>>>> programatically
>>>>>>
>>>>>> regards
>>>>>>
>>>>>> --
>>>>>> Ing. Ivaldi Andres
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Those who say it can't be done, are usually interrupted by those doing
>>>>> it.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ing. Ivaldi Andres
>>>>
>>>
>>>
>>>
>>> --
>>> Those who say it can't be done, are usually interrupted by those doing
>>> it.
>>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
Sure,
The Job is like an etl, but without interface, so I decide the rules of how
the JSON will be saved into a SQL Table.

I need to Flatten the hierarchies where is possible in case of list flatten
also, nested objects Won't be processed by now

{"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
{"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
{"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }

I would like something like this on my SQL table

ab  c d
12,3Field 4,5,6,7,8
11   22,33  Field144,55,66,77,88
111  222,333Field2444,555,,666,777,888

Right now this is what i need

I will later add more intelligence, like detection of list or nested
objects and create relations in other tables.



On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka <alpivo...@gmail.com> wrote:

> More detail is needed.
> Can you provide some context to the use-case ?
>
> On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Hello, I'm trying to Save a JSON filo into SQL table.
>>
>> If i try to do this directly the IlligalArgumentException is raised, I
>> suppose this is beacouse JSON have a hierarchical structure, is that
>> correct?
>>
>> If that is the problem, how can I flatten the JSON structure? The JSON
>> structure to be processed would be unknow, so I need to do it
>> programatically
>>
>> regards
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>
>
> --
> Those who say it can't be done, are usually interrupted by those doing it.
>



-- 
Ing. Ivaldi Andres


JSON to SQL

2016-01-27 Thread Andrés Ivaldi
Hello, I'm trying to Save a JSON filo into SQL table.

If i try to do this directly the IlligalArgumentException is raised, I
suppose this is beacouse JSON have a hierarchical structure, is that
correct?

If that is the problem, how can I flatten the JSON structure? The JSON
structure to be processed would be unknow, so I need to do it
programatically

regards

-- 
Ing. Ivaldi Andres


Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Yes,
The use case would be,
Have spark in a service (I didnt invertigate this yet), through api calls
of this service we perform some aggregations over data in SQL, We are
already doing this with an internal development

Nothing complicated, for instance, a table with Product, Product Family,
cost, price, etc. Columns like Dimension and Measures,

I want to Spark for query that table to perform a kind of rollup, with cost
as Measure and Prodcut, Product Family as Dimension

Only 3 columns, it takes like 20s to perform that query and the
aggregation, the  query directly to the database with a grouping at the
columns takes like 1s

regards



On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> can you elaborate more on the use case?
>
> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <iaiva...@gmail.com> wrote:
> >
> > Hi,
> >
> > I'd like to use spark to perform some transformations over data stored
> inSQL, but I need low Latency, I'm doing some test and I run into spark
> context creation and data query over SQL takes too long time.
> >
> > Any idea for speed up the process?
> >
> > regards.
> >
> > --
> > Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres


Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Thanks Jõrn, I didn't expect Spark to be faster than SQL, but just not that
slow.

We are tempted to use Spark as our hub of sources, that way we can access
throw different data sources and normalize it. Currently we are saving the
data in SQL becouse Spark latency, but the best would be execute directly
over Spark.


On Tue, Dec 1, 2015 at 9:05 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Hmm it will never be faster than SQL if you use SQL as an underlying
> storage. Spark is (currently) an in-memory batch engine for iterative
> machine learning workloads. It is not designed for interactive queries.
> Currently hive is going into the direction of interactive queries.
> Alternatives are Hbase on Phoenix or Impala.
>
> On 01 Dec 2015, at 21:58, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
> Yes,
> The use case would be,
> Have spark in a service (I didnt invertigate this yet), through api calls
> of this service we perform some aggregations over data in SQL, We are
> already doing this with an internal development
>
> Nothing complicated, for instance, a table with Product, Product Family,
> cost, price, etc. Columns like Dimension and Measures,
>
> I want to Spark for query that table to perform a kind of rollup, with
> cost as Measure and Prodcut, Product Family as Dimension
>
> Only 3 columns, it takes like 20s to perform that query and the
> aggregation, the  query directly to the database with a grouping at the
> columns takes like 1s
>
> regards
>
>
>
> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> can you elaborate more on the use case?
>>
>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I'd like to use spark to perform some transformations over data stored
>> inSQL, but I need low Latency, I'm doing some test and I run into spark
>> context creation and data query over SQL takes too long time.
>> >
>> > Any idea for speed up the process?
>> >
>> > regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
>>
>
>
>
> --
> Ing. Ivaldi Andres
>
>


-- 
Ing. Ivaldi Andres


Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Ok, so latency problem is being generated because I'm using SQL as source?
how about csv, hive, or another source?

On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> It is not designed for interactive queries.
>
>
> You might want to ask the designers of Spark, Spark SQL, and particularly
> some things built on top of Spark (such as BlinkDB) about their intent with
> regard to interactive queries.  Interactive queries are not the only
> designed use of Spark, but it is going too far to claim that Spark is not
> designed at all to handle interactive queries.
>
> That being said, I think that you are correct to question the wisdom of
> expecting lowest-latency query response from Spark using SQL (sic,
> presumably a RDBMS is intended) as the datastore.
>
> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Hmm it will never be faster than SQL if you use SQL as an underlying
>> storage. Spark is (currently) an in-memory batch engine for iterative
>> machine learning workloads. It is not designed for interactive queries.
>> Currently hive is going into the direction of interactive queries.
>> Alternatives are Hbase on Phoenix or Impala.
>>
>> On 01 Dec 2015, at 21:58, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>
>> Yes,
>> The use case would be,
>> Have spark in a service (I didnt invertigate this yet), through api calls
>> of this service we perform some aggregations over data in SQL, We are
>> already doing this with an internal development
>>
>> Nothing complicated, for instance, a table with Product, Product Family,
>> cost, price, etc. Columns like Dimension and Measures,
>>
>> I want to Spark for query that table to perform a kind of rollup, with
>> cost as Measure and Prodcut, Product Family as Dimension
>>
>> Only 3 columns, it takes like 20s to perform that query and the
>> aggregation, the  query directly to the database with a grouping at the
>> columns takes like 1s
>>
>> regards
>>
>>
>>
>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> can you elaborate more on the use case?
>>>
>>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'd like to use spark to perform some transformations over data stored
>>> inSQL, but I need low Latency, I'm doing some test and I run into spark
>>> context creation and data query over SQL takes too long time.
>>> >
>>> > Any idea for speed up the process?
>>> >
>>> > regards.
>>> >
>>> > --
>>> > Ing. Ivaldi Andres
>>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>>
>


-- 
Ing. Ivaldi Andres


Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Mark, We have an application that use data from different kind of source,
and we build a engine able to handle that, but cant scale with big data(we
could but is to time expensive), and doesn't have Machine learning module,
etc, we came across with Spark and it's looks like it have all we need,
actually it does, but our latency is very low right now, and when we do
some testing it took too long time for the same kind of results, always
against RDBM which is our primary source.

So, we want to expand our sources, to cvs, web service, big data, etc, we
can extend our engine or use something like Spark, which give as power of
clustering, different kind of source access, streaming, machine learning,
easy extensibility and so on.

On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> I'd ask another question first: If your SQL query can be executed in a
> performant fashion against a conventional (RDBMS?) database, why are you
> trying to use Spark?  How you answer that question will be the key to
> deciding among the engineering design tradeoffs to effectively use Spark or
> some other solution.
>
> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Ok, so latency problem is being generated because I'm using SQL as
>> source? how about csv, hive, or another source?
>>
>> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> It is not designed for interactive queries.
>>>
>>>
>>> You might want to ask the designers of Spark, Spark SQL, and
>>> particularly some things built on top of Spark (such as BlinkDB) about
>>> their intent with regard to interactive queries.  Interactive queries are
>>> not the only designed use of Spark, but it is going too far to claim that
>>> Spark is not designed at all to handle interactive queries.
>>>
>>> That being said, I think that you are correct to question the wisdom of
>>> expecting lowest-latency query response from Spark using SQL (sic,
>>> presumably a RDBMS is intended) as the datastore.
>>>
>>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Hmm it will never be faster than SQL if you use SQL as an underlying
>>>> storage. Spark is (currently) an in-memory batch engine for iterative
>>>> machine learning workloads. It is not designed for interactive queries.
>>>> Currently hive is going into the direction of interactive queries.
>>>> Alternatives are Hbase on Phoenix or Impala.
>>>>
>>>> On 01 Dec 2015, at 21:58, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>
>>>> Yes,
>>>> The use case would be,
>>>> Have spark in a service (I didnt invertigate this yet), through api
>>>> calls of this service we perform some aggregations over data in SQL, We are
>>>> already doing this with an internal development
>>>>
>>>> Nothing complicated, for instance, a table with Product, Product
>>>> Family, cost, price, etc. Columns like Dimension and Measures,
>>>>
>>>> I want to Spark for query that table to perform a kind of rollup, with
>>>> cost as Measure and Prodcut, Product Family as Dimension
>>>>
>>>> Only 3 columns, it takes like 20s to perform that query and the
>>>> aggregation, the  query directly to the database with a grouping at the
>>>> columns takes like 1s
>>>>
>>>> regards
>>>>
>>>>
>>>>
>>>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>>> can you elaborate more on the use case?
>>>>>
>>>>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I'd like to use spark to perform some transformations over data
>>>>> stored inSQL, but I need low Latency, I'm doing some test and I run into
>>>>> spark context creation and data query over SQL takes too long time.
>>>>> >
>>>>> > Any idea for speed up the process?
>>>>> >
>>>>> > regards.
>>>>> >
>>>>> > --
>>>>> > Ing. Ivaldi Andres
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ing. Ivaldi Andres
>>>>
>>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Re: Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread Andrés Ivaldi
Hi,

We have been evaluating apache Kylin, how flexible is it? I mean, we need
to create the cube Structure Dynamically and populete it from different
sources, the process time is not too important, what is important is the
response time on queries?

Thanks.

On Mon, Nov 9, 2015 at 11:01 PM, fightf...@163.com 
wrote:

> Hi,
>
> According to my experience, I would recommend option 3) using Apache Kylin
> for your requirements.
>
> This is a suggestion based on the open-source world.
>
> For the per cassandra thing, I accept your advice for the special support
> thing. But the community is very
>
> open and convinient for prompt response.
>
> --
> fightf...@163.com
>
>
> *From:* tsh 
> *Date:* 2015-11-10 02:56
> *To:* fightf...@163.com; user ; dev
> 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* Jörn Franke 
> *Date:* 2015-11-09 14:40
> *To:* fightf...@163.com
> *CC:* user ; dev 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> 
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> --
> 

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread Andrés Ivaldi
Hi,
Cassandra looks very interesting and It seems to fit right, but it looks
like It needs too much work to have the proper configuration that depends
of the data. And what We need to do it  a generic structure with less
configuration possible, because the end users dont have the know-how for do
that.

Please let me know if we did a bad interpretation about cassandra, so we
can take a look to it again.

Best Regards!!


On Mon, Nov 9, 2015 at 11:11 PM, fightf...@163.com <fightf...@163.com>
wrote:

> Hi,
>
> Have you ever considered cassandra as a replacement ? We are now almost
> the seem usage as your engine, e.g. using mysql to store
>
> initial aggregated data. Can you share more about your kind of Cube
> queries ? We are very interested in that arch too : )
>
> Best,
> Sun.
> --
> fightf...@163.com
>
>
> *From:* Andrés Ivaldi <iaiva...@gmail.com>
> *Date:* 2015-11-10 07:03
> *To:* tsh <t...@timshenkao.su>
> *CC:* fightf...@163.com; user <user@spark.apache.org>; dev
> <d...@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
> Hi,
> I'm also considering something similar, Spark plain is too slow for my
> case, a possible solution is use Spark as Multiple Source connector and
> basic transformation layer, then persist the information (actually is a
> RDBM), after that with our engine we build a kind of Cube queries, and the
> result is processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and
> scalable than RDBM, dont care about pre processing information if after pre
> processing the queries are fast.
>
> Regards
>
> On Mon, Nov 9, 2015 at 3:56 PM, tsh <t...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I'm in the same position right now: we are going to implement something
>> like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes
>> of versatile data and the necessity to make something like cubes (Hive and
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed
>> to Tableau + Vertica.
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
>> Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>>
>> For myself, I decided not to dive into Mesos. Cassandra is hardly
>> configurable, you'll have to dedicate special employee to support it.
>>
>> I'll be glad to hear other ideas & propositions as we are at the
>> beginning of the process too.
>>
>> Sincerely yours, Tim Shenkao
>>
>>
>> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>>
>> Hi,
>>
>> Thanks for suggesting. Actually we are now evaluating and stressing the
>> spark sql on cassandra, while
>>
>> trying to define business models. FWIW, the solution mentioned here is
>> different from traditional OLAP
>>
>> cube engine, right ? So we are hesitating on the common sense or
>> direction choice of olap architecture.
>>
>> And we are happy to hear more use case from this community.
>>
>> Best,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>>
>> *From:* Jörn Franke <jornfra...@gmail.com>
>> *Date:* 2015-11-09 14:40
>> *To:* fightf...@163.com
>> *CC:* user <user@spark.apache.org>; dev <d...@spark.apache.org>
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightf...@163.com" <fightf...@163.com> wrote:
>>
>> Hi, community
>>
>> We are specially interested about this featural integration according to
>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>> seems good implementation for lambda architecure in the open-source
>> world, especially non-hadoop based cluster 

Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread Andrés Ivaldi
Hi,
I'm also considering something similar, Spark plain is too slow for my
case, a possible solution is use Spark as Multiple Source connector and
basic transformation layer, then persist the information (actually is a
RDBM), after that with our engine we build a kind of Cube queries, and the
result is processed again by Spark adding Machine Learning.
Our Missing part is reemplace the RDBM with something more suitable and
scalable than RDBM, dont care about pre processing information if after pre
processing the queries are fast.

Regards

On Mon, Nov 9, 2015 at 3:56 PM, tsh  wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* Jörn Franke 
> *Date:* 2015-11-09 14:40
> *To:* fightf...@163.com
> *CC:* user ; dev 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> 
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> --
> fightf...@163.com
>
>
>


-- 
Ing. Ivaldi Andres


Spark Analytics

2015-11-05 Thread Andrés Ivaldi
Hello, I'm newbie at spark world, With my team are analyzing Spark as
integration frameworks between different sources, so far so good, but I't
becomes slow when aggregations and calculations are applied to the RDD.

Im using Spark as standalone and under windows.

I'm running this exalple:
- Create Spark context SQLContext, and create a query to SQL like this
  val df = sqc.read.format("jdbc").options(  Map("url" -> url,
"dbtable" ->"(select [Cod], cant from [FactV]) as t "))
  .load().toDF("k1","v1")

- Perform aggregarion like this
 val cant = df.rollup("k1")
  .agg(sum("v1").alias("v1")


It tooks like 20s, if i execute this query directly on SQL it tooks less
than one second.

what I've seen is that the expensive time is on context creation or at
least that is what I believe.
Is there any way to do it faster? or reuse context?

What wee need is to perform aggregations like a Cube but with different
sources and real time, it's also posible perform pre-process the data.

kind regards,


-- 
Ing. Ivaldi Andres