Re: PhysicalRDD problem?

2014-12-10 Thread Michael Armbrust
I'm hesitant to merge that PR in as it is using a brand new configuration
path that is different from the way that the rest of Spark SQL / Spark are
configured.

I'm suspicious that that hitting max iterations is emblematic of some other
issue, as typically resolution happens bottom up, in a single pass.  Can
you provide more details about your schema / query?

On Tue, Dec 9, 2014 at 11:43 PM, Nitin Goyal  wrote:

> I see that somebody had already raised a PR for this but it hasn't been
> merged.
>
> https://issues.apache.org/jira/browse/SPARK-4339
>
> Can we merge this in next 1.2 RC?
>
> Thanks
> -Nitin
>
>
> On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal 
> wrote:
>
>> Hi Michael,
>>
>> I think I have found the exact problem in my case. I see that we have
>> written something like following in Analyzer.scala :-
>>
>>   // TODO: pass this in as a parameter.
>>
>>   val fixedPoint = FixedPoint(100)
>>
>>
>> and
>>
>>
>> Batch("Resolution", fixedPoint,
>>
>>   ResolveReferences ::
>>
>>   ResolveRelations ::
>>
>>   ResolveSortReferences ::
>>
>>   NewRelationInstances ::
>>
>>   ImplicitGenerate ::
>>
>>   StarExpansion ::
>>
>>   ResolveFunctions ::
>>
>>   GlobalAggregates ::
>>
>>   UnresolvedHavingClauseAttributes ::
>>
>>   TrimGroupingAliases ::
>>
>>   typeCoercionRules ++
>>
>>   extendedRules : _*),
>>
>> Perhaps in my case, it reaches the 100 iterations and break out of while
>> loop in RuleExecutor.scala and thus, doesn't "resolve" all the attributes.
>>
>> Exception in my logs :-
>>
>> 14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached
>> for batch Resolution
>>
>> 14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in
>> context with path [] threw exception [Servlet execution threw an exception]
>> with root cause
>>
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> Unresolved attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
>> DOWN_BYTESHTTPSUBCR#6567, tree:
>>
>> 'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
>> DOWN_BYTESHTTPSUBCR#6567]
>>
>> ...
>>
>> ...
>>
>> ...
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>>
>> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>>
>> at scala.collection.immutable.List.foreach(List.scala:318)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>>
>> at
>> org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86)
>>
>> at
>> org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67)
>>
>> at
>> org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85)
>>
>> at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50)
>>
>>  at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490)
>>
>>
>> I think the solution here is to have the FixedPoint constructor argument
>> as configurable/parameterized (also written as TODO). Do we have a plan to
>> do this in 1.2 release? Or I can take this up as a task for myself if you
>> want (since this is very crucial for our release).
>>
>>
>> Thanks
>>
>> -Nitin
>>
>> On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust > > wrote:
>>
>>> val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
 existingSchemaRDD.schema)

>>>
>>> This line is throwing away the logical information about
>>> existingSchemaRDD and thus Spark SQL can't know how to push down
>>> projections or predicates past this operator.
>>>
>>> Can you describe more the problems that you see if you don't do this
>>> re

Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
I see that somebody had already raised a PR for this but it hasn't been
merged.

https://issues.apache.org/jira/browse/SPARK-4339

Can we merge this in next 1.2 RC?

Thanks
-Nitin


On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal  wrote:

> Hi Michael,
>
> I think I have found the exact problem in my case. I see that we have
> written something like following in Analyzer.scala :-
>
>   // TODO: pass this in as a parameter.
>
>   val fixedPoint = FixedPoint(100)
>
>
> and
>
>
> Batch("Resolution", fixedPoint,
>
>   ResolveReferences ::
>
>   ResolveRelations ::
>
>   ResolveSortReferences ::
>
>   NewRelationInstances ::
>
>   ImplicitGenerate ::
>
>   StarExpansion ::
>
>   ResolveFunctions ::
>
>   GlobalAggregates ::
>
>   UnresolvedHavingClauseAttributes ::
>
>   TrimGroupingAliases ::
>
>   typeCoercionRules ++
>
>   extendedRules : _*),
>
> Perhaps in my case, it reaches the 100 iterations and break out of while
> loop in RuleExecutor.scala and thus, doesn't "resolve" all the attributes.
>
> Exception in my logs :-
>
> 14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached
> for batch Resolution
>
> 14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in
> context with path [] threw exception [Servlet execution threw an exception]
> with root cause
>
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
> DOWN_BYTESHTTPSUBCR#6567, tree:
>
> 'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
> DOWN_BYTESHTTPSUBCR#6567]
>
> ...
>
> ...
>
> ...
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>
> at scala.collection.immutable.List.foreach(List.scala:318)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>
> at
> org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86)
>
> at org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67)
>
> at
> org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85)
>
> at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50)
>
>  at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490)
>
>
> I think the solution here is to have the FixedPoint constructor argument
> as configurable/parameterized (also written as TODO). Do we have a plan to
> do this in 1.2 release? Or I can take this up as a task for myself if you
> want (since this is very crucial for our release).
>
>
> Thanks
>
> -Nitin
>
> On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust 
> wrote:
>
>> val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
>>> existingSchemaRDD.schema)
>>>
>>
>> This line is throwing away the logical information about
>> existingSchemaRDD and thus Spark SQL can't know how to push down
>> projections or predicates past this operator.
>>
>> Can you describe more the problems that you see if you don't do this
>> reapplication of the schema.
>>
>
>
>
> --
> Regards
> Nitin Goyal
>



-- 
Regards
Nitin Goyal


Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
Hi Michael,

I think I have found the exact problem in my case. I see that we have
written something like following in Analyzer.scala :-

  // TODO: pass this in as a parameter.

  val fixedPoint = FixedPoint(100)


and


Batch("Resolution", fixedPoint,

  ResolveReferences ::

  ResolveRelations ::

  ResolveSortReferences ::

  NewRelationInstances ::

  ImplicitGenerate ::

  StarExpansion ::

  ResolveFunctions ::

  GlobalAggregates ::

  UnresolvedHavingClauseAttributes ::

  TrimGroupingAliases ::

  typeCoercionRules ++

  extendedRules : _*),

Perhaps in my case, it reaches the 100 iterations and break out of while
loop in RuleExecutor.scala and thus, doesn't "resolve" all the attributes.

Exception in my logs :-

14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached
for batch Resolution

14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in
context with path [] threw exception [Servlet execution threw an exception]
with root cause

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
DOWN_BYTESHTTPSUBCR#6567, tree:

'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
DOWN_BYTESHTTPSUBCR#6567]

...

...

...

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

at
org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86)

at org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67)

at org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85)

at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50)

 at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490)


I think the solution here is to have the FixedPoint constructor argument as
configurable/parameterized (also written as TODO). Do we have a plan to do
this in 1.2 release? Or I can take this up as a task for myself if you want
(since this is very crucial for our release).


Thanks

-Nitin

On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust 
wrote:

> val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
>> existingSchemaRDD.schema)
>>
>
> This line is throwing away the logical information about existingSchemaRDD
> and thus Spark SQL can't know how to push down projections or predicates
> past this operator.
>
> Can you describe more the problems that you see if you don't do this
> reapplication of the schema.
>



-- 
Regards
Nitin Goyal


Re: PhysicalRDD problem?

2014-12-09 Thread Michael Armbrust
>
> val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
> existingSchemaRDD.schema)
>

This line is throwing away the logical information about existingSchemaRDD
and thus Spark SQL can't know how to push down projections or predicates
past this operator.

Can you describe more the problems that you see if you don't do this
reapplication of the schema.