[jira] [Closed] (SPARK-6363) Switch to Scala 2.11 for default build

2019-10-11 Thread antonkulaga (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga closed SPARK-6363.
--

Resolved

> Switch to Scala 2.11 for default build
> --
>
> Key: SPARK-6363
> URL: https://issues.apache.org/jira/browse/SPARK-6363
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: antonkulaga
>Assignee: Josh Rosen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
> support. So, it will be better if Spark binaries would be build with Scala 
> 2.11 by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-11 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949230#comment-16949230
 ] 

antonkulaga commented on SPARK-28547:
-

>I bet there is room for improvement, but, ten thousand columns is just 
>inherently slow given how metadata, query plans, etc are handled.
>You'd at least need to help narrow down where the slow down is and why, and 
>even better if you can propose a class of fix. As it is I'd close this.

[~srowen] I am not a spark developer, I am spark user, so I cannot say where 
the bottleneck is. If I see that doing even super-simple tasks like describe or 
like some simple transformation (like taking log out of each gene expression 
values) fail, I report it as a performance problem. As I am bioinformatician, 
most of my work is about dealing with gene expressions (thousands of samples * 
tens of thousands genes) it makes Spark unusable for me for most of the 
use-cases If operations that take seconds in pandas dataframe (without any 
spark involved) take many hours or freeze in Spark dataframe there is something 
inherently wrong how you handle the data in Spark dataframe and something you 
should investigate for Spark 3.0
If you want to narrow it down, can it be "make dataframe.describe work for 15K 
* 15K dataframe and take less than 20 minutes to complete"?

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948897#comment-16948897
 ] 

antonkulaga edited comment on SPARK-28547 at 10/10/19 7:24 PM:
---

[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking hours/days to compute, even for 
simplest operations like per-column statistics) whenever the data frame has 
10-20K and more columns and I gave GTEX dataset as an example (however any gene 
or transcript expression dataset will be ok to demonstrate it). In many fields 
(like big part of bioinformatics) wide data frames are common, right now Spark 
is totally useless there.


was (Author: antonkulaga):
[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking many hours) whenever the data 
frame has 10-20K and more columns and I gave GTEX dataset as an example 
(however any gene or transcript expression dataset will be ok to demonstrate 
it). In many fields (like big part of bioinformatics) wide data frames are 
common, right now Spark is totally useless there.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948897#comment-16948897
 ] 

antonkulaga commented on SPARK-28547:
-

[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking many hours) whenever the data 
frame has 10-20K and more columns and I gave GTEX dataset as an example 
(however any gene or transcript expression dataset will be ok to demonstrate 
it). In many fields (like big part of bioinformatics) wide data frames are 
common, right now Spark is totally useless there.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-31 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga reopened SPARK-28547:
-

I did not see any solutions. 

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-31 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897194#comment-16897194
 ] 

antonkulaga commented on SPARK-28547:
-

[~maropu] I think I was quite clear: even describe works slow as hell. So the 
easiest way to reproduce is just to run describe on all numeric columns in 
GTEX. 

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-28547:

Description: 
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes ours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work well with pure pandas 
(without any spark involved).
f

  was:
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
either takes ours or gets frozen (because of lost executors) irrespective of 
memory and numbers of cores. While the same operations work well with pure 
pandas (without any spark involved).
f


> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes ours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work well with pure 
> pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-28547:

Description: 
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes hours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work fast (minutes) and well 
with pure pandas (without any spark involved).
f

  was:
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes ours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work well with pure pandas 
(without any spark involved).
f


> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread antonkulaga (JIRA)
antonkulaga created SPARK-28547:
---

 Summary: Make it work for wide (> 10K columns data)
 Key: SPARK-28547
 URL: https://issues.apache.org/jira/browse/SPARK-28547
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3, 2.4.4
 Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per node, 
32 cores (tried different configurations of executors)
Reporter: antonkulaga


Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
either takes ours or gets frozen (because of lost executors) irrespective of 
memory and numbers of cores. While the same operations work well with pure 
pandas (without any spark involved).
f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14220) Build and test Spark against Scala 2.12

2019-04-07 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-14220:

Comment: was deleted

(was: I suggest to use Spark 2.4.1 as there Scala 2.12 is not longer 
experimental)

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2019-04-07 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811726#comment-16811726
 ] 

antonkulaga commented on SPARK-14220:
-

I suggest to use Spark 2.4.1 as there Scala 2.12 is not longer experimental

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-11-03 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674066#comment-16674066
 ] 

antonkulaga commented on SPARK-25588:
-

Any updates on this? This bug blocks ADAM library and hence blocks most of 
bioinformaticians using Spark.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }

[jira] [Created] (SPARK-25198) org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not supported.

2018-08-22 Thread antonkulaga (JIRA)
antonkulaga created SPARK-25198:
---

 Summary: org.apache.spark.sql.catalyst.parser.ParseException: 
DataType json is not supported.
 Key: SPARK-25198
 URL: https://issues.apache.org/jira/browse/SPARK-25198
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
 Environment: Ubuntu 18.04, Spark 2.3.1, 
org.postgresql:postgresql:42.2.4
Reporter: antonkulaga


Whenever I try to save the dataframe with one of the columns with JSON string 
inside to the latest Postgres I get 
org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not 
supported. As Postgres supports JSON well and I use the latest postgresql 
client I expect it to work. Here is an example of the code that crashes

val columnTypes = """id integer, parameters json, title text, gsm text, gse 
text, organism text, characteristics text, molecule text, model text, 
description text, treatment_protocol text, extract_protocol text, source_name 
text,data_processing text, submission_date text,last_update_date text, status 
text, type text, contact text, gpl text"""

myDataframe.write.format("jdbc").option("url", 
"jdbc:postgresql://db/sequencing").option("customSchema", 
columnTypes).option("dbtable", "test").option("user", 
"postgres").option("password", "changeme").save()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16406) Reference resolution for large number of columns should be faster

2018-08-14 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580444#comment-16580444
 ] 

antonkulaga commented on SPARK-16406:
-

Are you going to backport it to 2.3.2 as well?

> Reference resolution for large number of columns should be faster
> -
>
> Key: SPARK-16406
> URL: https://issues.apache.org/jira/browse/SPARK-16406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>
> Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
> of columns). This gets problematic as soon as you try to resolve a large 
> number of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2017-07-25 Thread antonkulaga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100966#comment-16100966
 ] 

antonkulaga commented on SPARK-4820:


This issue is valid for Spark 2.2.0 on Ubuntu 16.04 and it is a BLOCKER! I am 
blocked in some projects because I cannot overcome this stupid error

> Spark build encounters "File name too long" on some encrypted filesystems
> -
>
> Key: SPARK-4820
> URL: https://issues.apache.org/jira/browse/SPARK-4820
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Theodore Vasiloudis
>Priority: Minor
> Fix For: 1.4.0
>
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems

2017-07-25 Thread antonkulaga (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-21531:

Description: 
Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it is 
still valid in 2.x

This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}


  was:
This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}



> CLONE - Spark build encounters "File name too long" on some encrypted 
> filesystems
> -
>
> Key: SPARK-21531
> URL: https://issues.apache.org/jira/browse/SPARK-21531
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: antonkulaga
>Assignee: Theodore Vasiloudis
> Fix For: 1.4.0
>
>
> Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it 
> is still valid in 2.x
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems

2017-07-25 Thread antonkulaga (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-21531:

Description: 
Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it is 
still valid in 2.x

ORIGINAL description:

This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}


  was:
Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it is 
still valid in 2.x

This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}



> CLONE - Spark build encounters "File name too long" on some encrypted 
> filesystems
> -
>
> Key: SPARK-21531
> URL: https://issues.apache.org/jira/browse/SPARK-21531
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: antonkulaga
>Assignee: Theodore Vasiloudis
> Fix For: 1.4.0
>
>
> Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it 
> is still valid in 2.x
> ORIGINAL description:
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems

2017-07-25 Thread antonkulaga (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-21531:

Priority: Major  (was: Minor)

> CLONE - Spark build encounters "File name too long" on some encrypted 
> filesystems
> -
>
> Key: SPARK-21531
> URL: https://issues.apache.org/jira/browse/SPARK-21531
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: antonkulaga
>Assignee: Theodore Vasiloudis
> Fix For: 1.4.0
>
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems

2017-07-25 Thread antonkulaga (JIRA)
antonkulaga created SPARK-21531:
---

 Summary: CLONE - Spark build encounters "File name too long" on 
some encrypted filesystems
 Key: SPARK-21531
 URL: https://issues.apache.org/jira/browse/SPARK-21531
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: antonkulaga
Assignee: Theodore Vasiloudis
Priority: Minor
 Fix For: 1.4.0


This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2017-03-23 Thread antonkulaga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937923#comment-15937923
 ] 

antonkulaga commented on SPARK-14220:
-

Any progress on this?

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6363) make scala 2.11 default language

2015-03-19 Thread antonkulaga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368846#comment-14368846
 ] 

antonkulaga commented on SPARK-6363:


 is already cross-built for 2.10 and 2.11, and published separately for both

I mean Spark downloads where they provide only 2.10 versions and propose to 
build 2.11 from source. I think 2.11 should be default there

 make scala 2.11 default language
 

 Key: SPARK-6363
 URL: https://issues.apache.org/jira/browse/SPARK-6363
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: antonkulaga
Priority: Minor
  Labels: scala

 Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
 support. So, it will be better if Spark binaries would be build with Scala 
 2.11 by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6363) make scala 2.11 default language

2015-03-16 Thread antonkulaga (JIRA)
antonkulaga created SPARK-6363:
--

 Summary: make scala 2.11 default language
 Key: SPARK-6363
 URL: https://issues.apache.org/jira/browse/SPARK-6363
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: antonkulaga
Priority: Minor


Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
support. So, it will be better if Spark binaries would be build with Scala 2.11 
by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org