Re: Spark SQL Dataframe resulting from an except( ) is unusable

2017-02-01 Thread Liang-Chi Hsieh
> (i.e. almost all productive uses of this dataframe) fails due to this > exception. > > Is it a miss in the Literal implementation that it does not handle > UserDefinedTypes or is it left out intentionally? Is there a way to get > around this problem? This problem seems to be present

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Liang-Chi Hsieh
s in DataFrames, Python/R APIs for >> DataFrames, dstream, and most recently Structured Streaming. >> >> >> >> Holden has been a long time Spark contributor and evangelist. She has >> written a few books on Spark, as well as frequent contributions to the >> Python A

Re: spark main thread quit, but the driver don't crash at standalone cluster

2017-01-19 Thread Liang-Chi Hsieh
> > rdd.collect().foreach(println) > throw new RuntimeException //exception > } > ssc.start() > try { > ssc.awaitTermination() > } catch { > case e: Exception => { > System.out.println("end!") > throw e > } > } - Liang-C

Re: 答复: Limit Query Performance Suggestion

2017-01-18 Thread Liang-Chi Hsieh
a number most close to that config. > E.g. the config is 2000: > if limit=1, 1 = 2000 * 5, we shuffle to 5 partitions > if limit=, = * 9, we shuffle to 9 partitions > if limit is a prime number, we just fall back to single partition > > best regards, > -zhen

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Liang-Chi Hsieh
t; the target RDD, I use the input partition to get the corresponding parent > partition, and get the half elements in the parent partitions as the > output > of the computing function. > > Thanks, > Fei > > On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh > viirya@

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Liang-Chi Hsieh
hich can help get high data locality. >> >> Is there anyone who knows how to implement it or any hints for it? >> >> Thanks in advance, >> Fei >> >> > > > -- > -- Anastasios Zouzias > > azo@.ibm > - Liang-Chi Hsieh | @v

Re: Limit Query Performance Suggestion

2017-01-15 Thread Liang-Chi Hsieh
0*10 = 100), once the first group > finishes the tasks driver will check whether the accumulator value is been > reached the limit value > if its reached then no further task will be launched to executers and the > result will be returned. > > Let me know for any furthur suggestion

Re: Both Spark AM and Client are trying to delete Staging Directory

2017-01-15 Thread Liang-Chi Hsieh
> Shouldn't we add some check to Client? > > > Thanks, > Rostyslav - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Both-Spark-AM-and-Client-are-trying-to-delete-Sta

Re: handling of empty partitions

2017-01-11 Thread Liang-Chi Hsieh
= toCarryBd.value.get(i + 1).get } You may need to make sure the next partition has a value too. Holden has pointed out before, you need to deal with the case that the previous/next partition is empty too and go next until you find a non-empty partition. geoHeil wrote > Hi Liang-Chi Hsieh, > >

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Liang-Chi Hsieh
functions=[partial_aggregatefunction(key#0, nested#1, nestedArray#2, >>> nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)], >>> output=[key#0,internal_col#37]) >>> +- *Sort [key#0 ASC], false, 0 >>>+- Scan Exist

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Liang-Chi Hsieh
owski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > --------- > To unsubscribe e-mail: > dev-unsubscribe@.apache - Liang-Chi Hsieh | @viirya Spark Technology Cen

Re: handling of empty partitions

2017-01-09 Thread Liang-Chi Hsieh
do mapPartitions, it just gives you an empty iterator as input. You can do what you need. You already return a None when you find an empty iterator in preparing "toCarry". So I was wondering what you want to ask in the previous reply. geoHeil wrote > Thanks a lot, Holden.

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Liang-Chi Hsieh
9984, 0.970241]) >> >> This result has std=sqrt(2/3) >> >> Instead it should have resulted other 3 vectors that form std=1 for each >> column. >> >> Adding another vector (4 total) results in 4 scaled vectors that form >> std= sqrt(3/4) instead of std=1 &g

Re: handling of empty partitions

2017-01-08 Thread Liang-Chi Hsieh
empty partitions? > http://stackoverflow.com/questions/41474175/spark-mappartitionswithindex-handling-empty-partitions > > Kind regards, > Georg - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers

Re: Converting an InternalRow to a Row

2017-01-07 Thread Liang-Chi Hsieh
alRow > class TopKDataType extends UserDefinedType > > { > private final ExpressionEncoder > > unboundedEncoder; > private final List > > data; > > public Row[] rows() { > val encoder = resolveAndBind(this.unboundedEncoder); > >

Re: Parquet patch release

2017-01-07 Thread Liang-Chi Hsieh
lease follow that thread on the Parquet list. > > Thanks! > > rb > > -- > Ryan Blue > Software Engineer > Netflix - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.

Re: Converting an InternalRow to a Row

2017-01-05 Thread Liang-Chi Hsieh
> collect(Collectors.toList()); >> val encoder = >> RowEncoder.apply(schema).resolveAndBind(ScalaUtils.scalaSeq(attributes), >> SimpleAnalyzer$.MODULE$); >> >> >> --- >> Regards, >> Andy >> >> On Thu, Jan 5, 2017 at 2:53

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing anything. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context:

Re: Converting an InternalRow to a Row

2017-01-04 Thread Liang-Chi Hsieh
ntln("Round trip: " + roundTrip.size()); > } > > The code fails at the line encoder.fromRow() with the exception: >> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate > expression: getcolumnbyordinal(0, IntegerType) > > --- > Regards, > A

Re: Quick request: prolific PR openers, review your open PRs

2017-01-04 Thread Liang-Chi Hsieh
ot.com/users/cloud-fan; 8 > jerryshao https://spark-prs.appspot.com/users/jerryshao; 8 - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Quick-request-prolific-PR-ope

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
Forget to say, another option is we can replace readAllFootersInParallel with our parallel reading logic, so we can ignore corrupt files. Liang-Chi Hsieh wrote > Hi, > > The method readAllFootersInParallel is implemented in Parquet's > ParquetFileReader. So the

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
corruptblock.0 is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [65, 82, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing anything. - Lia

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Liang-Chi Hsieh
evelopers-list.1001551.n3.nabble.com/What-is-mainly- >> different-from-a-UDT-and-a-spark-internal-type-that- >> ExpressionEncoder-recognized-tp20370.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> - >> To unsubscribe e-

Re: context.runJob() was suspended in getPreferredLocations() function

2017-01-01 Thread Liang-Chi Hsieh
no errors and no > outputs. > > What is the reason for it? > > Thanks, > Fei - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/context-runJob-was-suspended-in-getPref

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-30 Thread Liang-Chi Hsieh
of 100 Range plans, there are 5049 Range(s) needed to go through. For 200 Range plans, it becomes 20099. You can see it is not linear relation. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

2016-12-28 Thread Liang-Chi Hsieh
ing-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > --------- > To unsubscribe e-mail: > dev-unsubscribe@.apache - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: ht

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Liang-Chi Hsieh
ools.nsc.interpreter.IMain.interpret(IMain.scala:565) >> at scala.tools.nsc.interpreter.ILoop.interpretStartingWith( >> ILoop.scala:807) >> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) >> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) >> at scala.tools.nsc.i

RE: Shuffle intermidiate results not being cached

2016-12-28 Thread Liang-Chi Hsieh
} println(s"Took $timeLen miliseconds") totalTime += timeLen } val timeLen2 = time { val grouped2 = allDF.groupBy("cat1").agg(sum($"v"), count($"cat2")) grouped2.show() } totalTime += timeLen2 println(s"Overall time was $totalTime miliseconds

RE: Shuffle intermidiate results not being cached

2016-12-28 Thread Liang-Chi Hsieh
does the window internally. This was still 8 times lower > throughput than batch and required a lot of coding and is not a general > solution. > > I am looking for an approach to improve the performance even more > (preferably to either be on par with batch or a relatively low factor >

RE: Shuffle intermidiate results not being cached

2016-12-27 Thread Liang-Chi Hsieh
ch. > Is there a correct way to do such an aggregation on streaming data (using > dataframes rather than RDD operations). > Assaf. > > > > From: Liang-Chi Hsieh [via Apache Spark Developers List] [mailto: > ml-node+s1001551n20361h80@.nabble > ] > Sent: Monday,

Re: Shuffle intermidiate results not being cached

2016-12-26 Thread Liang-Chi Hsieh
n x2))) ... Your first example just does two aggregation operations. But your second example like above does this aggregation operations for each iteration. So the time of second example grows as the iteration increases. - Liang-Chi Hsieh | @viirya Spark Technology Center http://ww

Re: java.lang.AssertionError: assertion failed

2016-12-22 Thread Liang-Chi Hsieh
-18986. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-AssertionError-assertion-failed-tp20277p20338.html Sent from the Apache Spark Developers List mailing list

Re: stratified sampling scales poorly

2016-12-22 Thread Liang-Chi Hsieh
y1”-> fraction, “key2”-> fraction, …, “keyn”-> > fraction). > > I have a question is that why stratified sampling scales poorly with > different sampling fractions in this context? meanwhile simple random > sampling scales well with different sampling fractions (I ran experime

Re: Aggregating over sorted data

2016-12-22 Thread Liang-Chi Hsieh
-sorted. see here: > https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala > > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh > viirya@ > wrote: > >> >> I agreed that to make sure this work, yo

Re: Aggregating over sorted data

2016-12-21 Thread Liang-Chi Hsieh
. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20331.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-21 Thread Liang-Chi Hsieh
_code").parquet(.....) - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Null-pointer-exception-with-RDD-while-computing-a-method-creating-dataframe-tp20308p20328.html Sent from the Apa

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
properly installed in all nodes in the cluster, because those Spark jobs will be launched at the node which the main driver is running on. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
sure the node running the driver has enough resources to run them. I am not sure if you can use `SparkLauncher` to submit them in different mode, e.g., main driver in client mode, others in cluster mode. Worth trying. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
on the node which the driver runs. It looks weird. Looks like you try to fetch some data first and do some jobs on the data. Can't you just do those jobs in the main driver as Spark actions with its API? - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View

Re: Aggregating over sorted data

2016-12-20 Thread Liang-Chi Hsieh
+--++ |number|collect_list(letter)| +--++ | 3| [a, b, c]| | 1| [a, b, c]| | 2| [a, b, c]| +--++ I think it should let you do aggregate on sorted data per key. - Liang-Ch

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread Liang-Chi Hsieh
Hi, You can't invoke any RDD actions/transformations inside another transformations. They must be invoked by the driver. If I understand your purpose correctly, you can partition your data (i.e., `partitionBy`) when writing out to parquet files. - Liang-Chi Hsieh | @viirya Spark

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-20 Thread Liang-Chi Hsieh
if it is the root cause and fix it then. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tp20108p20298.html Sent

Re: Aggregating over sorted data

2016-12-18 Thread Liang-Chi Hsieh
-defined JVM object as buffer to hold the input data into your aggregate function. But you may need to write necessary encoder for the buffer object. If you really need this feature, you may open a Jira to ask others' opinion about this feature. - Liang-Chi Hsieh | @viirya Spark Technology

Re: Mistake in Apache Spark Java.

2016-12-16 Thread Liang-Chi Hsieh
Hi, I tried your example with latest Spark master branch and branch-2.0. It works well. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mistake-in-Apache-Spark-Java

Re: Document Similarity -Spark Mllib

2016-12-15 Thread Liang-Chi Hsieh
like it might not work much better than brute-force even you set a higher threshold. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib

Re: Document Similarity -Spark Mllib

2016-12-13 Thread Liang-Chi Hsieh
sure the problem is exactly at columnSimilarities? E.g, val exact = mat.columnSimilarities(0.5) val exactCount = exact.entries.count - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551

Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-12-11 Thread Liang-Chi Hsieh
Hi, There is a plan to add this into Spark ML. Please check out https://issues.apache.org/jira/browse/SPARK-18023. You can also follow this jira to get the latest update. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http

Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-11 Thread Liang-Chi Hsieh
Hi Dongjoon, I know some people only use Spark SQL with SQL syntax not Dataset API. So I think it should be useful to provide a way to do this in SQL. - Liang-Chi Hsieh | @viirya Spark Technology Center -- View this message in context: http://apache-spark-developers-list.1001551.n3

Re: Document Similarity -Spark Mllib

2016-12-10 Thread Liang-Chi Hsieh
also adjust the threshold when using DIMSUM. [1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix Square using MapReduce (DIMSUM)" [2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity Computation" - Liang-Chi Hsieh | @viirya Spa

Re: java.lang.IllegalStateException: There is no space for new record

2016-12-09 Thread Liang-Chi Hsieh
Hi Nick, I think it is due to a bug in UnsafeKVExternalSorter. I created a Jira and a PR for this bug: https://issues.apache.org/jira/browse/SPARK-18800 - Liang-Chi Hsieh | @viirya Spark Technology Center -- View this message in context: http://apache-spark-developers-list.1001551

<    1   2