Cannot pass broker list parameter from Scala to Kafka: Property bootstrap.servers is not valid

2017-01-04 Thread Dino
I have spent a lot of time trying to figure out the following problem. I need to consume messages from the topic of remote Kafka queue using Scala and Spark. By default the port of Kafka on remote machine is set to `7072`, not `9092`. Also, on remote machine there are the following versions install

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

2017-01-04 Thread Kay Ousterhout
I believe that these two were indeed originally related. In the old hash-based shuffle, we wrote objects out immediately to disk as they were generated by an RDD's iterator. On the other hand, with the original version of the new sort-based shuffle, Spark buffered a bunch of objects before writing

Tests failing with GC limit exceeded

2017-01-04 Thread Kay Ousterhout
I've noticed a bunch of the recent builds failing because of GC limits, for seemingly unrelated changes (e.g. 70818, 70840, 70842). Shane, have there been any recent changes in the build configuration that might be causing this? Does anyone else have any ideas about what's going on here? -Kay

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread darren
We've been able to use ipopo dependency injection framework in our pyspark system and deploy .egg pyspark apps that resolve and wire up all the components (like a kernel architecture. Also similar to spring) during an initial bootstrap sequence; then invoke those components across spark. Just re

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread ayan guha
Hi Chetan What do you mean by incremental load from HBase? There is a timestamp marker for each cell, but not at Row level. On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri wrote: > Ted Yu, > > You understood wrong, i said Incremental load from HBase to Hive, > individually you can say Incrementa

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
After checking the codes, I think there are few issues regarding this ignoreCorruptFiles config, so you can't actually use it with Parquet files now. I opened a JIRA https://issues.apache.org/jira/browse/SPARK-19082 and also submitted a PR for it. khyati wrote > Hi Reynold Xin, > > In spark 2.

Re: Converting an InternalRow to a Row

2017-01-04 Thread Liang-Chi Hsieh
You need to resolve and bind the encoder. ExpressionEncoder enconder = RowEncoder.apply(struct).resolveAndBind(); Andy Dang wrote > Hi all, > (cc-ing dev since I've hit some developer API corner) > > What's the best way to convert an InternalRow to a Row if I've got an > InternalRow and the co

Re: Quick request: prolific PR openers, review your open PRs

2017-01-04 Thread Hyukjin Kwon
Let me double-check mind too. 2017-01-04 21:57 GMT+09:00 Liang-Chi Hsieh : > > Ok. I will go through and check my open PRs. > > > Sean Owen wrote > > Just saw that there are many people with >= 8 open PRs. Some are > > legitimately in flight but many are probably stale. To set a good > example, >

Re: Clarification about typesafe aggregations

2017-01-04 Thread geoHeil
Thanks for the clarification. rxin [via Apache Spark Developers List] < ml-node+s1001551n20462...@n3.nabble.com> schrieb am Mi. 4. Jan. 2017 um 23:37: > Your understanding is correct - it is indeed slower due to extra > serialization. In some cases we can get rid of the serialization if the > valu

Re: Clarification about typesafe aggregations

2017-01-04 Thread Reynold Xin
Your understanding is correct - it is indeed slower due to extra serialization. In some cases we can get rid of the serialization if the value is already deserialized. On Wed, Jan 4, 2017 at 7:19 AM, geoHeil wrote: > Hi I would like to know more about typeface aggregations in spark. > > http://

Re: Tests failing with GC limit exceeded

2017-01-04 Thread shane knapp
preliminary findings: seems to be transient, and affecting 4% of builds from late december until now (which is as far back as we keep build records for the PRB builds). 408 builds 16 builds.gc <--- failures it's also happening across all workers at about the same rate. and best of all, the

Converting an InternalRow to a Row

2017-01-04 Thread Andy Dang
Hi all, (cc-ing dev since I've hit some developer API corner) What's the best way to convert an InternalRow to a Row if I've got an InternalRow and the corresponding Schema. Code snippet: @Test public void foo() throws Exception { Row row = RowFactory.create(1); StructType

Clarification about typesafe aggregations

2017-01-04 Thread geoHeil
Hi I would like to know more about typeface aggregations in spark. http://stackoverflow.com/questions/40596638/inquiries-about-spark-2-0-dataset/40602882?noredirect=1#comment70139481_40602882 An example of these is https://blog.codecentric.de/en/2016/07/spark-2-0-datasets-case-classes/ ds.groupByK

Re: Quick request: prolific PR openers, review your open PRs

2017-01-04 Thread Liang-Chi Hsieh
Ok. I will go through and check my open PRs. Sean Owen wrote > Just saw that there are many people with >= 8 open PRs. Some are > legitimately in flight but many are probably stale. To set a good example, > would (everyone) mind flicking through what they've got open and see if > some PRs are st

Quick request: prolific PR openers, review your open PRs

2017-01-04 Thread Sean Owen
Just saw that there are many people with >= 8 open PRs. Some are legitimately in flight but many are probably stale. To set a good example, would (everyone) mind flicking through what they've got open and see if some PRs are stale and should be closed? https://spark-prs.appspot.com/users Username

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Jiří Syrový
Hi, another nice approach is to use instead of it Reader monad and some framework to support this approach (e.g. Grafter - https://github.com/zalando/grafter). It's lightweight and helps a bit with dependencies issues. 2016-12-28 22:55 GMT+01:00 Lars Albertsson : > Do you really need dependency

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
Ted Yu, You understood wrong, i said Incremental load from HBase to Hive, individually you can say Incremental Import from HBase. On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > Incremental load traditionally means generating hfiles and > using org.apache.hadoop.hbase.mapreduce.LoadIncrementa

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Chetan Khatri
Lars, Thank you, I want to use DI for configuring all the properties (wiring) for below architectural approach. Oracle -> Kafka Batch (Event Queuing) -> Spark Jobs( Incremental load from HBase -> Hive with Transformation) -> Spark Transformation -> PostgreSQL Thanks. On Thu, Dec 29, 2016 at 3:2

Re: Apache Hive with Spark Configuration

2017-01-04 Thread Chetan Khatri
Ryan, I agree that Hive 1.2.1 work reliably with Spark 2.x , but i went through with current stable version of Hive which is 2.0.1 and I am working with that. seems good but i want to make sure the which version of Hive is more reliable with Spark 2.x and i think @Ryan you replied the same which i

Re: Why ShuffleMapTask has transient locs and preferredLocs?!

2017-01-04 Thread Jacek Laskowski
Hi Imran, Yes, you're right. I stand corrected! Thanks. This is the part that opened my eyes: > By the time that task has been assigned a location, and its running on an > executor, it doesn't matter anymore. That's why a task does not have to have it after deserialization (!) Thanks a lot. O

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
Forget to say, another option is we can replace readAllFootersInParallel with our parallel reading logic, so we can ignore corrupt files. Liang-Chi Hsieh wrote > Hi, > > The method readAllFootersInParallel is implemented in Parquet's > ParquetFileReader. So the spark config > "spark.sql.files.

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
Hi, The method readAllFootersInParallel is implemented in Parquet's ParquetFileReader. So the spark config "spark.sql.files.ignoreCorruptFiles" doesn't work for it. Reading all footers in parallel can speed up the task. However, we can't control if ignoring corrupt files or not. Of course we ca