Jython importing pyspark?

2014-10-05 Thread Robert C Senkbeil
Hi there, I wanted to ask whether or not anyone has successfully used Jython with the pyspark library. I wasn't sure if the C extension support was needed for pyspark itself or was just a bonus of using Cython. There was a claim (

Impact of input format on timing

2014-10-05 Thread Tom Hubregtsen
Hi, I ran the same version of a program with two different types of input containing equivalent information. Program 1: 10,000 files with on average 50 IDs, one every line Program 2: 1 file containing 10,000 lines. On average 50 IDs per line My program takes the input, creates key/value pairs

Re: Impact of input format on timing

2014-10-05 Thread Matei Zaharia
Hi Tom, HDFS and Spark don't actually have a minimum block size -- so in that first dataset, the files won't each be costing you 64 MB. However, the main reason for difference in performance here is probably the number of RDD partitions. In the first case, Spark will create an RDD with 1

Re: Parquet schema migrations

2014-10-05 Thread Andrew Ash
Hi Cody, I wasn't aware there were different versions of the parquet format. What's the difference between raw parquet and the Hive-written parquet files? As for your migration question, the approaches I've often seen are convert-on-read and convert-all-at-once. Apache Cassandra for example

Re: Jython importing pyspark?

2014-10-05 Thread Matei Zaharia
PySpark doesn't attempt to support Jython at present. IMO while it might be a bit faster, it would lose a lot of the benefits of Python, which are the very strong data processing libraries (NumPy, SciPy, Pandas, etc). So I'm not sure it's worth supporting unless someone demonstrates a really

Re: Parquet schema migrations

2014-10-05 Thread Michael Armbrust
Hi Cody, Assuming you are talking about 'safe' changes to the schema (i.e. existing column names are never reused with incompatible types), this is something I'd love to support. Perhaps you can describe more what sorts of changes you are making, and if simple merging of the schemas would be

Hyper Parameter Tuning Algorithms

2014-10-05 Thread Lochana Menikarachchi
Found this thread from April.. http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccabjxkq6b7sfaxie4+aqtcmd8jsqbznsxsfw6v5o0wwwouob...@mail.gmail.com%3E Wondering what the status of this.. We are thinking about implementing these algorithms.. Would be a waste if they are already

Re: SPARK-3660 : Initial RDD for updateStateByKey transformation

2014-10-05 Thread Soumitra Kumar
Hello, I have submitted a pull request (Adding support of initial value for state update. #2665), please review and let me know. Excited to submit my first pull request. -Soumitra. - Original Message - From: Soumitra Kumar kumar.soumi...@gmail.com To: dev@spark.apache.org Sent: