Re: binary file deserialization

2016-03-09 Thread Saurabh Bajaj
You can load that binary up as a String RDD, then map over that RDD and
convert each row to your case class representing the data. In the map stage
you could also map the input string into an RDD of JSON values and use the
following function to convert it into a DF
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

val anotherPeople = sqlContext.read.json(anotherPeopleRDD)


On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanov 
wrote:

> We have a huge binary file in a custom serialization format (e.g. header
> tells the length of the record, then there is a varying number of items for
> that record). This is produced by an old c++ application.
> What would be best approach to deserialize it into a Hive table or a Spark
> RDD?
> Format is known and well documented.
>
>
> --
> Ruslan Dautkhanov
>


Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread Saurabh Bajaj
You can call *foreachRDD*(*func*) on the output from the final stage, then
check the time if it's the 15th min of an hour then you flush the output to
DB else you don't.
Let me know if that approach works.

On Tue, Mar 8, 2016 at 2:10 PM, ayan guha  wrote:

> Yes if it falls within the batch. But if the requirement is flush
> everything till 15th min of the hour, then it should work.
> On 9 Mar 2016 04:01, "Ted Yu"  wrote:
>
>> That may miss the 15th minute of the hour (with non-trivial deviation),
>> right ?
>>
>> On Tue, Mar 8, 2016 at 8:50 AM, ayan guha  wrote:
>>
>>> Why not compare current time in every batch and it meets certain
>>> condition emit the data?
>>> On 9 Mar 2016 00:19, "Abhishek Anand"  wrote:
>>>
 I have a spark streaming job where I am aggregating the data by doing
 reduceByKeyAndWindow with inverse function.

 I am keeping the data in memory for upto 2 hours and In order to output
 the reduced data to an external storage I conditionally need to puke the
 data to DB say at every 15th minute of the each hour.

 How can this be achieved.


 Regards,
 Abhi

>>>
>>


Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Saurabh Bajaj
Hi Andy,

I believe you need to set the host and port settings separately
spark.cassandra.connection.host
spark.cassandra.connection.port
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters

Looking at the logs, it seems your port config is not being set and it's
falling back to default.
Let me know if that helps.

Saurabh Bajaj

On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson  wrote:

> Hi Ted
>
> I believe by default cassandra listens on 9042
>
> From: Ted Yu 
> Date: Tuesday, March 8, 2016 at 6:11 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: pyspark spark-cassandra-connector java.io.IOException:
> Failed to open native connection to Cassandra at {192.168.1.126}:9042
>
> Have you contacted spark-cassandra-connector related mailing list ?
>
> I wonder where the port 9042 came from.
>
> Cheers
>
> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>>
>> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python
>> notebook that reads a data frame from Cassandra.
>>
>> *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH
>> works how ever I can not figure out how to configure my notebook. I have
>> tried various hacks any idea what I am doing wrong
>>
>> : java.io.IOException: Failed to open native connection to Cassandra at 
>> {192.168.1.126}:9042
>>
>>
>>
>>
>> Thanks in advance
>>
>> Andy
>>
>>
>>
>> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
>> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11"
>>
>> $ export PYSPARK_PYTHON=python3
>> $ export PYSPARK_DRIVER_PYTHON=python3
>> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $*
>>
>>
>>
>> In [15]:
>> 1
>>
>> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043")
>>
>> 2
>>
>> df = sqlContext.read\
>>
>> 3
>>
>> .format("org.apache.spark.sql.cassandra")\
>>
>> 4
>>
>> .options(table=“time_series", keyspace="notification")\
>>
>> 5
>>
>> .load()
>>
>> 6
>>
>> ​
>>
>> 7
>>
>> df.printSchema()
>>
>> 8
>>
>> df.show()
>>
>> ---Py4JJavaError
>>  Traceback (most recent call 
>> last) in ()  1 
>> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 
>> 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra")
>> .options(table="kv", keyspace="notification").load()  3   4 
>> df.printSchema()  5 
>> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py
>>  in load(self, path, format, schema, **options)137 
>> return self._df(self._jreader.load(path))138 else:--> 139
>>  return self._df(self._jreader.load())140 141 
>> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>  in __call__(self, *args)811 answer = 
>> self.gateway_client.send_command(command)812 return_value = 
>> get_return_value(--> 813 answer, self.gateway_client, 
>> self.target_id, self.name)814 815 for temp_arg in 
>> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py
>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45 
>> return f(*a, **kw) 46 except 
>> py4j.protocol.Py4JJavaError as e: 47 s = 
>> e.java_exception.toString()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>  in get_return_value(answer, gateway_client, target_id, name)306 
>> raise Py4JJavaError(307 "An error occurred 
>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>> ".", name), value)
>> 309 else:310 raise Py4JError(
>> Py4JJavaError: An error occurred while calling o280.load.
>> : java.io.IOException: Failed to open native connection to Cassandr